Brool brool (n.) : a low roar; a deep murmur or humming

Haskell Performance: Array Creation

 |  performance haskell coding

Ran into another interesting performance problem while converting a small test program over to Haskell. Let’s say that you want to walk through every line of a text file, collate character frequencies, and return anything that maps to a particular frequency. For purposes of explanation we’ll do something really silly like look for lines with 10 capital ‘A’s.

{-# OPTIONS -XBangPatterns #-} import IO import System import Data.Word import Data.Array.Unboxed import Control.Monad.ST import Data.Array.ST import qualified Data.ByteString as B import qualified Data.ByteString.Internal as BI import qualified Data.ByteString.Char8 as C counts' !line = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) -- collate character counts here return arr counts !line = runSTUArray (counts' line) hit !line = let freq = counts line in freq!65 == 10 main = do args <- getArgs f <- openFile "wordlist" ReadMode text <- B.hGetContents f print $ length $ filter hit (C.lines text)

This is fast (0.07s on 300K test file, compiled with -O2 on a Macbook Pro), but there’s no actual collation going on. It suddenly becomes two magnitudes slower as soon as you start to do anything based on the line:

-- version 1 : 9.5 seconds counts' !line = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) readArray arr (B.head line) >>= \v -> writeArray arr (B.head line) (v+1) return arr counts !line = runSTUArray (counts' line)

.. equals 9.54 seconds just when you collate the first character of the string, and led me to believe that ByteStrings were slow, especially since a constant change to the array was fast (0.07 seconds):

-- version 2 : 0.07 seconds counts' !line = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) readArray arr 0 >>= \v -> writeArray arr 0 (v+1) return arr

(Here I’ll elide the hours of me messing around with profiling and whatnot to try and figure out why B.head was so slow, or how to make B.foldl’ update a state variable, or sprinkling strictness bangs all over the place, or the other tons of false trails that I went down.)

It turns out that the slow portion in all of this is the newArray, not the character collation itself, which can be seen in the profile if the array creation is moved into its own routine and we look at the profile:

-- version 3 -- compile with: ghc -package bytestring stuarray.hs -prof -auto-all -O2 -o stuarray.out -- run with: ./stuarray.out +RTS -p initial_array = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) return arr counts' !line = do arr <- initial_array let ix = B.head line readArray arr ix >>= \v -> writeArray arr ix (v+1) return arr counts !line = runSTUArray (counts' line)

… and the relevant part of the profile:

MAIN                     MAIN                     1           0   0.0    0.0   100.0  100.0
 main                    Main                   240           1   0.7    0.3   100.0  100.0
  hit                    Main                   242      335075   0.0    0.0    99.3   99.7
   counts                Main                   243      335075   0.0    0.2    99.3   99.7
    counts'              Main                   244      335075   0.0    0.0    99.3   99.5
     initial_array       Main                   245      335075  99.3   99.5    99.3   99.5

Which brings up two interesting points:

The Ocaml version took about 0.720 seconds (Ocaml version below), compared to Haskell time of about 8.5 seconds. Reducing the array size to 128 in both cases reduced it to 0.112 seconds for Ocaml and 4.3 seconds in Haskell.

Array size Ocaml Haskell
128 0.112 4.3
255 0.184 8.5
256 0.720 8.5

Note the discontinuity in Ocaml between 255 and 256 elements, which I find interesting. The nice people in #haskell suggested that I stop constructing/destructing the array and instead just null it out between each line, and it turned out that the best way to do that was just to thaw a starter array.

-- version 4: 0.11 seconds initial_array = listArray (0,255) (repeat 0) :: UArray Word8 Int counts' !line = do arr <- thaw initial_array let ix = B.head line readArray arr ix >>= \v -> writeArray arr ix (v+1) return arr counts !line = runSTUArray (counts' line)

This is still kind of strange to me (because an object is still getting constructed/destructed – thaw guarantees a copy) but you can’t argue with a performance increase. (Interestingly, using Array.copy or Array.fill in Ocaml is slower than just using Array.make). Final times? 0.11 seconds in Haskell vs. the best of 0.720 second UPDATED: 0.12 seconds in Ocaml… and the Haskell version is just 50% slower than a C implementation with a gratuitous calloc instead of a memset.

UPDATE: An pointed out that zeroing the Ocaml array with a for loop is much faster than Array.copy or Array.fill, bringing it to about the same speed as Haskell.

Lessons learned?

open Char open String let all_lines fn filename = let chan = open_in filename in try while true do let line = input_line chan in fn line done with End_of_file -> close_in chan let initial_array = Array.make 256 0 let count line = let freq = initial_array in let ix = int_of_char line.[0] in for i = 0 to 255 do freq.(i) <- 0 done; freq.(ix) <- freq.(ix) + 1; freq let hit line = let freq = count line in freq.(65) = 10 let _ = let linecount = ref 0 in all_lines (fun line -> if hit line then linecount := !linecount + 1 else ()) "wordlist"; print_int !linecount


Comments are moderated whenever I remember that I have a blog.

dons | 2009-03-30 06:17:38
Did you consider just using a mutable array (e.g. STUArray) ? These are the primary array structure with identical behaviour/representation to what you'd use in an imperative language. They should have identical performance.
tim | 2009-03-30 17:58:07
@dons: Well, I use a STUArray in the innermost loop... it converts to a UArray on return but from what I can tell in the docs this should be an in-place modification, no copy necessary.
an | 2009-03-30 19:01:37
For what is worth, OCaml is faster than the fourth Haskell version if you remember to compile with -unsafe -nodynlink -inline 100, use a pre-allocated array and zero it out for each line (like in Haskell), with for i = 0 to 255 do freq.(i) &lt;- 0 done. It doesn't take hours to get there ;)
tim | 2009-03-31 05:32:44
@an: You're absolutely right -- even a standard -ccopt -O2 becomes faster when you zero out the array in a for loop. I'm amazed at the difference between that and Array.fill. Updated the original article, thanks for the info!
Add a comment