Brool brool (n.) : a low roar; a deep murmur or humming

Haskell Performance: Lowercase

 |  performance haskell coding

I was trying to track down some issues with some text processing programs that I was writing in Haskell, and ran into an interesting problem. I made one small change and my program ended up being 5 times slower, and I had to backtrack to try and find out what it was. So, given a simple Haskell program that sees if a word is in a wordlist:

import IO import System import qualified Data.ByteString as B import qualified Data.ByteString.Internal as BI import qualified Data.ByteString.Char8 as C main = do args <- getArgs let searchfor = C.pack $ head args f <- openFile "wordlist" ReadMode text <- B.hGetContents f print $ length $ filter ((==) searchfor) (C.lines text)

To search a smallish list of about 300K words takes 0.040 seconds on my computer, compared to 0.200 seconds for Python and 0.210 seconds for a naive Haskell implementation that is not using ByteStrings. However, let’s just add lowercase to the equation:

import IO import System import Data.Char import qualified Data.ByteString as B import qualified Data.ByteString.Internal as BI import qualified Data.ByteString.Char8 as C main = do args <- getArgs let searchfor = C.pack $ head args f <- openFile "wordlist" ReadMode text <- B.hGetContents f print $ length $ filter (\x -> (C.map toLower x) == searchfor) (C.lines text)

Suddenly, the ByteString version becomes about 30% slower than the naive version — 0.337 seconds vs. 0.251 seconds — and is even slower than the Python version. What the heck is going on here? Trying an empty map (i.e., C.map id x) resulted in something fast, so I’m suspecting that the lowercase function itself is slow.

Unfortunately, there doesn’t seem to be a lowercase available in ByteString; at the moment it seems that you need to set up your own ctype table and use that.

import IO import System import Data.Char import Data.Word import Data.Array.Unboxed import qualified Data.ByteString as B import qualified Data.ByteString.Internal as BI import qualified Data.ByteString.Char8 as C ctype_lower = listArray (0,255) (map (BI.c2w . toLower) ['\0'..'\255']) :: UArray Word8 Word8 lowercase = B.map (\x -> ctype_lower!x) main = do args <- getArgs let searchfor = C.pack $ head args f <- openFile "wordlist" ReadMode text <- B.hGetContents f print $ length $ filter (\x -> (lowercase x) == searchfor) (C.lines text)

… which turns out to run really quickly at 0.070 seconds, about the same as a C program doing the same task.

Update: See dons comments below – Char is operating on Unicode, which makes it slow. I wonder if a ctype.h-type library for ByteString makes sense?

Discussion

Comments are moderated whenever I remember that I have a blog.

dons | 2009-03-29 06:54:54
I think you're running into the fact it is using Unicode toLower. Look in GHC.Unicode.toLower, and it is in terms of: foreign import ccall unsafe "u_towlower" towlower :: CInt -&gt; CInt So you're calling out to C each time. An ascii only version should be faster, e.g., toLower c@(C# c#) | isAsciiUpper c = C# (chr# (ord# c# +# 32#)) | isAscii c = c | isUpper c = unsafeChr (ord c `minusInt` ord 'A' `plusInt` ord 'a') | otherwise = c or your version.
Reply
Add a comment