C mmap + BZ2_bzDecompress way slower than bzip2 command
I'm using mmap + BZ2_bzDecompress to sequentially decompress a large file
(29GB). This is done because I need to parse the uncompressed xml data,
but only need small bits of it, and it seemed like it would be way more
efficient to do this sequentially than to uncompress the whole file (400GB
uncompressed) and then parse it. Interestingly already the decompression
part is extremely slow - while the shell command bzip2 is able to do a bit
more than 52MB per second (used several runs of timeout 10 bzip2 -c -k -d
input.bz2 > output and divided produced filesize by 10), my program is
able to do not even 2MB/s, slowing down after a few seconds to 1.2MB/s
The file I'm trying to process uses multiple bz2 streams, so I'm checking
BZ2_bzDecompress for BZ_STREAM_END, and if it occurs, use
BZ2_bzDecompressEnd( strm ); and BZ2_bzDecompressInit( strm, 0, 0 ) to
restart with the next stream, in case the file hasn't been completely
processed. I also tried without BZ2_bzDecompressEnd but that didn't change
anything (and I can't really see in the documentation how one should
handle multiple streams correctly)
The file is being mmap'ed before, where I also tried different
combinations of flags, currently MAP_RDONLY + MAP_PRIVATE with madvise to
MADV_SEQUENTIAL | MADV_WILLNEED | MADV_HUGEPAGE (I'm checking return
value, and madvise does not report any problems, and I'm on a linux kernel
3.2x debian setup which has hugepage support)
When profiling I made sure that other than some counters for measuring
speed and a printf which was limited to once every n iterations, nothing
else was run. Also this is on a modern multicore server processor where
all other cores where idle, and it's bare metal, not virtualized.
Any ideas on what I could be doing wrong / do to improve performance?
No comments:
Post a Comment