Compressing sparse GeoTiffs using PackBits

[2019-02-07 Thu]

GeoTiffs can be compressed using several algorithms (lossless and lossy) via the excellent command-line tools of GDAL. A nice round-up is in this blog post, which shows that if you want to optimize read-time (lossless), then the primitive PackBits algorithm is best. PackBits does best when there are a lot of identical bytes in a row. Let's try to take advantage of this when compressing a sparse geotiff (with sparse I mean a lot of no-data values).

I prepared an example geotiff raster with 1000x1000 Float32 values, the fill is about 10% and the nodata value is -9999.

>> gdalinfo sparse.tif
Driver: GTiff/GeoTIFF
Files: sparse.tif
Size is 1000, 1000
Coordinate System is:
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0],
    UNIT["degree",0.0174532925199433],
    AUTHORITY["EPSG","4326"]]
Origin = (0.000000000000000,0.500000000000000)
Pixel Size = (0.000500000000000,-0.000500000000000)
Metadata:
  TIFFTAG_SOFTWARE=n/a
Image Structure Metadata:
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (   0.0000000,   0.5000000) (  0d 0' 0.01"E,  0d30' 0.00"N)
Lower Left  (   0.0000000,   0.0000000) (  0d 0' 0.01"E,  0d 0' 0.01"N)
Upper Right (   0.5000000,   0.5000000) (  0d30' 0.00"E,  0d30' 0.00"N)
Lower Right (   0.5000000,   0.0000000) (  0d30' 0.00"E,  0d 0' 0.01"N)
Center      (   0.2500000,   0.2500000) (  0d15' 0.00"E,  0d15' 0.00"N)
Band 1 Block=1000x2 Type=Float32, ColorInterp=Gray
  NoData Value=-9999

Note that it is in `Float32` with NoData value -9999. If your file is in Float64, I suggest you change that (as it is unlikely that your data has 15 decimal digits of precision) with: gdal_translate -ot float32 in.tif out.tif.

Let's try to just compress it using PackBits:

>> gdalwarp -overwrite -co COMPRESS=PACKBITS sparse.tif sparse-v1.tif
>> ls -sh sparse*
3.9M sparse.tif  3.9M sparse-v1.tif

No improvement. As an aside, other compression schemes will work with this data:

>> gdalwarp -overwrite -co COMPRESS=DEFLATE sparse.tif sparse-deflate.tif
>> ls -sh sparse*
520K sparse-deflate.tif  3.9M sparse.tif  3.9M sparse-v1.tif

but I want to stick to Packbits for improved reading speed.

The trick to make Packbits work is to re-encode the no-data value into a value which has representation with a simple bit pattern. Let's first see what the byte pattern of my current NoData=-9999 value is:

julia> reinterpret(UInt32, -9999f0)
0xc61c3c00

Which shows bit-representation of -9999 in Float32 as hexadecimal "c61c3c00". Remember that one byte is encoded with two hexadecimal numbers, i.e. the bytes are "c6", "1c", "3c, "00".

Let's make a Float32 with all equal bytes (and also negative):

julia> reinterpret(Float32, parse(UInt32, "eeeeeeee", base = 16))
-3.697314f28

and use that to re-encode the NoData values (and compress):

>> gdalwarp -overwrite -srcnodata -9999 -dstnodata -3.697314e28 \
     -co COMPRESS=PACKBITS sparse.tif sparse-v2.tif
>> ls -sh sparse*
520K sparse-deflate.tif  3.9M sparse.tif  3.9M sparse-v1.tif  668K sparse-v2.tif

Cool, now my file compressed from 3.9M to 0.668M.

Let's just check the read-speeds (the reason I've used Packbits) with a ten times larger file than the above-provided example file:

>> time gdalinfo -checksum sparse-large.tif > /dev/null
 0.17s user 0.03s system 99% cpu 0.202 total
>> time gdalinfo -checksum sparse-large-v1.tif > /dev/null
 0.60s user 0.15s system 99% cpu 0.750 total
>> time gdalinfo -checksum sparse-large-v2.tif > /dev/null
 0.65s user 0.12s system 99% cpu 0.781 total
>> time gdalinfo -checksum sparse-large-deflate.tif > /dev/null
 0.83s user 0.10s system 99% cpu 0.940 total

which shows 0.17s for uncompressed, 0.6s for "v1", 0.65s for "v2" and 0.83s for "deflate". Which is a bit underwhelming, but hey, at least Packbits is faster than Deflate.

Note that PackBits only really works for sparse datasets, or datasets which have a lot of identical values (with simple byte patterns). Otherwise use Deflate or LZW.