The Consultative Committee for Space Data Systems (CCSDS) Rice Coding is a recommendation
for lossless compression of satellite data. It was also integrated with HDF (Hierarchical Data Format)
software for lossless compression of scientific data, and was proposed for lossless compression of
medical images. The CCSDS Rice coding is an approximate adaptive entropy coder. It uses a subset of
the family of Golomb codes to produce a simpler, suboptimal prefix code. The default preprocessor is a
unit-delay predictor with positive mapping. The adaptive entropy coder concurrently applies a set of
variable-length codes to a block of consecutive preprocessed samples. The code option that yields the
shortest codeword sequence for the current block of samples is then selected for transmission. A unique
identifier bit sequence is attached to the code block to indicate to the decoder which decoding option to
use. In this paper we explore the parallel efficiency of the CCSDS Rice code running on Graphics
Processing Units (GPUs) with Compute Unified Device Architecture (CUDA). The GPU-based
CCSDS Rice encoder will process several codeword blocks in a massively parallel fashion on different
GPU multiprocessors. We parallelized the CCSDS Rice coding by using reduction sum for code option
selection, prefix sum for intra-block and inter-block bit stream concatenation as well as asynchronous
data transfer. For NASA AVIRIS hyperspectral data, the speedup is near 6× as compared to the
single-threaded CPU counterpart. The CCSDS Rice coding has too many flow control instructions
which significantly affect the instruction throughput by causing threads of the same CUDA warp to
diverge. Consequently, the different execution paths must be serialized, increasing the total number of
instructions executed within the same warp. We conclude that this branching and divergence issue is
the bottleneck of the Rice coding that leads to smaller speedup than other entropy coding on GPUs.
|