Lossless Compression of English Short Messages

This lossless compressor achieves a much higher compression rate on English texts than general purpose compressors. Its typical compression ratio is 15% (number of output bits divided by the number of input bits).

The compression is achieved by using the probability of the next word computed by the GPT-2 language model released by OpenAI. It is a neural network of 1.5 billion parameters based on the Transformer architecture. It is implemented using the LibNC library and runs on a standard PC.

An arithmetic coder generates the bit stream. Each compressed character holds 15 data bits by using the CJK and the Hangul Syllables unicode ranges.

This compressor can be used to transmit text messages over very low bandwidth channels.