ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

Hits per page

hit 1 - 1 | 1 hit

Sorting

Unknown

Boosting Text Compression with Word-Based Statistical Encoding (2011)

Farina, A., Navarro, G., Parama, J. R.

Oxford University Press

In: Computer Journal

add to mindlist on the mindlist

Details

Publication Date: 2011-12-23

Description: Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30–35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.

Print ISSN: 0010-4620

Electronic ISSN: 1460-2067

Topics: Computer Science

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

hit 1 - 1 | 1 hit