IRIS - Institutional Research Information System

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that z= O(blog (n/ b)), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where z= Ω(blog n). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures.

Gagie, T.; Navarro, G.; Prezza, Nicola. (2018). On the approximation ratio of Lempel-Ziv parsing. In 13th International Symposium on Latin American Theoretical Informatics, LATIN 2018 (pp. 490- 503). Springer. Isbn: 978-3-319-77403-9. Isbn: 978-3-319-77404-6. Doi: 10.1007/978-3-319-77404-6_36.

On the approximation ratio of Lempel-Ziv parsing

Gagie T.;Navarro G.;Prezza N.

2018

Abstract

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that z= O(blog (n/ b)), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where z= Ω(blog n). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del convegno
	
				2018
			
	Codice ISBN
	
				978-3-319-77403-9
978-3-319-77404-6
			
	Citazione
	
				Gagie, T.; Navarro, G.; Prezza, Nicola. (2018). On the approximation ratio of Lempel-Ziv parsing. In 13th International Symposium on Latin American Theoretical Informatics, LATIN 2018 (pp. 490- 503).  Springer. Isbn: 978-3-319-77403-9. Isbn: 978-3-319-77404-6. Doi: 10.1007/978-3-319-77404-6_36.
			
	Appare nelle tipologie:
	
				04.1 - Contributo in Atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
latin18.pdf Solo gestori archivio Tipologia: Documento in Pre-print Licenza: DRM (Digital rights management) non definiti Dimensione 287.75 kB Formato Adobe PDF Visualizza/Apri	287.75 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/194129

Citazioni

25

24

30

social impact