Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparable memory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA.

Flexible indexing of repetitive collections / Belazzougui, D.; Cunial, F.; Gagie, T.; Prezza, Nicola; Raffinot, M.. - Unveiling Dynamics and Complexity: 13th Conference on Computability in Europe, CiE 2017, Proceedings, (2017), pp. 162-174. (13th Conference on Computability in Europe, CiE 2017, Turku, Finland, June 12-16, 2017). [10.1007/978-3-319-58741-7_17].

Flexible indexing of repetitive collections

Prezza N.;
2017

Abstract

Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparable memory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA.
2017
978-3-319-58740-0
978-3-319-58741-7
File in questo prodotto:
File Dimensione Formato  
CiE_final.pdf

Solo gestori archivio

Tipologia: Documento in Pre-print
Licenza: DRM (Digital rights management) non definiti
Dimensione 517.39 kB
Formato Adobe PDF
517.39 kB Adobe PDF   Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/194117
Citazioni
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 6
social impact