In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.

Composite repetition-aware data structures / Belazzougui, Djamal; Cunial, Fabio; Gagie, Travis; Prezza, Nicola; Raffinot, Mathieu. - 26th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, (2015), pp. 26-39. (Combinatorial Pattern Matching, Ischia Island, Italy, June 29 - July 1, 2015). [10.1007/978-3-319-19929-0_3].

Composite repetition-aware data structures

PREZZA, NICOLA;
2015

Abstract

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.
2015
3319199293
Compact Directed Acyclic Word Graphs (CDAWG), Suffix Array Sample, LZ77 Indexes, Repetition Maximum, LZ Factors
File in questo prodotto:
File Dimensione Formato  
Composite repetition-aware data structures.pdf

Solo gestori archivio

Descrizione: articolo principale
Tipologia: Documento in Pre-print
Licenza: DRM (Digital rights management) non definiti
Dimensione 462.9 kB
Formato Adobe PDF
462.9 kB Adobe PDF   Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/192276
Citazioni
  • Scopus 54
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact