Toward a Definitive Compressibility Measure for Repetitive Sequences.

IEEE Trans. Inf. Theory(2023)

引用 6|浏览21
While the $k$ th order empirical entropy is an accepted measure of the compressibility of individual sequences on classical text collections, it is useful only for small values of $k$ and thus fails to capture the compressibility of repetitive sequences. In the absence of an established way of quantifying the latter, ad-hoc measures like the size $z$ of the Lempel–Ziv parse are frequently used to estimate repetitiveness. The size $b \le z$ of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute, and it is not monotone upon appending symbols. Recently, a more principled measure, the size $\gamma $ of the smallest string attractor , was introduced. The measure $\gamma \le b$ lower-bounds all the previous relevant ones, while length- $n$ strings can be represented and efficiently indexed within space $O\left({\gamma \log \frac {n}{\gamma }}\right)$ , which also upper-bounds many measures, including $z$ . Although $\gamma $ is arguably a better measure of repetitiveness than $b$ , it is also NP-complete to compute and not monotone, and it is unknown if one can represent all strings in $o(\gamma \log n)$ space. In this paper, we study an even smaller measure, $\delta \le \gamma $ , which can be computed in linear time, is monotone, and allows encoding every string in $O\left({\delta \log \frac {n}{\delta }}\right)$ space because $z = O\left({\delta \log \frac {n}{\delta }}\right)$ . We argue that $\delta $ better captures the compressibility of repetitive strings. Concretely, we show that (1) $\delta $ can be strictly smaller than $\gamma $ , by up to a logarithmic factor; (2) there are string families needing $\Omega \left({\delta \log \frac {n}{\delta }}\right)$ space to be encoded, so this space is optimal for every $n$ and $\delta $ ; (3) one can build run-length context-free grammars of size $O\left({\delta \log \frac {n}{\delta }}\right)$ , whereas the smallest (non-run-length) grammar can be up to $\Theta (\log n/\log \log n)$ times larger; and (4) within $O\left({\delta \log \frac {n}{\delta }}\right)$ space, we can not only represent a string but also offer logarithmic-time access to its symbols, computation of substring fingerprints, and efficient indexed searches for pattern occurrences. We further refine the above results to account for the alphabet size $\sigma $ of the string, showing that $\Theta \left({\delta \log \frac {n\log \sigma }{\delta \log n}}\right)$ space is necessary and sufficient to represent the string and to efficiently support access, fingerprinting, and pattern matching queries.
Data compression,Lempel–Ziv parse,repetitive sequences,string attractors,substring complexity
AI 理解论文