Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

arxiv(2022)

引用 0|浏览60
暂无评分
摘要
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let e_k(n) denote the average edit distance between random, independent strings of n characters from an alphabet of size k. For k ≥ 2, it is an open problem how to efficiently compute the exact value of α_k(n) = e_k(n)/n as well as of α_k = lim_n →∞α_k(n), a limit known to exist. This paper shows that α_k(n)-Q(n) ≤α_k ≤α_k(n), for a specific Q(n)=Θ(√(log n / n)), a result which implies that α_k is computable. The exact computation of α_k(n) is explored, leading to an algorithm running in time T=𝒪(n^2kmin(3^n,k^n)), a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how α_k(n) can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of n say up to a quarter million. Correspondingly, 99.9% confidence intervals of width approximately 10^-2 are obtained for α_k. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound β_k^* to α_k, such that lim_k →∞β_k^*=1. In general, β_k^* ≤α_k ≤ 1-1/k; for k greater than a few dozens, computing β_k^* is much faster than generating good statistical estimates with confidence intervals of width 1-1/k-β_k^*. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要