比较两种估计字符串概率的计数方法

论文标题

比较两种估计字符串概率的计数方法

Comparing Two Counting Methods for Estimating the Probabilities of Strings

论文作者

Takamoto, Ayaka, Yoshida, Mitsuo, Umemura, Kyoji

论文摘要

有两种方法可以计算另一个大字符串中字符串的出现数量。一种是计算找到字符串的位置的数量。另一个是确定可以提取多少个绳子而不会重叠。当字符串是周期性模式的一部分时，两者之间的差异变得很明显。这项研究报告说，差异对于估计模式的发生概率很重要。在这项研究中，实验中使用的字符串来自时间序列数据。该任务涉及通过估计概率或计算信息数量来对字符串进行分类。首先，计算字符串所有子字符串的频率。每种计数方法有时可能会为同一字符串产生不同的频率。其次，选择了最可能的分割的概率。字符串的概率是所选分割中基因概率的所有概率的乘积。分类结果表明，计数方法的差异在统计上是显着的，并且没有重叠的方法更好。

There are two methods for counting the number of occurrences of a string in another large string. One is to count the number of places where the string is found. The other is to determine how many pieces of string can be extracted without overlapping. The difference between the two becomes apparent when the string is part of a periodic pattern. This research reports that the difference is significant in estimating the occurrence probability of a pattern. In this study, the strings used in the experiments are approximated from time-series data. The task involves classifying strings by estimating the probability or computing the information quantity. First, the frequencies of all substrings of a string are computed. Each counting method may sometimes produce different frequencies for an identical string. Second, the probability of the most probable segmentation is selected. The probability of the string is the product of all probabilities of substrings in the selected segmentation. The classification results demonstrate that the difference in counting methods is statistically significant, and that the method without overlapping is better.

下载PDF全文

下载文献需遵守相关版权规定

论文标题