手写数字的集聚聚类以确定不同语言的相似性

论文标题

手写数字的集聚聚类以确定不同语言的相似性

Agglomerative Clustering of Handwritten Numerals to Determine Similarity of Different Languages

论文作者

Rahat-uz-Zaman, Md., Hye, Shadmaan

论文摘要

不同语言的手写数字具有各种特征。可以通过分析数字的提取特征来衡量语言的相似性和差异。手写数字数据集可用于不同地区的许多著名语言。在本文中，收集了不同语言的几个手写数字数据集。然后，通过确定和比较每个手写数字的相似性，它们被用来找到这些书面语言之间的相似性。这将有助于找到哪些语言具有相同或相邻的父语言。首先，通过暹罗网络构建了两个数字图像的相似性度量。其次，在暹罗网络的帮助下确定了数字数据集的相似性，并通过替换相似性平均技术来确定新的随机样本。最后，基于每个数据集的相似性进行集聚聚类。这种聚类技术显示了数据集的一些非常有趣的属性。本文集中的属性是数据集的区域相似之处。通过分析簇，可以轻松识别哪些语言来自类似区域。

Handwritten numerals of different languages have various characteristics. Similarities and dissimilarities of the languages can be measured by analyzing the extracted features of the numerals. Handwritten numeral datasets are available and accessible for many renowned languages of different regions. In this paper, several handwritten numeral datasets of different languages are collected. Then they are used to find the similarity among those written languages through determining and comparing the similitude of each handwritten numerals. This will help to find which languages have the same or adjacent parent language. Firstly, a similarity measure of two numeral images is constructed with a Siamese network. Secondly, the similarity of the numeral datasets is determined with the help of the Siamese network and a new random sample with replacement similarity averaging technique. Finally, an agglomerative clustering is done based on the similarities of each dataset. This clustering technique shows some very interesting properties of the datasets. The property focused in this paper is the regional resemblance of the datasets. By analyzing the clusters, it becomes easy to identify which languages are originated from similar regions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题