论文标题

通过溢出来支持大数据上的总体分析窗口功能

Support Aggregate Analytic Window Function over Large Data by Spilling

论文作者

Shi, Xing, Wang, Chao

论文摘要

分析函数(也称为窗口功能)是在滑动窗口上查询数据的聚合。例如,在线股票平台上的一个简单查询是返回过去三天的股票的平均价格。这些功能是SQL数据库中常用的功能。它们在大多数商业数据库中得到了支持。随着云数据基因云和机器学习技术的使用越来越多,具有分析窗口功能的查询频率上升。某些分析功能仅需要内存中的const空间来存储状态,例如sum,avg,而另一些则需要线性空间,例如min,max。当窗口很大时,存储状态的存储空间可能太大。在这种情况下,我们需要将状态洒到磁盘上,这是一个重型操作。在本文中,我们提出了一种算法来操纵磁盘中的状态数据,以减少磁盘I/O,以使溢出物可用和效率。我们通过不同的数据分布分析算法的复杂性。

Analytic function, also called window function, is to query the aggregation of data over a sliding window. For example, a simple query over the online stock platform is to return the average price of a stock of the last three days. These functions are commonly used features in SQL databases. They are supported in most of the commercial databases. With the increasing usage of cloud data infra and machine learning technology, the frequency of queries with analytic window functions rises. Some analytic functions only require const space in memory to store the state, such as SUM, AVG, while others require linear space, such as MIN, MAX. When the window is extremely large, the memory space to store the state may be too large. In this case, we need to spill the state to disk, which is a heavy operation. In this paper, we proposed an algorithm to manipulate the state data in the disk to reduce the disk I/O to make spill available and efficienct. We analyze the complexity of the algorithm with different data distribution.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源