模型联接：使分析能够超过缺乏大桌子的连接

论文标题

模型联接：使分析能够超过缺乏大桌子的连接

Model Joins: Enabling Analytics Over Joins of Absent Big Tables

论文作者

Shanghooshabad, Ali Mohammadi, Triantafillou, Peter

论文摘要

这项工作是由两个关键事实激发的。首先，非常需要能够学习和执行知识发现和分析（LKD）任务而无需访问RAW-DATA表。这可能是由于组织发现管理和维护不断增长的桌子或出于隐私原因而变得越来越令人沮丧和昂贵。因此，可以从原始数据开发紧凑的模型，而不是使用表。其次，通常，LKD任务应在（可能非常大的）表上执行，这本身就是连接单独（可能非常大的）关系表的结果。但是，当缺席个人桌子时，该怎么办？在这里，我们提出以下基本问题：Q1：一个人如何“加入模型”（缺失/删除）表格或“与其他表”“加入模型”，以使LKD仿佛是在实际的原始表的连接上执行的？ Q2：每张表使用哪些合适的型号？ Q3：由于模型联接将是实际数据联接的近似值，因此如何评估模型联接结果的质量？这项工作提出了一个框架，结合模型，解决了这些挑战。该框架集成并加入了缺席表的台式模型，并生成了一个均匀且独立的样本，该样本是对实际的RAW-DATA JOIN的均匀和独立样本的高质量近似。近似源于模型，但不是源于模型联接框架。通过模型联接获得的样本可用于执行LKD下游任务，例如近似查询处理，分类，聚类，回归，关联规则挖掘，可视化等。据我们所知，这是该议程和解决方案的第一部作品。使用TPC-DS数据和合成数据展示模型的详细实验加入了有用性。

This work is motivated by two key facts. First, it is highly desirable to be able to learn and perform knowledge discovery and analytics (LKD) tasks without the need to access raw-data tables. This may be due to organizations finding it increasingly frustrating and costly to manage and maintain ever-growing tables, or for privacy reasons. Hence, compact models can be developed from the raw data and used instead of the tables. Second, oftentimes, LKD tasks are to be performed on a (potentially very large) table which is itself the result of joining separate (potentially very large) relational tables. But how can one do this, when the individual to-be-joined tables are absent? Here, we pose the following fundamental questions: Q1: How can one "join models" of (absent/deleted) tables or "join models with other tables" in a way that enables LKD as if it were performed on the join of the actual raw tables? Q2: What are appropriate models to use per table? Q3: As the model join would be an approximation of the actual data join, how can one evaluate the quality of the model join result? This work puts forth a framework, Model Join, addressing these challenges. The framework integrates and joins the per-table models of the absent tables and generates a uniform and independent sample that is a high-quality approximation of a uniform and independent sample of the actual raw-data join. The approximation stems from the models, but not from the Model Join framework. The sample obtained by the Model Join can be used to perform LKD downstream tasks, such as approximate query processing, classification, clustering, regression, association rule mining, visualization, and so on. To our knowledge, this is the first work with this agenda and solutions. Detailed experiments with TPC-DS data and synthetic data showcase Model Join's usefulness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题