论文标题
Factorjoin:一个新的基数估算框架,用于联接查询
FactorJoin: A New Cardinality Estimation Framework for Join Queries
论文作者
论文摘要
基数估计是查询优化中最根本和最具挑战性的问题之一。在估计联接查询的基数时,经典和基于学习的方法均未产生令人满意的性能。他们要么依靠简化的假设导致无效的基数估计,要么建立大型模型以了解数据分布,从而导致较长的计划时间和跨查询的普遍性。 在本文中,我们提出了一个新的框架因子Join,用于估算联接查询。 Factorjoin结合了经典联接图方法背后的想法,以有效处理基于学习的方法以准确捕获属性相关性。具体而言,Factorjoin会在DB中扫描每个表格,并在离线准备阶段构建单桌子条件分布。当加入查询到来时,Factorjoin将其转化为一个因素图模型,以有效有效地估计其基数。 与现有的基于学习的方法不同,Factorjoin不需要在预先加入或需要执行的查询工作负载来训练模型。由于它仅依赖于单表统计数据,因此Factorjoin的空间较小,并且非常容易训练和维护。在我们的评估中,要素Join可以比以前的基于最新的学习方法产生更有效的估计,估计延迟,较小的模型尺寸较小40倍,并且以可比性或更高的精度以100倍较小的型号和100倍的训练速度。此外,Factorjoin可以在一秒钟内估算10,000个子计划查询,以优化查询计划,该查询计划非常接近商业DBMS中传统的基数估计器。
Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long planning times and a lack of generalizability across queries. In this paper, we propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality. Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.