论文标题
任务不足数据估值的基础知识
Fundamentals of Task-Agnostic Data Valuation
论文作者
论文摘要
我们研究数据所有者/卖方的数据搜索者/买家的数据。假设在实践中可能不存在的特定实用性指标(例如验证集中的测试准确性),通常对特定任务进行数据评估。在这项工作中,我们专注于任务不合时宜的数据评估,而无需任何验证要求。数据购买者可以访问有限数量的数据(可以公开使用),并从数据销售商那里寻求更多数据示例。我们将问题提出,以估计卖方在买方可用的基线数据方面数据的统计属性差异。我们通过测量买方的卖方数据的多样性和相关性来捕获这些统计差异;我们在不要求原始数据的情况下向卖方估算这些措施。我们通过提出的方法设计查询,以使卖方对买方的原始数据视而不见,并且不知道对查询的响应进行响应,以获得多样性和相关性权衡的期望结果。我们将通过对真实表格和图像数据集进行的广泛实验来显示,这些实验可以捕获卖方数据的多样性和相关性。
We study valuing the data of a data owner/seller for a data seeker/buyer. Data valuation is often carried out for a specific task assuming a particular utility metric, such as test accuracy on a validation set, that may not exist in practice. In this work, we focus on task-agnostic data valuation without any validation requirements. The data buyer has access to a limited amount of data (which could be publicly available) and seeks more data samples from a data seller. We formulate the problem as estimating the differences in the statistical properties of the data at the seller with respect to the baseline data available at the buyer. We capture these statistical differences through second moment by measuring diversity and relevance of the seller's data for the buyer; we estimate these measures through queries to the seller without requesting raw data. We design the queries with the proposed approach so that the seller is blind to the buyer's raw data and has no knowledge to fabricate responses to queries to obtain a desired outcome of the diversity and relevance trade-off.We will show through extensive experiments on real tabular and image datasets that the proposed estimates capture the diversity and relevance of the seller's data for the buyer.