论文标题
购买前尝试:用于现实世界数据市场的实用数据购买算法
Try Before You Buy: A practical data purchasing algorithm for real-world data marketplaces
论文作者
论文摘要
数据交易越来越流行,这可以从过去几年中出现数十个数据市场(DMS)的出现。定价数字资产特别复杂,因为与物理资产不同,数字资产可以以零成本复制,存储和传输几乎是免费的,等等。在大多数DMS中,邀请数据销售商表示价格,以及对其数据集的描述。但是,对于数据购买者,决定支付请求价格是否有意义,只有在使用AI/ML算法的数据后才能完成。理论工作已经分析了要购买的数据集的问题,以及以全部信息模型为背景的价格,其中算法在任何O(2^n)上的算法的性能以及n个数据集的任何可能的子集都是先验的,以及购买者的价值函数。但是,这些信息很难计算,更不用说在现实世界DM的背景下公开了。 在本文中,我们表明,如果DM向潜在的买家提供了其在单个数据集上其AI/ML算法的性能的度量,那么他们可以选择具有效率的数据集,以近似于完整信息模型的数据集。我们在购买(TBYB)之前将其称为生成的算法尝试,并在合成和现实世界数据集上演示TBYB如何仅用O(n)而不是O(2^n)信息而导致几乎最佳的购买性能。
Data trading is becoming increasingly popular, as evident by the appearance of scores of Data Marketplaces (DMs) in the last few years. Pricing digital assets is particularly complex since, unlike physical assets, digital ones can be replicated at zero cost, stored, and transmitted almost for free, etc. In most DMs, data sellers are invited to indicate a price, together with a description of their datasets. For data buyers, however, deciding whether paying the requested price makes sense, can only be done after having used the data with their AI/ML algorithms. Theoretical works have analysed the problem of which datasets to buy, and at what price, in the context of full information models, in which the performance of algorithms over any of the O(2^N) possible subsets of N datasets is known a priori, together with the value functions of buyers. Such information is, however, difficult to compute, let alone be made public in the context of real-world DMs. In this paper, we show that if a DM provides to potential buyers a measure of the performance of their AI/ML algorithm on individual datasets, then they can select which datasets to buy with an efficacy that approximates that of a complete information model. We call the resulting algorithm Try Before You Buy (TBYB) and demonstrate over synthetic and real-world datasets how TBYB can lead to near optimal buying performance with only O(N) instead of O(2^N) information released by a marketplace.