论文标题
替代网络档案格式的情况以加快数据之间的数据至潜流周期
The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle
论文作者
论文摘要
WARC文件格式被Web Archives广泛使用,以保留收集的Web内容供将来使用。随着Web档案的快速增长以及将这些档案作为统计和分析研究的大数据源重复使用的日益兴趣,将这些数据转化为见解的速度变得至关重要。在本文中,我们表明WARC格式对批处理处理工作负有严重的绩效惩罚。我们将这些惩罚的根本原因追溯到其数据结构,编码和寻址方法。然后,我们进行受控实验,以说明这些问题的严重程度。实际上,只需将WARC文件重新标记为Parquet或Avro格式即可实现一到两个数量级的绩效增益。尽管这些结果不一定构成AVRO或PARQUET的认可,但现在是Web档案社区考虑使用更有效的Web档案格式代替WARC的时候了。
The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.