论文标题

部分可观测时空混沌系统的无模型预测

Toward the Detection of Polyglot Files

论文作者

Koch, Luke, Oesch, Sean, Adkisson, Mary, Erwin, Sam, Weber, Brian, Chaulagain, Amul

论文摘要

标准化文件格式在计算机软件的开发和使用中起关键作用。但是,可以通过创建以多种文件格式有效的文件来滥用标准化文件格式。所得的polyglot(许多语言)文件可能会混淆文件格式标识,允许文件的元素逃避分析。这对于依赖文件格式标识以进行功能提取的恶意软件检测系统尤其有问题。由于灵活性在某些文件格式的格式规范中,因此可以轻松地逃避依赖文件签名的文件格式标识过程。尽管已经完成了使用比文件签名更全面的方法来识别文件格式的工作,但准确识别Polyglot文件仍然是一个开放的问题。由于恶意软件检测系统通常会执行特定于文件格式的特征特征提取,因此需要在这些系统摄入之前过滤多个文件。否则,恶意内容可能会通过未被发现。为了解决多面检测的问题,我们使用MITRA工具组装了数据集。然后,我们评估了最常用的文件标识工具文件的性能。最后,我们证明了一系列机器和深度学习模型的准确性,精度,召回和F1得分。 Malconv2和Catboost在我们的数据集上的召回最高,分别为95.16%和95.45%。可以将这些模型纳入恶意软件检测器的文件处理管道中,以在发生文件格式依赖性功能提取之前过滤掉潜在的恶意多插管。

Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源