本帖最后由 Q1628393554 于 2016-1-12 21:02 编辑
偶然在avast网站上发现了其技术页,觉得挺好的,而且里面涉及到 FileRepMalware 和 Evo-gen,转来给大家分享下,顺便刷下任务玩 。无奈自己英语太渣,文章里又有些较为专业的术语,最终失败了 。
会出现文件太新,一台机器无法判断文件是否为病毒的情况,这时实时分类显得尤为重要。鉴于我们有2.3亿用户,当新的威胁出现的时候,系统有超载的风险。为了消除该问题,我们在客户端和美杜莎家计算集群间建立了高速缓存代{过}{滤}理。A file can be classified differently, as new information arrives from Scavenger, so the TTL of the cached decisions is set to a few minutes. Despite this, the cache hits in almost 40% of the requests.
每个美杜莎节点使用2个或4个Nvidia GPU。每个美杜莎集群有一个主节点负责整个集合,其拥有几个从节点。分类需要大量安全和恶意样本。Evo-gen生成器也使用了一些未分类的样本。因为使用方式不同,我们保持各组独立。干净组样本是最重要的,因为误报的代价是很高的。因此,干净组占用最多的空间,扫描最慢。为了提高吞吐量,we keep the clean set mirrored。最近的病毒及未分类样本只占用干净文件的大有10%的空间。
- 模型重学习,这仅需要将样本添加到正确的组或从中移除。
- 理解特定决策的原因。
- 调整误报率
Custom-distance function
Each sample is represented by a constant-sized feature vector consisting of approximately 100 attributes. We keep the exact composition of the feature vector secret, but, for example, obvious candidates such as section table data in the Portable Executable format are included. In general, there are static and dynamic features, categorized as offsets, sizes, checksums, factors, bit flags and generic numbers. Taking into account the nature of the attributes, we ended up with several distance operators and a weighting scheme that equalizes the importance of the attributes. The following table contains a sample of the operators we use.
.kNN classifier
The most common approach for instance-based learning is the nearest neighbor classification. To fine tune our classifier, we built a tool, called Pythia, which displays the nearest neighbors of a given query sample. It uses a dimensionality reduction method (NMDS) to display the neighbors in 2D space, and also displays additional metadata for the selected samples. This information can be used by a human to determine whether or not it is feasible to distinguish between malware and clean neighbors in the current case. The goal was to create a fully autonomous system — which means high precision at the cost of lower recall. After some experimenting, we added a few thresholds, including minimal allowed distance to clean files, maximal allowed distance to malware files, as well as a weighting term that shifts the balance between clean and malware sets.
Real-world data
现实世界中数据的冗余是非常显著的。我们的内部系统每天大约接收250000个新PE文件。通过严格的分群标准(low threshold distance and complete linkage),其中150000个可以直接分配给20000个集群之一。Each cluster can then be classified as a whole. That means 130,000 fewer decisions to make, and that the total number of clusters does not grow by 20,000 every day, as the clusters overlap between days.
Avast在用户机器上的所有可执行文件运行前检查它们。如果该文件与当前病毒库都不匹配时将查询其文件信誉。如果返回结果表明用户数极少,则该文件将在avast Sandbox中执行。如果程序行为和任何已知的攻击行为不匹配,real-time classifier被调用。Avast提取其特征发送到云服务器等待响应。大部分低流行度的文件是良性的。在每天的大约250000次请求中大约4000次被判为病毒。
过去的string-based signatures 方式很有效,并且可以很好地归纳威胁的变种。但特征码需要分析师和时间。在当今的威胁环境下,考虑到所有的变种和其内在联系,没有足够的人和时间进行分析。我们需要一个类似特征码的方法,但不依赖与人工干预或是占用大量时间。
Enter Evo-gen. Evo-gen leverages the distance function to create a set of similar feature vectors which allows us to build a rule set from those features. Once we have a set of very similar feature vectors from the distance function, we can start to pick features that make them similar and build a rule set from those features. It is somewhat similar to rule-set generation in decision trees, but the objectives are different. 为了加快归纳速度,我们可以使用尽量少的规则,同时keeping hits in the clean set at zero. But there are many ways to pick 20 rules from 100 possible ones - 5.36x1020, or 536 billion, numerically speaking. We’re currently taming the combinatorial explosion with a stochastic approach, which provides better results than Scavenger approaches. 这里再一次体现出GPU的速度极其重要。While trying to understand how the Evo-gen rule sets (blue) affect the signature “ecosystem,” we produced the following visualization. Each blob represents a different rule set or signature, and the size of the blob is proportional to the number of detected variants.
通过利用最大的跨平台网络、云、机器学习和专有大数据分析,avast提供了独特的精细安全方案。与其他安全厂商不同,由于我们有更多的用户,我们的防护网络变得更强,更易于管理。任何人连入我们的网络将立即得到保护,无论他在哪里。我们称之为Global Security。