avast技术页

显示全部楼层 · 发表于 2016-1-12 21:03:00

本帖最后由 Q1628393554 于 2016-1-12 21:02 编辑

偶然在avast网站上发现了其技术页，觉得挺好的，而且里面涉及到 FileRepMalware 和 Evo-gen，转来给大家分享下，顺便刷下任务玩

。无奈自己英语太渣，文章里又有些较为专业的术语，最终失败了

。
下面贴出半成品给大家参考，建议查看英文原文。

Avast独一无二的安全是如何炼成的？

正如近些年在计算领域所发生的巨大变化，保护用户的需求也逐渐增加。不断增加的多元化平台、操作系统、程序和设备意味着更多的攻击方式。通过利用世界最大的跨平台感知网络、云、机器学习及专有的大数据分析，avast提供了一个独特的，高度复杂的方法以应对现在的安全挑战。

数量上的安全

我们的保护网络主要有4个部分。首先，我们在全球分布有2.3亿的设备，其可以充当我们防护系统无与伦比的感知网络。其次，我们有一个超过50亿个文件的庞大数据库，这些文件都被avast检测过，我们称其为FileRep（文件信誉^_^）。第三，Scavenger（食腐者^_^）——我们的内部系统，可以跟踪所有avast检测到的样本，承担所有分析病毒分类、攻击检测工作。第四，我们利用GPU来处理大量数据，可以迅速自动对新样本进行分类。我们称该系统为Medusa（美杜莎^_^）

同步

为了更有效，美杜莎需要跟上所有食腐者的变化。这包括新文件到达，判断文件是否是病毒，或着文件被从分类集合中删除。RabbitMQ中的持久消息队列可以保证即使是在美杜莎服务升级或维护时依然没有更新丢失。另外，美杜莎和食腐者的一致性通过完整的标识符存储而持续检查。

实时部署

会出现文件太新，一台机器无法判断文件是否为病毒的情况，这时实时分类显得尤为重要。鉴于我们有2.3亿用户，当新的威胁出现的时候，系统有超载的风险。为了消除该问题，我们在客户端和美杜莎家计算集群间建立了高速缓存代{过}{滤}理。A file can be classified differently, as new information arrives from Scavenger, so the TTL of the cached decisions is set to a few minutes. Despite this, the cache hits in almost 40% of the requests.

集群设置

每个美杜莎节点使用2个或4个Nvidia GPU。每个美杜莎集群有一个主节点负责整个集合，其拥有几个从节点。分类需要大量安全和恶意样本。Evo-gen生成器也使用了一些未分类的样本。因为使用方式不同，我们保持各组独立。干净组样本是最重要的，因为误报的代价是很高的。因此，干净组占用最多的空间，扫描最慢。为了提高吞吐量，we keep the clean set mirrored。最近的病毒及未分类样本只占用干净文件的大有10%的空间。

机器学习

唯一有效利用多年来得到的信息的方法是机器学习。我们使用实例学习，因为其有很多优点，实例学习包括下面几项：

模型重学习，这仅需要将样本添加到正确的组或从中移除。
理解特定决策的原因。
调整误报率

Custom-distance function

Each sample is represented by a constant-sized feature vector consisting of approximately 100 attributes. We keep the exact composition of the feature vector secret, but, for example, obvious candidates such as section table data in the Portable Executable format are included. In general, there are static and dynamic features, categorized as offsets, sizes, checksums, factors, bit flags and generic numbers. Taking into account the nature of the attributes, we ended up with several distance operators and a weighting scheme that equalizes the importance of the attributes. The following table contains a sample of the operators we use.

.kNN classifier

The most common approach for instance-based learning is the nearest neighbor classification. To fine tune our classifier, we built a tool, called Pythia, which displays the nearest neighbors of a given query sample. It uses a dimensionality reduction method (NMDS) to display the neighbors in 2D space, and also displays additional metadata for the selected samples. This information can be used by a human to determine whether or not it is feasible to distinguish between malware and clean neighbors in the current case. The goal was to create a fully autonomous system — which means high precision at the cost of lower recall. After some experimenting, we added a few thresholds, including minimal allowed distance to clean files, maximal allowed distance to malware files, as well as a weighting term that shifts the balance between clean and malware sets.

Real-world data

现实世界中数据的冗余是非常显著的。我们的内部系统每天大约接收250000个新PE文件。通过严格的分群标准(low threshold distance and complete linkage)，其中150000个可以直接分配给20000个集群之一。Each cluster can then be classified as a whole. That means 130,000 fewer decisions to make, and that the total number of clusters does not grow by 20,000 every day, as the clusters overlap between days.

世界级防护

我们有几种不同的方法将病毒分类传送到用户。为提供最高等级的防护，我们使用下面3种方法。

实时分类

Avast在用户机器上的所有可执行文件运行前检查它们。如果该文件与当前病毒库都不匹配时将查询其文件信誉。如果返回结果表明用户数极少，则该文件将在avast Sandbox中执行。如果程序行为和任何已知的攻击行为不匹配，real-time classifier被调用。Avast提取其特征发送到云服务器等待响应。大部分低流行度的文件是良性的。在每天的大约250000次请求中大约4000次被判为病毒。

FileRepMalware（文件信誉病毒^_^）

一旦一个文件被判为恶意文件，我们的内部系统检测到在全球范围检测该文件是安全的，文件信誉服务器中将设置一个标志。每一个avast客户端遇到该文件都将立刻阻止，并报告其为FileRepMalware（这个我真见过^_^）.

Evo-gen

过去的string-based signatures 方式很有效，并且可以很好地归纳威胁的变种。但特征码需要分析师和时间。在当今的威胁环境下，考虑到所有的变种和其内在联系，没有足够的人和时间进行分析。我们需要一个类似特征码的方法，但不依赖与人工干预或是占用大量时间。
Enter Evo-gen. Evo-gen leverages the distance function to create a set of similar feature vectors which allows us to build a rule set from those features. Once we have a set of very similar feature vectors from the distance function, we can start to pick features that make them similar and build a rule set from those features. It is somewhat similar to rule-set generation in decision trees, but the objectives are different. 为了加快归纳速度，我们可以使用尽量少的规则，同时keeping hits in the clean set at zero. But there are many ways to pick 20 rules from 100 possible ones - 5.36x1020, or 536 billion, numerically speaking. We’re currently taming the combinatorial explosion with a stochastic approach, which provides better results than Scavenger approaches. 这里再一次体现出GPU的速度极其重要。While trying to understand how the Evo-gen rule sets (blue) affect the signature “ecosystem,” we produced the following visualization. Each blob represents a different rule set or signature, and the size of the blob is proportional to the number of detected variants.

Avast：智能安全解决方案

通过利用最大的跨平台网络、云、机器学习和专有大数据分析，avast提供了独特的精细安全方案。与其他安全厂商不同，由于我们有更多的用户，我们的防护网络变得更强，更易于管理。任何人连入我们的网络将立即得到保护，无论他在哪里。我们称之为Global Security。

显示全部楼层 · 发表于 2016-1-12 22:02:32

等你翻译完

显示全部楼层 · 发表于 2016-1-13 08:22:09

root1605 发表于 2016-1-12 22:02
等你翻译完

老大，我可能翻译不完了，里面一些相对专业的术语描述，算法解释什么的，我不懂啊，不能瞎翻啊

显示全部楼层 · 发表于 2016-1-14 16:43:40

名字都好中二23333

这么说来avast的云是人工智能喽，有谁知道别的厂商也做类似的东西吗

显示全部楼层 · 发表于 2016-1-14 17:26:19

也不知道谁发表于 2016-1-14 16:43
名字都好中二23333

这么说来avast的云是人工智能喽，有谁知道别的厂商也做类似的东西吗

类似的东西应该大家都在做吧

[分享] avast技术页

本帖子中包含更多资源

评分

评分

评分