New Toy in the Avast Research Lab

显示全部楼层 · 发表于 2013-4-28 00:31:07

本帖最后由 ssama 于 2013-6-9 21:36 编辑

New Toy in the Avast Research Lab
vlk
December 3rd, 2012

The Avast Research Lab is where some of the Avast’s brightest brains essentially create new ways of detecting malware. These are either features inside the product (such as FileRep and autosandboxing, including all of its recent development) as well as components that run on our backend – i.e. things that users don’t necessarily see but that are equally important for the overall quality of the product.

In fact, working on the backend stuff takes up more of their time these days, as more and more intelligence in Avast is moving to the cloud and/or is being delivered in almost real time via the avast! streaming update technology.

The Avast backend classifiers use a number of techniques, but the two hot ones that the team has been working on hard recently are things that we call Malware Similarity Search and Evo-Gen.

Malware Similarity Search is an important feature that allows us to pretty much instantly categorize a big amount of incoming samples. That is, for any file, it is able to say whether the file looks similar to an already seen malware file (or a whole cluster of malware files) as well as whether it’s similar to a known clean file (or a cluster of these). This may sound like an easy problem to solve, but in practice this is actually pretty difficult. Of course, the secret sauce here is how you actually define the metric (to be able to talk about similarity) and what all you take into account when representing a file. In Avast we take into account both static properties of the file as well as the outcome of a dynamic analysis (i.e. basically logs gathered during the execution of the file).

Now, a technology like this is obviously very valuable as it allows us to make fast decisions about files that we have never seen before. For example, if a file is very similar to a cluster of known malware samples, and at the same time it is not similar to any clean files, we categorize it immediately as malware. Believe it or not, we’re seeing thousands of files like this every day.

The second technology I mentioned, Evo-Gen, is somewhat similar but a bit subtler in nature. This is about finding as short and generic descriptions of large sets of malware samples as possible. Say you take a set of 1,000,000 malware samples (and 1,000,000 clean files) and give the algorithm the following task: find as few, and as brief descriptions of as many samples in the malware set, without describing any file in the clean set. Evo-Gen is a genetic algorithm that we have developed just for that. It often happens to find some real gems for us – e.g. a description of an apparently random set of tens of thousands of malware files scattered somewhat randomly across our virus sets. And the size of the description? 8 bytes.

Now, if you think about this for a while, you will find out that both of these algorithms have something in common. I mean, for both of them it’s necessary to have super-fast access to our vast sets of clean and malware files. Forget about sequential access (or any kind of processing of the files one by one). Even reading the samples off the disks takes hours.

For this purpose, the team has developed another great piece of technology that we call MDE. It’s basically an in-memory database that works on top of indexed data and allows heavily parallel access.

Traditionally, we have been running these things on classic server hardware. For the most part, we use standard Dell servers based on Intel Xeon CPUs. However, the performance has never been great and we always thought we should be doing better.

One of the Intel CPU-based racks used by the Avast virus lab.

显示全部楼层 · 发表于 2013-4-28 00:31:58

本帖最后由 ssama 于 2013-4-28 00:33 编辑

The real breakthrough came when we started experimenting with the GPUs. For starters, modern GPUs (both from NVidia and AMD) are not limited to high-end graphics or gaming. The good thing about them is that they can be massively parallelized – while today’s high-end Intel CPUs contain 6, 8 or maybe 10 cores, the high-end gaming GPUs contain thousands of cores. True, each of them is not that powerful, but if you can unleash their potential with some good parallel algorithms, the resulting power is insane.

So, with MDE, we’re now in the process of transitioning to a GPU-based “supercomputing” farm . The environment we’re now evaluating looks like this:

You can see that it’s not a rackmount server – but a workstation instead. A hell of a workstation, I should say though. With Intel i7 E3820 4C 3.6GHz CPU and 32 GB DDR3 RAM, it’s not a bad start, but what’s really cool about the box is the 4 NVidia GPU-based graphics cards, each with 3 GB of RAM and connected to each other by a hose for external water cooling. The whole beast is powered by a 1,500W power supply but in case it’s not enough, we are ready to add one more.

While we haven’t put these systems in production yet, we will likely do so soon. And I’m truly looking forward to that – as doing so will allow us to serve you, our users, even better. You never know – if this proves to be as useful as we think it will be, we may end up building something like the Titan one day…

(Now, my job in the meantime will be to keep the gamers off the server room ).

显示全部楼层 · 发表于 2013-4-28 00:43:30

写的深入浅出，通俗易懂，虽然有点长不过看着挺有意思的

Malware Similarity Search and Evo-Gen

这俩倒是不新鲜了，GPU加速么，看来会成为新的卖点呢，我很好奇

显示全部楼层 · 发表于 2013-4-28 11:56:48

ssama 发表于 2013-4-28 00:31
The real breakthrough came when we started experimenting with the GPUs. For starters, modern GPUs (b ...

我去拿去翻译一下看看

显示全部楼层 · 发表于 2013-4-28 19:36:11

Avast的研究实验室中的新玩具
VLK
2012年12月3日

Avast的研究实验室是Avast的一些最聪明的大脑基本上创造新的方法检测恶意软件。这是里面的产品无论是功能（如FileRep autosandboxing，包括其最近的发展）以及组件上运行我们的后端 - 即的东西，用户就不一定能看到，但也同样重要的整体素质该产品。

事实上，在后端的东西需要更多的时间，这些天，随着越来越多的智能停住移动到云和/或正在提供在几乎实时通过的avast！流媒体技术更新。

的Avast的后端分类使用一些技巧，但，那些团队一直致力于两大热点最近辛苦的事情，我们称之为恶意软件相似性搜索和Evo根。

恶意软件相似性搜索是一个重要的功能，使我们能够几乎瞬间分类大传入样本量。即对任何文件，它是说文件是否看起来类似于一个已经看到了恶意软件文件（或一整个集群的恶意软件文件）作为以及是否类似于一个已知的干净文件（或这些集群）。这听起来像一个容易解决的问题，但在实践中，这实际上是相当困难的。当然，这里的秘密武器是你如何定义度量（能够谈论的相似），你什么都代表一个文件时考虑。停住，我们既考虑静态属性的文件，以及一个动态的分析结果（即基本上日志收集的文件在执行过程中）。

现在，这样的技术显然是非常有价值的，因为它使我们能够快速决策有关文件，我们以前从未见过。例如，如果一个文件是一个已知的恶意软件样本群非常相似，同时它是没有任何清理的文件相似，我们立即它归类为恶意软件。不管你信不信，我们看到成千上万的文件，每天这样。

我提到的第二项技术，埃沃根，是有几分相似，但在本质上有点微妙。这是关于短期和恶意软件样本尽可能大集的通用描述发现。假设你拍摄了1,000,000个恶意软件样本（和1,000,000干净的文件），并给予该算法以下任务：找到尽可能少，并尽可能多的恶意软件样本集的简短说明，不描述任何文件的情况下，在干净的集。 EVO-Gen是，只是我们已经开发了遗传算法。它经常发生在我们找到一些真正的宝石 - 例如显然随机几十有点随意散落在我们的病毒的恶意软件文件十万套的描述。和描述的大小？ 8个字节。

现在，如果你仔细想想这一段时间，你会发现，无论这些算法有共同的东西。我的意思是，对他们俩有必要拥有超快速的访问到我们广大套清洁和恶意软件的文件。忘记顺序访问（或任何一种处理的文件逐个）。即使关闭磁盘读取样品需要几个小时。

为了这个目的，该小组已开发的技术，我们称之为MDE的另一个很大的一块。它基本上是一个内存中的索引数据上，并允许大量的并行访问数据库。

传统上，我们已经运行经典的服务器硬件上的这些东西。在大多数情况下，我们使用标准的基于英特尔至强处理器的戴尔服务器。但是，性能却从未伟大，我们总是以为，我们应该做的更好。
基于Intel CPU的机架使用Avast的病毒实验室之一。
真正的突破是，当我们开始尝试与GPU的。对于初学者来说，现代GPU（均来自NVIDIA和AMD）不仅限于高端显卡或游戏。他们是一件好事，他们可以大规模并行 - 而今天的高端英特尔处理器包含6,8或者10个核心，GPU的高端游戏包含数千个核心。诚然，每个人是不那么强大，但如果你能充分发挥自己的潜能，有一些良好的并行算法，产生的功率是疯了。

所以，MDE，我们现在在这个过程中，过渡到基于GPU的超级计算机“农场。现在评估的环境中，我们看起来像这样：

你可以看到，它不是一个机架式服务器 - 工作站而非。工作站地狱，我应该说，虽然。随着英特尔i7 E38204C3.6GHz的CPU和32 GB的DDR3内存，它不是一个糟糕的开局，但什么是真的很酷约箱是NVidia的基于GPU的图形卡，每3 GB的RAM和一个彼此相连外部水冷却软管。整个兽1500 W电源供电，但如果它是不够的，我们准备增加一个。

虽然我们还没有把这些系统在生产，但我们可能会很快这样做。我真诚地期待着 - 因为这样做将让我们竭诚为您服务，我们的用户，甚至更好。你永远不知道 - 如果这被证明是有用的，因为我们认为这将是我们最终可能建设类似的泰坦天...

（现在，我的工作，在此期间将关闭，以保持玩家的服务器机房）。

显示全部楼层 · 发表于 2013-6-9 21:43:03

upd
好像很多人不知道evo

显示全部楼层 · 发表于 2013-6-9 22:15:30

ssama 发表于 2013-6-9 21:43
upd
好像很多人不知道evo

看了那个翻译半天也不懂，到底讲了什么。还有evo是神马?

显示全部楼层 · 发表于 2013-6-9 22:16:57

wefiwhfve 发表于 2013-6-9 22:15
看了那个翻译半天也不懂，到底讲了什么。还有evo是神马?

5楼好像用的谷歌翻译，当然看不懂 = =
@HearFish 问她吧

显示全部楼层 · 发表于 2013-6-9 22:34:48

ssama 发表于 2013-6-9 22:16
5楼好像用的谷歌翻译，当然看不懂 = =
@HearFish 问她吧

evo-gen是那个吧，基因式启发技术？
还有，是“他”

显示全部楼层 · 发表于 2013-6-9 22:43:53

wefiwhfve 发表于 2013-6-9 22:15
看了那个翻译半天也不懂，到底讲了什么。还有evo是神马?

里面提到他们正在研究两种技术，一个是malware similarity search，对恶意软件之间的相似性进行研究，还有一个是evo-gen，应该是基因式启发技术，下面都是在讲这些技术的具体内容
然后2楼提到了用GPU闲置运算能力来加快avast的扫描速度什么的

[转帖] New Toy in the Avast Research Lab

浏览过的版块