本帖最后由 ssama 于 2013-6-9 21:36 编辑
New Toy in the Avast Research Lab
vlk
December 3rd, 2012
The Avast Research Lab is where some of the Avast’s brightest brains essentially create new ways of detecting malware. These are either features inside the product (such as FileRep and autosandboxing, including all of its recent development) as well as components that run on our backend – i.e. things that users don’t necessarily see but that are equally important for the overall quality of the product.
In fact, working on the backend stuff takes up more of their time these days, as more and more intelligence in Avast is moving to the cloud and/or is being delivered in almost real time via the avast! streaming update technology.
The Avast backend classifiers use a number of techniques, but the two hot ones that the team has been working on hard recently are things that we call Malware Similarity Search and Evo-Gen.
Malware Similarity Search is an important feature that allows us to pretty much instantly categorize a big amount of incoming samples. That is, for any file, it is able to say whether the file looks similar to an already seen malware file (or a whole cluster of malware files) as well as whether it’s similar to a known clean file (or a cluster of these). This may sound like an easy problem to solve, but in practice this is actually pretty difficult. Of course, the secret sauce here is how you actually define the metric (to be able to talk about similarity) and what all you take into account when representing a file. In Avast we take into account both static properties of the file as well as the outcome of a dynamic analysis (i.e. basically logs gathered during the execution of the file).
Now, a technology like this is obviously very valuable as it allows us to make fast decisions about files that we have never seen before. For example, if a file is very similar to a cluster of known malware samples, and at the same time it is not similar to any clean files, we categorize it immediately as malware. Believe it or not, we’re seeing thousands of files like this every day.
The second technology I mentioned, Evo-Gen, is somewhat similar but a bit subtler in nature. This is about finding as short and generic descriptions of large sets of malware samples as possible. Say you take a set of 1,000,000 malware samples (and 1,000,000 clean files) and give the algorithm the following task: find as few, and as brief descriptions of as many samples in the malware set, without describing any file in the clean set. Evo-Gen is a genetic algorithm that we have developed just for that. It often happens to find some real gems for us – e.g. a description of an apparently random set of tens of thousands of malware files scattered somewhat randomly across our virus sets. And the size of the description? 8 bytes.
Now, if you think about this for a while, you will find out that both of these algorithms have something in common. I mean, for both of them it’s necessary to have super-fast access to our vast sets of clean and malware files. Forget about sequential access (or any kind of processing of the files one by one). Even reading the samples off the disks takes hours.
For this purpose, the team has developed another great piece of technology that we call MDE. It’s basically an in-memory database that works on top of indexed data and allows heavily parallel access.
Traditionally, we have been running these things on classic server hardware. For the most part, we use standard Dell servers based on Intel Xeon CPUs. However, the performance has never been great and we always thought we should be doing better.
One of the Intel CPU-based racks used by the Avast virus lab. |