本帖最后由 trafly 于 2011-8-20 16:20 编辑
AV-C原文内容
Sorting procedure
Samples from all sources are copied to the incoming server.
Encrypted and archived samples/collections are decrypted and extracted from archives.
Duplicate samples are weeded out.
File names are renamed to make sorting and maintenance more effective.
File extensions are renamed by a tool created in-house to its correct executable extension.
Unrecognized file formats are given the extension “.VIR” and are moved to a separate location
(for further inspection).
Samples are analyzed, using various tools (commercial tools, for example, but also tools used
and maintained by the anti-virus community) in order to recognize known garbage or non-
working samples. We also use several other static analyzers, PE parser, and so on, including our
own in-house tools.
Most known adware, hacker and virus tools, components, hoaxes, jokes, virus simulators,
commercial software, constructors, keygens (key generators), cracks, key loggers, engines,
sniffers, unviable (bad, corrupted, inactive, damaged or intended) samples, virus source code,
various garbage and disputed files, and so on. Basically, files and gray-area samples that should
not be included in the main test-sets – are sorted out. Working adware, spyware, etc. is
maintained separately for future tests based on such types of threat.
All PE malware is analyzed by a sandbox developed by people working at AV-Comparatives, and
also by various commercial sandboxes, in order to exclude non-working samples and other
garbage. Non-PE malware is also checked by some automated tools, but usually they need to be
checked manually, as are some PE files that our sandbox was not able to categorize reliably.
Viruses are verified by replication, but we do not always use the replicated samples for the tests
– we use some of them to check whether viruses were added by the vendors with reliable
accuracy, or whether some vendor only added some checksums in order to detect replicating
viruses. The latter case may be considered as unacceptable by us and can lead to exclusion of
the product concerned. If a file doesn’t seem viral or malicious we don’t include it. Instead, we
move it to the “unwanted” database. (We also do this even if, for example, all anti-virus
programs report the file as being infected – this means we don’t rely on anti-virus programs to
select which samples to include in the test-set, and we advise any other testers not to do that
either). Our test-sets do not contain samples that do not work under Microsoft Windows
NT/2000/2003/XP/Vista. Old macro samples (prior to Microsoft Office 97) are not included
either. In addition, we no longer include compromised HTML files.
Verified samples are sorted into the various categories we use; as this task is often tricky, we
also use (for example) VGrep to see how anti-virus vendors would classify a sample (e.g. as a
backdoor or worm). Sorting is based on the majority verdict. For example, if most products
classify a malicious program as a backdoor and one product classifies it as a worm, we classify it
as a backdoor too. There are only a few exceptional cases where we do not agree with the way
the majority of products classify some malware and in that case our own classification will be
applied. In case of replicating or polymorphic malware, we take care not to include a
disproportionate amount of the very same variant, in order to avoid flawed results. This is also
a reason why our test-sets often are “smaller” than others.
All samples are at some point validated. As automated systems (not to mention humans,
especially students…) are not fool-proof, it can nevertheless happen that grey-area or totally
inappropriate files also slip in (but they do get removed later from the sets).
We freeze the February and August test-sets, usually a few days before the test starts, which
means that many files which have not been fully analyzed by automated tools or by humans are
also included in the test-set. While the tests are already running we continue to check the
recently added samples too, and remove any bad samples from the test-set afterwards. As the
vendors will also receive all samples they missed in the meantime, they may also get some bad
samples, but they will be removed before the end of the test and not counted as misses in the
published report (and vendors have some weeks to report faults and bad samples).
After the tests, we look again to see whether there are any samples that were not detected by
any product. Usually we find 2-3 files that are indeed not detected by any product, and on
examination those files always turned out to be bad samples. We therefore decided that samples
determined to be undetected by all tested products will be removed from the test-set, and will
not be counted as misses in the test actually performed (since they are garbage).
In the testing month, we focus our analysis on the samples that were missed by the tested
products. We start from those samples that were missed by most products, as they have a higher
probability of being non-working.
Files reported as bad by vendors will be removed, and the results will be corrected before they
are published on the website. Due to the (approximately) two weeks (peer-) reviewing
procedure, we are also able to include in our sets fresh malware, and to analyze the samples
even when the tests are already started. This also gives vendors the opportunity to report back
testing faults or inappropriate samples, though they are not obligated to do so. This all helps to
ensure that in the end we publish correct results for our readers. Anyway, since we commenced
this methodology in research published at the begin of this year, some bad samples may still be
in the test-set, but considering the size of the test-set, they should be so few, that they have
practically no significant effect on the results and no discernable impact on the rankings or
awards given. Should we ever find out in our QA that the error margin was higher than
anticipated, or high enough to have an impact on a ranking or award, we will publish that
information.
翻译内容如下:
AV-C测试方法论与FAQ第二节
样本分类
- 所有来源的样本复制到服务器
- 所有加密和压缩的文本/文本集分别进行解密和解压缩操作
- 重复的样本副本都被清除
- 文件名均重命名以保证更有效的分类和维护
- 使用研发的工具将病毒扩展名更正为原扩展名
- 扩展名未识别的文件将被加上.VIR的扩展名并转移到一个单独的位置(以进行进一步识别)
- 样本将使用多种不同的工具进行分析(如商业工具,但也包括那些由反病毒团体/社区使用和维护的工具)。这样做可以识别已知的垃圾样本和失效样本。我们也使用其他几个静态分析器、PE分析器。这其中有的工具是我们自己研发的。
- 大多数广告程序、修改器以及病毒工具、组件、恶作剧病毒、病毒仿真工具、商业软件、构造器、序列号生成器、破解文件、键盘记录工具、引擎、嗅探器、失效的脚本,这些文件和不易界定的样本都不会包含在主要测试样本集中,从样本集中检出。而有效的广告程序、间谍程序等将被单独分类以便未来针对这些类别的文件进行进一步测试
- 所有PE类恶意程序都将采用AV-C公司自行开发的沙盒进行分析,同时也将采用多家商业公司的沙盒软件进行分析,以保证检出失效样本和其他垃圾文件。非PE类恶意程序也将采用一些自动化工具进行分析,不过这些程序往往需要人工进行检验,因为我们的沙盒对某些程序无法进行可靠的识别分类。
- 病毒都是使用复制方法进行检测的,测试中只是偶尔使用病毒复制后的样本——目的是来检查厂商对病毒了解的足够充分,还是仅对病毒提取了MD5值以检测复制的病毒。我们不认可后者的方法,并将采用该策略的产品从测试产品中清除。如果一个文件似乎并非有毒或者恶意,我们将不把它包含在病毒样本中。相反,我们把它归纳到“其他”数据库中。(虽然有可能所有的杀毒程序报告该文件已被感染,但我们仍采用该策略分类——这一位置我们并不依靠杀毒程序来进行测试集样本的筛选,同时我们也建议其他的测试机构也不要这样做)。我们的测试集中不包含那些在Windows NT/2000/2003/XP/Vista 中无法正常工作的样本。过时的宏病毒(常见于OFFICE 97)也不包含在测试集中。另外,我们也不再将感染的HTML文件加入到测试集中。
- 认可的病毒样本被分为不同的种类。由于这项任务“充满挑战”,我们也使用VGrep工具来分析反病毒厂商如何对病毒样本进行分类(如后门、蠕虫等类别)。分类将以多数认定为准。比如,如果大多数产品将一个有害程序认定为后门程序,而一个产品将其认定为蠕虫,那么我们也将该有害程序归入后门类别。仅有少数特例我们与多数产品的归类不一致,而这时我们会以自己的认定为准。针对复制性或者多态有害程序,我们将注意不把它的变种重复录入病毒样本库,以避免干扰结果的准确性。这也是我们的测试集较其他机构小的原因之一。
- 所有样本都经过某种程度的验证。因为自动化程序(也包括人,尤其是学生)并不是万能的,有时那些难以认定的、甚至是完全不合适的文件会被加入样本集(经验证后这些文件慧聪样本集中删除)
- 我们经常在测试开始前几点,对2月份和8月份的测试样本集进行冻结。这也意味着许多并未被自动化工具和人充分分析的文件将被包含在样本集中。当测试进行的时候,我们继续对新近添加的样本进行检测,并在之后删除那些无效样本。由于厂商也会同时收到那些未检测出的样本,他们也有可能收到无效样本,不过这些样本将在测试结束前移除,并且不会影响最后发布的报告的准确性(厂商会有几周时间来报告这些错误和无效样本)
- 测试结束后,我们会审视是否有所有产品都没有检测出的样本。往往我们会发现两到三个文件无法被任何产品检测到,而最终验证发现这些文件都是失效样本。因此,我们决定所有产品均无法检测的样本都将从测试集中剔除,同时不将其计算为一次未检测到事件(因为它们都是无用的垃圾)
- 在测试月中,我们会分析那些受测产品未检测到的样本,从那些被产品漏检最多的样本开始,因为这些样品有更大的几率是无效的。
- 厂商汇报的损坏文件将删除,同时在发布前对测试结果进行更正。由于有约两周时间进行自查和同业互审,我们能够将最新的恶意程序加入测试集中,并且对样本进行分析。而这一流程也将给厂商机会汇报测试错误或者举报不合适的样本,尽管他们并没有这样的义务。所有这些措施保证我们最终向读者发布的结果是准确无误的。由于我们在今年初(2009年4月)采用这套方法论进行测试,一些无效的样本仍可能包含在测试集中,但是考虑到测试集的容量,它们将不会产生重大的影响,也不会对产品评级产生决定性的影响。一旦出现我们发现有超出预期的错误,或者这些错误足以影响到评级的时候,我们会发布该信息。
|