Abstract:

With the recent advent of automated malware development toolkits, it
has become relatively easy for even unskilled aggressors to create new
variants of existing malware that are capable of evading antivirus
(AV) software.  This has led to a huge surge in the number of new
malware threats in recent years and a significant growth in the size
of the virus signature database.  AV companies such as Symantec
receive thousands of suspicious program samples every day.  The first
step in processing any sample is to determine if it is indeed
malicious.  Currently, this step is largely done manually and thus is
a major bottleneck in the malware processing workoow.  In this talk, I
will first present the design, implementation and evaluation of a
malware database management called SMIT (Symantec Malware Indexing
Tree) that attempts to speed up this process.  The system is based on
the insight that since most new malware samples are simple syntactic
variations of existing malware, one way to ascertain the maliciousness
of a sample is to check if the sample is sufficiently similar to any
known malware programs. SMIT can efficiently make such decisions based
on the malware's function-call graph -- a high-level structural
representation that is less susceptible to the low-level obfuscations
employed by malware writers to evade detection.  To address the
scalability challenges, SMIT exploits an efficient graph similarity
metric that exploits structural information in the underlying malware
programs, and a multi-resolution indexing scheme that achieves a good
balance between pruning efficiency and search effectiveness.  

In the second part of the talk, I will present an automatic
malware-signature generation system called Hancock.  Hancock aims to
address the signature explosion problem by automatically creating
string signatures, each of which corresponds to a contiguous byte
sequence that is meant to match multiple variants of a malware
family.  Hancock features several novel techniques that effectively
overcome the false positive problem of machine-generated string
signatures, including a scalable model that can accurately estimate
the occurrence probability of arbitrary byte sequences in benign
programs and a content-aware signature selection algorithm that checks
if a candidate is part of a library function or any generic code
sequences.  With these techniques, the string signatures that Hancock
automatically generates are able to meet the false positive rate
requirement of 0.1%.

Short Bio:

Xin Hu is currently a Ph.D. candidate in the department of Electrical
Engineering and Computer Science at the University of Michigan, Ann
Arbor. He received his B.S. degree from Zhejiang University, China in
2005 and M.S degree from the University of Michigan in 2007 both in
computer science. His research interests lie primarily in the area of
computer and network security, with an emphasis on the large-scale
malware analysis, automatic signature generation, network monitoring
and botnet detection. He is a recipient of Symantec Research Labs
fellowship and the 2nd place winner in AT&T Best Applied Security
Research Paper Award 2010.