Abstract: With the recent advent of automated malware development toolkits, it has become relatively easy for even unskilled aggressors to create new variants of existing malware that are capable of evading antivirus (AV) software. This has led to a huge surge in the number of new malware threats in recent years and a significant growth in the size of the virus signature database. AV companies such as Symantec receive thousands of suspicious program samples every day. The first step in processing any sample is to determine if it is indeed malicious. Currently, this step is largely done manually and thus is a major bottleneck in the malware processing workoow. In this talk, I will first present the design, implementation and evaluation of a malware database management called SMIT (Symantec Malware Indexing Tree) that attempts to speed up this process. The system is based on the insight that since most new malware samples are simple syntactic variations of existing malware, one way to ascertain the maliciousness of a sample is to check if the sample is sufficiently similar to any known malware programs. SMIT can efficiently make such decisions based on the malware's function-call graph -- a high-level structural representation that is less susceptible to the low-level obfuscations employed by malware writers to evade detection. To address the scalability challenges, SMIT exploits an efficient graph similarity metric that exploits structural information in the underlying malware programs, and a multi-resolution indexing scheme that achieves a good balance between pruning efficiency and search effectiveness. In the second part of the talk, I will present an automatic malware-signature generation system called Hancock. Hancock aims to address the signature explosion problem by automatically creating string signatures, each of which corresponds to a contiguous byte sequence that is meant to match multiple variants of a malware family. Hancock features several novel techniques that effectively overcome the false positive problem of machine-generated string signatures, including a scalable model that can accurately estimate the occurrence probability of arbitrary byte sequences in benign programs and a content-aware signature selection algorithm that checks if a candidate is part of a library function or any generic code sequences. With these techniques, the string signatures that Hancock automatically generates are able to meet the false positive rate requirement of 0.1%. Short Bio: Xin Hu is currently a Ph.D. candidate in the department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor. He received his B.S. degree from Zhejiang University, China in 2005 and M.S degree from the University of Michigan in 2007 both in computer science. His research interests lie primarily in the area of computer and network security, with an emphasis on the large-scale malware analysis, automatic signature generation, network monitoring and botnet detection. He is a recipient of Symantec Research Labs fellowship and the 2nd place winner in AT&T Best Applied Security Research Paper Award 2010.