Where EB2 3211
When Fridays 1-2
Moderator Freeh (vwfreeh at ncsu dot edu)
|21 Jan||Troubleshooting Production Cloud Systems using Management Console Logs||Kamal|
|28 Jan||The Visual Development of GCC Plug-ins with GDE||Dean|
|4 Feb||ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Programs||Xing Wu|
|11 Feb||Jump-Oriented Programming: A New Class of Code-Reuse Attack||Bletsch|
|18 Feb||ARC discussion||David Fiala|
|22 Feb||Coordinated Power, Thermal, and Performance Management in Virtualized Data Centers||Xiaorui (Ray) Wang|
|24 Feb||Composable Abstractions for Synchronization in Dynamic Threading Platforms||Jim Sukha|
|4 Mar||Time-traveling Forensic Analysis of VM-based High-interaction Honeypots||Deepa Srinivasan|
|11 Mar||Spring break|
|21 Mar||Nanonetworks: A New Communication Paradigm||Ian Akyildiz, Triangle Distinguished Lecture|
|28 Mar||TDLS Talk: Automatic Programming Revisited||Rastislav Bodik|
|4 Apr||Rethinking Parallel Languages and Hardware||Sarita Adve|
|8 Apr||Predictable Task Migration for Locked Caches in Multi-Core Systems||Abhik|
|15 Apr||Detecting Capability Leaks in Android-based Smartphones||Mike Grace|
|22 Apr||System Software for Cloud Computing||Dilma Da Silva, IBM Research|
|29 Apr||Automatic Generation of Communication Specifications from Parallel Applications||Xing Wu|
Title: Automatic Generation of Executable Communication Specifications from Parallel Applications
Abstract: Portable parallel benchmarks are widely used and highly effective for (a) the evaluation, analysis and procurement of high-performance computing (HPC) systems and (b) quantifying the potential benefits of porting applications for new hardware platforms. Yet, past techniques to synthetically parametrized hand-coded HPC benchmarks prove insufficient for today’s rapidly-evolving scientific codes particularly when subject to multi-scale science modeling or when utilizing domain-specific libraries.
To address these problems, this work contributes novel methods to automatically generate highly portable and customizable communication benchmarks from HPC applications. We utilize ScalaTrace, a lossless, yet scalable, parallel application tracing framework to collect selected aspects of the run-time behavior of HPC applications, including communication operations and execution time, while abstracting away the details of the computation proper. We subsequently generate benchmarks with identical run-time behavior from the collected traces. A unique feature of our approach is that we generate benchmarks in CONCEPTUAL, a domain-specific language that enables the expression of sophisticated communication patterns using a rich and easily understandable grammar yet compiles to ordinary C+MPI. Experimental results demonstrate that the generated benchmarks are able to preserve the run-time behavior—including both the communication pattern and the execution time—of the original applications. Such automated benchmark generation is particularly valuable for proprietary, export-controlled, or classified application codes: when supplied to a third party, our auto-generated benchmarks ensure performance fidelity but without the risks associated with releasing the original code. This ability to automatically generate performance-accurate benchmarks from parallel applications is novel and without any precedence, to our knowledge.
System Software for Cloud Computing
Cloud computing has been receiving a lot of attention from the computing community. It is perceived as some as the “IT fad of the moment” and by others as a revolutionary approach to deliver computing services. In this talk we analyze cloud computing from the perspective of system software, exploring how this new model impacts current practices in operating systems and distributed computing. We identify a set of exciting research opportunities in resource management for cloud computing and discuss how cloud computing itself affects the way we carry out research projects.
Detecting Capability Leaks in Android-based Smartphones
Smartphones offer a multitude of features to their users, including the ability to run third-party applications. To manage the amount of access given to these applications, Android provides a fine-grained capability model, which allows programs to define, request, and check for arbitrary permissions. This capability model assumes that all exported API calls check the caller's permissions before performing any action that requires permission. Failing to check the permissions properly constitutes a capability leak, where an unprivileged caller can gain access to a capability through an intermediary. We provide a system, Woodpecker, which aims to locate such leaks through static analysis. By using Woodpecker, we survey eight real-world phone images attempting gain access to thirteen sensitive capabilities. Our results indicate that out of those thirteen capabilities, eleven were leaked, with individual phones leaking up to eight capabilities.
Predictable Task Migration for Locked Caches in Multi-Core Systems
Locking cache lines in hard real-time systems is a common means of achieving predictability of cache access behavior and tightening as well as reducing worst case execution time, especially in a multitasking environment. However, cache locking poses a challenge for multi-core hard real-time systems since theoretically optimal scheduling techniques on multi-core architectures assume zero cost for task migration. Tasks with locked cache lines need to proactively migrate these lines before the next invocation of the task. Otherwise, cache locking on multi-core architectures becomes useless as predictability is compromised. This paper proposes hardware-based push-assisted cache migration as a means to retain locks on cache lines across migrations. We extend the push-assisted migration model with several cache migration techniques to efficiently retain locked cache lines on a bus-based chip multi-processor architecture. We also provide deterministic migration delay bounds that help the scheduler decide which migration technique(s) to utilize to relocate a single or multiple tasks. This information also allows the scheduler to determine feasibility of task migrations, which is critical for the safety of any hard real-time system. Such proactive migration of locked cache lines in multi-cores is unprecedented to our knowledge.
TDLS talk: Rethinking Parallel Languages and Hardware
Sarita Adve, Computer Science, U. Illinois at Urbana-Champaign
TDLS Talk: Automatic Programming Revisited
Speaker: Rastislav Bodik , Computer Science, UC Berkeley
Time-traveling Forensic Analysis of VM-based High-interaction Honeypots
Honeypots have proven to be an effective tool to capture computer intrusions (or malware infections) and analyze their exploitation techniques. However, forensic analysis of compromised honeypots is largely an ad-hoc and manual process. In this paper, we propose Timescope, a system that applies and extends recent advances in deterministic record and replay to high-interaction honeypots for extensible, fine-grained forensic analysis. In particular, we propose and implement a number of systematic analysis modules in Timescope, including contamination graph generator, transient evidence recoverer, shellcode extractor and break-in reconstructor, to facilitate honeypot forensics. These analysis modules can “travel back in time” to investigate various aspects of computer intrusions or malware infections during different execution time windows. We have developed Timescope based on the open-source QEMU virtual machine monitor and the evaluation with a number of real malware infections shows the practicality and effectiveness of Timescope.
Composable Abstractions for Synchronization in Dynamic Threading Platforms
Abstract: Modern dynamic threading platforms such as MIT Cilk and Intel Cilk Plus utilize work-stealing schedulers to execute fork-join computations efficiently. These schedulers are not designed, however, to handle additional dependencies that may be introduced by synchronization. In particular, these platforms do not effectively support nested parallelism inside critical regions of code, making it difficult for programmers to compose parallel functions that use synchronization.
This talk discusses two abstractions for composable synchronization in dynamic threading platforms, based on the ideas of helper locks and task graph synchronization. First, we describe HELPER, a prototype runtime for supporting helper locks in MIT Cilk. Helper locks are locks that protect critical sections which are parallel regions; when a worker fails to acquire a helper lock, it can help to complete the parallel region holding the lock. Second, we present Nabbit, a Cilk++ library for parallel execution of task graphs with arbitrary dependencies edges. Nabbit also allows users to exploit parallelism within nodes of a task graph. HELPER and Nabbit, which provide complementary approaches for composable synchronization, both provide provable theoretical bounds on their performance.
Coordinated Power, Thermal, and Performance Management in Virtualized Data Centers
Abstract: In recent years, power and thermal management has become one of the most important issues for cloud-scale data centers that are rapidly increasing the number of hosted servers. In addition to reducing operating costs, precisely controlling power consumption and heat dissipation is an essential way to avoid system failures caused by power capacity overload or overheating due to increasingly high server density (e.g., blade servers). Power and thermal control becomes even more challenging as many data centers start to adopt virtualization technology for resource sharing, leading to increased utilization and power consumption for each server.
In this talk, we will present a coordinated power, thermal, and performance management framework designed for today's virtualized data centers. Our framework first provides highly scalable power control solutions in a hierarchical way at three levels: single server, server rack, and entire data center, because there are physical and contractual power limits at each level. Our framework also includes novel performance control algorithms, which provide power-efficient application-level performance guarantees for multiple virtual machines running on the same physical servers. Furthermore, our framework coordinates power and performance control schemes at different system layers to achieve simultaneous guarantees on both power and performance in virtualized data centers. We will also introduce our work on power management for chip multiprocessors, optimal sensor placement for thermal monitoring, and electricity cost control for distributed data centers.
Introduction to the ARC Cluster
Abstract: David Fiala will be presenting a brief introduction and overview for users of the new ARC cluster that was recently installed in the CSC department. ARC is a versatile research cluster to designed house projects involving high-performance software (think OpenMP or MPI), GPUs, SSDs, power monitoring, and virtualization. In its current state ARC contains 1727 compute cores across 108 nodes with a 40Gbit/s Infiniband interconnect. Of those nodes, presently 36 are outfitted with an NVIDIA Tesla C2050, and 16 nodes are equipped with OCZ RevoDrive 120GB SSDs. Anyone interested in learning about this new cluster is welcome to attend. Following the introduction, an quick getting-started tutorial will be presented with time for questions. Further information may be obtained at: http://moss.csc.ncsu.edu/~mueller/cluster/arc/
Jump-Oriented Programming: A New Class of Code-Reuse Attack
Abstract: Return-oriented programming is an effective code-reuse attack in which short code sequences ending in a ret instruction are found within existing binaries and executed in arbitrary order by taking control of the stack. This allows for Turing-complete behavior in the target program without the need for injecting attack code, thus significantly negating current code injection defense efforts (e.g., W^X). On the other hand, its inherent characteristics, such as the reliance on the stack and the consecutive execution of return-oriented gadgets, have prompted a variety of defenses to detect or prevent it from happening.
In this paper, we introduce a new class of code-reuse attack, called jump-oriented programming. This new attack eliminates the reliance on the stack and ret instructions (including ret-like instructions such as pop+jmp) seen in return-oriented programming without sacrificing expressive power. This attack still builds and chains functional gadgets, each performing certain primitive operations, except these gadgets end in an indirect branch rather than ret. Without the convenience of using ret to unify them, the attack relies on a dispatcher gadget to dispatch and execute the functional gadgets. We have successfully identified the availability of these jump-oriented gadgets in the GNU libc library. Our experience with an example shellcode attack demonstrates the practicality and effectiveness of this technique.
ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Programs
Abstract: Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for communication modeling due to its lossless yet scalable trace collection. Estimating the impact of scaling on communication efficiency still remains non-trivial due to execution-time variations and exposure to hardware and software artifacts.
This work contributes a fundamentally novel modeling scheme. We synthetically generate the application trace for large numbers of nodes by extrapolation from a set of smaller traces. We devise an innovative approach for topology extrapolation of single program, multiple data (SPMD) codes with stencil or mesh communication. The extrapolated trace can subsequently be (a) replayed to assess communication requirements before porting an application, (b) transformed to auto-generate communication benchmarks for various target platforms, and © analyzed to detect communication inefficiencies and scalability limitations.
To the best of our knowledge, rapidly obtaining the communication behavior of parallel applications at arbitrary scale with the availability of timed replay, yet without actual execution of the application at this scale is without precedence and has the potential to enable otherwise infeasible system simulation at the exascale level.
The Visual Development of GCC Plug-ins with GDE
Code transformations allow the seamless addition of custom optimizations or specialized functionality to code at compile time. GCC plug-ins give developers this ability while allowing developers to leave the source code of GCC unmodified. Although this makes applying completed plug-ins easy, a thorough understanding of GCC internals, internal representations, and non-trivial source-to-internal mappings is still required.
We present a visual approach to plug-in development consisting of two components: a GCC plug-in and visualizer. The GCC plug-in extracts intermediate representation data to a database used by the visualizer. Specifically, we are able to visualize GIMPLE trees, control flow graphs, call graphs and the mapping from original source code to these internal representations. We also provide an interface to GDB for run-time plug-in debugging with visualization capabilities. This paper will demonstrate how these visualizations significantly ameliorate several problems facing transformation developers without relying on a simpler intermediate representation.
Troubleshooting Production Cloud Systems using Management Console Logs
Management console logs are often the only information source for troubleshooting a production cloud system. However, it is a daunting task for system administrators to manually examine thousands of lines of console log messages to detect and diagnose various runtime system problems. In this paper, we present a novel hybrid console log analysis (HCLA) system that can automatically detect anomalies and extract diagnostic messages that are most relevant to the anomaly cause. HCLA can achieve both high analysis accuracy and low overhead by first performing fast clustering using a coarse-grained log feature called message appearance vector and then applying outlier detection within each cluster using a fine-grained log feature called message flow graph. We have implemented a prototype of the HCLA system and conducted extensive experimental study using real management console logs of a production cloud system called Virtual Computing Lab. Our experimental results show that our approach can achieve 60% higher detection rate and 75% lower false alarms than previous schemes. HCLA is light-weight, which only takes several seconds to analyze a reservation log with thousands of lines of messages.