Mining and Understanding Software Enclaves (MUSE) Suresh Jagannathan Information Innovation Office DARPA http://www.darpa.mil/Our_Work/I2O/Programs/Mining_and_Understanding_Software_Enclaves_(MUSE).aspx 1 Distribution Statement A - Approved for Public Release, Distribution Unlimited
What is it? Next for DARPA: 'Autocomplete' for programmers Source: Phys.org Do We Really Need to Learn to Code? Source: The New Yorker Computer Programming Is a Dying Art Source: Newsweek Pentagon seeks 'big code' for 'big data' Source: USA Today 2 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Trends > 21M repositories 24M > 10M LoC > 4M code snippets (open source) Navy’s newest warship (USS Zumwalt) runs on Linux The US government is the largest consumer of OSS 3 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Why should the government care? Navy’s newest warship (USS Zumwalt) runs on Linux The US government is the 24M largest consumer of OSS in the world 4 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Topic Modeling Open-Source Software Generic Program Properties Specialized Domain Properties Source: ohloh.net 5 Distribution Statement A - Approved for Public Release, Distribution Unlimited
System Architecture Source Binary OR Graph Database and Analytics Mining Engine Inspection α 1 α 2 Source Binary Artifact Property X X Checking OR Generation α 3 and Repair β 3 Program Analysis, Discovery Theorem Proving, Testing λ 3 β 2 Program that satisfies X: Learning and f( α 1 ) ◦ g( β 2 ) ◦ h( λ 3 ) Synthesis β 1 λ 2 λ 1 Query: “Synthesize a program that does X” 6 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Enclaves Redundancies in the corpus exposed as dense components ( enclaves ) in the mined network • Nodes represent properties facts, claims, and evidence • Edges connect related properties Anomalous properties have small number of connections Likely invariants have large number of connections 7 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Big Code Front-End Collector ! Classifier ! Diverse, Representative, ! Ontological Structure ! High-fidelity corpus ! Types and Proofs ! Binary decompilation ! Static and dynamic analyses ! Theorem proving ! Environment and platform dependencies, ! Tests and runtime verification ! Program ! Models (memory, execution, …) ! Executable Specifications ! Analyses ! Model Checking ! Abstract Interpretation ! Contracts and assertions ! Documentation extraction ! Canonical and persistent representation of analysis Database ! outputs construction ! 8 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Big Code Back-End Mining Inference Engine Distributed Graph Database Property Checking Navigation and Search Query Specification DSLs Queries Queries Language Language Learning and Protocol Synthesis Framework Model Generation Discovery 9 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Dependencies Infrastructure Widget Synthesis & Repair Analytics Sketch Mining Engine Based Synthesis Artifact Store Trace Analysis Graph Artifact Ontic Types Visualization Generators & Clichés Datalog Invariant Detection Evaluator Probabilistic Type Inference Systems Ontology / Datalog Analyses Protocol Repair Collection Classification (static, dynamic, concolic) & Patch Synthesis Repair Bayesian Demo Specification Queries Workshops Abstract Draft-based Interpretation Cloud Synthesis Infrastructure Static Challenge Abductive Dependently Analysis Deep Learning Problems Inference Typed IR & Hypothesis Specification Binary LLVM Generation Extraction Convex Fault Localization Optimization & Repair Multii-Layered Database Design Pattern Flaw Detection Synthesis from & Repair Specifications 10 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Corpus Currently, ~6TB Java and C, C++ 11 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Draper Labs: The DeepCode Architecture Source: Draper Labs 12 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Artifact Generation Use of clang and Draper’s open-source Fracture decompiler support both compile down of source and binary lift to LLVM Intermediate Representation (IR) Source: Draper Labs 13 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Deep Learning Analytics Source: Draper Labs 14 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Finding Heartbleed using Big Code (Draper) Buggy Program: Heartbleed bug 170K C/C++ Deep Learning Projects Graph Layer ~400GB ~20M artifacts (calls graphs, CFGs, etc.) Artifact Generator LLVM Identify and classify design patterns (flaws and repairs) ANTLR4 Blue-Good Red-Bad Fracture Math Layer Metadata Extractor if (1+2+16 > s->s3->rrec.length) return 0; Repaired Program: if (1+2+payload+16 > s->s3->rrec.length) Added bounds checks return 0; if (write_length > SSL3_RT_MAX_PLAIN_LENGTH) return 0; Distribution Statement A - Approved for Public Release, Distribution Unlimited
Kestrel Institute: Synthesis using Big Code Source: Kestrel Institute 16 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Kestrel Institute: Proof-Directed Synthesis Using Big-Code Source: Kestrel Institute 17 Distribution Statement A - Approved for Public Release, Distribution Unlimited
Artifact Generation Process Source: Kestrel Institute 18
Features Source: Kestrel Institute 19 Distribution Statement A - Approved for Public Release, Distribution Unlimited
AES Synthesis using Big Code (Kestrel) (defthm bytep-of-xtime (implies (bytep b) Specification (bytep (xtime b))) :hints (("Goal" :in-theory (enable acl2::shl)))) Machine Learning 130K Java 180 out of 130K projects Projects relevant to AES ~2.3B methods Control Flow Types Graphs ~200B facts Synthesis + Proof Refinement Analysis & API sequences Specification Extraction Proofs 422 Features Program public static int lookup (int[][] arr, int hex) { int row = hex >> 4; Implementation + int column = hex & 0xF; Proof of Correctness return arr[row][column]; } Distribution Statement A - Approved for Public Release, Distribution Unlimited
Challenge Problems – Phase 1 Problem Approach Synthesis from demonstrations in Dynamic tracing analysis Swing/Eclipse Synthesis of AES Specification-driven (synthesis-by-construction) Automated repair of incorrect API Code transfer usage in Android Repair of incorrect invariants (off-by- Deep learning one errors) in C/C++ code Synthesize a communication module User-directed cliché discovery for a drone Complete a partial implementation of Sketch-based synthesis binary search tree Graph classification and repair Repair incorrect graph implementations from specifications 21 Distribution Statement A - Approved for Public Release, Distribution Unlimited
www.darpa.mil 22
Recommend
More recommend