Learning a Static Analyzer from Data Pavol Bielik Veselin Raychev Martin Vechev Department of Computer Science CAV 2017 ETH Zurich July 22-28, Heidelberg
Writing a Static Analyzer Framework for Java Static Type Checker Static Type Checker Pointer Analysis for JavaScript for JavaScript 17 contributors ~400 contributors Writing static analyzer is Writing static analyzer is hard Writing static analyzer is frustrating Writing static analyzer is time consuming Writing static analyzer is brittle
Example of Unsound Analysis Missed Error Error correctly reported
This Work: Learn a Static Analyzer Can we learn a static analyzer? (aka its abstract transformers)
This Work: Learn Static Analyzer from Data Input Dataset � = { ⟨� � , � � ⟩ } � =1
This Work: Learn Static Analyzer from Data Input Dataset � = { ⟨� � , � � ⟩ } � =1 Language � for abstract transformers
This Work: Learn Static Analyzer from Data Input Dataset � = { ⟨� � , � � ⟩ } � =1 Synthesis + �� best ∊ � Over- Language � approximation for abstract transformers
This Work: Learn Static Analyzer from Data Input Dataset Input Dataset � = { ⟨� � , � � ⟩ } � =1 � = { ⟨� � , � � ⟩ } � =1 Synthesis + �� best ∊ � Over- Language � approximation for abstract transformers How to obtain suitable dataset?
This Work: Learn Static Analyzer from Data Input Dataset � = { ⟨� � , � � ⟩ } � =1 Synthesis + �� best ∊ � Over- Language � Language � approximation for abstract for abstract transformers transformers What is the language over which to learn? How to allow generating new interesting transformers?
This Work: Learn Static Analyzer from Data Input Dataset � = { ⟨� � , � � ⟩ } � =1 Synthesis Synthesis + + �� best ∊ � Over- Over- Language � approximation approximation for abstract transformers How to design scalable learning over large search spaces? How to prevent overfitting?
This Work: Learn a Static Analyzer Can we learn a static analyzer?
This Work: Learn a Static Analyzer interpretable and sound Can we learn a static analyzer? Problem Formulation analysis �� best = arg min ���� ( � , �� ) precision �� ∈ � analysis st. ∀ ⟨� , �⟩ ∈ � . � ( � ) ⊑ �� ( � ) soundness
An Example Transformer Learned Array.prototype.filter ::= if caller has one argument then points-to global object else if 2nd argument is Identifier then if 2nd argument is undefined then points-to global object else points-to 2nd argument else if 2nd argument is this then points-to 2nd argument else if 2nd argument is null then points-to global object else //2nd argument is a primitive value points-to new allocation site
An Example Transformer Learned Array.prototype.filter ::= if caller has one argument then points-to global object else if 2nd argument is Identifier then if 2nd argument is undefined then points-to global object else points-to 2nd argument else if 2nd argument is this then points-to 2nd argument else if 2nd argument is null then points-to global object else //2nd argument is a primitive value points-to new allocation site
An Example Transformer Learned Array.prototype.filter ::= if caller has one argument then points-to global object else if 2nd argument is Identifier then if 2nd argument is undefined then points-to global object else points-to 2nd argument else if 2nd argument is this then points-to 2nd argument else if 2nd argument is null then points-to global object else //2nd argument is a primitive value points-to new allocation site
Let us show the learning on an example analysis (aka points-to analysis)
Dataset: Points-to Analysis execution Program Abstract Syntax Tree (AST) reads/writes � ₁ function collect(value, idx, obj) { IfStatement - � ₂ if (value >= this .threshold) { BinaryExpression - � ₃ � ₁ ... Identifier:value � ₄ } MemberExpression - � ₅ � ₂ ... ThisExpression � ₆ � ₃ } Property:threshold
Dataset: Points-to Analysis execution Program Abstract Syntax Tree (AST) reads/writes � ₁ function collect(value, idx, obj) { IfStatement - � ₂ if (value >= this .threshold) { BinaryExpression - � ₃ � ₁ ... Identifier:value � ₄ } MemberExpression - � ₅ � ₂ ... ThisExpression � ₆ � ₃ } Property:threshold ⟨ ( ��� , � ₅ ), � ₂ ⟩ � = { ⟨� � , � � ⟩ } � =1
Language Describing Abstract Transformers � ∊ � ≔ � | if � then � else � � ∊ ������� � ∊ ������ function collect(val, idx, obj) { if (val >= this .threshold) { ... } Points-to Query } � ₁ var dat = [5, 3, 9]; dat.filter( collect, ctx ); method name has is filter 2nd argument � ₁ � ₂
Language Describing Abstract Transformers � ∊ � ≔ � | if � then � else � � ∊ ������� � ∊ ������ true � ₁ function collect(val, idx, obj) { true � ₂ if (val >= this .threshold) { ... } f a Points-to Query � ₁ l � ₂ } s e � ₁ f a � ₃ l s e var dat = [5, 3, 9]; dat.filter( collect, ctx ); can be represented as decision tree method name has is filter 2nd argument paths interpreted as abstract transformers � ₁ � ₂
Learning: Decision Trees Input Dataset � = { ⟨� � , � � ⟩ } � =1 Synthesis Synthesis + + �� best ∊ � Over- Over- Language � approximation approximation for abstract transformers
Learning: Decision Trees + CEGIS Input Dataset candidate � = { ⟨� � , � � ⟩ } � =1 analysis �� ∊ � Synthesis Oracle: + Over- Test/Verify Language � Counter-example approximation Analyzer for abstract ⟨� , �⟩ ∉ � transformers � ← � ∪ { ⟨� , �⟩ } no counter-example return analysis ��
Learning: Problem Formulation Problem Formulation Cost Function �� best = arg min ���� ( � , �� ) � ( � , � , �� ) = if ( � ≠ �� ( � )) then 1 else 0 �� ∈ � ���� ( � , �� ) = ∑ � ( � , � , �� ) st. ∀ ⟨� , �⟩ ∈ � . � ( � ) ⊑ �� ( � ) ⟨� , �⟩ ∈ � guarantees analysis soundness prefer analysis with fewer errors
Learning Algorithm � ∊ � ≔ � | if � then � else � 10^18 true � ₁ 10^6 � ₂ 10^6 f a l � ₂ s 10^6 e Untractable
Learning Algorithm � ∊ � ≔ � | if � then � else � Key Idea: Synthesise Programs in Parts 10^6 � ₁ 10^6
Learning Algorithm � ∊ � ≔ � | if � then � else � Key Idea: Synthesise Programs in Parts 10^6 + 10^6 true � ₁ 10^6 � ₂ 10^6
Learning Algorithm � ∊ � ≔ � | if � then � else � Key Idea: Synthesise Programs in Parts 10^6 + 10^6 + 10^6 true � ₁ 10^6 � ₂ 10^6 f a l � ₂ s 10^6 e
Learning Algorithm � best = arg min ���� ( � , � ) � ∊ ������� ���� ( � , � best ) > 0 ���� ( � , � best ) = 0 � best refine analysis no errors � best return � best � �
Learning Algorithm � best = arg max InfGain ( � , � , � best ) � ∊ ������ ���� ( � , � best ) > 0 � ₁ � best Find split � * refine analysis that separates � ₂ � best � �
Learning Algorithm � best = arg max InfGain ( � , � , � best ) � ∊ ������ ���� ( � , � best ) > 0 � ₁ � best � � Find split � * refine analysis that separates � ₂ � best � � � �
Learning Algorithm � best = arg max InfGain ( � , � , � best ) � ∊ ������ ���� ( � , � best ) > 0 � ₁ � best InfGain ( � , � , � best ) = 0 � � Find split � * refine analysis no split reduces entropy that separates � ₂ � best approximate ( � ) � � � �
Learning: Decision Trees + CEGIS Input Dataset candidate � = { ⟨� � , � � ⟩ } � =1 analysis �� ∊ � Synthesis Oracle: Oracle: + Over- Test/Verify Test/Verify Language � Counter-example approximation Analyzer Analyzer for abstract ⟨� , �⟩ ∉ � transformers � ← � ∪ { ⟨� , �⟩ } no counter-example How to find complex counter-examples quickly? return analysis �� How to efficiently explore hard to find corner cases?
Naive Approach: Random Fuzzing 1. Pick a random training example ⟨� , �⟩ ∊ � � ’ � 2. Mutate the input randomly 3. Obtain the correct label Execute � ’ � ’ 4. Check for correctness ∀ ⟨� , �⟩ ∈ � ’ . � ( � ) ⊑ �� ( � ) 5. Repeat
Naive Approach: Random Fuzzing 1. Pick a random training example Exponential Number of Choices 2. Mutate the input randomly Slow 3. Obtain the correct label 4. Check for correctness When to stop? 5. Repeat
The Oracle: Testing an Analyzer Key Idea: Take advantage of candidate analysis �� How to sample from space of all programs? �
The Oracle: Testing an Analyzer execution path coverage of �� �
Recommend
More recommend