Sourcerer: An Infrastructure for Large-scale Collection and Analysis of Open-source Code Sushil Bajracharya, Joel Ossher , Cristina Lopes Sushil Bajracharya, Joel Ossher , Cristina Lopes Donald Bren School of Information and Computer Sciences University of California, Irvine jossher@uci.edu jossher@uci.edu
SOURCERER SOURCERER 2 jossher@uci.edu jossher@uci.edu
Sourcerer’s Inception • Started in 2005 • Motivation – Explore the use of structural information for code retrieval retrieval – Enable data mining on large quantities of source code • Target: Open-source Java code – Open Source movement provides a large quantity of high quality code – Java is popular, and amenable to static analysis 3 jossher@uci.edu jossher@uci.edu
Sourcerer Today • Collection of loosely coupled Java tools – www.github.com/sourcerer/Sourcerer • Aggregated repository of open source code – www.ics.uci.edu/~lopes/datasets/index.html – www.ics.uci.edu/~lopes/datasets/index.html • Services – http://sourcerer.ics.uci.edu/services/ • Applications 4 jossher@uci.edu jossher@uci.edu
Layered Architecture Models Applications Services Stored Content Tools 5 jossher@uci.edu jossher@uci.edu
Tools and Stored Content Repository Manager File Repository SourcererDB Repository Creator Code Indexer Crawler Search Index Internet 6 jossher@uci.edu jossher@uci.edu
Code Crawler Repository Manager • Input – List of seed pages • Output File Repository – List of project pages • Plugin-based • Plugin-based Repository Creator – Sourceforge – Java.net Crawler – Tigris – Google Code Hosting Internet – Apache 7 jossher@uci.edu jossher@uci.edu
File Repository Repository Manager • Local aggregated repository File Repository • Repository Creator – Input – • List of project pages Repository Creator – Output • Populated file repository Crawler • Repository Manager – Housekeeping tasks Internet 8 jossher@uci.edu jossher@uci.edu
Feature Extractor Repository Manager • Input – File Repository • Output File Repository – Files containing entities and relations and relations Repository Creator • Entity-relationship metamodel Crawler • Headless Eclipse plugin – Uses Eclipse Java development tools (JDT) Internet 9 jossher@uci.edu jossher@uci.edu
SourcererDB • MySQL database • Database importer – Incremental SourcererDB – Parallel – Parallel – Input • Feature extractor output Code Indexer – Output • SourcererDB Search Index 10 jossher@uci.edu jossher@uci.edu
Search Index • Text search for code entities • Apache Solr SourcererDB – Search platform for – Lucene • Code Indexer Code Indexer – Heavily parallel Search Index 11 jossher@uci.edu jossher@uci.edu
Stored Contents Recap File Repository SourcererDB Search Index 12 jossher@uci.edu jossher@uci.edu
Sourcerer Services • Repository Access – Look up text matching SourcererDB entities / relations • Relational Query • Relational Query – Direct access to SourcererDB • Code Search Service – Access the Lucene index • Dependency Slicing 13 jossher@uci.edu jossher@uci.edu
Applications • Sourcerer Code Search Engine – sourcerer.ics.uci.edu/sourcerer/search/index.jsp • CodeGenie – Test-driven code search – Test-driven code search • Sourcerer API Search – Demo! 14 jossher@uci.edu jossher@uci.edu
LESSONS LEARNED LESSONS LEARNED 15 jossher@uci.edu jossher@uci.edu
16 jossher@uci.edu jossher@uci.edu
Lesson One: Reuse • Feature extractor 1.0 – Corollary: javac • Code crawler woes 17 jossher@uci.edu jossher@uci.edu
Lesson Two:Performance & Scalability • Research prototype • Jars directory • Repository migration 18 jossher@uci.edu jossher@uci.edu
Lesson Three: Loose Coupling • Sourcerer M1 • CASI 19 jossher@uci.edu jossher@uci.edu
Lesson Four: YCMEH • You can’t make everyone happy – Why only Java? – Why no X project or Y repository? – Why no versioning information? – Why no versioning information? – … – If you try, no one will be happy (since your tool will never be released) 20 jossher@uci.edu jossher@uci.edu
Thank you! • Contact: jossher@uci.edu 21 jossher@uci.edu jossher@uci.edu
Recommend
More recommend