DtCraft: A High-performance Distributed Execution Engine at Scale Dr. Tsung-Wei Huang Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, IL, USA 1
Outline Streamline the cluster programming DtCraft system Leverage your time to produce promising results Hands-on examples Q&A 2
Motivation: A “Hard - coded” Distributed Timer General design partitions Logical, physical, or hierarchical partitions Design data are stored in a shared storage (e.g., NFS, GPFS) Non-blocking IO Single-server multiple-client model Event-driven programming Server is the centralized coordinator Serialization/Deserialization Clients exchange boundary timing with server TOP level Hierarchy M2 M1:PI1 M1 M1:PO1 PO1 G1 PI1 H1 M2:PI1 I1 M2:PO1 PI2 M1:PI2 M2 M2:PI2 PI3 Hierarchy M1 Three partitions, top-level, M1, and M2 (given by design teams) Huang et al., “A Distributed Timing Analysis Framework for Large Designs,” IEEE/ACM DAC16 3
What does “Productivity” Mean? Programming language Transparency Performance 4
Our Solution: DtCraft A unified engine to streamline cluster programming Completely built from the ground up using C++17 Save your time away from the pain of DevOps High-level C++17-based Stream Graph API Network Event-driven Resource I/O stream Serialization programming reactor control DtCraft Kernel (master, agents, executors) … T.-W. Huang, C.- X. Lin, and M. D. F. Wong, “DtCraft: A distributed execution engine for compute - intensive applications,” IEEE/ACM ICCAD, 2017 5
System Architecture Express your parallelism in our stream graph model Generic dataflow at any granularity Deliver transparent concurrency through the kernel Automatic workload distribution and message passing DtCraft website: http://dtcraft.web.engr.illinois.edu/ 6
Stream Graph Programming Model A general representation of a dataflow Abstraction over computation and communication Analogous to the assembly line model Vertex storage goods store Stream processing unit independent workers Compute unit Generate data ostream istream Data stream A B A B A B A B Stream graph istream ostream B A A B A B Data stream buffer Compute unit Generate data 7
Outline Streamline the cluster programming DtCraft system Leverage your time to produce promising results Hands-on examples Q&A 8
Write a DtCraft Application Step 1: Decide the stream graph of your application Step 2: Specify the data types to stream Step 3: Define the stream computation callback Step 4: Attach resources on vertices (optional) Step 5: Submit ./submit – master=host hello-world ostream istream A B String str; String str; A B int id; int id; Container 1 Container 2 (1 CPU / 4GB RAM) (1 CPU / 8GB RAM) 9
Feedback Control Flow Example Concurrent Ping-pong Each end keeps sending a binary data to the other end Iteration breaks when one end received a hundred 1s Step 3: A B callback Step 2: Step 4: A’s resource bool flag; [=] (auto& B, auto& is) { 1 CPU / 1 GB RAM Extract bool from is; ‘1’ or ‘0’ (random) if received 100: close; else send bool to A; Step 1: stream graph } Break at A B counter ≥ 100 istream B Step 3: A B callback ‘1’ or ‘0’ (random) [=] (auto& A, auto& is) { Extract bool from is; Step 4: B’s resource if received 100: close; 1 CPU / 1 GB RAM Step 2: else send bool to B; } bool flag; istream A Step 5: ./submit – master=127.0.0.1 ping-pong 10
g++ ping-pong.cpp -lDtCraft -o ping-pong ~40 lines of code Single program Sequential flow Fully distributed Simple syntax Resource control Built-in serialization Asynchronous IO Multi-threaded Isolation … and more ~$ ./submit ping-pong or ~$ ./ping-pong 11
Distributed Timing Analysis using DtCraft Two-level hierarchical design (three partitions) Top-level TOP level Hierarchy M2 API M1:PI1 M1 M1 M1:PO1 report_at PO1 M2 G1 PI1 H1 report_slew M2:PI1 I1 M2:PO1 report_rat PI2 M1:PI2 remove_gate M2 Timing M2:PI2 PI3 insert_gate Hierarchy M1 command power_gate s insert_net Three timer vertices Timer User connect_pin One user vertex ... Boundary timing Four Linux containers Six input/output streams Optimization Timer Timer Each container has one OpenTimer program operating on one design hierarchy 12
Exchange Timing Data – Delay, Slew, etc. DtCraft Existing framework In-context streaming with < 30 lines Many extra stuff Extra.pb.h Out-of-context Extra.pb.cpp streaming takes … > 300 lines Source.cpp 13
Deploy the Distributed Timer in One Line DtCraft Existing framework Duplicate the code for each partition Top.cpp M1.cpp M2.cpp Only three lines for resource control in Linux container Container 3 Container 1 Container 2 Wrap up with submission scripts ~$ ./submit – master=127.0.0.1 binary 14
Comparison with the Hard-coded Method × 17 fewer lines of code 33% from message passing The potential productivity 67% from boilerplate code gain is tremendous! 7-11% performance loss Transparent concurrency API cost Development time Runtime (40 AWS nodes) 6000 15 4000 10 2000 5 0 0 Small Medium Large # weeks DtCraft Hard-coded DtCraft Hard-coded 15
Getting Involved with DtCraft Github: https://github.com/twhuang-uiuc/DtCraft Star our project to receive updates DtCraft Cluster computing Scalability Groovy API Security Productivity MIT license Open to collaboration! 16
Distributed Online Machine Learning One image stream generator One image label classifier/trainer Image stream … Data DNN source Classifier Stream generator Online image label classifier 17
Only 60-line code to create distributed ML with streaming 18
Thank you! Tsung-Wei Huang twh760812@gmail.com http://web.engr.illinois.edu/~thuang19/ 19
Recommend
More recommend