Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL
Motivation Modern data apps combine many disjoint processing libraries & functions » Relational, statistics, machine learning, … » E.g. PyData stack + Great results leveraging work of 1000s of authors – No optimization across these functions
How Bad is This Problem? Growing gap between memory/processing makes traditional way of combining functions worse parse_csv data = pandas.parse_csv(string) filtered = pandas.dropna(data) dropna avg = numpy.mean(filtered) mean 5-30x slowdowns in NumPy, Pandas, TensorFlow, etc
How We Solve This machine graph … SQL learning algorithms Common Runtime … CPU GPU
How We Solve This machine graph … SQL learning algorithms Runtime API Weld Weld IR runtime Optimizer Backends … CPU GPU
Runtime API Uses lazy evaluation to collect work across libraries User Application Weld Runtime f1 data = lib1.f1() IR fragments lib2.map(data, map for each function item => lib3.f2(item) f2 ) Runtime API Combined IR program Optimized Data in 1101110 0111010 machine code application 1101111
Weld IR Designed to meet three goals: 1. Library composition: support complete workloads such as nested parallel calls 2. Ability to express optimizations: e.g. loop fusion, vectorization, loop tiling 3. Explicit parallelism
Weld IR Small, powerful design inspired by “monad comprehensions” Parallel loops: iterate over a dataset Builders: declarative objects for producing results » E.g. append items to a list, compute a sum » Can be implemented differently on different hardware Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof
Examples Implement functional operators using builders def map(data, f): builder = new vecbuilder[int] for x in data: merge (builder, f(x)) result (builder) def reduce(data, zero, func): builder = new merger[zero, func] for x in data: merge (builder, x) result (builder)
Example Optimization: Fusion squares = map (data, x => x * x) sum = reduce (data, 0, +) bld1 = new vecbuilder[int] bld2 = new merger[0, +] for x in data: merge (bld1, x * x) merge (bld2, x) Loops can be merged into one pass over data
Implementation Prototype with APIs in Scala and Python » LLVM and Voodoo for code gen Integrations: TensorFlow, NumPy, Pandas, Spark
Results: Individual Workloads SQL (TPC-H) PageRank 12 1.2 0.7 Runtime [secs] GraphMat Runtime [secs] Runtime [secs] 0.6 10 1 Hand-opt 0.5 8 0.8 Weld 0.4 0.6 6 0.3 0.4 4 0.2 0.2 2 0.1 0 0 0 1 4 12 1 4 12 1 2 4 8 12 Number of threads Number of threads Number of threads HyPer Weld HyPer Weld H.o. H.o. Q1 Q3 Word2Vec 0.3 0.6 Runtime [secs] Runtime [secs] 0.25 0.5 25 0.2 TF 0.4 Runtime [secs] 20 TF-Op 0.15 0.3 Weld 15 0.1 0.2 0.05 0.1 10 0 0 1 4 12 1 4 12 5 Number of threads Number of threads 0 HyPer Weld HyPer Weld TF-Op = C++ operator H.o. H.o. Q6 Q12
Results: Existing Frameworks 45 1000 0.2 SparkSQL 0.18 TF Runtime [secs; log10] 40 Runtime [secs] Runtime [secs] 0.16 Weld Hand-opt 35 100 0.14 Weld 30 0.12 25 0.1 10 0.08 20 0.06 15 0.04 1 10 0.02 0 5 0.1 0 1 Core 12 Cores LR (1T) LR (12T) TPC-H Q1 TPC-H Q6 NP Weld Workload Workload NExpr TPC-H Vector Sum Logistic Regression Integration effort: 500 lines glue, 30 lines/operator
Results: Cross-Library Optimization Pandas + NumPy Spark SQL UDF 100 2.0 Current Scala UDF Weld, no CLO Weld Weld, CLO 10 1.5 Runtime (sec, log10) Runtime (sec) Weld, 12 core 31x 1 1.0 290x 0.1 0.5 14x 0.01 0.0
Conclusion The way we compose software will have to change to efficiently use modern hardware Weld is our first attempt at such a design – lots of open questions! » Optimization, specialized hardware, domain info, … Open source: this spring We’re hiring! (postdocs)
Recommend
More recommend