PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems Presented by Qinyuan Sun Slides are modified from first author Yunseong Lee’s slides
2 Outline • Prediction Serving Systems • Limitations of Black Box Approaches • PRETZEL: White-box Prediction ServingSystem • Evaluation • Conclusion
Machine Learning PredictionServing Performance goal: 1. Models are learned fromdata 1) Low latency 2. Modelsare deployed and served together 2) High throughput 3) Minimal resourceusage Learn Model Deploy Data Server Users T raining Predictionserving 2
4 ML Prediction ServingSystems: Replication State-of-the-art Result ensemble caching Clipper TFServing ML.Net Request Batching T ext “Pretzel is tasty” Analysis • Assumption: models are blackbox • Re-use the same code in training phase Image cat Recognition • Encapsulate alloperations car into a function call (e.g., predict() ) • Apply external optimizations PredictionServing System
5 How does ModelsLook Like inside Boxes? vs. L J Pretzel istasty Model (text) (positive vs.negative) <Example: SentimentAnalysis>
6 How do ModelsLook inside Boxes? DAG ofOperators Featurizers Char Predictor Ngram Logistic v s . J Pretzel istasty T okenizer Concat Regression L Word Ngram <Example: SentimentAnalysis>
7 How do ModelsLook inside Boxes? DAG ofOperators Extract Compute N-grams finalscore Char Ngram Logistic v s . J Pretzel istasty T okenizer Concat Regression L Word Ngram Split text Merge two intotokens vectors <Example: SentimentAnalysis>
8 Many ModelsHave Similar Structures • Many part of a model can be re-used in other models • Customer personalization, T emplates, T ransferLearning • Identical set of operators with differentparameters
9 Outline • Prediction Serving Systems • Limitations of Black BoxApproaches • PRETZEL: White-box Prediction ServingSystem • Evaluation • Conclusion
1 0 Limitation 1: ResourceWaste • Resources are isolated across Blackboxes 1. Unable to share memoryspace è Waste memory to maintain duplicate objects (despite similarities between models) 2. Nocoordination for CPU resources between boxes è Serving many models can use too many threads machine
1 1 Limitation 2: Inconsideration for Ops’Characteristics 1. Operators have different performancecharacteristics • Concat materializes avector • LogReg takes only 0.3% (contrary to the training phase) 2. There can be a better plan if such characteristics are considered • Re-use the existingvectors • Apply in-place update in LogReg Others CharNgram WordNgram Concat LogReg 0.3 Char Ngram 34.2 32.7 23.1 9.6 Log T okenizer Concat Reg Word 40% 60% 0% 100 Ngram 80% 20 % Latency breakdown %
1 2 Limitation 3: LazyInitialization • ML.Net initializes code and memory lazily (efficient in training phase) • Run 250 SentimentAnalysis models 100 times è cold: first execution / hot: average of the rest99 • Long-tail latency in the coldcase • Code analysis, Just–in-time (JIT) compilation, memory allocation,etc • Difficult to provide strong Service-Level-Agreement(SLA) Char Ngram 444x Log T okenizer Concat 13x Reg Word Ngram
1 3 Outline • (Black-box) Prediction ServingSystems • Limitations of Black BoxApproaches • PRETZEL: White-box Prediction ServingSystem • Evaluation • Conclusion
1 4 PRETZEL: White-box PredictionServing • We analyze models to optimize the internal execution • We let models co-exist on the same runtime, sharing computation and memoryresources • We optimize models in twodirections: 1. End-to-end optimizations 2. Multi-model optimizations
1 5 End-to-End Optimizations Optimize the execution of individual models from start to end 1. [Ahead-of-time Compilation] Compile operators’ code inadvance à No JIToverhead 2. [Vector pooling] Pre-allocate data structures à No memory allocation on the data path
1 6 Multi-model Optimizations Share computation and memory acrossmodels 1. [Object Store] Share Operatorsparameters/weights à Maintain only onecopy 2.[Sub-plan Materialization] Reuse intermediate results computed by othermodels à Save computation
System Components 3. Runtime: Execute inferencequeries 1. Flour: IntermediateRepresentation Runtime var fContext = ...; Object var Tokenizer = ...; Scheduler return fPrgm.Plan(); Store … 2. Oven: Compiler/Optimizer 4. FrontEnd: Handle userrequests FrontEnd 17
1 8 Prediction Serving withPRETZEL 1. Offline Model • Analyze structural information ofmodels Analyze • Build ModelPlan for optimalexecution • Register ModelPlan toRuntime Model Runtime Register Plan 2. Online • Handle predictionrequests • Coordinate CPU & memoryresources FrontEnd Runtime
System Design: OfflinePhase 1. T ranslate Model into FlourProgram <Model> <Flour Program> var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) . Tokenize (); var tCNgram = tTokenizer. CharNgram (numCNgrms, ...); var tWNgram = tTokenizer. WordNgram (numWNgrms, ...); var fPrgrm = tCNgram Char . Concat (tWNgram) Ngram . ClassifierBinaryLinear (cParams); Log T okenizer Concat Reg Word return fPrgrm.Plan(); Ngram 18
Rule-based System Design: OfflinePhase optimizer 2. Oven optimizer/compiler build ModelPlan Push linearpredictor & Remove Concat <Flour Program> var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) . Tokenize (); Group ops intostages var tCNgram = tTokenizer. CharNgram (numCNgrms, ...); var tWNgram = <Model Plan> Stage1 tTokenizer. WordNgram (numWNgrms, ...); var fPrgrm S1 = tCNgram Logical DAG . Concat (tWNgram) . ClassifierBinaryLinear (cParams); S2 return fPrgrm.Plan(); Stage2 19
Rule-based System Design: OfflinePhase optimizer 2. Oven optimizer/compiler build ModelPlan Push linearpredictor & Remove Concat <Flour Program> e.g., Dictionary , N-gramLength var fContext = new FlourContext(...) var fContext = new FlourContext(...) var fContext = new FlourContext(...) var tTokenizer = fContext.CSV var tTokenizer = fContext.CSV var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .FromText( fields , fieldsType , sep ) .FromText(fields, fieldsType, sep) .Tokenize(); .Tokenize(); . Tokenize (); Group ops intostages var var tCNgram tCNgram = = tTokenizer.CharNgram( numCNgrms , tTokenizer. CharNgram (numCNgrms, var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); ...); ...); var var var tWNgram tWNgram tWNgram = = = <Model Plan> Stage1 tTokenizer. WordNgram (numWNgrms, tTokenizer.WordNgram( numWNgrms , ...); ...); var var fPrgrm fPrgrm tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm S1 = tCNgram = tCNgram = tCNgram Logical DAG . Concat (tWNgram) .Concat(tWNgram) .Concat(tWNgram) . ClassifierBinaryLinear (cParams); .ClassifierBinaryLinear(cParams); .ClassifierBinaryLinear( cParams ); Parameters S2 return fPrgrm.Plan(); return fPrgrm.Plan(); return fPrgrm.Plan(); Statistics e.g., dense vs.sparse, Stage2 maximum vectorsize 20
2 2 System Design: OfflinePhase 3. ModelPlan is registered to Runtime LogicalStages PhysicalStages <Model Plan> Model1 S1 S2 S1 Logical DAG S2 Parameters Statistics ObjectStore 2. Find the most 1. Store parameters& efficient physical impl. mapping between using params & stats logical stages
2 3 System Design: OfflinePhase 3. Registerselected 3. ModelPlan is registered to Runtime physical stagesto Catalog LogicalStages PhysicalStages Catalog <Model Plan> Model1 S1 S2 S1 Logical DAG S2 Parameters N-gramlength Statistics Sparse vs. Dense ObjectStore 1 vs. 3 2. Find the most 1. Store parameters& efficient physical impl. mapping between using params & stats logical stages
2 4 System Design: OnlinePhase LogicalStages Model1 Model2 S1 S1’ 2. Instantiate S2 S2’ physicalstages along with 4. Send resultback ObjectStore parameters to Client <Model1, “Pretzel istasty”> PhysicalStages 1. When aprediction 3. Execute stagesusing request arrives thread-pools, Runtime managed byScheduler
2 5 Outline • (Black-box) Prediction ServingSystems • Limitations of Black boxapproaches • PRETZEL: White-box Prediction ServingSystem • Evaluation • Conclusion
2 6 Evaluation • Q. How PRETZEL improves performance overblack-box approaches? • in terms of latency , memory and throughput • 500 Models from Microsoft Machine Learning T eam • 250 Sentiment Analysis(Memory-bound) • 250 Attendee Count(Compute-bound) • System configuration • 16 Cores CPU, 32GBRAM • Windows 10, .Net core2.0
Recommend
More recommend