the ai thunderdome
play

The AI Thunderdome Using OpenStack to accelerate AI training with - PowerPoint PPT Presentation

The AI Thunderdome Using OpenStack to accelerate AI training with Sahara, Spark, and Swift Sean Pryor, Sr. Cloud Consultant, RHCE Red Hat https://www.redhat.com spryor@redhat.com Overview This talk will cover Brief explanations of ML,


  1. The AI Thunderdome Using OpenStack to accelerate AI training with Sahara, Spark, and Swift Sean Pryor, Sr. Cloud Consultant, RHCE Red Hat https://www.redhat.com spryor@redhat.com

  2. Overview This talk will cover Brief explanations of ML, Spark, and Sahara ● Some notes on preparation for Sahara ● (And some issues we hit in our lab while preparing for this talk) ● A look at Machine Learning concepts inside Spark ● Cross Validation and Model Selection ● Sparkflow architecture ● Example code ● 2 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  3. Big Data and OpenStack

  4. From the user survey: Big Data and OpenStack https://www.openstack.org/analytics A lot of data resides on OpenStack already ● The data is already there. Why move it elsewhere to analyze it? ● Tools are already there to do the analysis 4 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  5. Sahara+Spark+Swift Architecture Basic architecture outline ● Sahara is a wrapper around Heat ○ It does more than just Spark too ● Basic architecture involves just Spark on compute nodes ● Spark cluster can directly access Swift via swift://container/object URLs ● Code deployed on Spark clusters can access things independently as well 5 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  6. Spark Architecture Overview Basic architecture outline ● Spark has a master/slave architecture ● The cluster manager can be either the built-in one, Mesos, Yarn, or Kubernetes ● Spark is built on top of the traditional Map/Reduce framework, but has additional tools, notably ones that include Machine Learning ● For TensorFlow, there are several frameworks that make training and deploying models on Spark a lot easier ● Workers have in-memory data cache - this is important to know when using TensorFlow 6 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  7. Deploying Sahara A few notes when deploying Spark clusters via Sahara Image modifications are Ensure hadoop swift OpenStack job framework needed support is present doesn't support Python guestmount works great here ● java.lang.RuntimeException: The Job/Job Execution/Job ● ● java.lang.ClassNotFoundExcep Template framework assumes pip install: ● tion: Class java org.apache.hadoop.fs.swift.s tensorflow or ○ native.SwiftNativeFileSystem ● In order to do python, it likely tensorflow-gpu not found means spark-submit keras ○ This error indicates support is ● ○ sparkdl missing, may need to reinstall sparkflow ○ /usr/lib/hadoop-mapreduce/ha ● Add supergroup to ubuntu user doop-openstack.jar 7 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  8. Machine Learning with Spark

  9. Training AI Basic overview of AI and AI training For ML techniques, broadly, each iteration tries to ● fit a function to the data. Each new iteration refines the function ● Features : Characteristics of a single datapoint ● ● Labels : Outputs of a Machine Learning model Learning rate : How much each new iteration ● changes the function Loss : How far from reality each label is ● ● Normalization : Penalizes complex functions. This helps prevent overfitting 9 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  10. Spark Machine Learning Important Components in Spark ML DataFrame Transformer Estimator Built on the regular Spark Transformers add/change Estimators are Transformers ● ● ● RDD/DataFrame API data in a dataframe that instead output a model SQL-like Transformers implement a Estimators implement a fit() ● ● ● transform() method which method which trains the Lazy evaluation ● returns a modified algorithm on the data Notably transform() doesn't ● DataFrame Estimators can also give you ● trigger evaluation. Things data about the model like like count() do weights and Supports a Vector type in ● hyperparameters addition to regular datatypes Can be saved/reused ● 10 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  11. Cross Validation Automatic selection of the best model ● CrossValidator allows you to select model ● After evaluating on all sets of parameters, the parameters based on results of parallel training best is trained and tested against the entire ● Wraps a Pipeline, and executes several dataset pipelines in parallel with different parameters ● Parameter grid should ideally be small ● Requires a grid of parameters to train against ● The folding of the dataset means that it's not ● Splits the dataset into N folds, with a ⅔ train ⅓ ideal for small datasets test split ● Still requires some expertise in making sure it ● Requires a loss metric to optimize against, doesn't overfit, or that other errors don't occur Evaluator classes have these pre-baked 11 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  12. Example Code

  13. Right out of the manual: https://spark.apache.org/docs/2.3.0/ml-tuning.html Parallel Hyperparameter Training Spark CrossValidation Sample Code from pyspark.ml import Pipeline crossval = CrossValidator( from pyspark.ml.classification import LogisticRegression estimator=pipeline, from pyspark.ml.evaluation import BinaryClassificationEvaluator estimatorParamMaps=paramGrid, from pyspark.ml.feature import HashingTF, Tokenizer evaluator=BinaryClassificationEvaluator(), from pyspark.ml.tuning import CrossValidator, ParamGridBuilder numFolds=2) # use 3+ folds in practice cvModel = crossval.fit(training) training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), test = spark.createDataFrame([ (1, "b d", 0.0), (4, "spark i j k"), ... (5, "l m n"), ], ["id", "text", "label"]) (6, "mapreduce spark"), (7, "apache hadoop") tokenizer = Tokenizer(inputCol="text", outputCol="words") ], ["id", "text"]) hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") prediction = cvModel.transform(test) lr = LogisticRegression(maxIter=10) selected = prediction.select("id", "text", "probability", pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) "prediction") for row in selected.collect(): paramGrid = ParamGridBuilder() \ print(row) .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build() 13 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  14. Parallel Hyperparameter Training Spark CrossValidation Sample Code from pyspark.sql import SparkSession from pyspark.ml import Pipeline from pyspark.ml.feature import HashingTF, Tokenizer spark = SparkSession.builder.appName("SparkCV").getOrCreate() training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), ● Boilerplate start sets up Spark (1, "b d", 0.0), Session and training data ... ], ["id", "text", "label"]) ● Tokenizer takes in the input strings and outputs tokens tokenizer = Tokenizer(inputCol="text", outputCol="words") ● HashingTF generates features by hashing based on the frequency hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") of the input ● LogisticRegression is one of the lr = LogisticRegression(maxIter=10) pre-canned ML algorithms ● Pipeline sets up all the stages pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) 14 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

  15. Parallel Hyperparameter Training Spark CrossValidation Sample Code ● ParamGrid is a grid of different parameters to plug into our Pipeline segments from before ● CrossValidator is a wrapper around the pipeline paramGrid = ParamGridBuilder() \ it gets passed, and executes each pipeline with .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ the values from the ParameterGrid .addGrid(lr.regParam, [0.1, 0.01]) \ .build() ● The Evaluator parameter is the function we use to measure the loss of each model crossval = CrossValidator( estimator=pipeline, estimatorParamMaps=paramGrid, ● numFolds is how much we want to partition the evaluator=BinaryClassificationEvaluator(), dataset numFolds=2) # use 3+ folds in practice ● cvModel is our best model result from the cvModel = crossval.fit(training) training. ● cvModel.bestModel is an alias 15 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

More recommend