ensemble learning with sagemaker and step functions
play

Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin - PowerPoint PPT Presentation

Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin Weigel | 09.09.2019 Hamburg, Germany Benjamin Weigel Data Engineer & Cloud Coordinator Europace AG https://www.europace.de/ There is manual efgort in obtaining a mortgage


  1. Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin Weigel | 09.09.2019 Hamburg, Germany

  2. Benjamin Weigel Data Engineer & Cloud Coordinator Europace AG https://www.europace.de/

  3. There is manual efgort in obtaining a mortgage

  4. Smart Document Classification

  5. Text Image Smart Model Model use output as input Document trained on trained on Classification OCR-extracted text page-bitmap Sequence Model trained on sequence information (i.e. “Page 1-4 is a contract”)

  6. Options to Build a Model Training-Pipeline

  7. AWS Sagemaker

  8. AWS Step-Functions - define a distributed workflow as series of steps - visual workflow - long running workflows (max 1yr) - 4000 transitions/month for free - after: 0.025 USD per 1000 transitions - can get expensive quickly

  9. AWS Step-Functions

  10. { "Comment" : "An example of the Amazon States Language." , "StartAt" : "FirstState" , "States" : { "FirstState" : { "Type": "Task", Amazon States "Resource" : "arn:aws:lambda:us-east-1:123456789012:function:..." , "Next": "ChoiceState" }, ... Language } } "ChoiceState": { "Type": "Choice", define a state-machine "Choices": [ ● { "Variable": "$.foo", JSON-based ● "NumericEquals": 1, "Next": "FirstMatchState" describe: }, ● { "Variable": "$.foo", a state and the ○ "NumericEquals": 2, "Next": "SecondMatchState" transition to the next } ], "Default": "DefaultState" error-conditions etc. ○ }

  11. Step Functions can control Sagemaker ● Transform and Training Jobs directly via these Resources: Step Functions & "arn:aws:states:::sagemaker:createTransformJob.sync" Sagemaker "arn:aws:states:::sagemaker:createTrainingJob.sync" easy peasy https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html

  12. "Image Model Training" : { "Type" : "Task", "Resource" : "arn:aws:states:::sagemaker:createTrainingJob.sync", "Parameters" : { "TrainingJobName" : "ImageModel", "AlgorithmSpecification" : { "TrainingImage" : "520713654638.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-mxnet:1.3-gpu-py3", "TrainingInputMode" : "File" }, "HyperParameters" : { sagemaker: "epochs" : "80", "batch_size" : "10", "conv_block_length" : "2", "cycle_length" : "10", createTrainingJob.sync "depth" : "5", "dropout" : "0.5", "max_lr" : "0.1", "min_lr" : "0.0001", ... "start_filter" : "4", "worker" : "4" ● configure job via Parameters }, "InputDataConfig.$" : "$.generated.image_model.InputDataConfig", "OutputDataConfig" : { section "S3OutputPath.$" : "$.generated.output_artifact_paths.image_model_prefix" }, "ResourceConfig" : { "InstanceCount" : 4, "InstanceType" : "ml.p2.xlarge", "VolumeSizeInGB" : 10 }, "RoleArn" : "arn:aws:iam::123456789012:role/sm-stepfunction-iam-role", "StoppingCondition" : { "MaxRuntimeInSeconds" : 172800 } } } https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestSyntax https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html

  13. The Good Photo by Joshua Ness on Unsplash

  14. Start simple { "StartAt" : "Train Text Model", "States" : { "Train Text Model" : { "Type" : "Task", "Resource" : "arn:aws:states:::sagemaker:createTrainingJob.sync", "Parameters" : { ... }, "End" : true } } }

  15. Expand from there { "StartAt" : "Fetch Preprocessed Data" , "States" : { "Fetch Preprocessed Data" : { "Type" : "Task", "Resource" : "arn:aws:states:::batch:submitJob.sync" , "Next" : "Train Text Model" , "Parameters" : { "JobName" : "FetchPreparedData" , "JobDefinition" : "arn:aws:batch: us-east-1:1234567890:job-definition/job:2 ", "JobQueue" : "arn:aws:batch:us-east-1:1234567890:job-queue/queue" , "Parameters" : { "DATA_INPUT_PATH.$" : "$.input_data" , "OUTPUT_PATH.$" : "$.ready_to_use_artifacts" } } }, "Train Text Model" : { ... } } }

  16. Retry if possible "Train Text Model" : { "Type" : "Task", "Resource" : "arn:aws:states:::sagemaker:createTrainingJob.sync", ... "Retry" : [ { "ErrorEquals" : [ "SageMaker.AmazonSageMakerException" ], "IntervalSeconds" : 1, "MaxAttempts" : 100, "BackoffRate" : 1.1 }, ... ] }

  17. If all else fails ... "Train Text Model" : { "Type" : "Task", ... "Catch" : [{ "ErrorEquals" : ["States.ALL" ], "Next" : "Notify Failure" }] }, "Notify Failure" : { "Type" : "Task", "Resource" : "arn:aws:states:::sns:publish" , "End" : true, "Parameters" : { "Subject" : "[ERROR] Model Training failed!" , "Message" : "Error during model training!" , "TopicArn" : "arn:aws:sns:*:123456789012:alerting_topic" , "MessageAttributes" : { ... } } }

  18. But there is a catch ...it’s a valid state after all

  19. Fail successfully! "Notify Failure" : { "Type" : "Task", "Resource" : "arn:aws:states:::sns:publish" , "Next" : "Fail", ... }, "Fail" : { "Type" : "Fail" }

  20. Text Image Model Model use output as input Add a few trained on trained on more models ... OCR-extracted text page-bitmap Sequence Model trained on sequence information (i.e. “Page 1-4 is a contract”)

  21. Use concurrency for time effjciency "Fetch Preprocessed Data" : { ... "Next" : "Base Model Training" }, "Base Model Training" : { "Type" : "Parallel" , "Next" : "Train Sequence Model" , "Branches" : [ { "StartAt" : "Train Image Model" , "States" : { "Train Image Model" : { ... "End" : true } } },{ "StartAt" : "Train Text Model" , "States" : { "Train Text Model" : { ... "End" : true }}}]},

  22. Beware of “silent” errors notification trigger won’t fire because there is no state defined for this scenario -> unexpected failure

  23. Everything should fail the same "Base Model Training" : { "Type" : "Parallel", "Next" : "Train Sequence Model" , "Branches" : [...], "Catch" : [ { "ErrorEquals" : [ "States.ALL" ], "Next" : "Notify Failure" } ] }

  24. Some jobs are long running and expensive then something fails and you have to debug (rerun) …

  25. Save time & money skip some steps... "States" : { "Skip Image Model Training?" : { "Type" : "Choice" , "Choices" : [ { "Variable" : "$.train_image_model" , "BooleanEquals" : false, "Next" : "Skip Fetch Preprocessing Artifacts" } ], "Default" : "Train Image Model" }, "Skip Fetch Preprocessing Artifacts" : { "Type" : "Pass", "End" : true }, "Train Image Model" : { ... "End" : true } }

  26. Rinse and repeat ...and add a little sprinkle on top

  27. Our Model Training Workflow - Lambda - Batch Job - Sagemaker - SNS - Choice - Pass (to skip steps) - Fail - Wait

  28. Our Model Training “Setup” Data (S3) Input Workflow - Input to setup state Step Function machine execution - define where the data is - (Hyper)Parameterization - Data & Models stored on S3 (each execution gets its own copy of the data) Models & Data (S3)

  29. The Bad Photo by Markus Spiske on Unsplash

Recommend


More recommend