Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG
Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords ● Huge row sizes (multi MBs) ● Huge datasets (tens+ TB) HDF5 Learning curve Many datasets Raw AV API Png files data API API Maps Labels
Consolidating datasets Research engineers (typically) don’t do data-extraction Train directly from the well-known dataset Raw AV API data API API Maps Labels
Uber ATG Mission Introduce self-driving technology to the Uber network in order to make transporting people and goods safer, more efficient, and more affordable around the world.
About myself... Yevgeni Litvin Work on data platform and onboard integration of models
Our talk today Enabling “One Dataset” approach File formats Petastorm as an enabling tool
One dataset One dataset used by multiple research projects. ● Easy to compare models. Easy to reproduce training. ● ● Faster research engineer ramp-up. ML infra-team management. ● ● Superset of the data a single project may require. ● No model-specific preprocessing. Efficient data access. ● ● TF/PyTorch/other framework native access.
Apache Parquet Efficient column-subset reads. Atomic read unit: one column from a row group (a chunk). Random access to a row-group. Natively supported by Apache Spark, Hive and other big-data tools. No tensor support
Petastorm Scalable Native TensorFlow, PyTorch Shuffling Sharding Queries, Indexing Parquet partitions Local caching N-grams (windowing)
Research engineer experience Before: After: Data extraction (Query upstream systems, ETL at scale) Train Train Evaluate Evaluate Deploy Deploy
Two integration alternatives Apache Parquet as a dataframe with tensors Train from existing org Parquet stores (native types, no tensors) Hedgehog Fog Horse nd-arrays, scalars (e.g. images, Apache Parquet non-Petastorm, Apache Parquet lidar point store store clouds)
Extra schema information FrameSchema = Unischema('FrameSchema', [ UnischemaField('timestamp', np.int32, (), ScalarCodec(IntegerType()), nullable=False), UnischemaField('front_cam', np.uint8, (1200, 1920, 3), CompressedImageCodec('png'), nullable=False), UnischemaField('label_box', np.uint8, (None, 2, 2), NdarrayCodec(), nullable=False), ]) Stored with Parquet store Defines tensor serialization format Runtime types validation Needed for wiring natively into Tensorflow graph
Generating a dataset 1. Configure row-group size spark and writes Petastorm metadata at the end with materialize_dataset(spark, output_url, FrameSchema, rowgroup_size_mb) : rows_rdd = sc.parallelize(range(rows_count))\ 2. Encode tensors and .map(row_generator)\ convert to a spark Row .map(lambda x: dict_to_spark_row(FrameSchema, x) ) spark.createDataFrame(rows_rdd, FrameSchema.as_spark_schema() ) \ .write \ .parquet(output_url) 3. Spark schema from Unischema def row_generator(x): return {'timestamp': ..., 'front_cam': np.asarray(...), 'label_box: np.asarray(...)}
Python with make_reader ('hdfs:///tmp/hello_world_dataset') as reader: for sample in reader: print(sample.id) plt.imshow(sample.image1) [Out 0] 0 # Reading from non-Petastorm dataset (only native Apache Parquet types) with make_batch_reader ('hdfs:///tmp/hel...') as reader: for sample in reader: print(sample.id) [Out 1] [0, 1, 2, 3, 4, 5]
Tensorflow Substitute with make_batch_reader to read non-Petastorm dataset # tf tensors with make_reader ('hdfs:///tmp/dataset') as reader: Connect Petastorm Reader object into data = tf_tensors (reader) TF graph predictions = my_model(data.image1, data.image2)
PyTorch from petastorm.pytorch import DataLoader with DataLoader ( make_reader (dataset_url)) as train_loader: sample = next(iter(train_loader)) print(sample['id'])
Real example with make_reader ('hdfs:///path/to/uber/atg/dataset', schema_fields=[AVSchema.lidar_xyz]) as reader: sample = next(reader) plt.plot(sample.returns_xyz[:, 0], sample.returns_xyz[:, 1], '.')
Reader architecture Uses Apache Arrow Reading workers (threads or processes) Row-groups filtered, shuffled Output rows as np.array, tf.tensor or tf.data.Dataset
Petastorm row predicate User defined row filter in_lambda (['object_type'], Optimizations lambda object_type: object == 'car')) car ... ... car ... ... car ... ... car ... ... pedestrian ... ... car ... ... car ... ... car ... ... car ... ... car ... ... bicycle ... ... car ... ... car ... ... car ... ...
Transform def modify_row(row): row[ 'list_of_lists_as_tensor' ] = \ User defined row update foo_to_tensor(row[ 'list_of_lists' ]) On thread/process pool del row[ 'data_as_list_of_lists' ] 0, 0, ... 0 1, 1, ... 0 3, 3, ... 1 4, 3, ... 1 1, 2, ... 0 [[1,2,3], [4], [5,6]] 3, 5, ... 1 [[1], [4,5,6]] [[10]] 0, 8, ... 3 1, 4, ... 2 1, 3, ... 1
Local cache Slow/expensive links In-memory cache make_reader(..., cache_type=’local-disk’)
Sharding Distributed training Quick experimentation make_reader(..., cur_shard=3, shard_count=10)
NGrams (windowing) Sorted datasets Efficient IO/decoding Cons: RAM wasteful shuffling t=0 t=1 t=2 t=0 t=1 t=1 t=2 t=3 t=2 t=3 t=4 t=2 t=3 t=4
Conclusion Petastorm developed to support “One Dataset” workflow. Uses Apache Parquet as the store format: - Tensors support - Provides set of tools needed for deep-learning training/evaluation Organization data-warehouse (non-Petastorm, native Parquet types) (still lot’s of work left to be done… we are hiring!)
Github: https://github.com/uber/petastorm Thank you! yevgeni@uber.com
Recommend
More recommend