B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G - PowerPoint PPT Presentation

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

Seq u ences to bags nested_containers = [[0, 1, 2, 3],{}, [6.5, 3.14], 'Python', {'version':3}, '' ] import dask.bag as db the_bag = db.from_sequence(nested_containers) the_bag.count() 6 the_bag.any(), the_bag.all() True, False PARALLEL PROGRAMMING WITH DASK IN PYTHON

Reading te x t files import dask.bag as db zen = db.read_text('zen') taken = zen.take(1) type(taken) tuple PARALLEL PROGRAMMING WITH DASK IN PYTHON

Reading te x t files taken ('The Zen of Python, by Tim Peters\n',) zen.take(3) ('The Zen of Python, by Tim Peters\n', '\n', 'Beautiful is better than ugly.\n') PARALLEL PROGRAMMING WITH DASK IN PYTHON

Glob e x pressions import dask.dataframe as dd df = dd.read_csv('taxi/*.csv', assume_missing=True) taxi/*.csv is a glob e x pression taxi/*.csv matches : taxi/yellow_tripdata_2015-01.csv taxi/yellow_tripdata_2015-02.csv taxi/yellow_tripdata_2015-03.csv ... taxi/yellow_tripdata_2015-10.csv taxi/yellow_tripdata_2015-11.csv taxi/yellow_tripdata_2015-12.csv PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using P y thon ' s glob mod u le %ls Alice Dave README a02.txt a04.txt b05.txt b07.txt b09.txt b11.t Bob Lisa a01.txt a03.txt a05.txt b06.txt b08.txt b10.txt taxi import glob txt_files = glob.glob('*.txt') txt_files ['a01.txt', 'a02.txt', ... 'b10.txt', 'b11.txt'] PARALLEL PROGRAMMING WITH DASK IN PYTHON

More glob patterns glob.glob('b*.txt') glob.glob('?0[1-6].txt') ['b05.txt', ['a01.txt', 'b06.txt', 'a02.txt', 'b07.txt', 'a03.txt', 'b08.txt', 'a04.txt', 'b09.txt', 'a05.txt', 'b10.txt', 'b05.txt', 'b11.txt'] 'b06.txt'] [] glob.glob('b?.txt') PARALLEL PROGRAMMING WITH DASK IN PYTHON

More glob patterns glob.glob('??[1-6].txt') ['a01.txt', 'a02.txt', 'a03.txt', 'a04.txt', 'a05.txt', 'b05.txt', 'b06.txt', 'b11.txt'] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Permissible glob patterns Filename characters ( e . g ., file-02_tmp.txt ) Wildcard character * : matches 0 or more Wildcard character ? : matches e x actl y 1 Character ranges ( e . g ., [0-5] , [a-m] , [A-Z0-9] ) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let ' s practice ! PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON

F u nctional Approaches u sing Dask Bags PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

F u nctional programming F u nctions : � rst - class data Higher - order f u nctions : f u nctions as inp u t or o u tp u t to f u nctions F u nctions replacing loops w ith : map operations � lter operations red u ction operations ( or aggregations ) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using map def squared(x): return x ** 2 squares = map(squared, [1, 2, 3, 4, 5, 6]) squares <map at 0x1037a1b70> squares = list(squares) squares [1, 4, 9, 16, 25, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using filter def is_even(x): ...: return x % 2 == 0 evens = filter(is_even, [1, 2, 3, 4, 5, 6]) list(evens) [2, 4, 6] even_squares = filter(is_even, squares)) list(even_squares) [4, 16, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using dask . bag . map import dask.bag as db numbers = db.from_sequence([1, 2, 3, 4, 5, 6]) squares = numbers.map(squared) squares dask.bag<map-squared, npartitions=6> result = squares.compute() # Must fit in memory result [1, 4, 9, 16, 25, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using dask . bag . filter numbers = db.from_sequence([1, 2, 3, 4, 5, 6]) evens = numbers.filter(is_even) evens.compute() [2, 4, 6] even_squares = numbers.map(squared).filter(is_even) even_squares.compute() [4, 16, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using . str & string methods zen = db.read_text('zen.txt') uppercase = zen.str.upper() uppercase.take(1) ('THE ZEN OF PYTHON, BY TIM PETERS\n',) def my_upper(string): ...: return string.upper() my_uppercase = zen.map(my_upper) my_uppercase.take(1) ('THE ZEN OF PYTHON, BY TIM PETERS\n',) PARALLEL PROGRAMMING WITH DASK IN PYTHON

A bigger e x ample I def load(k): template = 'yellow_tripdata_2015-{:02d}.csv' return pd.read_csv(template.format(k)) def average(df): return df['total_amount'].mean() def total(df): return df['total_amount'].sum() data = db.from_sequence(range(1, 13)).map(load) data dask.bag<map-loa..., npartitions=12> PARALLEL PROGRAMMING WITH DASK IN PYTHON

A bigger e x ample II totals = data.map(total) averages.compute() averages = data.map(average) totals.compute() [14.75051171665384, 15.463557844570461, [1175217.5200009614, 15.790076907851297, 947282.0900005419, 15.971334410669527, 956752.3400005258, 16.477159899324676, 1304602.4800011297, 16.250654434978838, 1354966.290001166, 16.163639508987067, 1251511.6500010253, 16.164026987891997, 1167936.1000008786, 16.364647910506154, 915174.880000469, 16.544750841370114, 994643.300000564, 16.385807916489675, 1273267.4800010026, 16.28056690958003] 1158279.990000822, 1166242.130000856] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Red u ctions ( aggregations ) t_sum, t_min, t_max, = totals.sum(), totals.min(), totals.max() t_mean, t_std, = totals.mean(), totals.std() stats = [t_sum, t_min, t_max, t_mean, t_std] %time [s.compute() for s in stats] CPU times: user 142 ms, sys: 101 ms, total: 243 ms Wall time: 4.57 s [13665876.250009943, 915174.880000469, 1354966.290001166, 1138823.0208341617, 144025.81874405374] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Red u ctions ( aggregations ) import dask %time dask.compute(t_sum, t_min, t_max, t_mean, t_std) CPU times: user 63.7 ms, sys: 29.1 ms, total: 92.7 ms Wall time: 852 ms (13665876.250009943, 915174.880000469, 1354966.290001166, 1138823.0208341617, 144025.81874405374) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Anal yz ing Congressional Legislation PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

JSON data files J a v a S cript O bject N otation : stored as plain te x t common w eb format direct mapping to P y thon lists & dictionaries PARALLEL PROGRAMMING WITH DASK IN PYTHON

Sample JSON FIle : items . json items.json [ { "name": "item1", "content": ["a","b","c"] }, { "name": "item2", "content": {"a": 0, "b": 1} } ] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using json mod u le import json with open('items.json') as f: items = json.load(f) type(items) list items[0] items[1] items[1]['content']['b'] {'content': ['a', 'b', 'c'], 'name': 'item1'} {'content': {'a': 0, 'b': 1}, 'name': 'item2'} 1 PARALLEL PROGRAMMING WITH DASK IN PYTHON

JSON Files into Dask Bags items-by-line.json {"name": "item1", "content": ["a", "b", "c"]} {"name": "item2", "content": {"a": 0, "b": 1}} import dask.bag as db items = db.read_text('items-by-line.json') items.take(1) # Note: tuple containing a *string* ('{"name": "item1", "content": ["a", "b", "c"]}\n',) PARALLEL PROGRAMMING WITH DASK IN PYTHON

JSON Files into Dask Bags dict_items = items.map(json.loads) # converts strings -> other data dict_items.take(2) # Note: tuple containing dicts ({'content': ['a', 'b', 'c'], 'name': 'item1'}, {'content': {'a': 0, 'b': 1}, 'name': 'item2'}) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Pl u cking v al u es type(dict_items.take(2)) tuple dict_items.take(2)[1]['content'] # Chained indexing {'a': 0, 'b': 1} dict_items.take(1)[0]['name'] # Chained indexing 'item1' PARALLEL PROGRAMMING WITH DASK IN PYTHON

Pl u cking v al u es contents = dict_items.pluck('content') names = dict_items.pluck('name') contents names dask.bag<pluck-5..., npartitions=1> dask.bag<pluck-3..., npartitions=1> contents.compute() names.compute() [['a', 'b', 'c'], {'a': 0, 'b': 1}] ['item1', 'item2'] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Congressional legislation metadata 23 JSON � les metadata abo u t congressional bills u p to 1500 pieces of legislation per congress . Load all into Dask Bag u se current_status to co u nt v etoed bills u se date info to comp u te a v erage times PARALLEL PROGRAMMING WITH DASK IN PYTHON

Metadata ke y s Selected dictionar y ke y s 'bill_type' 'title_without_number' 'related_bills' 'id' 'titles' 'display_number' 'major_actions' 'current_status_description' 'link' 'current_status_date' 'committee_reports' 'current_status_label' 'introduced_date' 'sponsor' 'current_status' 'title' Warning : Not all a v ailable for e v er y bill PARALLEL PROGRAMMING WITH DASK IN PYTHON

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G - PowerPoint PPT Presentation

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda Seq u ences to bags nested_containers = [[0, 1, 2, 3],{}, [6.5, 3.14], 'Python', {'version':3}, ''

Ch u nking Arra y s in Dask PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

BLUE BINS 101 Stretchy grocery bags Blue bins Shopping bags Clothing/garment bags Dry cleaner

Globbing, pattern matching Globbing is the term used for bashs form of pattern matching in

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

BAGS and SACKS Shopping Bags Technical Sacks (e.g. Cement , Chemical, ..) Food Sacks

B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

BYOBB BRING YOUR OWN BAGS & BOTTLES Proposed Bag Article: Reducing the Source Thin

2 House of Bags Manufacturing Co. was established in October 2014 in Jeddah, Saudi Arabia, as one

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u

VALVE BAGS Content: 1) Applications 2) Equipment 3) Types of Valve Bags 4) Conversion Process

From Bags to Boards The Experimentation Behind the Recycled Building Material Bag Board

Building ilding an an op open en con oncordancer ordancer for or Mal alay ay/In

B u ilding f u nctions to a u tomate anal y sis AN ALYZIN G MAR K E TIN G C AMPAIG N S W ITH

Understanding Comp u ter Storage & Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS Peter Andreas Entschev Senior System

3C 3C BU BUILDI ILDING NG CON ONTROL TROL What t is is ou our r goa goal at B t Bui

Scaling RAPIDS with Dask Matthew Rocklin, Systems Software Manager GTC San Jose 2019 PyData is

Dask extending Python data tools for parallel and distributed computing Joris Van den Bossche -

Pi Pied edmon mont At Atlan anta T a Tower er: Buil Bu ilding ing f for t the Fu

The Art of Buil ilding a Garden Cit ity: New Communities for the C21st Katy ty Loc Lock,

KINGSONS Founded in 2006 in Hong Kong, Kingsons focuses on stylish bags and backpacks for the

CAP APACIT CITY Y BUILDING ILDING FO FOR TH THE E PHYSIC YSICAL AL PROTE TECTION CTION

Univ iversity of Wis isconsin in-Parksid ide st Century the 21 st Build ilding th ry

Tertiary Minerals plc Build ilding ing a strategic ategic position ition in the fluor uorspa

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G - PowerPoint PPT Presentation

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda Seq u ences to bags nested_containers = [[0, 1, 2, 3],{}, [6.5, 3.14], 'Python', {'version':3}, ''

Ch u nking Arra y s in Dask PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

BLUE BINS 101 Stretchy grocery bags Blue bins Shopping bags Clothing/garment bags Dry cleaner

Globbing, pattern matching Globbing is the term used for bashs form of pattern matching in

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

BAGS and SACKS Shopping Bags Technical Sacks (e.g. Cement , Chemical, ..) Food Sacks

B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

BYOBB BRING YOUR OWN BAGS &amp; BOTTLES Proposed Bag Article: Reducing the Source Thin

2 House of Bags Manufacturing Co. was established in October 2014 in Jeddah, Saudi Arabia, as one

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u

VALVE BAGS Content: 1) Applications 2) Equipment 3) Types of Valve Bags 4) Conversion Process

From Bags to Boards The Experimentation Behind the Recycled Building Material Bag Board

Building ilding an an op open en con oncordancer ordancer for or Mal alay ay/In

B u ilding f u nctions to a u tomate anal y sis AN ALYZIN G MAR K E TIN G C AMPAIG N S W ITH

Understanding Comp u ter Storage &amp; Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS Peter Andreas Entschev Senior System

3C 3C BU BUILDI ILDING NG CON ONTROL TROL What t is is ou our r goa goal at B t Bui

Scaling RAPIDS with Dask Matthew Rocklin, Systems Software Manager GTC San Jose 2019 PyData is

Dask extending Python data tools for parallel and distributed computing Joris Van den Bossche -

Pi Pied edmon mont At Atlan anta T a Tower er: Buil Bu ilding ing f for t the Fu

The Art of Buil ilding a Garden Cit ity: New Communities for the C21st Katy ty Loc Lock,

KINGSONS Founded in 2006 in Hong Kong, Kingsons focuses on stylish bags and backpacks for the

CAP APACIT CITY Y BUILDING ILDING FO FOR TH THE E PHYSIC YSICAL AL PROTE TECTION CTION

Univ iversity of Wis isconsin in-Parksid ide st Century the 21 st Build ilding th ry

Tertiary Minerals plc Build ilding ing a strategic ategic position ition in the fluor uorspa

BYOBB BRING YOUR OWN BAGS & BOTTLES Proposed Bag Article: Reducing the Source Thin

Understanding Comp u ter Storage & Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN