databases
play

Databases IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent - PowerPoint PPT Presentation

Databases IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp What are databases ? Holds data Organi z es data Retrie v e / Search data thro u gh DBMS A u s u all y large collection of data organi z


  1. Databases IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  2. What are databases ? Holds data Organi z es data Retrie v e / Search data thro u gh DBMS A u s u all y large collection of data organi z ed especiall y for rapid search and retrie v al . INTRODUCTION TO DATA ENGINEERING

  3. Databases and file storage Databases File s y stems Less organi z ed Ver y organi z ed Simple , less added f u nctionalit y F u nctionalit y like search , replication , ... INTRODUCTION TO DATA ENGINEERING

  4. Str u ct u red and u nstr u ct u red data Str u ct u red : database schema Relational database { "key": "value"} Semi - str u ct u red JSON Unstr u ct u red : schemaless , more like � les Videos , photos INTRODUCTION TO DATA ENGINEERING

  5. SQL and NoSQL SQL NoSQL Tables Non - relational databases Database schema Str u ct u red or u nstr u ct u red Relational databases Ke y-v al u e stores ( e . g . caching ) Doc u ment DB ( e . g . JSON objects ) INTRODUCTION TO DATA ENGINEERING

  6. SQL : The database schema -- Create Customer Table CREATE TABLE "Customer" ( "id" SERIAL NOT NULL, "first_name" varchar, "last_name" varchar, PRIMARY KEY ("id") ); -- Create Order Table CREATE TABLE "Order" ( "id" SERIAL NOT NULL, -- Join both tables on foreign key "customer_id" integer REFERENCES "Customer", SELECT * FROM "Customer" "product_name" varchar, INNER JOIN "Order" "product_price" integer, ON "customer_id" = "Customer"."id"; PRIMARY KEY ("id") ); id | first_name | ... | product_price 1 | Vincent | ... | 10 INTRODUCTION TO DATA ENGINEERING

  7. SQL : Star schema The star schema consists of one or more fact tables referencing an y n u mber of dimension tables . Facts : things that happened ( eg . Prod u ct Orders ) Dimensions : information on the w orld ( eg . C u stomer Information ) 1 Wikipedia : h � ps :// en .w ikipedia . org /w iki / Star _ schema INTRODUCTION TO DATA ENGINEERING

  8. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

  9. What is parallel comp u ting IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  10. Idea behind parallel comp u ting Basis of modern data processing tools Memor y Processing po w er Idea Split task into s u btasks Distrib u te s u btasks o v er se v eral comp u ters Work together to � nish task INTRODUCTION TO DATA ENGINEERING

  11. The tailor shop R u nning a tailor shop Goal : 100 shirts Best tailor � nishes shirt / 20 min u tes Other tailors do shirt / 1 ho u r M u ltiple tailors w orking together > best tailor INTRODUCTION TO DATA ENGINEERING

  12. Benefits of parallel comp u ting Processing po w er Memor y: partition the dataset RAM memor y chip : INTRODUCTION TO DATA ENGINEERING

  13. Risks of parallel comp u ting O v erhead d u e to comm u nication Parallel slo w do w n : Task needs to be large Need se v eral processing u nits INTRODUCTION TO DATA ENGINEERING

  14. An e x ample INTRODUCTION TO DATA ENGINEERING

  15. multiprocessing.Pool from multiprocessing import Pool def take_mean_age(year_and_group): year, group = year_and_group return pd.DataFrame({"Age": group["Age"].mean()}, index=[year]) with Pool(4) as p: results = p.map(take_mean_age, athlete_events.groupby("Year")) result_df = pd.concat(results) INTRODUCTION TO DATA ENGINEERING

  16. dask import dask.dataframe as dd # Partition dataframe into 4 athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4) # Run parallel computations on each partition result_df = athlete_events_dask.groupby('Year').Age.mean().compute() INTRODUCTION TO DATA ENGINEERING

  17. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

  18. Parallel comp u tation frame w orks IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  19. INTRODUCTION TO DATA ENGINEERING

  20. HDFS INTRODUCTION TO DATA ENGINEERING

  21. MapRed u ce INTRODUCTION TO DATA ENGINEERING

  22. Hi v e R u ns on Hadoop Str u ct u red Q u er y Lang u age : Hi v e SQL Initiall y MapRed u ce , no w other tools INTRODUCTION TO DATA ENGINEERING

  23. Hi v e : an e x ample SELECT year, AVG(age) FROM views.athlete_events GROUP BY year INTRODUCTION TO DATA ENGINEERING

  24. A v oid disk w rites Maintained b y Apache So �w are Fo u ndation INTRODUCTION TO DATA ENGINEERING

  25. Resilient distrib u ted datasets ( RDD ) Spark relies on them Similar to list of t u ples Transformations : .map() or .filter() Actions : .count() or .first() INTRODUCTION TO DATA ENGINEERING

  26. P y Spark P y thon interface to Spark DataFrame abstraction Looks similar to Pandas INTRODUCTION TO DATA ENGINEERING

  27. P y Spark : an e x ample # Load the dataset into athlete_events_spark first SELECT year, AVG(age) FROM views.athlete_events (athlete_events_spark GROUP BY year .groupBy('Year') .mean('Age') .show()) INTRODUCTION TO DATA ENGINEERING

  28. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

  29. Workflo w sched u ling frame w orks IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  30. An e x ample pipeline Ho w to sched u le ? Man u all y cron sched u ling tool What abo u t dependencies ? INTRODUCTION TO DATA ENGINEERING

  31. DAGs Directed Ac y clic Graph Set of nodes Directed edges No c y cles INTRODUCTION TO DATA ENGINEERING

  32. The tools for the job Lin ux' s cron Spotif y' s L u igi Apache Air � o w INTRODUCTION TO DATA ENGINEERING

  33. Created at Airbnb DAGs P y thon INTRODUCTION TO DATA ENGINEERING

  34. Airflo w: an e x ample DAG INTRODUCTION TO DATA ENGINEERING

  35. Airflo w: an e x ample in code # Create the DAG object dag = DAG(dag_id="example_dag", ..., schedule_interval="0 * * * *") # Define operations start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag) ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag) ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag) enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag) # Set up dependency flow start_cluster.set_downstream(ingest_customer_data) ingest_customer_data.set_downstream(enrich_customer_data) ingest_product_data.set_downstream(enrich_customer_data) INTRODUCTION TO DATA ENGINEERING

  36. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

Recommend


More recommend