Harnessing the power of Spark for Enterprise data engineering and analytics Vickye Jain, Associate Principal ZS Associates June 4, 2019
ZS is a professional services firm that works side by side with companies to develop and deliver products that drive customer value and company results 6,000 + ZSers who are passionately committed to helping companies and their customers thrive 24 OFFICES WORLDWIDE BANGALORE + BARCELONA + BOSTON + BUENOS AIRES + CHICAGO + EVANSTON + FRANKFURT + LONDON LOS ANGELES + MILAN + NEW DELHI + NEW YORK + PARIS + PHILADELPHIA + PRINCETON + PUNE SAN DIEGO + SAN FRANCISCO + SÃO PAULO + SHANGHAI + SINGAPORE + TOKYO + TORONTO + ZÜRICH � 2
Typical enterprise data engineering & analytics problems and solutions we deal with Scalable Self-serve Variety of data, reporting, Specialized advanced no easy access packaged Analytical Apps analytics analytics Enterprise Data Cloud DW/BI Web UI + NOSQL Data-science Lakes Solutions DBs workbenches � 3
Example use case highlights Use Case Highlights Business Challenges • • <24 hours SLA for Data to Reports Frequently changing business rules • • 50+ data sources (S3, FTP, Internal DB, Evolving internal and external input data • SFDC) Competing priorities within user group • • 100+ analytics ready data packs Complex data quality challenges • • 500+ business rules / KPIs Business and data focused internal staff • 2000+ users (field + HQ) • 3500+ GB data added weekly (500 GB � 4 inputs)
Solution Architecture EMR / Databricks Spark Clusters Compute Notebooks Data Science DevOps Pipeline Redshift Version control Continuous Integration Low Latency Query API Gateway S3 Airflow Reports / Analytics AWS Lambda Orchestration Storage truffleHog Services Code Scans Vulnerability Scans Athena Serverless Query � 5
Summary of challenges Optimal Shortfall of Many Enterprise Diversity of ETL infrastructure techno-functional ETL gatekeepers jobs creates need costs take some experts have not evolved for tuning doing Technical Scripting, CI/CD, Elastic infra costs Different tuning sophistication secure SDLC, initially can be approaches fit compromised when memory optimized surprising, different job types faced with tight data models, etc. especially during needing continuous timelines need education Development improvement � 6
SQL or Scripting? Split application into core technical components and business logic SQL is excellent for business logic, second nature for domain experts Spark SQL highly optimized , will run faster in many cases Encapsulate SQLs in PySpark shells to retain maximum flexibility PySpark excellent for technical components , easy to read and maintain Beauty of Spark is that both will use same execution engine and design patterns � 7
Spark Modularized View (SMV) Data Application Framework Enforced modularization Without SMV: With SMV: Key Benefits class PatientCohort (SmvModule): App Enforced modularization def requiresDS(self): return [Rx,Px] Stag Stag CREATE TABLE cohort AS def run(self, i): # Select distinct patient ids for RX claims Nifty ETL functions SELECT DISTINCT p_id from ( e e d_rx = i[Rx].select(‘p_id').dropDuplicates() SELECT DISTINCT p_id FROM Rx # Select distinct patient ids for PX claims UNION ALL d_px = i[Px].select(‘p_id').dropDuplicates() Module Module Module Module Easily debug any step SELECT DISTINCT p_id FROM # Combine RX & PX and drop duplicates Px) cohort = d_rx.smvUnion(d_px).dropDuplicates() Smv Smv Smv Smv DataSe DataSe DataSe DataSe Code wrapped with data return cohort t t t t smv-run –run-app runs entire application df. smvUnpivot (“Col1", “Col2", “Col3") smv-run –s stagename runs one stage only df.smvGroupBy(“ID"). smvFillNullWithPrevValue ($“claimid".asc) (“Indication") smv-run –m stagename.module runs one module only � 8 https://github.com/TresAmigosSD/SMV
Extreme performance tips Segregating storage and compute is a must for maximum elasticity Shuffles write to disk, optimize data models to minimize joins and aggs Broadcast join is your best! First thing to try for joins Cost based optimizer is awesome! Don’t forget to analyze tables Keep UDFs in Scala/Java , PySpark UDFs are relatively slower � 9
Extreme performance tips: decouple storage and compute Process and DQM in single cluster Process 1 DQM 1 Process 2 DQM 2 Process 3 DQM 3 Process 4 DQM 4 Process and DQM in separate cluster Possible only with decoupled storage and compute Process 1 Process 2 Process 3 Process 4 DQM 1 DQM 2 DQM 3 DQM 4 *DQM – Data Quality Module � 10
Extreme performance tips Think of task level parallelism when packaging Spark jobs Check 1 Check 2 Check 3 Check n … � 11
Asking your Spark experts to codify tuning steps will also help functional experts learn to self-service Breaking the job into intermediate Increase Is the 1:n join steps not more than 4 stages each. spark.sql.shuffle.parttions between fact and YES YES Shuffling in Prior stages has most by 3-5X and check if the Check for skewed keys that YES dimension that will likely led to sub-optimal data problem is eliminated. will lead to disproportionate Problem cause fact data distribution Note that this can result in multiplication of data Persists Are the later stages rows to multiply? smaller files in the output causing some tasks to spill >4 running longer Has the job and a step to coalesce data over while others to run leading to higher succeeded end to into fewer partitions at the well. Filter such keys out run times? end at least once? Check if end will benefit any direct into a separate dataset and summary task YES consumers optimize both joins Metrics for join YES separately (broadcast with NO How many stages stage shows disk very fine partitions for Spark Job does the job have? spill over or Repartition fact data right Does the SQL plan dataset with skewed keys) YES NO tuning straggler tasks after it is read to increase in Spark UI or the the number of data execution plan on NO Check if summary <=4 partitions available for the Does the job Does the join shell show ALL task Metrics shows join step. Increasing involve a join? involve one large small tables being Spilt the job into Merge Stage a near even run spark.sql.shuffle.parttions and one or more Broadcast? multiple steps and Increase the number of time for tasks can also help the join step relatively smaller execute each one cores available to the job NO NO across all the run faster with more tables (~100 MM individually, writing by either increasing number YES quartiles partitions rows of 5 columns intermediate data Add explicit of executors or increasing YES is small for ZS to disk to isolate broadcast hints for the cores per executor. If workloads) the Problem all small tables, be the peak memory used by sure to use aliases tasks is low, changing A sort-merger join in the hint if aliases #cores per executor will be will be used in such are defined in the most helpful Check Summary cases. Check if the YES SQL Does the job Sort Stage task Metrics for YES stages running involve aggregation disk spill Over or Partitioning or bucketing longest Is tied to If few straggler tasks exist, check straggler tasks source data can merge step or one for skewed keys or uneven input file significantly boost of sort steps splits performance, best done if more than one job will Add more cores to the process, Add additional keys and create an intermediate Skewed Keys benefit from this job will NO either by providing more executors aggregate followed by a final aggregate benefit from this sorting and or more cores per executors bucketing (Provided no spill over happens) Uneven File Repartition input data to create more even file splits Splits Does the job YES All tasks show Increase executor memory or reduce cores per Each Window function will behave like a separate job so involve Window spill-over executor essentially you are looking at many jobs clubbed together. functions? Best way to tine this type of job is combine steps needing the Check if you are using the latest S3commiter same window partitions into one step and break out others YES configuration from the CC team, have speculation into different steps turned off, and if need be switch to Gzip compression for faster writes NO Does the job slow down at the final write stage when data is NO To be continued being written to s3? � 12
Recommend
More recommend