PDI Sizing Overview and Case Study Steve Szabo Pentaho Lead Solution Engineer, Hitachi Vantara
Introduction and Agenda • Introduction – What is PDI Sizing, and Why Do We Need to be Concerned About it? • Agenda – Brief Anatomy of PDI – Example Sizing Problem – Review Test Cases – Example Sizing Solution – Review Major Constraints, Bottlenecks, and Best Practices • Next Steps – Recommendations – Resources Here is a sample footnote.
Example Sizing Problem • Retail Business has daily data that must be readied for next day analysis – Data volume fluctuates daily – 8 hour delivery window Peak = 10 TB – 10 TB per day peak, 5 TB per day average 10 TB per Day = 400 GB per Hour
Sizing Disclaimers • Past performance is not a guarantee of future results • The best practice is to run throughput tests with fully representative data and transformation profiles on actual equipment • Sizing should accommodate data growth and operational margins • The results here represent throughput with a single transformation type under controlled conditions. – Increasing the variety of transformations may result in lower performance • See Pentaho Best Practices for performance tuning
What is PDI Sizing? Determining the number of nodes and cores needed to Data Data process data within time Sources Targets constraints
PDI Sizing Variables and Constraints Customer Jobs Transformations Inputs Demands Available Memory Available Available Input Output CPU (cores) ops/sec bandwidth bandwidth Number of Nodes Time
Pentaho Data Integration Sizing Factors • Available Time – Processing time required – Turn-around time requirements • Amount of Data – Interdependencies and lag • Available Resources – Computing power: cores, memory, storage, network – Number of nodes • Complexity of Transformations
PDI Anatomy • Platform – CPU, Cores, Memory – JVM • Jobs – Orchestration • Transformations attributes – Threads – Connections – Steps • Blocking Steps • Expensive Steps – Rowset buffers – Step copies – Multiple streams
PDI Sizing – Hardware Constraints • Enterprise Grade supported Processors ( non end-of-life ) • 8-core processors • 32 GB or more of available RAM – 24GB+ per JVM • High Speed network connections ( 1Gb/sec – 10 Gb/sec ) • Low number Network Hops – Co-located nodes on same segment • Cluster Configurations – Carte Clustering – Hadoop Map Reduce – AWS Auto Scaling – Spark Clusters
Example 1: High I/O, Moderate CPU use-case** • Single 8-core node with a single type of transformation • Throughput Peak: – 229 GB per Hour – 916 GB per 4-Hours – 5.5 TB per 24-Hours
Summarized Results – High I/O, Moderate CPU use-case** Number of Concurrent Hourly Daily Transformations Throughput Throughput Notes 121.9 GB per Hour 2.9 TB per Day Medium (15-steps) 1 This represents an 228.6 GB per Hour 5.4 TB per Day 88% increase in 2 throughput This is less than 229.2 GB per Hour 5.5 TB per Day 3 1% more throughput
Example 2: High I/O, High CPU use-case** • Single 8-core node with a single type of transformation • Throughput Peak: – 113 GB per Hour – 452 GB per 4-Hours – 2.7 TB per 24-Hours
Summarized Results – High I/O, High CPU use-case** Number of Concurrent Hourly Daily Transformations Throughput Throughput Notes 88.8 GB per Hour 2.1 TB per Day Medium (15-steps) 1 This represents a 112.6 GB per Hour 2.7 TB per Day 27% increase in 2 throughput This is less than 113.4 GB per Hour 2.7 TB per Day 3 1% more throughput
Example Sizing Solution Capacity Requirement: 10 TB per Day Contingency: Solution: 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core ( Additional nodes ) ( Two 8-core nodes )
PDI Sizing – Use-Case Variation Considerations • Data formats : JSON, XML, CSV, Binary • Transformation Sizes : small, medium, large ( low-CPU, high-CPU) • Step copies : 1, 2, 4, 8, 16 • Field sizes: 10B, 100B, 1K, 4K, 10K • Row sizes: 1K, 4K, 16K - and Rowset buffer • Step types: Regex, Javascript • Aggregations: Sort, Join, Analytic (Sum, Standard Deviation)
PDI Sizing – Best Practices • Performance Tuning – Optimize input and output structures (e.g. Pre-sort data) – Identify slowest steps – Use optimal steps, such as Bulk Loading • Scaling up – Number of step copies – Number of instances – Clustering transformations • Perform load test - Determine Peak throughput • Monitoring and alerting
Considerations and Recommendations • Capacity planning – CPU, Storage, • System Maintenance Network – Allow for maintenance, scheduled and MTBF – Allow for operating margins • Backups, Upgrades, Backlog Recovery • 20% to 50% – Redundancy, Failover – Allow for System overhead – Ongoing system and performance monitoring • 10% • Data forecasting – Monthly and Annual growth • Data Cycles – Ad-hoc projects – Near Real-time streaming versus – Data Growth daily analytical batches – Additional Transformations • System optimization – See Pentaho Best Practices
PDI Sizing – support.pentaho.com • Best Practices • Product Documentation • Enterprise Support • Pentaho Services
Recommend
More recommend