S8443: Feeding th the Big ig Data Engine How to Import Data in Parallel Presented By: Bria rian Kennedy, CT CTO Providence – Atlanta Email: bkennedy@simantex.com
In Introduction to Sim imantex, , In Inc. • Sim imantex Le Leadership – Experts in diverse public gaming, artificial intelligence applications, e-commerce, and software development – Gaming industry experience in lottery, casino, horse racing, sports betting, and eSports – Large business/enterprise pedigree complemented by start-up experience and the ability to scale up – Track record of creating partnerships, ecosystems, and collaboration – Global B2B and B2G experience • He Heli lios Gen eneral Purp rpose AI/S I/Simulation Pla latf tform – Helios is a revolutionary new approach to Enterprise software, forming a marriage of Wisdom and Artificial Intelligence to provide real-world solutions – Leveraging a proprietary simulation approach, Helios incorporates human learning, reasoning, and perceptual processes into an AI platform – Simantex is looking to apply it to the emerging eSports industry to combat fraud, detect software weakness, and improve player performance
Motivation for Hig igh Speed Data Im Importing This module is a part of the Helios Platform’s High Performance Data Querying & Access Layer. When we began work on this module the intent to be able to achieve these objectives: • Efficient utilization of server resources (Multitenant / Cost-savings) • Scalability to handle clients with massive data needs • Develop a complete enterprise solution that was 100% GPU based Proving that just about any problem, no matter how serial in nature it appears, can be mapped to the GPU and achieve significant performance gains
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey,17,California,3.65 • The first line of data could be a column name header record
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey,17,California,3.65 • Column widths are inconsistent from one record to the next
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson, " Hobbies include: tennis , football , swimming " ,19,Kansas,2.85 Applicant " Sarah " ,Baxter,17,New Jersey,2.90 CSV File Fred,Roberts, " Focuses on ""Extreme"" Sports like: sky diving , base jumping , etc " ,19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey,17,California,3.65 • Columns may be quoted (meaning they start and end with a quote) • This means that the Delimiter cou ould be be part of the data • The quotes surrounding a column should not be treated as part of the column data
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always " early " to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts, " Focuses on "" Extreme "" Sports like: sky diving, base jumping, etc " ,19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey,17,California,3.65 • Quotes may exist in column data where the column is not quoted • Quoted columns may have quotes in the data which are then double quoted
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey,17,California,3.65 • Columns may exceed target data size • Let’s say in this example the Notes column is a nvarchar(50) Notice that we only counted the double quote characters as 1, and we made sure not to count the outer quotes. Even still, this column exceeds our size constraint, so this record is an error.
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 missing Notes column! CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey,17,California,3.65 • Number of columns may differ from one record to the next • Possible error situation
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey ,17,California,3.65 • UTF-8 Text support for multi-language support means: • A character may be 1 – 3 bytes long affecting how we “count” characters to determine max size constraints • Columns can have a mixture of 1, 2, and 3 byte characters
Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 error row – missing columns – not imported CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专 注于研究和运 动 : hockey,17,California,3.65 • Not all columns may need to be retrieved from the text • Maybe in this run we only want to import: • Last Name, Age and Applying From State • So the Importer needs to be able to skip columns without writing out the data
Thin inking Dif ifferently, Adapting to Massiv ively Parallel Approaches This type of problem is traditionally handled by 1 2 3 reading data seq sequentia ially lly and managing a variety of “states” . Our approach will compute the “states” for each 111111111111111111111111 1 byte in the CSV file in parallel and store them in 222222222222222222222222 2 333333333333333333333333 a series of arrays. Let’s take a look at the general algorithm flow…
CSV Reader Program Flo low Output arrays are in GPU memory Col1 Col2 Data Data CSV Reader processes the Data Data Read CSV File from disk CudaMalloc and CSV File chunk in GPU into CPU Memory in cudaMemcpy CSV File Memory and outputs Data Data chunks. chunk into GPU Memory results to Arrays for each column/field. Data Data Col1 Col2 Data Data GPU Processing and Results to return to the Data Data Calculations on the CPU via cudaMemcpy. Output Arrays Data Data Data Data Queries Data Consolidation Math Operations Etc.
A Sim implif ified Example le To simplify the problem for now, let’s assume: 1. Field delimiters only appear in field boundaries. No commas within quotes or double quotes to escape a quote. 2. All data fit within their defined output array widths. There are no overruns. 3. All data are ASCII text characters, so we are always dealing with 1 byte per character. 4. All records or rows have the correct number of fields or columns. No column count errors.
Recommend
More recommend