CS 327E Class 9 April 8, 2019 No Quiz Today :) What to expect - PowerPoint PPT Presentation

CS 327E Class 9 April 8, 2019

No Quiz Today :)

● What to expect from upcoming Milestones: Milestone 9: Find your secondary dataset, load into BQ and model the data with SQL transforms Milestone 10: Create Beam pipelines that transform the data Milestone 11: Create cross-dataset queries and data visualizations Milestone 12: Create workflow with Apache Airflow Milestone 13: Present and demo your project ● Review your secondary dataset today in class: http://tinyurl.com/y7d2jzjj

Questions: ● How likely are young tech companies to sponsor H1B workers? ● How does the compensation of H1B workers compare to that of domestic workers who are performing the same role and living in same region? Datasets: ● Main Dataset: H1B applications for years 2015 - 2018 (source: US Dept of Labor) ● Secondary Dataset: Corporate registrations for various states (source: Secretary of States) ● Secondary Dataset: Occupational Employment Survey for years 2015 - 2018 (source: Bureau of Labor Statistics)

Cross-Dataset Queries: ● Join H1B’s Employer table with the Secretary of State’s Corporate Registry table on the employer’s name and city. Get the age of the company from the incorporation date in the registry record. Group the employers into age buckets to see how many young tech companies sponsor H1B workers. ● Technical challenges: 1) matching employers within the H1B dataset due to inconsistent spellings of the company’s name 2) matching employers across H1B and Corporate Registry datasets due to inconsistent spellings of the company’s name and address.

Main Dataset

Raw Table Stats Year Table Size # Rows # Columns 2015 241 MB 618,804 41 2016 233 MB 647,852 41 2017 253 MB 624,650 52 2018 283 MB 654,162 52

Source File: https://github.com/shirleycohen/h1b_analytics/blob/master/h1b_ctas.sql

● Normalizes the employer name, city and state ● Removes duplicate employer records Source Files: https://github.com/shirleycohen/h1b_analytics/blob/master/transform_employer_table_single.py https://github.com/shirleycohen/h1b_analytics/blob/master/transform_employer_table_cluster.py

● Read the records from the Employer and Job/Application tables in BigQuery and create a PCollection from each source ● Normalize the employer’s name, city and state from the Job/Application PCollection (using ParDo ) ● Join the Job/Application and Employer PCollections on employer’s name and city (using CoGroupByKey ). ● Extract the matching employer_id from the joined results and add it to the Job/Application element (using ParDo ) ● Remove employer’s name and city from the Job/Application PCollections (using ParDo ) ● Write new Job/Application table to BigQuery Source Files: https://github.com/shirleycohen/h1b_analytics/blob/master/transform_job_table_cluster.py https://github.com/shirleycohen/h1b_analytics/blob/master/transform_application_table_cluster.py

Secondary Dataset

Table Details

Source File: https://github.com/shirleycohen/h1b_analytics/blob/master/corporate_registrations_ctas.sql

Source File: https://github.com/shirleycohen/h1b_analytics/blob/master/transform_corpreg_table_cluster.py

● ● ● ● ● Source File: https://github.com/shirleycohen/h1b_analytics/blob/master/employer_views.sql

http://www.cs.utexas.edu/~scohen/milestones/Milestone9.pdf

CS 327E Class 9 April 8, 2019 No Quiz Today :) What to expect - PowerPoint PPT Presentation

CS 327E Class 9 April 8, 2019 No Quiz Today :) What to expect from upcoming Milestones: Milestone 9: Find your secondary dataset, load into BQ and model the data with SQL transforms Milestone 10: Create Beam pipelines that transform the

CS 327E Class 7 October 21, 2019 Announcements Midterm is next class from 6pm - 7:30pm

CS 327E Class 4 September 30, 2019 1) What type of relationship do we have between the Actor and

CS 327E Class 9 November 19, 2018 Announcements What to expect from the next 3 milestones

CS 327E Class 6 October 14, 2019 1) PTransforms such as Pardo mutate their input elements. A.

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

CS 327E Class 10 November 26, 2018 Announcements Scheduling your group presentation for

CS 327E Class 9 November 11, 2019 Grading update What to expect from remaining

CS 327E Class 2 September 16, 2019 1) Which is not a Data Manipulation Language construct? a)

CS 327E Class 7 November 5, 2018 Check your GCP Credits :) iClicker Question Are you running

CS 327E Class 11 November 25, 2019 Announcements Milestone 12: What: Group Presentations.

CS 327E Class 8 November 4, 2019 1) Does Q1 contain a subquery? Q1: SELECT * FROM Lineup WHERE

CS 327E Class 10 November 18, 2019 1) What is meant by the following usage pattern? A. The

CS 327E Class 4 Sept 18, 2020 Announcements Rubric clarification Test 1 details Exam

CS 327E Class 7 Oct 16, 2020 Review session for Test 2 Test 2 details Exam rules:

CS 327E Class 10 April 15, 2019 1) What is meant by the following usage pattern? A. The

CS 327E Class 3 September 23, 2019 1) Which SQL join type does this query contain? S(a: int, b:

CS 327E Class 8 Oct 30, 2020 Final Project Components Choose a primary and secondary

CS 327E Lecture 12 Shirley Cohen March 7, 2016 Agenda Announcements Readings for

CS 327E Lecture 10 Shirley Cohen February 29, 2016 Agenda Announcements Readings for

CS 327E Lecture 3 Shirley Cohen February 1, 2016 Agenda Announcements Homework for

CS 327E Lecture 1 Shirley Cohen January 25, 2016 Agenda Announcements Homework for

CS 327E Lecture 2 Shirley Cohen August 31, 2016 Question 1: Which of the following are reasons

CS 327E Lecture 8 Shirley Cohen October 19, 2016 Homework for Today Chapters 3 and 4 from

CS 327E Lecture 5 Shirley Cohen September 14, 2016 Plan for Today Finish Normalization