Section Title Section subtitle Stephen Strowes AIMS 2019 - PowerPoint PPT Presentation

Section Title Section subtitle Stephen Strowes   AIMS 2019   2019-04-16

Introduction Hadoop at the NCC � 2

Lots of data • RIPE Atlas generates a lot of measurement data • In totality, consumes ~66TB (compressed) • Stored on the NCC’s Hadoop cluster(s) � 3 Stephen Strowes | AIMS 2019 | 2019-04-16

Lots of data • We need tools that make exploration and analysis of this data easy • Apache Spark on Hadoop gets us part way there � 4 Stephen Strowes | AIMS 2019 | 2019-04-16

Running an in-house Hadoop cluster is not easy • Expenditure: hardware, rack space • Expenditure: system engineering, maintenance, uptime, patching, user requests, support • Expenditure: research engineering time � 5 Stephen Strowes | AIMS 2019 | 2019-04-16

Data Analysis is Exploratory • Iterative development of an analysis is critical • Want this to be as tight a loop as possible � 6 Stephen Strowes | AIMS 2019 | 2019-04-16

Atlas → Cloud A prototype � 7

Why the cloud? • The big three cloud platforms are many years old - they reduce expenditure on hardware and time - they have SLAs that help keep things running - they have all sorts of tooling ready to use (or not use, as we wish) • We’ve been prototyping against Google Cloud Platform � 8 Stephen Strowes | AIMS 2019 | 2019-04-16

Prototyping data ingress � 9 Stephen Strowes | AIMS 2019 | 2019-04-16

Google Cloud Platform • Cloud Storage - Avro files dropped in here, to be accessed by BigQuery • BigQuery - Data warehouse to store and query massive datasets enabling super-fast SQL queries using the Google infrastructure - BigQuery abstracts most everything away � 10 Stephen Strowes | AIMS 2019 | 2019-04-16

Traceroute data includes nested results { { { "hop" : 2, "hop" : 1, "dst_addr" : "193.0.19.59", "result" : [ "result" : [ { "type" : "traceroute", { "rtt" : 107.264, "rtt" : 2.728, "dst_name" : "193.0.19.59", "ttl" : 62, "ttl" : 255, "from" : "193.0.19.59", "msm_name" : "Traceroute", "from" : "193.0.10.2", "size" : 68 "size" : 28 "timestamp" : 1551700827, }, }, { "msm_id" : 5030, { "rtt" : 2.122, "rtt" : 2.011, "src_addr" : "193.0.10.36", "ttl" : 62, "ttl" : 255, "from" : "193.0.19.59", "prb_id" : 6003, "from" : "193.0.10.2", "size" : 68 "size" : 28 "from" : "193.0.10.36", }, }, { "endtime" : 1551700831, { "rtt" : 1.952, "rtt" : 1.628, "result" : [ "ttl" : 62, "ttl" : 255, "from" : "193.0.19.59", "from" : "193.0.10.2", "size" : 68 "size" : 28 } } ] ] } }, ] } � 11 Stephen Strowes | AIMS 2019 | 2019-04-16

BigQuery table schema � 12 Stephen Strowes | AIMS 2019 | 2019-04-16

BigQuery table schema: example data � 13 Stephen Strowes | AIMS 2019 | 2019-04-16

Comparisons � 14

Comparisons • apples vs. oranges - Python with Apache Spark, running on a private Hadoop cluster, vs - bigquery running on Google’s own public platform � 15 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 1 Count IPv6 addrs each probe ran traceroutes to in 1 day � 16 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 1: pyspark • Execution time: - 16-20 minutes (adhoc queue) - 5-6 minutes with a higher priority queue and the cluster isn’t loaded � 17 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 1: bigquery • Execution time: - 4-5 seconds � 18 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 2 Find lowest RTT between source and each hop � 19 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 2: pyspark • Execution time: - ~30 minutes � 20 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 2: bigquery SELECT result.from AS IpAddress, prbId, MIN(result.rtt) AS minRtt FROM `data-test-194508.prod.traceroute_atlas_prod`, unnest (hops) AS hop, unnest (resultHops) AS result WHERE startTime >= TIMESTAMP("2019-02-15") and startTime < TIMESTAMP("2019-02-16") GROUP BY result.from, prbId • Execution time: - ~25 seconds � 21 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 3 Emile’s probe similarity work � 22 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 3: pyspark • Execution time: - ~2 hours � 23 Stephen Strowes | AIMS 2019 | 2019-04-16

Example 3: bigquery • Execution time: - ~25 minutes � 24 Stephen Strowes | AIMS 2019 | 2019-04-16

Takeaways • But the point is that the abstractions are hidden well by the language and processing time is faster • The end result: more rapid data analysis � 25 Stephen Strowes | AIMS 2019 | 2019-04-16

The Future

The Future • This is prototype, exploratory work - putting other datasets in here, e.g. , IPmap data, ping data, peeringdb data • Project not costed, etc, etc • But, it looks promising � 27 Stephen Strowes | AIMS 2019 | 2019-04-16

General Access to Data and Tooling? • Most Atlas data is public, if not always easy to aggregate • If data is in a commodity cloud system, maybe it can be made more generally accessible • Give people access to all the data, and the platform’s tooling to operate over that data , easily • Get to the science faster? � 28 Stephen Strowes | AIMS 2019 | 2019-04-16

General Access to Data and Tooling? • Charging models: the NCC provides the data, and researchers pay for compute cycles/network transit they use • Big vendors support open data initiatives with free storage: - https://aws.amazon.com/opendata/ - https://cloud.google.com/bigquery/public-data/ • This doesn’t have to be hosted on Google, but any commodity platform that people are familiar with opens up the measurement data � 29 Stephen Strowes | AIMS 2019 | 2019-04-16

Questions? Elena <edominguez@ripe.net>

Section Title Section subtitle Stephen Strowes AIMS 2019 - PowerPoint PPT Presentation

Section Title Section subtitle Stephen Strowes AIMS 2019 2019-04-16 Introduction Hadoop at the NCC 2 Lots of data RIPE Atlas generates a lot of measurement data In totality, consumes ~66TB (compressed) Stored on the

Long Title Your Name Here Mount Holyoke College June 13, 2017 1 / 4 Section title subsection

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Chinese merger control Title Title Title Title Title Author Author Peter J Wang Firm Firm Firm

Patons Lane CLC Meeting INSERT DIVIDER TITLE 13 11 12 INSERT DIVIDER TITLE 14 A INSERT

ANNUAL GENERAL MEETING 12 INSERT DIVIDER TITLE 14 A INSERT DIVIDER TITLE 15 BINGO INDUSTRIES

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

TITLE IV-D CHILD SUPPORT COURT What is Title IV-D? Title IV-D is a section of federal

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

Long Title Your Name Joint work with X (from here), Y (from here) and Z (from here) Outline 1

Click to edit Master title style DRVR Click to edit Master title style Click to edit Master

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Annual General Meeting, June 2020 Table of Contents Section Title Page Market & Company

Mining Attributed Networks Part 1 Introduction Rushed Kanawati, Martin Atzmueller A 3 ,

Fading away: Dilution and user behaviour Paul Thomas Falk Scholer Alistair Moffat Given a

Introduction to Homeland Security Stephen M. Maurer Goldman School of Public Policy Philosophy

BBN-ANG-243 Advanced Phonology: Phonological Analysis Lecture 8: Word Stress Part 2 Annotated

Springdale Primary School Preparing Your Child for Primary Education 1. What are we preparing

RELEASE OF 2017 O LEVEL RESULTS 12 JANUARY 2018 2.30PM FORM A JAE PIN GCE O Level

Risky Business: Ethics of Caring for Patients Who Choose to Live at Risk Anna Zadunayski LLB MSc

MedX Protocol Launch unstoppable medical apps October 2, 2018 Medical Exchange Protocols Ltd

Section Title Section subtitle Stephen Strowes AIMS 2019 - PowerPoint PPT Presentation

Section Title Section subtitle Stephen Strowes AIMS 2019 2019-04-16 Introduction Hadoop at the NCC 2 Lots of data RIPE Atlas generates a lot of measurement data In totality, consumes ~66TB (compressed) Stored on the

Long Title Your Name Here Mount Holyoke College June 13, 2017 1 / 4 Section title subsection

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Chinese merger control Title Title Title Title Title Author Author Peter J Wang Firm Firm Firm

Patons Lane CLC Meeting INSERT DIVIDER TITLE 13 11 12 INSERT DIVIDER TITLE 14 A INSERT

ANNUAL GENERAL MEETING 12 INSERT DIVIDER TITLE 14 A INSERT DIVIDER TITLE 15 BINGO INDUSTRIES

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

TITLE IV-D CHILD SUPPORT COURT What is Title IV-D? Title IV-D is a section of federal

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1&quot;=20'-0&quot; 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

Long Title Your Name Joint work with X (from here), Y (from here) and Z (from here) Outline 1

Click to edit Master title style DRVR Click to edit Master title style Click to edit Master

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Annual General Meeting, June 2020 Table of Contents Section Title Page Market &amp; Company

Mining Attributed Networks Part 1 Introduction Rushed Kanawati, Martin Atzmueller A 3 ,

Fading away: Dilution and user behaviour Paul Thomas Falk Scholer Alistair Moffat Given a

Introduction to Homeland Security Stephen M. Maurer Goldman School of Public Policy Philosophy

BBN-ANG-243 Advanced Phonology: Phonological Analysis Lecture 8: Word Stress Part 2 Annotated

Springdale Primary School Preparing Your Child for Primary Education 1. What are we preparing

RELEASE OF 2017 O LEVEL RESULTS 12 JANUARY 2018 2.30PM FORM A JAE PIN GCE O Level

Risky Business: Ethics of Caring for Patients Who Choose to Live at Risk Anna Zadunayski LLB MSc

MedX Protocol Launch unstoppable medical apps October 2, 2018 Medical Exchange Protocols Ltd

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Annual General Meeting, June 2020 Table of Contents Section Title Page Market & Company