rsmg
play

RSMG 10 October 2019 ORR protects the interests of rail and road - PowerPoint PPT Presentation

ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future RSMG 10 October 2019 ORR protects the interests of rail and road users, improving the safety, value and


  1. ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future RSMG 10 October 2019

  2. ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future Using R to automate reports Reproducible Analytical Pipelines Lucy Charlton

  3. 3 The problem… Data store Calculations ■ Need to produce graphs and (SQL) in Excel commentary every month for internal report. ■ Data in same format, output in Copy and same format paste charts Power BI ■ Involves same calculations, into Word such as moving averages ■ Takes at least half a day ■ Manual steps, making it time Write consuming and risk of errors commentary Final report (and dull) in Word

  4. 4 The solution- use RAP Data store R Output (SQL) functions

  5. 5 What is R Markdown? ■ R is open source (free!) stats programming language ■ R Markdown is a file format for making dynamic documents with R. An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. ■ Do all editing and formatting in R document. ■ Can include all graphs and tables in one place- easy to update if data updates. ■ Variety of outputs- word, pdf, html, interactive dashboards, tables, websites….

  6. 6 How to do it…. ■ Use R Markdown to automatically create graphs and dynamic reports ■ All graphs are consistent and presented in same way ■ Saves time and no manual calculation! ■ First-had to agree on set text for commentary and write code ■ Success! Took less than 10 minutes to run code, edit text and check all figures. Will be faster next time. Compared to the 4+ hours work it took before.

  7. 7 Dynamic report ■ Need nice tidy data ■ No more coding in R- update three fields every period. ■ Very easy to amend formatting of report (uses word document as template)

  8. 8 Pros of using RAP ■ Auditability – records process of how report was created ■ Speed – quick and easy to update, can implement small changes across many tables/reports simultaneously; so we save time on the repetitive task; more time for interesting analysis ■ Quality – can build QA into the pipeline ■ Knowledge transfer – all the information about how the figures are produced is embedded in the code, makes it easy to handover

  9. 9 Challenges of using RAP ■ Set aside time for initial code ■ New skills to learn (R, Python, GitHub) ■ IT challenges ■ Finding suitable R packages, some only work for pdf or Word output not both ■ Might not always want fixed commentary

  10. 10 Is RAP for me? ■ Do you repeat the same workflow more than twice? ■ Is it time consuming and hard to replicate without a lot of manual intervention? ■ Do you copy and paste a lot between different software? ■ What are the impact of errors in your spreadsheet or report? ■ Could you reproduce publication statistics from one year ago? Five years ago?

  11. 11 How do I get started? Lots of online resources- DataCamp, CodeAcademy GovDataScience slack domain (rap_collaboration channel) https://govdatascience.slack.com RAP MOOC Mat Gregory (GDS) https://www.udemy.com/reproducible-analytical-pipelines/

  12. ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future Rail fares index automation and web scraping with Python Greg Williams

  13. 13 Automation of existing processes The focus of this presentation is on Web Scraping, but the automation of the production of the Fares Index is included as an example of an alternative application of Python in automating existing process. This work is part of civil service efforts to drive efficiency gains by producing Reproducible Analytical Pipelines. (RAP) https://gss.civilservice.gov.uk/blog/learning-about-reproducible- analytical-pipelines-rap-two-weeks-with-the-gss-good-practice-team/ Why automate at all? • Consistently produced output • Remove human error • Process is defined within the code • Faster processing of data

  14. 14 Fares index preparation What’s involved • 120 source files, 8 files of price information, 2 lookup tables, four file formats, 37 individual steps to follow • A total of 7.5 GB of data: equivalent to 7,600 digital photographs • Process Map below:

  15. 15 Fares index preparation: results • Produced in just over an hour, not three months • Expensive proprietary software can be discontinued, saving £12,000 annually • Manual step of cross checking with Avantix no longer needed • Time gained reinvested in deeper QA processes Red items are outputs not produced in manual process: • 1 ‘superfile’ of 78,000,000 rows of data at 12 GB • 5 Diagnostic files • Checks for missing categories, products, duplicates in the RDG files, checksums • 3 subsets for checking • Earnings over £500,000 • Annual price change greater than 20% • Annual price change less than -20% • 2 subsets for calculation: advanced and non-advanced files • 1 hour needed, exclusive of source file data extraction • Final output based on the ‘superfile’, advanced and non-advanced • 5 minutes needed to create final outputs

  16. 16 What is “web scraping”? Web scraping , web harvesting , or web data extraction is data scraping used for extracting data from websites. [1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Source: https://en.wikipedia.org/wiki/Web_scraping This presentation will cover web scraping through http, rather than an API Direct webscraping via a URL: https://www.nationalrail.co.uk/

  17. 17 The problem Economists want to see if the introduction of Open Access Operators changes availability of tickets and/or prices of tickets. This means collecting individual ticket prices based on particular routes, days of the week and times of day. Manual Data Collection: go into the NRE website, request data for route time combination and copy down the results into a spreadsheet. • Time consuming • Search times change depending on day of week • Risk of human error • Adding routes increases workload • Data not collected for weekends

  18. 18 The solution Automatic Data Collection through web scraping: • Speed: 60 seconds to collect a route with 54 items of data, with data being automatically appended • Runs automatically: runs every day in the background • Variables automatically calculated: code works out the day of the week and searches for appropriate times for each route • Accurate: The data is converted into a CSV format into a spreadsheet without human intervention – no transcription errors • Flexible: A metadata sheet controls the nature and number of queries will be made • A common standard: the CSV format is fixed for all users • Unlimited in scope: The code executes three times on Fridays, so that weekend data is captured

  19. 19 The benefits • It’s been of genuine interest to users:  This has been rolled out to 6 ORR economists  Meeting with DfT Data Science team for a technical demonstration/explanation on 16 th October • Enables team to collect more routes • Can add additional variables at little extra cost • Targets both  changes in tickets a given number days from the present day  changes in tickets for fixed day in the future over a time series

  20. 20 Steps in web scraping: Direct web scraping Step What to do Library used number 1 Build your URL via string Generic: string concatenation 2 Send URL to website Requests 3 Extract the underlying BeautifulSoup JSON data 4 Parse JSON data json 5 Convert to CSV csv

  21. Journey Focussed Management Observed Passenger Movement through a Network

  22. Agenda • Vision (NR) • Journey Focussed Management Project • Stakeholders • Outputs • Timeline and Next Steps • AOB

  23. System Operator We are ‘the glue that holds the network together’, enabling the seamless provision of cross-boundary services and coordinating capacity requests to make the best use of the network. Our vision is to become the recognised expert trusted by decision makers to plan Britain’s railway. To achieve this we provide a wholesystem, long term view, using the detailed knowledge we have from capacity planning and timetabling the network. Our work is informed by our industry-wide interfaces with every train operator, the Network Rail routes, other infrastructure managers, public and private funders, and franchising bodies.

More recommend