Some Ideas for Best Practice in Scientific Computing Dr Owain - PowerPoint PPT Presentation

Some Ideas for “Best Practice” in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader

“Scientific Computing?” ● “Doing science with computers” – Generating data → Simulation – Analysing data → Filtering, statistical analysis… – Theorising about data → Machine learning/AI? ● Not just science – Arts/humanities → “Research Computing”

About Me ● Been at UCL since 2005 (Computational Chemistry PhD) ● Spent the last 8 or so years working in Research Computing in ISD – Team Lead of Research Computing Applications and Support – Look after users and applications on UCL ISD managed resources + design those services

Contents ● Overview of HPC/HTC services at UCL ● Version Control ● Publishing Code (+ Data) ● Pitfalls 4 contentious statements

UCL Research Computing Resources ● Parallel ● UCL only services: – Single job spans multiple nodes – Grace → High Performance – Tightly coupled parallelisation usually in Computing (HPC) MPI – Myriad , Legion → High Throughput – Sensitive to network performance Computing (HTC) – Currently primarily chemistry, physics, – Aristotle → Interactive teaching engineering Linux service ● High throughput – Lots (tens of thousands) of independent ● National services: jobs on different data – Thomas (Tier 2 MMM hub) – High I/O – Michael (Faraday Battery Institute) – Currently, primarily biosciences and physics – In the future, digital humanities

HPC Many processes on many processors work simultaneously + communicate between each other Input Data Output Data

HTC Many processes, operate independently of each other and in Output Input any order Data Data

The what + why of version control ● Version control systems are tools that let you keep track of who changed a file or set of files, when and what they changed. – If you are collaborating they let you all work on a project and share changes in a structured way. – If you are working on a long term project (e.g. your PhD thesis!) help you keep a record of what you did and when (and get old versions back). ● Many available, many types – from very basic (e.g. “track changes”) to very advanced decentralised systems.

Git and Github ● Git is an Open Source (GPL) command line tool originally written by Linus Torvalds. – But there are lots of graphical tools available that “talk git” – “Decentralised” - i.e. every person working on a repository has their own copy ● Github is a centralised service for hosting, sharing and contributing to git repositories of open source code – A sort of “social network” for coding – Free for public repositories – Recently bought by Microsoft! “Octocat”, Github’s cute mascot

Github is an interesting place to explore ● It’s the default for RITS (including RSD) at UCL – e.g. – https://github.com/UCL/i_newspaper_rods- software to run queries over the British Museum’s Times Digital archive. – https://github.com/UCL-RITS/rcps-buildscripts/- all the installation management for UCL RC services (and where you can request new software). ● Code for all sorts and scales of projects, inc. big companies like Microsoft, Valve...

Setting up git/Github ● Depends on whether you are using Linux, Mac or Windows! Linux – often already installed, or install from your package manager – Mac – install from the Xcode developer tools – Windows – a lot more complicated: pick an option from: – Command-line tools: https://git-scm.com/downloads ● GUI choices: https://git-scm.com/downloads/guis ● ● Set up name and email in the client ● IF you want to use Github, register a Github account More detail on linking this to git on your local machine here: – https://help.github.com/articles/set-up-git/

But overall... ● You don’t have to like git or even use it: – Other version control systems are available (SVN, CVS...) – Anything is better than nothing – what is important is to have a good automated way of tracking what you did when and getting back “that” version of the code. – Find out if your research group already uses a version control tool and use that. – Similarly there are Github alternatives for collaboration like BitBucket. ● Anything that’s a text file(*) can go into version control – this includes LaTeX source if you use that for your thesis/papers. (*) Binary files can go in but you can’t see the difference between versions as easily

Aside: Code: Application vs Method Applications Method ● Packaged as “ready for other ● “What I did” people” ● Really a part of the write-up ● Works on machines other than the – Probably hard-coded to work developer’s: with one dataset, in the few No hard coding of paths – environments available to the “sensible” install process – user. works on arbitrary dataset – – Jupyter notebooks etc. ● Used directly by other people for ● Inspire other people’s work work

Publishing Code + Data First contentious statement: IF your research is publicly funded it is your moral obligation to make your code and data available to outsiders under a reasonable license. Increasingly, funding councils agree.

Publishing Code + Data: Motivations ● Citations: IF you license appropriately, get citations for free! ● Collaborations: IF you are willing to, potential collaborators will come to you! ● “Reproducibility” + finding errors IF you are not evil, this is good.

Reproducibility Second contentious statement: “Reproducibility” in scientific computing has been hijacked by software engineers. ● Overwhelming on focus on bit-perfect reproduction of results: Containers/VMs to exactly reproduce environment. – → Doesn’t work anyway because of hardware. Only actually of use forensically (of course useful for moving your software about which – is a separate issue)

Reproducibility ● Relatively little focus on whether the general method is stable: – If your method only works with a particular compiler/MKL/whatever version then it may be a bug, not a valid result. – (Related) If your code stops working because a language feature is deprecated then expecting that old version to be available for the lifetime of your research is a bad idea – update your working version of your code.

Publishing Code: “Do” Things to think about: ● What you are publishing – is it an “Application” or is it a “Method”? Set expectations for users in the documentation – ● License – in order for people to actually use your software! ● Versions: Keep old versions online and distinguish them i.e. “ myprg-1.2.3.tar.gz ” not “ myprg- – current.tar.gz ” Tag releases (if on Github etc.). – When publishing results say which version/tag you used! – ● DO THE SAME FOR DATA SETS!

Pitfalls ● Ritual ● The “things I did” explosion ● Obsessing over performance/not caring enough ● Designing experiments based on the contents of slide decks

Ritual Third contentious statement: – IF you are publishing research you should know how your results are generated . – i.e. it’s not just enough to plug some data into a black box whose workings you do not understand and then publish the results. THIS DOES NOT MEAN YOU NEED TO UNDERSTAND THE COMPUTER DOWN TO THE MICROCODE!

Avoiding Ritual ● Read up on the software you are using. ● Think about its limitations: – Is its output is deterministic or is there a random element? – Where does that algorithm break down? – What sort of machine can I run this on? ● Think about how it might be applied to your problem: – Am I actually using this software appropriately? – What data requirements do I need to think about? ● Think about your results: – Are they reasonable?

It’s just a model... This all dovetails neatly together into a larger problem: When we simulate things, we are just building a model. Models have limits! Computer models are not the only models: ● Animal models ● Building an actual model ● Theoretical models JUST BECAUSE THE COMPUTER MODEL SAYS IT IS TRUE DOESN’T MAKE IT SO!

It’s just a model... Real life... Physical scale model... Computer model….

The “things I did” explosion ● This is not unique to scientific computing but is encouraged by the way we use computers. ● It can be tempting to try a lot of unplanned things on our input data and see what “works”: “ I’ll just run it through X and see... ” – This can be difficult to track. – This can be dangerously close to “p-hacking” when analysing data... ● Always record what you did even if you didn’t plan it. – Version control helps (particularly if you are modifying code)

Performance Performance is important. But what is important is the time to get to a meaningful solution, not the performance of code alone. ● There’s no point in learning C to make one job that takes 48 hours run in 4 hours. – But maybe if you have to run 10000 of them? ● Obsessive optimisation is madness. ● It’s completely worth slightly modifying your code to make it run 10x as fast.

Experiment design by slide deck Supervisor: “Hey, I went to this conference and saw a really interesting presentation by Prof. X’s group on this application they have developed and you should try using it on our problem”

Some Ideas for Best Practice in Scientific Computing Dr Owain - PowerPoint PPT Presentation

Some Ideas for Best Practice in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader Scientific Computing? Doing science with computers Generating data Simulation Analysing data

Contractor EH&S Management Best Practice (2007) Best Practice (2007) December 2006

Best practice in lipid management Delivering best practice: 5 Steps / Interactive Case Study

Scientific Computing Albert-Jan Yzelman (May 10, 2010) Scientific Computing is... a two-years

Social History of Ideas Social History of Ideas Historians have a rich appreciation of ideas

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

Best Practice Wool Scouring Dr. Jock Christoe Best Practice - Definition Make a profit by

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

THE AWARD CATEGORIES Best House Best Apartment Best Alteration and Renovation

41 1 Sustainable Performance US Dollar Best Trade Best Customer The BIZZ Qatar Corporate

Stroke Best Practice Care Plans for Long Term Care June 2016 Str Strok oke e Best Pr Best

Large N & SUSY: Large N & SUSY: some new ideas and results new ideas and results some

www.UNHistory.org www.UNHistory.org The Power of Ideas The Power of Ideas UNIHP Book Series

Innovative Ideas to Engage Agents Will Bickmore & Sarah-Lynne Rand Senior Account Managers

Project Ideas Semester long projects of medium scope TAs presenting project ideas today

CS449/649: Human-Computer Interaction Winter 2018 Lecture VII Anastasia Kuzminykh Create

Theory Castello di Trento (Trint), watercolor 19.8 x 27.7, painted by A. Drer on his way

Semantic Relations within Data Sources Mohsen Taheriyan Craig A. Knoblock Pedro Szekely Jose

2017 CHAIRMAN Mr Graeme Liebelt Safety Committed to our goal of no injuries Lost Time

GP event Tuesday 26 January 2016 The Bristol Golf Club Primary Care Commissioning (PCC) An

Non-LHC experiments in HEPDATA Matthew Wing (UCL) Initial comments. Data / experiments

Research, , Im Impact and the Value of House of Memories Wednesday 11 th July 2018 Museum of

Benin Kingdom YEAR FIVE Autumn 1 LESSON SIX WHY DID THE BRITISH COLONISE BENIN AND WHAT IMPACT

Debates & Rhetoric Figure: The Owl and the Nightingale , British Museum MS Cotton Caligula A.

Some Ideas for Best Practice in Scientific Computing Dr Owain - PowerPoint PPT Presentation

Some Ideas for Best Practice in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader Scientific Computing? Doing science with computers Generating data Simulation Analysing data

Contractor EH&amp;S Management Best Practice (2007) Best Practice (2007) December 2006

Best practice in lipid management Delivering best practice: 5 Steps / Interactive Case Study

Scientific Computing Albert-Jan Yzelman (May 10, 2010) Scientific Computing is... a two-years

Social History of Ideas Social History of Ideas Historians have a rich appreciation of ideas

City of Piedmont Best Best &amp; Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

Best Practice Wool Scouring Dr. Jock Christoe Best Practice - Definition Make a profit by

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

THE AWARD CATEGORIES Best House Best Apartment Best Alteration and Renovation

41 1 Sustainable Performance US Dollar Best Trade Best Customer The BIZZ Qatar Corporate

Stroke Best Practice Care Plans for Long Term Care June 2016 Str Strok oke e Best Pr Best

Large N &amp; SUSY: Large N &amp; SUSY: some new ideas and results new ideas and results some

www.UNHistory.org www.UNHistory.org The Power of Ideas The Power of Ideas UNIHP Book Series

Innovative Ideas to Engage Agents Will Bickmore &amp; Sarah-Lynne Rand Senior Account Managers

Project Ideas Semester long projects of medium scope TAs presenting project ideas today

CS449/649: Human-Computer Interaction Winter 2018 Lecture VII Anastasia Kuzminykh Create

Theory Castello di Trento (Trint), watercolor 19.8 x 27.7, painted by A. Drer on his way

Semantic Relations within Data Sources Mohsen Taheriyan Craig A. Knoblock Pedro Szekely Jose

2017 CHAIRMAN Mr Graeme Liebelt Safety Committed to our goal of no injuries Lost Time

GP event Tuesday 26 January 2016 The Bristol Golf Club Primary Care Commissioning (PCC) An

Non-LHC experiments in HEPDATA Matthew Wing (UCL) Initial comments. Data / experiments

Research, , Im Impact and the Value of House of Memories Wednesday 11 th July 2018 Museum of

Benin Kingdom YEAR FIVE Autumn 1 LESSON SIX WHY DID THE BRITISH COLONISE BENIN AND WHAT IMPACT

Debates &amp; Rhetoric Figure: The Owl and the Nightingale , British Museum MS Cotton Caligula A.

Contractor EH&S Management Best Practice (2007) Best Practice (2007) December 2006

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

Large N & SUSY: Large N & SUSY: some new ideas and results new ideas and results some

Innovative Ideas to Engage Agents Will Bickmore & Sarah-Lynne Rand Senior Account Managers

Debates & Rhetoric Figure: The Owl and the Nightingale , British Museum MS Cotton Caligula A.