Some Ideas for “Best Practice” in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader
“Scientific Computing?” ● “Doing science with computers” – Generating data → Simulation – Analysing data → Filtering, statistical analysis… – Theorising about data → Machine learning/AI? ● Not just science – Arts/humanities → “Research Computing”
About Me ● Been at UCL since 2005 (Computational Chemistry PhD) ● Spent the last 8 or so years working in Research Computing in ISD – Team Lead of Research Computing Applications and Support – Look after users and applications on UCL ISD managed resources + design those services
Contents ● Overview of HPC/HTC services at UCL ● Version Control ● Publishing Code (+ Data) ● Pitfalls 4 contentious statements
UCL Research Computing Resources ● Parallel ● UCL only services: – Single job spans multiple nodes – Grace → High Performance – Tightly coupled parallelisation usually in Computing (HPC) MPI – Myriad , Legion → High Throughput – Sensitive to network performance Computing (HTC) – Currently primarily chemistry, physics, – Aristotle → Interactive teaching engineering Linux service ● High throughput – Lots (tens of thousands) of independent ● National services: jobs on different data – Thomas (Tier 2 MMM hub) – High I/O – Michael (Faraday Battery Institute) – Currently, primarily biosciences and physics – In the future, digital humanities
HPC Many processes on many processors work simultaneously + communicate between each other Input Data Output Data
HTC Many processes, operate independently of each other and in Output Input any order Data Data
The what + why of version control ● Version control systems are tools that let you keep track of who changed a file or set of files, when and what they changed. – If you are collaborating they let you all work on a project and share changes in a structured way. – If you are working on a long term project (e.g. your PhD thesis!) help you keep a record of what you did and when (and get old versions back). ● Many available, many types – from very basic (e.g. “track changes”) to very advanced decentralised systems.
Git and Github ● Git is an Open Source (GPL) command line tool originally written by Linus Torvalds. – But there are lots of graphical tools available that “talk git” – “Decentralised” - i.e. every person working on a repository has their own copy ● Github is a centralised service for hosting, sharing and contributing to git repositories of open source code – A sort of “social network” for coding – Free for public repositories – Recently bought by Microsoft! “Octocat”, Github’s cute mascot
Github is an interesting place to explore ● It’s the default for RITS (including RSD) at UCL – e.g. – https://github.com/UCL/i_newspaper_rods- software to run queries over the British Museum’s Times Digital archive. – https://github.com/UCL-RITS/rcps-buildscripts/- all the installation management for UCL RC services (and where you can request new software). ● Code for all sorts and scales of projects, inc. big companies like Microsoft, Valve...
Setting up git/Github ● Depends on whether you are using Linux, Mac or Windows! Linux – often already installed, or install from your package manager – Mac – install from the Xcode developer tools – Windows – a lot more complicated: pick an option from: – Command-line tools: https://git-scm.com/downloads ● GUI choices: https://git-scm.com/downloads/guis ● ● Set up name and email in the client ● IF you want to use Github, register a Github account More detail on linking this to git on your local machine here: – https://help.github.com/articles/set-up-git/
But overall... ● You don’t have to like git or even use it: – Other version control systems are available (SVN, CVS...) – Anything is better than nothing – what is important is to have a good automated way of tracking what you did when and getting back “that” version of the code. – Find out if your research group already uses a version control tool and use that. – Similarly there are Github alternatives for collaboration like BitBucket. ● Anything that’s a text file(*) can go into version control – this includes LaTeX source if you use that for your thesis/papers. (*) Binary files can go in but you can’t see the difference between versions as easily
Aside: Code: Application vs Method Applications Method ● Packaged as “ready for other ● “What I did” people” ● Really a part of the write-up ● Works on machines other than the – Probably hard-coded to work developer’s: with one dataset, in the few No hard coding of paths – environments available to the “sensible” install process – user. works on arbitrary dataset – – Jupyter notebooks etc. ● Used directly by other people for ● Inspire other people’s work work
Publishing Code + Data First contentious statement: IF your research is publicly funded it is your moral obligation to make your code and data available to outsiders under a reasonable license. Increasingly, funding councils agree.
Publishing Code + Data: Motivations ● Citations: IF you license appropriately, get citations for free! ● Collaborations: IF you are willing to, potential collaborators will come to you! ● “Reproducibility” + finding errors IF you are not evil, this is good.
Reproducibility Second contentious statement: “Reproducibility” in scientific computing has been hijacked by software engineers. ● Overwhelming on focus on bit-perfect reproduction of results: Containers/VMs to exactly reproduce environment. – → Doesn’t work anyway because of hardware. Only actually of use forensically (of course useful for moving your software about which – is a separate issue)
Reproducibility ● Relatively little focus on whether the general method is stable: – If your method only works with a particular compiler/MKL/whatever version then it may be a bug, not a valid result. – (Related) If your code stops working because a language feature is deprecated then expecting that old version to be available for the lifetime of your research is a bad idea – update your working version of your code.
Publishing Code: “Do” Things to think about: ● What you are publishing – is it an “Application” or is it a “Method”? Set expectations for users in the documentation – ● License – in order for people to actually use your software! ● Versions: Keep old versions online and distinguish them i.e. “ myprg-1.2.3.tar.gz ” not “ myprg- – current.tar.gz ” Tag releases (if on Github etc.). – When publishing results say which version/tag you used! – ● DO THE SAME FOR DATA SETS!
Pitfalls ● Ritual ● The “things I did” explosion ● Obsessing over performance/not caring enough ● Designing experiments based on the contents of slide decks
Ritual Third contentious statement: – IF you are publishing research you should know how your results are generated . – i.e. it’s not just enough to plug some data into a black box whose workings you do not understand and then publish the results. THIS DOES NOT MEAN YOU NEED TO UNDERSTAND THE COMPUTER DOWN TO THE MICROCODE!
Avoiding Ritual ● Read up on the software you are using. ● Think about its limitations: – Is its output is deterministic or is there a random element? – Where does that algorithm break down? – What sort of machine can I run this on? ● Think about how it might be applied to your problem: – Am I actually using this software appropriately? – What data requirements do I need to think about? ● Think about your results: – Are they reasonable?
It’s just a model... This all dovetails neatly together into a larger problem: When we simulate things, we are just building a model. Models have limits! Computer models are not the only models: ● Animal models ● Building an actual model ● Theoretical models JUST BECAUSE THE COMPUTER MODEL SAYS IT IS TRUE DOESN’T MAKE IT SO!
It’s just a model... Real life... Physical scale model... Computer model….
The “things I did” explosion ● This is not unique to scientific computing but is encouraged by the way we use computers. ● It can be tempting to try a lot of unplanned things on our input data and see what “works”: “ I’ll just run it through X and see... ” – This can be difficult to track. – This can be dangerously close to “p-hacking” when analysing data... ● Always record what you did even if you didn’t plan it. – Version control helps (particularly if you are modifying code)
Performance Performance is important. But what is important is the time to get to a meaningful solution, not the performance of code alone. ● There’s no point in learning C to make one job that takes 48 hours run in 4 hours. – But maybe if you have to run 10000 of them? ● Obsessive optimisation is madness. ● It’s completely worth slightly modifying your code to make it run 10x as fast.
Experiment design by slide deck Supervisor: “Hey, I went to this conference and saw a really interesting presentation by Prof. X’s group on this application they have developed and you should try using it on our problem”
Recommend
More recommend