Generation 3 Middleware overview : (some of) the same slides as last time, almost status : not the same slides as last time!
What's in the Gen3 Middleware? PipelineTask The SuperTask Framework: a replacement for CmdLineTask PipelineTasks ● Pipelines are lists of SuperTasks ● Preflight turns a Pipeline and a Data ID Expression into a directed acyclic QuantumGraph whose nodes are Datasets or Quanta . ● Execution runs the processing described by a QuantumGraph (possibly in parallel, possibly at scale). 2 LSST2018
What's in the Gen3 Middleware? The Generation 3 Butler: a replacement for the current Butler ● Almost the same get and put interface for reading and writing Datasets. ● Predefined data model for Data ID keys, called DataUnits , that lets us capture the relationships between different kinds of data. ● New interfaces for subsets and other expressions. ● Very different data repository concept. ● Totally different implementation. 3 LSST2018
What's a PipelineTask? Just a CmdLineTask, with a few modifications: ● configurable, introspectable input/output DatasetTypes ● a runQuantum method that does the work Unlike CmdLineTask, SuperTasks don't run themselves or parse arguments; they provide enough information for something else to run them. 4 LSST2018
What's a Quantum? An indivisible, independent unit of processing. A Quantum has: ● A PipelineTask that executes it. ● A collection of input Datasets (subdivided into predicted and actual). ● A collection of output Datasets. These naturally form a directed acyclic graph (DAG), which we call the QuantumGraph. 5 LSST2018
QuantumGraph warp quantum 4 visit=320 Task=MakeWarps raw tract=12 visit=320 patch=50 sensor=10 PVI visit=320 sensor=10 WCS quantum 1 visit=320 Task=ProcessCcd sensor=10 src visit=320 sensor=10 flat quantum 6 quantum 3 begin=2019-04-10 Task=AssembleCoadd Task=JointCal sensor=10 filter=r src visit=321 WCS coadd sensor=10 visit=321 tract=12 sensor=10 quantum 2 patch=50 Task=ProcessCcd filter=r PVI raw visit=321 visit=321 warp sensor=10 sensor=10 visit=321 quantum 5 tract=12 Task=MakeWarps patch=50 6
What's a Pipeline? A list of PipelineTask classes and their Config instances. It's conceptually a graph as well (though it will behave like flat a collection in Python). coadd raw src JointCal WCS AssembleCoadd ProcessCcd PVI MakeWarps warp flat 7 LSST2018
How it all works Data ID Expression Preflight Execution Registry Butler(s) Preflight Pipeline Solver Pipeline PipelineTask 1 Config 1 Executor QuantumGraph Executor Utilities PipelineTask 2 Config 2 Quantum 1 Quantum 2 PipelineTask Config N Execution N Configuration Quantum M SuperTask Core Common PipelineTask Extensions Framework Configuration Operator Inputs Gen3 Butler Components 8 LSST2018
Gen3 Butler Components Master Repository stores and manages Datasets Registry SQL database that manages metadata Datastore groups Datasets into Collections stores Datasets labels Datasets with DataUnits understands file formats defines DatasetTypes dynamically Butler high-level interface to the Datasets in a single Collection 9 LSST2018
Butler.get From Butler RC_w_2018_42 Registry Collection PVI dataset_id DatasetType camera=HSC DataUnit Datastore sensor=30 DataUnit From user Python Object 10
Butler.put From Butler Registry RC_w_2018_42 Collection PVI dataset_id DatasetType camera=HSC DataUnit Datastore sensor=30 DataUnit From user Python Object 11
Gen3 Butler Design Principles Make our data model explicit, as a SQL schema. ● Different cameras are mapped to our data model; we don't map our processing to theirs. ● Predefine the set of allowed keys for Data ID dicts ("DataUnits"). ● The relationships between DataUnits are part of the data model: ○ Many-to-One: e.g. a Visit has a PhysicalFilter. ○ Many-to-Many: e.g. a Visit-Sensor combination spatially overlaps several Tract-Patch combinations. 12 LSST2018
Gen3 Butler Design Principles Separate Camera-specialization from filename mapping. ● Mappers will no longer exist. ● Cameras are themselves DataUnits, and obs_* packages are responsible for defining other DataUnits associated with Cameras. ● When we have files, filename templates will be customizable in the Butler client config at the DatasetType level and the Camera level. 13 LSST2018
Gen3 Butler Design Principles Define Collections in the database, not the filesystem. ● Multiple cameras, runs, etc. in a single data repository. ● Datasets may be associated with multiple Collections, which are just database tags. ● A Butler instance retrieves Datasets from a single Collection. ● The Collection for a processing run typically includes its inputs and its outputs (simulates Gen2 repository chaining). 14 LSST2018
Schema: DataUnits also, without tables: ● ExposureRange ● SkyPix ● Label 15 LSST2018
Schema: Dataset DataUnit fields Not shown: provenance tables (details are just distracting at this level) 16 LSST2018
Gen3 Butler Registry Status SQLite Registry is basically working for low-level operations: ● registering DatasetTypes ● registering Cameras and SkyMaps ● get / put / single-file ingest ● single-Camera, single-SkyMap expressions for Preflight ● transactions What's missing: ● Subsets/Transfers (copy stuff from one repo to another) ● Provenance is nominally there, but hasn't really been tested. ● Details of the schema have not been finalized. 17 LSST2018
Gen3 Butler Datastore Status PosixDatastore is basically working for low-level operations: ● get / put ● single-file ingest ● composites Chained- and InMemoryDatastore have been prototyped; need Registry support to really be evaluatored What's missing: ● testing, polishing, edge-case support ● more concrete Datastores implementations 18 LSST2018
Gen3 Butler Status: Repo Conversion We already have a script to (non-destructively) create a snapshot Gen3 view into a set of Gen2 repos. It's actually running in ci_hsc - after the (Gen2-based) processing completes, we make a Gen3 repo and re-run all of the checks. It doesn't do everything yet: ● raw data isn't transformed on read properly ● the script doesn't work on calibration repos yet. 19 LSST2018
Why we can't all use Gen3 Butler yet It's integration time! ● obs_ package integration is totally different (no Mappers!), and still being prototyped. ● There are a lot of hacks in obs_ packages we don't want to propagate to Gen3; that means we need to actually fix those issues properly. ● It's not quite a drop-in replacement, so our CmdLineTasks can't use it directly. 20 LSST2018
PipelineTask Status Base class is ready to be used, but many of our concrete CmdLineTasks aren't ready to use it - so we're getting them in shape first (e.g. run -> runDataRef , butlers in LoadReferencesTask ). Preflight and "laptop" execution framework are working - in simple tests, with toy-example PipelineTasks . But no one has "run them in anger." 21 LSST2018
What We're Doing Now ● About to start writing PipelineTask versions of existing CmdLineTasks . ○ PipelineTask versions will use Gen3 Butler, CmdLineTask versions will continue to use Gen2. ● Putting together raw ingest hooks and customizations in obs_ packages. ● Standing up first versions of production Registries and Datastores. 22 LSST2018
What We're Doing Now Overall: The basic pieces are there, but we need to use them to learn how they need to be improved. To be able to use them, we need: ● concrete PipelineTasks that aren't just toy examples ● real obs_ packages with Gen3 integration 23 LSST2018
Recommend
More recommend