Applications of Bayesian Classification to Data Management Christopher Lynnes NASA/GSFC Co-Authors: S. Berrick, A. Gopalan, X. Hua, S. Shen, P. Smith, K. Yang NASA/GSFC K. Wheeler, C. Curry NASA/ARC 1 6/30/04
Problem Statement Science data volume keeps pace with technology. 2 6/30/04
Data demands are also increasing • Lower Latency: driven by applications • Online access: driven by machine-to-machine interfaces (e.g., models) • Volume: driven by advances in computing and data mining • A solution is to manage data according to their “usefulness”. 3 6/30/04
Data Management Today: black-box paradigm • Data are managed as largely opaque objects – albeit with labels (metadata) and “cover art” (browse) Subscrip- tions CACHE Process Store Archive 4 6/30/04
Content-based Data Management 12 11 1 Subscription: 10 2 Subsetting: 9 3 8 7 6 4 5 “send data when my “give me just the study area is clear” clear pixels ” CACHE Subscrip- tions purge keep 12 11 1 cache 10 2 9 3 8 7 6 4 5 optimization Process Store Archive Automatic 12 11 1 10 2 9 3 12 11 1 5 = time-critical 8 7 6 4 10 2 5 9 3 quality assessment 8 7 6 4 5 6/30/04
Usefulness is in the eye of the beholder Pixel Characteristics Clear-Sky Study Type Cloud Properties X Aerosols X (X) X X Ocean Color X Land Vegetation X Snow Cover/Sea Ice X Wildfires X 6 6/30/04
Characterization of MODIS Calibrated Radiance • Most popular product at Goddard DAAC • Train algorithm to classify pixels – Cloud, glint, land, water, etc. • Speed of the forward algorithm is critical. – However, we can afford time and CPU for training. • Products from science algorithms train machine learning algorithms – Products as proxy for domain experts – Nearly unlimited supply of training and test data – Circular logic if we were making science products… – …but in the decision support domain, it serves as a high-speed approximator to the science algorithm. 7 6/30/04
Bayesian Classification Applied to MODIS Calibrated Radiance • Bayesian classification: – Pr(C|E) = Π Pr( E i |C) × Pr(C) / Π Pr( E i ) – Where C is a class – And E i are measurements of independent variables (evidence). – Pr(C) is the prior probability • Training: Compute frequency histograms for E i |C – MODIS cloudmask and ocean color products “train” the classifier. Percentage of Points Frequency Histogram for Band 1 14% Cloud 12% Desert in Class 10% Glint 8% Ice 6% Land Water 4% Coast 2% 0% -0.5 0.0 0.5 1.0 1.5 2.0 2.5 Log(Calibrated Radiance) 8 6/30/04
Prior Probabilities • Prior probabilities are “known” statistics for the earth – Regional and Seasonal variations – Derived from MODIS Level 3 gridded products December to February 9 6/30/04
Practical Classification - Application • For each class: – Look up the probability for each band measurement in frequency histograms – Compute product to get the overall probability for membership in that class – Choose the class with the highest overall probability Frequency Distribution of Band 1 20% Cloud Desert Glint 15% Snow/Ice Land Water 10% Coast 5% 0% 0 0.5 1 1.5 2 2.5 3 Log10(Radiance) 10 6/30/04
Bayesian Classification Example Bayesian classification using Terra/MODIS scene for MODIS Cloudmask Product bands 1, 2, 2/1, 31, 32 16:20-16:25Z, 2003-10-16 Cloud Cloud Desert Desert Land Land Water Water Coast Coast Glint Glint Ice Ice 11 6/30/04
Timing Results Algorithm Timing for 300 s of Data 2000 1600 1200 Secs 800 400 0 1 2 3 4 5 6 7 Sci. Alg. Number of Bands *Bayesian classification on 250 MHz SGI, as a function of number of bands used 12 6/30/04
Exploitation of Classification Results • Add algorithm to Direct Broadcast processing stream Calibrated Radiances Terra w/supplement class layer Classification band 36 Calibrated Radiances Calibration Geolocation band 2 L0 Reformatted in HDF Geolocation Level 1a Processor band 1 Level 0 Raw Data 3-meter X-band Antenna Direct Readout Laboratory 13 6/30/04
Content-Based Subsetting Deliver just the pixels likely to be useful e.g., cloud-free 1. Classify using Bayesian classifier 2. Zero out pixels classified as cloud 3. Apply lossless compression Currently implemented as an on-the-fly conversion in WUSTL FTP, e.g.: ftp g0dug03u.ecs.nasa.gov >cd /datapool/OPS/user/MODB/RMT021KM.001 >ls >cd 2004.06.13 >ls *.hdf >get RMT021KM.A2004165.1843.001.2004166072602.hdf. clr 14 6/30/04
Content-Based Data Selection • Today: “select scenes where cloud cover < 50%” – Less than foolproof study area • Tomorrow: “select scenes where Lake Winnebago is visible” • Ad hoc indexing / queries are difficult, but... • ...subscription queries should be tractable – “Is anyone looking for data that are clear for a particular area in this scene?” 15 6/30/04
Automated Quality Assessment of Geolocation • Compare observed land-water pattern with land- sea mask based on geolocation – Systematic geolocation error ⇒ systematic shift in pattern • Technique: – Classify land/water/cloud from geolocated radiance – Assign +1.0 to land, -1.0 to water • Assign “unknown” classes a random number in the interval (-1.0, +1.0) – Cloud, snow/ice in classification – Ephemeral water in land-sea mask – Compute cross-correlation using 2-D FFT 16 6/30/04
Geolocation Case Study • Terra/MODIS data for 19 June 2002 reprocessed with the usual onboard attitude and ephemeris • But: a spacecraft maneuver made the onboard data inaccurate – Typically, definitive attitude/ephemeris are used in the vicinity of maneuvers • Several months later…a group studying land cover change identified errors in geolocation 17 6/30/04
6/30/04 Bayesian classification Land-sea mask Geolocation Shift Effect Cross-Correlation Geolocation shift 18
Recommend
More recommend