missing data in large transaction databases
play

Missing Data in Large Transaction Databases Allan R. Wilks - PDF document

Missing Data in Large Transaction Databases Allan R. Wilks AT&T Labs - Research Setting call detail on AT&T long distance network 300 million transactions (50 GB) per day collected from 400 sources reporting frequency


  1. Missing Data in Large Transaction Databases Allan R. Wilks AT&T Labs - Research

  2. Setting • call detail on AT&T long distance network • 300 million transactions (50 GB) per day • collected from 400 sources • reporting frequency ranging from continuous to every few weeks • complicated variable-length record format Workshop on Data Quality 30 November 2000, Slide 1

  3. Use • fraud detection • streaming access • database access Workshop on Data Quality 30 November 2000, Slide 2

  4. Problem • are we seeing all the data? • needle absence in haystack • niches for fraudsters • perception: database confidence Workshop on Data Quality 30 November 2000, Slide 3

  5. Sources • are all sources reporting? • depends on having exhaustive source list • each source reporting everything? volume monitoring frequency monitoring serial number monitoring stratified -- all exchanges? Workshop on Data Quality 30 November 2000, Slide 4

  6. Holes in database • users can detect quite small holes -- surprising do users alert? -- depends on their expectations do users think about the data as they see it? • auto queries transverse to reporting sources • traceback can the source of a hole be traced? keep raw data Workshop on Data Quality 30 November 2000, Slide 5

  7. Tools • streaming tools sh, awk, C, ... everything small • database tools Daytona integrates well with UNIX 8 TB and growing • alerting via pager software failures system failures heartbeat Workshop on Data Quality 30 November 2000, Slide 6

  8. Lessons • develop subject matter expertise • log everything • explain all anomalies • keep raw data • automate as much as possible Workshop on Data Quality 30 November 2000, Slide 7

Recommend


More recommend