Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State University http://pike.psu.edu/ dongwon@psu.edu Problem Landscape Data Linkage : Given two data collection D 1 and D 2 , identify/link all crosswise similar data object set S with few false positives Abundant research in many disciplines DB: record linkage, merge/purge, approx. join DL: citation matching, de-duplication AI: identity matching NLP: word sense disambiguation IR: name disambiguation LIS: name authority control KOCSEA 2008 2 1
Data Linkage Proj. @ Penn State Since 2006 Supported by IBM, Microsoft, and NSF http://pike.psu.edu/linkage/ Focus on two unanswered challenges parallel Data Scalability distributed indexing DNA seq. Today’s blocking time series name Focus record video image Flexibility KOCSEA 2008 3 1. Group Linkage [ICDM 06, ICDE 07] T. Cruise Collateral, 04 Sofa-Jumping The Last Samurai, 03 Minority Report, 02 Vanilla Sky Vanilla Sky, 02 The Last Samurai Mission Impossible Mission Impossible 2 PP0Q03 TX204 Group of Elements KOCSEA 2008 4 2
1. Group Linkage [ICDM 06, ICDE 07] Key Ideas BM : Generalized Jaccard Similarity using Max Weight Bipartite Matching M : O(N 3 ) UB : Greedy algorithm based approx. of BM: O(N) Theorem: IF UB(g 1 ,g 2 ) < θ → BM(g 1 ,g 2 ) < θ → g 1 ≠ g 2 KOCSEA 2008 5 2. Video Linkage [CIVR 08] < Original Video> Contrast Brightness Crop Color Enhancement Color Change TV size Mul3‐edi3ng Low resolu3on Noise/Blur Small Logo KOCSEA 2008 3
2. Video Linkage [CIVR 08] shot1 shot2 shot3 A group of shots Video A group of frames Shot 1. Dynamic Key frame selec3on 2. Uniform Key Key Key frame frame frame 1 2 n A group of key frames 3. Hybrid = Dynamic + Uniform ‐ reduce # of computa3ons KOCSEA 2008 7 2. Video Linkage [CIVR 08] frames shot 1 Video 1 frames shot 2 Compare frames compare frames shot 1 frames Video 2 shot 2 shot 3 We need frames features of a frame 7/8/2008 8 KOCSEA 2008 4
2. Video Linkage [CIVR 08] 1. HSV color histogram (CH) 2. Vector of YCbCr blocks 3. Mo3on vector histogram 16x16 pixel block 16x16 pixel block H S V YCbCr average Mo3on vectors 4 values 16 values 4 values (0,0), (0,1), (1,0), (1,1), (0,‐1), (‐1,0), (1‐1), (‐1,1), (‐1,‐1) N = # of mo3on vectors M = # of blocks L=256 = 9 9 KOCSEA 2008 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] DNA Gene Sequence BLAST Parallel BLAST Text ED DTW SAX Time Series KOCSEA 2008 10 5
3. Text Linkage [JCDL 06, QDB 08, WebDB 08] DNA Conversion N-gram token set 1bit, 2bit, or 4bit Lookup Table with tf.idf weight data coding 1-bit 2-bit 4-bit coding coding coding Prob(X=X i ) Prob(X ≤ C 1 ) Prob(X ≥ C 2 ) Text C 1 C 2 X = word weight (normalized) BLAST Text CTATGCAG CTATGCAG TGAA VLDB 11000 TGAA Importance (rank) 11001 TGAC SIGMOD GAGAGGGTGGGC TGAC GAGAGGGTGGGC Text KOCSEA 2008 11 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] N-gram token set Text with tf.idf weight + Conversion Lookup Table Hilbert Curve QWERTY Layout KOCSEA 2008 12 6
Conclusion Group Linkage Handle the integration of CiteSeer and ACM DL Each data collection with ~100,000 groups Video Linkage Can detect copied videos w. high precision/recall Applied to Flickr Text Linkage Try to bridge three different disciplines Solve record linkage and document clustering problems using DNA sequence or Time Series KOCSEA 2008 13 Conclusion Other Data Linkage Techniques Name Linkage [CACM 08, WIDM 08, SemEval 07] Parallel Linkage [CIKM 08] Adaptive Linkage [JCDL 07] Hashed Linkage [TR 08] Future Work http://pike.psu.edu/linkage/ Unifying Framework Application to other data analysis problems KOCSEA 2008 14 7
Credit Students @ Penn State Yoojin Hong Hung-sik Kim Haibin Liu Su Yan Tao Yang Collaborators Ergin Elmacioglu, Yahoo, USA Jaewoo Kang, Korea U., Korea Nick Koudas, U. Toronto, Canada Jeongkyu Lee, U. Bridgeport, USA Byung-Won On, U. British Columbia, Canada Jian Pei, Simon Fraser U., Canada Divesh Srivastava, AT&T Labs – Research, USA KOCSEA 2008 15 8
Recommend
More recommend