novel data linkage techniques
play

Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State - PDF document

Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State University http://pike.psu.edu/ dongwon@psu.edu Problem Landscape Data Linkage : Given two data collection D 1 and D 2 , identify/link all crosswise similar data object set S


  1. Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State University http://pike.psu.edu/ dongwon@psu.edu Problem Landscape  Data Linkage : Given two data collection D 1 and D 2 , identify/link all crosswise similar data object set S with few false positives  Abundant research in many disciplines  DB: record linkage, merge/purge, approx. join  DL: citation matching, de-duplication  AI: identity matching  NLP: word sense disambiguation  IR: name disambiguation  LIS: name authority control KOCSEA 2008 2 1

  2. Data Linkage Proj. @ Penn State  Since 2006  Supported by IBM, Microsoft, and NSF  http://pike.psu.edu/linkage/  Focus on two unanswered challenges parallel Data Scalability distributed indexing DNA seq. Today’s blocking time series name Focus record video image Flexibility KOCSEA 2008 3 1. Group Linkage [ICDM 06, ICDE 07] T. Cruise Collateral, 04 Sofa-Jumping The Last Samurai, 03 Minority Report, 02 Vanilla Sky Vanilla Sky, 02 The Last Samurai Mission Impossible Mission Impossible 2 PP0Q03 TX204 Group of Elements KOCSEA 2008 4 2

  3. 1. Group Linkage [ICDM 06, ICDE 07]  Key Ideas  BM : Generalized Jaccard Similarity using Max Weight Bipartite Matching M : O(N 3 )  UB : Greedy algorithm based approx. of BM: O(N)  Theorem:  IF UB(g 1 ,g 2 ) < θ → BM(g 1 ,g 2 ) < θ → g 1 ≠ g 2 KOCSEA 2008 5 2. Video Linkage [CIVR 08] <
Original
Video>
 Contrast
 Brightness
 Crop
 Color
Enhancement
 Color
Change
 TV
size
 Mul3‐edi3ng
 Low
resolu3on
 Noise/Blur
 Small
Logo
 KOCSEA 2008 3

  4. 2. Video Linkage [CIVR 08] shot1
 shot2
 shot3
 A
group
of
shots
 Video
 A
group
of
frames
 Shot
 1.
Dynamic

 Key
frame
selec3on
 2.
Uniform

 Key Key Key frame frame frame 1 2 n A
group
of
key
frames
 3.
Hybrid
=
Dynamic
+
Uniform
 





‐
reduce
#
of
computa3ons
 KOCSEA 2008 7 2. Video Linkage [CIVR 08] frames
 shot
1
 Video
1
 frames
 shot
2
 Compare
frames
 compare
 frames
 shot
1
 frames
 Video
2
 shot
2
 shot
3
 We need frames
 features of a frame 7/8/2008 8 KOCSEA 2008 4

  5. 2. Video Linkage [CIVR 08] 1.
HSV
color
histogram
(CH)
 2.
Vector
of
YCbCr
blocks
 3.
Mo3on
vector
histogram
 16x16
pixel
block
 16x16
pixel
block
 H
 S
 V
 YCbCr
average
 Mo3on
vectors
 4
values
 16
values
 4
values
 (0,0),
(0,1),
 (1,0),
(1,1),
 (0,‐1),
(‐1,0),
 (1‐1),
(‐1,1),
 (‐1,‐1)
 N
=
#
of
mo3on
vectors

 M
=
#
of
blocks
 L=256
 



=
9
 9 KOCSEA 2008 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] DNA Gene Sequence BLAST Parallel BLAST Text ED DTW SAX Time Series KOCSEA 2008 10 5

  6. 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] DNA Conversion N-gram token set 1bit, 2bit, or 4bit Lookup Table with tf.idf weight data coding 1-bit 2-bit 4-bit coding coding coding Prob(X=X i ) Prob(X ≤ C 1 ) Prob(X ≥ C 2 ) Text C 1 C 2 X = word weight (normalized) BLAST Text CTATGCAG CTATGCAG TGAA VLDB 11000 TGAA Importance (rank) 11001 TGAC SIGMOD GAGAGGGTGGGC TGAC GAGAGGGTGGGC Text KOCSEA 2008 11 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] N-gram token set Text with tf.idf weight + Conversion Lookup Table Hilbert Curve QWERTY Layout KOCSEA 2008 12 6

  7. Conclusion  Group Linkage  Handle the integration of CiteSeer and ACM DL  Each data collection with ~100,000 groups  Video Linkage  Can detect copied videos w. high precision/recall  Applied to Flickr  Text Linkage  Try to bridge three different disciplines  Solve record linkage and document clustering problems using DNA sequence or Time Series KOCSEA 2008 13 Conclusion  Other Data Linkage Techniques  Name Linkage [CACM 08, WIDM 08, SemEval 07]  Parallel Linkage [CIKM 08]  Adaptive Linkage [JCDL 07]  Hashed Linkage [TR 08]  Future Work http://pike.psu.edu/linkage/  Unifying Framework  Application to other data analysis problems KOCSEA 2008 14 7

  8. Credit  Students @ Penn State  Yoojin Hong  Hung-sik Kim  Haibin Liu  Su Yan  Tao Yang  Collaborators  Ergin Elmacioglu, Yahoo, USA  Jaewoo Kang, Korea U., Korea  Nick Koudas, U. Toronto, Canada  Jeongkyu Lee, U. Bridgeport, USA  Byung-Won On, U. British Columbia, Canada  Jian Pei, Simon Fraser U., Canada  Divesh Srivastava, AT&T Labs – Research, USA KOCSEA 2008 15 8

Recommend


More recommend