author profiling
play

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I - PowerPoint PPT Presentation

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I ntro duc tio n Pre vio us wo rk Syste m o ve rvie w Alg o rithm Co nc lusio n 2017/ 6/ 26 O UT LINE 3 I ntro duc tio n Pre vio us wo rk Syste m o


  1. Author profiling 1 zhiming Zho u 2017/ 6/ 26

  2. O UT LINE 2 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  3. O UT LINE 3 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  4. O UT LINE 4 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  5. PREVIO US WO RK 5 2017/ 6/ 26

  6. O UT LINE 6 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  7. SYST EM O VERVIEW 7 • Ma ximize the re c a ll • Ma ximize the pre c isio n 2017/ 6/ 26

  8. O UT LINE 8 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  9. ALG O RIT HM 9  Pre - proc e ssing  Cle a n the da ta : 1. No isy F irst o r L a st Na me s 2. Mista ke nly Se pa ra te d o r Me rg e d Na me Units 2017/ 6/ 26

  10. ALG O RIT HM 10  Improving the Re c a ll  Na me -Spe c ific Co nside ra tio n:  String -b a se d Co nside ra tio n: 1. Na me Suffixe s a nd 1. L e ve nshte in E dit Pre fixe s Dista nc e 2. Nic kna me s 2. So unde x Dista nc e 3. Na me I nitia ls 3. Ove rla pping Na me Units 4. Asia n Na me s a nd We ste rn Na me s 2017/ 6/ 26

  11. ALG O RIT HM 11  Improving the Pre c ision  Me ta -Pa th-b a se d Simila rity: T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly. 2017/ 6/ 26

  12. ALG O RIT HM 12  Improving the Pre c ision  Me ta -Pa th-b a se d Simila rity: T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly. 2017/ 6/ 26

  13. ALG O RIT HM 13  Improving the Pre c ision  Me ta -Pa th-b a se d Simila rity: T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly. 2017/ 6/ 26

  14. ALG O RIT HM 14  Improving the Pre c ision  Ra nking -b a se d Me rg ing We do a sc a n fro m the to p ra nke d I D pa ir to the lo we r ra nke d o ne s to he lp infe r the a utho r e ntity. And we will skip the c o nflic t I Ds, find o ne tha t ha s hig h simila rity b ut a lso pa sse s the na me ma tc hing c o mpa riso n, we b e lie ve the se two I Ds ha ving hig h pro b a b ility to b e the re a l duplic a te . Afte r tha t, if A is the duplic a te o f B a nd B is the duplic a te o f C, we will c o nside r tha t a is the duplic a te o f C. Ano the r impo rta nt stra te g y is to e xpa nd the a utho r na me s c o rre spo nding to the I Ds o nc e we a re c o nfide nt a b o ut two I Ds to b e the duplic a te . T his ide a is use ful b e c a use it c a n he lp a vo id the mista ke nly de te c te d c o nflic ts. 2017/ 6/ 26

  15. ALG O RIT HM 15  Post- proc e ssing Unc o nfide nt duplic a te a utho r I Ds sho uld b e re mo ve d e ve n tho ug h the ir na me s a re c o mpa tib le a nd the ir me ta -pa th- b a se d simila rity sc o re s a re a c c e pta b le . T his ste p is c ruc ia l in tha t the la te r ite ra tive fra me wo rk re q uire s hig hly c o nfide nt o utput to g ra dua lly re fine the re sults. 2017/ 6/ 26

  16. ALG O RIT HM 16  Ite ra tive F ra me work  An ite ra tive fra me wo rk whic h ta ke s the de te c te d duplic a te s o f the la st ite ra tio n a s pa rt o f the input: 1. we a re a b le to g e ne ra te muc h b e tte r me ta -pa th-b a se d simila rity sc o re s 2. re c a ll the na me e xpa nsio n mo dule intro duc e d a t the e nd o f the p- ste p 2017/ 6/ 26

  17. O UT LINE 17 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  18. C O NC LUSIO N 18 We ha ve trie d to disa mb ig ua tio n the a utho r na me , a nd we ha ve fo und a b e tte r a lg o rithm whic h is undo ub te dly pra c tic a l in K DD Cup Da ta Mining Co nte st 2013. But the re is still lo ts o f wo rk ne e d to b e do ne . I n the future , we ne e d to a djust the c o de to o ur da ta b a se , a nd we ne e d to c ha ng e so me o f the pa ra me te rs to o b ta in the b e st re sult. I a m lo o king fo rwa rd to the da y we c o mple te the wo rk, a nd I a m firmly b e lie ve d tha t o ur wo rk will turn o ut to b e a ve ry impo rta nt impro ve me nt o f the Ac e ma p. 2017/ 6/ 26

  19. 19 Q&A 2017/ 6/ 26

  20. 20 Thank You! 2017/ 6/ 26

Recommend


More recommend