further plans and available further plans and available
play

Further plans and available Further plans and available data sets - PowerPoint PPT Presentation

Further plans and available Further plans and available data sets for research in data sets for research in directed networks directed networks Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences


  1. Further plans and available Further plans and available data sets for research in data sets for research in directed networks directed networks Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences benczur@sztaki.mta.hu http://datamining.sztaki.hu Supported by the EC FET Open project "New tools and algorithms for directed network analysis" (NADINE No 288956) 14 June 2013

  2. Overview Overview • Web classification, ClueWeb12 • Temporal ranking, learning to rank • Metadata extraction from pdf publications • Plagiarism Detection • Twitter: 1TBdata available, user graph collection in progress for Andreas’ data for Andreas’ data • Distributed systems for very large problems Hardware 50-node old dual core Hadoop • 5-node new Hadoop/HBASE • • 260TB net Isilon

  3. Automatic metadata extraction Automatic metadata extraction • Careful selection of open source PDF converters • Feature generation o font size, face, upper/lower case, numeric characters, symbols o location (centered, vertical position, spacing, page number) o entity list (names, institutions) • Manual training for a Hungarian journal in Economics • Automatic training planned by using publication DBs • Automatic training planned by using publication DBs • Selection of machine learning methods o Random forest is best, LogitBoost with trees is second best o Conditional random fields sound nice but not nearly as good as claimed • Extraction depends of what we can train (manually label) o Author, title, institution o References extracted structured o Tables, figure captions o …?

  4. Plagiarism detection Plagiarism detection • BonFIRE Future Internet Research andExperimentation testbed • KOPI: A plagiarism detection toolkit o http://kopi.sztaki.hu/ o Translation plagiarism (English and Hungarian) (English and Hungarian) o Now serving English Wikipedia o Service puts very heavy load on search index (sentence based checks, existing suboptimal code) o Index ported to several distributed key-value stores o We feed with Web data

  5. Crosslingual Web Classification Crosslingual Web Classification • Save resources, select quality and topic • Legal regulation (porn, illicit content) • Web scale data (Test: ClueWeb09 25TB – 0.5 Billion English language docs) • We just obtained ClueWeb12 • We just obtained ClueWeb12 Cross-Lingual Web Spam Classification. Garzó, Daróczy, Kiss, Siklósi, Benczúr. WebQuality 2013 (@WWW) The classication power of Web features . Erdelyi, Benczur, Daroczy, Garzo, Kiss, Siklosi Internet Mathematics, under revision Julien Philippe Masanes Rigaux Internet Memory Paris

  6. Large Large set set of of features features • Term frequency o tf.idf or BM25 scores for frequent terms • Content o DOM, HTML, HTTP elements o Appearance of popular terms o Term, n-gram statistics, compressibility • Linkage • Linkage o PageRank (truncated variants; ratios) o Neighborhood (only approximate counting is possible) o TrustRank

  7. Workflow Workflow ( (MapRed MapRed jobs jobs indicated indicated) )

  8. SZTAKI Web Processing Framework SZTAKI Web Processing Framework

  9. Crosslingual Web Crosslingual Web Classification Classification • ��������������������������������������������������� • ������������������������������� Terms in the English model translated into Portuguese to translated into Portuguese to classify in the target language. Strongest positive and negative predictions are used for training a model in the target language.

  10. Temporal Wikipedia Search Temporal Wikipedia Search ( (Julianna Julianna) )

  11. Yago Yago: Yet Another General : Yet Another General Onthology Onthology • By MPII Saarbrücken derived from Wikipedia WordNet and GeoNames • 10+ million entities (persons, organizations, cities), 120+ million facts We are developing similar visualization as Wikipedia (prev slide) •

  12. Temporal trends in blog data Temporal trends in blog data Liberation_war economic promise those engine this_year in_effect fulfill

  13. Temporal trends in blog data Temporal trends in blog data • Temporal Text Mining: probabilistic models, language models • Still in progress, challenging algorithmic issues thesis case phd plagiarism semmelweis university case_discovery

  14. SZTAKI SZTAKI Full Full Text Text Search Search Technology Technology

  15. Network Network Influence Influence in in Recommenders Recommenders

  16. Apply for Twitter: Apply for Twitter: retweets retweets • Twitter data: o topics (~bursts: occupy wall street ....) o Andreas has 4 topics ("10o","occupy","20n","yosoy132"). • For all topics we have a set of tweets (can be a retweet) • In numbers: Follower network: 10 6 users o Tweets: ~ 10 5 - 10 6 per topic o o Tweets: ~ 10 - 10 per topic • Social network (who follows who) is missing • Needed since we only know the ROOT of a retweet sequence • Robert is collecting the network

  17. The Matrix Factorization recommender The Matrix Factorization recommender Learning Source of next slides: Domonkos Tikk, CEO, Gravity

  18. BRISMF model BRISMF model • Biased Regularized Incremental Simultaneous Matrix Factorization • Apply regularization to prevent overfitting • To further decrease RMSE using bias values • Model: K � � ∑ = + + = + + r p q b c p q b c ˆ ui u i u i uk ki u i = 1 k

  19. BRISMF Learning BRISMF Learning • Loss function 2   K ∑ ∑ ∑ ∑ ∑ ∑  − − −  + λ + λ + λ + λ r p q b c p q b c 2 2 2 2 ui uk ki u i uk ki u i   u i ∈ R k = u k i k u i ( , ) 1 ( , ) ( , ) train • SGD update rules • SGD update rules ( ) ( ) ∆ = η − λ ∆ = η − λ p e q p q e p q uk ui ki uk ki ui uk ki ( ) ( ) ∆ = η − λ ∆ = η − λ b e b c e c u ui u i ui i

  20. R P 1 4 3 1,2 -0,3 1,1 -0,2 1,1 -0,4 1,2 -0,5 1,2 0,9 1,1 0,8 1,2 0,9 4 4 0,5 -0,1 0,4 -0,4 0,5 -0,3 0,4 -0,2 4 2 4 1,4 1,5 1,3 0,8 0,9 -1,3 -1,1 -1,2 -0.1 0,1 0,0 0.6 0.5 Q -0,2 -0,1 0,0 0,5 0,4 -0,3 -0,4 -0,2 1,6 1,6 1,5 0,3 0,2

  21. R P 1 4 3.3 3 2.4 1,4 1,1 0,9 1,9 -0.5 3.5 4 4 1.5 2,5 -0,3 4 4.9 2 1.1 4 1,5 2,1 1,0 0.7 1.6 Q -1,0 0,8 1,6 1,8 0,0

  22. Influence Influence Learning Learning by by Gradient Gradient Descent Descent • Present influence recommender: o heuristic weighted network learning o no artist based learning part • Heuristic combination of the influence and factor models o Is it likely that user v influences user u on artist a? o Can user a be influenced at all in case of artist a? • Use SGD method to learn user and artist factors � � ∑ = Γ ∆ + + r t p q b c ( )( ) ˆ uat v a v i v

  23. Distributed learning? Distributed learning? • Hadoop gathered bad reputation recently o Wants to be too robust, keep writing all temporal data several times to disk o Fails after a given number of servers o The learning and graph problems do more computation on less data compared to building a Google search index • My personal choice of frameworks o GraphLab (Danny Bickson, HUJI) • Nearly as efficient as possible C++ codes • Nearly as efficient as possible C++ codes • But very hard to write them • We work with them on implementing learning-to-rank methods o Stratosphere (Volker Markl, Kostas Tzoumas, TU Berlin) • Developments coordinated by TU Berlin with lots of partners incl. us • Promises to simplify complex workflows like the spam filter • Yet what many applications need would be o Streaming (read data only once, no batch computations) o Fully distributed: no Facebook, Google, Netflix knowing each and every online action ever in our life – have P2P learning

  24. A d A distributed istributed systems systems comparison slide comparison slide “Scalable Machine Learning for Big Data” tutorial at ICDE 2012

  25. Mobility Mobility Data Data Stream Stream processing processing ( (Orange Orange D4D) D4D)

  26. Stream Stream Processing Processing Architec Architecture ture Overview Overview Goal is to hide Storm details from user • Streaming infrastructure pluggable (could combine with Stratosphere) • Persistence layer pluggable

  27. Conclusions Conclusions • Web classification plans to integrate with BUbiNG, use SZTAKI cluster to test the crawler • Analyze ClueWeb12 and maybe a NADINE crawl? • Temporal ranking in Wikipedia – other temporal collections? • Use metadata extraction from online publications to infer topics and rich information that is available in full text only topics and rich information that is available in full text only (beyond the usual DBLP graph analysis) • Network analysis in the plagiarism detection tool? • Twitter o Understand the 1TBdata o Find influences in the user graph that we collect for Andreas’ data • Distributed machine learning and graph algorithms

  28. Questions? Questions? András Benczúr Head, Informatics Laboratory and “Big Data” lab “Big Data” lab http://datamining.sztaki.hu/ benczur@sztaki.mta.hu Web and Social Media 14 June 2013

Recommend


More recommend