scalable generic and adaptive systems for focused crawling
play

Scalable, Generic, and Adaptive Systems for Focused Crawling - PowerPoint PPT Presentation

Scalable, Generic, and Adaptive Systems for Focused Crawling Georges Gouriten* - georges@netiru.fr Silviu Maniu Pierre Senellart* * Tlcom Paristech Institut Mines-Tlcom LTCI CNRS Hong Kong University What is focused


  1. Scalable, Generic, and Adaptive Systems for Focused Crawling Georges Gouriten* - georges@netiru.fr Silviu Maniu° Pierre Senellart*° * Télécom Paristech – Institut Mines-Télécom – LTCI CNRS ° Hong Kong University

  2. What is focused crawling?

  3. A directed graph

  4. Web Social network P2P etc.

  5. Weighted 5 3 0 2 5 0 0 4 3 3 3 2 4

  6. Let u be a node, β(u) = count of the word Bhutan in all the tweets of u

  7. Even more weighted 0 0 2 0 3 0 1 0 1 0 0 0 0 0 1 3

  8. Let ( u , v ) be an edge, α(u) = count of the word Bhutan in all the tweets of u mentioning v

  9. The total graph 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  10. A seed list 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  11. The frontier 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  12. Crawling one node 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  13. A crawl sequence Let V 0 be the seed list, a set of nodes, a crawl sequence, starting from V 0 , is { v i , v i in frontier(V 0 U {v 0 , v 1 , .. , v i-1 }) }

  14. Goal of a focused crawler Produce crawl sequences with global scores (sum) as high as possible

  15. A high-level algorithm Estimate scores at the frontier Pick a node from the frontier Crawl the node

  16. Supposing a perfect estimator

  17. Finding an optimal crawl sequence offline: NP-hard Greedy wins for a crawled graph > 1000 nodes Refresh rate of 1 is better

  18. Estimation in practice

  19. Different kinds of estimators

  20. bfs 5 3 0 2 5 0 0 4 3 3 3 2 4

  21. bfs 5 3 0 2 5 0 0 4 3 3 3 2 4

  22. bfs

  23. nr navigational rank score propagation from the ancestors of a node then to the children of a node

  24. nr

  25. opic online page importance computation ~ online pageRank computation

  26. opic 2. ->

  27. Open spaces in the state-of-the-art nr has a quadratic complexity opic focus on popularity the rest is about how to score

  28. First-level neighboorhood

  29. Second-level neighboorhood

  30. Neighborhood-based estimators

  31. deg, e, n, ne deg: number of neighbors e: sum of incoming edges n: sum of incoming nodes ne: sum of incoming (node*edge)s

  32. Linear regressions

  33. Multi-armed bandits (1) slot slot slot slot machine machine machine machine 1 2 3 4 ...

  34. Multi-armed bandits (2) Budget n, how to maximize the reward? Balance exploration and exploitation

  35. Applied to focused crawling Slot machines: estimators Reward: score of the top node

  36. mab_ε probability 1-ε: slot machine with the highest average reward probability ε: random slot machine

  37. mab_ε-first steps [0, └ ε x N ┘ ]: random slot machine steps [ └ ε x N ┘ +1, N]: slot machine with the highest average reward

  38. mab_var Succession of ε-first strategies, with a reset every r steps, r varying with the context

  39. Their running times

  40. Expected running times Twitter API for one week: - 3s - 200,000 nodes One domain website for one week: - 1s - 600,000 nodes

  41. Experimental framework (1)

  42. Experimental framework (2) ─ Graph score 10 seed graphs 1 seed graph: 50 seeds picked randomly among non-zero β Arithmetic average of the crawl scores (sum) ─ Global score Normalization with a baseline -- relative score Geometric average among the five graphs

  43. Datasets and code are online http://netiru.fr/research/14fc

  44. To measure the running times Same crawl sequence: the oracle Storage in RAM (20G) 3.6 GHz

  45. The running times (ms)

  46. nr Quadratic complexity, with large constant factors

  47. Their precision

  48. The precision Same crawl sequence: the oracle Precision: distance of the top node to the actual top node Arithmetically averaged over a window of 1000 steps

  49. For bretagne

  50. Their ability to lead crawls

  51. Leading the crawl Different crawl sequences: defined by the top estimated nodes

  52. Average graph scores for France

  53. The multi armed-bandits

  54. All the estimators

  55. Conclusion

  56. What we learnt Generic model NP-hardness offline Refresh rate of 1 Greedy Neighborhood features Linear regressions Multi-armed bandit strategy

  57. Future work Approximation of the optimal score Distributed crawl Recrawling nodes Further multi-armed bandits comparisons

  58. Thank you. georges@netiru.fr

  59. Finding the optimal crawl sequences in a known graph

  60. PTime many-one reduction from the LST-Graph problem Problem remains hard if nodes, not edges, are weighted A subtree rooted at r is seen as a crawl sequence starting from r Free edges are added to the graph to allow free crawls from he seed to any potential root of a subtree

  61. Rich friends will make you richer

  62. The greedy strategy Node picked = argmax(β(v)), v in frontier

  63. Is not always optimal 12 3 20 4 1 2 2

  64. The altered greedy strategy Node picked = probability q: argmax(β(v)) probability 1-q: random v so that, max(β(u)) - β(v) <= ζ x max(β(u))

  65. Altered greedy vs greedy for jazz

  66. The refresh rate disadvantage

  67. When estimation takes too long

  68. The score degradation (%) at different steps

Recommend


More recommend