building a search engine for the cuban web
play

Building a Search Engine for the Cuban Web Jorge Luis Betancourt - PowerPoint PPT Presentation

Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 S E V I L L E , S PA I N Who am I 01 Jorge Luis Betancourt Gonzlez Search/Crawl Engineer Apache Nutch


  1. Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S PA I N

  2. Who am I 01 Jorge Luis Betancourt González Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast 2

  3. Agenda • Introduction & motivation • Technologies used • Customizations • Conclusions and future work 3

  4. Introduction / Motivation Cuba Internet Intranet Global search engines can’t access documents hosted the Cuban Intranet 4

  5. Writing your own web search engine from scratch? or … 5

  6. Common search engine features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 6

  7. How to fulfill these requirements? At the core a search store query engine: stores some information a retrieve this information when a question is received 7

  8. Open Source to the rescue … crawler 1 Index Server 2 web interface 3 8

  9. Apache Nutch “ Nutch is a well matured, production ready Web crawler. Enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing. 9

  10. Apache Nutch • Highly scalable • Highly extensible • Pluggable parsing protocols, storage, indexing, scoring, • Active community • Apache License 10

  11. Apache Solr TOTAL DOWNLOADS 8M+ MONTHLY 250,000+ DOWNLOADS • Apache License • Great community • Highly modular • Stability / Scalability • Based on Lucene • Battle tested 11

  12. Back to the list of features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 12

  13. Image search and thumbnails Custom parser & indexer to store the image thumbnail h1 Custom parser & indexer & scoring p img identify and store the text related with an image 13

  14. How does it work? 2 img 3 1 h1 img p img 14

  15. News search (NRT & alerting) Nutch is really not suited for this task: Batch nature of the Hadoop Jobs doesn’t fit well in this scenario 15

  16. Our topology http://news-site.com index RSS fetch parse flaxsearch/luwak monit or parse the RSS feed and outputs the news links to be processed according to SC protocol. https://github.com/commoncrawl/news-crawl 16

  17. Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 17 17

  18. Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 18 18

  19. Apache Solr • Solr has full support for highlighting (3 impl) • powerful faceting capabilities (even more on recent releases) • autocorrection support based on the index content • awesome scalability (SolrCloud, classic master-slave replication) 19

  20. The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 20

  21. The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 21

  22. Other features - monitoring We needed a way of monitoring our infrastructure without a great Internet connection you can’t send GB of logs to a cloud environment, so … (and metrics) time series store (and logs) analytical tool (and facets) 22

  23. Other features - monitoring (and logs) parsing & aggregation (and metrics) time series store (and logs) analytical tool (and facets) 23

  24. Banana (Kibana port) for visualizations 24

  25. Infrastructure WEB HTTP HTTP HTTP Solr 2 Replicador HTTP JAVABIN Solr 1 Master Crawlers Nutch 25

  26. Some usage stats less than 10 000 visits around 600 unique visitors 26

  27. Future work Apply deep learning techniques to process the raw images and mix with current approach Increase the number of signals that we get from our crawlers (correlate even more crawl related events) 27

  28. Thanks Questions? M jorgelbg@apache.org � @jorgelbg

Recommend


More recommend