Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S PA I N
Who am I 01 Jorge Luis Betancourt González Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast 2
Agenda • Introduction & motivation • Technologies used • Customizations • Conclusions and future work 3
Introduction / Motivation Cuba Internet Intranet Global search engines can’t access documents hosted the Cuban Intranet 4
Writing your own web search engine from scratch? or … 5
Common search engine features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 6
How to fulfill these requirements? At the core a search store query engine: stores some information a retrieve this information when a question is received 7
Open Source to the rescue … crawler 1 Index Server 2 web interface 3 8
Apache Nutch “ Nutch is a well matured, production ready Web crawler. Enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing. 9
Apache Nutch • Highly scalable • Highly extensible • Pluggable parsing protocols, storage, indexing, scoring, • Active community • Apache License 10
Apache Solr TOTAL DOWNLOADS 8M+ MONTHLY 250,000+ DOWNLOADS • Apache License • Great community • Highly modular • Stability / Scalability • Based on Lucene • Battle tested 11
Back to the list of features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 12
Image search and thumbnails Custom parser & indexer to store the image thumbnail h1 Custom parser & indexer & scoring p img identify and store the text related with an image 13
How does it work? 2 img 3 1 h1 img p img 14
News search (NRT & alerting) Nutch is really not suited for this task: Batch nature of the Hadoop Jobs doesn’t fit well in this scenario 15
Our topology http://news-site.com index RSS fetch parse flaxsearch/luwak monit or parse the RSS feed and outputs the news links to be processed according to SC protocol. https://github.com/commoncrawl/news-crawl 16
Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 17 17
Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 18 18
Apache Solr • Solr has full support for highlighting (3 impl) • powerful faceting capabilities (even more on recent releases) • autocorrection support based on the index content • awesome scalability (SolrCloud, classic master-slave replication) 19
The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 20
The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 21
Other features - monitoring We needed a way of monitoring our infrastructure without a great Internet connection you can’t send GB of logs to a cloud environment, so … (and metrics) time series store (and logs) analytical tool (and facets) 22
Other features - monitoring (and logs) parsing & aggregation (and metrics) time series store (and logs) analytical tool (and facets) 23
Banana (Kibana port) for visualizations 24
Infrastructure WEB HTTP HTTP HTTP Solr 2 Replicador HTTP JAVABIN Solr 1 Master Crawlers Nutch 25
Some usage stats less than 10 000 visits around 600 unique visitors 26
Future work Apply deep learning techniques to process the raw images and mix with current approach Increase the number of signals that we get from our crawlers (correlate even more crawl related events) 27
Thanks Questions? M jorgelbg@apache.org � @jorgelbg
Recommend
More recommend