scaling newsum
play

Scaling NewSum Big data text Clustering and https://www.scify.org - PowerPoint PPT Presentation

Scaling NewSum Big data text Clustering and https://www.scify.org Summarization using N-Gram graphs Alexandros Tzoumas | a.tzoumas@scify.org WWW.FED4FIRE.EU Whats our product about? Scaling NewSum | SciFY.org 3 WWW.FED4FIRE.EU 4


  1. Scaling NewSum Big data text Clustering and https://www.scify.org Summarization using N-Gram graphs Alexandros Tzoumas | a.tzoumas@scify.org WWW.FED4FIRE.EU

  2. What’s our product about? Scaling NewSum | SciFY.org

  3. 3 WWW.FED4FIRE.EU

  4. 4 WWW.FED4FIRE.EU

  5. 5 WWW.FED4FIRE.EU

  6. Business Goals Scaling NewSum | SciFY.org

  7. Goals Business goals From a technical perspective - improve the quality of the Measure and evaluate: - the accuracy of candidate solutions our product offers clustering components, - allow NewSum technology to - the effectiveness (summary expand to new domains/markets quality) of alternative summarization components - the overall scalability of the system 7 WWW.FED4FIRE.EU

  8. Challenges Business challenges From a technical perspective - Expansion to new markets should take - Define a process for evaluating different domain specific characteristics into account clustering and summarization components as system parameters - Scale the algorithms to process thousand of - A product manager is not able to configure sources/articles the product-related settings appropriate for each domain, so a semi-supervised process would be invaluable 8 WWW.FED4FIRE.EU

  9. The Experiments Scaling NewSum | SciFY.org

  10. Setup Tengu testbed with the support of IMEC Cassandra - Hadoop - Spark 10 WWW.FED4FIRE.EU

  11. Experiments Experiment set 1 Results Goal: Measure effectiveness of NewSum’s candidate clustering implementations Related datasets: Selected the algorithm with higher Multiling (articles with clustering information) precision & recall 6GB database of news articles Methodology: Run clustering on 2 different clustering implementations and measure recall and precision. Automatic evaluation for MultiLing dataset Manual process for news articles dataset 11 WWW.FED4FIRE.EU

  12. Experiments Experiment set 2 Results Goal: Measure scalability Increased 5 times the speed of the Related dataset: clustering pipeline! 6GB database of news articles Methodology: Identified areas of improvement Run the clustering pipeline using as input a) the algorithm from experiment set 1 b) a variable number of articles. Measure speed 12 WWW.FED4FIRE.EU

  13. Experiments Experiment set 3 Results Goal: Measure effectiveness of NewSum’s candidate summarization implementations Implemented/Identified the process Related datasets: 6GB database of news articles for selecting the algorithm appropriate for each scenario Methodology: Run the summarization pipeline using as input a) configuration/parameter setting b) a number of clusters to be summarized. Recall and precision were measured through a manual process. Results: Implemented/Identified the process for selecting the algorithm appropriate for each scenario 13 WWW.FED4FIRE.EU

  14. Conclusions Scaling NewSum | SciFY.org

  15. What we achieved - Defined a process for evaluating clustering algorithms - Defined a process for evaluating summarization components - Increased 5 times the speed of the clustering pipeline! - Measured scalability and identified bottlenecks WWW.FED4FIRE.EU 15

  16. How Fed4Fire+ helped us Patron’s support was crucial to the success of the experiments 16 WWW.FED4FIRE.EU

  17. How Fed4Fire+ helped us Provided a quick way to start experimenting with big data without having to worry about the underlying technologies 17 WWW.FED4FIRE.EU

  18. How Fed4Fire+ helped us Funding allowed us allocate time to implement the algorithms and analyze next steps 18 WWW.FED4FIRE.EU

  19. Next Steps Scaling NewSum | SciFY.org

  20. Next steps Continue working on algorithm implementations Distributed N-gram graphs Improve clustering speed using blocking methodology Automate the set up of a pipeline in a cloud environment to be used in production. Release a domain specific product related to Blockchain news. 20 WWW.FED4FIRE.EU

  21. 21 WWW.FED4FIRE.EU

  22. www.scify.org WWW.FED4FIRE.EU This project has received funding from the European Union’s Horizon 2020 research and innovation programme, which is co-funded by the European Commission and the Swiss State Secretariat for Education, Research and Innovation, under grant agreement No 732638.

Recommend


More recommend