CAPTO Gennaio 2010 1
The problem to solve • Nowadays information published on internet is not manageable any more; the consequence is that any internet search is not precise. • Due to the overwhelming amount of information and the inherent nature of internet (polling protocol), manual internet retrieval can be a human exhaustive activity; • The relevant information is only a fraction of the available one; • All these problems, that lead to a loss of information (hence power), pertain to the information created by a company as well; CAPTO Gennaio 2010 2
The Goal To have a way to retrieve an information: On time => When needed Precise => Noise reduction Fruitful => Structured and harmonized Complete => Extracted from any media CAPTO Gennaio 2010 3
The solution Capto is the complete solution to create information acquiring and indexing media from multiple sources CAPTO Gennaio 2010 4
Characteristics • Focus on relevant information; • A unique portal to retrieve all the information you need; • Users can subscribe to ‘information channels’, being notified when new pertinent information is created; • A complete information management workflow; CAPTO Gennaio 2010 5
Technical Characteristics • Enhanced crawling capabilities (authentication, javascript processing, WEB 2.0); • Distributed and scalable acquisition from internet sources; • Enhanced Text Indexing (stemming, ranking (BM25), probabilistic search,…); • An highly configurable CMS portal (Jsr-168 compatible portlets, can be registered in any legacy CMS); • Can scale up to millions of indexed documents; CAPTO Gennaio 2010 6
Application domains • Data Monitoring: • Finance, stock markets… • Information monitoring and analysis (document repositories, news, web press, news feeds, blogs, mails,…) • Brand analysis (brand monitoring, sentiment analysis,…) • Massive text indexing and retrieval • …by and large any domain where the retrieval and analysis of information creates new (and more useful) information; CAPTO Gennaio 2010 7
The architecture Domain dependent Domain independent www External File System, DBMS,… CAPTO Gennaio 2010 8
PA Case history:Edison The problem : monitoring of Italian laws and regulations on the environmental impact related with the production of Energy The solution : • Automatic acquisition from several national, regional, federal and local web portals; • A complete validation workflow; • Information precision: before (manual acquisition) <50%, after ~100% CAPTO Gennaio 2010 9
Other products on the market Text indexing and ranking : • Apache Lucene (http://lucene.apache.org) • ClusterClick (www.clusterclick.com) • Amberfish (http://www.etymon.com/tr.html) • Terrier (http://ir.dcs.gla.ac.uk/terrier/) Document Management : • OpenText (www.opentext.com) • SearchExpress (www.searchexpress.com) • IndexData (www.indexdata.com) • AutonomyVirage (www.virage.com) Internet Information Retrieval: • HtDig (www.htdig.org) CAPTO Gennaio 2010 10
Conclusions • Can be used to monitor the acquisition of multimedia from internet sources; • Can be used to index and retrieve textual information from any archived media; • Can be used to shorten the time-to-information; • Can be used to provide a more precise information (and to map the information you have); • Can be easily adopted (low cost of software adoption) • Domain agnostic and multi-language CAPTO Gennaio 2010 11
Recommend
More recommend