Apache Tika Apache Tika Whats new with 2.0? Whats new with 2.0? - PowerPoint PPT Presentation

Apache Tika Apache Tika What’s new with 2.0? What’s new with 2.0?

Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate

Tika, in a nutshell Tika, in a nutshell “small, yellow and leech-like, and probably the oddest thing in the Universe” • Like a Babel Fish for content! • Helps you work out what sort of thing your content (1s & 0s) is • Helps you extract the metadata from it, in a consistent way • Lets you get a plain text version of your content, eg for full text indexing • Provides a rich (XHTML) version too

Tika in the news Tika in the news • Panama Papers – Tika used to extract content from most of the fjles before indexing in Apache SOLR https://source.opennews.org/en-US/articles/people-and-tech- behind-panama-papers/ • MEMEX – DARPA funded project https://nakedsecurity.sophos.com/2015/02/16/memex-darpas- search-engine-for-the-dark-web/ • http://openpreservation.org/blog/2016/10/04/apache-tikas- regression-corpus-tika-1302/

Tika at ApacheCon Tika at ApacheCon • Tim Allison, tomorrow (Thursday), 2.40pm Evaluating T ext Extraction: Apache Tika's™ New Tika- Eval Module • Also related: David North (same time...) Apache POI: The Challenges and Rewards of a 15 Year Old Codebase • Several Committers around, come fjnd us!

A bit of history A bit of history A bit of history A bit of history

Before Tika Before Tika • In the early 2000s, everyone was building a search engine / search system for their CMS / web spider / etc • Lucene mailing list and wiki had lots of code snippets for using libraries to extract text • Lots of bugs, people using old versions, people missing out on useful formats, confusion abounded • Handful of commercial libraries, generally expensive and aimed at large companies and/or computer forensics • Everyone was re-inventing the wheel, and doing it badly....

Tika's History (in brief) Tika's History (in brief) • The idea from Tika fjrst came from the Apache Nutch project, who wanted to get useful things out of all the content they were spidering and indexing • The Apache Lucene project (which Nutch used) were also interested, as lots of people there had the same problems • Ideas and discussions started in 2006 • Project founded in 2007, in the Apache Incubator • Initial contributions from Nutch, Lucene and Lius • Graduated in 2008, v1.0 in 2011

Tika Releases Tika Releases 12/17 08/16 03/15 11/13 06/12 02/11 09/09 05/08 01/07 0.2 0.4 0.6 0.8 0.10 1.1 1.3 1.5 1.7 1.9 1.11 1.13 0.1 0.3 0.5 0.7 0.9 1.0 1.2 1.4 1.6 1.8 1.10 1.12 1.14

A (brief) introduction to Tika A (brief) introduction to Tika A (brief) introduction to Tika A (brief) introduction to Tika

(Some) Supported Formats (Some) Supported Formats • HTML, XHTML, XML • Microsoft Offjce – Word, Excel, PowerPoint, Works, Publisher, Visio – Binary and OOXML formats • OpenDocument (OpenOffjce) • iWorks – Keynote, Pages, Numbers • PDF, RTF, Plain T ext, CHM Help • Compression / Archive – Zip, T ar, Ar, 7z, bz2, gz etc • Atom, RSS, ePub Lots of Scientifjc formats • Audio – MP3, MP4, Vorbis, Opus, Speex, MIDI, Wav • Image – JPEG, TIFF, PNG, BMP, GIF, ICO

Detection Detection • Work out what kind of fjle something is • Based on a mixture of things • Filename • Mime magic (fjrst few hundred bytes) • Dedicated code (eg containers) • Some combination of all of these • Can be used as a standalone – what is this thing? • Can be combined with parsers – fjgure out what this is, then fjnd a parser to work on it

Metadata Metadata • Describes a fjle • eg Title, Author, Creation Date, Location • Tika provides a way to extract this (where present) • However, each fjle format tends to have its own kind of metadata, which can vary a lot • eg Author, Creator, Created By, First Author, Creator[0] • Tika tries to map fjle format specifjc metadata onto common, consistent metadata keys • “Give me the thing that closest represents what Dublin Core defjnes as Creator”

Plain T ext Plain T ext • Most fjle formats include at least some text • For a plain text fjle, that's everything in it! • For others, it's only part • Lots of libraries out there which can extract text, but how you call them varies a lot • Tika wraps all that up for you, and gives consistentency • Plain T ext is ideal for things like Full T ext Indexing, eg to feed into SOLR, Lucene or ElasticSearch

XHTML XHTML • Structured T ext extraction • Outputs SAX events for the tags and text of a fjle • This is actually the Tika default, Plain T ext is implemented by only catching the T ext parts of the SAX output • Isn't supposed to be the “exact representation” • Aims to give meaningful, semantic but simple output • Can be used for basic previews • Can be used to fjlter, eg ignore header + footer then give remainder as plain text

Tika “Architecture”, in brief Tika “Architecture”, in brief • Hide complexity • Hide difgerences • Identify, pick and use the “best” libraries and tools • Work with all the upstreams for you • Come “Batteries Included” where possible / not too big, “Batteries Nearby” otherwise • Try to avoid surprises • Support JVM + Non-JVM users as equals • Work to fjx any of the above that we happen to miss!

What's New? What's New? What's New? What's New?

Formats and Parsers Formats and Parsers

Supported Formats Supported Formats • HTML • XML • Microsoft Offjce • Word • PowerPoint • Excel (2,3,4,5,97+) • Visio • Outlook • Pre-OOXML XML formats, Lock Files etc!

Supported Formats Supported Formats • Open Document Format (ODF) • iWorks, Word Perfect • PDF, RTF • ePUB • Fonts + Font Metrics • T ar, RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200 • Plain T ext • RSS and Atom

Supported Formats Supported Formats • IPTC ANPA Newswire • CHM Help • Wav, MIDI • MP3, MP4 Audio • Ogg Vorbis, Speex, FLAC, Opus, Theora • PNG, JPG, JP2, JPX, BMP, TIFF, BPG, ICNS, PSD, PPM, WebP • FLV, MP4 Video – Metadata and video histograms • Java classes

Supported Formats Supported Formats • Source Code • Mbox, RFC822, Outlook PST, Outlook MSG, TNEF • DWG CAD • DIF, GDAL, ISO-19139, Grib, HDF, ISA-T ab, NetCDF, Matlab • Executables (Windows, Linux, Mac) • Pkcs7, Time Stamp Data Envelope TSD • SQLite, dBase DBF • Microsoft Access

OCR OCR

OCR OCR • What if you don't have a text fjle, but instead a photo of some text? Or a scan of some text? • OCR (Optical Character Recognition) to the rescue! • T esseract is an Open Source OCR tool • Tika has a parser which can use T esseract for found images • T esseract is detected, and used if found on your path • Explicit path can be given, or can be disabled • TODO: Better combining of OCR + normal, or eg PDF only

Container Formats Container Formats

Databases Databases

Databases Databases • A surprising number of Database and “database” systems have a single-fjle mode • If there's a single fjle, and a suitable library or program, then Tika can get the data out! • Main ones so far are MS Access & SQLite • Panama Papers dump may inspire some more! • How best to represent the contents in XHTML? • One HTML table per Database T able best we have, so far...

Tika Confjg XML Tika Confjg XML

Tika Confjg XML Tika Confjg XML • Using Confjg, you can specify what to use for: Parsers, Detectors, T ranslator, Service Loader + Warnings / Errors, Encoding Detectors, Mime T ypes • You can do it explicitly • You can do it implicitly (with defaults) • You can do “default except” • T ools available to dump out a running confjg as XML • Use the Tika App to see what you have + save it

Tika Confjg XML example Tika Confjg XML example <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <mime-exclude>image/jpeg</mime-exclude> <mime-exclude>application/pdf</mime-exclude> <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/> </parser> <parser class="org.apache.tika.parser.EmptyParser"> <mime>application/pdf</mime> </parser> </parsers> </properties>

Embedded Resources Embedded Resources

Tika App Tika App

Tika Server Tika Server

OSGi OSGi

Tika Batch Tika Batch

Tika Batch Tika Batch • Easy way to run Tika against a very large number of documents, for testing and for bulk ingestion • Multi-threaded, but not yet Hadoop enabled, see https://wiki.apache.org/tika/TikaInHadoop for more there • Output T ext or XHTML, metadata, optionally embedded • Records failures too, so you know where things go wrong • Sets up parent/child processes to robustly handle permanenthangs/OOMs • Optionally restart child every x mins to mitigate memory leaks.

Apache Tika Apache Tika Whats new with 2.0? Whats new with 2.0? - PowerPoint PPT Presentation

Apache Tika Apache Tika Whats new with 2.0? Whats new with 2.0? Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate Tika, in a nutshell Tika, in a nutshell small, yellow and leech-like, and probably the oddest thing in the

Apache CXF, Tika and Lucene The power of search the JAX-RS way Andriy Redko About myself

Chris A. Mattmann, NASA JPL, USC & the ASF @chrismattmann mattmann@apache.org Content

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

TIKA TRADING Tika Trading is a distribution company that belongs to the sector of fruit

So what are we covering? Me, Myself and I + Apache Contextual motivation for improved

TIKA intervention Jakarta, Indonesia, 15 May 2016 TIKA as Leading Agency of an Emerging Donor

TE OHAKI Ka tika te whakatauki o Ngti Phauwera e k ana Ko te amorangi ki mua Ko te hapai

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

How Apache works JB Onofr <jbonofre@apache.org> Who am I JB Onofr

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Apache Arrow & TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 Apache Arrow: the project

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Apache Sling A REST-based Web Application Framework Carsten Ziegeler | cziegeler@apache.org

to work with Java 9 Jigsaw Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

What's New in Apache Syncope 1.2.0 Dr. Colm higeartaigh Speaker Introduction 11/14/14 2

The other Apache Technologies your Big Data solution needs! Nick Burch The Apache Software

Apache Incubator: where it is coming from and where it is going. Roman Shaposhnik

Apache Tika Apache Tika Whats new with 2.0? Whats new with 2.0? - PowerPoint PPT Presentation

Apache Tika Apache Tika Whats new with 2.0? Whats new with 2.0? Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate Tika, in a nutshell Tika, in a nutshell small, yellow and leech-like, and probably the oddest thing in the

Apache CXF, Tika and Lucene The power of search the JAX-RS way Andriy Redko About myself

Chris A. Mattmann, NASA JPL, USC &amp; the ASF @chrismattmann mattmann@apache.org Content

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

TIKA TRADING Tika Trading is a distribution company that belongs to the sector of fruit

So what are we covering? Me, Myself and I + Apache Contextual motivation for improved

TIKA intervention Jakarta, Indonesia, 15 May 2016 TIKA as Leading Agency of an Emerging Donor

TE OHAKI Ka tika te whakatauki o Ngti Phauwera e k ana Ko te amorangi ki mua Ko te hapai

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

How Apache works JB Onofr &lt;jbonofre@apache.org&gt; Who am I JB Onofr

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Apache Apex: Next Gen Big Data Analytics Thomas Weise &lt;thw@apache.org&gt; @thweise PMC Chair

Apache Arrow &amp; TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 Apache Arrow: the project

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Apache Sling A REST-based Web Application Framework Carsten Ziegeler | cziegeler@apache.org

to work with Java 9 Jigsaw Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

What's New in Apache Syncope 1.2.0 Dr. Colm higeartaigh Speaker Introduction 11/14/14 2

The other Apache Technologies your Big Data solution needs! Nick Burch The Apache Software

Apache Incubator: where it is coming from and where it is going. Roman Shaposhnik

Chris A. Mattmann, NASA JPL, USC & the ASF @chrismattmann mattmann@apache.org Content

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

How Apache works JB Onofr <jbonofre@apache.org> Who am I JB Onofr

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Apache Arrow & TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 Apache Arrow: the project

to work with Java 9 Jigsaw Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb