The Panama Papers, Graphs, and Data Science Unravelling the Shady World of Offshore Finance One Data Structure at a Time Dr. Jim Webber Chief Scientist, Neo4j
About @jimwebber Graphs, databases, distributed systems Socialist, activist, agitator, #SJW
#panamapapers
Disclaimer Offshore companies are not illegal. There is no suggestion that parties listed in the Panama Papers documents have necessarily broken the law or acted improperly.
Almost 400 journalists Based in 76 countries “Our aim is to bring journalists from different countries together in teams - eliminating rivalry and promoting collaboration . Together, we aim to be the world’s best cross-border investigative team .” icij.org/about
You may remember them from... #BahamasLeak
Source Material • The ICIJ presentation • The Reddit AMA • Online publications (SZ, Guardian, TNW et.al.) • The ICIJ website • https://panamapapers.icij.org/ • The Power Players • Key Numbers & Figures
Hidden Secrets No Longer Exposed the offshore holdings of 12 current and former world leaders. And dealings of 128 more politicians and public officials around the world. In all: 150 politicians from 50 countries, connected to companies in 21 tax havens.
System Architecture
Stack Unstructured data extraction ● Nuix professional OCR service ● ICIJ Extract (open source, Java: https://github.com/ICIJ/extract), leverages Apache Tika, Tesseract OCR and JBIG2-ImageIO. ● Python for wiring Database ● Apache Solr (open source, Java) ● Redis (open source, C) ● Neo4j (open source, Java) App ● Oxwall (open source, secure social network) ● Blacklight (open source, Rails) ● Linkurious (closed source, JS) Other ● Redis for queues ● Talend for ETL from other DBs ● AWS for cloud hardware
Data Flow Architecture Raw Files Database POWER Meta-Data Discovery Search
3 million files for OCR Investigators used Nuix’s optical character x recognition to make millions of scanned documents text- 10 seconds per file searchable. They used Nuix’s named entity = extraction and other analytical tools to identify and cross- 1 yr / 35 servers reference the names of Mossack Fonseca clients through millions = 1.5 weeks of documents.
400 users Lucene syntax queries with proximity matching!
Disconnected Documents
Context is King name: “John” last: “Miller“ name: ”Jose" role: “Negotiator“ last: “Pereia“ position: “Governor“ PERSON PERSON PERSON name: “Alice” name: "Maria" PERSON last: “Smith“ name: “Some Media Ltd” last: "Osara" role: “Advisor“ value: “$70M”
Context is King name: “John” last: “Miller“ name: ”Jose" role: “Negotiator“ last: “Pereia“ position: “Governor“ MENTIONS PERSON PERSON since: Jan 10, 2011 PERSON name: “Alice” name: "Maria" PERSON last: “Smith“ name: “Some Media Ltd” last: "Osara" role: “Advisor“ value: “$70M”
Need to store and query connections between entities. Whether they’re physical or inferred by algorithms or humans.
Neo4j: All about Patterns KNOWS Ann Dan NODE NODE (:Person { name:"Dan"} ) -[:KNOWS]-> (:Person {name:"Ann"}) LABEL LABEL PROPERTY PROPERTY http://neo4j.com/developer/cypher
Cypher: Find Patterns KNOWS ??? Dan NODE NODE MATCH (:Person { name:"Dan"} ) -[:KNOWS]-> (who:Person) RETURN who LABEL ALIAS LABEL PROPERTY ALIAS http://neo4j.com/developer/cypher
Data Model Meta Data Entities Actual Entities • Person • Document, Email, Contract, DB- Record • Representative (Officer) • Meta: Author, Date, Source, • Address Keywords • Client • Conversation: Sender, Receiver, • Company Topic • Account
Data Model for Relationships Meta-Data Activities • sent, received, cc‘ed • open account • mentioned, topic-of • manage • created, signed • has shares • attached • registered address • roles • money flow • family relationships
The Basic ICIJ Data Model
The Basic ICIJ Data Model in Neo4j
The Real ICIJ Data Model
What’s Been Delivered?
Data initially exposed as interactive visualization • Public figures and leaders • Different shell companies & involvements
@apcj @technige
OSS Stack Enables • Find interesting spots with full-text and fuzzy search • See neighbourhoods of suspects and interesting facts • Find connections and shortest paths between seemingly disconnected information • Add new knowledge as relationships enriching the graph structure • Stories emerge from the collaboration • Add more information from other sources
Neo4j ICIJ Distribution We have also made a distribution of Neo4j available with the data in it. This will allow you to query the database to fully explore from your computer the connections between people and companies. The package also includes a guide that explains how to use Neo4j.
What’s Been Discovered?
Distorting markets London is wonderful, but expensive. Tax-avoiding investors have been able to distort the property market to suit their objective of capital gains. Tax-avoidance multiplies their advantage to the disadvantage of regular Londoners.
Tax is optional for the rich Lionel Messi’s net worth is estimated at €200,000,000. Average income in Barcelona is a more modest €33,000 per year at headline 30% income tax. Messi remains entitled to roads, emergency services and all other benefits of citizenry in his host country.
We’re not all in this together Britain’s former Prime Minister declared that the country was “all in it together” after the 2008 financial collapse. The British people have seen massive declines in education, healthcare, and social services. Cameron benefitted from his dad’s investment funds involvement with Mossack Fonseca.
Ice, Ice, Baby Icelandic citizens took the brunt of their banking system’s collapse. Their Prime Minister had a conflict of interest in deciding how much government money would be used to compensate shareholders. He was a (transitive) beneficiary.
Lava Jato I’ll leave this one to you, folks. Bad behaviour spans borders, but so does our technology stack and our commitment to building better communities.
What does this mean for us?
Open source data technology democratises the capabilities that were once the domain of the Web giants. What they can do, we can approximate at low cost and with high effectiveness. Bad guys beware - it’s cheap to find you!
One more thing
Enjoy the conference! @jimwebber
Recommend
More recommend