Wrangling Court Data on a National Level The agenda Who am I? - PowerPoint PPT Presentation

A presentation by Mike Lissner creator of CourtListener.com and Juriscraper Wrangling Court Data on a National Level

The agenda ● Who am I? ● What is CourtListener? ● What is Juriscraper? ● How does it work? ● What does it do? ● How can you contribute? ● What's the future hold?

Me ● Mike Lissner ● Not: ● A lawyer ● A computer scientist ● Am: ● Grad from UC Berkeley School of Information ● Employee of a search company you may know ● Open source/access enthusiast ● Have blog at http://michaeljaylissner.com

CourtListener Background ● Started in 2010 ● Aggregates data and provides alerts ● Powerful search engine ● Data dumps ● Citation linking (see Rowyn's presentation!) ● Free. Free. Free. ● Demo

use Juriscraper ● Our main topic du jour. ● A newer project used live on CourtListener ● A simple open source scraper that anybody can

Juriscraper's Features ● Extensibility ● Solid, modern code ● Character detection and normalization ● Simple installation ● Harmonization ● Sophisticated title casing ● Sanity checking and hard failures

Extensibility ● Supports: ● Varied geographies (countries, states, federal) ● Languages ● Media types (video, oral arguments, text) ● Currently has scrapers for: ● Federal Appeals courts ● Some states ● Some special jurisdictions ● Some back scrapers

Modern Code ● Requires: DRY, OO, PEP8 ● Uses: ● Python 2.7 ● lxml and XPath ● Requests ● chardet

on the binary data. Character Encodings ● Detects the declaration in XML or HTML pages ● If that's missing, then sniffs the encoding based ● Normalizes everything to UTF-8

guaranteed in reverse chronological order. States, US, etc.) Harmonization ● Words like, “et al, appellant, executor”, etc. all get removed. ● All forms of “USA” get normalized (U.S.A., U.S., United ● All forms of “vs” get normalized. ● Text gets titlecased if needed (much harder than it seems!) ● Junk punctuation gets removed/replaced ● Dates get converted to Python objects and results are

completely and loudly Sanity Checking and Hard Failures ● Court websites change frequently ● If our meta data is bad, we should fail

Integrating Juriscraper aka “All about the Caller” ● You have to build a “caller” ● You'll want: ● Duplicate detection ● Minimal impact on court websites ● Mimetype detection ● OCR ● PDF “Decryption”

to the next Duplicate Detection ● Test if the site has changed using a hash ● If so, extract the meta data from the page using Juriscraper. ● Iterate over the items, download their text or binary. ● If a hash of the text or binary is new, save the item and proceed ● Else, dup_count++ ● If proceeding, check the date of the next item. ● If prior to the dup we found, terminate. ● Else check a hash on the next item. ● If five dup_count == 5, terminate.

Impact Minimization ● Methods: ● Reasonable duplicate detection algorithms ● User-agent set to “juriscraper” ● Free sharing of data via our API

This would be awful, but... numbers” Mimetypes, OCR and PDFs ● Mimetypes can be detected via “magic ● Text can then be extracted. ● If no text, use OCR. ● If text is garbled, try “decrypting” it

We built a sample caller. Two, actually.

development easier. Getting involved ● No more siloed scrapers! ● All code is open source (BSD license) ● Installation is simple (five minutes using pip) ● We built some custom tools to make ● Looking for: ● More users ● More developers

Why this is important ● Scaling is vital. ● More callers means: ● More jurisdictions ● Faster response times ● Improved code ● A unified court scraper (user-agent)

Juriscraper's Future ● Better alerts for downed scrapers ● Court-level rate throttling ● HTML tidying ● API Refactoring ● More courts! ● More backscrapers ● More unit tests

Juriscraper Demo/walkthrough

awareness-platform-courtlistener/ Thank you. ● http://courtlistener.com/ ● https://bitbucket.org/mlissner/search-and- ● https://bitbucket.org/mlissner/juriscraper/ ● http://michaeljaylissner.com/

Wrangling Court Data on a National Level The agenda Who am I? - PowerPoint PPT Presentation

A presentation by Mike Lissner creator of CourtListener.com and Juriscraper Wrangling Court Data on a National Level The agenda Who am I? What is CourtListener? What is Juriscraper? How does it work? What does it do?

Data wrangling with Tableau and Excel October 11 2016 JRNL 520H What is data wrangling? Data

Applying the Data Wrangling Process Nicole G Weiskopf, 8/21/18 Wrangling diabetes Research

Overview of Wrangling Hypertension Nicole G Weiskopf, 8/26/18 Wrangling hypertension Research

The bottom line We are the data science people but the world needs to know about it Wrangling vs

Wrangling the Bugzilla Beast Robinson Tryon September 23 rd , 2015 1 Wrangling the Bugzilla

SNAKE WRANGLING SNAKE WRANGLING Isaac Elliott How can we bring the benefits of better languages

Sixth National Sixth National Sixth National Sixth National Court Court Court Court

Introductjon to EHR Data Quality Nicole G Weiskopf, 8/21/18 Learning Objectjves What is data

A Brockport A Brockport Village Court? Village Court? Village Court? Village Court? Final

BASKETBALL Initialize Court Visitor Home 4 10 Court 1 1 1 3 4 4 0 4 2 2 3 3

General-Purpose Inductive Programming for Data Wrangling Automation Lidia Contreras-Ochando,

Recorders Court Judges Recorders Court Judges Recorder s Court Judges Recorder s Court

02 Preparing data for analysis Gabor Bekes Data Analysis 1: Exploration 2019 Variable types

Under erstan anding Cour Court M Mana anagement OFFICE of COURT ADMINISTRATION Who Are You?

Resident Presentation 1 Dekum Court June 23, 2020 Dekum Court Introduction Review Previous

Court Performance Measurements Hon. Charles Pratt, Judge Allen Superior Court Kathleen Rusher

Beyond Binary Labels: Political Ideology Prediction of Twitter Users Daniel Preot iuc-Pietro

Machine Learning for Fine-Grained Hardware Prefetcher Control Jason Hiebel Laura E. Brown

On Target Coun-ng by Sequen-al Snapshots of Binary Proximity

Non-Silicon Non-Binary Computing: Why Not? Elena Dubrova, Yusuf Jamal, Jimson Mathew Royal

Eiganes tunnel / Ryfast Worlds longest sub-sea road-tunnel, a city tunnel, and a sub-sea

Sea-ice verification by using binary image distance metrics B. Casati, JF. Lemieux, G. Smith, P.

A jump-target identification method for multi-architecture static binary translation Alessandro

Simple nonunifilar binary word generators Sarah Marzen June 1, 2013 Outline Motivation