A thread about the history of our legal citation extractor and open source. 1/?
In 2015 or so, two students at @BerkeleyISchool wrote the first version of it. It was pretty good, and was able to find basic citations in a paragraph, look them up in CourtListener, and then make them into links. Cool, v1 was born.
Later, we wanted it to work on all kinds of citations, and we started building a huge database of reporters, their abbreviations, dates, etc. This became our reporters DB: https://github.com/freelawproject/reporters-db/. A *bunch* of folks have now sunk weeks of their lives into making it really good.
And it works! Using the database of reporter dates and abbreviations, we can find practically all of citations in a block of text. Awesome. But it was missing a few things:
1. It didn't handle depth of treatment.
2. It was really bad at weird citation formats
1. It didn't handle depth of treatment.
2. It was really bad at weird citation formats
Well, we're in 2020 now, and along comes another volunteer. Out of nowhere, he implements support for Id, supra, etc. Wow. https://free.law/2020/03/05/citation-data-gets-richer/
Next, we realize it'd be great if all this citation stuff lived outside of CourtListener's code base so that others could use it. A few weeks ago, "eyecite" was born. Now, if you want to pull citations out of text, there's an easy drop in tool for that: https://pypi.org/project/eyecite/
But, we're not done yet b/c it's 2021 now. The next thing that happens is that Jack Cushman from @harvardlil drops by and makes it 10× faster via some embarrassing performance tweaks. The fruit was hanging low, folks.
Now we're working on making it match a lot more stuff — like statutes — while keeping the performance mostly stable. It's incredible work, and we'll be sharing more about eyecite soon, but it's hard not to rave right now. /fin