by Dr. Michael Covington, Director of Intelligent Analytics
The client had a large collection of legal documents, all relating to business transactions that were very similar and often interacted with each other. The client needed a study of the feasibility of extracting structured information from the documents automatically and making a computer database.
The main challenges were:
- The documents used complex legal language, which existing natural language processing tools often had trouble analyzing
- The documents were on paper, some poorly printed, and many with handwritten annotations
- The extracted information needed to be very accurate -- not a statistical sampling of the documents, but extraction of exact items.
The feasibility study addressed optical character recognition with human assistance (along with a longer-term plan to make this unnecessary by saving the documents as word processor files or textual PDFs), natural language parsing and information extraction, and "how to crawl before we walk" (that is, how to benefit from methods that are only partly successful). A technical approach was recommended and the necessary implementation effort was successfully estimated.