This post defines predictive analytics, explains how it functions and when it can be applied during discovery to facilitate more efficient workflows.
In the previous article we introduced the three different groups of analytics one can apply in discovery; structured, conceptual and predictive analytics. The focus of this article will be an overview on predictive analytics.
Predictive analytics is used in various fields: actuarial science, financial services, insurance, telecommunications, retail, travel, healthcare, pharmaceutical. We encounter this science every time we order a book on Amazon and the right side of the page is populated with suggestions for other books we might also like. Do you use Pandora radio? What about Facebook or LinkedIn? Predictive analytics are being utilized every time you like or dislike a song and follow a company or friend someone. It is everywhere.
Predictive analytics can significantly lower costs, dramatically reduce review time and substantially increase quality for document review. Predictive analytics has been proven effective to the point where the judiciary is suggesting (and sometimes ordering) counsel to consider predictive analytics in their eDiscovery protocols. Furthermore, the Department of Justice (DOJ) antitrust division recently stated they prefer merging parties use predictive analytics for Hart Scot Rodino (HSR) Second Requests.
What is Predictive Analytics?
Predictive analytics encompasses a variety of machine learning techniques from the fields of statistics, computer science, data mining and game theory. It a workflow where a human subject matter expert reviews a subset of documents in order to train the system on what they are looking for. Then the system can statistically “predict” how the human would code for the rest of the collection. Once complete, the legal team can make informed decisions on how best to approach the collection for review and determine the total cost implications. Predictive analytics is often referred to as “Predictive Coding” (PC) or “Technology Assisted Review” (TAR).
How Does Predictive Analytics Work?
Predictive Analytics combines statistics and the efficiencies of a computerized sampling system with knowledge from human expert(s). The human expert(s) interacts with the system by answering “yes/no” questions for a series of controlled samples. Depending on the workflow and goal, the question might be:
- “Does this document have legal or business relevance for this case?”
- “Is this document responsive to a specific issue in a Production Request?”
- “Will this produced document support my side of the case?”
- “Does our document retention policy require us to keep this document?”
- “Does this document pertain to the XYZ compliance requirement”
The system builds a classification model in the background as it learns from the expert and presents subsequent samples. As it learns, the system ultimately will tune the classification model to the point where it can “predict” what the human will choose as “affirmative” (relevant) in the sample they are reviewing. Quality control and verification can then be run to confirm the system has indeed reached stability. At that point, TAR systems based on conceptual analytics will code document above a certain threshold as “responsive” and those below the threshold as “not relevant.” Support Vector Machine (SVM) based Predictive Coding systems will calculate relevancy scores ranging between 0 and 100 for the rest of the collection where documents with higher scores are more likely relevant and those with lower scores are more likely not relevant. The results from the TAR and PC systems are then used in a variety of workflows in discovery. Workflow examples include Early Data Assessment, Strategic Document Review, Accelerated Document Review and Document Review QC/Verification.
How Predictive Analytics Can Be Applied to Your Document Collection
Recently, predictive analytics is being applied to support document retention polices for corporation and government entities. Think about it; if you can train a system to identify likely relevant documents, you can certainly train it to identify which documents need to be kept to support a specific compliance requirement as well as which documents can be considered for deletion. Think of the cost savings! This is an emerging area for predictive analytics and we are just beginning to see the efficacy of implementing such an approach. Stay tuned – this area of predictive analytics will most definitely grow in the next few years.
Early Data Assessment
A new matter emerges where opposing counsel has requested documents from several custodians. There is a small percentage of potentially relevant documents . Both sides have tried to negotiate search terms but neither can agree. Predictive analytics could be used on your key custodians in order to get a sense of what facts are in the collection. This may provide sufficient information for your firm to make a “fight” or “settle” position. Predictive analytics can also be used to identify the 75-85% of the relevant documents that typically reside in 15-25% of the collection. Early Data Assessment workflows provides both early insights to what is contained in the corpus as well as defensible metrics to make strategic decisions for review options early in the eDiscovery process – all with an eye towards reducing total costs.
Incoming Production from Opposing Counsel
Using predictive analytics for an incoming production is all about identifying the documents that are important to support your side of the case, as quickly as possible. Simply train the predictive analytics for what is important to you and let it bring what’s most important for your side to the top.
Document Review Prioritization
Your firm is tasked with review of a large (50,000+ records) set of documents. This set may or may not have already been culled by key terms agreed to by both parties. It is suspected a large amount of non-relevant material remains in the corpus. Predictive analytics can be used to find the top 10% of highest ranking/most likely relevant documents and have your lead counsel focus on those. The next 11-25% holding the lesser relevant could be assigned to lower cost contract reviewers. The percentage of relevant documents in the remaining 75% of the collection (normally 15% or less) could be sampled for defensibility and cost shifting arguments. Using predictive analytics (especially Predictive Coding) to prioritize documents that need to be reviewed facilitates an efficient review process overall.
Predictive analytics is not new and is widely used by many industries. Well known examples, including Amazon, Zappos, and Pandora radio, as well as major segments such as energy, transportation and financial management.
Predictive analytics used in today’s Predictive Coding and Technology Assisted Review solutions goes far beyond keyword searching. It is a powerful tool that uses well established classification algorithms rather than just discrete keyword searches. Unlike keyword searches, predictive analytics takes into account all the words in a document, as well as words to exclude. It also identifies the conceptual relationship of the words to one another which determines what is and what is not likely to be relevant. Predictive analytics incorporates human intelligence to leverage the results of review across large document populations. It can be used in a variety of workflows, in several places along the EDRM lifecycle including Early Data Assessment, Accelerated Document Review, Review QC/Verification as well as Information Governance. The key benefits from predictive analytics are increased quality, reduced review time and cost reduction – yes, all three.