Anyone who works closely with me knows I love to define things. Why? Because I want to ensure that everyone I work with talks about things in the same way, and that clear expectations are established on projects. Things tend to go wrong in the eDiscovery process when assumptions are made.
So what is this all leading up to? One simple question: do you know the difference between extracted text and OCR text?
It’s okay if you don’t! After 17 years of working in the eDiscovery industry, I still have this discussion on a regular basis during my consulting engagements. My theory is that we are so focused on all of the advanced technology that we forget about the basics. I am revisiting this topic because I want everyone to understand the importance of text in eDiscovery, especially when working with integrated databases that contain both ESI and scanned paper documents, as well as when using advanced technologies, such as predictive analytics, predictive coding and email threading.
What are the two different types of text?
- Extracted Text – Extracted text files are the end result of an ESI text extraction process. Text extraction refers to a normalization process where native file text content, title information and/or various native file metadata information is extracted from ESI so important content can be searched in almost any database, repository or file system. Important – Extracted text content is 100% accurate since it is pulled directly from source ESI.
- OCR Text (Optical Character Recognition) – OCR text files are the end result of an OCR process. OCR is a technology that converts digital or scanned image content (TIFF images, JPEG images, image only PDFs, etc.) into searchable text to enable documents to be electronically searched. Important Note: OCR is not 100% accurate (it is typically between 0%-90% depending on the quality of the image and OCR is not the same as extracted text.
It is important for organizations to differentiate between extracted text and OCR text, especially when matters involve both ESI and scanned paper. But it goes even further than that. What happens to native ESI where text could not be extracted? Are these native files being imaged and OCRd on your projects? Do you know the answer to that question? You should, especially if discovery protocols ever come into question.
When looking at many of the data deliverables I receive, I am oftentimes unable to differentiate between the different types of text. If legal teams are unable to differentiate between extracted and OCR text, they’ll have to wait until the data is loaded into a review platform to figure it out. At that point, if you cannot validate the quality of the text, any advanced technology that relies on textual content is going to find it for you, after you pay a lot of money to implement it.
So what are some key things you can do to differentiate between extracted text and OCR and validate the quality of text for your eDiscovery projects?
5 Best Practices for Dealing with Text in eDiscovery
1. When finalizing production orders, ask to have a field called ‘Source Data Type’ included in all deliverables, with either an ‘ESI’ or ‘Image’ value applied.
If parties agree to provide this information, you then have the ability to quickly search or filter on data deliverables to confirm quality of text.
2. When building databases, organize data based on whether the documents are ESI or scanned paper.
Since scanned paper typically has OCR (and sometimes no text at all), keeping data organized in two separate buckets will allow you to isolate images until the quality of the text is confirmed. This will also allow you to easily exclude documents with poor text from any analytics indexes you create, or email threading processes that you run.
3. When sending ESI processing projects to your internal support team or to outside vendors, make sure to ask how they handle source ESI files where text cannot be extracted.
This is an important discussion to have for two reasons:
- Data should be produced as it is stored in the normal course of business. With the increasing use of advanced technology such as analytics, predictive coding and email threading, many of the eDiscovery protocol disputes we see involve the quality of text. Since source ESI without extracted text most often gets imaged and OCRd (going from 100% accurate extracted text to 0%-90% quality text), do you have a way to identify these documents in the event your preservation and collection protocols are called into question?
- If source ESI without extracted text doesn’t get OCRd then what is done? If the answer is nothing and they are processed as is without imaging and OCR, then these documents will go into a review platform, or get produced, without any text at all. This means these documents won’t be searchable and will not be included in any advanced technology workflows you implement. Are they flagged for you so you can do a linear review? If not, these documents may be missed completely during review.
4. Always validate the quality of your textual content before implementing any advanced technology solution, searching content or starting review.
Well-organized data at the beginning of a project will ensure more accurate review in the end.
5. Don’t ignore conversations about data formats and text.
Even though this goes against eDiscovery rules and protocols, many times we receive deliverables where ESI has been printed, scanned and then OCRd instead of being processed correctly. Don’t settle for less than accurate text when most often you can receive 100% accurate, extracted text from ESI.
This may seem like common knowledge, but you would be surprised at how many issues arise during document review because quality of text was never validated or discussed. If you don’t take these things into consideration all of the advanced technology in the world won’t help you out and in the end, you’ll spend even more money cleaning up the workflows that were supposed to expedite review in the first place.