In my last post I gave three reasons why eDiscovery legacy review tools no longer meet the needs of responsible litigators and litigation support teams. Reasons included burgeoning volumes of data, limited database functionality, and poor reliability. Newer SQL-based review tools like kCura’s Relativity and Ipro’s Eclipse offer much better capability to handle these and other issues facing today’s document reviewer.
But there is always a little bad that comes with the good. For all the benefits provided by the SQL back-end, there is a considerable amount of upkeep that comes with it. Like a garden, a SQL database must be tended for it really to flourish. Below are some tips to help your database thrive.
Limit the number of text fields
Too many or bloated fields in a database can impact performance. Databases with a large number of text fields are slow to respond to queries and to display lists. SQL server has a row limit of 8060 characters. When this limit is exceeded, the server moves the excessive row column to another page in the ROW_OVERFLOW_DATA allocation unit. In layman’s terms, this means that when you have too many characters allocated to fixed-length text fields, a portion of the data gets stored elsewhere in the database. So, for example, when you run a search to display those fields or to display documents containing those fields, the system has to leave the standard data storage area and go find the data stored elsewhere. Your search will be slow to finish.
Limit the number of folders
In certain SQL based review platforms like Relativity, the folder is the physical location of the document in the database. Searches, batch sets, and views, all pull data from these folders, but to do so, mini-searches are run on the back end to locate the documents. If the folder structure is complicated or there are too many, then the searches required to display the documents take longer to run. A complicated folder structure inhibits document-to-document navigation because the system is running all these mini-searches.
Take care with persistent highlight sets
Persistent highlighting is probably the single biggest limiter to speed in document-to-document navigation. Reviewers see highlights as a boon to productivity and have become increasingly dependent on them. This dependency leads to bigger lists of more complicated highlight terms. There are two ways to display highlights: through a list of terms or through a field populated with the terms that are present in the specific document. When highlighting through a list, the more complicated terms and the more prevalent the hits, the longer it will take to display the highlights. The system is searching for every word in the list in every document, even if that word does not exist. The document cannot be fully loaded until those highlights are completed. So there are competing interests: more highlights may lead to a more productive review, but more highlights lead to slower system response. The database administrator and legal team need to have a conversation to ensure the best compromise between these two competing interests.
Use hard tags instead of complex nested queries
Once upon a time, creating queries looked something like this: X And Not Y. The system stored your queries and numbered them. You could then refer back to them by number to combine them in different ways. Today’s SQL databases let you draft more complex queries combining dtsearch logic with field conditions and folder combinations. When you combine these with other searches that are doing the same level of complicated tasks, those queries can bog down performance. Furthermore, when you combine searches that refer to other searches, the SQL database has to work extremely hard to identify and report your results. Rather than combining those queries and pushing the system to perform, you are better off leveraging the use of hard tags or persistent lists to expedite the query. You need to do a little planning to have a workflow for tagging to allow queries to be updated as tags are set. The result is a bit more front-end work but a faster and healthier search that maximizes the SQL performance.
Manage the data going into your search indexes
The whole King James Bible contains only 4MB of text. By comparison, it is astounding to see the volume of full text found in many electronic files. Excel and Powerpoint files often contain upwards of 10 – 20MB of text. The average meeting invitation contains .002MB. The effect that these large text files have on searches is also astounding. They affect the speeds of building indexes, running searches, and highlighting terms. These very large text files also add to the database hosted size as they are counted twice: once for hosting and once for the index. In a recent case we found that 16,000 records in a 4.8 million record database accounted for over half of the index size. These files most frequently represent only a small fraction of documents in the database, but their presence can impact performance for the whole collection. To solve this problem, consider creating a separate index of these very large files, or removing them from the index altogether and doing a simple linear review of them. Doing so will save time and aggravation over the life of the project.
Your SQL-based database has a life of its own. It shouldn’t be a stretch to think of it as a garden that requires planning and tending. A database is a complex dependency of hardware, software, documents, searches, indexes, and objective coding decisions used to research, support, and build a legal argument. Over time, it may become overgrown with the work product of past decisions, abandoned strategies, or unneeded data. Every so often, it should be pruned. To flourish, it needs your care and feeding.