Originally posted on April 11, 2012
da Silva Moore + Kleen Products = It’s All About the Math
This was one of my favorite early blog posts and is the theme of the Predictive Coding Thought Leadership Series I have been teaching around the country. I think it is more relevant today than when I initially posted it. Due to a server issue, we lost all of the original blog posts from DocReviewMD which is giving me an opportunity to “re-release” these posts, with an updated introduction paragraph identifying the relevance of the post today. Enjoy…..
It is nice to know that people are reading these blog posts. It’s amazing how much more immediate a reaction is to a blog post than a podcast. Now that I have had a chance to digest where we are at, I thought it would be a good chance to summarize where I believe the tea leaves say we are with Da Silva Moore and Kleen. The posturing over judicial and vendor conspiracies, untested technologies, if an expert is competent and who should control the process selected for document review and production are really side issues in both of these cases. The common strand in both of these matters is offering some comfort to an adversary that most of the ESI that is related to a case is in fact being produced and the gaps in production are not intentional but are caused by search and retrieval limitations. Both cases are about this same issue and are less about predictive coding than people realize.
The Da Silva Moore plaintiffs are comfortable that predictive coding works better than key word searching. What they are not comfortable with is how they can be sure they received the relevant documents and that is what is at the heart of their debate over the size of the sampling to be done to validate the process. The Kleen plaintiffs want predictive coding to be used because they know identifying key words in an antitrust case is really hard to do. They believe their best chance to find the relevant ESI they hope to find is to use some form of predictive coding across a wide number of data sources and custodians. The key theme here with both parties is a lack of comfort from parties receiving the ESI on what they are receiving and a desire to receive what they are entitled to.
The answer to this issue has to come from statistics. There is no way to measure a large data set and say what was done without sampling. It is too expensive, reviewing everything is prone to huge inconsistencies, and it would take too long. Providing these assurances is what sampling is intended to do and has been doing since the time of the Greeks. So the easiest way for lawyers to understand these cases about predictive coding is to forget about predictive coding even being involved in the case. The analysis I am blogging about works the same with predictive coding, computer assisted review, key word guessing, linear review, producing every 5th document or any method of document review or production going forward. The test is a simple two part test:
- How did you find and produce your relevant ESI and discharge your Rule 26(g)(1) certification obligation or other similar state court obligation that you have done a reasonable job?
- How are you going to show you satisfied this obligation to the other side and the court?
The issue of what you did really becomes less relevant to this test. It only becomes important if you can’t show that you met your burden in the second question. That is when picking the right approach really becomes more of an issue. It is this reason why in my mind predictive coding will eventually take off. Kleen cries out for this result as you hear the defense talk about the thousands of hours spent, over a million documents produced, and teams of experts used to find key words. But the Kleen defendants have not shown is a sample process across the population of what has been left behind or what is potentially there. This information is what any party on the other side needs to see to be comfortable they have received what they are entitled to. This is where the ugliness can occur. Going forward, these disputes need to be addressed earlier on and before resources are spent. No one wants to pay to redo work. But knowing this re do risk exists once we start measuring will undoubtedly aid judges and parties in deciding if a production is complete, to choose the most efficient review approach going forward just in case they have to do more work!! This is one of the strengths of predictive coding. It enables parties to do more work if they have to or facts change. Linear review does not. But parties should be able to choose their own processes. They should also have to show results that prove they have done a reasonable job. Again these cases are about statistical sampling more than predictive coding.
So my final conclusion about both of these cases is regardless of what the decisions are, the issues are not going away. Large data sets, fear the other side is hiding something, unknown technology, mistakes being made, high costs …it’s a long list of fears parties have which are compounded by the uncertainty of what one is receiving in a production. Understanding how to make reasonable statistical offerings and providing sampling results is the answer I believe of both of these cases and the many more like them that will follow regardless of whether predictive coding, stratified key word selection, linguistic analysis or key word guessing is used.