Originally posted on March 27, 2012
When a server crashed and I lost my many of my original blog posts, it gave me the opportunity to update the DocReviewMD posts and look back. Sorry to say, I have been unable to locate the first da Silva Moore blog post but I will keep looking. Today that case continues to grow and has become a battle between experts. In essence it is a train wreck. But this is no surprise as it is the first case using predictive coding and there are many issues surrounding the use of these tools which we are still figuring out. That is the basis of the 2013 Predictive Coding Thought Leadership Series I have been touring the country with in 2013. But it is fun to look at the back catalogue on what we were saying back in early 2012 about these tools. Enjoy…..
Connecting Blog Post on Da Silva Moore to the Kleen case
This blog represents Part two of my initial analysis of Da Silva Moore and the Kleen case. To wrap up Da Silva Moore, the final analysis I was planning to offer on this case as it stands right now was given by myself and Herb Roitblat, Chief Scientist at OrcaTec and Chairman of the Electronic Discovery Institute where we covered statistics and validation differences we have with those proposed by the plaintiff and defendant in more detail on ESIBytes in Validating Predictive Coding, Da Silva Moore and other Current Issues. I am also linking to the blog post written by Herb Roitblat on this topic entitled Da Silva Moore Plaintiffs Slash and Burn their Way Through EDiscovery which adds a few more interesting points. To summarize the problem I have with the plaintiff’s position is they are getting a huge amount of data to get comfortable with the validity of the predictive coding that has been done since they are getting the null set of documents and can see what the software is being instructed to leave out of the production as well as the ability to review results and make adjustments iteratively. No plaintiff using key word searching has ever been given so much insight and opportunity to impact the quality of their review set with the exception perhaps of the federal government who typically hold all the cards in their investigations. Is this a perfect outcome? There is no such thing as perfect between warring parties in litigation. But there is nothing unfair about this and much the plaintiff’s can be pleased about.
B. Kleen Products, LLC v. Packaging Corporation of America
So if you accept that Da Silva Moore is dictum in determining if parties should be forced to use predictive coding because the parties had already agreed to use predictive coding, the next big chance for THE DECISION to be written will be in the Kleen case. I have read the transcripts thus far and will be in Chicago on March 28th for the next hearing. I am not optimistic for a slam dunk decision forcing the defendants to give up on key word searching and to adopt predictive coding. For one thing, the defendants have testified about their iterative process and use of sampling to validate their results. Personally, I am not convinced that their results showing roughly 4% responsive documents were found in the random samples of the “null set” of excluded documents is something to be proud of and there are some issues with their methodology that have come out in testimony thus far. If Judge Nan Nolan thinks that this result shows the defendants’ process was not reasonable and they are missing data, this could enable the plaintiffs to sway the court that more work is needed such as predictive coding. The plaintiffs are also raising issues about the adequateness of the collection effort and we know nothing about the results of the sampling against other custodians.
A: Really all that’s assumed here is that all the reasonable sources of responsive documents are collected and made available to the system.
Q: And I guess we’ll defer for another day what those reasonable sources may be. Step 2, test set creation, could you explain that, please.
Testimony of Dave Lewis, Page 159.
Furthermore, the use of 4 or 5 custodians from only one of the nine defendant companies to derive key words for all of the defendants in separate companies is troubling. Therefore, I believe there is plenty of room down the road for the plaintiffs to seek additional discovery to validate that the defendant’s process was in fact reasonable even if predictive coding is not adopted. But these defendants have also done more work than many in the field do when key words are selected and expended a lot of money thus far in discovery with their process. That factor in addition to the novel question of whether one party can force another party to use their e-discovery search process make this a difficult case for the plaintiffs and in the final analysis, might influence the court’s final decision.
- 1. Plaintiff’s Proposals to Set a New Baseline
One of the plaintiff’s proposals described to use predictive coding would build off of the iterative work already done by the defendants to create key words and Boolean searches and responsive documents. They would start by taking a random sample from across the nine defendants’ key executives to create a sense or benchmark of how many responsive documents there might be in the entire collection. This broadens the benchmark pool across all defendants instead of four or five custodians at a single defendant. This would also be done using random samples to provide some level of confidence in the process. Then the plaintiffs can compare these results to what the defendants have already done to see if it is working. The plaintiff’s theorize this approach would reveal shortcomings in what the defendants have offered thus far and would then propose that the parties can work to develop a more representative seed set through an iterative approach of searches or by using a more direct method of predictive coding.
My reaction to this proposal is there would be logistical challenges in a class action of making a bucket of electronically stored information comprised of the collected data from nine defendants and many of their key officers since these nine defendants are also competitors. As a result, the random sampling might have to be done nine separate times. Nevertheless, this is still an efficient way to determine what is in the entire data set as a base line. One of my big problems with the defendant’s approach is they can’t give an accurate estimate of what documents they are leaving behind across the data set by key word culling on only 4 custodians from a single defendant. A random sample only can be used to generalize about the set of documents from which the sample was drawn. Of course, document review has never been ordered on a party to do it this way before, but if we stopped medical testing in the 1800’s we’d still be treating lots of ailments by draining blood. My ultimate opinion is the plaintiffs are on more solid ground that by opening up testing to more of the defendants, their approach will result in more responsive documents being found. The real challenge here is the amount of money which has already been spent by the defendants’ doing the document review even if the plaintiffs approach is more efficient.
- 2. The Problem of the Plaintiff’s Participation in Key Word Selection
Potentially troubling to the plaintiffs position is that the testimony revealed that plaintiffs did participate in correspondence with the defendants’ by receiving the list of proposed search terms on August 5th, 2011 and responded on September 15th, 2011 with additional key words they were interested in using in response to the defendants letter. Now at least one of the defendants, Georgia Pacific has completed their review of 140,000 documents pulled from 17 custodians and at this time the plaintiffs are asking the defendants to change review strategy. Personally, even though I am a fan of predictive coding, that might be “a dollar late and a day short”. The resources expended as identified in the defendant’s reply brief is over 30,000 hours of review time including contract attorney time. Even at a reduced contract attorney rate of $50 an hour, this totals $1.5 million dollars. Regardless of the actual dollars spent, what is clear is substantial resources were already spent by the defendants. If the court believes the defendants were attempting to
- 3. The Costs to Identify Key Words
The last interesting point is even if each defendant were to use predictive coding, the testimony in the case shows that the Georgia Pacific defendants spent at least 1400 hours working on developing key words with 500 hours by KPMG and 900 hours by Counsel on Call. This doesn’t include the time spent by one of International Paper’s law firms coming up with an initial list of search terms. This is the equivalent of 9 months of time for a single expert and at a blended billing rate of say, $400 an hour would be $360,000 just to come up with key words to use!!! This total is probably not included in the 30,000 hours spent doing the review.
Instead of spending this estimated $360,000 to come up with key words, predictive coding could have been used to cull a data set into responsive documents using at least one platform I have worked with in approximately 40 hours of review time presuming the predictive coding only took 3000 documents to train. That equates into $16,000 spent to create a review set for each defendant and $144,000 dollars in total. If you somehow got all of the data for the 9 defendants into one data repository, it might have only cost $16,000 to do this same task. Of course, you would need to add the processing and hosting costs to this total and sampling to validate but the cost saving potential of this approach is significant. This estimated $360,000 spent by the defendants is a large amount of money and it appears they and the plaintiffs are not close to agreeing with this approach. One might think that if an economic argument had been made BEFORE the review had been done through the meet and confer process, that the defendants might have accepted the plaintiff’s approach before writing large checks to evaluate key words on one defendant’s data and then set up a linear review that costs 30,000 hours of time and potentially $1.5 million dollars.
One of the plaintiff’s compelling objections is the key word selection was done using only 1 of the defendants and this approach would incorporate all of the defendants’ data. This concern targets the assumption that the language used by defendants in the nine different defendant companies will be the same language used by 5 executives in single defendant company. This is not necessarily true. Each defendant might use different language to describe the issues. Sampling on one custodian tells you nothing or at least very little about the other custodians, let alone other companies. This point is highlighted in the Blair Maron study analyzing the efficiency of key word searching on documents related to a BART train accident in San Francisco. There were approximately 40,000 documents in the collection and a key part in the train accident was identified correctly by search terms 3 different times. But upon further review, it was found that the part had 26 other names. When you add misspellings into the mix, the number of search terms can grow even larger.
- 4. The Kleen Case is Really about a Failure in Cooperation
Lawyers are naturally going to disagree on what approach makes the best sense, especially given the novelty of having parties discuss the validation of search approaches in a litigation setting. What this case really shows to me again is how important cooperation is in electronic discovery. First of all, no party should put themselves in this position of condoning or participating in a bad process of key word searching if they want predictive coding to be used. If a party find the TREC data and vendor data which make a compelling case that predictive coding is better, faster and cheaper than key word searching, they might be better off not even participating in the key word selection process. Parties advocating early on might want to ask for this approach and ask for judicial relief right away or at least make it clear they do not agree with the adversaries approach and they have academic studies and experts to support their concerns. Several federal court judges including Judges Francis (S.D. of NY), Hornak (W.D. of PA) and Grimm (Md) have said on past ESIBytes podcasts that they would prefer if parties bring this type of dispute to their attention early on before work is done so they don’t have to order am expensive re do. It seems to me that this is the position that the Kleen defendants would have been in if the plaintiffs had objected very early on. It is not clear from the record when the plaintiffs objected and this could be one of the keys the judge will look to when deciding this case.
- 5. Why Can’t the Plaintiff’s Say What They Want?
Another observation I’d like to make is one a good friend of mine in the e-discovery field said, that the plaintiff’s aren’t even making it clear what they want as a remedy. To be stuck on this point drives home how easy key words are. Of course the plaintiff’s can’t say what exactly they want just like a party can with key words. Predictive coding involves bypassing the fact that a party doesn’t know all the good key words so they instead look for documents which contain buckets of key words and metadata found in relevant documents and then use these documents to search for more like them. Key words have friends and studies have shown that using documents as your means for finding other similar documents is far more accurate than guessing at key words. So the plaintiff’s general request to sample a broad collection for potential responsiveness is the best they can do from a distance without more information. As a result, I am not as bothered by the plaintiffs being unable to offer exactly what they want. If they can show a poor collection effort they might be able to broaden a predictive coding exercise over a greater number of custodians, potentially across defendants.
- 6. Regardless of the Data Culling and Review Approach, Validation Sampling at the End Makes Sense
The best outcome might be to let parties choose their own document selection and review approach. That could either be because the costs evaluating the data might already have been expended like they have been in Kleen or it might be because judges might not feel it is appropriate to tell parties how they should do their review without seeing data first showing real problems. From this perspective, while I believe predictive coding is a good approach for efficiently finding responsive documents, the best outcome might be to allow parties to choose their own document selection method, and to enable sampling validation going forward by the other party. Done early enough, it might be enough to force a party to have to do a redo if they pick an approach with is flawed, even if substantial resources were spent. To me, this is a fair outcome even if the parties had negotiated in good faith early on and had disagreements. It allows parties to pick their own review approaches but it holds them responsible for the outcomes through sampling. It also provides some level of assurances that enough of the responsive documents have been produced to the other party.
If Judge Nolan decides the costs spent thus far outweigh the plaintiff’s arguments that predictive coding will result in a better job, then I hope she will at least issue a Peck like strong endorsement of using predictive coding or sampling production results even if she doesn’t order the defendants to use it. Another nice option might be to offer some form of cost shifting like Judge Scheindlin used so effectively in one of the latter Zubulake opinions when backup tapes were examined using sampling to identify if the defendant had missed relevant ESI. That could open up a huge can of worms if substantial data was missed, as it likely was given the use of key words over a limited number of custodians.
These are a lot of issues to chew on and ponder. I can’t wait to read this opinion and see how it reverberates through the 7th Circuit Pilot program. I also can’t wait to get to Chicago on Wednesday to watch day 2 of the Kleen hearings.