Kleen Hearing Day Two – The Battle of Boolean Searches versus Sampling and Predictive Coding and Attacking Expert Witnesses
Losing my entire collection of blog posts due to a server error has allowed me to revisit the posts I could find on my hard drive and update them like artists do with box collections. So this first paragraph updates the blog post. I actually updated this post with an eDJ Group post last year when the Kleen Products litigants agreed to cooperate and continue the current Boolean Searches. Many pundits claimed a victory for the anti-predictive coding camp. I said it was an uphill battle for the plaintiffs because of the burden changing in mid stream a review approach that the Defendants’ had been using across 6 defendants. I was thrilled to get to watch this argument first hand but the real challenge today isn’t the arguing the process. It is that lawyers do are not yet comfortable on how to validate these tools to even propose in large numbers that they start using predictive coding. That is the purpose behind the Predictive Coding Thought Leadership Series which I have been touring the country and leading a CLE programs on validation and statistics. So enjoy my accurate assessment of this case, though I admit I overstated the pace of future change to predictive coding….
Originally Published April 1, 2012
These are my partial observations though I need to clarify that I didn’t see the entire hearing held on March 27th, because I flew in that morning and left near the end for a meeting. So I encourage people to read the transcripts in their entirety when they come out as well as additional blog posts on this topic.
The Hottest Ticket in E-Discovery
When I decided to spend the extra $300 to come to Chicago a day early and watch the Kleen hearings, my rationale was this is probably the best entertainment dollar I could spend in E-Discovery given the relative novelty of predictive coding in law (as opposed to other fields). For the price of what I might spend going to see a classic rock band like Bruce Springsteen or the Rolling Stones, I witnessed over 5 hours of entertaining E-Discovery skirmishes. The battle of Kleen is a classic case of old school search and retrieval approaches of Boolean searches taking on new technology approaches of statistical sampling and predictive coding and not much middle ground in sight at this time.
My first reaction about attending is where the heck was everyone else? Outside of a few technologists I knew, there might have been 10 people in the room who were not taking part in the proceedings. That is a bit underwhelming when you consider how the legal field needs more expertise to be developed using these types of technologies within law firms. So why aren’t more lawyers coming in to watch the debate? I am especially critical of this when you consider that several of my guests on a recent ESIBytes podcast who support predictive coding estimated if 5% of cases they see use predictive coding in 2012, it will be very successful primarily because lawyers don’t yet know how to use these tools so it will take time.
This is a Really Messy Case
The most common strand to the hearing that won’t come through clearly in the transcript is how visibly upset Judge Nolan is with the proceedings thus far. This is a very complicated case due to the number of large defendants in an antitrust case and the fact that this case was recently re-assigned to Judge Nan Nolan. The Judge openly stated at times during day 2 of testimony that she is just trying to get her arms around some basic information in the case and having a hard time doing it. Small facts like have the defendants even produced data to the plaintiffs (yes, over a million documents) and what was the agreed upon form of production (which took about 15 minutes to get an answer to). At one point, she vented that she has had a huge amount of filings dumped on her and was hoping Dan Regard could answer her questions because he is at least getting paid to read everything in the record. Dan without missing a beat humorously replied he is challenged like the Judge with only so many hours in the day.
Whatever comes out of this case, I would not be surprised if we read a Judge Grimm like Mancia v. Mayflower opinion extolling parties on the virtues of cooperation and how this type of case is what can happen when cooperation principles for E-Discovery are not adopted. The Seventh Circuit Pilot Program which Judge Nolan is actively involved in has given the judge a wonderful opportunity to study how cooperation is supposed to work through the meet and confer process. I am not sure this training has prepared her for the mess of a case that she is now witnessing, but don’t be surprised if the parties are given a lecture behind closed doors and in an opinion if the parties cannot find common ground in their discovery dispute. I do see a recipe for cooperation which could save face for both parties which I will offer at the end of this post.
Sampling 101 Primer
First I’d like to offer an observation that this case is as a great case to help lawyers understand what “recall” means and what basic tests statisticians use to calculate recall. Perhaps the most important issue being debated in Kleen is the validation and testing of recall for E-Discovery cases. David Lewis, on behalf of the plaintiff’s opined on day 2 that recall is one of the most important measures in E-Discovery. Recall is a measurement of all the responsive documents that exist in a data population or setting a baseline, and then evaluating how well the system of finding the ESI did in finding relevant information when compared to the baseline.
On the one hand, the defendants’ have attempted a more traditional effort to generate key words and phrases and show they work by using a small sample of custodians. The best they can say about their testing is the effectiveness of the key words on the sample they have taken but they have no recall measurement against the rest of the custodians or sources of data.
I find more compelling the approach of the plaintiffs who are following the playbook of predictive coding companies like Recommind, Equivio, OrcaTec that have had to create systems measuring recall to gain the trust from those who do not trust a “black box’s” results. A good way to show a “black box” has worked to parties in a case, is to measure the richness of responsive documents in the entire population of documents before a project starts, then see based on sampling how the final results predicted by the algorithm compare to that measurement. Given the scrutiny and battles predictive coding companies have had to face to gain acceptance, their approaches have had to start with measurement to provide assurances that their systems have worked. This is also the basis of the academic studies like TREC and Blair Maron which have baselines to compare approaches against. Without a baseline, there is no ability to say what has been found in proportion to what has been searched.
The plaintiff’s predictive coding expert David Lewis on day 2 of the Kleen hearings held firm to his testimony from day one of the hearings that the sampling by the Georgia Pacific defendant of the Null Set (documents deemed non responsive) by the defendants comprised of only 5 custodians out of the 17 ultimately selected, does not give him enough information to determine the amount of responsive documents that might be in the case from the Georgia Pacific data collection or the other defendants’ data collections. To do this the sample has to be drawn from the entire data set being searched. Otherwise, the defendants and the plaintiffs are guessing that any results from the defendants’ limited sampling of a few custodians will hold true across other custodians and in fact other organizations in the litigation. The questions posed by the defendants’ on cross examination show they don’t have a good grasp or they are trying to ignore the fact that they have not set a working baseline to evaluate recall which is central to the plaintiff’s request to become comfortable with whatever they receive from the defendants is most of the relevant ESI. The defendants’ seem more interested in holding up their approach as being good enough and consistent with best practices for search and retrieval than even addressing how to create a baseline sample of the entire data set where the most likely sources of ESI are stored. They also cite the fact that no case has required this level of testing for recall. From a statistical perspective, the tests the defendants have run are fairly unimpressive because at best they can only estimate recall of the five custodians Georgia Pacific tested on and not of the remaining custodians or any of the other six defendants. It will be interesting to see if the Judge Nolan finds this significant difference in statistical approaches worth focusing a landmark opinion on if the dispute isn’t resolved.
Every statistician I have spoken with agrees it is much better to have a baseline of what is in a data set than asking the parties to play the E-Discovery version of Go Fish and guess what words will hit on the responsive data and forcing parties to live with the results without measuring what was potentially left behind. Using statistical sampling up front can offer parties much more comfort through a baseline measure of the richness of a data set through a random sample of the data collection. Then regardless of the method of information retrieval used, the other party without access to all the data can obtain some comfort through testing on what was produced that they are receiving most of the documents that they are entitled to receive. The parameters of how confident they want to be of the results and with what degree of accuracy they give to the results, will determine the size of the sample.
One of the bigger picture questions of Kleen is what does the heightened scrutiny on where searches for ESI are done play into the E-Discovery field? My sense is this discussion of “where to search” and doing sampling to set baselines for recall will represent the new “key word” negotiation in the world of predictive coding. Time will tell on this one but I would be hard pressed not to advise a client to do or ask for this type of testing up front. Trusting is fine but verifying through some measurement is better. Since these tests are based on sampling, they are easy to employ once the population of documents to be sampled is agreed to. Statistics and openness in results can only improve cooperation which requires both parties to have some degree of trust in each other to work.
Identifying the appropriate universe of potentially relevant ESI is where the art of E-Discovery occurs. This is what the parties in Kleen need to be talking about. Sampling can help with this task too as it is what was done in the Zubulake case to assess whether there was missing ESI on backup tapes which the defendants had not produced in their review of active email servers. Sampling what is not produced is a good way to validate a data source is not a worthwhile source of data to be searching in. The Defendants would argue they have done this with the five Georgia Pacific custodians they tested key words on. But this is a sample within a sample. These five custodians are possibly different from the other 12 custodians who were searched against at Georgia Pacific and the sampling gets even more tenuous when the results of these five custodians are compared to the custodians from the six other companies. David Lewis ended his testimony on cross examination on day 2 by point blank stating the methodology used by the defendants could not be relied upon to give an opinion of statistical effectiveness.
To climb into the weeds of testing and validation further, there was also additional discussion on Cross of David Lewis on how reliable the testing to create the key words was by the defendants’ because of the impact of potential bias by having the three Counsel on Call contract attorneys reviewing the null set of data of 660 documents also review the 400 documents randomly pulled from the composite set based on key word hits and knowing which sets came from the key words and which ones were not hit by key words. The problem is this knowledge can create biases. See Generally Testimony from Sam Brown day 1, page 131-140 describing the process and page 175 that same 3 reviewers reviewed both the null set and the composite set and compare it to the testimony from David Lewis on day 2, mid morning. David Lewis. The clearest example of this point was given near the end of David Lewis’s testimony the analogy he used to make the point is medical research and double blind experiments where treatments with a placebo and those with an experimental drug are studied, the doctors do not know who has the placebo and who has the experimental drug. Only 3rd party scientists know the different data sets. This wasn’t done by the Georgia Pacific defendants in the review of the samples by the Counsel on Call attorneys.
The defendants’ attempted to attack David Lewis on their use of small teams to iteratively train a system which David Lewis agreed you could do. He opined it is appropriate to use training sets like the 4 custodians used by Georgia Pacific as a strategy to learn about the case and to begin training your software, but it is hot reasonable to come up with a sample as a baseline to test the output. Again, David Lewis was hammering home the point that the defendants have set no benchmark for how much data is in the collection that is likely responsive. This point was discussed at several times with David Lewis testifying that the defendant’s attorney is mixing up training tasks and how to do a good sampling approach.
Predictive Coding Debated and Analytics Explored to Find the Hard to Find Document
On cross examination, David Lewis opined that predictive coding offers the best shot of finding a hard to find document like “they are with us” that could win an antitrust case. The reason a Boolean search is limiting is we don’t know the words to express agreement that might be used in documents. Predictive coding looks at the statistical properties of metadata, the words used and is the best chance to find documents in a difficult search. Judge Nolan interjected some questions about whether the history of the past 10 years of antitrust cases might help the plaintiffs to construct Boolean searches more effective. David Lewis disagreed because people can code their language different ways. Judge Nolan asked an excellent question that even with predictive coding software, don’t you need that document or one like it to find more like it? David Lewis explained it would be hard with any approach but based on patterns in metadata, middle of the night emails, particular email accounts used, or where the email was stored, that these could be used to help find the individual documents. This would be much harder to do with Boolean searches without knowing the words to use. This point is less about predictive coding and more about using advanced analytics.
Defendants’ Say they Followed Industry Accepted Practices and Blair Maron Used To Support Boolean Searches
Dan Regard on Day 2 testified for the Defendants’ starting after lunch on Direct Examination and went through the process used by the remaining six defendants’. The plaintiff’s objected because it isn’t clear how much of the work that Dan Regard did, but Judge Nolan said this was the first chance she had to hear what the other six defendants’ did. Dan Regard testified that all of the defendants’ participated in developing the key words and Georgia Pacific took the lead to review the key words on their documents. After these tests were done, the list was shared with the plaintiff’s and their feedback was received, Georgia Pacific did further testing then shared this list with the other defendants’. These terms were modified by removing the Georgia Pacific name and specific facilities and replacing them with each defendant’s terms. They also modified the search terms for the technology they were using. While there were 15 search strings in Georgia Pacific’s set, one defendant broke it into 41 search strings. The defendants’ used wildcards, stemming and Dan estimates the totality of all the variations of search terms, connectors, wildcards etc., ran over 2 million searches.
Dan Regard walked through Defendants’ Exhibit 12 and explained in detail what each defendant with the exception of Georgia Pacific did, while focusing his testimony on Temple-Inland, the client which hired him originally for collection work but saying he called each of the defendant’s individually, talked with their lawyers and consultants and kept the chart to help him understand the entire process. He thought it was important that the results differ for each of the defendants’. He also testified that the each defendant relied on Georgia Pacific’s testing of the search terms.
One other interesting point from the expert testimony comes from Dan Regard’s testimony in support of boolean searches. He opined that boolean searches are extremely transparent, are familiar to all parties in E-Discovery and have been studied since 1985 in the famous Blair Maron study, can be broadened so they do not need to be precise, and can be applied consistently to all documents regardless of the order without changing the results. In comparison to machine learning, Boolean terms require no training, indexing with machine assisted learning is more expensive, the level of care needed to train is machine assisted review is much higher. The trade off is it takes a lot of time to make key words work well which Georgia Pacific invested. INSERT LINK TO BLAIR MARON.
What Dan Regard left out is the dreadful results that have come out of studies involving key words. He did cite for perhaps the first time ever, the Blair Maron study as an example that we have been studying key words since 1985 in support of key word searching. What he didn’t cite is the fact that Blair Maron stands for the proposition that we are often shocked at how poorly key words do. The defendants’ position is that Boolean searches work well enough but Blair Maron as a study is not consistent with his party’s positions. The results of Blair Maron are the participants thought they were finding 75% of the responsive documents using key words and the follow up testing revealed they only found 20% of the responsive documents. Those are not good results and this is frankly this is the result the plaintiffs are most concerned about if Boolean searches are relied upon.
Dan Regard also made it clear that the use of Clearwell topic as a tool to evaluate documents and generate additional key words was a form of Computer Assisted review. His opinion is the overall process used by the defendants’ exceeds many of his clients and is consistent with the Sedona Conference principles.
Dan Regard also challenged Tim Hanner’s testimony as a plaintiff’s expert on the collection sources from Day 1 of the hearings that it is possible or accepted to collect every server and system at a company as opposed to focusing on collections based on custodians. He opined that is not how it is done and it’s not appropriate. Using an analogy of searching a house for information, if you were looking for books, you would go to the room with the books in them instead of searching the entire house.
Discrediting the Experts
The defendants focused some time on cross examination questioning David Lewis on whether he had actually used one of these predictive coding tools to do what he is suggesting the defendants do or in fact ever used the statistical measurements he is suggesting. He testified he has developed a system in the field and understands it has been used but he has not done any field work yet and doesn’t know first-hand of a vendor who does, though he could identify vendors that could do this. To be fair to David Lewis, this has been done by no defendants in the litigation community more than a handful of times which is exactly why this case is so important to validate newer approaches. As Judge Nolan continuously pointed out, we are in new territory here without a lot of ground rules.
The defendants’ also pushed David Lewis on cross examination to identify any cases where his theory of setting a baseline on recall has been required by a court. He could not provide one.
This is less damage than what the plaintiff’s did on cross examination of the defendants’ expert Dan Regard when they ended the day by surprising Dan Regard by noting his Louisiana Bar license was not active or inactive but is in ineligible status. This was news to Dan Regard and he promised to clean up what is possibly as simple as not keeping up with CLE requirements. Next the plaintiff’s ripped through his resume and his lack of experience, (just like David Lewis was attacked), with predictive coding, statistical measurement and for good measure, Boolean Searches. The end result is this puts the judge in a bit of a dilemma because the defendants’ position is basically “we measured a little and that is much better than what most people do” along with citations to Sedona papers. But they do not address head on these newer predictive coding tools and their E-Discovery expert, does not appear to have tremendous experience to opine effectively that the Plaintiff’s approach is technically flawed. He can only say it isn’t what the E-Discovery field is doing today which we already know. I didn’t watch all of Dan Regard’s testimony and maybe he said some compelling things in the last hour of the hearing to change what appeared to be a damaging line of questioning. I would suggest reading the transcripts like I will be doing when they are available to learn how things ended up.
Great Example of a Failure of Finding a Proportionate Middle Ground so I Try to Offer One
I spent the day after the hearing mulling an attack one of the defendants’, Georgia Pacific, launched on the cost of putting all of their shared network data into a predictive coding tool and trying to figure out how the parties could dig out of this lack of cooperation mess they have dug for themselves and the Judge. Georgia Pacific attorney while cross examining David Lewis, offered that they have 7.6 Terrabytes of data in shared drives for the container board business and HQ related to packaging and estimated that at a low price for predictive coding of $400 a gigabyte, it would cost over $3,000,000 to run all this data through a predictive coding tool just in software costs for just one of the 7 defendants. This number is unreasonably large and frankly not a result that any predictive coding company would suggest doing today.
In contrast, the plaintiffs have not done a good job of defining what information they would like to load into the predictive coding software beyond the fact that they need all the sources of potentially responsive information. Proportionality and reasonableness theories make it unlikely that this level of data would be ordered to be provided to the plaintiffs, especially with the large sums of money already spent doing linear review. However, there is likely some middle ground that the plaintiffs could start with to make predictive coding’s price point drop dramatically which academic studies like TREC show would also likely substantially improve the recall rates on the defendant’s production.
The plaintiffs might offer to accept as a starting point using the 17 custodians which Georgia Pacific ended up running key words against and the other 6 defendants’ custodians who Boolean searches were run against that are listed in the Defendants’ Exhibit 12. Since I don’t have access to the Defendants’ Exhibit 12 which contains a breakdown of the custodians and the data they had collected, I will assume there are 17 custodians for each of the 7 defendants, and each has 4 gigabytes of data. Based on these assumptions, the total population of ESI to be evaluated by predictive coding across all 8 defendants would be 476 gigabytes. If the $400 a gigabyte number used by the defendants to calculate their 3 million cost estimate for predictive coding for Georgia Pacific’s data alone is used, the total cost for using the predictive coding technology falls to $190,400 for hosting and processing the data and roughly a week of attorneys time to run the predictive coding on each of the defendants’ data sets. This process could be completed by some vendors in 10 days.
The plaintiffs would probably argue that they believe there are more custodians and unstructured data sources they need to have access to and I would not disagree with that point as raised by their expert Tim Hanner on day 1 of the testimony. But David Lewis their expert testified on day 2 of the hearings that it is a reasonable strategy to begin training a tool, to use a smaller number of custodians,(as opposed to validating a baseline for recall). The Defendants’ exhibit identifies a starting point which could be used to create a training set for predictive coding. In addition, some predictive coding tools anticipate this iterative type of learning about a case and allow for network analysis to see who was sending emails to each other and in what volumes to fill in gaps of missing custodians and ESI. By using predictive coding on this set of custodians who the defendants clearly believe are involved in the case, a better set of search terms might be extracted based on the types of key terms found in the relevant documents using predictive coding or a set of predictive coding exemplar documents might be devised to point at some of the noncustodial data that has been collected and preserved and potentially other custodians.
Tying together a reasonable request is something which hasn’t come out in the testimony. Frankly, the cost of sending 20 professionals and experts to 18 hours of hearings at say $400 an hour is $180,000 without travel expenses and there is a third day being scheduled!! The numbers cry out for a compromise and the judge’s pained expressions also suggest she expects the parties to work harder to fashion compromises. Reasonable requests, proportionality, and iterative learning are the foundations of the Sedona principles in the Cooperation Proclamation. While I don’t believe the Sedona Best Practices Guide for search and retrieval from 2007 is particularly relevant when we are evaluating the effectiveness of new technology, the more general principles of cooperation and iterative process still are.
The pained look on Judge Nolan’s face as I left was the lasting image I have from the hearings. Somehow she has to make sense of this convoluted process with little precedent to guide her, two primary experts who have been somewhat tarnished for not having the right experience to offer their opinions, and a lack of experience among the participants to say how it is going to work out beyond theory with two parties who couldn’t be further apart on. This is not a good place to be for either litigant when trying to get the best possible decision. This case calls for a compromise in approaches and the parties who have been working with this case since its inception are the best able to resolve this fairly. It really shouldn’t be this complicated.
by Doc Review, MD – AKA Karl Schieneman