News & Press: Affiliates in the News

Investigating Received Data: Creative Use of Data Analytics in eDiscovery (part 2 of 4)

Tuesday, May 16, 2017   (0 Comments)
Posted by: Mary Mack
Share |

Investigating Received Data: Creative Use of Data Analytics in eDiscovery (part 2 of 4)
By Ian Campbell and Michael Fischer

In our last article, we shared how Jeffrey, a partner at a firm specializing in consumer class actions could speed up his deposition prep by implementing online collaboration across multiple legal teams. Now, let’s look at co-counsel, Sandra, who uses data analytics across the document review of a received production to let the data tell her the story.

Sandra received a dump of 1,500,000 documents with a clawback provision from the defendant. Of course that means lots of non-relevant data; the defense has left the relevance review to the plaintiff firm. How does a small firm with limited resources, did we mention that it’s a contingency case, get through 1.5 million documents without breaking the bank? The answer is definitely not a linear review. Sandra has options though; her review platform offers several data analytics tools and she is ready to give them a try. She works with her PM who suggests that they start with clustering to gain a quick look at what they received. 

Why start with data clustering (unsupervised machine learning)?
Data clustering gives you an idea of what’s in the dataset. Without getting into the weeds, here is a quick explanation of clustering.

By starting with clustering, you get an unbiased view into the set by identification of the key concepts within and can compare them with the original case theory. The team will also use this tool to remove whole categories of obviously irrelevant material. 

Here is an example of the clusters identified in a dataset based on details from the USADA’s U.S Postal Service Pro Cycling Team Investigation. On the left, folder navigation type display organized the documents with the most prevalent concepts on top. The right side of this image is a graphic representation of the same information.

Sandra starts clicking into the interesting concept groups and perusing the documents. As she does, she finds key ideas that she believes will be important to the case. She tags and/or folders the documents, and adds comments where appropriate. This exercise is a game-changer for Sandra and her team. In the first days of analyzing the data, she has accomplished what would have required ten attorneys and too many hours in a linear review: she is getting a fairly in-depth view into the contents of the dataset.  

Among the clusters, the system will also identify clusters which contain irrelevant data. These can be tagged and removed from future review projects.

As she looks through the documents, she also captures the key sections of issue-based text to begin building an exemplar document. This example, or hybrid example document, is the quintessential representation of what she and her team are hoping to find; many times it is referred to as the “Dream Smoking Gun” document.

The cluster data can also be used post search to allow Sandra and her team to look at the dataset in similar batches, speeding review and enabling quick reallocation of content to subject experts.

This is what I really need to find. Is it in here?
That question can be answered by using a supervised machine learning tool like Xmplar. It is an intelligent “more-like-this” search tool. With it, the case team can tell the analytics engine exactly what they are looking for. Sandra can copy and paste text from documents as she reviews, and/or craft the words and phrases her team hopes to find. The machine trains on that content and brings back more documents containing those concepts. 

The exemplar document is a living breathing document that the attorneys update as they learn more about the case. Much like the continuous learning of TAR 2.0, the machine adjusts its search parameters as the case team learns more about the case. As the attorneys learn, so does the machine.

Now that Sandra and her team have found some key documents, she needs to identify all the people who had copies of those documents, or some version of them. 

Who helped draft this?
This is where near-duplicate identification comes into play. Imagine that a company spent several weeks drafting a regulatory correspondence. Sandra has found the final version of that document. She wants to find out who, besides the owner of the file folder, was involved in the drafting. But, track changes have been cleared, so she can’t see who the contributors were. There’s an analytics tool for that: near-dupe identification compares each document in the dataset to every other document and finds those that have a high level of similarity. 

This is a fast and easy way to identify all versions of a document across the dataset, perhaps find a version that includes the track changes (assuming you requested native and not image production), and learn who contributed to the drafting. Each of these tools helps her expand her knowledge and build the story.

Sandra believes the department supervisor provided feedback on the document, but he doesn’t show up in the document history. Maybe he received it as an email attachment and commented in the body of the email. To find those comments, she could use family grouping and email threading to find emails with attached drafts, that would identify any-and-all commentary about those drafts.
Who else knew about it?
Email threading delivers an entire email conversation, not just individual messages, to help the reviewer understand the messages in context. It can also identify additional participants in the conversation. If the supervisor Sandra is investigating had been in the conversation but drops off midway, he would not show up in the metadata of any of the emails sent from that point forward. Email threading associates all of the emails in the conversation, so reviewers can follow the thread of the conversation. Perhaps Sandy finds the supervisor’s comments in one of the earlier emails. Eureka! These advanced text analytics partnered with traditional text searches are making short work of what could have been a challenging and much longer project.

In this scenario, Sandra used clustering, more-like-this with formulated example documents, near-duplicate identification and email threading. She has a handle on what is in the data set and has identified key facts and key players. Now she can develop an informed review strategy from the corpus. 

In the next article, we’ll look at additional analytics tools that can be used to organize and prioritize documents for review. 

For more information about Data Analytics for Plaintiff firms, download this information sheet.

About the Authors
Ian Campbell is the President and CEO of iCONECT Development LLC, which has been developing innovative eDiscovery review software since 1999. He is responsible for sales operations, business development, product lifecycle development, and partner relations. With more than 16 years of strategic product development in the litigation support field, Campbell has is a frequent industry spokesperson, sharing his experiences and expert commentary with audiences for the American Bar Association, LegalTech, ILTA, AIIM, IQPC, Marcus-Evans and other legal and management groups around the world.

Mike Fischer is Director of Information Services at Schlichter Bogard & Denton, LLP.  Mike has worked for nearly 10 years in Legal Technology and manages E-Discovery review projects for many large scale multi-party complex matters.  Schlichter, Bogard & Denton, LLP represents individuals harmed by corporate wrongdoing – and consistently prevails at trial. Their work has been repeatedly profiled in the media and recognized by judges; among other things, it has been called “pioneer[ing],” “tireless,” and “historic.”



What our customers say?

©2018 Association of Certified E-Discovery Specialists
All Rights Reserved