Speed, speed, speed

The most important feature of ediscovery software is speed.

Why is speed so important?

Westlaw is fast. Google is fast. Spotlight, the search on a Mac, is fast. Lawyers can and should expect the same level of speed from their ediscovery software. If Google can search the entire Internet and Westlaw can search hundreds of years of primary law in an instant, then instantaneous searching of 500,000, 5,000,000, or even 50,000,000 documents in a typical commercial case should be possible too.

Speed is important for three reasons. The first reason, which is obvious, is that time costs money when lawyers and paralegals charge by the hour. (The same is true for review firms that charge their clients by the document; a slow system increases review firms’ costs for the same amount of revenue.) The second reason is that review is often a rush: lawyers need to get documents out the door or find evidence for a deposition or hearing in a hurry.

The third reason, though, is less appreciated and more important. Slow ediscovery software — software where you have to wait seconds or tens of seconds for searches to load and seconds or tens of seconds to navigate between documents or pages of results — makes document review awful. Slow software makes document review so awful, in fact, that lawyers stop doing it as soon as they are senior enough to get someone else to do it for them!

Slow software means the least experienced members of the team are the only ones doing the review. Fast software means that senior litigators — the lawyers who will try the case, or argue the dispositive motions — are willing to access the evidence directly, just as they are willing to pull critical cases from Westlaw. A major and little-understood benefit of speed is that it gets you better lawyers working directly with the evidence.

How fast is fast?

How should you judge whether ediscovery software is fast? First, you need to know what “fast” is: a good benchmark is 1/3 of a second for searches and 1/10 of a second for document navigation, including rendering complex documents like PowerPoint or Excel files so that they look in the ediscovery software just like they would look in PowerPoint or Excel.

Second, you need to test ediscovery software against a large database, preferably the complete Enron set. Too often you will see a demo from a vendor on 50 documents, or 500 documents, and get acceptable speeds, but when you use the software on a real-life database of hundreds of thousands or millions of documents, everything grinds to a halt. Demand to see the software in operation on the full Enron dataset of about 500,000 documents; that is a real speed test.

What makes ediscovery software fast?

Fast ediscovery software requires three things: (1) modern search and rendering technology; (2) doing as much as possible up front, rather than during the review; and (3) using the fastest storage hardware money can buy and scaling it appropriately as data grows.

Modern technology

Search is a solved problem. Companies like Apple, Amazon, IBM, and Twitter all use an open-source search technology called Lucene. Unfortunately, many ediscovery companies still use a legacy search technology called dtSearch. Simply switching from dtSearch to Lucene — or picking software that is built on top of Lucene — makes things faster.

Some people like dtSearch because they are familiar with its search syntax, that is, how you input searches. But syntax is not a good reason to prefer one search technology over another. This is because the search syntax is independent of the underlying search technology: software can take any search syntax and “parse,” or translate, it into the language of the underlying search technology.

Similarly, for rendering, a legacy approach is to use embedded viewers that require downloading additional software or having the viewer “read” the native document when the reviewer pulls it up to look at it. By standardizing documents, for example to PDF, and by rendering the PDFs in the web browser itself, you skip the embedded viewer step and get a substantial increase in speed.

Doing the work up front

The second big contributor to speed is doing all the things that take time up front. If there is processing to be done or conversions to be performed or, in general, anything that takes time, you want it to happen when the data is loaded into the ediscovery software, not when a reviewer is sitting in front of her computer running searches or reviewing documents.

For example, the following should happen at the ingestion or processing stage, rather than at the search and review stage: (1) organizing emails into threads and grouping attachments with their parent emails; (2) finding related documents or “near duplicates”; and (3) converting documents to PDF or TIFF for rendering in review (that is, no “on the fly” conversions). By doing these things up front, good ediscovery software avoids wasting the reviewer’s time.

Another big benefit of doing work like this up front is that it can easily be broken up across many different computers all working together. This is called parallelization. If one computer can do the job in a week, then five computers working together can do it much more quickly, say, in a day or two. Indeed, using parallelization, an arbitrary amount of data can be processed arbitrarily fast — and this processing results in faster searching and document navigation for reviewers during the search and review phase.

Fast hardware scaled appropriately

One of the most common causes of poor performance is ediscovery software installations where either commodity hardware was used initially or high-end hardware was used at first, but it was not scaled as data size increases. At Disco, we use the best enterprise-grade solid-state storage available in the market today, and we offer Disco on top of a managed-services infrastructure where our team handles all hardware acquisition, integration, and scaling. You shouldn’t have to worry about installations or hardware.

If speed is so important, why is so much ediscovery software so slow?

This is a great example of the disconnect between so many ediscovery vendors and the practicing litigators who are their ultimate customers. Most practicing litigators will not use the bevy of exotic features that ediscovery vendors work so hard on (the “web of email” or the “concept wheel”), but every single litigator will use the core search and review features. And litigators hate the software they use largely because these search and review features are so slow.

Lawyers who have grown up with fast Google, fast Westlaw, fast Spotlight, and fast everything are not going to tolerate the current speed deficiency for long. I predict that the coming year will see a shift in focus by the savviest ediscovery companies from superfluous features back to the core features; this will be the year of “shoring up the base” in the sense of making sure that the things lawyers do 90% of the time work as well as possible.

That’s what we’ve tried to do at Disco: see for yourself what speed looks like by running some searches and viewing some documents in the complete Enron database, available at http://demo.csdisco.com.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s