Future of legal big data

The party line in the ediscovery and information-lifecycle-governance industries is that businesses need tools to reduce the amount of data they retain and the amount of data they produce in connection with investigations and litigation. This is wrong.

The party line is historically driven. The cost, in the past, of collecting, processing, reviewing, and producing electronically stored information is what led to the party line’s adoption. When collection services cost $300 per hour, processing costs $250 per gigabyte, and review involves thousands of hours of lawyer time plus hosting fees, user fees, and the like, the natural reaction is to reduce the amount of data that has to be collected, processed, reviewed, and produced by any means possible. This thinking resonates nicely with the desire of businesses to keep their information confidential; to produce as little as they defensibly can.

What do you see when you set aside the party line and consider the problem of big data from a business perspective, in light of the technology available today and the technology we expect to be available soon?

The first and most important realization is that retention, storage, and access to business data should be driven by operational concerns: businesses should decide what data to keep and how to store it and make it accessible based on how that data can help them make money. The prior approach, where retention, storage, and access is driven by compliance, regulatory, or litigation concerns, is the tail wagging the dog. We should think about how to keep data and use it to advance business goals and, only secondarily, think about how we can satisfy compliance, regulatory, and litigation requirements at reasonable cost.

Email is a great example. Companies, driven by the party line, are looking for more and more aggressive tools to reduce the email they keep and reduce further the email that is subject to review in investigations and litigation. But email, in almost all companies, is the primary, natural repository of operational information: it’s where business people live. The operational value of tools that can extract actionable business information from email is tremendous. The holy grail is technology that can extract this information with no change in the way people create and use email in day to day business. Employees generate valuable information in email as always, and technology surfaces that email when there is an operational need for it.

What problems can we expect to face as we open a new office in New York? Well, let’s extract that information from the email (and other documents and electronically stored information) that we created when we opened a new office in Boston last year. This kind of operational thinking — how can we generate actionable business information from the data we create in the ordinary course of business — is what should drive decisions about how data is retained and stored.

The concern that motivates the party line, the cost and intrusiveness of discovery, should be addressed not by reducing the amount of data we keep, but by reducing the cost per unit of data of storing, collecting, accessing, and reviewing that data. This is what we’re out to do at Disco. Already, we’ve brought to market ediscovery software that delivers a 10x speed improvement at search and review and does it at perfectly predictable flat-fee pricing that is 1/2 or less the prices charged by others. Speed will only increase; cost will only fall.

The operational usefulness of data is the carrot for keeping it. The stick is that the arguments for why it isn’t available — why it wasn’t retained, why it can’t be collected or searched or reviewed — are going to get weaker over time. As technology like Disco makes it easier to collect, search, and review data, businesses and lawyers will be harder and harder pressed to argue that it can’t or shouldn’t be done.

The right approach is to retain lots of data, use technology to extract actionable business information from it, and use technology to reduce, dramatically, the costs of handling that data in investigations and litigation. Destroying data isn’t the answer.

Disco Release Notes

Like most pieces of software, Disco is being updated constantly.  In an effort to share some of the most recent updates, below are the release notes for December 16, 2013 to January 20, 2014.


  • Near-native PDFs now display in the review tab for unknown file types
  • Extra whitepace in Concordance load files is now supported
  • Added ImagePath as supported load file metadata
  • Addressed a concurrency issue to allow more than one production to be run from a single database at the same time
  • Improved production speed by using batch inserts


  • Improved date range search speed
  • Addressed concurrency issue to ensure all documents are timely marked for production
  • Increased page responsiveness by changing the way audit events are written
  • Use load file supplied PDF and OCR text if a native document fails to load
  • Improved production performance
  • Fixed attachments of deduplicated parents not receiving all deduplicated metadata
  • Fixed range queries for bates numbers that contain spaces


  • Improved text encoding detection for files using UTF-8 without a BYTE ORDER MARK
  • Made custom fields available for searching in search builder
  • Added custom fields to search results report
  • Supported searching on custom field names that contain spaces
  • Fixed bates extraction from email file names
  • Added support for overlay files that only use custom fields
  • Search builder was changed to handle custom names with spaces


  • Upgraded to a newer version of RavenDB
  • Improved PDF rendering in IE 11
  • Improved searching in the DocumentNote field to include quoted phrases
  • Improved mass tagging speed
  • Improved searching with proximity clauses that contain phrases


  • Improved search syntax to handle unusual searches (eg a/n (b c d))
  • Changed indexing to better capture Powerpoint notes
  • Improved handling of quotes from search syntax copied from Word documents
  • Altered processing to leave zip files and other containers in a load file as natives