Why we don’t do installations

Disco is a hosted software-as-a-service (SaaS) solution.

We’re often asked whether a law firm, corporate client, or channel partner can license Disco for installation on their own servers “behind the firewall.” We do not do behind-the-firewall installations for three main reasons:

  1. The quality of the user experience — especially speed — depends on the server and storage hardware (enterprise-grade solid-state high-volume storage) on which Disco runs. It is much harder for us to control the quality of hardware used on installations than it is to control the quality of hardware in our data centers. By controlling the hardware, we eliminate this potential cause of degraded user experience.
  2. We can push code changes daily to address issues and add features because we need to maintain only a single codebase on our servers. We don’t need to schedule regular releases, issue patches, or worry about whether particular installations are running old versions.
  3. Data is stored at Codero data centers in Arizona and Virginia. Our data centers are SSAE No. 16 certified and were SAS 70 Type II certified (SSAE No. 16 has replaced SAS 70). Data is encrypted on disk and in transit between the data centers and users using HTTPS / TLS (the successor to SSL). Usually the security available at installation locations (especially channel partners, law firms, and all but the largest companies) is not as good as the security provided by our data centers.

The one installation we’ve done we set up so that we specified the hardware, have direct access to it, and can push changes to the code running on that hardware just as though it were at one of our data centers. We did the install to accommodate a particular client with unique governmental security requirements. While we will consider installation requests, our price for doing installs reflects our preference for the hosted SaaS approach.


You collect email from the five members of an executive team. They email each other all the time. So when #1 sends an email to #2, #3, #4, and #5, you will have five copies of that email in your collection, one from each of the five executives. But you only want your reviewers to review that email once — to save time, but also to ensure that it doesn’t get tagged one way by one reviewer and another way by a different reviewer.

If the executives are discussing a proposed contract, that contract might be attached in PDF format to 5, 10, or more emails from each of the executives. For example, #1 might send it to #3 and #4 for their comments, #1 might send it to the company’s outside counsel for review, and #1 might have received it from the contract counterparty originally. The same document would be attached to each of #1’s emails. Copies might also appear on #3’s laptop and on the file server in New York. Again, you want your reviewers to review the contract only once.

Deduplication is how you do this. Deduplication has three steps:

  1. On ingestion, you identify what documents are duplicates, remove all but one of the duplicates, and for the one you keep, store metadata about the duplicates that you removed, for example, where they came from (path, custodian, etc.). Identifying what documents are duplicates is done in a variety of ways, for example, using metadata (to, from, subject, date/time, etc.) for email or a hash value for office documents. The result of this process is a single document with metadata telling you all the places in which that document appeared.
  2. On review, reviewers see only the single document that was the result of ingestion. But that document contains the metadata for all its duplicates so that, for example, if #1 and #3 are both custodians and both of them had the document, it will show up in searches for custodian(#1) or custodian(#3). That is, the deduplicated document is a search result whenever any of the original documents would have been search results. This is why metadata from the removed duplicates is stored with the one copy that is kept for review.
  3. On production, the deduplicated documents are de-deduplicated so that the documents produced look the way they were kept in the ordinary course of business. In our example, custodians #1-#5 will each have a copy of the document and, if custodian #1 attached the document to three different emails, each of those emails will have a copy of the document following it.

Deduplication is like an accordion: you identify duplicates, collapse them all into one document for purposes of review, and then expand them back out on production so that the production looks like the data you collected in the first place.

Unfortunately, a lot of software deduplicates incorrectly. For example, some software will deduplicate between custodians (our first situation above), but not within a custodian (our second situation above). The reasoning for this is that each time a document shows up as an attachment, you want a separate opportunity to tag it: if the document is attached to emails A and B and email A is tagged relevant while email B is tagged privileged, then, the argument goes, the document should be tagged relevant when attached to exhibit A and privileged when attached to exhibit B even though it’s the same document.

This argument for “partial” or “custodian-level” deduplication is unpersuasive. In the example above, the same document is at issue: it is either privileged or not, and relevant email A either destroyed the privilege or it did not. If email A destroyed the privilege, then the attachment is not privileged. By showing only one copy of the document during review, the review system forces reviewers to make this call (and not, for example, privilege log a document a duplicate of which was produced as an attachment to a relevant email! Documents, not instances or appearances of documents, are the unit of review.

Custodian-level deduplication is also used sometimes to account for defects in the software used to produce documents. For example, if a review set contains emails A, B, and C; a document X is attached to all three emails; and A and B are tagged “hold back” while C and X are tagged “relevant,” some production software will produce C and 3 copies of X: C and its attachment X as well as X again because it is a relevant attachment attached to A and B, even though A and B themselves are not produced. Custodian-level deduplication prevents this because the copies of X attached to A and B would also have been tagged “hold back” and so would not have been produced.

But the correct answer to this problem is not custodian-level deduplication: it is, on production, to include copies of attachments that themselves fall within a production only (a) one time per parent that is being produced or (b) one time alone if no parent is being produced. That is, instead of fixing a bug at step (3) of deduplication (the step where you de-deduplicate on production), custodian level deduplication attempts to work around the bug by introducing a compensating bug at step (1) of deduplication. To correct for improper de-deduplication, systems like this don’t deduplicate fully, which gives rise to the possibility of inconsistent tags and forces reviewers to review documents multiple times.

The takeaway from all this is simple: deduplication is a hard task that software designers should think through carefully and that users should not mess with. Disco deduplicates and de-deduplicates correctly so that you don’t have to go down this rabbit hole yourself. It just works.

Automagical processing

Disco has automagical processing.

When you send us native data, all we do is click “go.” Disco handles OCR, imaging everything to PDF, creating email threads, unzipping containers to arbitrary depth, linking emails and attachments and parents and embedded objects, normalizing timezones, deduplication, deNISTing, detecting near duplicates, and everything else that’s involved in “processing” and “ingestion.”

When you send us images with a load file (for example, a production from the other side), all we do is click “go.” Disco handles extracting data from load files in any standard format (EDRM, Concordance, Summation), loading the images into the database, and correcting or supplementing metadata as necessary (by running the processing and ingestion tasks just described if they haven’t been done or have been done poorly). If the load file has random extra fields, Disco loads them as searchable custom fields.

Processing involves all kinds of subtle decisions: how deeply to deduplicate (by custodian? attachments?) and how much to de-deduplicate on production; to what timezone to normalize; how to backfill email addresses (x@y.com) where an Exchange server dumped only names (“Billy Patterson”); whether to show an image logo that appears in thousands of emails as an attachment (please, no); etc. Disco makes those decisions sensibly so you don’t need to worry about them. You feed in data and Disco returns a beautiful review set.

If you’re like me, you might think that this kind of automatic processing is standard in the industry. Shockingly, it’s not.

Most people process data using manual tools like Law or Nuix that require an operator to go through the data and specify what happens to it; it’s the opposite of just clicking go. The situation is the same on production: in order to get black & white single page TIFFs and color JPEGs to comply with the SEC’s rules, some people actually look at the documents, find the color ones, and then for those documents create color JPEGs. This is why vendors have teams of project managers and processing engineers and QC teams to look over their work; they have to do this because they use a manual process.

Disco replaces these manual processing teams with software. The upside for you is that complex decisions about how to process are made correctly and those correct decisions are implemented by computers with the reliability that only they can provide — and at a fraction of the price of the manual alternative. To take the color example, Disco algorithmically detects color and creates color JPEGs when color is detected. (This is not as easy as it sounds, but it didn’t even occur to us to do this manually; that sort of thing is our nightmare.)

When you hear a vendor bragging about their “processing team,” what they’re really telling you is that their software sucks. Because everything that processing team is doing should be done — better, faster, cheaper — by computers. And the processing team, experts in ediscovery, should be using their knowledge to add value in managing and structuring reviews, sharing searching and tagging best practices, helping with collections, and the like.

How are you so much faster?

The first thing you notice about Disco is how fast it is: 10x faster than the advertised speeds of some of our competitors. How do we get that speed?

One is the tech: a modern web stack on top of Lucene for search and RavenDB (an open source document-based database) as the database running on fast enterprise-grade solid state storage. Our competitors mostly run dtSearch on top of SQL databases and the quality of hardware varies greatly because they have channel partners or end users who run their own hardware.

Two is that we do everything computationally intensive on ingest rather than during review. That includes OCR, imaging everything to PDF so we can display a near-native rendering in the browser (no need to “TIFF on the fly” because we’ve done it all on ingest), recreating threads, deduplicating, detecting near duplicates, extracting text and metadata, and other “processing” tasks. By doing things on ingest rather than review, (a) we can parallelize as much of it as possible and (b) the user doesn’t have to wait while computation happens during review. Think of it as trading off storage for speed — as a result of our processing, we have a net 2x growth in data we have to store, but we get a 10x speed advantage.

Three is design (this is not really computer speed, but speed from the point of view of the user). Disco’s user interface is such that everything can be done from three screens: search, review, and production. Unlike our competitors, there are not multiple screens for search builders, filters, “advanced” searching, mass tagging, etc. We’ve also set up linear review so that it can be done using the keyboard only: shortcuts for navigating between documents and tagging. These things seem minor — until you’re in the middle of a 10,000,000 document review that occupies your life as a junior associate for six months.

Of course, there’s some secret sauce under all this, but top-end modern tech, doing everything on ingest, and designing for speed of use are really the top three contributors to Disco’s smoking the competition.

Early case assessment (ECA) is obsolete

Early case assessment (ECA) software takes collected data (scanned documents, native documents, emails, forensic images, etc.) and lets you: (1) run searches using keywords, date ranges, etc. on the data; (2) see analytics for the data as a whole and for sets of search results, for example, number of documents, file types, custodians, and common words or phrases; and (3) “cull” the data before fully processing it and loading it into separate review software so that only part of the data collected actually makes it into the review software.

The only reason people think they need separate ECA software is price: it might cost $30 / GB or $40 / GB to process and load data into ECA software, whereas it might cost $200 / GB, $300 / GB, $400 / GB, or more to load the same data into review software. By loading the data into ECA software first, then using the ECA software to cull the data so that much less remains to be loaded into review software, you can greatly reduce the total cost of getting the data into the review software. Or so the argument for ECA goes.

If you take away this cost disparity — that is, if you make loading data into review software cost the same as or less than loading it into ECA software — then there is no good reason to use ECA software.  And there are many reasons why using ECA software is bad. The main one is that culling, understood as deciding what data, on a broad-stroke basis, merits further review, is not a process that should take place only at the outset of litigation. As a case progresses, as claims are added, new important facts are discovered, and parties are added or removed, the data that merits further review will change.

For example, when a case first starts out, a particular custodian or a particular date range might seem unimportant and might be culled out at the ECA stage. But later in the case that custodian or date range might become relevant, or even critical: perhaps the other side claims that a contract was modified orally or by email two years after it was negotiated and two years before the breach; or perhaps an explosion is tracked to an alleged defect in a particular valve previously thought unimportant, but only after the expert testimony has been developed. In situations like this, the culling decision you would make with the new information is different from the culling decision you would have made at the outset of the case.

What you want is for all the data you’ve collected to be available all the time. You can run searches on that data and tag the search results (say, “For Further Review” or “Possibly Relevant”) to do the equivalent of ECA culling. Then review assignments can be drawn from the documents so tagged. But, if what’s possibly relevant changes later, you can either untag these documents or tag other documents, effectively updating your culling decisions in real time. Because all the data is already in the review tool, there is no additional cost or delay in processing new data to be added or taking that new data through ECA.

(Here’s another way of making the same point. Lawyers often fear predictive coding because it seems too simplistic; how can you generalize from some documents to others without individual review? But culling based on search terms or date ranges is egregiously simplistic generalization, far more simplistic than even the worst predictive coding on the market.)

Disco includes all ECA features as part of the review tool itself. And Disco’s pricing and automated processing makes it cost effective to just load all the data you collect into Disco and use tagging in place of ECA and culling. ECA, like so many other aspects of ediscovery, is not a best practice, or even a reasonable practice, but rather a second-best or third-best reaction to how expensive processing and loading into other review tools is. In short, Disco makes separate ECA tools obsolete.