Good, Fast, Cheap: Pick Three

Everyone has heard the old joke:  “Fast, good, cheap: pick any two.”  We believe you can have all three with Disco.

We have the fast part down.  As detailed in an earlier post, Disco is extremely fast and beats the advertised speeds of any eDiscovery software out there.

We think we have the good part handled.  The software is easy to use with searching syntax that lawyers already know and an interface that is familiar to anyone who has ever done an internet search.  The software also provides an ingestion engine and production page which allow users to run a review from start to finish with plain English explanations and little support required.

We also have the cheap part covered.  We offer Disco for what we believe to be some of the lowest prices in the market.  Our price includes everything: ingestion (including deduplication, threading, and deNISTing among others), review (with near-native PDF, text only, and native views), and production.  There is simply no need to have separate tools for ingestion, ECA, and production (and pay the associated costs).  Since Disco is offered in a software as a service model, users also get the latest improvements and upgrades immediately when they are rolled out for no additional cost.

So, why settle for two when you can have all three?

The Technology Underlying Disco

From time to time we receive questions about Disco’s technology underpinnings and whether it is robust enough to handle large amounts of data.  The short answer is yes:  Disco can handle some of the largest datasets likely to be found in litigation. 

While our engineers can provide a more detailed description, a high level, layman’s description is often helpful.  Disco uses RavenDB as the underlying database and Apache Lucene to handle the searching and indexing.

RavenDB is an open source, document oriented database. Document oriented databases are what help drive companies like Amazon, Google and Facebook.  As the name suggests, document oriented databases are ideal for data organized as discrete collections (figuratively called “documents”), which becomes an obvious choice to store the literal documents of a document review platform like Disco.  When retrieving a document, the database has to make only one call to the database instead of the many calls necessary to retrieve the data from multiple tables in a SQL based database. 

RavenDB has a 16 terabyte limit for data in a single database and no limit on individual file size.  If storage needs are greater than 16 terabytes, several strategies exist, such as compression and sharding, which allow for even bigger data sizes.  Let’s put that in perspective.  While there is no effective way to determine file size except by looking at each file, a fairly well recognized average is that one gigabyte of data is approximately 16-17,000 Word documents.  Very roughly speaking, this means that RavenDB could handle about 256,000,000 Word documents in a single database.

A different measure common in the legal industry is the banker box.  A banker box holds between 2,700 and 3,000 pages.  The average scanned size of paper documents is about 125 megabytes per box or 8 boxes per gigabyte.  16 terabytes of storage space means RavenDB could handle roughly 128,000 banker boxes in a single database.

Lucene is a similarly robust, open source, full text information retrieval software library.  Lucene is used by companies such as Apple, Twitter (for their real-time search), and LinkedIn.

In order to allow fast searching, Disco runs searches against document indexes created when documents are ingested into the database.  Lucene provides those indexes and runs the searches.  Lucene has the capability to handle approximately 2.1 billion documents as an upper limit.  From a practical perspective, Lucene indexes won’t reach that limit and will rarely get above a few hundred million documents except in the absolute largest of cases.