From time to time we receive questions about Disco’s technology underpinnings and whether it is robust enough to handle large amounts of data. The short answer is yes: Disco can handle some of the largest datasets likely to be found in litigation.
While our engineers can provide a more detailed description, a high level, layman’s description is often helpful. Disco uses RavenDB as the underlying database and Apache Lucene to handle the searching and indexing.
RavenDB is an open source, document oriented database. Document oriented databases are what help drive companies like Amazon, Google and Facebook. As the name suggests, document oriented databases are ideal for data organized as discrete collections (figuratively called “documents”), which becomes an obvious choice to store the literal documents of a document review platform like Disco. When retrieving a document, the database has to make only one call to the database instead of the many calls necessary to retrieve the data from multiple tables in a SQL based database.
RavenDB has a 16 terabyte limit for data in a single database and no limit on individual file size. If storage needs are greater than 16 terabytes, several strategies exist, such as compression and sharding, which allow for even bigger data sizes. Let’s put that in perspective. While there is no effective way to determine file size except by looking at each file, a fairly well recognized average is that one gigabyte of data is approximately 16-17,000 Word documents. Very roughly speaking, this means that RavenDB could handle about 256,000,000 Word documents in a single database.
A different measure common in the legal industry is the banker box. A banker box holds between 2,700 and 3,000 pages. The average scanned size of paper documents is about 125 megabytes per box or 8 boxes per gigabyte. 16 terabytes of storage space means RavenDB could handle roughly 128,000 banker boxes in a single database.
Lucene is a similarly robust, open source, full text information retrieval software library. Lucene is used by companies such as Apple, Twitter (for their real-time search), and LinkedIn.
In order to allow fast searching, Disco runs searches against document indexes created when documents are ingested into the database. Lucene provides those indexes and runs the searches. Lucene has the capability to handle approximately 2.1 billion documents as an upper limit. From a practical perspective, Lucene indexes won’t reach that limit and will rarely get above a few hundred million documents except in the absolute largest of cases.