Disco Deep Dive: Processing

We will be releasing a series of “Deep Dive” documents that take users into the details of Disco. The first of these is Disco Deep Dive: Processing, which covers Disco’s 12-step automated processing for electronically stored information.

*     *     *

Disco includes full processing of electronically stored information.

Processing involves extracting all searchable information, including metadata and full text, from documents and email and formatting them for lawyer review. Disco processes email from Exchange, Outlook, and Lotus, including files in PST, MBOX, EML, and MSG format; Excel, PowerPoint, and Word documents; PDFs; some CAD and CAM files; images, including JPGs, GIFs, PNGs, and TIFFs; and audio and video files.

No preprocessing is required before loading native data into Disco. Files that Disco does not process, for example, proprietary databases, are not full-text searchable and cannot be imaged by Disco, but can still be searched by their metadata, downloaded, tagged, and produced.

Processing is a twelve-step process:

  1. Extract files from all containers and compressed archives. Disco extracts files from containers contained in other containers all the way down to arbitrary depth. For example, Disco will extract an Excel file embedded in a Word document contained in a ZIP file attached to an email contained in a PST. In addition to ordinary containers and compressed archives, Disco can extract files from forensic images.
  2. Remove system files. Disco removes “system files,” like the copy of Windows or Word collected as part of an image of a custodian’s computer. The National Institute of Standards and Technology (NIST) publishes the industry-standard list of system files. Removing these files is sometimes called deNISTing.
  3. Extract all text and metadata. Disco extracts all available text and metadata from native documents. This includes, for example, the full text of a Word document, the text in all the cells of an Excel spreadsheet, the sent date of an email, the created and last modified dates of files on a file system, and the like. The extracted text and metadata become fully searchable in Disco.
  4. OCR all images. Disco runs optical character recognition (OCR) on all images, including scanned documents, and uses the results of OCR as the full text for the images. Disco rotates images and can be set for foreign languages to make the quality of the OCR as high as possible. Disco uses the Tesseract OCR engine, which is an OCR engine developed by H-P that was purchased and open sourced by Google. Disco also recognizes printed or imaged email from its OCR and creates parent–child and conversation relationships for email recognized in this way.
  5. Normalize time zones.  Documents and email collected in different parts of the world have time expressed in different time zones, often with no notation of the time zone. Disco normalizes all time zones to the time zone of the reviewer or to a single time zone for the matter so that documents and email appear in the correct order without reviewers having to convert time zones mentally.
  6. Detect duplicates. Each file that Disco ingests is called an “instance.” Disco detects duplicate instances by comparing the hashes of certain parts of each instance. A hash is a fixed-length alphanumeric string generated by a hashing function from certain input data. Input data that has the same hash is the same. For email, Disco detects duplicates by comparing the hashes of the concatenated sender address, date sent, normalized subject, normalized message body, and the hash of each attachment. This is so that, for example, Disco detects two copies of an email message, one collected from its sender and the other from its recipient, as duplicates even though the one collected from the recipient has additional header information indicating when and by what address path it was delivered. For images with load files, Disco detects duplicates by comparing the hash of the native plus the image. This is so that, for example, Disco does not detect the same native produced multiple times in a single production, like an attachment attached to multiple emails, as a duplicate if it is produced with different images, for example, with different Bates stamps. For all other files, Disco detects duplicates by comparing the hash of the entire file. Essentially, Disco treats two instances as duplicates if they look identical in the review window and when printed out.
  7. Detect near duplicates. Disco detects near duplicates by conducting a paragraph-wise text comparison of documents. Documents with substantial paragraph-wise overlap are flagged as near duplicates. Reviewers can see the number of near duplicates that a document has in the search results summary grid and can navigate from a document to its near duplicates in the review window.
  8. Generate near-native rendering of all documents for review. One of the keys to Disco’s speed is creating near-native renderings of all documents during processing so that reviewers don’t have to wait for these to be created during review. During review, Disco displays these stored near-native renderings in the browser. Disco can create multiple near-native renderings for documents, for example, if the document contains redlines or otherwise can be viewed in multiple ways in its native format.
  9. Create parent­–child relationships. Disco creates parent–child relationships between emails and their attachments and between documents and their embedded objects. Disco shows the number of children that a document has in the search results summary grid and lets you navigate from parents to children and children to parents in the review window. You can also search for documents with or without children.
  10. Create email conversations. Disco normalizes email subjects by removing blank space and prefixes like Re: and Fwd:. Then Disco groups as conversations all emails that share the same normalized subject and have at least one participant in common. This is an algorithm that errs in favor of grouping emails together
  11. Create search indices and review database. Disco creates 17 or more different search indices on ingest to make searching and generating counts along a variety of dimensions as fast as possible. Disco stores native files and its near-native renderings on fast network-attached storage (NAS) and stores metadata and extracted or OCR text as well as the search indices in a document-based noSQL database on a database server.
  12. Generate a complete ingest report. Disco generates an ingest report for every file it ingests. The ingest report shows the file’s hash, when it was ingested, its custodian, file length, file path, and container path, how Disco treated it, and whether there were any ingestion problems, for example, if the file was password protected and the password was not supplied. The consolidated ingest report for an entire database is available for download on the analytics page in Disco.

Disco processing speed depends on the kind of data. Dense container files, like PSTs, take longer per GB than flat files like Word documents or PDFs. On average, Disco can process about 50,000 documents, not pages, per hour. Disco processes the full Enron set, which is 60 GB of PSTs, in about 4 hours. Disco processing can add documents to a database while the database is live. Because processing involves multiple passes through the data, you will see new documents appear first and then, after all the documents have appeared, you will see conversation counts and other second-pass information appear.

Disco processing is entirely automated. Automated processing lets Disco deliver faster speeds and increased reliability without human error. It also allows Disco to include all processing features on every matter without nickel-and-diming customers with separate line items for processing features like email threading, duplicates and near duplicates, and imaging. Disco processing is equivalent to processing using dedicated processing software with all options checked and full analytics included.

Frequently asked questions

Q.            Can or should I preprocess data before loading it into Disco?

A.        No preprocessing is required before loading data into Disco. We discourage from preprocessing native data because preprocessing alters, destroys, or obfuscates information that is better extracted by Disco processing.

Q.            Can I load productions or exports from other ediscovery software into Disco?

A.        Yes.Productions from other parties and exports from other ediscovery software are compatible with Disco if they are in PDF, TIFF, or JPG format, single-page or multi-page, accompanied by a load file in any industry-standard format. We recommend a DAT load file with accompanying OPT. Note that Disco will process this kind of data so that, for example, duplicates not detected by other ediscovery software or noted in the incoming load file will be detected by Disco. As a result, the document counts in Disco may be different than the document counts in the load file or in other ediscovery software. Disco can optionally use a load-file-supplied hash field for deduplication, in which case only those entries with the same hash in the load file will be deduplicated.

Q.            Can I cull data before processing?

A.        Disco processes all data and allows you to conduct early case assessment and cull data from the full-featured review tool. Because there is no separate charge for early case assessment versus full Disco processing, you have access to fully processed data and a full-featured review tool when making culling decisions.

Q.            How is data size measured for billing purposes?

A.        Disco measures data size after expanding any top-level compressed file, for example, a ZIP file containing all the data, but before any other processing. Some Disco processing, like deduplication, reduces the data size; other Disco processing, like generating near-native renderings up front, increases the data size. On average, the size of the data on Disco’s servers is 2x the size of the data before processing. But because Disco bills on the size of the data before processing, you know what the data size on your invoice will be before you submit your data to Disco. This billing transparency is one of Disco’s strengths.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s