Disco Deep Dive: Processing

We will be releasing a series of “Deep Dive” documents that take users into the details of Disco. The first of these is Disco Deep Dive: Processing, which covers Disco’s 12-step automated processing for electronically stored information.

*     *     *

Disco includes full processing of electronically stored information.

Processing involves extracting all searchable information, including metadata and full text, from documents and email and formatting them for lawyer review. Disco processes email from Exchange, Outlook, and Lotus, including files in PST, MBOX, EML, and MSG format; Excel, PowerPoint, and Word documents; PDFs; some CAD and CAM files; images, including JPGs, GIFs, PNGs, and TIFFs; and audio and video files.

No preprocessing is required before loading native data into Disco. Files that Disco does not process, for example, proprietary databases, are not full-text searchable and cannot be imaged by Disco, but can still be searched by their metadata, downloaded, tagged, and produced.

Processing is a twelve-step process:

  1. Extract files from all containers and compressed archives. Disco extracts files from containers contained in other containers all the way down to arbitrary depth. For example, Disco will extract an Excel file embedded in a Word document contained in a ZIP file attached to an email contained in a PST. In addition to ordinary containers and compressed archives, Disco can extract files from forensic images.
  2. Remove system files. Disco removes “system files,” like the copy of Windows or Word collected as part of an image of a custodian’s computer. The National Institute of Standards and Technology (NIST) publishes the industry-standard list of system files. Removing these files is sometimes called deNISTing.
  3. Extract all text and metadata. Disco extracts all available text and metadata from native documents. This includes, for example, the full text of a Word document, the text in all the cells of an Excel spreadsheet, the sent date of an email, the created and last modified dates of files on a file system, and the like. The extracted text and metadata become fully searchable in Disco.
  4. OCR all images. Disco runs optical character recognition (OCR) on all images, including scanned documents, and uses the results of OCR as the full text for the images. Disco rotates images and can be set for foreign languages to make the quality of the OCR as high as possible. Disco uses the Tesseract OCR engine, which is an OCR engine developed by H-P that was purchased and open sourced by Google. Disco also recognizes printed or imaged email from its OCR and creates parent–child and conversation relationships for email recognized in this way.
  5. Normalize time zones.  Documents and email collected in different parts of the world have time expressed in different time zones, often with no notation of the time zone. Disco normalizes all time zones to the time zone of the reviewer or to a single time zone for the matter so that documents and email appear in the correct order without reviewers having to convert time zones mentally.
  6. Detect duplicates. Each file that Disco ingests is called an “instance.” Disco detects duplicate instances by comparing the hashes of certain parts of each instance. A hash is a fixed-length alphanumeric string generated by a hashing function from certain input data. Input data that has the same hash is the same. For email, Disco detects duplicates by comparing the hashes of the concatenated sender address, date sent, normalized subject, normalized message body, and the hash of each attachment. This is so that, for example, Disco detects two copies of an email message, one collected from its sender and the other from its recipient, as duplicates even though the one collected from the recipient has additional header information indicating when and by what address path it was delivered. For images with load files, Disco detects duplicates by comparing the hash of the native plus the image. This is so that, for example, Disco does not detect the same native produced multiple times in a single production, like an attachment attached to multiple emails, as a duplicate if it is produced with different images, for example, with different Bates stamps. For all other files, Disco detects duplicates by comparing the hash of the entire file. Essentially, Disco treats two instances as duplicates if they look identical in the review window and when printed out.
  7. Detect near duplicates. Disco detects near duplicates by conducting a paragraph-wise text comparison of documents. Documents with substantial paragraph-wise overlap are flagged as near duplicates. Reviewers can see the number of near duplicates that a document has in the search results summary grid and can navigate from a document to its near duplicates in the review window.
  8. Generate near-native rendering of all documents for review. One of the keys to Disco’s speed is creating near-native renderings of all documents during processing so that reviewers don’t have to wait for these to be created during review. During review, Disco displays these stored near-native renderings in the browser. Disco can create multiple near-native renderings for documents, for example, if the document contains redlines or otherwise can be viewed in multiple ways in its native format.
  9. Create parent­–child relationships. Disco creates parent–child relationships between emails and their attachments and between documents and their embedded objects. Disco shows the number of children that a document has in the search results summary grid and lets you navigate from parents to children and children to parents in the review window. You can also search for documents with or without children.
  10. Create email conversations. Disco normalizes email subjects by removing blank space and prefixes like Re: and Fwd:. Then Disco groups as conversations all emails that share the same normalized subject and have at least one participant in common. This is an algorithm that errs in favor of grouping emails together
  11. Create search indices and review database. Disco creates 17 or more different search indices on ingest to make searching and generating counts along a variety of dimensions as fast as possible. Disco stores native files and its near-native renderings on fast network-attached storage (NAS) and stores metadata and extracted or OCR text as well as the search indices in a document-based noSQL database on a database server.
  12. Generate a complete ingest report. Disco generates an ingest report for every file it ingests. The ingest report shows the file’s hash, when it was ingested, its custodian, file length, file path, and container path, how Disco treated it, and whether there were any ingestion problems, for example, if the file was password protected and the password was not supplied. The consolidated ingest report for an entire database is available for download on the analytics page in Disco.

Disco processing speed depends on the kind of data. Dense container files, like PSTs, take longer per GB than flat files like Word documents or PDFs. On average, Disco can process about 50,000 documents, not pages, per hour. Disco processes the full Enron set, which is 60 GB of PSTs, in about 4 hours. Disco processing can add documents to a database while the database is live. Because processing involves multiple passes through the data, you will see new documents appear first and then, after all the documents have appeared, you will see conversation counts and other second-pass information appear.

Disco processing is entirely automated. Automated processing lets Disco deliver faster speeds and increased reliability without human error. It also allows Disco to include all processing features on every matter without nickel-and-diming customers with separate line items for processing features like email threading, duplicates and near duplicates, and imaging. Disco processing is equivalent to processing using dedicated processing software with all options checked and full analytics included.

Frequently asked questions

Q.            Can or should I preprocess data before loading it into Disco?

A.        No preprocessing is required before loading data into Disco. We discourage from preprocessing native data because preprocessing alters, destroys, or obfuscates information that is better extracted by Disco processing.

Q.            Can I load productions or exports from other ediscovery software into Disco?

A.        Yes.Productions from other parties and exports from other ediscovery software are compatible with Disco if they are in PDF, TIFF, or JPG format, single-page or multi-page, accompanied by a load file in any industry-standard format. We recommend a DAT load file with accompanying OPT. Note that Disco will process this kind of data so that, for example, duplicates not detected by other ediscovery software or noted in the incoming load file will be detected by Disco. As a result, the document counts in Disco may be different than the document counts in the load file or in other ediscovery software. Disco can optionally use a load-file-supplied hash field for deduplication, in which case only those entries with the same hash in the load file will be deduplicated.

Q.            Can I cull data before processing?

A.        Disco processes all data and allows you to conduct early case assessment and cull data from the full-featured review tool. Because there is no separate charge for early case assessment versus full Disco processing, you have access to fully processed data and a full-featured review tool when making culling decisions.

Q.            How is data size measured for billing purposes?

A.        Disco measures data size after expanding any top-level compressed file, for example, a ZIP file containing all the data, but before any other processing. Some Disco processing, like deduplication, reduces the data size; other Disco processing, like generating near-native renderings up front, increases the data size. On average, the size of the data on Disco’s servers is 2x the size of the data before processing. But because Disco bills on the size of the data before processing, you know what the data size on your invoice will be before you submit your data to Disco. This billing transparency is one of Disco’s strengths.

Proposed Rule of Civil Procedure on Confidentiality Orders

The Committee on Court Rules has proposed to the Supreme Court of Texas a new Rule 76b governing confidentiality orders. Download a copy of the rule proposal here: Rule 76b Proposal.

The proposed rule provides a formal basis for confidentiality orders in Texas state court, brings confidentiality orders into harmony with Rule 76a (the rule that governs sealing of court records), and includes a form confidentiality order that is likely to be the default for Texas cases in the future.

This post contains my comments on and exposition of the proposed rule. These are the views of only one member of the committee; the full committee may not share these views, and they are not part of the formal rule proposal. We would welcome any comments from the public or members of the bar.

RULE 76b. Confidentiality Orders

1. Motion for Confidentiality Order. Any party may move for an order protecting the confidentiality of information produced in the course of discovery.

Comment: This provides a formal basis in the rules for a motion to enter a confidentiality order.

2. Motion Challenging a Designation of Confidentiality Under a Confidentiality Order.  Any party may challenge a designation that discovery materials are confidential under a confidentiality order using the procedures set forth in this rule.  

(a)    To initiate a challenge to a designation of confidentiality, the challenging party must serve a written request that the designating party withdraw the confidentiality designation for the discovery materials at issue.  The challenging party must specifically identify the challenged discovery materials and state the basis for challenging their designation as confidential.

(b)   Within 15 days of service of a request under subparagraph 2(a), the designating party may serve a written response that withdraws the designation to some or all of the materials identified in the challenging party’s request.

(c)    If any designations remain in dispute, any party may move for an order determining whether the designation of the information as confidential is proper.  The parties must treat disputed materials as confidential unless the court determines otherwise.

(d)   Nothing in this rule prohibits an agreement or court order specifying different procedures for challenging a designation of confidentiality.

(e)    Nothing in this rule shifts the burden of proof for establishing the confidential nature of information.

Comment: This provides a default procedure for handling disputes about whether information designated as confidential really is confidential. The procedure has two parts: (1) a formal conference procedure between the parties, in which the party challenging the designation states its challenge and the designating party can either withdraw or stand on its designation; and (2) a motion to the court to allow the court to decide the issue if the parties continue to disagree after the formal conference.

In the formal conference procedure, the party challenging a confidentiality designation must (a) specifically identify the designations with which it disagrees, for example, by Bates number or by page and line number in a transcript; and (b) must state the basis for challenging the designations, for example, that the information is already lawfully in the public record. There is then a 15-day period during which the designating party can choose to withdraw the designation. Neither party can seek court intervention during this 15-day period.

If the 15 days pass and the challenged designations, or some of them, are not withdrawn, then either party may file a motion asking the court to decide whether the challenged designations that have not been withdrawn are proper. While the rule says that either party may file this motion, the rule also provides that information will continue to be treated as confidential until the court determines otherwise. The effect of this is that, in practice, it will almost always be the party challenging the designations who has to file the motion.

The burden to establish that information should be treated as confidential generally lies with the designating party. The fact that the party challenging the designation is the party who must file a motion with the court is not intended to change this burden. Of course, the burden on certain arguments that are in the nature of affirmative defenses, like an argument that the challenging party had already received the information lawfully before receiving it from the designating party, lies on the challenging party; nothing in the rule shifts this burden either.

3. Order on Motion for Confidentiality Order.  To protect the confidentiality of information, to facilitate the resolution of disputes, and to minimize the burdens on the parties and the court, upon a motion filed under paragraph 1, the court may enter an order regarding confidentiality in the form set forth in paragraph 5 below or in another manner the court deems appropriate given the particularities of the case. 

Comment: This provides a formal basis in the rules for a court to enter a confidentiality order on motion of a party. This also introduces the form order and emphasizes that a court can customize it to the circumstances of a particular case.

4. Temporary Sealing. An order entered under paragraph 3 may provide that it serves as a temporary sealing order under Rule 76a(5). If the order serves as a temporary sealing order, the parties must post the order at the place for notices under Rule 76(a)(3).  If court records are filed under temporary seal under the order, the temporary sealing expires 30 days after filing, unless within that time a party moves to seal the court records under, and provides the notice required by, Rule 76a(3). If such a motion is timely filed, the court records remain sealed until the motion is decided or judgment is final in the cause, whichever is later. Any procedures for filing confidential information under seal must otherwise comply with Rule 76a.

Comment: This brings Rule 76b into harmony with Rule 76a. Rule 76a is the rule that governs sealing orders in Texas state courts. It is a rule very much in favor of public access to court records and makes sealing court records much harder in Texas than it is in federal courts or in many other states. The trouble with Rule 76a in the context of confidentiality orders is that it takes quite some time to obtain sealing orders under Rule 76a — which is a problem when a party has only limited time to file, for example, a response to a motion for summary judgment to which the party must attach documents or testimony that have been designated as confidential. Nothing in Rule 76a provides for the kind of “automatic” sealing of such documents or testimony that is routinely provided in federal court.

Proposed Rule 76b(4) would address this problem by providing that a confidentiality order may act as a temporary sealing order under Rule 76a, allowing parties to file documents under seal without first going through the procedures required under Rule 76a for obtaining a temporary sealing order. The party that wants confidential information to remain under seal must still timely move for a permanent sealing order under Rule 76a and must comply with the requirements of Rule 76a in connection with that motion. Thus, without weakening the substantive requirements of Rule 76a, proposed Rule 76b(4) provides a workable solution that lets attorneys file confidential documents under temporary seal in the heat of litigation deadlines.

5.        Form of Confidentiality Order.

[Case Style]

It is hereby ORDERED that:

1. “Confidential Information” in this Order means any information of any type, kind or character that a Designating Party designates as “Confidential,” whether it be a document, information in a document, or information disclosed during a deposition, in an interrogatory answer, or otherwise.  Only information that the Designating Party in good faith believes is confidential may be designated as Confidential.  Information is not entitled to treatment as Confidential Information under this Order if:

(a)            the information is in the public domain at the time of disclosure;

(b)           the information becomes part of the public domain, other than by violation of this Order;

(c)            the receiving party can show that the information was in its rightful and lawful possession at the time of disclosure; or

(d)           the receiving party can make some other showing of a lawful receipt of such information.

Comment: This provides that (a) anything can be the subject of a confidentiality designation, including, for example, documents produced, deposition testimony, and the like; and (b) explicitly carves out from what may be designated as confidential information certain information that has been lawfully released to the receiving party or the public. Notably the proposed order does not define what is confidential beyond the four explicit carveouts and does not limit confidential treatment to trade secrets or the like.

2. “Qualified Persons,” as used in this Order means:

(a)            attorneys of record for the parties in this cause and employees of such attorneys to whom it is necessary that the material be shown for purposes of this cause;

(b)           the party or a party representative;

(c)            actual or potential independent experts and consultants, or  attorneys not of record for the parties in this cause;

(d)           any document processing, document storage, reprographic, or similar litigation-support service providers working at the direction of a person described in subparagraphs a), b), or c); and

(e)            any other person the Court designates as a Qualified Person, after notice to all parties and an opportunity to be heard.

All Qualified Persons, except the Qualified Persons described in subparagraphs (2)(a) and those otherwise exempted by the Court, must sign Exhibit A before Confidential Information is disclosed to them.  The parties may, by agreement, specify multiple party representatives who will serve as Qualified Persons.  If a Qualified Person under subparagraph 2(d) is not a natural person, only its authorized agent must sign Exhibit A.  Attorneys of record must retain all signed copies of Exhibit A.

Comment: Qualified Persons are those persons with whom you can share information designated as confidential. Exhibit A is an acknowledgment of the confidentiality order and an agreement to be bound by it. Outside counsel and their employees working on the case need not sign Exhibit A; all other qualified persons must. In addition to outside counsel, one party representative, any experts or potential experts and any counsel not of record, and litigation-support providers may see confidential information. Each of these is treated differently.

A party is allowed one party representative unless the parties agree on multiple party representatives. This is generally the business guy who will have to give settlement approval. The experts and counsel not of record provision allows for house counsel to see confidential information without using up the party representative slot. So the form order allows a party to show confidential information to all its house counsel who sign Exhibit A as well as one further party representative who need not be a lawyer. There is also a special provision for litigation-support providers, such as copy shops or ediscovery companies: only an authorized agent of the company, not each individual who has access to confidential information, must sign Exhibit A. This recognizes the practical reality that no one obtains individual signatures from staff of litigation-support providers.

3. “Designating Party” means a person that designates information as Confidential Information, if that person is a party to this cause or a nonparty who signed Exhibit B to this Order, thereby agreeing to be bound by the terms of this Order.

4. All Confidential Information must be used solely for the purpose of preparation and trial of this cause, and must not be disclosed to any person except in accordance with the terms of this Order.

5. Confidential Information must not be disclosed or made available by any receiving party to persons other than Qualified Persons.

Comment: The form order does not permit sharing confidential information in connection with other, related litigation (for example, multiple plaintiffs’ counsel separately suing a single defendant may not pool confidential information).

6. Documents or other discovery material produced in this cause may be designated as Confidential Information by (a) conspicuously marking each page “Confidential,” or (b) otherwise clearly identifying the Confidential Information.

Comment: The catchall “otherwise clearly identifying the Confidential Information” provides for documents that are not easily marked on each page, for example, audio or video files or electronically stored information provided in native format. Such items can be marked as confidential by, for example, appending “CONFIDENTIAL” to their filenames or in any other reasonable way.

7. Confidential Information in a deposition transcript must be designated as such by: (a) stating on the record at the deposition that testimony is “Confidential Information”; and (b) by providing written notice to all parties of the testimony that is Confidential Information, by page and line number and identifying any exhibits that contain Confidential Information, before the expiration of time provided by Rule 203.1(b) for the deponent’s return of the deposition transcript.

Comment: Designating deposition testimony requires (a) a statement on the record at the deposition that the testimony is confidential and (b) separately, written notice identifying the confidential information by page and line number and by exhibit once the transcript has been received. The statement on the record ensures that everyone knows what is potentially confidential as soon as the deposition happens; the later written notice makes the statement specific by identifying the exact testimony or exhibits that are confidential.

8. Information inadvertently disclosed without designation as Confidential Information may be designated as Confidential Information by notice to the receiving party, in writing, specifically identifying the Confidential Information.

Comment: Note that the designation is not retroactive — meaning that, until confidential information is actually so designated, a receiving party is not bound by the restrictions on use contained in this order. If confidential information is publicly disseminated before a designation under this paragraph is made, the information may no longer qualify for treatment as confidential.

9. Nothing in this Order prevents disclosure of Confidential Information beyond the terms of this Order if the Designating Party consents to such disclosure or, if the Court, after notice to all affected parties, orders such disclosure. Nor does anything in this Order prevent any counsel of record from utilizing Confidential Information in the examination or cross-examination of any person who is indicated on the document as being an author, source, or recipient of the Confidential Information.

Comment: This paragraph presents some possibility of trouble where you need to use confidential information to cross examine a witness who is not an author, source, or recipient of the confidential information. For example, you may need to cross examine a product engineer in a defect case about the design of a particular component using information from a confidential investigation report even though the product engineer was not an author, source, or recipient of the information in the investigation report.

10. In the event a party wishes to use any Confidential Information in any document filed with the Court in this cause, this Order serves as a temporary sealing order, sealing such Confidential Information under Rule 76a(5) without necessity for a separate motion or order.  Notice of the filing of Confidential Information must be given to the Designating Party immediately upon filing the Confidential Information, unless the Designating Party is the party filing the Confidential Information.

Comment: This provision reflects federal practice: a person filing confidential information may file it under seal notwithstanding Rule 76a because this order counts under that rule as a temporary sealing order; and the person seeking to make the temporary sealing order permanent then has the burden of seeking a permanent sealing order under Rule 76a.

11. Within 120 days after the conclusion of this cause and any appeal thereof, any Confidential Information produced by a party in the possession of any Qualified Person must be returned to the producing party or destroyed with a signed certification of destruction, except as otherwise agreed by the parties or ordered by this Court.

12. This Order does not bar an attorney, in rendering advice to the attorney’s client with respect to this cause, from conveying in a general way to a client who is not a Qualified Person the attorney’s evaluation of Confidential Information.

Comment: This is one of the old problems with confidentiality orders: how does a lawyer advise his client when the basis of the advice is information designated as confidential? This is mitigated by the provision allowing for one party representative to be a qualified person and therefore to receive confidential information. Courts must be careful to limit application of or modify this provision so that lawyers are not prevented by confidentiality orders from freely and fully advising their clients.

13. This Order does not prevent any Designating Party from seeking additional protection for its Confidential Information.

Final Comment: The proposed rule does not address the problem of gross overdesignation of information as confidential. Indeed, by placing the burden on the party challenging a designation of confidentiality to file a motion seeking to overturn the designation, the rule makes it likely that confidentiality will be the norm and overturned confidentiality designations the exception. But the obvious alternative — allowing designations that are not agreed to expire unless the designating party seeks and obtains a court order — seems likely to consume an unacceptable amount of court time litigating confidentiality issues. We found no good solution to this problem, just as there is no good solution to the general problem of counsel acting as gamesmen in discovery rather than as officers of the law.

Overcoming Objections: The Two Minute Demo

One objection potential customers raise is they are too busy to see a demo. The customer expects to spend 30-60 minutes looking at the software and doesn’t want to commit the time.  The solution is the two minute demo.

Instead of asking for and scheduling a time to show the software, the next time you are in a potential customer’s office or have a customer on the phone ask if the person can spare two minutes to see something.  When you get a yes, have them go to the demo website (demo.csdisco.com). It is important to have them do the typing and to do it from their work computer. Have them log in using the pre-populated fields. Have them select any of the three cases and have them run a search. Do any search they want, but I recommend using “Ken Lay” because it will return slightly more than 20,000 results.  That’s it. That’s the two minute demo.

By doing this, you let the customer see Disco’s two biggest selling points: speed and ease of use. You’ve also demonstrated Westlaw-style search syntax and show the potential customer how fast Disco will run on their system and over their Internet connection, using a 60gb document set running on our production servers.  Coupled with the most recent whitepaper describing Disco’s features, the two minute demo is a great introduction to the product and should be enough to get you a real, longer demo in front of decision makers.

Free ediscovery software for law school classrooms

We have a standing offer at Disco to make the complete Enron dataset available in a fully functional Disco database, with unlimited accounts for students, for anyone teaching a law school class. If you’re interested in this, send email to CeCe Cohen at cece@csdisco.com. We can also optionally provide guest speakers, who are ex litigation partners or law professors who now work at Disco, to talk with students about Disco or legal technology in general.

Disco can be included in a course focused on ediscovery or in a broader skills course covering the litigation process. A sample syllabus for the latter kind of course is available here. For the former kind of course, the takeaways for students are:

  1. an understanding of the ediscovery market, including technology companies v. services companies v. channel partners;
  2. the relationship between law firms, managed-review providers, and alternative legal providers like Axiom;
  3. an appreciation of the importance of ediscovery to the outcome of major cases;
  4. the ediscovery process, from investigation to conference to collection to review to production, and the kinds of motion disputes that can come up along the way as well as how they are typically resolved (both practically through negotiation and by courts);
  5. the technical details of ediscovery software (ingestion, deduplication, etc.);
  6. best practices for participating in or organizing a large review;
  7. costs of a review and how to control them;
  8. client-management issues (hiding data, inadvertently or otherwise);
  9. spoliation, proving it, and obtaining sanctions; and
  10. the frontiers of legal technology, both in the ediscovery space and beyond.

A real, nitty-gritty understanding of legal technology and how it can be used by great lawyers to accomplish 10x what they could accomplish before, will be essential to lawyers’ careers over the next 30 years.