You collect email from the five members of an executive team. They email each other all the time. So when #1 sends an email to #2, #3, #4, and #5, you will have five copies of that email in your collection, one from each of the five executives. But you only want your reviewers to review that email once — to save time, but also to ensure that it doesn’t get tagged one way by one reviewer and another way by a different reviewer.
If the executives are discussing a proposed contract, that contract might be attached in PDF format to 5, 10, or more emails from each of the executives. For example, #1 might send it to #3 and #4 for their comments, #1 might send it to the company’s outside counsel for review, and #1 might have received it from the contract counterparty originally. The same document would be attached to each of #1’s emails. Copies might also appear on #3’s laptop and on the file server in New York. Again, you want your reviewers to review the contract only once.
Deduplication is how you do this. Deduplication has three steps:
- On ingestion, you identify what documents are duplicates, remove all but one of the duplicates, and for the one you keep, store metadata about the duplicates that you removed, for example, where they came from (path, custodian, etc.). Identifying what documents are duplicates is done in a variety of ways, for example, using metadata (to, from, subject, date/time, etc.) for email or a hash value for office documents. The result of this process is a single document with metadata telling you all the places in which that document appeared.
- On review, reviewers see only the single document that was the result of ingestion. But that document contains the metadata for all its duplicates so that, for example, if #1 and #3 are both custodians and both of them had the document, it will show up in searches for custodian(#1) or custodian(#3). That is, the deduplicated document is a search result whenever any of the original documents would have been search results. This is why metadata from the removed duplicates is stored with the one copy that is kept for review.
- On production, the deduplicated documents are de-deduplicated so that the documents produced look the way they were kept in the ordinary course of business. In our example, custodians #1-#5 will each have a copy of the document and, if custodian #1 attached the document to three different emails, each of those emails will have a copy of the document following it.
Deduplication is like an accordion: you identify duplicates, collapse them all into one document for purposes of review, and then expand them back out on production so that the production looks like the data you collected in the first place.
Unfortunately, a lot of software deduplicates incorrectly. For example, some software will deduplicate between custodians (our first situation above), but not within a custodian (our second situation above). The reasoning for this is that each time a document shows up as an attachment, you want a separate opportunity to tag it: if the document is attached to emails A and B and email A is tagged relevant while email B is tagged privileged, then, the argument goes, the document should be tagged relevant when attached to exhibit A and privileged when attached to exhibit B even though it’s the same document.
This argument for “partial” or “custodian-level” deduplication is unpersuasive. In the example above, the same document is at issue: it is either privileged or not, and relevant email A either destroyed the privilege or it did not. If email A destroyed the privilege, then the attachment is not privileged. By showing only one copy of the document during review, the review system forces reviewers to make this call (and not, for example, privilege log a document a duplicate of which was produced as an attachment to a relevant email! Documents, not instances or appearances of documents, are the unit of review.
Custodian-level deduplication is also used sometimes to account for defects in the software used to produce documents. For example, if a review set contains emails A, B, and C; a document X is attached to all three emails; and A and B are tagged “hold back” while C and X are tagged “relevant,” some production software will produce C and 3 copies of X: C and its attachment X as well as X again because it is a relevant attachment attached to A and B, even though A and B themselves are not produced. Custodian-level deduplication prevents this because the copies of X attached to A and B would also have been tagged “hold back” and so would not have been produced.
But the correct answer to this problem is not custodian-level deduplication: it is, on production, to include copies of attachments that themselves fall within a production only (a) one time per parent that is being produced or (b) one time alone if no parent is being produced. That is, instead of fixing a bug at step (3) of deduplication (the step where you de-deduplicate on production), custodian level deduplication attempts to work around the bug by introducing a compensating bug at step (1) of deduplication. To correct for improper de-deduplication, systems like this don’t deduplicate fully, which gives rise to the possibility of inconsistent tags and forces reviewers to review documents multiple times.
The takeaway from all this is simple: deduplication is a hard task that software designers should think through carefully and that users should not mess with. Disco deduplicates and de-deduplicates correctly so that you don’t have to go down this rabbit hole yourself. It just works.