Sorting the mail

Jim Leous
3 min readNov 7, 2016
Sorting mail at the New York Port of Embarkation (from:WikiCommons)

Over the weekend, FBI Director Comey sent a letter to various Congressional committee heads stating, “Based on our review, we have not changed our conclusions that we expressed in July.”

Many media outlets said that the relevant emails were duplicates of emails which already existed in the database of emails for the previous investigation of Mrs. Clinton.

This led Mr. Trump’s adviser, General Flynn to tweet the following:

Now I have no idea if the 650k figure is correct, but I’ll assume it is for the sake of argument. I also have no quibble with his math: there are 691,200 seconds in 8 days. What baffles me is that General Flynn assumes that FBI agents are reading each piece of email.

Were I given this task (as someone who ran an enterprise email system for about a decade), I would first identify those email messages that are “To:” or “From:” or “Cc:” or “Bcc:” to Mrs. Clinton. Since this was allegedly on the laptop of the estranged husband of one of Mrs. Clinton’s trusted advisers, I would assume that this number is probably in the tens or hundreds rather than 650k. That would be my first cut at reducing a much larger data set to a significantly smaller one.

How quickly can this be done? First think about your email client (that’s Outlook, Thunderbird, Mac Mail or any number of Web-based email clients), and ask yourself what do you do when you know that you received an email from “Greg” but you can’t remember where it is? You “filter” it. In my client, I put Greg’s email address in the search space, or I just say “From:Greg”. I have about 50k emails in my inbox (stay tuned for my “Inbox:Infinity” post coming soon), and when I do this, the client returns a list in less than 3 seconds. My client (as does GMail, O365, Outlook, etc.) allows me to combine searches with “AND”s and “OR”s to do ask things like, “Show me every email I’ve received from Bob where Mike is copied in the last 7 days” or “Show me every message where Tom is in the ‘To:’ line, the ‘Cc:’ line or the ‘From:’ line.” Again even the complex searches take 3–4 seconds to come up with a list of relevant emails. GMail is even faster (Hey, they’re a search company…).

Now with that much smaller (result) set of emails, there are any number of ways that I can determine whether they are duplicates of email messages in from the original investigation. If the mails are simply forwarded, or replied to and I have full email headers, I can match the “Message-ID” field. That is a unique number associated with each email. I could match on date stamps, email threads, and email size. In the absence of headers, I could compare the text from parts of the email with parts of existing emails in the original set. The easiest way to do this is using tools (a “cryptographic hash” or a “message digest function”) which create a much smaller, unique set of characters for each block of text and match those rather than the messages themselves.

At that point, any remaining messages (my guess is a handful), could be sifted through by agents actually reading the messages.

What surprises me is that it took eight days.

--

--

Jim Leous

Jim Leous thinks about Emerging Technologies for Penn State