Adventure in Email Archiving

Up until 2009, my approach to personal email relied on pretty simple tools on Unix systems. I checked and sent email using “mail” at the command-line most of the time. In the late 1990s, I wrote a Perl script that altered the checking email part of my life. With forwarding, sendmail would pass incoming email to the script. The script used whitelists and keywords to classify each new piece of email. The email would be appended to a file. Files were named with dates, and the extensions told me what sort of mail was in each. I could use Emacs to browse my mail. High-priority stuff, as determined by whitelists, all ended up in one file. Newsletters and other periodic email got their own files. Things known to be bad got shunted off. The remainder of unclassified stuff went into a catchall file. Archiving was simple: move the files to an archive folder. Searching email was simple: use “agrep” with the -d option to pull out individual email messages whole. I had little concern about email viruses or malware since any attachment decoding I did was done manually and with direct attention.

In 2009, though, the email hosting setup got revamped. Sendmail was out, Postfix was in, and imap was the protocol of choice. I shifted to mostly using the Thunderbird email application. In some ways, this was a plus for convenience. For one thing, doing it this way allowed us to set up SquirrelMail as a webmail interface. I could get access to email just by getting access to a browser. Mostly.

For a couple of months, things were great. But SquirrelMail has some issues with trying to deal with lots of messages on an imap server. So archiving email became a new issue all over again.

I started using the “getmail” tool for periodically moving a bolus of email messages off the server and into a Maildir structure on my local file server. This would restore functionality to SquirrelMail, for a while at least.

“getmail” very likely could be set up to deliver individual email messages to a script, like the Perl script I had used for a decade. And I probably should have taken the time to figure that out. But I didn’t. I basically got out of the habit of treating my email as a searchable resource. And I decided that “getmail” could be set up for automatic expiry of messages on the server, so I’d give that a try. “getmail” got set as a cron job, invoked every night to download messages and delete any on the server that had been seen by getmail for 90 days. And I thought that was that.

Several months have gone by. We got a new machine to expand our file storage capacity. Copying off stuff from the old file server to the new indicated an issue: the personal files were taking up more space than expected. The biggest chunk of that was a never-deleted set of subdirectories holding digital photos from the 2000 to 2008 timeframe. Doing a transfer to the partition set aside for photo files fixed that up and gave back several hundred gigabytes. The next largest block of space, though, was in my personal file subtree. A little exploration revealed that my Maildir for automatic archiving of email was taking up 110 gigabytes. “ls” wouldn’t touch the “new” directory at all. I eventually downloaded a C source file that went at a directory with the “getdents” system call, revealing that it held some four and a half million files.

This caused me to revise my expectations for what “getmail” was actually doing. I was expecting getmail to look at a message on the server, check for its existence in the local store, and download the thing once. What getmail was actually doing, though, was downloading every message on the server every time it was invoked. Given invocation every day and a 90-day expiry time, I’d expect on average 90 copies of each individual email. So those 4.5 million email files probably represented roughly 50,000 unique email messages.

OK, so this is a problem. Unix has solutions for duplicate files, right? Sure it does. New problem: duplicate file utilities expect to do their work by finding files that are actually identical, and the directory getmail was dumping messages into (rather like the buckets of water in “The Sorceror’s Apprentice”, making me Mickey) turned out not to have duplicate files. Huh? Well, here was another aspect of getmail behavior I have come to know. It acts as an RFC822 MTA (mail transfer agent). This means that it dutifully logs its activities in each file it transfers. So each day’s copy of an email gets a different date and timestamp at a minimum.

So here’s my task. I need to tackle finding copies of email messages in a directory with 4.5 million files in it, retaining one and only one copy of each message, and deleting the rest.

And here’s my approach. I wrote a Python program to do the job. The os.listdir function is right out. Instead, I’m using the glob.iglob file list iterator. It can handle incrementally reading the immense directory. I’m using the Sqlite3 database. I have one table that holds a path, filename, and message-ID string for each email message stored. The table has a unique constraint on message-ID, so only one such can be stored. As the program iterates through files in a directory, it parses the file with email.parser to get the headers as a dictionary. The message-ID is extracted. A query to the database looks to see if the message-ID already has an entry. If it does, the program checks to make sure the file still exists. If it does, and the current file is not identical in path and filename, the current file gets deleted. If it doesn’t exist, the entry is updated with the current file. Yes, every single file is being opened and read. A little reflection indicates that I also could optimize space utilization in the database with a second table for directory paths, and just store a foreign key pointing to a path entry rather than the whole path for each entry. However, I’m not motivated to do that at the moment.

I started the Python program yesterday morning. Over the past 19 hours or so, it has deleted about 530,000 redundant messages while adding about 30,000 entries to the database. So this will take a few days to clear out my email archive problem. Once the immediate issue is dealt with, I’ll need to revisit the email archive issue from the ground up.

Wesley R. Elsberry

Falconer. Interdisciplinary researcher: biology and computer science. Data scientist in real estate and econometrics. Blogger. Speaker. Photographer. Husband. Christian. Activist.

2 thoughts on “Adventure in Email Archiving

  • 2013/09/13 at 12:58 pm
    Permalink

    The easiest solution (that is being widely used in our department) is to simply forward all work email to a gmail account.

  • 2013/10/19 at 11:22 am
    Permalink

    I do correspondence that I feel more comfortable hosting myself rather than storing with some third party. I’m writing up scripts to make the back-end search easier.

Comments are closed.