Speaking of Archiving
One of the things that would be good to do would be to keep track of changes to my web pages over time. If everything I did went through a content management system, one way would be to keep some sort of transaction log. But that’s not the situation I want to consider. Many of my pages go back to 1994 and 1995, well before CMS. So given a set of pages without access to some backend, how best to track changes over time?
There are open source programs to mirror websites. To name a couple, there’s HTTrack and w3mir. I’ve used w3mir for quite some time on Unix hosts, and the Windows port of HTTrack. w3mir provides an option to only download changed items given a previous directory of mirrored pages. So here’s what I am thinking about…
Start with a full mirror of a site. I don’t think that a mirroring operation should happen more than once a day, so I’m thinking that a directory named for the site and subdirectories named for the date of mirroring should be sufficient. If I’m wrong on frequency, certainly the subdirectories could be named by both date and time of mirroring.
For subsequent checks of the site, copy the most recent mirror, then invoke w3mir in “update” mode. This should minimize bandwidth use, as only the header information will be needed for many files. I need to see whether the w3mir logging output would be useful for post-processing, but even if not, a comparison of the files in each directory would shortly yield a record of which files changed between visits.
There’s the issue of browsing a set of mirrors with possible changes. This is a problem that the Internet archive has previously solved, but so far as I can tell, they don’t make their software available to others for use. And that’s about where I’ve gotten to in musing about this. I’m hoping to be able to use symbolic links on a Unix host to patch up each mirror, allowing full browsing while only keeping unique files, so as to minimize hard disk space usage.