How to Create an Archive of a Website

Recently, I was asked to archive a website in such a way that the static HTML files could be browsed with links to scripts, stylesheets, and images continuing to work properly. Options such as the Wayback Machine or Webrecorder required me to manually visit every page I wanted archived, and weren’t as reliable about getting every resource as I wanted. Eventually, I found HTTRack, which was perfect for my needs.

HTTRack offers a command line interface and does a fantastic job of getting everything on a website. I tried running the command with several different combinations of flags, but found that something like this worked best for my needs:

$ httrack https://zystvan.com -O ./ --mirrorlinks -%v -* +zystvan.com/*

This command will look through https://zystvan.com, and act upon the flags:

-O (capital letter O) will output the copy to the current folder, ..
--mirrorlinks will ensure links between documents continue to work in your local copy.
-%v displays filenames as they’re downloaded.
-* excludes everything from being downloaded.
+zystvan.com/* overrides our exclude and allows files on zystvan.com to be downloaded.

The exclude/include part at the end was necessary to prevent HTTrack from also archiving any sites linked from the main one, for example seeing a Twitter profile and then trying to archive the entirety of Twitter. I suspect there might be a flag to do the same thing, but haven’t checked.

Tada! You’ll now have a completely working local copy of the website you chose to archive.

How to Create an Archive of a Website July 15, 2017