Gwulo forever: backups and archives

Submitted by David on Sat, 06/20/2020 - 18:00

This website holds a lot of valuable information, representing tens of thousands of hours of work. To recreate it would take literally years, and in some cases the information is irreplaceable as the original contributors have already passed away.

As a website is a surprisingly fragile thing, this newsletter looks at some of the steps that are in place to keep the Gwulo's information available for the long term.

 

1. Backups: Coping with short-term damage

The Gwulo website has already broken down several times over the 10+ years since we started due to problems with the server, me making a mistake, or attacks by hackers.

The solution is to reach for a backup and restore a copy of the website that was made before the problem happened - hopefully not too long before, so that we don't lose too many days' posts and comments.

We have multiple levels of backup working:

  1. Once a day the Gwulo website makes a backup copy of its databases, which contain all the comments, posts, etc.
    If it isn't a database problem, eg there's a risk that the server has been infected by hackers, then ...
  2. Several times a week the hosting company takes a complete snapshot of the server, including all the databases and all the files.
    If for some reason the hosting company went out of business overnight, then ...
  3. Once a month I backup the databases and files from the server in the US to my computer here in Hong Kong. A copy of that backup is also sent to a third-party backup service at another location in the US.

So far, all the problems we've faced have been solved with the first- or second-level backup, typically losing a day or two's new content.

I hope it never happens, but in the worst case we'd need to use the third-level backup, losing up to a month's new content.

 

2. Archives: A static copy for the long term

What if the hosting company went out of business the same day I got run over by a tram, and there was no more Gwulo.com?

Wouldn't it be good if there was a copy of the Gwulo website available elsewhere on the web that you could still access? You wouldn't be able to add anything new to the site, but at least you could still read all that valuable information.

Say hello to website archiving: taking regular copies of a website's pages, and hosting them publicly so that they can still be read even after the original website no longer exists.

Copies of Gwulo's pages currently exist in three website archives:

 

2.1 The Wayback Machine

This is part of the Internet Archive, a non-profit organisation based in the US that copies and archives pages from websites around the world. You can visit the Wayback Machine's homepage and type in any website's domain to see how many pages it has copied.

The Wayback machine has been running for a long time, so it can show you how Gwulo's original homepage looked in July 2009 during the initial split from Batgung.com, or go back even further to show you the Batgung homepage from 2002.

The Wayback Machine reports that it has 100,000 URLs from Gwulo in its collection, but its coverage is still far from complete. eg we have well over 20,000 photos on the Gwulo website, but the Wayback Machine only has copies of around 3,800 of them.

 

2.2 The British Library's Web Archives

One of the roles of a country's central library is to gather, preserve and make publicly available copies of new publications issued in that country. Traditionally that meant books, magazines, and newspapers, but libraries such as the British Library have expanded their role to also include material published to the web. They kindly included Gwulo.com in their Open UK Web Archive:

The Open UK Web Archive is a smaller collection (approximately 15,000) of selected websites archived by the British Library and its partners since 2003, with permissions from the owners. Selected websites will continue to be added to this open access collection, again with the permission of website owners. This content can be viewed anywhere.

A search for "gwulo.com" in The Open UK Archive returns 5,422 results, so again that is far from complete. Also I note that all the copies were made in 2017 - anything newer than that hasn't been included.

 

2.3 Special Collections, University of Hong Kong Libraries

Here in Hong Kong, I'm not aware that Hong Kong's Public Library run any website archiving service. Fortunately, Special Collections at HKU Libraries have stepped in to archive websites that document Hong Kong's history.

Here's their first archive copy of the Gwulo website that they made last summer. If you browse through that copy, you'll see they've done a great job: it is much more complete than either of the copies at the Wayback Machine or the British Library.

Though this archive can already be accessed by the public, it isn't actively promoted yet. HKU say that will change when the original website is no longer available. So if I've been knocked down by that tram and you don't see Gwulo.com online any more, please let them know!

We're currently adding over 9,000 photos, pages, and comments to Gwulo each year, which means that last summer's copy in the HKU archive is already missing a lot of material. I'll cross my fingers that they'll continue to update their copy.

There are several people to thank for getting Gwulo included in HKU's archive. First, thanks to Hugh Farmer, who mentioned in the newsletter for his Industrial History of Hong Kong website that the site had been archived by HKU. Next to Stephen Davies who made the introductions to HKU Libraries, and finally to Vivian So and Iris Chan at HKU for making the archive happen.

 


 

Summary

I'm happy with the current backup procedures we have in place, but the solution for long term archiving is still a work in progress.

Ideally I'd like to see a complete copy of the website included in two archives (in case one fails). It would also be great to see the archives updated monthly or quarterly, rather than annually, so that less material is lost in the case the original website disappears. One option could be to let the archivers know what's new or changed, so that a monthly update would only need them to make copies of the 1,000 or so new / changed pages, instead of copying the 40,000+ pages of the whole site.

In the meantime:

  • If you know of any other web archive that is willing to archive the Gwulo website, please let me know.
  • If there are any pages on the Gwulo website that you want to make sure are archived, the Wayback Machine has a Save Page Now feature that lets you add it to their archive.
  • And a couple of tasks for me to follow up on:
    • Get in touch with the British Library to ask if they can start archiving pages from Gwulo again.
    • Make a small change to how Places are displayed, so that they show the values of their Latitude and Longitude. On Gwulo we rely on maps to show the location of a Place, but currently none of the archives can display Gwulo's maps. At least if the Latitude and Longitude is displayed in the archived copy of a Place, a reader can work out where the Place was located.

If you've read this far, thank you! Backups and archives belong with flossing teeth in the "important but not very exciting" category, but I thought people who've contributed information and images to Gwulo might like to see how it is preserved.

Comments

Submitted by on
Sun, 06/21/2020 - 11:40

Thanks David for making a special effort to ensure that this treasure trove is preserved in perpetuity!  Backing up of computer files is of utmost importance to historians (and actually everyone) in this era, in which less and less printed copies of anything exist.

It's amazing how this site has grown, but also how much responsibility you have also had to take on as site owner/custodian. So a big thanks for all your effort and diligence in keeping it up and running over all the years.