Serving Atom feeds with GitHub Pages
I recently checked on my blog through Google’s Webmaster Tools. Surprisingly, my sitemap couldn’t be read because it contained an error. That was weird because it worked fine for a long time. Google’s error message didn’t help at all:
We were unable to read your Sitemap. It may contain an entry we are unable to recognize. Please validate your Sitemap before resubmitting.
My sitemap is an Atom feed of all my posts. Since Atom is XML, I can check it with a more useful validator, like the W3C’s feed validator. It came back with this, which was much more helpful:
Missing “charset” attribute for “text/xml” document.
Sorry, I am unable to validate this document because on line 86 it contained one or more bytes that I cannot interpret as us-ascii (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: ascii “\xC2” does not map to Unicode
There are actually two problems here. One is that the charset
attribute isn’t being set on the Content-Type
header. It should
be utf-8
, but it defaults to us-ascii
. The other problem
is that text/xml
is deprecated and should be replaced with
application/xml
. Or, better yet, application/atom+xml
.
Fortunately, solving the second problem solves the first problem
too. The default charset
for application/xml
is utf-8
.
Unfortunately, my blog is hosted on GitHub Pages. I can’t
modify the headers. I can, however, modify the filenames. GitHub
doesn’t know that my sitemap is an Atom feed since it has an .xml
extension. Changing it to .atom
clues them in, and they serve it
with the right header. See for yourself:
$ curl -v 'http://taylor.fausak.me/sitemap.atom'
* About to connect() to taylor.fausak.me port 80 (#0)
* Trying 204.232.175.78... connected
* Connected to taylor.fausak.me (204.232.175.78) port 80 (#0)
> GET /sitemap.atom HTTP/1.1
> User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
> Host: taylor.fausak.me
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/1.0.13
< Date: Thu, 26 Apr 2012 16:47:37 GMT
< Content-Type: application/atom+xml
< Content-Length: 188478
< Last-Modified: Thu, 26 Apr 2012 16:44:12 GMT
< Connection: keep-alive
< Expires: Fri, 27 Apr 2012 16:47:37 GMT
< Cache-Control: max-age=86400
< Accept-Ranges: bytes
<
In short, to get GitHub pages to serve your Atom feed with the right
MIME type, use .atom
instead of .xml
. (The same thing goes for
RSS: use .rss
instead of .xml
.)
Renaming the sitemap means that anyone subscribed to that feed won’t receive updates. I could keep both of them around indefinitely, but I don’t want to duplicate the content. Instead, I’ll create a feed at the old URL with one entry that points to the new URL.