Processes Blogger archive XML files into HTML, Markdown and JSON files. Meaning you can convert...
- Blogger to HTML
- Blogger to Markdown (noting the body content remains HTML, it is not a full markdown conversion)
- Blogger to JSON
It also...
- extracts the display template
- extracts a list of commenters to JSON
- extracts a list of contributors to JSON
- extracts a list of labels to JSON
- extracts a list of images and their associated blog ID; also provided are JSON files mapping the old URL to the new file
- downloads images hosted on blogspot (you can allow and block other domains in the config)
- downloads the XML sitemap file(s)
This tool will help you migrate content from Blogger to a new solution. It is not a drop-in solution for any specific blog tool, the idea is to generate source files you can copy and paste into your new blog project.
Be aware this is a pretty rough conversion so you must inspect the output yourself before you rely on it. The project has been hacked together in holiday breaks, has no tests and generally... use at your own risk ;)
I expect most people will want to customise:
- settings in
config.json
- rendering templates in
_templates.js
Likely uses for each format:
- Markdown is used a ton of static site generators, noting they may all need differently-formatted frontmatter
- JSON is a common format and easier to work with than Blogger's XML
- HTML is provided more to inspect the conversion than anything else, although it does create a quite clean archive. If you want HTML fragments, edit the templates to remove what you don't need.
If you aren't migrating to a new tool and just want to store a static HTML copy of your site as it was published, I'd suggest you download a copy with wget instead:
wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://www.example.com
You will need to use a bash terminal like OSX Terminal or WSL on Windows 10. This will not work in anything else as it calls out to curl
.
Log in to Blogger, then go to Settings - Other, then choose Back Up Content to download the XML file. You can also use Google Takeout to export your Blog in ATOM format. This tool should be able to read both, although it has mostly been tested with content exported from the Blogger web UI (not Takeout).
Note that /input/
is gitignored, so if you want to ensure your data is not accidentally committed to git create that directory and put them in there.
- Clone or download this repo
- Ensure NodeJS is installed
- If you have NVM just run
nvm install && nvm use
- Otherwise check the
.nvmrc
file in this repo and install that version
- If you have NVM just run
- Run
npm install
from the root of the repo - Run the smoke tests:
npm test
...to confirm everything is working.
After setup is done, run this in your bash terminal:
npm run bloggerconvert <filename>
...where <filename>
is either a downloaded XML file from Blogger's interface, or an .atom file from Google Takeout (eg. node index.js blog-01-01-2018.xml
).
To change configuration, edit config.json
. Options:
- structure:
tree
(default) |flat
- flat: blog posts are all generated in one directory
- tree: blog posts are generated in a
YYYY/MM/DD
directory tree
- imagePath: the path to images in your final blog, used in the
src
attribute ofimg
elements. - imageUrls: array of substrings, if an
img src
contains matches any of the substrings it will be downloaded. - imageExclusionUrls: array of substrings, if any match that image will be excluded from download.
- imageRelativeUrl: since Blogger changed to hosted-only, relative images are handled with a fallback URL. You can check this setting in Blogger settings -> basic -> blog address (edit) -> Fallback subdomain (CNAME).
Notes:
- All generated files will be in
/output/blogname/
- Old output for each blog is deleted every time you run the tool. This is to ensure each run is clean if you change anything.
- If you need multiple conversions of the same site, you will need to make copies manually between runs.
IMPORTANT - READ THIS! - the image downloader is very very simple (ie. dumb as rocks), does not follow redirects or retry etc, and you MUST check the validity of downloaded images. DO NOT ASSUME THEY HAVE WORKED.
If an image has not downloaded correctly, the easiest thing to do is delete it from the image cache and run the script again. Anything from a network timeout to a legitimate error can cause a failed download. If repeated attempts fail, you will need to manually download images or at the least manually test one of the failed URLs.
If you do not edit config.json
, all images hosted on blogpot URLs will be downloaded; and IMG elements in blog posts will be updated to link to the downloads. As each blog tool will have a different path to image storage, you need to edit the imagePath
option to suit your target system.
If you want to download images from other domains, or exclude certain domains, there are options for that as well.
Images are stored in a local cache that is not deleted between runs, so if you do multiple downloads you don't smash your connection each time. This does mean that failed downloads may create an empty file. As stated above, you really must manually check your images!
By default HTML is written with a very simple HTML5 wrapper. It does not apply the Blogger template, as that's a proprietary format and not worth reproducing. To perhaps state the obvious, if you really want to use Blogger templating you should probably stay on Blogger. If you simply want to preserve a read-only copy of your Blogger site in its original format, it would be better to download a copy with wget
(see instructions above).
Post content is not parsed to Markdown, it is injected as raw HTML. If you need true Markdown, you will need to convert the extracted HTML yourself. However if you just need to repost the old content without editing it, this probably isn't necessary.
Markdown files contain frontmatter in YAML format:
---
key: value
key: value
key:
- value
- value
---
This frontmatter format is used in many popular static site generators like Hugo, Hexo, etc. You will need to choose the way you want to render labels into tags or categories.
${module.exports.renderCategories(labels)}
renders labels in this format:
categories:
- foo
- bar
- baz
${module.exports.renderTags(labels)}
renders labels in this format:
tags: ["foo","bar","baz"]
Example Hexo frontmatter:
---
title: ${post.metadata.title}
date: ${post.metadata.published.tidyISO}
updated: ${post.metadata.updated.tidyISO}
categories: ${module.exports.renderCategories(labels)}
tags: ${module.exports.renderTags(labels)}
permalink: ${post.metadata.url}
---
Example Hugo frontmatter:
---
title: ${post.metadata.title}
date: ${post.metadata.published.tidyISO}
lastmod: ${post.metadata.updated.tidyISO}
categories: ${module.exports.renderCategories(labels)}
tags: ${module.exports.renderTags(labels)}
slug: ${post.metadata.url}
---
To set up your target blog's exact format, refer to 'customising the output' below.
While a direct conversion of Blogger's XML to JSON is possible (in fact this tool is based on it), it's not clean due to the flat data structure of Blogger's archive format. This tool gives a set of 'tidied up' JSON files that should be easier to work with.
Output templates are in _templates.js
. The file uses ES6 template literals to provide a very simple templating solution. When called, they are passed the data for a single post or page - so anything you can see in those data files is available for use.
You can modify the HTML/contents of posts using Cheerio. Look for processPostBody
in index.js
.
Comments are included in the output as raw HTML, with a heading at the top noting how many comments and a note at the end that comments are closed. To generate without comments or modify any of these specifics, just remove them from the templates and run the conversion again.
If you need more-complex comment conversion, you will need to take care of that yourself - or perhaps raise a pull request :) I didn't need it so I didn't do it.
Generally speaking the script will try to tell you about problems.
If you are getting NodeJS errors (particularly binding
errors), the first step is to ensure you are using a compatible version of Node. Check the .nvmrc
file in the repo root; or install NVM and run nvm install && nvm use
in the root of this repository.
If trouble persists, try deleting node_modules
and doing a fresh npm install
- it's node's version of turning it off and then on again... but it often works.
MIT.
But don't be waiting for it to happen, if you need them right away DIY or submit a pull request...
- create option to base64 encode all images
- add an index file for the HTML export
- harden the image downloader with retries etc