input

---
title: "WordPress export to Markdown"
author: Bob Rubbens
publish_timestamp: 2026-01-23T18:51:51+01:00
state: published
template: template/post.mako
id: b106d383-1bd3-4b3c-8e46-fb3b0a7accc4
---

I'm working on consolidating my previous blogs into this blog. That is, giving the content of the old blogs an immutable corner on this new site for reference and enjoyment. And also, for curious people to see what I was up to in the 2010's, and preserving that for the future. While [my old WordPress blog](https://knightsofthecompiler.wordpress.com/) is still live at the time of writing, there's no guaranteeing how long that will be the case.

As part of this effort, I *braincoded* [a quick script](./wordpress_export_extract_md.py.html) that turns WordPress data exports into folders of Markdown files. The WordPress export I had access two comes in two parts:

- A chunky XML file containing all posts, pages, comments etc. of the site, called "wordpressContentExport", or content export for short.
- A dated folder hierarchy filled with media, mostly pictures in my case. This is called the "wordpressMediaExport", or media export for short.

My script only considers the content export as input. The script generates the following folder hierarchy in the output folder:

```
./out/
├── index.md
├── page
│   ├── about-me
│   │   └── index.md
│   ├── february-operation-get-out
│   │   └── index.md
│   ...
└── post
    ├── doing-some-catch-up
    │   └── index.md
    ├── first-level-done
    │   └── index.md
    ...
```

The top-level `index.md` is a sort of table of contents for the whole directory. Then, for each type of top-level item in the wordpress export, the script creates a folder. These are the `page`, `post`, etc. folders. These top-level folders then contain subfolders for each item of that type in the content export, where the `index.md` files within *those* folders contain the actual page/post/etc. content and metadata.

Other folders the script generates are `attachment` and `wp_global_styles`. I didn't integrate those properly with the page and post directories, nor with the media export. It's possible, but would require more work. I basically stopped working on the script when I reached the point that integrating the Markdown export into my site proper was very little work, and hence further work on the script wasn't worth it anymore. I can imagine if you have a "proper" blog, and not just a handful of posts like on my old blog, you'd prefer expanding the script to cover the media export as well, instead of doing it manually.

The post and page conversion from HTML into Markdown is far from perfect. I use `pandoc` to parse the HTML into Markdown, which mostly works, except there are a few HTML tags used by WordPress that pandoc doesn't recognize. Obviously, this is to be expected. I'm happy to get anything useful for free at all. The most important part is that the output is straightforward to edit as it is almost entirely plain Markdown, with a few HTML-literal warts here and there.

I think the script could be adapted to recognize WordPress HTML markup, and convert those into Markdown equivalents. It's not hard, just a bit annoying to detect and get right. You'd have to at least recognize headings and figures with captions. In its current form, the script leaves this to pandoc, which turns these constructs into HTML literals within the Markdown. It works, and is in fact sufficient from a visual perspective. If you pipe the markdown files created by the script through `pandoc` again to get HTML, the resulting HTML looks decent enough in the browser, including headings and figures. However, from the perspective of long-term storage and consistency, I prefer having everything in a nice and minimal Markdown style.

The script requires both the [`pandoc`](https://pandoc.org) command-line tool and the Python [`pandoc`](https://pypi.org/project/pandoc/) package, and Python 3.12. Though I expect that 3.11 and 3.10 would also work.

There you have it. A simple but effective script, which provides at most a humble starting point for your own WordPress exporting endeavors. At the very least you will need to adapt the script to output a format compatible with the static site generator, or other website-related software, of your choice. Nevertheless, this script is a good example and does get the basics of rummaging through the WordPress XML format out of the way.