I’m working on consolidating my previous blogs into this blog. That is, giving the content of the old blogs an immutable corner on this new site for reference and enjoyment. And also, for curious people to see what I was up to in the 2010’s, and preserving that for the future. While my old WordPress blog is still live at the time of writing, there’s no guaranteeing how long that will be the case.
As part of this effort, I braincoded a quick script that turns WordPress data exports into folders of Markdown files. The WordPress export I had access two comes in two parts:
My script only considers the content export as input. The script generates the following folder hierarchy in the output folder:
./out/
├── index.md
├── page
│ ├── about-me
│ │ └── index.md
│ ├── february-operation-get-out
│ │ └── index.md
│ ...
└── post
├── doing-some-catch-up
│ └── index.md
├── first-level-done
│ └── index.md
...
The top-level index.md is a sort of table of contents
for the whole directory. Then, for each type of top-level item in the
wordpress export, the script creates a folder. These are the
page, post, etc. folders. These top-level
folders then contain subfolders for each item of that type in the
content export, where the index.md files within
those folders contain the actual page/post/etc. content and
metadata.
Other folders the script generates are attachment and
wp_global_styles. I didn’t integrate those properly with
the page and post directories, nor with the media export. It’s possible,
but would require more work. I basically stopped working on the script
when I reached the point that integrating the Markdown export into my
site proper was very little work, and hence further work on the script
wasn’t worth it anymore. I can imagine if you have a “proper” blog, and
not just a handful of posts like on my old blog, you’d prefer expanding
the script to cover the media export as well, instead of doing it
manually.
The post and page conversion from HTML into Markdown is far from
perfect. I use pandoc to parse the HTML into Markdown,
which mostly works, except there are a few HTML tags used by WordPress
that pandoc doesn’t recognize. Obviously, this is to be expected. I’m
happy to get anything useful for free at all. The most important part is
that the output is straightforward to edit as it is almost entirely
plain Markdown, with a few HTML-literal warts here and there.
I think the script could be adapted to recognize WordPress HTML
markup, and convert those into Markdown equivalents. It’s not hard, just
a bit annoying to detect and get right. You’d have to at least recognize
headings and figures with captions. In its current form, the script
leaves this to pandoc, which turns these constructs into HTML literals
within the Markdown. It works, and is in fact sufficient from a visual
perspective. If you pipe the markdown files created by the script
through pandoc again to get HTML, the resulting HTML looks
decent enough in the browser, including headings and figures. However,
from the perspective of long-term storage and consistency, I prefer
having everything in a nice and minimal Markdown style.
The script requires both the pandoc command-line tool and
the Python pandoc package,
and Python 3.12. Though I expect that 3.11 and 3.10 would also work.
There you have it. A simple but effective script, which provides at most a humble starting point for your own WordPress exporting endeavors. At the very least you will need to adapt the script to output a format compatible with the static site generator, or other website-related software, of your choice. Nevertheless, this script is a good example and does get the basics of rummaging through the WordPress XML format out of the way.
Generated with BYOB.
License: CC-BY-SA.
This page is designed to last.
⇐ [This site is part of the UT webring] ⇒