Unit Propagation - Moving from reStructuredText and docutils to Markdown and pandoc

In the second half of 2024, I got an itch to start a blog again. By now, I’m old enough to realize this is a 3-5 yearly cycle, so there’s not much use resisting it. “Who knows,” I thought back then, “maybe this time it will stick!”

When writing for personal interest, there is one topic that I try to stay away from as much as possible. For one, this topic is only interesting to a small handful of people. It also sucks up large amounts of time easily, because it’s both fun to write and a complicated fractal that can be explored endlessly.

This thing I avoid is, you might’ve guessed it already, the meta blog post. That is, a blog post about writing for and maintaining the blog on which the post will be posted.

As it is now exactly a year since I published my first blog post for this site, I decided to indulge myself, and allow myself to write a meta blog post about this topic.

When I created this site in 2024, static site generators (SSGs) were, and still are, all the hype, and I was in the market for a plaintext system to build my own SSG with. A few months earlier Hillel Wayne had written a blog post praising RestructuredText (rST).¹ It also has a semi-formal specification, which, being a formal methods researcher, attracted me like a moth to a flame. Finally, for a previous blog I had also used rST to write posts, so in a way it felt somewhat familiar.

Unfortunately, after using rST and docutils, a library for piecemeal processing of rST, I concluded the experience is not as pleasant as I’d hoped.

The main takeaway is that, if you want a zero-hassle setup, with basic good-looking outputs for little effort, it’s hard to beat MD+pandoc. However, if you have some ideas about what the output should look like, and have time to invest in getting things right, I’d wholeheartedly recommend rST+docutils. You’ll get more bang for your buck there in the long term. Or, as another alternative, if you don’t mind a more opinionated tool that takes more control of your workflow, or want something that has more features to offer, Sphinx is also a good candidate.

To know all the gory details of how I came to these conclusions, read on! I’ll cover how I ended up choosing rST, the problems I ran into while using rST+docutils for my SSG, and what I expect to gain by moving to MD+pandoc.

How I ended up choosing rST

I remember using rST for my previous blog (web archive link). I was only using it because it happened to be the default input format of Nikola, so I experienced it more from an end-user perspective.² Generally, I liked rST! I had some experience with Markdown (MD), so rST wasn’t too strange. I do remember having a hard time remembering the proper rST syntax for things, and looking them up a lot.

With this experience in the back of my mind, I encountered Hillel’ blog post, and was immediately convinced. To summarize, these were the primary reasons.

Extensible syntax

RestructuredText has two pieces of syntax designed to be customized by users: directives and interpreted text roles (also called just roles). Directives are blocks of text that look something like this (example from the rST docs):

.. figure:: larch.png
   :scale: 50

   The larch.

In this example, figure is not a keyword built into the rST language, but just one instance of this reusable block syntax. By replacing figure with a keyword of your choice, you can change the meaning of the block into something specific to your use case. Maybe you want some kind of media block? Or an interactive quiz block?

The other customizable element are roles. These are intended for inline text, and look like this (example from the rST docs):

See :RFC:`2822` for more info.

Here, the customizable keyword is RFC, and the text for the inline element is between the backticks.

In my opinion, there’s just enough structure in these syntaxes to be useful for a wide spectrum of use cases, without being overly verbose. When I first found out about this, I immediately realized this is a killer feature, as it offers a clear next step for when you want to start using more annotations in your documents. Possibly you could even go in a semantic direction. It’s unfortunate that the MD community has not settled on a standard syntax for a custom inline and block element.

Precise design

RestructuredText made the immediate impression on me that effort has been put into making the design at least somewhat precise. The reference spec for rST is precise, providing a solid starting point for implementors, or at least a clear description that would allow a developer to ask useful questions about ambiguities. Part of rST is the docutils document model, which is essentially an XML-like notation to describe rST documents without ambiguity. I have used this with succes to reliably inspect and debug rST documents.

The docutils library

Docutils, is, as far as I know, the main library around to process rST in python. To me, its major selling point is that it allows adding customizable syntaxes. It achieves this by having a read-transform-write architecture, which makes it clear where I need to add code if I want to e.g. transform custom elements into HTML, or would ever want to consider another output format

I also like that docutils supports adding customizable syntaxes at run-time. This means I can just hook up my handlers before calling docutils, instead of having to fork docutils and put my custom handlers into the sources of docutils.

There’s one obvious other choice in the rST space, which is Sphinx. You might know it as the standard documentation generator for Python. While Sphinx is a powerful and featureful tool, I still went with docutils. This is because Sphinx seemed to be more like a foundation to build upon, compared to the small and effective component I was looking for to drop into my SSG design. This is reinfored by the fact that Sphinx actually uses docutils, too.

Maybe all my problems stem from this one decision to use docutils and not Sphinx! I certainly don’t exclude the possibility. Maybe in a next documentation-related project I’ll consider Sphinx more seriously, but for this project, docutils just seemed like the better fit.

Annoyances I ran into

Documentation quantity

Documentation of docutils felt a bit chaotic. There’s one site that has some API docs available. While useful, it feels a bit like an afterthought. It’s not as helpful as it could be to someone who wants to use docutils in a standalone manner.

The official docutils documentation impressed me with what’s available, and the depth with which some topics are discussed. This convinced me that the developers know how to write quality documentation. However, there are plenty of topics where the content is either way too terse, still a work in progress, or not present at all.

Given the complexity of the docutils library, this means that to use it effectively, you have to be ready to participate in the community, e.g. via the mailing list or by making bug reports.

Over-layering of the docutils architecture

The library consists of quite a few moving parts and abstraction layers, which all contribute to either making the library modular, or implementing some part of the versatile docutils pipeline. Because of this flexibility, the library does not feel like a coherent whole to me, but more like a mishmash amalgamation of a variety of tools. This chimera produces what you want in some cases, and fails in subtle ways in others.

A specific example of this is the following. I wanted syntax highlighting in my code samples, so I instructed docutils to use Pygments. I learned the hard way that the CSS classes Pygments uses to markup the code partially overlaps with the CSS of docutils. I spend quite some time debugging why part of my code samples where suddenly striked through! Turns out that the class for marking up a string in Pygments overlaps with the one for strikethrough in the docutils HTML5 writer. The docutils one got precedence in the HTML, making the marked-up source code look strange.

Once I realized what was happening, it wasn’t too much work to massage docutils and Pygments into avoiding this. I vaguely remember I could specify the format of CSS class names on both sides, and there was a (non-standard) combination where this overlap didn’t happen.

Clearly docutils is quite flexible and extendable. The fact that I could avoid this problem by “just” tuning subcomponents speaks of a flexible architecture. But, the fact that I have to spend time manually tuning the low-level output of both the syntax highlighter and the HTML writer is not ideal.

Another example is that I had, and am still having, a hard time understanding the settings model of docutils. I get that, from a tool point of view, you want several entry points to supply defaults. Moreover, overriding settings on a per-tool and per-component basis definitely sounds useful. E.g. if you are writing a LaTeX-replacement probably you want to have the option to override settings at any level. But if you “just” want to use docutils as a library, there are just too many places to insert settings, and its not really clear which settings go where. I wound up putting a dict in some global object so I could configure my custom components. Maybe that’s the way to go, I don’t know. If it is, the documentation failed to mention it.

No out-of-the-box bibliography support

When I was considering moving to MD, this was actually one of the major motivations.

For the record: it’s possible to generate a proper academic bibliography to accompany your rST document. Using this library you can generate a bibliography to accompany your rST. However, what made bibliographies in rST annoying to use is:

Note that this only a mild complaint! The fact that you can find a well-designed plugin like this, ready to be extended, is actually a very good place to be for a library and its surrounding ecosystem. However, as my goal is to build a no-hassle SSG, the fact that I need to extend standard primitives is not ideal.

Why are you complaining? You’re just a spoiled user!

You might be thinking, if I have so many opinions about docutils, I should just extend docutils, and contribute back to the library! This is completely fair. It’s likely that these problems are unsolved in docutils because there simply hasn’t been developer priority and/or funding to implement and polish the required features. Sadly I don’t have the energy and the time to actually do this.³ Just to make this explicit: this blog post is not written to complain, but because like documenting my experience of the combination of design choices I made, and hope it might be interesting or useful to others.

For what it’s worth, I think the flaws I mentioned are all fixable. E.g. there’s no reason why docutils cannot also have a simple interface, similar to the python pandoc library.

However, I also think there are some design improvements, or even simplifications, that could be made to simplify docutils. For example, docutils has a dedicated element for command line options. Why is this a docutils built-in, but bibliography handling is a plugin? I think, ideally, both should be plugins. These two features are, IMHO, good examples of what should be handled at the Sphinx level.

More subjectively, I think the system for propagating settings between components in docutils, or more generally the whole component/plugin/reader writer system composing components is a bit over engineered, and could probably be simplified. Unfortunately I moved to MD+pandoc before I could really form a concrete opinion on this.

A challenger appears: Markdown and pandoc

Markdown (MD) is a widespread plaintext markup language. It’s similar to rST, except that it’s a lot more minimal. Pandoc is a flexible tool for processing various flavors of markdown. To give you an idea of pandoc’s place in the ecosystem: pandoc refers to itself as a “swiss army knife” for processing and manipulating MD.

I’ve used Markdown and pandoc for small documentation processing projects before, and with good success: I remember pandoc causing very few problems and getting mostly out of the way. Markdown being as prevalent as it is, it’s pretty much a cultural blind spot for software engineers at this point. Therefore its a safe and obvious choice for most plain text-related tasks.

After dealing with the above problems with rST for a while, I happened to run into pandoc again on the internet. Remembering how easy it was to work with pandoc for earlier projects, I realized I should port my SSG to MD+pandoc if I want to minimize complexity in my SSG. This was probably back in April 2025.

Aside: extensible syntax for Markdown

For me, the most important feature MD is lacking is an extensible syntax. As far as I can tell, there are no theoretical problems to be solved here. The community needs to pick one of the available syntaxes, and then stick to it. I know, that’s probably asking too much…

From what I’ve seen in practice, the attribute syntax of pandoc is strongly tied to HTML. For example, using pandoc’s attribute syntax, in the text { #X .Y }, X is an identifier (in the HTML element sense), and Y is a class (in the CSS sense).

That doesn’t mean you can’t use those syntactical elements for generic/custom purposes. But it does mean there will always be tools, perhaps older ones, that assume this is a mechanism intended for interfacing with HTML output. Instead, I would like an extension that is agnostic to the particular back-end used.

For example, the key-value notation for attributes is a good start for a generic mechanism (even though those are also kind of tied to HTML attributes…). In addition, I’d like some kind of notation, not historically tied to HTML, to indicate the type of the element. Maybe using an exclamation mark, as that is already used in the MD syntax to indicat you want to import an image. E.g. ! quiz { answer1="Monday" } `content` to indicate a quiz element, where the first answer is “Monday”, and there’s some content to go with the question.

Most importantly, I’d like to emphasise I’m not one to bikeshed: if someone with more leverage in the community has an idea of how to “just” use the attribute syntax, possibly by proposing some conventions to work around the fact that it’s tied to HTML, I’d happily rewrite my MD sources to use that.

Some other ideas for inline extensible syntax, since I’m indulging myself anyway:

Or, perhaps John MacFarlane can come up with a proposal to “just” combine the syntax of rST with . Then we can call it a day!

Changes when moving to pandoc

There were improvements, things that kind of stayed the same, and some things that got worse:

Improvements

In general, it feels like using pandoc is a more coherent experience when compared to using docutils. Whenever I had questions, I could either find it easily in the pandoc docs, which are of high quality, or searching for it wasn’t too hard - it’s quite a popular tool, after all.

All in all, the switch was pretty easy. Translating the posts was some manual work, but none of them are very long, so it’s almost not worth mentioning.

I was expecting citation support in pandoc to be far better than in docutils. In fact, this was more or less the main reason for me to move to pandoc, as I disliked the complexity citations introduced into my rST+docutils setup. However, now that I’ve used citations in pandoc for a bit, I’ve changed my opinion. Unsubstantiated claim warning: pandoc seems to produce similar quality bibliographies with less hassle in terms of API usage. However, when writing the post about CReduce, which has a bunch of citations, I realized that when using all three (citations, footnotes and direct links) simultaneously, the writing becomes incoherent and syntactically busy.

There’s simply no need for three syntactic categories to refer to something. You need one syntax to refer to something directly, and another to refer so something with a bit of context. The former can be done with inline links, the latter with footnotes. In that interpretation, academic citations are just a footnote with an academic syntactical sugar topping.

I realized citations don’t actually work so well in blog posts, where inline links and footnotes are more prevalent in practice, too. Because of this, I won’t be using the bibliography feature of my SSG going forward, instead focusing on just using direct links and footnotes well. If it ever happens I need to cite something in an academic manner, I’ll just one-off generate a citation and copy-paste that into the post.

Medium

From a format perspective, portability improved significantly, as MD is a lot more widespread than rST. Then again, MD doesn’t have a spec, meaning you have to choose one of the flavours, which hurts portability. In the end, I feel that my text sources for this blog were portable, and still are. In an emergency, I can probably get pandoc to translate them to some other format, which should take care of most of the work.

Portability in tooling terms is not so bad. I guess instead of depending on just a python library, I’m now depending on both a python library and a Haskell binary. Then again, pandoc is so widespread I think I don’t have to worry too much about pandoc becoming hard to use for a while.

The code is a little bit simpler now, after kicking out docutils entirely. It’s hard to beat the markdown python library interface. Basically it’s just a function for “parse this MD” and one for “write this MD to another format”. I like the relatively minimal (compared to docutils!) document datatype, which makes future processing far less daunting.

One thing that annoyed me before I switched to MD is that docutils had a bug where the syntax highlighting and styling css overlapped, which required some fidgeting with docutils flags to sort out. Glad that the code handling this is now gone!

Regressions

My heart aches at the idea of losing the conceptually awesome reusable custom syntax of rST! This is clearly a large step back; pandoc’s MD has no direct equivalent.

I’ve mentioned it before, but I think this an important enough point to mention it separately: the spec for pandoc’s MD is not as rigid as rST’s. Beyond that, just basic MD as it was originally defined can still be spartan at times. To get around that, you have to pick a flavour. For me, that’ll be the pandoc flavour for now. I haven’t run into cases yet where I disagree with pandoc’s design choices.

Could this having to choose one of many incompatible flavours be related to not having extensible syntax? Possibly. Basic MD does allow inline HTML, so maybe I have the wrong expectations.

Performance is quite a bit worse. I have some experience with calling pandoc on many MD files and that wasn’t too great. Docutils felt pretty snappy to me. To compare, before I made the switch compiling my site took around 3 seconds. Now it takes at least 10! Granted, I actually have more than two posts now, and my site, while still simple, has gained a few features since it was first compiled a year ago. Still, quite a large chunk of that time is spent running pandoc a few times, for at least the following reasons:

There are plenty of solutions to the performance problem if it ever becomes too large, though: using a different MD library, batching stuff using pandoc-server, removing the source highlighting feature, or caching outputs of my SSG library. So I’m not worried. For now I’ll just let it be.

My personal verdict

If you’re running a large projects, and have allocated time and money ~~to throw at the problem~~ to make something nice: definitely consider rST. But: expect to actually improve docutils a bit to get where you need to be. I don’t think this is really a downside! Contributing back to the community is important.

However, if you “just” need something to work, and/or want to tap into the rich MD ecosystem, then MD+pandoc is your best bet. The combination might be bit limiting in places, but I’ve never seen anyone encounter this barrier yet.

Again, I want to put in an honorable mention for Sphinx, which seems to be a good candidate if you don’t mind a more heavyweight and opinionated tool. It clearly has a lot to offer, but, to me, seems less like a small cog you drop into your existing workflow, and more like a powerful engine you build your product around.