Speeding Up HTML Generation by 2000%
Bob Rubbens


When I started this blog in 2024, generating the HTML for this site took
between 3 and 5 seconds. This was good enough at the time.

Time passed, and it’s now a year later. I’m archiving one of my old
blogs on this site when I notice that HTML generation takes over 20
seconds.

This happened because my old blog consists of 20-some pages and (small)
posts, which more or less doubled the volume of my site.

You’d expect a simple homebrew static site generator (SSG) to be quick,
but clearly mine was not. After some measuring, it turns out pandoc was
the bottleneck. Long story short: I am now caching all calls to pandoc.
Excluding cache warmup, HTML generation times are between 1 and 1.8
seconds depending on which laptop I’m on and if the laptop is charging
or not. And achieving that performance required only very small changes!

How can pandoc even be your bottleneck?

Some of the pages of my site require multiple calls to pandoc. The core
assumption of my SSG is that calling pandoc is basically free, so I use
it pretty much everywhere. For example, the blog posts on this site
require:

-   Two Markdown import calls. One to turn the raw Markdown into a
    Python datastructure, the other to extract the metadata snippet at
    the beginning of a post.
-   Another export call to generate some HTML from the datastructure
-   Some calls to Pandoc for conversion to other formats: plaintext and
    a pretty-printed source.

From my experiments, the runtime of pandoc is usually between 50ms and
300ms. As a general purpose swiss-army knife for Markdown documents,
this is fine. Especially if you can get pandoc to batch-convert all your
files. For my use case, which is invoking pandoc separately for each
piece of Markdown-related work I have as if it’s a Python built-in, it’s
not ideal.

From the first moment that I switched to pandoc, the goal was to reduce
code size of my SSG, at the cost of run-time performance. This still
holds up, but I just hadn’t foreseen that the run-time overhead would be
this much.

Long story long: prototyping a build system

The longer story of resolving this bottleneck is that I spent the
christmas break working on my own build system, taskgraph. The idea was
to formulate dependencies and outputs precisely, so the build script
could figure out how to only rebuild parts of the site that needed
updating using graph analysis.

The general architecture was simple. Each task would define a list of
files as inputs, and a list of files as outputs. Combining all tasks,
this implicitly forms a graph, where tasks and files are nodes, and
dependencies and outputs are directed edges. taskgraph would then
compute the partial order of tasks, and check each output for changed
depencies. Whenever it detected an outdated output, the corresponding
task is executed to update the outputs. A mapping of path to hash of
inputs was kept track of between executions.

I used content-based change detection, basically combining and checking
file hashes to see if anything needs updating. Computing hashes is not
free, but it was generally fast enough for my use case. It also has the
benefit that merely deleting a dependency would also cause a rebuild.
This can be can be a problem in build systems, e.g. GNU Make, where
adding or removing a dependency does not always trigger a rebuild.

It was a fun side-project. I got a basic prototype going that has all
the essential functionality of a build system and generates roughly 20%
of my blog. It also ran pretty much instantly when I would change only
one file. It seemed the primary goal of this SSG rewrite was within
reach!

Alas, I did not fully move my SSG to taskgraph. There were three
problematic downsides.

Shortcomings of taskgraph

Problem one, I still needed to port over the remaining 80% of my SSG.
While certainly possible, it felt like unnecessary work. Especially in
the presence of problem two: fine-grained specification of inputs and
outputs is annoying and verbose. At least, the way I designed taskgraph
is. Here’s the class for the task to import Markdown into the Python
datastructure:

    class MdToPandoc:
        paths: MdPaths

        def inputs(self):
            return [self.paths.file_path]

        def outputs(self):
            return [self.paths.ast_path]

        def run(self, ctx):
            doc = pandoc.read(file=str(self.paths.file_path))
            write_pickle(self.paths.ast_path, doc)
            return [self.paths.ast_path]

14 lines, assuming Black formatting. And the only thing that is,
essentially, accomplished is that the doc = ... line is cached. While my
SSG does not contain that many moving parts, I was expecting, at the
very least, for the code size of my SSG to grow an order of magnitude.
Sure, you can come up with a shorter inline syntax to define tasks like
the one above. And maybe you can make pickling/unpickling of files
happen implicitly in taskgraph somehow. But the prospect of having to
spell out all required files one-by-one, and having to pay an order of
magnitude of code to maintain, annoyed me.

And that’s only the start of the problem. My SSG heavily relies on
arbitrary Python execution in Mako templates. While possible, it’s
annoying to fit this model into the mold of taskgraph tasks. I like
being able to extend the blog by putting more logic in the templates,
keeping the SSG base script small. Adding friction there would be a high
price to pay. In addition, the Mako templates are definitely not the
bottleneck in the SSG performance. Mako templates would therefore gain
very little by being cached, so paying the cost of porting them to
taskgraph made no sense.

The third problem is that I basically implemented a generic task library
that already exists: pydoit. While I haven’t looked at it in-depth, it
seems similar to taskgraph, but better. This left me with two choices:
use my own large and clunky thing, or introduce another dependency to my
SSG?

The current state of things: caching pandoc

I took a step back, and realized that I actually only need to speed up
the 10 lines of code in my SSG that look like this:

    doc = pandoc.read(file=str(self.paths.file_path))

This actually wasn’t difficult. This small class is now doing the heavy
lifting of all my pandoc-related needs, and caching the results:

    class PandocStore:
        def __init__(self):
            self.read_cache = {}
            self.write_cache = {}

        def read(self, doc, options=[]):
            args = (pickle.dumps(doc), tuple(options))
            if args not in self.read_cache:
                result = pandoc.read(doc, options=options)
                self.read_cache[args] = result
            return self.read_cache[args]

        def write(self, doc, options=[]):
            args = (pickle.dumps(doc), tuple(options))
            if args not in self.write_cache:
                result = pandoc.write(doc, options=options)
                self.write_cache[args] = result
            return self.write_cache[args]

It’s basically wrapping the pandoc Python package in a simple caching
layer.

The nice part is that this approach works properly even when the build
script changes. This wasn’t the case with the taskgraph approach:
changing the build script required manually cleaning the cache. This
sounds easy to detect automatically, but it actually is not. What if a
system upgrade silently upgrades one of the libraries Python implicitly
uses? Now, only if pandoc’s behaviour changes the cache needs to be
deleted manually, which is pretty rare.

There are a few small downsides. They are acceptable, and I expect they
will remain so in the foreseeable future.[1]

There’s lots of pickling/unpickling going on. This is required because
the Python Markdown datastructure is mutable, so I can’t rely on
hashing. Luckily, Python pickling is fast, so it’s not a bottleneck.

Similarly, to hash the list of options, I shallowly freeze it by turning
it into a tuple. This works, as for my SSG, the options are only ever
strings. If more complicated arguments are ever used, I’ll need to
pickle these, too.

The new structure required all my calls to pandoc to be
path-independent. Basically, any calls such as pandoc.read(file=path)
had to be changed to pandoc.read(path.read_text()). This was already
mostly the case, so refactoring the few calls where this was not the
case was easy.

Finally, I also need to empty the cache manually every once in a while.
As is, it will keep growing indefinitely. I could adapt the system to
remove unused cache entries each run. However, the cache is only 17MB,
so it’s not worth the effort yet.

Going forward

Generating my site is now pretty snappy. Maybe it’s only a matter of
time until I add some extension of my site that makes generation slow
again. Possibly, if that happens a few times, I’ll have to reconsider
the build system approach. For now, I’m liking this surgical change
because it’s so small. I hope, and expect, that I can apply this
approach to future bottlenecks, too.

[1] Famous last words…