When I started this blog in 2024, generating the HTML for this site took between 3 and 5 seconds. This was good enough at the time.
Time passed, and it’s now a year later. I’m archiving one of my old blogs on this site when I notice that HTML generation takes over 20 seconds.
This happened because my old blog consists of 20-some pages and (small) posts, which more or less doubled the volume of my site.
You’d expect a simple homebrew static site generator (SSG) to be quick, but clearly mine was not. After some measuring, it turns out pandoc was the bottleneck. Long story short: I am now caching all calls to pandoc. Excluding cache warmup, HTML generation times are between 1 and 1.8 seconds depending on which laptop I’m on and if the laptop is charging or not. And achieving that performance required only very small changes!
Some of the pages of my site require multiple calls to pandoc. The core assumption of my SSG is that calling pandoc is basically free, so I use it pretty much everywhere. For example, the blog posts on this site require:
From my experiments, the runtime of pandoc is usually between 50ms and 300ms. As a general purpose swiss-army knife for Markdown documents, this is fine. Especially if you can get pandoc to batch-convert all your files. For my use case, which is invoking pandoc separately for each piece of Markdown-related work I have as if it’s a Python built-in, it’s not ideal.
From the first moment that I switched to pandoc, the goal was to reduce code size of my SSG, at the cost of run-time performance. This still holds up, but I just hadn’t foreseen that the run-time overhead would be this much.
The longer story of resolving this bottleneck is that I spent the
christmas break working on my own build system, taskgraph.
The idea was to formulate dependencies and outputs precisely, so the
build script could figure out how to only rebuild parts of the site that
needed updating using graph analysis.
The general architecture was simple. Each task would define a list of
files as inputs, and a list of files as outputs. Combining all tasks,
this implicitly forms a graph, where tasks and files are nodes, and
dependencies and outputs are directed edges. taskgraph
would then compute the partial order of tasks, and check each output for
changed depencies. Whenever it detected an outdated output, the
corresponding task is executed to update the outputs. A mapping of path
to hash of inputs was kept track of between executions.
I used content-based change detection, basically combining and checking file hashes to see if anything needs updating. Computing hashes is not free, but it was generally fast enough for my use case. It also has the benefit that merely deleting a dependency would also cause a rebuild. This can be can be a problem in build systems, e.g. GNU Make, where adding or removing a dependency does not always trigger a rebuild.
It was a fun side-project. I got a basic prototype going that has all the essential functionality of a build system and generates roughly 20% of my blog. It also ran pretty much instantly when I would change only one file. It seemed the primary goal of this SSG rewrite was within reach!
Alas, I did not fully move my SSG to taskgraph. There
were three problematic downsides.
taskgraphProblem one, I still needed to port over the remaining 80% of my SSG.
While certainly possible, it felt like unnecessary work. Especially in
the presence of problem two: fine-grained specification of inputs and
outputs is annoying and verbose. At least, the way I designed
taskgraph is. Here’s the class for the task to import
Markdown into the Python datastructure:
class MdToPandoc:
paths: MdPaths
def inputs(self):
return [self.paths.file_path]
def outputs(self):
return [self.paths.ast_path]
def run(self, ctx):
doc = pandoc.read(file=str(self.paths.file_path))
write_pickle(self.paths.ast_path, doc)
return [self.paths.ast_path]14 lines, assuming Black
formatting. And the only thing that is, essentially, accomplished is
that the doc = ... line is cached. While my SSG does not
contain that many moving parts, I was expecting, at the very
least, for the code size of my SSG to grow an order of
magnitude. Sure, you can come up with a shorter inline syntax to
define tasks like the one above. And maybe you can make
pickling/unpickling of files happen implicitly in taskgraph
somehow. But the prospect of having to spell out all required files
one-by-one, and having to pay an order of magnitude of code to maintain,
annoyed me.
And that’s only the start of the problem. My SSG heavily relies on
arbitrary Python execution in Mako templates. While possible, it’s
annoying to fit this model into the mold of taskgraph
tasks. I like being able to extend the blog by putting more logic in the
templates, keeping the SSG base script small. Adding friction there
would be a high price to pay. In addition, the Mako templates are
definitely not the bottleneck in the SSG performance. Mako templates
would therefore gain very little by being cached, so paying the cost of
porting them to taskgraph made no sense.
The third problem is that I basically implemented a generic task
library that already exists: pydoit.
While I haven’t looked at it in-depth, it seems similar to
taskgraph, but better. This left me with two choices: use
my own large and clunky thing, or introduce another dependency to my
SSG?
I took a step back, and realized that I actually only need to speed up the 10 lines of code in my SSG that look like this:
doc = pandoc.read(file=str(self.paths.file_path))This actually wasn’t difficult. This small class is now doing the heavy lifting of all my pandoc-related needs, and caching the results:
class PandocStore:
def __init__(self):
self.read_cache = {}
self.write_cache = {}
def read(self, doc, options=[]):
args = (pickle.dumps(doc), tuple(options))
if args not in self.read_cache:
result = pandoc.read(doc, options=options)
self.read_cache[args] = result
return self.read_cache[args]
def write(self, doc, options=[]):
args = (pickle.dumps(doc), tuple(options))
if args not in self.write_cache:
result = pandoc.write(doc, options=options)
self.write_cache[args] = result
return self.write_cache[args]It’s basically wrapping the pandoc Python
package in a simple caching layer.
The nice part is that this approach works properly even when the
build script changes. This wasn’t the case with the
taskgraph approach: changing the build script required
manually cleaning the cache. This sounds easy to detect automatically,
but it actually is not. What if a system upgrade silently upgrades one
of the libraries Python implicitly uses? Now, only if pandoc’s behaviour
changes the cache needs to be deleted manually, which is pretty
rare.
There are a few small downsides. They are acceptable, and I expect they will remain so in the foreseeable future.1
There’s lots of pickling/unpickling going on. This is required because the Python Markdown datastructure is mutable, so I can’t rely on hashing. Luckily, Python pickling is fast, so it’s not a bottleneck.
Similarly, to hash the list of options, I shallowly freeze it by turning it into a tuple. This works, as for my SSG, the options are only ever strings. If more complicated arguments are ever used, I’ll need to pickle these, too.
The new structure required all my calls to pandoc to be
path-independent. Basically, any calls such as
pandoc.read(file=path) had to be changed to
pandoc.read(path.read_text()). This was already mostly the
case, so refactoring the few calls where this was not the case was
easy.
Finally, I also need to empty the cache manually every once in a while. As is, it will keep growing indefinitely. I could adapt the system to remove unused cache entries each run. However, the cache is only 17MB, so it’s not worth the effort yet.
Generating my site is now pretty snappy. Maybe it’s only a matter of time until I add some extension of my site that makes generation slow again. Possibly, if that happens a few times, I’ll have to reconsider the build system approach. For now, I’m liking this surgical change because it’s so small. I hope, and expect, that I can apply this approach to future bottlenecks, too.
Generated with BYOB.
License: CC-BY-SA.
This page is designed to last.
⇐ [ This site is part of the UT webring ] ⇒