input

---
title: "Investigating Erdős problem #307 with genetic programming"
author: Bob Rubbens
publish_timestamp: 2025-10-05T20:36:40+02:00
# modified_timestamp: 2025-09-30T20:39:00+02:00
state: published
template: template/post.mako
id: f7162567-2e48-44bc-b8e1-8c0ea5a4dfa3
---

A few days ago [Stijn Cambie](https://www.ibs.re.kr/ecopro/stijncambie/) gave a nice presentation at the FMT colloquium about the [Erdős problem database](https://www.erdosproblems.com/) and the community surrounding it. Of the interesting problems discussed in the talk, one in particular caught my eye: [problem #307](https://www.erdosproblems.com/go_to/307) (EP307). Not because I thought that young and naive computer science PhD could make some progress on this idea, but because I was curious how bad (or how well!) [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming) (GP) would be able to solve it.

In particular, I thought it was nice to apply GP here because there is a nice concise way to encode a solution for EP307 into a genome, i.e. a list of natural numbers. The script was fun to write, but in the end the results were not very impressive. To my amateur mathematician's eyes, it got pretty close to a solution, but I can imagine an expert might still judge the result to be wayyy off.

Skip to the end if you just want to run the code. The script's made using only python 3.8, so it should run just about anywhere. Otherwise, read on for more about EP307, genetic programming, and the observations of the script's results.

# Erdős problem #307

[Paul Erdős](https://en.wikipedia.org/wiki/Paul_Erd%C5%91s) is a famous mathematician from the previous century. Most people know him from the concept of [Erdős number](https://en.wikipedia.org/wiki/Erd%C5%91s_number), which is defined as the number of paper authors between you and Erdős, where two authors are connected if they wrote a paper together. I've been told [my Erdős number](https://www.csauthors.net/robert-rubbens/) is 4, but I've *also* been told the average is between 4 and 5. Do with that information what you will `:-)`.

Among mathematicians, Erdős has a reputation for posing and solving many interesting mathematical problems. New Erdős problems are discovered every day, and added to the Erdős problem database, as mathematicians read papers from Erdős. This post is about [Erdős problem #307](https://www.erdosproblems.com/go_to/307) (EP307), which is the following.

Do there exist two finite sets of primes $P$ and $Q$, such that 

$$
1 = \left(\sum_{p \in P} \frac{1}{p}\right) \left(\sum_{q \in Q} \frac{1}{q}\right)
$$

So basically, summing all the reciprocals of elements of $P$, doing the same for $Q$, and multiplying those results. If you can get to 1 for some $P$ and $Q$, you win. Sounds easy enough! Let's try to find $P$ and $Q$ with GP. Spoiler alert: it won't work. But it's a fun exercise!

# A brief introduction to Genetic Programming

Genetic programming seeks to find solutions to problems by applying the principle of evolution. The idea is to start with a population of solutions, and then to keep evolving the population, throwing out unlikely solutions every once in a while. Until an individual appears that solves the problem: then you're done!

Each individual has a *genome*, which is usually a sequence of integers, but this need not to be the case. Then, given two individuals, you can make two *new* individuals by *splicing* the genomes. Splicing works by picking a point somewhere in both genomes, cutting the two genomes there, and then swapping the genome halves. That sounds confusing, so here is a picture:

![An illustration showing how to splice two genomes into two new genomes.](./splicing.svg)

That's the essence of it. Of course, you can come up with many ways to combine genomes. For example, another option to create new genomes is to exchange two numbers of the genomes, creating two new genomes that are the same as their parents, except for one index.

Another feature that GP algorithms often have is the *mutation* operation. This takes only one genome and mutates it in some way. For example, the mutation might add or remove numbers in the genome, or maybe it incremens or decrements some numbers.

The nice thing about GP is that, to use it, you only need to provide an *encoding* of your problem into GP. Then, because this evolutionary mechanism of splicing, mutation, and natural selection is so general, the algorithm finds a good solution entirely on its own. This encoding requires the following:

1. An automatic way to translate a candidate solution to and from a sequence of integers.
2. A way to check that a genome is "ok". GP works best if this check is as lenient as possible, otherwise it might restrict generation of new genomes.
3. A function that calculates the *quality* of a genome. In a GP context, this is usually called a *fitness function*.

Putting all of this together, you get the following pseudocode:

```python
population = generate_ok_individuals(150)
while fitness(max(population, key=fitness)) < goal_fitness:
  A = random.choice(population)
  B = random.choice(population)
  children = splice(A, B)
  for child in children:
    if is_ok(child): population += child

  # Maintain a stable population size
  while len(population) > 150: 
    population.remove(min(population, key=fitness))
```

For some problems the above pseudocode might already suffice, but if not, there are many ways to extend it. That's the beauty of GP: it's a general framework that you can easily extend with your own ideas. You can come up with new ways to combine and mutate genomes, add probabilities to them, and this way tune your algorithm to be more or less noisy. Maybe you want to cull a large part of your population every generation. Or perhaps when the search gets stuck, you take the 10 most promising individuals and restart the search. One cool extension is to [splice genomes together with a probability proportional to their relative fitness](https://stackoverflow.com/a/4463613/2078414), which is somewhat resource intensive to do. I also like the idea of [tournament selection](https://en.wikipedia.org/wiki/Tournament_selection), where you hold small tournaments where genomes compete to reproduce, though it is also resource intensive.

That's actually also the fun practical part of GP: making this all fast. For example, by [memoizing](https://en.wikipedia.org/wiki/Memoization) the result of the fitness function, [clever datastructure use](https://en.wikipedia.org/wiki/Heap_(data_structure)) for easily removing the best and worst individual from the population, and coming up with an encoding that encodes *just* the right amount of information into a genome.

There's only one requirement (as far as I know; be sure to look into the literature if you're curious about this stuff) for GP to be effective to a problem: that the solution space is kind of *continuous*. By continous, I mean that, for most candidate solutions, it should be easy to make it a bit better by applying some small changes. You can imagine that if for all candidate solutions, you need to change a lot about a candidate solution to improve it, then it's unlikely that splicing and mutation will yield genomes with higher fitness. Luckily, lots of problems have this property, or can be rephrased in continuous terms, so there is lots of literature to read up on if you're interested in using this technique 😄.

To get a better idea of what the encoding into GP can look like, let's have a look at encoding EP307 into GP.

# Encoding EP307 for GP

To complete the encoding of EP307 for GP, we need three things:

1. A back-and-forth translation from sets $P$ and $Q$ to an integer sequence.
2. A definition of what it means for a genome to be "ok" in this context.
3. A fitness function to find out which genomes is best. Please make it nice and continuous while you're at it.

## Back-and-forth translation

The back-and-forth translation needs two qualities. First, as said before, it needs to turn two sets into a single list of numbers. Second, it also needs to be robust against the splicing and mutation operations that the GP algorithm will do to the genomes. Ideally, the translation should be robust enough that randomly splicing two genomes together has a very high probability of resulting in a new, valid, genome.

### Basic insight: storing lists in a list

We can use the following small insight: you can encode any two lists of integers in a single list of integers, as long as you know where to split the list to get the original lists back. We'll call this special index the *split index*. For example, the lists `[2, 7, 5]` and `[13, 17]` are encoded in the list `[2, 7, 5, 13, 17]`, because we know that splitting at the split index (2) gives back the original list. Then, if we prepend the split index to the beginning of the list, we're done. This way, the list `[2, 2, 7, 5, 13, 17]` encodes exactly the candidate solution $P = \{2, 7, 5\}, Q = \{13, 17\}$.

### Making it robust against mutations

We still need to adapt this encoding a bit, because it can break very easily. For example, consider again the genome from earlier, `[2, 2, 7, 5, 13, 17]`. What if a mutation adds a `100` to the front, resulting in the genome `[100, 2, 2, 7, 5, 13, 17]`. If we try to split the rest of the list at index `100`... That doesn't work out! Another problem is that the numbers in the list might not remain primes. For example, if a mutation causes the `7` in the genome to be incremented to `8`, the resulting $P$ contains the number 8, which means $P$ is no longer a set of just primes.

The first problem we solve by always taking the split index *modulo* the length of the remaining list. So in the case of the genome  `[100, 2, 2, 7, 5, 13, 17]`, the split index would be $100 \bmod 6 = 4$.

The second problem is solved by not having primes directly in the genome, but having *indices of primes*. In other words, we start with a list of primes, say, the first 1000. Then whe represent the primes in $P$ and $Q$ by taking the indices of these primes in the list of primes. Applying this indirection to `[2, 2, 7, 5, 13, 17]`, we get: `[2, 0, 3, 2, 5, 6]`.

Now, there's still a problem. Consider the genome `[1, 0, 1]`, which represents $P = \{2, 3\}, Q = \emptyset$. When $Q = \emptyset$, the final product described by EP307 will never be 1, so we'd like to exclude those solutions from the population. We'll do this by adding this constraint to the definition of when a genome is "ok".

## Defining "ok"

When a genome is "ok", this means a genome is valid, i.e. is safe to put in the population. In other words, an ok genome does not violate the constraints imposed by the problem we're trying to solve with GP. What it means to be "ok" is implemented by the `is_ok` function, which returns `True` if a given genome is "ok".

Ideally, we'd rather not have to check if each new genome we produce is ok. You might even be able to come up with an encoding that doesn't need it. However, in the interest of time, I decided the encoding I had in mind was good enough, and patched up the remaining holes with `is_ok`.

The first constraint we put in `is_ok` is that the $P$ and $Q$ derived from the genome must be non-empty. That excludes all nonsensical candidate solutions.

What's nice about `is_ok` is that you can use it to further optimize the GP process. For example, [Stijn observed](https://www.erdosproblems.com/go_to/307) that a possible solution of EP307 will also have the following properties:

- $P$ and $Q$ should be disjoint
- $P$ and $Q$ will have at least 60 elements in total.

By putting these constraints in `is_ok`, we make sure that we don't get individuals in the population that have no chance at ever becoming a solution.

There's a hidden assumption in adding these constraints to `is_ok`: we are assuming that we can find a solution to EP307 by randomly splicing together "ok" genomes. But this might be very difficult. It could be the case that by allowing genomes in the population that are actually *not* ok according to Stijn's constraints, we make it easier to get to a solution that *is* in fact ok in terms of Stijn's constraints.

To phrase it differently, imagine GP as finding a series of stepping stones to the other side of a river. Naturally, you restrict yourself to only the rough stones for your route, because you want to have a safe route. But in a river, rough stones are actually rare, so you might not be able to find a safe route. If you relax your search to also include somewhat slippery stones, you might end up with a dangerous route, but at least you'll have a route! That's what we're doing by adding Stijn's constraints: we're only considering the rough stones, and in doing so, assuming there are enough rough stones to get us to the other side.

As I'm just doing this for fun, it doesn't matter too much. For a more serious application of GP, these questions are important.

## Fitness function

Now for the crown jewel of any GP encoding, the *fitness function*. It is actually a bit boring: given a genome $g$, we just extract the sets $P$ and $Q$ from it, and compute the result as described by EP307, with the reciprocals and summing and all that. Then we compute the distance from 1, take the absolute value, and voilà. This is easy to compute, and also continuous in the sense that the fitter the genome is, the smaller the fitness. The only tricky part is that, for this fitness function, a smaller fitness is better than a larger fitness. This means the `min` and `max` need to be flipped around in the pseudocode from earlier.

And that's all you need to get started with solving EP307 with genetic programming!

# Observations

While running my script a bunch of times and eyeballing the output, I noticed some things, which, to my amateur mathematician's mind, seemed interesting. When I told Stijn about them, he didn't seem too surprised, so some of these likely follow directly from the definition of EP307.

**Limited candidate solution composition:** any time I inspected the best candidate solution so far, $P$ and $Q$ always included only the first 100 or so prime numbers. I haven't seen the prime sets go beyond 541. I don't have a good explanation or even intuition for this. To be honest, I would've expected the reverse to happen, that every once in a regular while you'd see a large prime floating by. Instead, the solutions my script has been reporting have been pretty compact, that is, with few primes missing inbetween the smallest and largest prime of the solution.

Taking this one step further: if we assume this is an indication that a proper solution to EP307 will only use small primes, a nice optimization could be to only use the first 128 primes. This means you can use 2 64-bit integers as bitsets to represent a candidate solution, which I'm sure can help speed things up (though you might need a language with access to low-level primitives).

**Quick convergence towards "final" solution:** my script almost always converges towards the final solution in less than 1000 generations. With final here I mean that, after it has reached this solution, it rarely finds a better one if you leave it running for an hour or so.

For the record, this is the best solution I've seen, in terms of the exponent. I've never seen a solution with an exponent smaller than -9. 

- Fitness: 2.611821372986363e-09
- P: [3, 7, 11, 13, 17, 29, 31, 37, 43, 53, 59, 61, 97, 107, 109, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 193, 197, 199, 229, 257, 281, 283, 337, 349, 373, 379, 397, 401, 409, 419, 431, 449, 479, 487, 499, 503, 509, 523]
- Q: [2, 5, 19, 23, 41, 47, 71, 73, 79, 83, 89, 103, 113, 181, 191, 211, 263, 307, 317, 359, 389, 421, 521]

**Is genetic programming the right fit for the problem?** Maybe. I'm fairly sure it's not a bad fit, but calling it a good fit is maybe a bit much. Probably some hybrid algorithm that combines [gradient descent] with something [monte carlo] inspired could be faster and find solutions with higher fitness. That is, I think it makes a lot of sense to start with some imperfect solution, and then breadth-first explore the state space around that solution with some randomness sprinkled in. If anyone ends up doing any work in this direction, do let me know `:-)`.

# Script

If you want to have a look at the code, fair warning, it's not polished, but it works. It's built using only Python 3.8 primitivies, and includes the first 1000 primes, absolutely *for free!*[^1] So it should be pretty portable. It's should be pretty fertile ground for more GP experiments and refactoring `:-)`.

[You can find it here.](./ep307.py.zip) Enjoy!

[^1]: Licensed under GNU GPLv3.