The number of optimal addition chains for *n* is denoted *NMC(n)*. So for 27 we have *NMC(27) = 5* since we have 5 different optimal addition chains:

1 2 3 6 9 18 27

1 2 3 6 12 15 27

1 2 3 6 12 24 27

1 2 4 5 9 18 27

1 2 4 8 9 18 27

*NMC* counts addition chains as unique sequences of numbers without regard to how each element is constructed. For example:

1 2 3 4 7

1 2 3 5 7

1 2 3 6 7

1 2 4 5 7

1 2 4 6 7

*NMC(7) = 5*

We count only one chain 1 2 3 4 7 even though there are two ways of generating this chain. We have 4 = 3 + 1 and 4 = 2 + 2.

A small table of the early *NMC* values:

* n l(n) NMC(n)** n l(n) NMC(n)*

1 0 1 11 5 15

2 1 1 12 4 3

3 2 1 13 5 10

4 2 1 14 5 14

5 3 2 15 5 4

6 3 2 16 4 1

7 4 5 17 5 2

8 3 1 18 5 7

9 4 3 19 6 33

10 4 4 20 5 6

Achim Flammenkamp wrote a program to calculate *NMC* values by enumerating all addition chains for *n* with *l(n) ≤ r. *The code built an array of addition chain values keeping track of which values had not been used. If the chain was not at its end position the unused elements were added together and pruned using a class A bound [1] derived from *c(r). c(r) *is the smallest number whose optimal addition chain is of length *r.* So, *l(c(r)) = r* and if *m < c(r)* then *l(m) < r.* A class A bound is very simple. If we are at position i *< r* in the chain with an element *m* then 2^{r-i}m ≥ c(r).

Achim calculated *NMC(n)* for *n ≤ 2 ^{22}*. I added the class F bounds [2]. To do this we generate the class F bound for each integer

Instead of recursively searching in a depth first manner for addition chains I shifted to a push down stack of problems as described in [3]. This technique though a little more complicated allows the easy conversion of the code to run on multi-core processors. Idle processors/threads can request threads with active values on their stacks to pass the largest problems (the smallest depths) to the idle processors via a queue in global memory. This very nicely scales to even very large core systems.

If you think about how, you might code this search, you need an array of *l(n)* values so that you can tell if you find an addition chain for *n* at depth *r* you must record that fact by incrementing the array for the *NMC(n)* value. Updating *NMC(n)* requires an interlocked operation as it's don from multiple threads concurrently. These memory accesses and shared memory writes make the code memory bound (limited by the speed of paths to and from memory). We can avoid interlocked operations by splitting the *NMC(n)* on a per/thread basis and combining at the end of execution.

We waste a lot of time though recalculating stuff. Instead, we should search only for chains of length *r* for *n* if *l(n)=r*. So, we run the program with *r=0* then *r=1* etc. We can dispense with *l(n)* values and instead use a bitmap that has a bit set for *n* if *l(n) = r*.

I was running this program on a Dell Precision 7920 Dual Intel 6258R. So, a 58/116 core/thread machine. This machine supports AVX512 and so has registers that can hold 16 32-bit integers. If I restrict the depth to *r ≤ 32* (a huge range) and special case *n=2 ^{32}* I might be able to do even more.

You then meet a problem that seems to be quite well known in the literature. It's hard to use SIMD in backtracking searches since some searches run faster than others and you end up with some slots being unused unless you rebalance. Rebalancing is costly. I got this to work by changing the code that pulls elements from the stack for new problems to actually pull multiple elements if there are gaps and combine them.

The biggest improvement from AVX512 came from the gathered read. If you imagine at depth *r* we have up to 16 integers packed into a zmm register. We want to look each value up in a bitmap and only if the bitmap is set increment the *NMC* array. I use the _mm512_mask_i32gather_epi32 intrinsic to read from the bitmap in parallel. Each *n* is shifted down by 5 bits to create a 32-bit element offset in the bitmap. Anding the *n* with 31 give the bit number. The gather loaded values are shifted down by the bit number and only the values that had the bottom bit set are used to increment the *NMC* values.

It turns out that much to my surprise the code spends much of its time finding non-optimal chains so rejecting in parallel is a big deal. So, there must be way more no-optimal chains of length *r* than optimal ones. With this code I was able to generate *NMC(n)* for *n ≤ 2 ^{28}*.

The text file format is the same as the tables shown above. *n* followed by *l(n)* and then by *NMC(n)*.

[1] Thurber EG (1999), *"Efficient Generation of Minimal Length Addition Chains"*, SIAM Journal on Computing. Vol. 28(4), pp. 1247-1263.

[2] Thurber EG and Clift NM, "Addition chains, vector chains, and efficient computation", Discrete Mathematics, Volume 344, Issue 2, 2021,

[3] Thurber EG (1999), *"Efficient Generation of Minimal Length Addition Chains"*, SIAM Journal on Computing. Vol. 28(4), pp. 1247-1263.