The Number of Optimal Addition Chains for a Given Target

The number of optimal addition chains for n is denoted \(NMC(n)\). So for 27 we have \(NMC(27) = 5\) since we have 5 different optimal addition chains:

1 2 3 6 9 18 27
1 2 3 6 12 15 27
1 2 3 6 12 24 27
1 2 4 5 9 18 27
1 2 4 8 9 18 27

\(NMC\) counts addition chains as unique sequences of numbers without regard to how each element is constructed. For example:

1 2 3 4 7
1 2 3 5 7
1 2 3 6 7
1 2 4 5 7
1 2 4 6 7

\(NMC(7) = 5\)

We count only one chain 1 2 3 4 7 even though there are two ways of generating this chain. We have 4 = 3 + 1 and 4 = 2 + 2. 

A small table of the early \(NMC\) values:

 \(n\)  \(l(n)\) \(NMC(n)\)   \(n\)  \(l(n)\) \(NMC(n)\)

  1   0      1                11   5    15
  2   1      1                12   4      3
  3   2      1                13   5    10
  4   2      1                14   5    14
  5   3      2                15   5      4
  6   3      2                16   4      1
  7   4      5                17   5      2
  8   3      1                18   5      7
  9   4      3                19   6    33
10   4      4                20   5      6

As \(n\) rises we seem to be able to have larger values of \(NMC(n)\) presumably because of the combinatorial explosion of being able to rearrange the chains in more ways. Gage Siebert asked if I thought \(NMC\) was unbounded. I think the answer is yes based on the following:

We know from [5] that the worst possible addition chain length given by \(l(n)=\lambda(n)+v(n)-1\) (essentially the binary method is optimal) can be attained by integers with their bits spread out enough to precluding adding them in more than one at a time. We can then form the addition chain \(1,2,4,...,2^{\lambda(n)}\). We can continue this chain in \((v(n)-1)!\) ways adding in the remaining bits one by one. For example, for the number 465 we have \(v(465)=5\) so we must have at least \(4!=24\) chains starting \(1,..,2^8\) followed by adding the remaining bits in:

1 10 100 1000 10000 100000 1000000 10000000 100000000 100000001 100010001 101010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100000001 100010001 110010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100000001 101000001 101010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100000001 101000001 111000001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100000001 110000001 110010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100000001 110000001 111000001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100010000 100010001 101010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100010000 100010001 110010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100010000 101010000 101010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100010000 101010000 111010000 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100010000 110010000 110010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 100010000 110010000 111010000 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 101000000 101000001 101010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 101000000 101000001 111000001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 101000000 101010000 101010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 101000000 101010000 111010000 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 101000000 111000000 111000001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 101000000 111000000 111010000 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 110000000 110000001 110010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 110000000 110000001 111000001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 110000000 110010000 110010001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 110000000 110010000 111010000 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 110000000 111000000 111000001 111010001
1 10 100 1000 10000 100000 1000000 10000000 100000000 110000000 111000000 111010000 111010001

Picking \(n\) with larger values of \(v(n)\) gives unbounded \(NMC(n)\) values.

Achim Flammenkamp wrote a program to calculate \(NMC\) values by enumerating all addition chains for \(n\) with \(l(n) \le r\). The code built an array of addition chain values keeping track of which values had not been used. If the chain was not at its end position the unused elements were added together and pruned using a class A bound [1] derived from \(c(r)\). \(c(r)\) is the smallest number whose optimal addition chain is of length \(r\). So, \(l(c(r)) = r\) and if \(m < c(r)\) then \(l(m) < r\). A class A bound is very simple. If we are at position \(i < r\) in the chain with an element \(m\) then \(2^{r-i}m\ge c(r)\).

Achim calculated \(NMC(n)\) for \(n \le 2^{22}, l(n)\le 22\). I added the class F bounds [2]. To do this we generate the class F bound for each integer \(n\le 2^r\) to calculate the \(NMC(n)\) for any \(n\) with \(l(n)\le r\). Since these are lower bounds for element values, we take their minimum value in each position. These bounds halved the search time.

Instead of recursively searching in a depth first manner for addition chains I shifted to a push down stack of problems as described in [1]. This technique though a little more complicated allows the easy conversion of the code to run on multi-core processors. Idle processors/threads can request threads with active values on their stacks to pass the largest problems (the smallest depths) to the idle processors via a queue in global memory. This very nicely scales to even very large core systems.

If you think about how, you might code this search, you need an array of \(l(n)\) values so that you can tell if you find an addition chain for \(n\) at depth \(l(n)=r\). You must record that fact by incrementing the array for the \(NMC(n)\) value. Updating \(NMC(n)\) requires an interlocked operation as it's done from multiple threads concurrently. These memory accesses and shared memory writes make the code memory bound (limited by the speed of paths to and from memory). We can avoid interlocked operations by splitting the \(NMC(n)\) on a per/thread basis and combining at the end of execution.

We waste a lot of time though recalculating stuff. Instead, we should search only for chains of length \(r\) for \(n\) if \(l(n)=r\). So, we run the program with \(r=0\) then \(r=1\) etc. We can dispense with \(l(n)\) values and instead use a bitmap that has a bit set for \(n\) if \(l(n) = r\).

I was running this program on a Dell Precision 7920 Dual Intel 6258R. So, a 58/116 core/thread machine. This machine supports AVX512 and so has registers that can hold 16 32-bit integers. If I restrict the depth to \(r \le 32\) (a huge range) and special case \(n=2^{32}\) I might be able to do even more.

You then meet a problem that seems to be quite well known in the literature [4]. It's hard to use SIMD in backtracking searches since some searches run faster than others and you end up with some slots being unused unless you rebalance. Rebalancing is costly. I got this to work by changing the code that pulls elements from the stack for new problems to actually pull multiple elements if there are gaps and combine them. This seemed to work best if the two problems can be combined to fit into one problem so that one problem can be deleted after combination. 

The biggest improvement from AVX512 came from the gathered read. If you imagine at depth \(r\) we have up to 16 integers packed into a zmm register. We want to look each value up in a bitmap and only if the bitmap is set increment the \(NMC\) array. I use the _mm512_mask_i32gather_epi32 intrinsic to read from the bitmap in parallel. Each \(n\) is shifted down by 5 bits to create a 32-bit element offset in the bitmap. Anding the \(n\) with 31 give the bit number. The gather loaded values are shifted down by the bit number and only the values that had the bottom bit set are used to increment the \(NMC\) values.

It turns out that much to my surprise the code spends much of its time finding non-optimal chains so rejecting in parallel is a big deal. So, there must be way more no-optimal chains of length \(r\) than optimal ones. With this code I was able to generate \(NMC(n)\) for \(n\le 2^{28},l(n)\le 28\).

nmc28.zip

The text file format is the same as the tables shown above. \(n\) followed by \(l(n)\) and then by \(NMC(n)\). Note that the file is sparce since it contains no numbers \(n\) with \(l(n)>28\). The file should be dense up until \(c(29)=7624319\), since this is the first number requiring more than 28 steps in its optimal addition chains. Gage Siebert kindly sent me a fascinating heat map of \(NMC(n)\) against \(n\):

NMC(n) vs n

Knuth [3] says that for \(c(r)\) values "The computation of \(l(n)\) seems to be hardest for this sequence of n's". This leads some authors to think that finding optimal chains for \(c(r)\) values is hard. In fact, this is not the case. If you know \(l(c(r))\) then finding chains appears to be easy as they have very large numbers of chains. What is hard is proving that there is no chain shorter than \(l(c(r))\). Do the \(c(r)\) values have the largest number of chains for all numbers with \(l(n)=r\)? We might think this is true since a chain for \(c(r)\) may contain any number \(m<c(r)\) since \(l(m)<r\).

We explore this question in the following table. \(MNMC(r)\) is the number \(n\) with \(l(n)=r\) with the largest \(NMC(n)\) value. Values may tie and, in that case, we show them all.

 Do \(c(r)\) values have the largest \(NMC\) values?
\(r\) \(c(r)\) \(NMC(c(r))\) \(MNMC(r)\) \(NMC(MNMC(r))\) \(r\) \(c(r)\) \(NMC(c(r))\) \(MNMC(r)\) \(NMC(MNMC(r))\) \(r\) \(c(r)\) \(NMC(c(r))\) \(MNMC(r)\) \(NMC(MNMC(r))\)
0 1 1 1 1 11 191 9787 191 9787 22 110591 1517938 155604 29390402
1 2 1 2 1 12 379 9076 508 17961 23 196591 6556097 311208 55506704
2 3 1 3,4 1 13 607 31373 758 33076 24 357887 2973869 551880 97132092
3 5 2 5,6 2 14 1087 55262 1214 110624 25 685951 13736135 1103760 208839226
4 7 5 7 5 15 1903 68810 2540 230110 26 1176431 10619809 2207520 406164668
5 11 15 11 15 16 3583 57443 5080 413376 27 2211837 7266410 4415040 732644565
6 19 33 22 40 17 6271 209898 8436 780274 28 4169527 15095632 7257602 2032609296
7 29 132 29 132 18 11231 252024 16872 1508603
8 47 220 58 352 19 18287 665583 33744 2533736
9 71 1258 71 1258 20 34303 1158949 48482 5110044
10 127 2661 142 3903 21 65131 1661908 77802 12383931

[1] Thurber EG (1999), "Efficient Generation of Minimal Length Addition Chains", SIAM Journal on Computing. Vol. 28(4), pp. 1247-1263.

[2] Thurber EG and Clift NM, "Addition chains, vector chains, and efficient computation", Discrete Mathematics, Volume 344, Issue 2, 2021,

[3] Knuth DE (1998), "The art of computer programming, 2: seminumerical algorithms, Addison Wesley", Reading, MA. , pp. 477

[4] Powley, Curt, Chris Ferguson, and Richard E. Korf. "Depth-first heuristic search on a SIMD machine." Artificial Intelligence 60.2 (1993): 199-242.

[5] Hansen, W. Untersuchungen über die Scholz-Brauerschen Additionsketten und deren Verallgemeinerung 1957 Göttingen