The Number of Optimal Addition Chains for a Given Target

The number of optimal addition chains for n is denoted NMC(n). So for 27 we have NMC(27) = 5 since we have 5 different optimal addition chains:

1 2 3 6 9 18 27
1 2 3 6 12 15 27
1 2 3 6 12 24 27
1 2 4 5 9 18 27
1 2 4 8 9 18 27

NMC counts addition chains as unique sequences of numbers without regard to how each element is constructed. For example:

1 2 3 4 7
1 2 3 5 7
1 2 3 6 7
1 2 4 5 7
1 2 4 6 7

NMC(7) = 5

We count only one chain 1 2 3 4 7 even though there are two ways of generating this chain. We have 4 = 3 + 1 and 4 = 2 + 2. 

A small table of the early NMC values:

 n  l(n) NMC(n)         n  l(n) NMC(n)

  1   0      1                11   5    15
  2   1      1                12   4      3
  3   2      1                13   5    10
  4   2      1                14   5    14
  5   3      2                15   5      4
  6   3      2                16   4      1
  7   4      5                17   5      2
  8   3      1                18   5      7
  9   4      3                19   6    33
10   4      4                20   5      6

Achim Flammenkamp wrote a program to calculate NMC values by enumerating all addition chains for n with l(n) ≤ r. The code built an array of addition chain values keeping track of which values had not been used. If the chain was not at its end position the unused elements were added together and pruned using a class A bound [1] derived from c(r). c(r) is the smallest number whose optimal addition chain is of length r. So, l(c(r)) = r and if m < c(r) then l(m) < r. A class A bound is very simple. If we are at position i < r in the chain with an element m then 2r-im ≥ c(r).

Achim calculated NMC(n) for n ≤ 222. I added the class F bounds [2]. To do this we generate the class F bound for each integer n ≤ 2r to calculate the NMC(n) for any n with l(n) ≤ r. Since these are lower bounds for element values, we take their minimum value in each position. These bounds halved the search time.

Instead of recursively searching in a depth first manner for addition chains I shifted to a push down stack of problems as described in [1]. This technique though a little more complicated allows the easy conversion of the code to run on multi-core processors. Idle processors/threads can request threads with active values on their stacks to pass the largest problems (the smallest depths) to the idle processors via a queue in global memory. This very nicely scales to even very large core systems.

If you think about how, you might code this search, you need an array of l(n) values so that you can tell if you find an addition chain for n at depth r you must record that fact by incrementing the array for the NMC(n) value. Updating NMC(n) requires an interlocked operation as it's done from multiple threads concurrently. These memory accesses and shared memory writes make the code memory bound (limited by the speed of paths to and from memory). We can avoid interlocked operations by splitting the NMC(n) on a per/thread basis and combining at the end of execution.

We waste a lot of time though recalculating stuff. Instead, we should search only for chains of length r for n if l(n)=r. So, we run the program with r=0 then r=1 etc. We can dispense with l(n) values and instead use a bitmap that has a bit set for n if l(n) = r.

I was running this program on a Dell Precision 7920 Dual Intel 6258R. So, a 58/116 core/thread machine. This machine supports AVX512 and so has registers that can hold 16 32-bit integers. If I restrict the depth to r ≤  32 (a huge range) and special case n=232 I might be able to do even more.

You then meet a problem that seems to be quite well known in the literature. It's hard to use SIMD in backtracking searches since some searches run faster than others and you end up with some slots being unused unless you rebalance. Rebalancing is costly. I got this to work by changing the code that pulls elements from the stack for new problems to actually pull multiple elements if there are gaps and combine them.

The biggest improvement from AVX512 came from the gathered read. If you imagine at depth r we have up to 16 integers packed into a zmm register. We want to look each value up in a bitmap and only if the bitmap is set increment the NMC array. I use the _mm512_mask_i32gather_epi32 intrinsic to read from the bitmap in parallel. Each n is shifted down by 5 bits to create a 32-bit element offset in the bitmap. Anding the n with 31 give the bit number. The gather loaded values are shifted down by the bit number and only the values that had the bottom bit set are used to increment the NMC values.

It turns out that much to my surprise the code spends much of its time finding non-optimal chains so rejecting in parallel is a big deal. So, there must be way more no-optimal chains of length r than optimal ones. With this code I was able to generate NMC(n) for n ≤ 228.

nmc28.zip

The text file format is the same as the tables shown above. n followed by l(n) and then by NMC(n).

Knuth [3] says that for c(r) values "The computation of l(n) seems to be hardest for this sequence of n's". This leads some authors to think that finding optimal chains for c(r) values is hard. In fact, this is not the case. If you know l(c(r)) then finding chains appears to be easy as they have very large numbers of chains. What is hard is proving that there is no chain shorter than l(c(r)). Do the c(r) values have the largest number of chains for all numbers with l(n)=r? We might think this is true since a chain for c(r) may contain any number m<c(r) since l(m)<r.

We explore this question in the following table. MNMC(r) is the number n with l(n)=r with the largest NMC(n) value. Values may tie and, in that case, we show them all.

 Do c(r) values have the largest NMC values?
r c(r) NMC(c(r)) MNMC(r) NMC(MNMC(r)) r c(r) NMC(c(r)) MNMC(r) NMC(MNMC(r)) r c(r) NMC(c(r)) MNMC(r) NMC(MNMC(r))
0 1 1 1 1 11 191 9787 191 9787 22 110591 1517938 155604 29390402
1 2 1 2 1 12 379 9076 508 17961 23 196591 6556097 311208 55506704
2 3 1 3,4 1 13 607 31373 758 33076 24 357887 2973869 551880 97132092
3 5 2 5,6 2 14 1087 55262 1214 110624 25 685951 13736135 1103760 208839226
4 7 5 7 5 15 1903 68810 2540 230110 26 1176431 10619809 2207520 406164668
5 11 15 11 15 16 3583 57443 5080 413376 27 2211837 7266410 4415040 732644565
6 19 33 22 40 17 6271 209898 8436 780274 28 4169527 15095632 7257602 2032609296
7 29 132 29 132 18 11231 252024 16872 1508603
8 47 220 58 352 19 18287 665583 33744 2533736
9 71 1258 71 1258 20 34303 1158949 48482 5110044
10 127 2661 142 3903 21 65131 1661908 77802 12383931

[1] Thurber EG (1999), "Efficient Generation of Minimal Length Addition Chains", SIAM Journal on Computing. Vol. 28(4), pp. 1247-1263.

[2] Thurber EG and Clift NM, "Addition chains, vector chains, and efficient computation", Discrete Mathematics, Volume 344, Issue 2, 2021,

[3] Knuth DE (1998), "The art of computer programming, 2: seminumerical algorithms, Addison Wesley", Reading, MA. , pp. 477