Discovery via classifiers and/or AI?

mathkook · November 10, 2025, 6:20pm

This post is more on the computer side, less on the math side.

Here are start numbers (in binary) labeled as {\color{red}0} if their trajectory ends in 5-8-4-2-1, and as {\color{red}1} if it ends in 16-8-4-2-1.

1001010100101011111 {\color{red}0} (start number 305503 in decimal)
11010110011010001001 {\color{red}0}
111100100110011011 {\color{red}0}
1000011101111111 {\color{red}0}
100100011101100001 {\color{red}0}
110111110000101 {\color{red}1}
1111101101100101 {\color{red}1}
1000110101111111 {\color{red}1}
1010010101100101111 {\color{red}0}
11101001101101111 {\color{red}1}
…

And here are parity sequences (not start numbers!) labeled as {\color{red}1} if they correspond to integer cycle shapes for some 3n+d problem (with \lvert d \rvert < 2^{\text length}-3^{\text weight}), and {\color{red}0} otherwise. This is the concept of near-cycle.

00010001111 {\color{red}1} (3n+95 cycle)
111111011101101001011110000 {\color{red}1} (3n+5 cycle)
11101010 {\color{red}0} (3n+13 cycle, yes, but 13 \not\lt 2^8-3^5)
1111111100000 {\color{red}0}
1111111010000 {\color{red}0}
0010100111111 {\color{red}1} (3n+233 cycle)
0011001101111 {\color{red}1} (3n+233 cycle)
0001011111101 {\color{red}0}
0001111101101 {\color{red}0}
…

In each case, we have a potentially-infinite supply of train/test examples for a classifier or AI system to locate predictive relationships between input sequence and output class.

Maybe the AI system doesn’t learn the whole concept, but if it does better than chance (on test data), it may have found a piece of the puzzle. Just for example, it might trivially be learned that any power of 2 (any input of the form 10^*) will go through 16 rather than 5 … or less trivially, that no high cycle parity vector will be a near-cycle.

Interested to know if anyone’s tried this type of classification and/or machine-assisted discovery, or if there are other data sets that might be generated and tried.

HungryMonkey7 · November 11, 2025, 1:45am

Well if you knew two sequences have been the same pattern of even vs odd steps for some time, without knowing when they diverge, then the only difference would be the size of each of the original start numbers. So not only would the parity (e/o) be needed but also the size of the number that is started from to determine how it reaches one. If we try to figure it out that way we would have to bring in the size, or how many times the smallest instance of the pattern it is compared to other instances of the pattern. An ai would need to categorize each type of pattern, and then the start number’s size for that pattern, and then the side of 5 or 32 it comes from for each number of each pattern in hopes of finding correlations.

mathkook · November 11, 2025, 2:20am

Right, not asking for 100% accuracy on unseen numbers, just slightly better than chance.

The idea is to identify classes of numbers that come down one path or another, or classes of parity vectors that correspond to near-cycles. Something we do by hand already, that might be partially automated.

mathkook · November 11, 2025, 9:10pm

I’d like to distinguish between “if this worked, would it be useful” and “is this too hard”?

I think it would be useful because people are already doing things like this. For example,

proving that Collatz k-cycles (k<91) can’t be populated by integers
proving that the Collatz high cycle can’t be populated by integers
proving that an integer Collatz circuit (if it existed) would necessarily have co-prime number of up-moves and down-moves

As for “is this too hard?” … it’s definitely hard.

Here’s a concrete case. Given parity vector v=11100101101010011, is it a near-cycle? If 2^k-3^x is prime, the answer is no. That covers a big chunk of parity vectors. But, even given vast amounts of labeled data (vectors paired with yes/no), what kind of machine learning system could identify the feature “prime 2^k-3^x” and correlate it with “no”? Pretty certain an LSTM or Transformer is not able to.

To simplify further, imagine a vast labeled dataset consisting of integers (in binary) labeled with “0” (prime) or “1” (composite). What sort of machine learning system could analyze that data and attain 100% classification accuracy on the training set by identifying the correct generalization, leading to perfect performance on integers (input binary strings) much longer than what was given in the training data? A typical neural network can’t, because classifying prime/composite for a new large number requires unbounded computation. So that’s hard.

Although, it’s interesting to think about what examples and/or a priori knowledge a machine needs to discover the concept of primality. Like the “smallest program in class X that’s consistent with a labeled dataset”, maybe. Kids do it unsupervised: “Here’s 11 blocks, I’m gonna make a rectangle … wait, I can’t. How about 12 blocks?”

Even if the machine doesn’t understand the target concept 100%, better-than-average performance on test data might still be useful. Like, if the system said, "anything divisible by 2, 3, or 5 is in class 1.‘’ Divisibility by 2 is an easy concept to learn on binary integers (check the last digit), although divisibility by 3 looks harder. What sort of ML architectures can and can’t do it?

A totally easy pattern is “any large-enough power of 2 (as a Collatz start number) passes through 16 on its way to 1.” For that, we just need to identify that the input string starts with 1 and ends with a bunch of 0. You can definitely imagine a machine learning system picking up on that fact from data alone.

Another intermediate pattern is “high cycles aren’t near-cycles.” A simplification of this would be a vast dataset consisting of high cycles (labeled 1) and non-high cycles (labeled 0). After training, we ask, “is 1101101011010 a high cycle?”

The question is what kind of patterns (1) regarding Collatz trajectories, (2) that we’re interested in … might a machine learning system be able to identify from data?

oros · November 12, 2025, 9:48pm

Here is one attempt that I’m aware of:
Collatz Conjecture: Neural Networks and Total Stopping Time: New TST values are not chaotic.

Unfortunately, nothing is said on the choice of neural network. I tried to reproduce this experiment which aims to predict the sequence of total stopping times (number of steps needed to reach 1), but the accuracy was no better than 15%. Very likely, my choice of neural network (CNN) was not suitable.

Yet, this is not classification.

Regarding your two initial proposals, I suspect the classification of trajectories to be (slightly) less messy than that of near vs primitive cycles, according to this post.

Note that the notion of “near” cycles has been extensively studied by Belaga and Mignotte, e.g., in this paper, although the focus was on primitive cycles.

mathkook · November 13, 2025, 12:54am

@oros Thanks for the references! I’ll check them out. Both of these examples seem hard … maybe there are easier things to attack.

oros · November 15, 2025, 5:42pm

Sorry, this reference is less relevant than I thought for near cycles… It seems that little is known regarding this notion, which is a particular case of primitive cycle.

oros · November 17, 2025, 7:07pm

And here is a new preprint suggesting that Collatz might become a relevant benchmark for machine-learning in the arithmetic field.

The authors investigate whether a transformer can learn how to perform a sequence of Collatz iterations whose parity vector is of the form 1^s0^t in various bases. Apparently, it is mostly challenging in base 3 (less than 25% of accuracy), which is quite interesting. They didn’t try in base 3/2 though…

I would love to see a similar study on total stopping times (TST). Let us imagine we can achieve 15% accuracy with positive integers of any length when represented in a suitable base, we could focus on those with predictable TSTs. Proving that they reach 1 for even 1% of them would be a breakthrough.

mathkook · November 18, 2025, 1:51am

Great idea, and for these kinds of problems, there’s no reason multiple base representations couldn’t simultaneously populate the input string.

mathkook · November 23, 2025, 6:53pm

I trained transformers for two tasks:

Does the trajectory of start number n pass through 5 or 16? Seq2seq formulation: given n as a sequence of digits in base 2, produce sequence consisting of one digit (0 for 5, or 1 for 16).
What is the total stopping time (TST) of start number n? (Number of steps to reach one). Seq2seq formulation: given n as a sequence of digits in base 2 (or 6), produce step count as a sequence of digits in base 2 (or 6).

These are ridiculously hard tasks, because we’re asking the machine to find a pattern that we can’t (as expert humans) find. But what the heck.

5 vs. 16 task.

80k training instances (randomly-chosen 30-bit numbers), 20k test instances. Four-layer transformer, 512 dimensions, 1000 epochs.

Train loss decreases from 0.1126 to 0.0258.
Train accuracy: 98.39% (versus 93.78% from always guessing 5)
Test accuracy: 90.85% (versus 94.00% from always guessing 5)

Not much luck.

Spurious patterns. The transformer seems to lock onto spurious patterns in the 80k training examples, particularly patterns that identify the rarer “16”-type numbers. Applied to test data, those patterns cause random instances to be classified as “16”-type, making accuracy even worse than “always guess 5”.

Smaller model. Reducing the power of the transformer (4 layers to 2, 512 dimensions to 128) makes train loss drop more slowly and (maybe) encourages it to focus on finding one “big” pattern rather than lots of spurious “small” patterns.

Ray of hope? Under the smaller model scenario, we get a small ray of hope. When we sort the test instances by P(output = 0), we see more bonafide “16”-type numbers at the top of the list (11 in the top 100) than at the bottom (only 5 in the bottom 100). “Hard to say if that’s a coincidence,” as we say in Collatz world.

TST prediction task.

80k training instances (start numbers 1-80k), 20k test instances (start numbers 80,001-100k). Two-layer transformer, 128 dimensions.

base	loss	epoch	train seq acc	train digit acc	test seq acc	test digit acc
2	0.64	20	3.2	57.6	1.2	55.9
	0.37	500	19.3	67.5	0.4	59.8
	0.31	1000	26.9	72.4	0.3	60.5
6		20	12.5	32.5	0.9	23.9
	0.76	60	17.2	35.8	0.4	25.9
	0.68	210	21.5	39.3	0.7	26.8
	0.60	1000	29.6	46.1	0.8	27.4

Not much luck? Test sequence accuracy (guessing TST exactly) is pretty bad, under one percent.

Digit accuracy. Most of the gold answers are three digits long (in base 6). The system is bad at guessing whole TSTs, but better than random (one-sixth) at guessing each digit, because the first digit is very predictable. Training correctly hones in on the fact that a large chunk of TSTs begin with 1 (or 2), so its guesses reflect that. This is some Benford-like behavior, since TSTs are basically logs of large numbers.

Length modeling. The TST transformer models learn “the longer the input, the longer the output,” though the extrapolation isn’t perfect. After training on start numbers from 1 to 80k, the model is tested on 80k-100k, outputting base 2 strings of average length 7.15 instead of (gold) 7.24. When tested on start numbers in the range of one billion, it outputs length 7.51 instead of 7.80.

Data sets. Constructing data sets for TST is tricky. For example, among the first 100k start numbers, there are only 315 unique TST values – some are very popular, appearing near the mode of the distribution. So a first-order approximation is just to learn those popular values and spit them back, no matter what the start number is. So, data selection maybe needs some work.

Comments welcome.

PS. If you’re not so much into number theory, but enjoy machine learning, there are lots of Collatz-related things to do.

Fumblingnumbers · November 23, 2025, 8:08pm

The combinatorics in the total stopping time are insane, as ive mentioned before the odd numbers increase sequentially ×1.5 so take just the odds in sequential positions that are powers of 2 like 3,7,15,31…..

3,5 increases once

7,11,17 increases twice

15,23,35,53 increases three times

31 increases four times….you get the picture

Then the same pattern starts again from from s=10 start 19 then again from s=14 27start….. as start numbers get bigger complexity grows, even as you increase up through odd starting points in sequential positions of powers of 2 the potential for complexity grows, this i think is why probability in any form just doesn’t cut it as this complexity potential is unbounded. Interestingly the ×1.5 structure may be the key to the fractal appearance, take 16,24,36,54,81 note the difference between these increases is also ×1.5

16+8

24+12

36+18

54+27 there’s a direct relationship aswell as s16=31 and 31,47……the difference is the s number. It’s not pleasant to deal with but this is the nature of the structure combinatoric complexity increases as the set of whole numbers increases. I dont see any way around this distribution, although im certainly not qualified to say there is no way around it.

mathkook · November 23, 2025, 9:48pm

Totally agree. It’s almost for sure too much to ask to learn this whole mapping from examples.

The idea is to at least learn some tiny, predictable portion of the mapping from n to TST(n).

For example, “If n is a power of 2, then TST(n) = \log_2 n.” Here, it’s easy to imagine a machine learning algorithm recognizing powers of two, given binary input, and it’s equally easy to imagine it taking \log_2 of such numbers, by counting the zeroes in the input.

(That particular portion of the overall mapping is too tiny to be of use, but that’s the general idea :))

oros · November 23, 2025, 10:06pm

Yes, both tasks are difficult.
Great investigation on whether the machine does better than us or not!
Some comments:

On 5 vs 16, transformer seems good at compressing training data though not without loss. And he’s doing rather poorly on extrapolation. He probably missed the existing patterns.
On TST, he reaches 25% or so with training data which is not bad (I got only 15% with TensorFlow, on a larger dataset). Extrapolation is more challenging and getting down to 1% is not surprising. Very likely the accuracy will tend to 0 when predicting TST on increasingly large inputs. In base 2, the digit accuracy remains above 50%, but not by far… I agree with your explanation related to Benford law. I do not expect accurate predictions on the least significant digit in any base .

I wonder what “epoch” means in your results. Usually, it refers to the number of times the neural network passes through the training dataset, which should be a whole number.

mathkook · November 23, 2025, 10:20pm

@oros Loss and epoch columns were switched – fixed, thanks. Loss means training loss, what the learner is actually trying to minimize (lower is better). Binary digit accuracy gets a boost bc the first digit is always 1

Fumblingnumbers · November 23, 2025, 10:26pm

Im wondering if there’s a way to pre factor this system into the model. And use computational methods to notice a pattern in the sets of numbers by creating some kind of bijection that marks the distribution of increasing sets rather than individual integers.if you take a base like 3 or s2 3 leads to 5 if you iterate 3×4+1 you’ll get all the odd numbers that also lead to 5, this process can be carried out on s by repeating s×4-1 so 2,7,27,107,427,1707 starting from the first multiple of 3 you can apply s×64-21 then divide them all by 1.5 until they go to an even ‘base’ that not divisible by 3

27,18,12,8

1707,1138

109227,72818

6990507,4660338,3106892

…….the distribution of these even s numbers marks the size of the ‘net’ which captures merging sequences. Sorry if this is irrelevant or im not explaining myself well,If nothing else you can target only sequences starting from integers with s=3 mod6 and reduce this even further by computing the ‘bases’ ie. Any 3 mod6 that produces an integer via(s+21)÷64 is not a base, any 3 mod6 that does not produce an integer under these conditions is a base that all other integers must run through and so cutting out the patterns we already know about giving the computational system a fresh starting point.

Fumblingnumbers · November 24, 2025, 5:55pm

Just a side note sticking with s, you can take any 1 mod5 or even s and apply s×4-1 and produce these multiples of 3 which will all be 3 mod12, convert these back to n via s×2-1 (to n collatz rules apply),

(3(2×2-1)+1)÷2=5

(3(7×2-1)+1)÷2=20

(3(27×2-1)+1)÷2=80….. and so on so above s=2 you can predict the parity vector increasing by 00 after the original 1 starting point, of course the 1 doesnt repeat as it passes down through s=7 and s=2 so this can help optimise the learning model.

but if you take s=27 and apply s×64-21 as mentioned before you’ll get the multiples of 3, now divide these by 3 to produce 9 mod12 these will also merge but with a different parity pattern

27÷3=9……….(3(9×2-1)+1)÷4=13…..13×3+1=40

1707÷3=569…………………………………………..(3(569×2-1)+1)÷4=853…..853×3+1=2560

And of course you can take 9 and apply s×4-1 and restart the same system all merging.the 9….569…..numbers all enter the path via parity pattern 1001 and then depending how high up they are the 0’s that follow should be easily calculated. Sorry to bang on let me know if this is relevant, I just wanted to note that im not just pushing my ideas, I think they can really help to optimise computational systems by narrowing down start numbers and ruling out the need to check merging integers that sit above them,similar to how we tend to start from odd numbers for the same reason.you can also cut out large parts of the vector if you want to optimise further. As 27 and 9 merge

27,18,12,8

9,6,4 when dividing by 1.5 for certain even s numbers you can just divide by 2 then continue multiplying by 1.5 and you have an equivalence, if your trying to calculate net rise vs net fall for large numbers this could highlight net fall over rise by doing away with a partial rising parity vector altogether.

mathkook · November 24, 2025, 6:54pm

@Fumblingnumbers I can’t follow everything, but it seems like you’re saying that we can group numbers, so that if one number follows a certain pattern, the other numbers in its group will also follow (merge into) that same pattern, which makes a lot of sense.

Fumblingnumbers · November 24, 2025, 7:52pm

Yes if we know the structure above numbers, and we can translate known equalities to the parity sequence we’ve already done part of the job, you could actually use these to test your learning models, by prompting them to notice rather than explicitly programming them. The simplest pattern to notice is n×3+1=(3(4n+1)+1)÷4 as 4n+1 is always odd we know know the structure that produces multiples of 4 above any odd base so remove the starting 1, add 00 then replace the 1 in the parity sequence as you increase 4n+1 on the starting n. Im not clear on the binary expressions so i dont know how these express themselves in your research, but this pattern will be there so its a good test for any approach , including computational, I like to use my s rules as the system preserves multiples of 3 on the increases and the decreases making the prospect of finding inequalities more viable. And it exposes merge rates systemically in a way that I now only see in collatz because of this approach, although im now starting too see I may be going the long way round. As all n×4+1=n’ congruant 1 mod4 you can just change the rule for all 1 mod4 to -1÷4 this way they will overshoot and fall to even numbers at times which highlights the combinatoric nature eg.

(149-1)÷4=37

(37-1)÷4=9

(9-1)÷4=2

2×3+1=7

(9×3+1)÷4=7

(37×3+1)÷16=7

(149×3+1)÷64=7

The interesting thing is when they fall to multiples of 4 using this

(277-1)÷4=69

(69-1)÷4=17

(17-1)÷4=4

4×3+1=13

So apply -1÷4 to 13 to correct

((4×3+1)-1)÷4=(4×3)÷4=3 this is a good way to see how the combinatorics work as numbers fall to their bases

mathkook · November 25, 2025, 1:59am

I replicated the Charton and Narayanan paper @oros pointed to, since I had the transformer scripts in place.

C&N take (odd) n and predict the (odd) endpoint of n’s initial trajectory after p up steps and q down steps. For example, 11 maps to 13, because 11 \rightarrow^{up} 17 \rightarrow^{up} 26 \rightarrow^{down} 13. Here are some sample base-3 training instances:

2222010120010120222010101 → 10202102020220210110211
2001211120022102211201120 → 222111111122111122011111111
2220022110202210221001112 → 102000100101112001102012
20020201202021110012222 → 120211002202002211111111

I ran with 100k examples instead of 300m (all in the range 1 to 10^{26}), with 100 training epochs. I’d like to thank my hardworking laptop who additionally moonlights locally running the software that hosts this forum

Results:

Base	Epoch	Loss	Train Seq Acc	Train Digit Acc	Test Seq Acc	Test Digit Acc
2	10	0.50	0.00	64.69	0.00	64.51
	20	0.46	0.00	67.15	0.00	66.86
	100	0.06	83.48	96.30	\color{red}80.85	95.36
3	10	0.83	0.00	38.86	0.00	38.86
	20	0.72	0.00	40.47	0.00	40.34
	100	0.10	51.20	68.30	\color{red}20.71	47.30
6	10	0.48	61.65	72.09	61.70	72.22
	20	0.16	79.28	83.94	79.44	84.16
	100	0.03	90.30	92.43	\color{red}88.47	90.97
24	10	0.23	89.97	93.24	89.62	92.97
	20	0.12	94.39	96.43	93.50	96.00
	100	0.02	99.29	99.79	\color{red}96.54	98.07

These results track C&N and would probably match exactly with more epochs and data.

Now we have some Collatz-related tasks for transformers that turn out to be easy, plus some tasks that are likely impossible. Interested to do more …

Fumblingnumbers · November 25, 2025, 6:57am

I think the point im trying to make is about the example pool/starting numbers to test, if we can agree that all number not merging immediately to 1 must merge through certain numbers, this makes computing exponentialy more efficient as the test block gets bigger, your computer will thank you.