Good suggestion, especially when it’s time to look at extrapolation to larger instances than are in the training. I’m ignoring that for now – for example, taking the base-24 TST predictor trained on 100,000 long start numbers 1 to 10^{12} and applying it to the original test set of small start numbers between 80,001 and 100,000 gives an accuracy of 0\%.
For TST, I upped the training to one million start numbers between 1 and 10^{12}. For comparison, the previous result was:
- 100k training start numbers and their TSTs, written in base-24.
- Result: 41\% accuracy on train data, 10\% on test data
- Example instances:
– 2 8 1 15 13 9 0 15 19 | 13 13 | 7 16 | Wrong
– 2 12 16 2 0 8 7 5 9 | 5 7 | 6 10 | Wrong
– 5 15 16 22 3 6 12 6 2 | 7 22 | 7 22 | Right
With 10x training data, using the same test set, we get:
| Epoch | Loss | Train Acc | Test Acc |
|---|---|---|---|
| 10 | 1.088 | 8.91% | 9.64% |
| 50 | 1.068 | 10.38% | 10.81% |
| 100 | 1.044 | 11.00% | 11.11% |
| 150 | 1.041 | 11.51% | 11.84% |
| 200 | 1.044 | 11.82% | 11.58% |
| 250 | 1.027 | 12.09% | 11.91% |
| 300 | 1.025 | 12.30% | 12.30% |
This time it doesn’t overfit, and the test accuracy steadily improves (and is still improving). The question is what is it learning? Some kind of interpolation?
I’ll ponder on this some more.