Can ChatGPT learn to add?
Recently, I’ve been thinking a lot about entropy as it relates to transformer-based LLMs. Some of these models contain hundreds of billions of parameters, which is certainly a lot! There is a lot of information you can encode with 100+ billion parameters. Which is why I’ve been thinking about addition as it relates to transformers. The rules of addition are very simple; even small children can learn them, and the rules are almost certainly encodable given 100 billion parameters. However, while the rules are simple, the entropy required to memoize 64-bit addition greatly exceeds even the largest LLMs. Additionally, the representation of binary numbers is extremely compact, meaning adding two 64-bit numbers should fit well under the token limit for any model. This compactness of representation also means its easy for a human to create an examples that would be impossible to memoize. So this begs the question: can a stochastic parrot learn to add? If not, does this represent a fundamental limit of transformers?
Here is my first attempt at describing binary addition to GPT4-backed ChatGPT:
I will describe binary addition to you. A binary number is a string of 0s and 1s. The least significant bit is the rightmost digit, and the most significant bit is the leftmost. To add two binary numbers, start with least significant digit of both numbers and add them. 0 and 0 add to 0 with 0 carry. 0 and 1 add to 1 with 0 carry. 1 and 1 add to 0 with 1 carry. Repeat this process from least significant bit to most significant bit. For each bit after the least significant bit, add the previous carry to the result for that bit as well. The output of the addition of the nth least significant bit becomes the nth least significant bit of the answer.
Now add two binary numbers: 1001 and 0110
Here is the output:
You've explained binary addition correctly. Now let's add the two binary numbers you provided: 1001 and 0110.
1001
+ 0110Starting from the least significant bit:
1. 1 (rightmost bit of 1001) + 0 (rightmost bit of 0110) = 1, with 0 carry.
2. 0 + 1 = 1, with 0 carry.
3. 0 + 1 + 0 (carry from previous step) = 1, with 0 carry.
4. 1 + 0 + 0 (carry from previous step) = 1, with 0 carry.So, the result of the binary addition of 1001 and 0110 is 1111.
Wow, amazing! It appears to understand how to do addition. Now let’s generate two random 32-bit numbers, say 10011111110100011100011011110010 and 00111010100101000100000101111011.
Prompt: Now add 10011111110100011100011011110010 and 00111010100101000100000101111011. Do not explain your work. Just output the answer.
ChatGPT: The result of the binary addition is 110010100110011000001000011010101.
Clearly, ChatGPT didn’t pay enough attention in arithmetic! Humans have a hard time comprehending how much information its possible to encode inside billions of parameters. What transformers have been made to do is beyond impressive, but I posit they can only work properly in low-entropy spaces where the correct answer to the prompt can be contained inside the parameter space. This limitation will prevent transformers from becoming a general human replacement.
sharemywin on Hacker News raised a good point: would a human be able to compute the answer without showing work? Not that ChatGPT and a human should necessarily have the same abilities, but there is something to the thought that “showing one’s work” encodes the answer in a way that is usable by ChatGPT to come to the correct answer. So I revised my request to add two 32 bit numbers:
Now add 10011111110100011100011011110010 and 00111010100101000100000101111011. Limit work on each bit to a single line.
Sure, let's add the two binary numbers you provided, keeping each bit operation to a single line.
10011111110100011100011011110010
+ 00111010100101000100000101111011
Starting from the least significant bit:
1. 0 + 1 = 1
2. 1 + 0 = 1
3. 0 + 1 = 1
4. 0 + 0 + 1 (carry from step 3) = 1
5. 1 + 1 = 0, with 1 carry.
6. 1 + 0 + 1 (carry from step 5) = 0, with 1 carry.
7. 0 + 0 + 1 (carry from step 6) = 1
8. 1 + 1 = 0, with 1 carry.
9. 1 + 0 + 1 (carry from step 8) = 0, with 1 carry.
10. 0 + 0 + 1 (carry from step 9) = 1
11. 0 + 1 = 1
12. 1 + 0 = 1
13. 1 + 1 = 0, with 1 carry.
14. 1 + 0 + 1 (carry from step 13) = 0, with 1 carry.
15. 0 + 1 + 1 (carry from step 14) = 0, with 1 carry.
16. 0 + 0 + 1 (carry from step 15) = 1
17. 1 + 1 = 0, with 1 carry.
18. 1 + 0 + 1 (carry from step 17) = 0, with 1 carry.
19. 1 + 1 + 1 (carry from step 18) = 1, with 1 carry.
20. 0 + 0 + 1 (carry from step 19) = 1
21. 0 + 1 = 1
22. 1 + 0 = 1
23. 0 + 0 = 0
24. 1 + 1 = 0, with 1 carry.
25. 0 + 0 + 1 (carry from step 24) = 1
26. 0 + 1 = 1
27. 1 + 0 = 1
28. 0 + 0 = 0
29. 1 + 1 = 0, with 1 carry.
30. 1 + 0 + 1 (carry from step 29) = 0, with 1 carry.
31. 1 + 1 + 1 (carry from step 30) = 1, with 1 carry.
32. 0 + 0 + 1 (carry from step 31) = 1
So, the result of the binary addition is:
11011001000010100010001110101101
Still wrong! To me, this indicates one of two possibilities:
More than 100 billion parameters are required to encode the rules of addition inside a transformer.
What we see when ChatGPT “understands” a complicated piece of logic is really just a low-entropy example that has somehow been directly encoded into the parameter space of the model.
Others on Hacker News have complained that this example is unfair because the tokenization model doesn’t tokenize binary numbers well. I think this criticism is unfounded. Even when tokenized, each binary number is a handful of smaller binary pieces. If you gave these smaller binary pieces to a human, it would be trivial to apply to rules of addition. However, I thought it worth trying better tokenization anyway, so I redid the addition prompt with spaces between each binary digit and verified that each binary digit was a separate token. Here is the output:
Sure, let's add the two binary numbers:
1 0 0 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 1 1 0 1 1 1 1 0 0 1 0
+ 0 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1
______________________________________________________________
Starting from the least significant bit and moving towards the most significant bit:
1. 0 + 1 = 1, carry 0
2. 1 + 1 = 0, carry 1
3. 0 + 1 + 1 = 0, carry 1
4. 0 + 1 + 1 = 0, carry 1
5. 1 + 1 + 1 = 1, carry 1
6. 1 + 0 + 1 = 0, carry 1
7. 1 + 1 + 1 = 1, carry 1
8. 0 + 0 + 1 = 1, carry 0
9. 1 + 1 = 0, carry 1
10. 1 + 0 + 1 = 0, carry 1
11. 1 + 1 + 1 = 1, carry 1
12. 0 + 0 + 1 = 1, carry 0
13. 1 + 1 = 0, carry 1
14. 0 + 0 + 1 = 1, carry 0
15. 0 + 1 = 1, carry 0
16. 1 + 0 = 1, carry 0
17. 1 + 0 = 1, carry 0
18. 1 + 0 = 1, carry 0
19. 0 + 0 = 0, carry 0
20. 0 + 0 = 0, carry 0
21. 0 + 0 = 0, carry 0
22. 1 + 1 = 0, carry 1
23. 1 + 0 + 1 = 0, carry 1
24. 0 + 1 + 1 = 0, carry 1
25. 1 + 1 + 1 = 1, carry 1
26. 1 + 1 + 1 = 1, carry 1
27. 1 + 1 + 1 = 1, carry 1
28. 1 + 1 + 1 = 1, carry 1
29. 0 + 0 + 1 = 1, carry 0
30. 0 + 1 = 1, carry 0
31. 0 + 0 = 0, carry 0
32. 1 + 0 = 1, carry 0
The sum of the two binary numbers is:
1 0 0 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 1 1 0 1 1 1 1 0 0 1 0
ChatGPT still fails to apply to rules of addition.