The authors of the manuscript could have framed this in terms of repeatability, or lack thereof. It’s difficult to use instruments that’s aren’t repeatable for high stakes applications or for scientific discovery. And, as suspected, LLM outputs don’t appear to be repeatable.
There are two points in this article that seem to be incorrect after reading the paper myself.
1. Testing primality - Here is a direct quote from the paper. "The dataset contains 1,000 questions, where 500 primes were extracted from [ZPM+23] and 500 composite numbers were sampled uniformly from all composite numbers within the interval [1,000, 20,000]." Nowhere does this suggest that they only tested prime numbers, as was stated within the article.
2. Directly executable - While the name is indeed slightly misleading, this is how the paper defines directly executable: "We call it directly executable if the online judge accepts the answer (i.e., the answer is valid Python and passes its tests)." This explicitly states that the code is considered directly executable if A. it its valid python and B. passes its tests. Therefore according to the paper they do indeed test for correctness. They also note that, after post processing and removing the extra markup text, the performance of GPT-4 increased from 50% in March to 70% in June. However, part of their argument was that they explicitly asked GPT-4 to generate only the code, which was an instruction that it followed in March but failed to do so in June.
Bottom line: The Stanford/Cal paper appears to be heavily flawed research. It not only hurts the reputations of Stanford and Cal but also arXiv (Cornell).
Thanks for the analysis. Why capacity vs behaviour? A GPT-4 user cares about outputs of API calls, and may not differentiate capacity from behaviour. Capacity vs behaviour is something like latent variable vs observation, since GPT-4 is not open source.
Is GPT-4 getting worse over time?
Thanks for this in-depth analysis. It surprises me how many people are rushing to conclusions without even having read a single page of that paper.
The authors of the manuscript could have framed this in terms of repeatability, or lack thereof. It’s difficult to use instruments that’s aren’t repeatable for high stakes applications or for scientific discovery. And, as suspected, LLM outputs don’t appear to be repeatable.
Thank you, useful analysis.
This article perhaps needs a conclusion to summarize and synthesis what you are saying in full that's easily understandable by a lay audience.
Another great review from you folks.
There are two points in this article that seem to be incorrect after reading the paper myself.
1. Testing primality - Here is a direct quote from the paper. "The dataset contains 1,000 questions, where 500 primes were extracted from [ZPM+23] and 500 composite numbers were sampled uniformly from all composite numbers within the interval [1,000, 20,000]." Nowhere does this suggest that they only tested prime numbers, as was stated within the article.
2. Directly executable - While the name is indeed slightly misleading, this is how the paper defines directly executable: "We call it directly executable if the online judge accepts the answer (i.e., the answer is valid Python and passes its tests)." This explicitly states that the code is considered directly executable if A. it its valid python and B. passes its tests. Therefore according to the paper they do indeed test for correctness. They also note that, after post processing and removing the extra markup text, the performance of GPT-4 increased from 50% in March to 70% in June. However, part of their argument was that they explicitly asked GPT-4 to generate only the code, which was an instruction that it followed in March but failed to do so in June.
Right, but consistent low-repeatability, while perhaps undesirable, is not degradation.
Check these links:
https://www.reddit.com/r/ChatGPT/comments/153xee8/has_chatgpt_gotten_dumber_a_response_to_the/
https://www.reddit.com/r/ChatGPTPro/comments/154jl5s/research_suggests_that_chatgpt_is_getting_dumber/
https://twitter.com/thatroblennon/status/1681685508852965377
Bottom line: The Stanford/Cal paper appears to be heavily flawed research. It not only hurts the reputations of Stanford and Cal but also arXiv (Cornell).
Thanks for the analysis. Why capacity vs behaviour? A GPT-4 user cares about outputs of API calls, and may not differentiate capacity from behaviour. Capacity vs behaviour is something like latent variable vs observation, since GPT-4 is not open source.