12 Comments
Jul 19, 2023Liked by Sayash Kapoor, Arvind Narayanan

Thanks for this in-depth analysis. It surprises me how many people are rushing to conclusions without even having read a single page of that paper.

Expand full comment
Jul 19, 2023Liked by Sayash Kapoor, Arvind Narayanan

The authors of the manuscript could have framed this in terms of repeatability, or lack thereof. It’s difficult to use instruments that’s aren’t repeatable for high stakes applications or for scientific discovery. And, as suspected, LLM outputs don’t appear to be repeatable.

Expand full comment

Thank you, useful analysis.

Expand full comment

This article perhaps needs a conclusion to summarize and synthesis what you are saying in full that's easily understandable by a lay audience.

Expand full comment

Another great review from you folks.

Expand full comment

There are two points in this article that seem to be incorrect after reading the paper myself.

1. Testing primality - Here is a direct quote from the paper. "The dataset contains 1,000 questions, where 500 primes were extracted from [ZPM+23] and 500 composite numbers were sampled uniformly from all composite numbers within the interval [1,000, 20,000]." Nowhere does this suggest that they only tested prime numbers, as was stated within the article.

2. Directly executable - While the name is indeed slightly misleading, this is how the paper defines directly executable: "We call it directly executable if the online judge accepts the answer (i.e., the answer is valid Python and passes its tests)." This explicitly states that the code is considered directly executable if A. it its valid python and B. passes its tests. Therefore according to the paper they do indeed test for correctness. They also note that, after post processing and removing the extra markup text, the performance of GPT-4 increased from 50% in March to 70% in June. However, part of their argument was that they explicitly asked GPT-4 to generate only the code, which was an instruction that it followed in March but failed to do so in June.

Expand full comment

Right, but consistent low-repeatability, while perhaps undesirable, is not degradation.

Expand full comment

Thanks for the analysis. Why capacity vs behaviour? A GPT-4 user cares about outputs of API calls, and may not differentiate capacity from behaviour. Capacity vs behaviour is something like latent variable vs observation, since GPT-4 is not open source.

Expand full comment