Is GPT-4 getting worse over time?

Jul 19, 2023

A new paper going viral has been widely misinterpreted

13 Comments

Jul 19, 2023

Thanks for this in-depth analysis. It surprises me how many people are rushing to conclusions without even having read a single page of that paper.

Expand full comment

Reply (2)

brutalist

Jul 19, 2023Edited

Traditional peer review may have its flaws, but I think it's vastly more scientific than what we seem to have these days, which is idea diffusion via memetic warfare on twitter and hacker news as soon as the preprint lands on arxiv.

Expand full comment

Jean IBARZ

Jul 20, 2023Edited

Yeah totally agree, that's crazy. I believe that's what happens when some people are trying to making money out of useless newsletters, scrapping content by copy/pasting from other influencers content and rephrasing it with gpt3.5 without taking the time 2secondes to think about it.... :-D

Expand full comment

Visar Berisha

Jul 19, 2023

The authors of the manuscript could have framed this in terms of repeatability, or lack thereof. It’s difficult to use instruments that’s aren’t repeatable for high stakes applications or for scientific discovery. And, as suspected, LLM outputs don’t appear to be repeatable.

Expand full comment

Ivan Ferrari

Jul 20, 2023

Thank you, useful analysis.

Expand full comment

Stephen Pimentel

Aug 9, 2023

Right, but consistent low-repeatability, while perhaps undesirable, is not degradation.

Expand full comment

Michael Spencer

Jul 21, 2023

This article perhaps needs a conclusion to summarize and synthesis what you are saying in full that's easily understandable by a lay audience.

Expand full comment

Yuxi Li

Jul 20, 2023Edited

Thanks for the analysis. Why capacity vs behaviour? A GPT-4 user cares about outputs of API calls, and may not differentiate capacity from behaviour. Capacity vs behaviour is something like latent variable vs observation, since GPT-4 is not open source.

Expand full comment

Sriram Ravindren

Jul 20, 2023

Another great review from you folks.

Expand full comment

Mary Thomson

Sep 25

As a potential user, this strikes me as an important point:

"...It is little comfort to a frustrated ChatGPT user to be told that the capabilities they need still exist, but now require new prompting strategies to elicit..."

Operationally, this means that the usability of ChatGPT has indeed degraded. The distinction between capability and performance will not be clear to an end user, and really shouldn't need to be.

Expand full comment

Richard Yu

Aug 23, 2023

There are two points in this article that seem to be incorrect after reading the paper myself.

1. Testing primality - Here is a direct quote from the paper. "The dataset contains 1,000 questions, where 500 primes were extracted from [ZPM+23] and 500 composite numbers were sampled uniformly from all composite numbers within the interval [1,000, 20,000]." Nowhere does this suggest that they only tested prime numbers, as was stated within the article.

2. Directly executable - While the name is indeed slightly misleading, this is how the paper defines directly executable: "We call it directly executable if the online judge accepts the answer (i.e., the answer is valid Python and passes its tests)." This explicitly states that the code is considered directly executable if A. it its valid python and B. passes its tests. Therefore according to the paper they do indeed test for correctness. They also note that, after post processing and removing the extra markup text, the performance of GPT-4 increased from 50% in March to 70% in June. However, part of their argument was that they explicitly asked GPT-4 to generate only the code, which was an instruction that it followed in March but failed to do so in June.

Expand full comment

Reply (1)

Sayash Kapoor

Aug 23, 2023

On your first point: you're probably looking at the updated version of the paper, released on August 1. In that version, the authors updated their study in response to our critique. Here's the original paper, which doesn't mention composite numbers: https://arxiv.org/pdf/2307.09009v1.pdf

Expand full comment

Byndnglsh

Jul 25, 2023

Check these links:

https://www.reddit.com/r/ChatGPT/comments/153xee8/has_chatgpt_gotten_dumber_a_response_to_the/

https://www.reddit.com/r/ChatGPTPro/comments/154jl5s/research_suggests_that_chatgpt_is_getting_dumber/

https://twitter.com/thatroblennon/status/1681685508852965377

Bottom line: The Stanford/Cal paper appears to be heavily flawed research. It not only hurts the reputations of Stanford and Cal but also arXiv (Cornell).

Expand full comment