GPT-4 and professional benchmarks: the wrong…

You guys are amazing! Crush the hype! Expose the parrot! :)

Expand full comment

Miguel I. Solano

Re "Melanie Mitchell gives an example of an MBA test question where changing some details in a way that wouldn’t fool a person is enough to fool ChatGPT (running GPT-3.5). A more elaborate experiment along these lines would be valuable." — A quick experiment with GPT-4 suggests that, unlike ChatGPT, GPT-4 is far less sensible to changes in wording. In our (brief) experiments with the same prompt Mitchell used, GPT-4 output the correct answer every time.

Expand full comment

Reply (2)

Parker Seegmiller

I'd guess this is more of the same, and that OpenAI has included these rewordings in the training data for GPT-4.

Expand full comment

Marcel Kincaid

Mar 24, 2023

It's GPT-4's failures, not its successes, that tell us that it can only regurgitate, not reason. (Which everyone should already realize from what an LLM is.)

Expand full comment

Ohlfearnain

Mar 26, 2023Edited

I am a non-expert and even I can understand this! I appreciate this substack and look forward to the book!

Expand full comment

Rodolphe Koller

The first job that will disappear with AI is that of scientists who use quantitative measures to predict which jobs will disappear with AI.

Expand full comment

So is it then fair to say standardized tests are a completely bogus way of estimating how someone will perform in the real word (based on the above thesis). Or could even be directionally damaging. Imaging a doctor who aces their standardized test but does poorly in the real world. Or is the thesis that when it comes to humans, standardized tests are a proxy for real world knowledge but not so for AI? as in humans build upon and adapt whereas AI learning stops at memorized training data? Thanks and will take your answer in the comments

Expand full comment

Reply (2)

Kirby

Not at all. The author's point is that since AI learns differently than humans these tests of learning are measuring different things for AIs vs Humans. Using standardized tests to compare across members of the same group - where the actual learning being tested is the same - is valid.

Expand full comment

you missed my point. What is "learning" vs. "memorization". That which the authors accuse AI of should hold for humans as well who ace standardized tests with enough practice but may perform very poorly in real world scenarios. I will let the authors respond. Thank you

Expand full comment

Arvind Narayanan

Yes, we mentioned in the post that standardized exams have been heavily criticized for this reason even for humans. Our point is that the difficulty of measuring real-world skills with exams is amplified when you use them to evaluate language models, because they do orders of magnitude more memorization than any human possibly could.

Expand full comment

Mar 23, 2023Edited

You might have seen this from Microsoft research. Definitely extends the discussion here https://arxiv.org/pdf/2303.12712.pdf

Expand full comment

Leo Lo

Mar 23, 2023

To argue standardized tests are "completely bogus" or "directionally damaging" would be arguing with a slippery slope. There's a degree to which standardized testing can help evaluate a person or even a language model's skills. And even for completely fresh test sets, it's not a proxy for real world evaluation.

To use your example, a medical doctor need to do a long period of residence before they become a full ”doctor". That would be safeguarding the profession with real world "testing".

Expand full comment

c1ue

Isn't the biggest issue that Chat GPT is accessing a database, period?

This is exactly like a prospective lawyer being able to bring in cheat notes of infinite pages into a bar exam. Whether it is an RDBMS database or the equivalent to a pointer-based database is irrelevant; 175B parameters performs like a gigantic database.

And no 175B parameters is not comparable to human memorization - it is far greater. The entire Encyclopedia Brittanica is about 23 gigabytes, compressed - so the ChatGPT parameters are at least the equivalent of having memorized the entire Encyclopedia Brittanica and likely is far, far greater.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

c1ue

https://www.gatesnotes.com/The-Age-of-AI-Has-Begun?WT.mc_id=20230321100000_Artificial-Intelligence_BG-TW_&WT.tsrc=BGTW

The vast, vast majority of human minds are literally incapable of perfect memorization (i.e. zero loss of fidelity) of a large, integral mass of data. Thus the comparison of raw storage capacity is invalid.

Secondly, the point wasn't just raw storage capacity - it was that ChatGPT's 175B comparison points are - even on a comparison point to bit standpoint - more than capable of literally memorizing the Encyclopedia Britannica with perfect fidelity. But in point of fact, I believe the 175B comparison points are more than 1 bit - i.e. 1 or 0 states in which case both storage and processing capacity are far greater. Comparison points that are trinary vs. binary would be orders of magnitude greater in absolute terms, for example.

So while I fully agree with parts of your response - I think you're missing the key part: that ChatGPT is not obviously "reasoning" as opposed to doing pure recall plus some fudge factor.

Dumping massive bar test prep data into such a construct could as easily create an utterly useless pile of junk that can pass the bar but cannot reliably render a competent legal outcome at any given time for any given situation - which is something very different than an "absent-minded" human legal clerk.

Nor am I being cynical about this - this is precisely the situation that has been ongoing with far less complex "AI" such as image recognition. Image recognition is theoretically far more simple than legal reasoning yet the litany of image recognition failures is still ongoing.

Expand full comment

stephon james

Jul 6, 2023

h you guys are a share a amazing information share with us

Expand full comment

J B

Apr 16, 2023

Whoa, this is a great take on GPT-4! If you gave me internet access during all of the AP tests, I could probably do about as well as GPT-4.

Expand full comment

Jim Kyner

Mar 27, 2023

In mid-2022, Bill Gates suggested using AP Bio (Advanced Placement biology exam) to judge GPT.

He told OpenAI to make it "capable of answering questions that it hasn’t been specifically trained for" and that AP Bio was *THE* test to prove their work is a revolutionary "breakthrough". They came back after a few months and GPT aced the test. See

Was Gates fooled by OpenAI or might unintended "contamination" (like described above) have fooled them all?

Expand full comment

Peripety

Mar 29, 2023

I don't think there's any way they COULD have gotten a full set of post-2021 exam questions to ask it, which immediately makes it suspect. The College Board doesn't actually release that many of the multiple choice questions, I don't think, especially not recently.

It would still be accurate to say that GPT wasn't "specifically trained" for the test, in the same way that I might not have "specifically trained" to answer a riddle you ask me? But since you probably learned your riddles from roughly the same media that every other English speaker did, whatever you ask me, there's a pretty good chance I'll have heard it before.

Expand full comment

Mar 23, 2023

while i responded to the. thread I felt this added to the discussion and merits its own comment thread if the authors are so inclined. This from Microsoft Research in the last 20 hours https://arxiv.org/pdf/2303.12712.pdf

Expand full comment

Jim Kyner

Apr 10, 2023

In their blog above, Arvind Narayanan and Sayash Kapoor (N&K), point out the potential contamination issue where AI evaluation tests can be contaminated by the data corpus that the AI was trained on.

In the pdf link you gave, the MS "Sparks of AGI" authors (Sparks paper authors) were aware of this issue because on page 25 they say (in reference to their deep learning optimizer Python code example), "It is important to note that this particular optimizer does not exist in the literature or on the internet, and thus the models cannot have it memorized, and must instead compose the concepts correctly in order to produce the code.".

However, they don't say how they determined that. As N&K observed, OpenAI's approach for detecting memorization by using substring match is a brittle method for making such a determination while other methods such as embedding distances involve subjective choices. Presumably, the Sparks paper authors found a more robust method that enabled them to make the assertion that they did regarding the potential memorization issue.

Do you (or anyone else) know what method the Sparks paper authors used for detecting a potential memorization issue relating to their optimizer Python code example?

Expand full comment

Stefan Parker