Introducing the REFORMS checklist for…

Aug 16, 2023

ML-based science is in trouble. Clear reporting standards for researchers could help.

8 Comments

Aug 17, 2023

Brilliant work!!! Guidelines like these help keep science on the pathway of accuracy, productivity, and ethical clarity.

Expand full comment

Michel Schellekens

Aug 16, 2023

I would be interested to know what you make of a recent paper in Nature from AlphaDev which claims to have developed a novel faster sorting algorithm. It did not contain a new sorting algorithm. Coming up with assembly level tricks is not the same as finding a new sorting algorithm. The results contained in it are interesting. It just does not do what it says on the tin. Is AI given a free pass when it comes to reviews in that particular venue?

Expand full comment

Reply (1)

Arvind Narayanan

Aug 16, 2023

Yes, exactly. The framing of that paper was really unfortunate and it is embarrassing that Nature reviewers/editors apparently did not push back.

Expand full comment

Reply (1)

Michel Schellekens

Aug 16, 2023

It ends up being counter productive for the authors. The lack of pushback means that published overstatements, once apparent, end up lowering credit that naturally would be attributed to the genuine contribution. And it does not do Nature any favours of course.

Expand full comment

Reply (1)

Michel Schellekens

Aug 16, 2023

It would be good if the authors corrected the matter openly.

Expand full comment

Greg G

Aug 18, 2023

This seems like it applies well to avoiding ML screwups in a commercial setting also.

Expand full comment

Test Bench

Aug 17, 2023

Have you requested feedback from Dr. Gelman? This sort of thing seems like it would be right up his alley, even if the subject matter isn’t his area of expertise.

Expand full comment

Jörg Wittkewitz

Aug 17, 2023Edited

It is getting even worse, when it comes to real-life research that aims to find causal relations between genAI and outcome variables, that many organisations can resonate with like productivity. Quality of texts was measured by three guys (who unwillingly fed the training data set, which is a very human form of data leakage) and a self-invented interrater reliability via IRC = .40, which is "small", at best. Also a renown journal:

https://www.science.org/doi/10.1126/science.adh2586

Expand full comment