9 Comments

This is really cool, great step forward! How are the individual measures weighted to form the final score?

Expand full comment

Thanks, guys!!! This is amazing!!! How are we to think about the fact that the highest percentage here is 54%? If these models were taking one of my classes, none would be passing at this point? What would it take for a model to raise its percentile to a more acceptable level?

Expand full comment

I feel like this may be over attributing what the model providers say their model should and should not do and the actual details about the model. The details matter more.

Seems like a fundamental flaw if Llama and BLOOM are so close to GPT4 which is absolutely not transparent.

Expand full comment