Member-only story
The “Human-level Performance” of GPT-4
Some critical thinking, please
Much excitement has been about the potential of GPT-4, the next iteration of OpenAI’s language model, to the point that some talk about a “human-level performance”. For instance, GPT-4 has “passed” some schools' standard tests (such as the GRE, LSAT, etc.) with grades of the best 10%, and this has been taken as proof that the chatbot is better than most human university applicants.
However, this excitement is based on a misunderstanding of what that “human-level performance” actually means. And the worst part is that it could be an intentional misuse of the term, aiming to generate buzz and capture –and monetize– eyeballs in the process.
The main problem with the term “human-level performance” in AI is that performance is measured only in particular tasks, such as solving multiple-choice tests. But human intelligence is much more than that.
What about “human-level performance” in arithmetic calculations? We all know that the calculator in your cell phone outperforms us by orders of magnitude, but this doesn’t mean that the calculator is more intelligent than us.
The (mis)use of school tests
I’ve seen many expressions of awe (and even fear) on the news about GPT-4 performance on standard school tests…