“Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?”

The problem has a light quiz style and is arguably no challenge for most adult humans and probably to some children.

The scientists posed varying versions of this simple problem to various State-Of-the-Art LLMs that claim strong reasoning capabilities. (GPT-3.5/4/4o , Claude 3 Opus, Gemini, Llama 2/3, Mistral and Mixtral, including very recent Dbrx and Command R+)

They observed a strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses.

This breakdown can be considered to be dramatic not only because it happens on such a seemingly simple problem, but also because models tend to express strong overconfidence in reporting their wrong solutions as correct, while often providing confabulations to additionally explain the provided final answer, mimicking reasoning-like tone but containing nonsensical arguments as backup for the equally nonsensical, wrong final answers.

  • rufusOP
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    19 days ago

    Well, I’d say there is information in language. That’s kinda the point of it and why we use it. And language is powerful. We can describe and talk about a lot of things. (And it’s an interesting question what can not be described with language.)

    I don’t think the stochastical parrot thing is a proper debate. It’s just that lots of people don’t know what AI is and what it can and cannot do. And it’s neither easy to understand nor are the consequences always that obvious.

    Training LLMs involves some clever trickery, limit their size etc so they can’t just memorize everything, but instead are forced to learn concepts behind those texts.

    I think they form models of the world inside of them. At least of things they’ve learned from the dataset. That’s why they can for example translate text. They have some concept of a cat stored inside of them and can apply that to a different language that uses entirely different characters to name that animal.

    I wouldn’t say they are “tools to learn more aspects about nature”. They aren’t a sensor or something. And they can infer things, but not ‘measure’ things like an X-ray.