Also, how you know it read the book, and not a summary of it, of which there are loads on the internet?
In the case of ChatGPT, it’s hard to tell. OpenAI won’t even reveal what their training dataset was.
Researchers have done some tests to tease this out, and they’re pretty confident that it has read quite a few books and memorized them verbatim. See one of my favorite papers in a while, Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.
Reading the paper, for even the best books it only guesses a name in a passage right more then seventy percent of the time on 5 of the over 500 tested books. On “The fellowship of the ring”, it got hardly over 50%, and that’s hardly a little known book. These LLM’s are definitely familiar with the content, I would hardly call that memorizing verbatim. (Humans are also reasonably good at this after reading a book)
In the case of ChatGPT, it’s hard to tell. OpenAI won’t even reveal what their training dataset was.
Researchers have done some tests to tease this out, and they’re pretty confident that it has read quite a few books and memorized them verbatim. See one of my favorite papers in a while, Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.
Reading the paper, for even the best books it only guesses a name in a passage right more then seventy percent of the time on 5 of the over 500 tested books. On “The fellowship of the ring”, it got hardly over 50%, and that’s hardly a little known book. These LLM’s are definitely familiar with the content, I would hardly call that memorizing verbatim. (Humans are also reasonably good at this after reading a book)