Every new LLM and every new tweak to an old LLM has a press release bragging about how well it tests on some benchmark you’ve never heard of. Every new model is trained heavily to the previous tren…
Still occasionally think about that bit in the o1 white paper where the openai researchers innocuously pose the question of what if our benchmarks for detecting hallucinations are shit actually, wouldn’t that be something.
Still occasionally think about that bit in the o1 white paper where the openai researchers innocuously pose the question of what if our benchmarks for detecting hallucinations are shit actually, wouldn’t that be something.