Got the pointer to this from Allison Parrish who says it better than I could:
it’s a very compelling paper, with a super clever methodology, and (i’m paraphrasing/extrapolating) shows that “alignment” strategies like RLHF only work to ensure that it never seems like a white person is saying something overtly racist, rather than addressing the actual prejudice baked into the model.
They started calling it AI and people immediately started asking the kinds of questions that science fiction had primed us for, whether that’s Skynet, Data, or Marvin. Apparently even among the supposedly intelligent folks the fact never quite landed that these stories (or at least the good ones) were ultimately trying to comment on people rather than creating fanciful situations to be dissected for their own sake.