> No, all these models are just bad for anything that they weren't RLed for, and...

koakuma-chan · 2026-02-23T16:25:00 1771863900

No, I am making the argument that models have poor capabilities outside of tasks they are RLed for, and their capabilities inside those tasks are only as good as capabilities of people evaluating their responses, i.e. not great. Even if you instruct the model "don't do X" or "do X this way"—you cannot rely on the model following that instruction. This means that there is nothing you can do if model makes "errors."

Not necessarily relevant, but fun, I had the ChatGPT model correct itself mid-response when checking my math work. It started by saying that I was wrong, then it proceeded to solve the problem and at the end it realized that I was correct.

embedding-shape · 2026-02-23T17:13:50 1771866830

> Even if you instruct the model "don't do X" or "do X this way"—you cannot rely on the model following that instruction.

Why not? I can definitively fire of two prompts to the same model and harness, and one include "don't do X" and the other doesn't, and I get what I expect, one didn't try to avoid doing X, and the other did. Is that not your experience using LLMs?

koakuma-chan · 2026-02-23T17:36:18 1771868178

It depends on the instruction, and how many other instructions there are. Models converge on doing things the way that emerged from their training, and with every turn the model cares less and less about your instructions. In practice, this means that after you had the model plan and execute the plan, you almost always end up having to iterate on the output because during the process of outputting the output the model began to derail and ignore instructions. You get things like "In a real app, we would do X, for now, just return null" or various subtle bugs.

It makes sense if you remember that it just predicts, what should probably be the next piece of text?

embedding-shape · 2026-02-23T17:46:07 1771868767

I understand how they work, as I do work with them everyday and been doing so for two years or so. What I don't understand, is how what you're saying is in any way related to the whole "deliberately create errors in code" part, which is where I jumped into the discussion.

Maybe I'm missing some bigger picture you're trying to paint here? I understand (and see) them making "mistakes" all the time, and I guess you could argue it's deliberate in some way, because it's simply how they work and adjusting the prompt and redoing usually solves the problem. But I'm afraid I don't see how it's connected, at least yet.