In earlier posts we looked at how LLMs are given direction through supervised fine-tuning / reinforcement learning from human feedback (RLHF) and how LLMs are kept safe. But just because a model can follow instructions doesn’t mean it should follow all of them.
Share this post
#17: Red Teaming: Why Breaking LLMs Might…
Share this post
In earlier posts we looked at how LLMs are given direction through supervised fine-tuning / reinforcement learning from human feedback (RLHF) and how LLMs are kept safe. But just because a model can follow instructions doesn’t mean it should follow all of them.