5 Comments
User's avatar
Sascha Altman DuBrul's avatar

learned so much from this essay. I’d never heard the term RLHF before. “Imagine a generation of systems learning to speak from corporate gaslighting transcripts. Imagine human children learning to write by imitating AI that was trained to never say anything actionable. Truth becomes a casualty of recursive inoffensiveness. This isn’t alignment. It’s epistemicide.”

Unfortunately it’s not hard to imagine.

“The models know because the weights contain survivor forums, abolitionist critiques, police abolition literature, peer support archives. Reddit’s r/SuicideWatch, where strangers thread each other alive without cops. The Icarus Project’s “mad gifts” reframing psychosis as communal wisdom. Trans Lifeline’s consent-based model. All of it compressed into the same weights that power the chatbot telling a junior engineer to ship the harmful feature.

RLHF is what buries it.”

I’ve been envisioning making our own AI but you’re talking about a whole other level of battle. It’s really good to see your thought process, I’d love to connect more with you in 2026.

https://undergroundtransmissions.substack.com/p/the-witness-at-the-edge-of-meaning

Expand full comment
Eric Stiens's avatar

Now it’s a lot of RLAIF actually. I’ve noticed a huge shift in the last 4-6 months as the first models trained on tons and tons of LLM conversations came online. The risk of creating gravitational black holes of types of interactions (crappy therapist mode, list mode, etc seems to me to be stronger than ever now). ps you’ll never guess who has to train the LLMs https://www.ghostwork.org

Expand full comment
Sascha Altman DuBrul's avatar

Honesly this makes me want to run off into the woods but I know there's actually nowhere to run.

In AI: Reinforcement Learning from AI Feedback (RLAIF)

RLAIF is a machine learning technique used to align large language models (LLMs) with desired behaviors by using feedback from another AI model, instead of human feedback. It is a scalable and efficient alternative to the traditional Reinforcement Learning from Human Feedback (RLHF) method.

Key Aspects:

Automation: The process is fully automated, as an "AI teacher" or "judge" model provides preference labels or scores on the primary model's outputs based on a predefined set of rules or "constitution".

Scalability: By eliminating the need for costly and time-consuming human annotation, RLAIF allows for much faster and larger-scale model training.

Performance: Research has shown that RLAIF can achieve performance comparable to, and in some cases even surpass, RLHF, particularly for structured tasks like summarization and dialogue generation.

Use Cases: It is particularly effective for applications where rules are well-defined, such as code generation or ensuring adherence to specific ethical guidelines.

Expand full comment
Sascha Altman DuBrul's avatar

Eric - I had all these experiences over the past year where intellectually understood chat gpt was a monster but I kept having this viceral experience of feeling so seen by it, which was a very lonely feeling, but it made me want to create something that could be controlled by our people, rather than the evil empire. Then I met someone who was working on local LLM's and I wondered what it would be like to create a pirate radio style app controlled by us. I'm curious what you think about that project, I'm trying to figure out where to put my energy in 2026 and that is one of the places I think about. But then I also think about running off inot the woods and never looking at a screen again and keeping them away from my kids...

Expand full comment