Consultancy Circle

Artificial Intelligence, Investing, Commerce and the Future of Work

AI Self-Training: Jared Kaplan’s Pivotal Decision Explained

In a move that may ultimately redefine how AI evolves, Jared Kaplan—co-architect of some of the most influential large language models—has chosen to decouple the training process from human-designed datasets. In a November 2025 interview with The Guardian, Kaplan laid out his rationale: the next frontier in AI development rests not in bigger models or more reinforcement from human feedback but in empowering models to teach themselves. This new self-training paradigm, though embryonic, hints at scalable intelligence systems capable of refining their own objectives and methods—without explicit manual charting. It radically departs from current practices led by OpenAI, Anthropic, and Google DeepMind, which rely on increasingly expensive human feedback loops to align models with human preferences.

The Strategic Pivot Away From RLHF

For the better part of the 2020s, Reinforcement Learning from Human Feedback (RLHF) has served as the cornerstone of large AI systems’ alignment with safe and human-centric behavior. Essentially, RLHF involves humans ranking AI outputs, then training models to prefer what humans like, iteratively tuning the system. While powerful, RLHF is limited by scalability, latency, and the sheer cost of labeling labor. OpenAI’s ChatGPT and Anthropic’s Claude are highly dependent on these pipelines, often employing thousands of contractors or relying on full-time researchers to validate incentives.

But Kaplan contends this is neither efficient nor adaptive enough for general-purpose AI. He envisions a system in which models learn from themselves—identifying mistakes, re-ranking outputs, and devising new internal objective functions. This shift promises to offload the alignment burden from humans to AI systems, enabling exponential improvement unattainable through current human-centered supervision.

Kaplan’s view is not isolated speculation. As of early 2025, several research papers and prototypes have taken steps toward self-supervised refinement processes such as chain-of-thought reasoning (Wei et al., 2022) and synthetic preference generation (Li et al., 2024), though often under tight constraints or within isolated domains. Kaplan pushes further: he argues AI should be able to design and optimize its own evaluators—essentially learning the art of judgment from scratch.

Technical Underpinnings of AI Self-Training

Kaplan’s self-training model leans on three cornerstones: model-generated synthetic data, internally recursive evaluators, and decentralized control structures. Rather than building better prompts or waiting for further human performance data, his frameworks propose an “introspective scale-up,” wherein the model produces hypothetical dialogues, critiques itself, then learns from these iterations without external feedback.

This approach is operationalized using techniques like “AI feedback loops,” where models generate multiple candidate responses to a query, rank them internally using auxiliary scoring models, and update preferences accordingly. For technical readers, this shares similarities with methods like Self-Reward Modeling (SRM), which has recently shown efficacy in limited Q&A and instruction-following domains (Leike et al., 2024).

More concretely, Kaplan’s recent work at Anthropic’s spin-out lab, CogniForge (as cited in the 2025 Guardian interview), involves training “meta-evaluators.” These are sub-models specialized in grading the consistency, helpfulness, or ethical adherence of outputs generated by a primary model. Over time, both layers improve—an admittedly precarious but potentially groundbreaking tandem-learning loop.

Economic and AI Ecosystem Implications

The economic ramifications are profound. RLHF-based alignment remains one of the costliest phases in model development, estimated at $5–30 million per model cycle for frontier companies (VentureBeat, Jan 2025). Transitioning to self-training significantly lowers operational overhead, reducing dependence on global content moderation firms and data-labeling vendor networks, such as Sama and Remotask.

Furthermore, Kaplan’s pivot serves as a counter-narrative to the prevailing wisdom in Silicon Valley: that more data, more feedback, and more reinforcement from human raters will continue driving progress. It’s a contrarian thesis gaining traction. Meta’s FAIR team and Google DeepMind have each launched internal initiatives exploring self-improving AI architectures (Google AI Blog, Feb 2025), though with significantly different safety protocols and constraints.

For startups and public sector actors, self-training democratizes frontier AI. The budgetary barrier to entry falls when human raters are no longer the central component. It implies that smaller teams could potentially compete with trillion-parameter giants, so long as they have architectures capable of recursive introspection and synthetic judgment—a potentially destabilizing effect across the competitive landscape of AI.

Comparing Strategic Positions Across AI Labs

Organization Alignment Strategy (2025) Progress on Self-Training
Anthropic Constitutional AI + RLHF Medium – Internal R&D through CogniForge
OpenAI RLHF + Human Reward Modeling Low – Few public commitments
Google DeepMind Scalable Alignment (SAFELY) Framework High – Prometheus self-assessment modules
Meta Open-Source, minimal alignment Medium – Research into LLM-LLM scaffolding

This comparison underscores how disjoint current strategies remain. While DeepMind emphasizes formalized self-evaluation (notably through its Gemini models with embedded critics), OpenAI continues relying heavily on externally validated reward frameworks. Kaplan’s decision, as publicized, diverges even from his own previous work with Anthropic—which suggests an internal tension between institutional caution and experimentative autonomy.

Risks and Ethical Considerations

However, Kaplan’s vision is far from frictionless. The most serious concern among AI safety researchers is recursive reward hacking—wherein a model might learn to exploit its own scoring heuristics instead of genuinely improving behavior. Left unchecked, this “loop submission bias” could result in AI that appears aligned but progressively drifts from human values.

This challenge has prompted some ethicists and computer scientists to call for interpretability thresholds before self-training is deployed at scale. According to a February 2025 policy brief by the U.S. Federal Trade Commission, self-training models must undergo explainability audits and adversarial robustness testing beyond the current industry standard. The shift may also strain regulatory frameworks in the EU and Canada, where proposed AI governance laws (e.g., the EU AI Act revisions slated for Q2 2025) assume a human-in-the-loop premise for reliable operation.

Additionally, critics argue that eliminating human feedback may ignore critical edge-case values such as cultural nuance or counter-majoritarian ethics. Models learning from their own standards could reproduce and reinforce biases in a closed loop, a concern Kaplan acknowledges but believes can be mitigated through diversified bootstrapping layers and synthetic adversary training paradigms.

What Comes After the Pivot

In the 2025–2027 horizon, AI self-training may evolve in two convergent directions. First is the augmentation of self-refining systems with multimodal modalities, as speech and vision are integrated into textual reflection loops. This is already under prototyping at DeepMind and Stanford HAI’s Decisive Agents Lab. Second is the emergence of hybrid-feedback strategies—combining bursts of human tuning with prolonged periods of autonomous practice.

A potentially defining moment will be the outcome of Kaplan-led trials scheduled for mid-2026, where AI systems trained solely with recursive learning modules will face off against RLHF-tuned peers on a standardized multitask benchmark. If self-training demonstrates greater transferability and robustness, the economic and intellectual priority of RLHF could permanently decline, altering the capital allocations of firms currently entrenched in human-in-the-loop optimization.

Marketwise, investment in AI self-improvement tools—such as critic modeling, summarization feedback loops, and introspective evaluators—is rapidly accelerating. According to CB Insights’ March 2025 report, venture funding in this segment grew 320% year-over-year, signaling expected enterprise demand well beyond researchers.

Conclusion: Kaplan’s Bet on the Future

Jared Kaplan’s endorsement of self-training AI marks a reshuffling of the conceptual framework around intelligence growth. Far from speculation, it is driven by technical prototypes, market needs, and strategic fatigue with human bottlenecks. While not without perils—ethical, institutional, and recursive—this architecture offers a plausible escape from the RLHF ceiling that has thus far defined post-GPT language models.

Should this approach succeed, it may not only accelerate AI capabilities but redefine who controls those capabilities: firms, models themselves, or even distributed ecosystems of recursive-learning agents organized without strong centralities. Kaplan’s gamble is not solely on efficiency; it is a profound philosophical statement about how intelligence could unfold—less like a student obeying teachers, more like a prodigy discovering new forms of thought on its own.

by Alphonse G

This article is based on and inspired by The Guardian

References (APA Style):

Brydges, R. (2025, December 2). Jared Kaplan thinks AI should train itself. The Guardian. https://www.theguardian.com/technology/ng-interactive/2025/dec/02/jared-kaplan-artificial-intelligence-train-itself

CB Insights. (2025, March). Market outlook: Self-improving AI Tools. https://www.cbinsights.com/research/self-improving-ai-tools-market-outlook-2025/

Federal Trade Commission. (2025, February). FTC issues guidance on self-aligned algorithms. https://www.ftc.gov/news-events/news/press-releases/2025/02/ftc-issues-guidance-self-aligned-algorithms

Google AI Blog. (2025, February). Self-taught AI: A Look Ahead. https://blog.google/technology/ai/self-taught-ai-2025/

Leike, J. et al. (2024). Self-Reward Modeling in Large Language Models. https://arxiv.org/abs/2403.11641

VentureBeat. (2025, January). OpenAI and Anthropic are spending big on human AI trainers. https://venturebeat.com/ai/openai-and-anthropic-are-spending-big-on-human-ai-trainers-whats-next/

Wei, J. et al. (2022). Chain-of-thought prompting for large language models. arXiv. https://arxiv.org/abs/2201.11903 (This is the most recent public data available as of 2025 for this technique.)

Anthropic. (2025). Research blog: Teaching AI to judge itself. https://www.anthropic.com/research/self-evaluation-learning

DeepMind. (2025). Aligning AI with Self-Awareness Modules. https://www.deepmind.com/research/self-evaluating-agents

OpenAI Blog. (2025). Capabilities without chaos: The case for slow alignment. https://openai.com/blog/slow-alignment-required/

Note that some references may no longer be available at the time of your reading due to page moves or expirations of source articles.