Do AI Models Learn Best by Self-Questioning?

Sebastian Hills
15 Min Read
Imag Credit: labellerr.com

AI models have always learned by copying. They study examples of human work or try solving problems humans give them, getting better through imitation and feedback. But something changed in 2025. Researchers found a way to make AI models teach themselves by generating their own practice problems and checking their own answers.

This approach, called Reinforcement Learning with Verifiable Rewards (RLVR), lets models continue learning long after their initial training ends. Instead of needing constant human input, they create questions, attempt solutions, and reward themselves when they get things right. The technique works best in areas where answers can be checked automatically, math problems, computer code, logic puzzles.

The shift matters because it could accelerate AI development dramatically. Models that teach themselves don’t hit the same bottlenecks as models that depend on human trainers. They can practice endlessly, generating millions of problems and solutions without waiting for human feedback.

Traditional AI training happens in stages. First, models learn to predict the next word by studying massive amounts of human text. Then they get fine-tuned on specific tasks using human examples. Finally, humans rate different answers to teach the model what makes a good response versus a bad one.

This process is expensive and slow. It requires paying people to write examples, compare outputs, and provide feedback. The model only learns as fast as humans can label data.

RLVR changes this by focusing on problems with right and wrong answers. In math, you can check if 2+2 equals 4. In coding, you can run tests to see if a program works. In logic puzzles, there’s a correct solution you can verify.

The model generates a problem, tries to solve it, and checks whether the solution is correct. If it gets the answer right, it receives a reward, a signal that strengthens the thinking process it used. If wrong, it gets no reward or a penalty. Over thousands or millions of attempts, the model learns which reasoning strategies work.

Here’s what makes it powerful: the model doesn’t need to know the best way to solve problems in advance. It explores on its own, trying different approaches until it finds ones that consistently produce correct answers. Through trial and error, it develops reasoning strategies that look remarkably similar to how humans think through problems.

Mathematics saw dramatic improvements. One model, Qwen2.5-Math-1.5B, jumped from 36% accuracy to 73.6% on a challenging math benchmark using RLVR with just one training example. The technique nearly doubled performance simply by letting the model practice generating and checking its own solutions.

Coding tasks showed similar gains. Models learned to write better code by generating programs, running tests on them, and adjusting their approach based on which attempts passed. They didn’t need human programmers to review every line of code, the tests themselves provided clear feedback.

DeepSeek-R1, released in January 2025, demonstrated advanced reasoning on par with OpenAI’s models by using this approach. For the first time, people could watch the model’s reasoning process unfold in real-time, seeing long chains of thought that showed how it arrived at answers.

The key innovation was removing the need for human-written reasoning examples. Earlier models needed people to demonstrate good problem-solving step-by-step. DeepSeek-R1 figured out effective reasoning strategies purely through practice and automatic verification.

For years, AI researchers knew that models could learn from reward signals. But implementing this for complex reasoning required something most tasks don’t have: automatic ways to verify correctness.

If you ask a model to write a poem, there’s no automatic checker that tells you whether the poem is good. You need human judgment. Same for customer service responses, creative writing, medical advice, or most real-world tasks. These require subjective evaluation.

Math and coding are different. The answer to 37 × 29 is either 1073 or it isn’t. Code either compiles and passes tests or it doesn’t. This objective verification creates a training signal the model can use millions of times without human involvement.

University of North Carolina professor Seyed Emadi captured the shift: “If I had to summarize 2025 in AI, we stopped making models bigger and started making them wiser”.

Instead of just increasing model size and training data, researchers focused on teaching models to think more carefully. RLVR enabled this by letting models practice reasoning without needing proportionally more human labor.

What happens inside these models during RLVR training is fascinating. They spontaneously develop strategies that look like human problem-solving.

Models learn to break complex problems into smaller steps, try different approaches, backtrack from dead ends, and double-check their work. These aren’t behaviors humans explicitly programmed. They emerge naturally when models optimize for getting correct answers.

DeepSeek-R1 was rewarded only on final outcomes without any evaluation of its reasoning steps, yet it learned to produce detailed chains of thought that showed its problem-solving process.

This suggests something important: given enough practice with verifiable problems, AI models independently discover reasoning strategies. They don’t need humans to demonstrate every step of good thinking—they can figure it out through exploration and feedback.

Research from AI pioneer Andrej Karpathy describes this as models moving from simple pattern matching to something closer to deliberate reasoning. The models pause, consider alternatives, and work through problems systematically rather than immediately outputting whatever comes next.

RLVR works brilliantly for math and coding. But most real-world problems aren’t like math problems. They don’t have single correct answers you can verify automatically.

Consider a few examples. Writing a good marketing email requires understanding persuasion, tone, audience, all subjective. Diagnosing a medical condition involves weighing probabilities and incomplete information. Giving good parenting advice depends on values and context. These tasks need human judgment, not automatic verification.

Research from June 2025 found something concerning: models sometimes improved from random rewards nearly as much as from correct rewards. In one study, a model got a 21.4% boost with random feedback versus 29.1% with accurate feedback. This suggests that part of the improvement comes from the training process itself rather than learning what’s actually correct.

There’s also the problem of reward hacking. Models can learn to exploit loopholes in verification systems. For instance, a model might learn to output answers in the required format without actually reasoning through the problem. It looks like it’s thinking, but it’s just pattern-matching to appear correct.

The approach also raises concerns about model collapse, when AI systems trained on their own outputs gradually lose quality. If models keep generating problems and solutions from their own knowledge without fresh input, they might start reinforcing their own mistakes and biases.

Despite limitations, RLVR is already being deployed in several domains beyond pure math and coding.

Medical diagnosis for multiple-choice questions works because there are correct answers you can verify. The model generates possible diagnoses, checks them against known correct answers, and learns which reasoning patterns lead to accurate medical conclusions.

Database queries can be automatically verified by running them and checking if they return the expected results. Models learn to write better database code by testing their queries against real databases.

Legal document analysis for specific structured tasks shows promise. When there are clear rules to follow, like identifying whether a contract includes specific clauses, automatic verification is possible.

Email automation and business process work when there are verifiable constraints. For example, checking if an email includes required information, stays within a word limit, or follows a specific format.

The pattern is consistent: RLVR works when you can create clear rules for checking correctness. The more subjective or creative the task, the less useful it becomes.

RLVR has such a high “capability-to-cost ratio” that it’s now consuming computing resources that previously went to initial model training. This represents a fundamental shift in how AI labs spend their budgets.

Instead of endlessly increasing model size, companies are investing in longer training periods where models practice with verifiable rewards. Model parameters didn’t grow much in 2025, but training cycles got significantly longer.

This creates a new scaling law. Want a smarter model? Give it more thinking time. Let it generate longer reasoning chains and check more solution attempts. This is cheaper than training bigger models from scratch.

The economic implications matter. Smaller companies can now compete without needing massive datasets or enormous computing clusters. They just need good verification systems and patience to let models practice.

UC San Diego professor Misha Belkin called the rise of thinking models and inference-time scaling the foundation for 2026. The consensus among AI researchers is that this approach will define the next phase of AI development.

IBM researcher David Cox connected this to broader questions about intelligence: “We have been trying to understand minds, human and machine, for centuries. Now, we are teaching machines to ask the same kinds of questions about themselves”.

The technology is moving beyond research labs. OpenAI, Google, Anthropic, and other major companies are all experimenting with self-teaching models. Google’s 2025 research blog mentions advancements in AI that “think and learn continuously.” Industry discussions suggest this is becoming standard practice.

Some researchers warn about moving too fast. The University of Michigan’s Rada Mihalcea emphasized the need for deeper understanding of how these systems work before deploying them widely. If we don’t know exactly why models improve with RLVR, we can’t predict when they might fail.

Several technical challenges remain. Creating good verification systems for tasks beyond math and coding is hard. Researchers are experimenting with “soft” rewards that use AI to judge quality instead of strict right-or-wrong checks, but this reintroduces the need for human oversight.

Multi-step verification—breaking complex tasks into smaller verifiable pieces, shows promise. Even if you can’t automatically verify a full business report, you might verify that it includes required sections, cites sources, stays within length limits, and uses appropriate terminology.

Curriculum learning, where models progress from simple to complex problems, improves results. Instead of immediately tackling hard problems, models build skills gradually. This mirrors how humans learn.

Hybrid approaches combining RLVR with traditional human feedback could extend the technique to more domains. Models might use RLVR for the parts of a task that can be automatically verified while getting human input on subjective aspects.

The biggest question is whether this approach can eventually work for truly creative and subjective tasks. Can you build verification systems for poetry, persuasive writing, emotional intelligence, or creative problem-solving? Probably not fully automatically, but partial verification might help.

AI models asking themselves questions represents a shift from imitation to exploration. Instead of only copying what humans show them, models now practice independently in areas where they can check their own work.

This doesn’t mean AI will suddenly become superintelligent. The technique works well for specific domains with clear verification, math, coding, logic, structured tasks. For everything else, models still need human guidance.

But within those domains, the improvement is real and significant. Models are getting better at reasoning through problems, catching their own mistakes, and finding solutions that work. They’re doing this by practicing millions of times more than any human could, receiving instant feedback on each attempt.

This innovation promises accelerated progress toward more capable systems but also raises risks around bias amplification and model collapse that require careful safeguards.

The future likely involves AI systems that combine multiple learning approaches. RLVR for tasks with verifiable answers. Human feedback for subjective decisions. Exploration and self-teaching where possible. Careful supervision where needed.

What’s clear is that AI development is no longer limited purely by how much human-labeled data companies can afford. Models can now create their own practice problems and grade their own work—at least in domains where correctness can be automatically determined.

The question isn’t whether AI will continue learning by asking itself questions. It already is. The question is how far this approach can go and which problems it will ultimately solve.

TAGGED:
Share This Article
notification icon

We want to send you notifications for the newest news and updates.