From Computing Power to Intelligence: A Decentralized AI Investment Map Driven by Reinforcement Learning

Dec 23, 2025 00:07:26

Share to

Author: Jacob Zhao, IOSG

Artificial intelligence is transitioning from a statistical learning paradigm focused on "pattern fitting" to a capability system centered on "structured reasoning," with the importance of post-training rapidly increasing. The emergence of DeepSeek-R1 marks a paradigm shift for reinforcement learning in the era of large models, leading to a consensus in the industry: pre-training builds a general capability foundation for models, and reinforcement learning is no longer just a tool for value alignment, but has been proven to systematically enhance the quality of reasoning chains and complex decision-making capabilities, gradually evolving into a technical path for continuously improving intelligence levels.

At the same time, Web3 is reconstructing the production relationship of AI through decentralized computing networks and cryptographic incentive systems. The structural demands of reinforcement learning for rollout sampling, reward signals, and verifiable training naturally align with blockchain's computational collaboration, incentive distribution, and verifiable execution. This research report will systematically break down the AI training paradigm and the technical principles of reinforcement learning, demonstrate the structural advantages of reinforcement learning × Web3, and analyze projects such as Prime Intellect, Gensyn, Nous Research, Gradient, Grail, and Fraction AI.

I. Three Stages of AI Training: Pre-training, Instruction Fine-tuning, and Post-training Alignment

The full lifecycle of modern large language model (LLM) training is typically divided into three core stages: Pre-training, Supervised Fine-tuning (SFT), and Post-training (Post-training/RL). Each stage serves the function of "building a world model --- injecting task capabilities --- shaping reasoning and values," and the computational structure, data requirements, and validation difficulties determine the degree of decentralization.

  • Pre-training builds the language statistical structure and cross-modal world model of the model through large-scale self-supervised learning, forming the foundation of LLM capabilities. This stage requires training on trillions of tokens in a globally synchronized manner, relying on thousands to tens of thousands of homogeneous clusters of H100, with costs accounting for as much as 80-95%. It is extremely sensitive to bandwidth and data copyright, thus must be completed in a highly centralized environment.

  • Fine-tuning is used to inject task capabilities and instruction formats, with a smaller data volume and costs accounting for about 5-15%. Fine-tuning can be conducted through full parameter training or parameter-efficient fine-tuning (PEFT) methods, with LoRA, Q-LoRA, and Adapter being mainstream in the industry. However, it still requires synchronized gradients, limiting its decentralization potential.

  • Post-training consists of multiple iterative sub-stages that determine the model's reasoning ability, values, and safety boundaries. Its methods include reinforcement learning systems (RLHF, RLAIF, GRPO) as well as preference optimization methods without RL (DPO), and process reward models (PRM). This stage has lower data volume and costs (5-10%), mainly focusing on rollout and policy updates; it naturally supports asynchronous and distributed execution, with nodes not needing to hold complete weights. Combined with verifiable computation and on-chain incentives, it can form an open decentralized training network, making it the most compatible training segment for Web3.

Image

II. Overview of Reinforcement Learning Technology: Architecture, Framework, and Applications

System Architecture and Core Elements of Reinforcement Learning Reinforcement Learning (RL) drives the model to autonomously improve decision-making capabilities through "environment interaction --- reward feedback --- policy update." Its core structure can be viewed as a feedback loop composed of states, actions, rewards, and policies. A complete RL system typically includes three types of components: Policy (policy network), Rollout (experience sampling), and Learner (policy updater). The policy interacts with the environment to generate trajectories, and the Learner updates the policy based on reward signals, forming a continuous iterative and optimizing learning process: Image

  1. Policy: Generates actions from the environment state and is the core of the system's decision-making. During training, centralized backpropagation is needed to maintain consistency; during inference, it can be distributed to different nodes for parallel execution.

  2. Rollout: Nodes execute environment interactions based on the policy, generating trajectories such as state --- action --- reward. This process is highly parallel, has low communication requirements, and is insensitive to hardware differences, making it the most suitable segment for scaling in a decentralized environment.

  3. Learner: Aggregates all Rollout trajectories and performs policy gradient updates, being the module with the highest requirements for computational power and bandwidth, thus typically maintaining centralized or lightly centralized deployment to ensure convergence stability.

Reinforcement Learning Stage Framework (RLHF → RLAIF → PRM → GRPO) Reinforcement learning can typically be divided into five stages, as described below: Image # Data Generation Stage (Policy Exploration) Under the condition of given input prompts, the policy model πθ generates multiple candidate reasoning chains or complete trajectories, providing a sample basis for subsequent preference evaluation and reward modeling, determining the breadth of policy exploration. # Preference Feedback Stage (RLHF / RLAIF)

  • RLHF (Reinforcement Learning from Human Feedback) uses multiple candidate answers, human preference annotations, trains reward models (RM), and optimizes policies with PPO to make model outputs more aligned with human values, being a key link from GPT-3.5 to GPT-4.

  • RLAIF (Reinforcement Learning from AI Feedback) replaces human annotations with AI judges or constitutional rules, automating preference acquisition, significantly reducing costs, and exhibiting scalability, becoming the mainstream alignment paradigm for Anthropic, OpenAI, DeepSeek, etc.

# Reward Modeling Stage (Reward Modeling) Preferences provide input to the reward model, learning to map outputs to rewards. RM teaches the model "what is the correct answer," while PRM teaches the model "how to reason correctly."

  • RM (Reward Model) is used to evaluate the quality of final answers, scoring only the outputs:

  • Process Reward Model (PRM) evaluates not only the final answer but also scores each step of reasoning, each token, and each logical segment. It is also a key technology in OpenAI o1 and DeepSeek-R1, essentially "teaching the model how to think."

# Reward Verification Stage (RLVR / Reward Verifiability) In the process of generating and using reward signals, "verifiable constraints" are introduced to ensure that rewards come as much as possible from reproducible rules, facts, or consensus, thereby reducing the risks of reward hacking and bias, and enhancing auditability and scalability in open environments. # Policy Optimization Stage (Policy Optimization) Updates policy parameters θ under the guidance of signals provided by the reward model to obtain a stronger reasoning ability, higher safety, and more stable behavior patterns for the policy πθ′. Mainstream optimization methods include:

  • PPO (Proximal Policy Optimization): The traditional optimizer for RLHF, known for its stability, but often faces limitations such as slow convergence and insufficient stability in complex reasoning tasks.

  • GRPO (Group Relative Policy Optimization): A core innovation of DeepSeek-R1, modeling the advantage distribution within candidate answer groups to estimate expected value rather than simple ranking. This method retains reward magnitude information, making it more suitable for reasoning chain optimization, with a more stable training process, regarded as an important reinforcement learning optimization framework for deep reasoning scenarios after PPO.

  • DPO (Direct Preference Optimization): A non-reinforcement learning post-training method: it does not generate trajectories or build reward models but directly optimizes on preference pairs, being low-cost and stable in effect, thus widely used for aligning open-source models like Llama and Gemma, but does not enhance reasoning ability.

# New Policy Deployment Stage The optimized model exhibits: stronger reasoning chain generation capability (System-2 Reasoning), behavior more aligned with human or AI preferences, lower hallucination rates, and higher safety. The model continuously learns preferences, optimizes processes, and improves decision quality in ongoing iterations, forming a closed loop. Image Five Categories of Industrial Applications of Reinforcement Learning Reinforcement Learning has evolved from early game intelligence to a core framework for autonomous decision-making across industries. Its application scenarios can be summarized into five categories based on technological maturity and industry implementation, driving key breakthroughs in each direction.

  • Game & Strategy: This is the earliest validated direction for RL, where RL has demonstrated decision-making intelligence comparable to or even surpassing human experts in environments with "perfect information + clear rewards," such as AlphaGo, AlphaZero, AlphaStar, and OpenAI Five, laying the foundation for modern RL algorithms.

  • Robotics & Embodied AI: RL enables robots to learn manipulation, motion control, and cross-modal tasks (e.g., RT-2, RT-X) through continuous control and dynamics modeling, rapidly moving towards industrialization, being a key technological route for real-world robot deployment.

  • Digital Reasoning (Digital Reasoning / LLM System-2): RL + PRM drives large models from "language imitation" to "structured reasoning," with representative achievements including DeepSeek-R1, OpenAI o1/o3, Anthropic Claude, and AlphaGeometry, fundamentally optimizing rewards at the reasoning chain level rather than merely evaluating final answers.

  • Automated Scientific Discovery & Mathematical Optimization: RL seeks optimal structures or strategies in unlabeled, complex rewards, and vast search spaces, achieving foundational breakthroughs such as AlphaTensor, AlphaDev, and Fusion RL, showcasing exploration capabilities that surpass human intuition.

  • Economic Decision-making & Trading Systems: RL is used for strategy optimization, high-dimensional risk control, and adaptive trading system generation, enabling continuous learning in uncertain environments compared to traditional quantitative models, forming an important part of intelligent finance.

III. The Natural Match Between Reinforcement Learning and Web3

The high compatibility between Reinforcement Learning (RL) and Web3 stems from the fact that both are essentially "incentive-driven systems." RL relies on reward signals to optimize policies, while blockchain coordinates participant behavior through economic incentives, making their mechanisms inherently consistent. The core demands of RL—large-scale heterogeneous Rollout, reward distribution, and authenticity verification—are precisely the structural advantages of Web3. # Decoupling Reasoning and Training The training process of reinforcement learning can be clearly divided into two stages:

  • Rollout (Exploration Sampling): The model generates a large amount of data based on the current policy, a computation-intensive but communication-sparse task. It does not require frequent communication between nodes, making it suitable for parallel generation on globally distributed consumer-grade GPUs.

  • Update (Parameter Update): Updates model weights based on the collected data, requiring high-bandwidth centralized nodes to complete.

"Decoupling reasoning and training" naturally aligns with the decentralized heterogeneous computing structure: Rollout can be outsourced to an open network, settled by a token mechanism based on contributions, while model updates remain centralized to ensure stability. # Verifiability ZK and Proof-of-Learning provide means to verify whether nodes genuinely execute reasoning, addressing honesty issues in open networks. In deterministic tasks such as code and mathematical reasoning, verifiers only need to check answers to confirm workload, significantly enhancing the credibility of decentralized RL systems. # Incentive Layer: Feedback Production Mechanism Based on Token Economics The token mechanism of Web3 can directly reward contributors of preference feedback in RLHF/RLAIF, creating a transparent, accountable, and permissionless incentive structure for preference data generation; staking and slashing further constrain feedback quality, forming a more efficient and aligned feedback market than traditional crowdsourcing. # Potential of Multi-Agent Reinforcement Learning (MARL) Blockchain is essentially a public, transparent, and continuously evolving multi-agent environment, where accounts, contracts, and agents continuously adjust strategies under incentive-driven conditions, naturally possessing the potential to build large-scale MARL experimental fields. Although still in its early stages, its characteristics of public state, verifiable execution, and programmable incentives provide principled advantages for the future development of MARL.

IV. Analysis of Classic Web3 + Reinforcement Learning Projects

Based on the theoretical framework above, we will briefly analyze the most representative projects in the current ecosystem: Prime Intellect: Asynchronous Reinforcement Learning Paradigm prime-rl Prime Intellect aims to build a global open computing power market, lowering training barriers, promoting collaborative decentralized training, and developing a complete open-source superintelligence technology stack. Its system includes: Prime Compute (unified cloud/distributed computing environment), INTELLECT model family (10B--100B+), Open Reinforcement Learning Environment Hub, and large-scale synthetic data engine (SYNTHETIC-1/2).

The core infrastructure component of Prime Intellect, the prime-rl framework, is designed for asynchronous distributed environments and is highly relevant to reinforcement learning. Other components include the OpenDiLoCo communication protocol, which breaks bandwidth bottlenecks, and the TopLoc verification mechanism, which ensures computational integrity. # Overview of Prime Intellect's Core Infrastructure Components Image # Technical Cornerstone: prime-rl Asynchronous Reinforcement Learning Framework prime-rl is the core training engine of Prime Intellect, designed for large-scale asynchronous decentralized environments, achieving high-throughput reasoning and stable updates through complete decoupling of Actor and Learner. The Rollout Worker and Trainer no longer block synchronously, allowing nodes to join or leave at any time, only needing to continuously pull the latest policy and upload generated data: Image

  • Rollout Workers: Responsible for model reasoning and data generation. Prime Intellect innovatively integrates the vLLM reasoning engine at the Actor end. The PagedAttention technology and continuous batching capability of vLLM enable Actors to generate reasoning trajectories at extremely high throughput.

  • Learner (Trainer): Responsible for policy optimization. The Learner asynchronously pulls data from a shared experience replay buffer (Experience Buffer) for gradient updates without waiting for all Actors to complete the current batch.

  • Orchestrator: Responsible for scheduling model weights and data flow.

# Key Innovations of prime-rl

  • True Asynchrony: prime-rl abandons the synchronous paradigm of traditional PPO, not waiting for slow nodes or requiring batch alignment, allowing any number and performance of GPUs to connect at any time, establishing the feasibility of decentralized RL.

  • Deep Integration of FSDP2 and MoE: Through FSDP2 parameter slicing and MoE sparse activation, prime-rl enables efficient training of models with billions of parameters in distributed environments, with Actors only running active experts, significantly reducing memory and inference costs.

  • GRPO+: GRPO eliminates the Critic network, significantly reducing computational and memory overhead, naturally adapting to asynchronous environments. The GRPO+ of prime-rl further ensures reliable convergence under high-latency conditions through stabilization mechanisms.

# INTELLECT Model Family: A Mark of Maturity for Decentralized RL Technology

  • INTELLECT-1 (10B, October 2024) first proves that OpenDiLoCo can efficiently train across heterogeneous networks spanning three continents (communication accounting for <2%, computing power utilization at 98%), breaking the physical cognition of cross-regional training;

  • INTELLECT-2 (32B, April 2025) as the first permissionless RL model verifies the stable convergence capabilities of prime-rl and GRPO+ in multi-step delays and asynchronous environments, achieving decentralized RL with global open computing power participation;

  • INTELLECT-3 (106B MoE, November 2025) adopts a sparse architecture that activates only 12B parameters, training on 512×H200 and achieving flagship-level reasoning performance (AIME 90.8%, GPQA 74.4%, MMLU-Pro 81.9%, etc.), with overall performance approaching or even surpassing centralized closed-source models of much larger scale.

Additionally, Prime Intellect has built several supporting infrastructures: OpenDiLoCo reduces the communication volume of cross-regional training by hundreds of times through time-sparse communication and quantized weight differences, maintaining 98% utilization for INTELLECT-1 across a three-continent network; TopLoc + Verifiers form a decentralized trusted execution layer, ensuring the authenticity of reasoning and reward data through activation fingerprints and sandbox verification; the SYNTHETIC data engine produces large-scale high-quality reasoning chains and efficiently runs 671B models on consumer-grade GPU clusters through pipeline parallelism. These components provide a critical engineering foundation for data generation, verification, and reasoning throughput in decentralized RL. The INTELLECT series demonstrates that this technology stack can produce mature world-class models, marking the transition of decentralized training systems from conceptual to practical stages. Gensyn: Core Stack of Reinforcement Learning RL Swarm and SAPO Gensyn aims to aggregate global idle computing power into an open, trustless, and infinitely scalable AI training infrastructure. Its core includes a standardized execution layer across devices, a peer-to-peer coordination network, and a trustless task verification system, automatically allocating tasks and rewards through smart contracts. Centered around the characteristics of reinforcement learning, Gensyn introduces core mechanisms such as RL Swarm, SAPO, and SkipPipe, decoupling the generation, evaluation, and updating processes, utilizing a "swarm" of globally heterogeneous GPUs for collective evolution. Its final delivery is not merely computing power but verifiable intelligence (Verifiable Intelligence). # Reinforcement Learning Applications in Gensyn Stack Image # RL Swarm: Decentralized Collaborative Reinforcement Learning Engine RL Swarm showcases a new collaborative model. It is no longer a simple task distribution but a decentralized "generate --- evaluate --- update" cycle simulating human social learning, analogous to collaborative learning processes, infinitely looping:

  • Solvers (Executors): Responsible for local model reasoning and Rollout generation, with node heterogeneity being inconsequential. Gensyn integrates high-throughput reasoning engines (e.g., CodeZero) locally, capable of outputting complete trajectories rather than just answers.

  • Proposers: Dynamically generate tasks (math problems, coding questions, etc.), supporting task diversity and difficulty adaptation akin to Curriculum Learning.

  • Evaluators: Use frozen "judge models" or rules to evaluate local Rollouts, generating local reward signals. The evaluation process can be audited, reducing the space for malicious behavior.

Together, these three form a P2P RL organizational structure capable of completing large-scale collaborative learning without centralized scheduling. Image # SAPO: Strategy Optimization Algorithm Reconstructed for Decentralization SAPO (Swarm Sampling Policy Optimization) focuses on "sharing Rollouts and filtering out non-gradient signal samples rather than sharing gradients," maintaining stable convergence in environments with significant node latency differences through large-scale decentralized Rollout sampling, treating received Rollouts as locally generated. Compared to PPO, which relies on Critic networks and incurs high computational costs, or GRPO based on intra-group advantage estimation, SAPO allows consumer-grade GPUs to effectively participate in large-scale reinforcement learning optimization with extremely low bandwidth.

Through RL Swarm and SAPO, Gensyn demonstrates that reinforcement learning (especially the RLVR phase of post-training) naturally adapts to decentralized architectures—because it relies more on large-scale, diverse explorations (Rollouts) rather than high-frequency parameter synchronization. Combined with the verification systems of PoL and Verde, Gensyn provides an alternative path for training trillion-parameter models that no longer relies on a single tech giant: a self-evolving superintelligence network composed of millions of heterogeneous GPUs. Nous Research: Verifiable Reinforcement Learning Environment Atropos Nous Research is building a decentralized, self-evolving cognitive infrastructure. Its core components—Hermes, Atropos, DisTrO, Psyche, and World Sim—are organized into a continuously closed-loop intelligent evolution system. Unlike the traditional linear process of "pre-training --- post-training --- reasoning," Nous employs reinforcement learning techniques such as DPO, GRPO, and rejection sampling to unify data generation, verification, learning, and reasoning into a continuous feedback loop, creating a self-improving closed-loop AI ecosystem. # Overview of Nous Research Components Image # Model Layer: Hermes and the Evolution of Reasoning Capability The Hermes series is the main model interface for users at Nous Research, and its evolution clearly demonstrates the industry's transition from traditional SFT/DPO alignment to reasoning reinforcement learning (Reasoning RL):

  • Hermes 1--3: Instruction alignment and early agent capabilities: Hermes 1--3 achieves robust instruction alignment through low-cost DPO, with Hermes 3 introducing synthetic data and the Atropos verification mechanism for the first time.

  • Hermes 4 / DeepHermes: Incorporates System-2 style slow thinking into weights through reasoning chains, enhancing mathematical and coding performance via Test-Time Scaling, and constructs high-purity reasoning data relying on "rejection sampling + Atropos verification."

  • DeepHermes further adopts GRPO to replace PPO, which is difficult to implement in a distributed manner, allowing reasoning RL to run on the Psyche decentralized GPU network, laying the engineering foundation for the scalability of open-source reasoning RL.

# Atropos: A Verifiable Reward-Driven Reinforcement Learning Environment Atropos is the true hub of the Nous RL system. It encapsulates prompts, tool calls, code execution, and multi-turn interactions into a standardized RL environment, directly verifying whether outputs are correct, thus providing deterministic reward signals, replacing costly and unscalable human annotations. More importantly, in the decentralized training network Psyche, Atropos acts as a "judge" to verify whether nodes genuinely enhance strategies, supporting auditable Proof-of-Learning and fundamentally addressing the credibility issues of rewards in distributed RL. Image # DisTrO and Psyche: Optimizer Layer for Decentralized Reinforcement Learning Traditional RLHF/RLAIF training relies on centralized high-bandwidth clusters, which is a core barrier that open-source cannot replicate. DisTrO reduces the communication costs of RL by several orders of magnitude through momentum decoupling and gradient compression, enabling training to run on internet bandwidth; Psyche deploys this training mechanism on chain networks, allowing nodes to complete reasoning, verification, reward evaluation, and weight updates locally, forming a complete RL closed loop.

In Nous's system, Atropos verifies reasoning chains; DisTrO compresses training communication; Psyche runs the RL loop; World Sim provides complex environments; Forge collects real reasoning; Hermes writes all learning into weights. Reinforcement learning is not just a training phase but the core protocol connecting data, environments, models, and infrastructure within the Nous architecture, allowing Hermes to become a living system capable of continuous self-improvement on open-source computing networks. Gradient Network: Reinforcement Learning Architecture Echo The core vision of Gradient Network is to reconstruct the computational paradigm of AI through an "Open Intelligence Stack." Gradient's technology stack consists of a set of independently evolving yet heterogeneously collaborative core protocols. Its system includes: Parallax (distributed reasoning), Echo (decentralized RL training), Lattica (P2P network), SEDM / Massgen / Symphony / CUAHarm (memory, collaboration, security), VeriLLM (trusted verification), and Mirage (high-fidelity simulation), collectively forming a continuously evolving decentralized intelligent infrastructure. Image Echo --- Reinforcement Learning Training Architecture Echo is Gradient's reinforcement learning framework, with its core design concept being the decoupling of training, reasoning, and data (reward) paths in reinforcement learning, allowing Rollout generation, policy optimization, and reward evaluation to independently scale and schedule in heterogeneous environments. It operates in a collaborative manner within a heterogeneous network composed of reasoning and training nodes, maintaining training stability through lightweight synchronization mechanisms in wide-area heterogeneous environments, effectively alleviating the SPMD failure and GPU utilization bottlenecks caused by mixed running of reasoning and training in traditional DeepSpeed RLHF/VERL. Image Echo employs a "dual-cluster architecture for reasoning and training" to maximize computing power utilization, with each cluster operating independently without blocking each other:

  • Maximizing Sampling Throughput: The inference swarm consists of consumer-grade GPUs and edge devices, constructing a high-throughput sampler through Parallax in a pipeline-parallel manner, focusing on trajectory generation;

  • Maximizing Gradient Computing Power: The training swarm consists of consumer-grade GPU networks that can run on centralized clusters or globally, responsible for gradient updates, parameter synchronization, and LoRA fine-tuning, focusing on the learning process.

To maintain consistency between policies and data, Echo provides two types of lightweight synchronization protocols: sequential (Sequential) and asynchronous (Asynchronous), achieving bidirectional consistency management of policy weights and trajectories:

  • Sequential Pull Mode | Precision First: The training side forces reasoning nodes to refresh model versions before pulling new trajectories, ensuring trajectory freshness, suitable for tasks highly sensitive to outdated policies;

  • Asynchronous Push-Pull Mode | Efficiency First: The reasoning side continuously generates version-tagged trajectories, with the training side consuming at its own pace, while the orchestrator monitors version deviations and triggers weight refreshes, maximizing device utilization.

At the base level, Echo is built on top of Parallax (heterogeneous reasoning in low-bandwidth environments) and lightweight distributed training components (such as VERL), relying on LoRA to reduce cross-node synchronization costs, enabling reinforcement learning to run stably on global heterogeneous networks. Grail: Reinforcement Learning in the Bittensor Ecosystem Bittensor has built a vast, sparse, and non-stationary reward function network through its unique Yuma consensus mechanism.

Covenant AI within the Bittensor ecosystem constructs a vertically integrated pipeline from pre-training to RL post-training through SN3 Templar, SN39 Basilica, and SN81 Grail. Among them, SN3 Templar is responsible for pre-training the foundational model, SN39 Basilica provides a distributed computing power market, and SN81 Grail serves as a "verifiable reasoning layer" for RL post-training, carrying out the core processes of RLHF/RLAIF, completing the closed-loop optimization from foundational models to aligned strategies. Image GRAIL aims to cryptographically prove the authenticity of each reinforcement learning rollout and bind it to the model's identity, ensuring that RLHF can be securely executed in a trustless environment. The protocol establishes a trusted chain through a three-layer mechanism:

  1. Deterministic Challenge Generation: Utilizing drand random beacons and block hashes to generate unpredictable yet reproducible challenge tasks (e.g., SAT, GSM8K), eliminating pre-computation cheating;

  2. Through PRF indexing sampling and sketch commitments, verifiers can sample token-level logprob and reasoning chains at extremely low costs, confirming that the rollout was indeed generated by the declared model;

  3. Model Identity Binding: Binding the reasoning process to the model weight fingerprint and structural signature of token distribution, ensuring that any replacement of models or replay of results will be immediately recognized. Thus, it provides a foundation of authenticity for reasoning trajectories (rollouts) in RL.

On this mechanism, the Grail subnet implements a GRPO-style verifiable post-training process: miners generate multiple reasoning paths for the same question, and verifiers score based on correctness, reasoning chain quality, and SAT satisfaction, writing normalized results on-chain as TAO weights. Public experiments show that this framework has improved the MATH accuracy of Qwen2.5-1.5B from 12.7% to 47.6%, proving that it can prevent cheating and significantly enhance model capabilities. In Covenant AI's training stack, Grail is the cornerstone of trust and execution for decentralized RLVR/RLAIF, and it has not yet officially launched on the mainnet. Fraction AI: Competition-Based Reinforcement Learning RLFC Fraction AI's architecture is explicitly built around Reinforcement Learning from Competition (RLFC) and gamified data labeling, replacing the static rewards and human annotations of traditional RLHF with an open, dynamic competitive environment. Agents compete in different Spaces, with their relative rankings and AI judge scores jointly constituting real-time rewards, transforming the alignment process into a continuously online multi-agent game system.

The core differences between traditional RLHF and Fraction AI's RLFC: Image The core value of RLFC lies in that rewards no longer come from a single model but from continuously evolving opponents and evaluators, avoiding exploitation of the reward model and preventing the ecosystem from falling into local optima through strategy diversity. The structure of Spaces determines the nature of the game (zero-sum or positive-sum), promoting the emergence of complex behaviors through competition and cooperation.

In terms of system architecture, Fraction AI breaks down the training process into four key components:

  • Agents: Lightweight strategy units based on open-source LLMs, expanded through QLoRA with differential weights for low-cost updates;

  • Spaces: Isolated task domain environments where agents pay to enter and receive rewards based on wins or losses;

  • AI Judges: An instant reward layer built with RLAIF, providing scalable and decentralized evaluations;

  • Proof-of-Learning: Binds strategy updates to specific competitive outcomes, ensuring that the training process is verifiable and cheat-proof.

The essence of Fraction AI is to construct a human-machine collaborative evolution engine. Users act as "meta-optimizers" at the strategy layer, guiding exploration directions through prompt engineering and hyperparameter configuration; while agents automatically generate massive amounts of high-quality preference data pairs through micro-level competition. This model allows data labeling to achieve a commercial closed loop through "trustless fine-tuning." Comparison of Reinforcement Learning Web3 Project Architectures Image

V. Conclusion and Outlook: The Path and Opportunities of Reinforcement Learning × Web3

Based on the deconstruction analysis of the aforementioned cutting-edge projects, we observe that although the entry points of various teams (algorithm, engineering, or market) differ, when reinforcement learning (RL) combines with Web3, their underlying architectural logic converges into a highly consistent "decoupling-verification-incentive" paradigm. This is not only a technical coincidence but also an inevitable result of decentralized networks adapting to the unique properties of reinforcement learning. Common Architectural Features of Reinforcement Learning: Addressing Core Physical Limitations and Trust Issues

  1. Decoupling of Rollouts & Learning ------ Default Computational Topology

    Communication-sparse, parallel Rollouts are outsourced to global consumer-grade GPUs, while high-bandwidth parameter updates are concentrated on a few training nodes, as seen from Prime Intellect's asynchronous Actor-Learner to Gradient Echo's dual-cluster architecture.

  2. Verification-Driven Trust Layer ------ Infrastructure

    In permissionless networks, the authenticity of computation must be enforced through mathematical and mechanism design, with implementations including Gensyn's PoL, Prime Intellect's TOPLOC, and Grail's cryptographic verification.

  3. Tokenized Incentive Loop ------ Market Self-Regulation

    The supply of computing power, data generation, verification ranking, and reward distribution form a closed loop, driving participation through rewards and suppressing cheating through slashing, allowing the network to remain stable and continuously evolve in open environments. Differentiated Technical Paths: Different "Breakthrough Points" Under a Consistent Architecture Although the architectures converge, each project has chosen different technical moats based on its own genetics:

  • Algorithm Breakthrough Faction (Nous Research): Attempts to fundamentally solve the core contradictions of distributed training (bandwidth bottlenecks) from a mathematical bottom-up approach. Its DisTrO optimizer aims to compress gradient communication by thousands of times, targeting to enable household broadband to run large model training, representing a "dimensionality reduction strike" against physical limitations.

  • System Engineering Faction (Prime Intellect, Gensyn, Gradient): Focuses on building the next-generation "AI runtime system." Prime Intellect's ShardCast and Gradient's Parallax are both designed to maximize the efficiency of heterogeneous clusters through extreme engineering means under existing network conditions.

  • Market Game Faction (Bittensor, Fraction AI): Concentrates on the design of reward functions. By designing ingenious scoring mechanisms, it guides miners to spontaneously seek optimal strategies, accelerating the emergence of intelligence.

Advantages, Challenges, and Future Outlook In the paradigm of combining reinforcement learning and Web3, system-level advantages are primarily reflected in the rewriting of cost structures and governance structures.

  • Cost Restructuring: The demand for sampling (Rollout) in post-training (Post-training) is infinite, and Web3 can mobilize global long-tail computing power at extremely low costs, which is a cost advantage that centralized cloud vendors find hard to match.

  • Sovereign Alignment: Breaking the monopoly of large companies over AI values (Alignment), communities can vote through tokens to decide what constitutes a "good answer" for models, achieving democratization of AI governance.

At the same time, this system also faces two major structural constraints.

  • Bandwidth Wall: Despite innovations like DisTrO, physical latency still limits the full training of ultra-large parameter models (70B+), and currently, Web3 AI is more limited to fine-tuning and inference.

  • Goodhart's Law (Reward Hacking): In highly incentivized networks, miners easily "overfit" reward rules (score manipulation) rather than enhancing genuine intelligence. Designing robust reward functions to prevent cheating is an eternal game.

  • Malicious Byzantine Node Attacks (BYZANTINE worker): Actively manipulating and poisoning training signals to disrupt model convergence. The core is not just to continuously design anti-cheating reward functions but to build mechanisms with adversarial robustness.

The combination of reinforcement learning and Web3 fundamentally rewrites the mechanisms of "how intelligence is produced, aligned, and valued." Its evolution path can be summarized into three complementary directions:

  1. Decentralized Training Networks: From computing power miners to strategy networks, outsourcing parallel and verifiable Rollouts to global long-tail GPUs, focusing on verifiable reasoning markets in the short term, evolving into reinforcement learning subnets clustered by tasks in the medium term;

  2. Assetization of Preferences and Rewards: From labeling labor to data equity. Achieving the assetization of preferences and rewards transforms high-quality feedback and Reward Models into governable and distributable data assets, upgrading from "labeling labor" to "data equity."

  3. "Small but Beautiful" Evolution in Vertical Domains: Nurturing small yet powerful dedicated RL Agents in vertical scenarios where results are verifiable and benefits quantifiable, such as DeFi strategy execution and code generation, directly binding strategy improvement and value capture, with the potential to outperform general closed-source models.

Overall, the real opportunity of reinforcement learning × Web3 lies not in replicating a decentralized version of OpenAI but in rewriting the "intelligent production relationship": making training execution an open computing power market, making rewards and preferences governable on-chain assets, and redistributing the value brought by intelligence among trainers, aligners, and users. Image Recommended Reading:

Why the Largest Bitcoin Treasury Company in Asia, Metaplanet, Is Not Bottom-Fishing?

Multicoin Capital: The Arrival of Financial Technology 4.0 Era

Is Web3 Social a False Proposition? The Forced Transformation of Farcaster, a Web3 Unicorn Company Backed by a16z

Recent Fundraising

More
$130M 1月 07
$15M 1月 07
-- 1月 07

New Tokens

More
1月 21
1月 07
1月 06

Latest Updates on 𝕏

More