Loading...

DeepSeek-R1: A Deep Dive into its Architecture

38 3________

DeepSeek-R1: Cost-Effective LLM Design Explained

DeepSeek-R1's foundation is the DeepSeek V3 Base, a transformer model initialized with parameters pre-trained on a massive corpus. The "0718/378 Activated" likely indicates the model's parameter size and the portion activated for this specific task. This pre-training minimizes the need for extensive training from scratch, reducing computational costs.

To bootstrap its reasoning capabilities, a "Cold Start" strategy is employed using Long Chain-of-Thought (CoT) prompting. This involves feeding the model a set of "K samples," each representing a complex, multi-step problem. These samples are meticulously designed to guide the model through a step-by-step reasoning process, effectively teaching it to decompose problems into manageable sub-tasks. Strategic prompting reduces the need for massive datasets, saving on data acquisition and processing costs.

Subsequent to the "Cold Start," Supervised Fine Tuning (SFT) is performed. This involves training the model on a dataset of paired input-output examples, where the outputs demonstrate desired reasoning patterns. The goal is to align the model's behavior with human-annotated examples, refining its understanding of the target task. Focused SFT reduces the number of training iterations required, saving computational resources.

The core innovation lies in Reasoning Oriented Reinforcement Learning (ORL). OPRO (Optimized Policy Regularization) is utilized, a technique designed to stabilize the training process and prevent policy collapse, a common issue in reinforcement learning. Crucially, rule-based rewards are incorporated, specifically focusing on accuracy and formatting. This ensures the model not only generates correct answers but also presents them in a structured, interpretable format, crucial for downstream applications. The "CoT Consistency Reward" further reinforces the model's ability to produce consistent reasoning chains across multiple runs. Efficient RL training with optimized rewards leads to faster convergence and reduced training time.

Data curation is paramount. A dataset of 600,000 reasoning samples is leveraged, encompassing a wide range of problem types and complexities. Rejection sampling, guided by a judge model (O3-V3sjudge), is employed to filter out inconsistent or incorrect reasoning steps, ensuring data quality. Additionally, 200,000 non-reasoning samples are included to provide broader context and general knowledge. High-quality, curated data reduces the need for extensive data cleaning and retraining, saving time and resources.

The combined dataset undergoes two epochs of SFT, further refining the model's understanding and generalization capabilities. This iterative process allows the model to progressively improve its performance on the target task. Efficient SFT training with a well-curated dataset minimizes the need for excessive iterations.

Finally, Reinforcement Learning with preference rewards is employed. This allows the model to learn from human preferences, beyond simple accuracy metrics. Diverse training prompts are used to ensure robustness across various domains, preventing overfitting to specific problem types. Effective RL training with preference rewards reduces the need for extensive fine-tuning and adaptation.

To create smaller, more efficient models like DeepSeek-R1-Zero, knowledge distillation is utilized. DeepSeek-R1, the larger model, acts as a teacher, transferring its knowledge to a smaller student model. This process maintains performance while reducing the model's parameter count, making it more suitable for resource-constrained environments. Knowledge distillation creates smaller, faster models, reducing deployment costs and resource requirements.

DeepSeek-R1's architecture exemplifies a systematic approach to building advanced reasoning capabilities in large language models. The integration of strategic prompting, supervised learning, reinforcement learning, and knowledge distillation results in a powerful and efficient model, designed for cost-effectiveness at every stage of development and deployment.

コメント