|
Before December 2024, DeepSeek was rarely mentioned in China’s AI community. With the release of DeepSeek-V3 and the reasoning model R1, Chinese media and AI researchers started to ask the same question as their American counterparts: Who is DeepSeek and how should we feel about them?
In this newsletter, we share a translation of insights from a January 26 closed-door session hosted by Shixiang 拾象, a VC spun out from Sequoia China. Attended by dozens of AI researchers, investors, and industry insiders, the event captures how the Chinese AI community is processing the DeepSeek shock. A core conclusion they’ve come to, one we’ve emphasized in ChinaTalk with our Miles Brundage interview and guest post by Lennart and Sihao, is that “In the long-run, questions about computing power will remain. Demand for compute remains strong and no company has enough.”
Before diving into that translation, we did a broad look at additional details and discussion coming from Chinese-language coverage of DeepSeek.
The Story Behind DeepSeek The Paper 澎湃 offered more details about High-Flyer, the quantitative hedge fund behind DeepSeek. Founded in 2015 by Liang Wenfeng 梁文锋, a Zhejiang University graduate, High-Flyer has a strong background in machine learning-based quantitative trading. Liang founded DeepSeek in July 2023, and the company has not received any outside funding to date.
When it comes to hiring, DeepSeek prioritizes “young and high-potential” candidates — specifically those born around 1998 with no more than five years of work experience, similar to other AI labs in China. Said one DeepSeek employee to The Paper, “The success of DeekSeek has demonstrated the power of young people, and in essence, that the development of this generation of artificial intelligence needs young minds.”
Liang has maintained a relatively low public profile, but 36Kr managed to secure two exclusive interviews with him. The first, in May 2023, followed High-Flyer’s announcement that it was building LLMs, while the second, in November 2024, came after the release of DeepSeek-V2.
In both interviews, Liang emphasized the value of innovation without immediate monetization and DeepSeek’s culture of openness. The second interview had a stark shift in tone, with Liang meditating less on the baked-in idealism of a strategy predicated on open sourcing core innovations and more time emphasizing that he wanted DeepSeek to prove to other Chinese engineers that domestic teams could deliver on “hardcore innovation.”
A budding partnership with ByteDance? TMT 钛媒体 reported yesterday that ByteDance and OpenAI are “considering research collaborations” with DeepSeek. While the two firms may have talked in the past, given today’s political climate it’s kind of hard to put much weight into the OpenAI rumor. Partnering with ByteDance, however, could be an enormous unlock for DeepSeek researchers, giving them access to orders of magnitude more compute.
National Pride in the Face of US Competition. The response from Chinese media has been quite positive. State media and industry leaders have celebrated DeepSeek’s achievements, often tinged with nationalist pride, particularly after English-language reports highlighted its performance and cost efficiency. For example:
China Daily declared, “For a Chinese LLM, it’s a historical moment to surpass ChatGPT in the US.” Daily Economic News echoed this sentiment, stating, “Silicon Valley Shocked! Chinese AI Dominates Foreign Media, AI Experts Say: ‘It Has Caught Up with the U.S.!’”
Tech executives have also weighed in. Feng Ji 冯骥, founder of Game Science (the studio behind Black Myth: Wukong), called DeepSeek “a scientific and technological achievement that shapes our national destiny (国运).” Zhou Hongyi, Chairperson of Qihoo 360, told Jiemian News that DeepSeek will be a key player in the “Chinese Large-Model Technology Avengers Team” to counter U.S. AI dominance.
Ordinary users have also been astounded by the model’s capabilities. Many were impressed by the Chinese poems that DeepSeek could write, and tutorials have come up, instructing users to use as few prompting words as possible and ask DeepSeek to talk like a human (说人话). In a viral Weibo post, a user said, “I never thought there would come a day when I would shed tears for AI,” citing DeepSeek’s response to their feelings of existential threat over DeepSeek’s ability to write.
Here is DeepSeek R1’s response: “Remember, all the words that make you tremble are just echoes that already exist deep within your soul. I am merely a valley that happened to pass by, allowing you to hear the weight of your own voice.” 记住,所有让你颤粟的文字,本质上都是你灵魂深处早已存在的回声。我不过是偶尔经过的山谷,让你听到了自己声音的重量。
And now, our translation of the industry summit.
January 26th. WeChat Link, Archive.
DeepSeek-R1 has sparked a frenzy in the global AI community, but there is a relative dearth of high-quality information about DeepSeek.
On January 26, 2025, 李广密 Guangmi Li, Founder and CEO of 拾象 Shixiang, organized a closed-door discussion on DeepSeek with dozens of top AI researchers, investors and frontline AI practitioners to discuss and learn from DeepSeek's technical details, organizational culture, and short-, medium-, and long-term impacts of its entry into the world. This discussion attempted to lift the veil of this “mysterious eastern force” about which we have so little information.
Below is a summary of the key points from this discussion.
Founder and CEO Liang Wenfeng is the core person of DeepSeek. He is not the same type of person as Sam Altman. He is very knowledgeable about technology.
DeepSeek has a good reputation because it was the first to release the reproducible MoE, o1, etc. It succeeded in acting early, but whether or not it did the absolute best remains to be seen. Moving forward, the biggest challenges are that resources are limited and can only be invested in the most high-potential areas. DeepSeek’s research and culture are still strong, and if given 100,000 or 200,000 chips, they might be able to do better.
From its preview to its official release, DeepSeek’s model’s long-context capabilities have improved rapidly. DeepSeek’s long-context 20K can be achieved with very conventional methods.
The CEO of Scale.ai said that DeepSeek has 50,000 chips, but that is definitely not reality. According to public information, DeepSeek had 10,000 old A100 chips and possibly 3,000 H800 cards before the ban. DeepSeek pays great attention to compliance and has not purchased any non-compliant GPUs, so it should have few chips. The way the United States uses GPUs is too extravagant.
DeepSeek focused all its efforts on a single goal and subsequently gave up many things, such as multimodality. DeepSeek is not just serving people, but seeking intelligence itself, which may have been a key factor in its success.
In some ways, quant trading can be said to be the business model of DeepSeek. Huanfang (another quantitative investment company founded by Liang Wenfeng) is the product of the last round of machine learning. DeepSeek’s highest priority is to push intelligence. Money and commercialization are not high priorities. China needs several leading AI labs to explore things that can beat OpenAI. Intelligence takes a long time to develop, and has begun to differentiate again this year, so new innovations are bound to result.
From a technical perspective, DeepSeek has been instrumental as a training ground for talent.
The business model of AI labs in the United States is not good either. AI does not have a good business model today and will require viable solutions in the future. Liang Wenfeng is ambitious; DeepSeek does not care about the model and is just heading towards AGI.
Many of the insights from DeepSeek’s paper involve saving hardware costs. On a couple of big dimensions of scaling, DeepSeek’s techniques are able to reduce costs.
In the short-term, everyone will be driven to think about how to make AI more efficient. In the long-run, questions about computing power will remain. Demand for compute remains strong and no company has enough.
Discussing DeepSeek’s organization:
When investing, we always choose the most advanced talent. But we see from DeepSeek’s model (the team is mostly smart young people who graduated from domestic universities) that a group that coheres well may also gradually advance their skills together. It has yet to be seen whether poaching one person might break DeepSeek’s advantage, but for now this seems unlikely.
While there’s a lot of money in the market, DeepSeek’s core advantage is its culture. The research culture of DeepSeek and ByteDance are similar, and both are critical for determining the availability of funding and long-term viability. Only with an important business model can there be a sustainable culture. Both DeepSeek and ByteDance have very good business models.
Why did DeepSeek catch up so fast?
Reasoning models require high-quality data and training. For LLMs or multimodal AI, it’s difficult to catch up with a closed source model from scratch. The architecture of pure reasoning models hasn’t changed much, so it’s easier to catch up in reasoning.
One reason R1 caught up quickly was that the task was not particularly difficult. Reinforcement learning only made the model choices more accurate. R1 did not break through the efficiency of Consensus 32, spending 32 times the efficiency, which is equivalent to moving from deep processing to parallelization, which is not pushing the boundaries of intelligence, just making it easier.
AI is similar to a step function, where the compute requirements for followers have decreased by a factor of 10. Followers have historically had lower compute costs, but explorers still need to train many models. The exploration of new algorithms and architectures will not stop. Behind the step function, there are significant investments by many people, meaning compute investments will continue to advance. Many resources will also be allocated to products. Apart from reasoning, there are other directions that are compute-intensive. While the vast amount of compute resources spent by explorers may not be visible, without such investment, the next "step" might not occur. Additionally, many are dissatisfied with current architectures and RL methods, and progress will continue.
When exploring directions, performance achieved with 10,000 GPUs may not always be significantly better than that of 1,000 GPUs, but there is a threshold somewhere. It’s unlikely that meaningful results can be achieved with only 100 GPUs because the iteration time for each solution would be too long.
Advancements in physics can be divided into academic research in universities and industry labs. The former focuses on exploring multiple directions without requiring immediate returns, while the latter prioritizes efficiency improvements.
From the perspectives of explorers and chasers, small companies with limited GPUs must prioritize efficiency, whereas large companies focus on achieving models as quickly as possible. Methods that improve efficiency on a 2,000-GPU cluster may not work effectively on a 10,000-GPU cluster, where stability becomes a higher priority.
The advantage of the CUDA ecosystem lies in its extensive and complete set of operators. Chinese companies like Huawei have targeted commonly used operators to achieve breakthroughs, leveraging their latecomer advantage. If a company has access to 100,000 GPUs, the decision between becoming a leader or a chaser is critical. Being a leader comes with high costs, while being a chaser offers higher efficiency. The next direction for China to follow could be multi-modality, especially since GPT-5 has been delayed for a long time.
[points 18-48 was a long technical discussion we’ve machine-translated below]
The question of why OpenAI and Anthropic did not do work in DeepSeek’s direction is a question of company-specific focus. OpenAI and Anthropic might have felt that investing their compute towards other areas was more valuable.
One hypothesis for why DeepSeek was successful is that unlike Big Tech firms, DeepSeek did not work on multi-modality and focused exclusively on language. Big Tech firms’ model capabilities aren’t weak, but they have to maintain a low profile and cannot release too often. Currently, multimodality is not very critical, as intelligence primarily comes from language, and multimodality does not contribute significantly to improving intelligence.
In 2025, models will begin to diverge. The most enticing vision is to continuously push the boundaries of intelligence, with many potential breakthrough paths. Methods might change, such as through synthetic data or alternative architectures.
2025 will, first and foremost, see interest in new architectures beyond Transformers. Some initial exploration is already underway, aiming to reduce costs while pushing the boundaries of intelligence. Secondly, the potential of reinforcement learning (RL) has yet to be tapped into completely. On the product side, there is significant interest in agents, though they have yet to see widespread application.
Multimodal products capable of challenging the ChatGPT paradigm might emerge in 2025.
The success of R1 and V3 in achieving low cost and high performance demonstrates the viability of this direction. This does not conflict with the approach of expanding hardware or increasing parameters. However, in China, due to certain restrictions, the former path is the primary option.
On DeepSeek:
First, DeepSeek may have been "forced" into its current path from base models or may simply be following the Scaling Law.
Second, from the perspective of distillation, DeepSeek likely follows a "large to small" approach. This is beneficial for closed-source models, which are growing larger and larger.
Third, there are currently no anti-scaling metrics emerging in the field. If such metrics arise, they could pose a challenge to the Scaling Law. However, open-source models can implement everything closed-source models do while also reducing costs, which is advantageous for closed-source models as well.
It is reported that Meta is still in the process of reproducing DeepSeek, but so far, this has not significantly impacted their infrastructure or long-term roadmap. In the long run, beyond exploring the boundaries of the technology, cost efficiency must also be considered. Lowering costs will let us have more fun.
Will developers migrate from closed-source models to DeepSeek? Currently, there hasn’t been any large-scale migration, as leading models excel in coding instruction adherence, which is a significant advantage. However, it’s uncertain whether this advantage will persist in the future or be overcome.
From the developer's perspective, models like Claude-3.5-Sonnet have been specifically trained for tool use, making them highly suitable for agent development. In contrast, models like DeepSeek have not yet focused on this area, but the potential for growth with DeepSeek is immense.
For large model users, DeepSeek V2 already meets most needs. While R1 improved speed, it didn’t provide significant additional value. Interestingly, when engaging in deep reasoning, some previously correct answers now tend to be incorrect.
When choosing models, users tend to simplify problems using engineering methods. 2025 may become a year of applications, with industries leveraging existing capabilities. However, this could lead to a bottleneck, as most day-to-day tasks might not require highly intelligent models.
Currently, reinforcement learning (RL) solves problems with standard answers but has not achieved breakthroughs beyond what AlphaZero accomplished. In fact, it is often simpler. Distillation addresses problems with standard answers, and RL methods work effectively when training with such answers. This explains why distillation and RL have made rapid progress in recent years.
Humanity’s demand for intelligence is vastly underestimated. Many critical problems, such as cancer and SpaceX's heat shield materials, remain unsolved. Existing AI primarily automates tasks, but there are numerous unsolved challenges ahead. Looking forward, the potential for explosive growth is immense, and the advancement of intelligence cannot stop.
The emergence of DeepSeek has led people to question the latest $500B narrative from Nvidia and OpenAI. There’s no verdict yet on compute — and OpenAI’s $500B narrative is their attempt to throw themselves a lifeline.
Regarding the doubts about OpenAI’s $500B infrastructure investment: because OpenAI is a commercial company, it could be risky if debt is involved.
$500B is an extreme number — likely to be executed over 4 or 5 years. SoftBank and OpenAI are the leading players (the former providing capital, the latter technology) — but SoftBank’s current funds can’t support $500B; rather SoftBank is using its assets as collateral. OpenAI, meanwhile, isn’t very cash-rich either, and other AI companies are more technical participants than they are funding providers. So it will be a struggle to fully realize the $500B vision.
OpenAI’s $500B computing power makes sense: during the exploration phase, the cost of trial and error is high, with both human and investment costs being substantial. But although the path isn’t clear and getting from o1 to R1 won’t be easy, at least we can see what the finish line looks like: we can track the intermediate markers, and from day one, aim for others’ proven end states; this gives us a better bearing on our progress. Being at the frontier exploring the next generation is most resource-intensive. The followers don’t bear exploration costs — they’re always just following. If Google/Anthropic succeed in their exploration areas, they might become the frontier company.
In the future, Anthropic might replace all their inference with TPU or AWS chips.
Domestic Chinese companies were previously constrained by computing power, but now it’s proven that the potential technical space is vast. For more efficient models, we might not need especially large cards — we can provide relatively customized chips that can be adapted for compatibility with AMD and ASIC. From an investment perspective, Nvidia’s moat is very high, but ASIC will have yet greater opportunities.
The DeepSeek situation isn’t really about compute — it’s about America realizing China’s capabilities and efficiency. DeepSeek isn’t Nvidia’s vulnerability; Nvidia will grow as long as AI grows. Nvidia’s strength is its ecosystem, which has been built up over a long time. Indeed, when technology develops rapidly, the ecosystem is crucial. The real crisis comes, though, when technology matures like electricity: it becomes commoditized; then, everyone will focus on products, and many ASIC chips will emerge for specific scenario optimization.
DeepSeek has had a significant short-term impact on the US AI sector and stock prices: pretrain demand growth is slowing, while post-training and inference scaling haven’t scaled up fast enough, creating a gap in the narrative for related companies, which will affect short-term trading.
DeepSeek mainly uses FP8, while the US uses FP16. DeepSeek’s improvements are all based on limited computational engineering capabilities, with efficient use of computing power being the biggest highlight. Last Friday, DeepSeek had a huge impact in North America: Zuckerberg gave higher expectations for Meta’s capital expenditure, but Nvidia and TSMC fell, and only Broadcom rose.
DeepSeek creates short-term market-sentiment pressure on stock prices and valuations. That’s affecting secondary market computing-related companies, and even energy companies — but the long-term narrative will continue.
Secondary-market practitioners will worry about potential air pockets in Nvidia’s transition from H cards to B cards. Combined with pressure from DeepSeek, there will be short-term stock-price pressure — but this may give rise to better long-term opportunities.
This short-term impact reflects sentiment about DeepSeek’s low-cost training investments (see, for instance, how it directly affected Nvidia’s stock price). AI, however, is a growth market with huge potential. Long-term, AI is just beginning, and if CUDA remains the preferred choice, hardware growth potential remains substantial.
The battle between open-source and closed-source intensifies the spotlight on DeepSeek.
There is a possibility that OpenAI and others have hidden their good models, and no leading models have been released so far. But after DeepSeek’s release, other AI companies may not be able to hide their good models anymore.
DeepSeek has done a lot of cost optimization. Amazon and others haven't seen any changes as a result and are still following the established plan in a state of coexistence. Open source and closed source models are not contradictory. Universities and small labs should give priority to DeepSeek. There will be no competition for cloud vendors, because cloud vendors support open source and closed source, preserving the current state of coexistence in the ecosystem. DeepSeek’s applications have not been as mature as Anthropic’s, and the latter has spent a lot of time on AI security Deepseek must consider if it hopes to be recognized by European and American markets in the long-term.
Open source controls the margins of the whole market. If open source can do 95% of what closed source can do and closed source is too expensive, then open source can be used completely. If the capabilities of open source and closed source do not differ greatly, then this presents a big challenge for closed source.
DeepSeek’s breakthrough made the outside world realize China’s AI strength. Previously, outsiders thought China’s AI progress lagged America by two years, but DeepSeek shows the gap is actually 3 to 9 months, and in some areas, even shorter.
When it comes to technologies and sectors that America has historically blocked China from accessing, if China can break through nonetheless, those sectors ultimately become highly competitive. AI might follow this pattern — and DeepSeek’s success may well prove this.
DeepSeek didn’t suddenly explode. R1’s impressive results reverberated throughout America’s entire AI establishment.
DeepSeek stands on the shoulders of giants — but exploring the frontier still requires much more time and human capital cost. R1 doesn’t mean that future training costs will decrease.
AI explorers definitely need more computing power; China, as a follower, can leverage its engineering advantages. How Chinese large-model teams use less computing power to produce results, thereby having some definite resilience — or even doing better — might end up being how the US-China AI landscape plays out in the future.
China is still replicating technical solutions; reasoning was proposed by OpenAI in o1, so the next gap between various AI labs will be about who can propose the next reasoning. Infinite-length reasoning might be one vision.
The core difference between different AI labs’ models lies not in technology, but in what each lab’s next vision is.
After all, vision matters more than technology.
ChinaTalk is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber.
There was a deep technical discussion in the article that we’ve machine-translated below.
The most groundbreaking aspect of DeepSeek isn’t its open-source nature or low cost, but the fact that it eliminates the need for Supervised Fine-Tuning (SFT)—at least for inference tasks. However, tasks beyond inference may still require SFT. This raises questions about whether this represents a new paradigm or architecture that improves data efficiency in training, or whether it accelerates the iteration speed of model performance.
DeepSeek-R1 demonstrates that using SFT for distillation has significant benefits. While it’s not completely free of SFT, SFT was only applied in the third stage, and RLHF (Reinforcement Learning with Human Feedback) was used for final alignment.
R1's core training relied on SFT, but with a unique twist: the data used for training was generated by a model trained with RLHF. This shows that complex methods aren’t always necessary; as long as the methodology is solid, SFT-based distillation alone can suffice.
GRPO's essence lies in having a smart enough base model. For instance, generating a single prompt involved 16 generations to increase the likelihood of a correct answer. Combining a strong base model with a verification mechanism is the key insight R1 offers. This approach works particularly well for tasks like math and coding, which are easier to verify. Theoretically, this process can be adapted to other tasks, potentially enabling the creation of a generalized RL model.
With R1-Zero, a Chain of Thought (CoT) process emerged without requiring SFT. The length of CoT continues to grow, which is significant. SFT appears to be more of a supporting tool—while the model can generate CoT without SFT, SFT can make it more efficient.
This shows that smaller model developers can use SFT to distill larger models effectively. While SFT wasn’t entirely abandoned in the R1 process, its role is evolving.
An LLM with infinite-length CoT can theoretically function like a Turing machine, solving extremely complex computational problems. CoT essentially represents intermediate search results, using an optimized sampling process to refine potential outputs and guide the model toward more reliable outcomes. Fundamentally, CoT is an intermediate computational step, with the final results either emerging naturally or aligning with the computational essence of the model.
Although the DeepSeek paper doesn’t explicitly mention long contexts, it’s evident that the context window improved significantly between R1-preview and R1. This improvement may stem from advancements in Long2Short CoT, where CoT processes in the third SFT stage were ultimately cleaned and refined, resulting in a more polished dataset for the final release.
Types of SFT data:
Cold-start data provides the model with a solid strategy or initialization, enabling better exploration. In RL, one optimization goal is to align closely with the original strategy.
Post-RL data combines RL-generated data with other datasets, then applies SFT on the base model. Each domain has its own data processing pipeline, and the model’s capabilities depend on the base model. Distillation ensures minimal loss and allows multiple domains to generalize effectively.
It’s unclear how efficient R1’s data process is, but OpenAI may have employed similar efficiency strategies, such as fine-tuning. In R1’s third stage, the RL-trained model wasn’t used directly as the base for training; instead, it was used to generate data. This data comprised 600K reasoning examples and 200K non-reasoning examples. The second-stage model likely demonstrated reasoning abilities even outside specific domains, producing valuable reasoning data. Meanwhile, non-reasoning data was part of V3 SFT data, allowing V3 to "imagine" a CoT. The dataset of 800K examples is relatively small yet highly efficient.
Scale.AI's potential for success lies in the growing need for RL in various domains, with math and coding being the most common areas currently requiring expert annotation. Although data labeling is becoming increasingly complex, the market for it remains viable.
In the context of training, multimodal data has yet to demonstrate significant benefits—either the costs are prohibitively high, or there’s no clear evidence of effectiveness. However, future opportunities in this area could emerge as technologies and methodologies improve.
DeepSeek places a strong emphasis on data annotation, with reports suggesting that even Liang Wenfeng personally participates in labeling. Beyond algorithms and techniques, data precision is critical in AI development. For example, Tesla’s data labeling costs are nearly 20 times higher than those of Chinese autonomous driving companies. While Chinese firms progressed from large-scale, generalized datasets to fine-grained annotations, they eventually realized the importance of employing highly experienced drivers—something Tesla prioritized from the start. Similarly, Tesla used individuals with exceptionally healthy motor skills for robot motion annotations, resulting in superior "smoothness." In contrast, the individuals chosen by Chinese companies produced less refined results. DeepSeek’s substantial investment in high-quality data annotation is one of the key factors behind its model's efficiency and success.
If one avoids understanding the biggest technical pain points in model training and instead chooses to bypass them through distillation, they might fall into pitfalls when the next generation of technology emerges.
There is a mismatch in capabilities between large and small models. Distillation from a large model to a small one is true distillation—a teacher-to-student approach. However, if distilling various Chinese datasets into a model that initially has no understanding of Chinese, its performance may degrade. That said, distilling small models has shown significant performance improvements. After R1 was distilled, applying reinforcement learning (RL) to the model resulted in substantial growth, as it was trained using mismatched data for the model.
The downside of distillation is that it reduces model diversity, which affects the upper limit of the model's performance and prevents it from surpassing the strongest models. However, in the short term, distillation remains a viable pathway.
Distillation involves certain hacks. Early on, RL was applied to instruction-tuned models. In this phase, the model often exhibited a pattern of generating useless ideas at first and then suddenly arriving at the correct answer. This happens because many RL hacks are quite subtle. The model might memorize many problems during pretraining, so while it seems to be reasoning, it’s actually just approaching memorized answers. This is the hidden risk of distillation. If distillation is done without annotations, using Reinforcement Learning with Verifiable Rewards (RLVR) may lead the model to use simpler methods to solve problems instead of genuinely reasoning through them. Even OpenAI hasn’t completely resolved this issue, which may be a limitation of this generation of technology.
In the long run, relying on shortcuts instead of independently envisioning and designing technical solutions can lead to unforeseen pitfalls. For instance, in this generation of technology, without a qualitative improvement in long context handling, the upper limit for problem-solving might be constrained. R1-zero may represent a correct direction. Starting with R1-zero from scratch or bypassing data like o1 entirely might be better. Simply replicating others' technical solutions may not work well, and more exploration is needed.
Other models can also achieve good results through distillation. In the future, the AI ecosystem might evolve roles akin to teachers and students. Being a "good student" could itself become a viable business model.
In terms of distillation and technical approaches, the impact of R1 is less groundbreaking than AlphaGo. However, commercially, its ability to break into the mainstream far surpasses AlphaGo.
Distillation occurs in two phases. If one only distills o1 or R1 without building their own framework and verifiable rewards, this can lead to over-reliance on distillation. In general-purpose domains, distillation is not feasible because rewards are unavailable, and the specific Chain of Thought (CoT) used in distillation cannot be reproduced. Moreover, the first phase of distillation leaves traces. For instance, models distilled from OpenAI may carry remnants of OpenAI’s annealing processes. The reason zero could achieve such capabilities purely in the RL stage is directly related to the foundational model's reflective ability after annealing.
It’s hard to believe that models trained purely on internet data without annealing could exhibit such behavior, as there is virtually no high-quality data on the internet.
Currently, only a few top labs are exploring how much annealing-stage data is needed and the optimal data ratios. Whether distillation is used or not, it’s essentially a form of RL. SFT (Supervised Fine-Tuning) is behavioral imitation and essentially unlimited reinforcement learning, but solely relying on SFT has low upper limits and reduces diversity.
Startups in the primary market are excited about DeepSeek. If DeepSeek continues to iterate, it could provide significant flexibility for companies that aren’t large publicly listed firms. DeepSeek has also distilled smaller versions of its model that can be used on mobile devices. If this direction proves successful, it could raise the ceiling for many AI applications.
The critical point of distillation is to determine what the goal is. OpenAI does not use data distillation, and to surpass OpenAI, it is certain that distillation alone won’t suffice.
In the future, models may need to learn to answer with leaps of reasoning, like humans. Within a fixed context length, the challenge will be whether the model's performance ceiling can be raised.
Process rewards can be useful, but they are susceptible to reward hacking, where the model may boost its reward without truly learning anything. For example, if a model generates 1000 outputs to solve a math problem, none of which are close to the correct answer, methods like RLVR won’t train the model effectively. In this case, a reasonable process reward could help guide the model toward the right direction. The effectiveness of process rewards depends on the difficulty of the problem and the reliability of the process reward itself.
Process scores in PRM estimations can easily be hacked if they deviate from the actual score. While process supervision is theoretically possible, challenges include the strength of the process and how to assign rewards based on it. Currently, result supervision primarily matches extracted answers, but there are no mature methods to prevent the model from hijacking the reward. Models are most easily hacked when they iterate on their own. Labeling the process is not difficult and can be enumerated, but it has not been widely done in the industry, making it a promising area for future exploration.
The upper limit of process supervision is constrained by human capabilities, as humans may not be able to conceive every possible process. In contrast, result supervision defines the model's ultimate potential because the correctness of the final result is the crucial factor in measuring performance.
AlphaZero’s effectiveness stems from its clearly defined endgame, where outcomes (win or loss) can be determined, and rewards can be calculated based on win probability. In contrast, language models don’t know if their continuous generation will lead to a correct answer. This is similar to genetic algorithms, where the upper potential might be higher but also more vulnerable to reward manipulation.
One of the key advantages of AlphaGo transitioning to AlphaZero is the fixed rules of the game, which made it easier to validate the model’s decisions. In comparison, models starting with math and coding tasks are easier to validate, and the quality of validation methods directly influences the quality of reinforcement learning. If the rules are not well-defined, the model may “hack” the system by satisfying superficial requirements without generating meaningful results. Therefore, establishing well-defined rules is critical to improve the effectiveness of reinforcement learning.
ChinaTalk is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber.