Artificial Intelligence, Extra Bytes

Suspicions Rise as New DeepSeek Model Mimics Google’s Gemini

Copy paste?

TNCR Staff

Save

The latest model from Chinese AI lab DeepSeek, named R1-0528, is receiving praise for its reasoning capabilities but also facing allegations that it was trained using outputs from Google’s Gemini models.

Released via HuggingFace, DeepSeek’s model performs well on math, logic, and programming tasks, edging close to the performance levels of OpenAI’s O3 and Google’s Gemini 2.5 Pro. However, multiple developers and researchers have raised suspicions over the model’s origins and linguistic patterns.

AI analyst Sam Paech posted evidence on social media suggesting the model’s outputs closely mirror those from Gemini, hinting at a possible use of synthetic Gemini data during training. Others, including anonymous developers, have noted the model’s “thought traces” are reminiscent of Gemini’s, reinforcing the speculation.

This follows earlier accusations that DeepSeek’s models might have incorporated ChatGPT-derived content.

Why It Matters: If substantiated, the use of outputs from competitors’ models, especially without authorization, raises significant ethical and legal concerns in the AI community. It underscores the difficulty of maintaining model originality in a time awash with AI-generated content, and challenges platforms to enforce clearer boundaries and protections.

Performance Achievements Come with Controversy: DeepSeek’s R1-0528 model has been praised for its sharp improvements in mathematical problem-solving, code generation, and general reasoning tasks. Its performance metrics show it is rapidly closing the gap with industry titans like OpenAI’s O3 and Google’s Gemini 2.5 Pro. The company claims this boost comes from algorithmic refinements and increased post-training resources. However, these claims are being challenged by independent researchers who suspect the model’s quality leap may stem from borrowing heavily from Gemini’s outputs, raising doubts about the integrity of these advances.

Independent Researchers Detect Gemini-Like Linguistic Patterns: Developer Sam Paech, who specializes in analyzing AI behavior using bioinformatics tools, stated that the R1-0528 model exhibits phrasing and trace behavior almost identical to Gemini 2.5 Pro. By dissecting the model’s outputs and comparing them to Gemini’s responses, he found a notable overlap in vocabulary and syntactic structure. Paech even shared his findings and analysis through GitHub and posts on X. Another developer, known only as the creator of SpeechMap, observed that the “chain-of-thought” outputs from DeepSeek bore a strong resemblance to those typically produced by Gemini, further fueling the suspicions.

Pattern of Model Distillation and Prior Incidents with ChatGPT Data: Allegations of distilling data from competitor models are not new to DeepSeek. Back in December 2024, developers noticed that DeepSeek’s V3 model occasionally referred to itself as ChatGPT, indicating it may have been trained on outputs from OpenAI’s chatbot. OpenAI and Microsoft reportedly found suspicious activity linked to exfiltration of large data volumes through developer accounts in late 2024, which were allegedly traced back to DeepSeek affiliates. These repeated patterns suggest a systemic issue and lend weight to the idea that R1-0528 may be another case of unauthorized model replication.

Big Tech Fights Back with Security and Obfuscation Measures: To combat rising cases of data scraping and unauthorized training, companies like OpenAI, Google, and Anthropic have tightened security around their models. OpenAI, for example, now mandates identity verification for organizations seeking access to advanced APIs, a policy that explicitly excludes China. Google has started “summarizing” its model traces to limit the ability of outsiders to reverse-engineer outputs. Similarly, Anthropic announced in May that it would obfuscate its Claude model traces to protect competitive information.

AI Data Contamination Makes Source Attribution Difficult: The broader AI community is grappling with the increasing presence of AI-generated content online, which makes it challenging to trace the true origins of training data. Whether intentionally harvested or incidentally included, AI outputs now saturate platforms like Reddit, HuggingFace, and X, forming part of the “open web” that many companies scrape for data. As a result, AI models may converge on similar outputs purely due to shared exposure to these environments. This complicates the case against DeepSeek, since shared language traits don’t necessarily prove intent, but they do highlight how model originality is becoming harder to verify.

Go Deeper -> DeepSeek may have used Google’s Gemini to train its latest model – TechCrunch

DeepSeek Allegedly Use Google’s Gemini to Train Its Latest AI Model – Tempo

Save

June 5, 2025

☀️ Subscribe to the Early Morning Byte! Begin your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.

☀️ Your latest edition of the Early Morning Byte is here! Kickstart your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.