Johanna Cabildo: How Big Tech's Obsession with Data is Harming AI
In Brief
The growing dependence of major tech companies on artificial data is not only impairing the quality of artificial intelligence but is also perpetuating biases and centralizing authority, while a more effective resolution lies in creating an equitable, clear, and human-centric data environment.

Meta's LLaMA-4 was introduced amid great expectations, but ultimately, it didn't meet the mark. , it disappointed In comparison to the previous version, it exhibited poorer logical reasoning, increased instances of inaccuracies, and an overall reduction in its capabilities. The CEO of D-GN pointed out that the underlying issue wasn't a deficiency in computing resources or innovation—rather, it was the data itself. Johanna Cabildo After depleting the internet's reservoir of reliable, diverse, and high-quality text, Meta turned to synthetic data, which refers to AI-generated content crafted for training newer AI models. This creates a self-perpetuating cycle where models train upon their own outputs, resulting in a gradual loss of precision and depth each time.
Other prominent firms like OpenAI, Google, and Anthropic are stuck in a similar predicament. The era of plentiful and authentic training data has come to an end. What remains is increasingly limited, leading to stagnation and hiding an underlying decline with a façade of progress.
It has been observed that now eight companies hold dominance over 89% of the global AI training data and infrastructure. This is not merely a matter of competitive power; it fundamentally shapes the knowledge that is encoded in AI and marginalizes diverse viewpoints. synthetic filler Models that are trained on biased or limited datasets can perpetuate real-world issues. For instance, AI systems designed using American healthcare data may misdiagnose individuals in different countries. Similarly, recruitment algorithms may unfairly disadvantage candidates with names that are uncommon in Western contexts.
Who Owns the Data?
The 2024 Stanford AI Index reported Facial recognition technologies are notably less reliable for individuals with darker skin tones, especially women. This trend leads to the further marginalization of underrepresented identities.
As these models increasingly rely on synthetic data, the errors they produce tend to escalate. Experts caution against the emergence of ‘polished nonsense’—language that may appear accurate but is riddled with fabrications. A study by early 2025 indicated that Google Gemini only provided completely accurate citations 10% of the time. The more these systems operate on their own flawed data, the quicker their quality deteriorates. AI companies originally based their algorithms on a wealth of publicly accessible knowledge, including books, Wikipedia entries, online forums, and news articles. However, now these same companies are restricting access to their models and monetizing them. OpenAI is currently engaged in legal disputes with Microsoft regarding unauthorized usage of its content. At the same time, platforms like Reddit and Stack Overflow are minority dialects as offensive or irrelevant.
, collaborating with OpenAI to provide user-generated content that was once freely available to everyone. recursive loops The strategy is unmistakable: gather freely available public knowledge, commercially exploit it, and confine it behind APIs. The very firms that profited from open systems are now limiting access while promoting synthetic data as a viable option, despite the increasing evidence indicating that it undermines model performance. AI cannot progress merely by learning from its reflections; it derives no knowledge from a mirror. Columbia Journalism Review Addressing AI's data crisis doesn't necessitate enhanced computational power or larger models; instead, it requires a fundamental overhaul in the methods of data collection, valuation, and governance.
Locked In, Locked Out
Web3 technologies could present a promising direction. The blockchain can trace the provenance of data. Token-based frameworks could ensure fair compensation to individuals who contribute their knowledge. Initiatives like Morpheus Labs have effectively utilized these tools to bolster the performance of Swahili language AI by 30%, simply by incentivizing community engagement.
In late 2023, The New York Times sued Privacy-enhancing tools such as zero-knowledge proofs provide another essential layer of trust. They facilitate training models on sensitive information, like healthcare data, without compromising individual privacy. This ensures ethical learning without sacrificing performance. entered exclusive licensing deals These concepts are not just theoretical; startups are actively adopting decentralized technologies to create culturally attuned, privacy-compliant AI solutions globally.
AI is influencing the structures that shape our society—education, healthcare, employment, and communication. The pivotal issue has shifted from whether AI will take a dominant role to who will dictate its evolution.
A Different Path
As the AI landscape grapples with the challenges posed by synthetic data and monopolistic infrastructures, services like D-GN indicate a promising pathway ahead: a landscape where AI is shaped by people, for people, contributing to a more equitable and intelligent future.
Should we permit a select few companies to recycle their own outputs, diminish model quality, and reinforce biases? Or will we make the commitment to create a new data ecosystem—one that appreciates transparency, equity, and shared ownership?
The root of the issue isn't that machines lack sufficient data; rather, the problem lies in the fact that the data being utilized is growing increasingly synthetic, narrow, and controlled. The remedy involves empowering those who generate valuable content and ensuring they are rewarded for their contributions. Improved AI begins with superior data, and better data initiates with us.
, please keep in mind that the information presented on this page is not intended to serve as and should not be construed as legal, tax, investment, financial, or any form of advice. It's crucial to invest only what you can afford to lose, and seek independent financial guidance if you're uncertain. For additional information, we recommend reviewing the terms and conditions alongside the help and support resources offered by the issuer or advertiser. MetaversePost is devoted to providing accurate, impartial news, though market conditions can change at any moment.
uk
Victoria writes on a wide array of technology topics, including Web3.0, AI, and cryptocurrencies. Her extensive background allows her to create insightful pieces for a broad audience.
Cryptocurrencylistings.com Launches CandyDrop to Streamline Crypto Acquisition and Boost User Engagement with Quality Projects
DeFAI Must Tackle the Cross-Chain Challenge to Realize Its Full Potential
dRPC Introduces NodeHaus Platform to Assist Web3 Foundations in Improving Blockchain Accessibility
Hack Seasons
Airdrops Calendar Hot Projects Raphael Coin Kicks Off Its Launch, Bringing a Renaissance Masterpiece to the Blockchain