Hey everyone, I'm Ning Fan.
Recently, FanFan came across a set of numbers that really got him restless. In May 2026, Epoch AI dropped a report suggesting that large language models might consume all publicly available text data on the internet between 2026 and 2032. Meanwhile, a more aggressive report from China's Telecom Institute directly predicts that by 2026, large language model training could completely wipe out usable text data.
This isn't some far-off sci-fi scenario; it's happening right now. The AI industry is facing more than just copyright lawsuits—on May 5th, Elsevier teamed up with five major publishers to collectively sue Meta, accusing Llama of training on a massive amount of pirated books. The deeper crisis is that high-quality data is running dry. The low-hanging fruit of publicly available internet data is getting stripped bare, while the real valuable vertical data—medical imaging, financial transaction records, legal precedents, industrial parameters—is all locked away in institutional vaults, and AI can't access it.
To put it plainly: the 'food crisis' for AI is upon us. And it's not that food has become expensive; it's that there's really not much left.
This is also why I've been keeping an eye on @OpenLedger lately. This project isn't telling the tired old story of 'decentralized GPT'; they're getting right into the data source — that's their Datanets system.
Datanets can be understood as 'data cooperatives'. For example, in the medical imaging sector, we could set up a specialized Datanet where doctors, hospitals, and research institutions worldwide contribute anonymized imaging data. Contributors earn $OPEN rewards based on data quality and usage frequency, while model developers pay to access these verified high-quality datasets for training specialized models. Each vertical sector—finance, industrial manufacturing, legal contracts—can create its own Datanet to unlock the 'dark data' trapped within institutions.
I believe the reason this logic holds up is that it breaks through a major barrier. There are massive amounts of high-quality data globally, but it's locked away in 'data silos'—institutions don’t communicate standards, data formats are inconsistent, and there's almost no mechanism for cross-platform sharing. OpenLedger's goal isn't to collect data themselves but to provide infrastructure that allows any community to 'self-organize' around specific domain data.
Its core weapon is the Proof of Attribution we talked about before — but today, I want to flip the script and look at this from the perspective of the 'data supply chain'.
In traditional AI training, it's all a tangled mess: where the data comes from, who handled it, how it was processed, and which parts of the model output it ultimately influenced. Data contributors get one-time buyouts, and the profits generated by the models have nothing to do with the data providers.
On OpenLedger, every piece of data is hashed and anchored on-chain from the moment it's uploaded. The entire process of labeling and verification is recorded, with training logs and dataset references going on-chain during model training. When it comes time for inference, the attribution engine automatically traces back which data points contributed the most, and then distributes rewards through smart contracts. Data contributors aren't bought out; they hold 'data equity' — as long as your data is being used, you keep earning.
OpenLedger refers to this entire chain as a 'verifiable data pipeline'. I like to call it a 'sunshine supply chain' for data. From collection to cleansing to verification to transmission, every step is auditable on-chain, so any malicious data contamination or unclear sources can be spotted right away.
Plus, OpenLedger isn't in this alone. They teamed up with Story Protocol for a major move in January 2026 — jointly launching a new standard for copyright clearing and automatic payments for creators of AI training data. How does it work? Story handles IP registration and licensing terms, while OpenLedger manages execution and verification — when authorized content is used in training, it cryptographically verifies IP usage, and then automatically pays the rights holder. The horror stories of legal battles over rights like those with Elsevier might never happen under this system.
Let’s also discuss OPEN’s role in the grand scheme. I've looked around and found OPEN isn't just a 'governance token' trying to pull the wool over your eyes — data contributors earn OPEN rewards through the attribution engine, model developers register and deploy models using OPEN as Gas, and users paying for model inference also utilize $OPEN. A portion goes to the model creators, another part to upstream data contributors, and some funds go into the public infrastructure fund. This entire economic cycle is what OpenLedger calls 'payable AI' — every part of AI has people working and earning, and economic activity is no longer a game dominated by giants.
I've always believed that the sexiest narrative of Web3 isn't about recreating a casino, but using tech to tackle real-world issues. The AI data depletion problem is no exaggeration — if the current data production relationships don't change, the ceiling for AI development is painfully visible. I can't guarantee that OpenLedger will be the game changer, but at least their proposal pushes the conversation about 'how data is produced and how profits are shared' significantly forward.
What do you all think? Is the data famine really here, or is it just fearmongering? Can decentralized data really hold its ground? Let's chat in the comments, I'm here waiting. Don't forget to follow @OpenLedger and $OPEN for the latest scoop, we'll discuss as we go!
