There is a well-known idea in management: people adjust their behavior based on how they are measured. The same idea applies to artificial intelligence. In today’s AI world, how models are measured often shapes how they are built, improved, and used.
As generative AI has grown, one challenge has quietly become harder than expected: keeping track of models. New names, versions, formats, and releases appear constantly. Even experienced engineers find it difficult to know which model performs best for a specific task. This confusion has made benchmarking more important than ever.
Recently, a detailed performance spreadsheet created by Harlan Lewis caught attention in the AI community. It compares many leading AI models and shows how their performance changes over time. Resources like this are valuable because they simplify complex information and give people a clear starting point for understanding model quality.
But looking deeper raises a bigger question. As AI systems move toward more autonomous and agent-based designs, are current benchmarks still enough? Or are they starting to show their limits?
At its simplest, a benchmark is a reference point. In AI, it usually means testing models on the same dataset using the same rules and metrics. The assumption is clear: higher scores mean better models. This method has helped the field move forward for years.
Benchmarking serves two main purposes. First, it helps developers choose the right model for real-world use. Second, it shows researchers where improvements are needed. This second role has been especially important in machine learning. Progress often comes from trying to beat previous scores on shared tasks.
This idea was clearly described by data scientist David Donoho in his work on the “Common Task Framework.” In this framework, researchers share public datasets, define clear goals, and use hidden test data to score models fairly. This structure helps prevent cheating and ensures results can be compared objectively.
Over time, this approach proved very effective. Platforms like Kaggle and Papers with Code, along with tasks like image recognition, speech processing, and translation, all benefited from it. The reason is simple. These problems have clear answers, clear data limits, and measurable outcomes.
However, not all problems fit this structure.
One growing concern is artificial general intelligence. By definition, general intelligence should work across many situations, not just inside narrow tests. Some researchers argue that benchmarks naturally limit what can be claimed. A model doing well on a benchmark only proves it performs well on that specific test, nothing more.
This becomes clearer when thinking about everyday human tasks. Many daily activities involve judgment, context, and personal interpretation. Designing a single dataset and metric for such tasks is extremely difficult. Even when a benchmark is defined, there will always be real-life cases that fall outside its scope.
There is also a modern challenge. Many AI models are trained on massive datasets that may already include benchmark data. This makes evaluation even harder, because test results may not fully reflect true learning.
So where does this leave the AI field? There are two possibilities. Either the Common Task Framework is still enough, and the right benchmarks for general intelligence simply haven’t been built yet. Or the framework itself has limits, and a new way of measuring AI progress will be needed.
At the moment, no one has a clear answer. What is clear is that access to high-quality data remains critical. As long as benchmarks guide development, progress depends heavily on who can access data and how fairly it is shared.
This is where Kite AI focuses its efforts. The project aims to support an open and healthy AI ecosystem by improving data access and aligning incentives between contributors and AI agents. By doing so, it hopes to reduce barriers and make AI development more inclusive.
Kite AI does not claim to have solved the benchmarking challenge. Instead, it takes a practical step forward by ensuring that lack of data does not slow innovation. In a world where measurement shapes behavior, fair access to data may be one of the most important foundations for meaningful AI progress.
About Kite AI
Kite AI is a decentralized infrastructure built to support the future AI economy. Its core idea, called Proof of AI, is designed to fairly reward contributors who provide data, models, and intelligent agents.
By solving issues around incentives, collaboration, and trust, Kite AI aims to create a more open and balanced AI ecosystem. Through blockchain-based design and a focus on transparency, the project works toward making AI development more accessible, secure, and globally inclusive.


