Looking Ahead to 2025

Although it's February now, I still think it would be quite nice to wrap up 2024 together with the beginning of 2025 and look forward into the new year.

The past thirteen months have been great, with a lot of good things happening. We've made huge progress across models' multimodal, reasoning, and agentic abilities, which are all important components on my own imaginary roadmap to capable AI systems that would have a huge impact on our species (or what people call "AGI").

On multimodal model

This is an interesting topic. The reason for its importance is that I strongly believe letting models "feel" the world in many different ways is an important key to helping them better understand physics, the world, and the whole universe — text does not include everything in "language"; "language" is diverse, it's much richer than text.

Currently, I think the existing tokenizer is what limits the model from moving forward. In fact, there are many simple tasks that we, as humans, would definitely not get wrong; however, even the strongest LLM currently available (i.e., o1-pro) would easily get stuck. For example:

So I think we have to come up with a real multimodal model that could get rid of those limitations of the current visual encoders/tokenizers and really understand the image. This is a SUPER basic ability that models need to have.

Beyond multimodal input, we have multimodal output. This showed up in GPT-4o back in May and new Gemini 2. The reason I think it is also pretty cool is that it's better than letting a model write prompts and use DALL·E or Midjourney to create images, because there are a lot of limitations with traditional text-to-image models. They sometimes get stuck with complex things and they don't understand what they are drawing. However, the models with true multimodal output can know what they need to generate, and humans can let them iterate on those results. What's more, we could do more fun things with such ability, like:

Pretty cool, right? And since you can actually let the model generate or edit a picture for you, everyone can do PS work - they don't need to actually have such expertise, which could be super convenient.

On reasoning models

This is the hottest topic from the last few months, and I've already written about it in last August. Up till now, we have several reasoning/thinking models in hand (o-series model, R1, Gemini Thinking models and a lot of others from research).

Thanks to RL, progress is really fast; for example, from o1 to o3 with roughly about 3 months, the model is able to solve a bunch of AGI-ARC tasks, and we could expect more crazy things in the coming months.

I think the idea of giving the model more time to respond is great. However, sometimes the model will have an overthinking problem, which is sometimes time- and compute-consuming. For example, when you ask R1 "1+1", it will think for seconds (~100 tokens):

So that's why I would say the model being able to control when they need to think is also another important ability, which may be the next focus for researchers. But before that, we need to make general reasoning ability (beyond math and coding) better. BTW, in the blog I wrote last year, I mentioned system 1 and system 2 thinking patterns of humans. I still think that would be useful, though directly applying that into the current models would not be feasible; we could still borrow some ideas.

Besides, another interesting thing mentioned in R1's paper is that the model sometimes uses mixed language in its thinking process. I think as we continually scale RL and test-time compute, we would even see models generate nonsense or scrambled text, while the result will not be affected at all. I think that would be a time when we say, "Okay, RL just works." (But this would be a total disaster for Anthropic and some AI doomers lol :P)

On agents

Besides reasoning, I think this is another term that everyone addicts to use. I still remember that last year almost all products said they had some "AI agents" stuff (and I blacklist every product that says so).

There are only a few real agents in my mind, like Project Astra from DeepMind, Operator and Deep Research from OpenAI. These tools are really the AI systems that could take reasonable actions for you.

Here, the definition I gave is that only when you have a good reasoner or your model can reason well, then you can call the tool or the system built upon your model an agent. I think that should be something that we should expect, instead of those weird and fancy tools that you click, then summarize some emails or things like that.

Although the products that claim they have agents in the past year are not acceptable to me, their idea is somehow kind of good—what they need is a better base model, i.e., o3-mini: fast, cheap, and capable.

Another core feature that would really push these agents forward should be in-thinking tool use. When o1 with tool use was released, I was concerned about whether o1's tool use was like thinking, calling a tool, then responding directly; but now with o3, my concerns have vanished. o3's tool use flow is like thinking, tool-using, rethinking (maybe with another few turns), then responding. In fact, I saw great benefits from this pattern in o3-mini with web browsing and Deep Research powered by fine-tuned o3. And I expect to see more agents from OpenAI and from other research labs.

Phew~ that’s all I wanna say. January is just a starting point, and we’re gonna have a wild ride in upcoming months! Just buckle up for it.