For some time, I have been thinking about how we could push the frontier of current reasoning models forward, not just their performance, but also other “use-cases” or “features”. So I came up with these two questions, which haven’t been mentioned by others (maybe?):
How could the current reasoning model paradigm affect or enhance multimodal models?
How could so-called “hybrid reasoning models” (possibly) work?
And around these two topics, I have some personal thoughts. I’m not aiming to be definitively correct, but just rather to share my own opinions.
1. How could the current reasoning model paradigm affect or enhance multimodal models?
People are shocked by how Gemini-2.5 Flash Image (aka Nano-Banana) performs: image generation, editing, and so much more. A lot of models today could do more than just output text, they could also create audio and image. So I’m thinking of whether we could integrate those multimodal models with the reasoning models. Instead of only thinking in text, could they also think in image and audio? If we could let a reasoning model to use multimodal content not just in its final response but also during its internal thinking, the results could be much better as well. This may sound confusing, so here’re a few examples:
audio modality: A user is preparing a speech and wants the model to generate an audio sample. gpt‑realtime or gemini‑2.5 may help, but the quality cannot be ensured. If the model can reason about what emotions to convey at specific points, produce draft audio to simulate, and iterate, the result quality may improve. For example:
user:
I’ll have a speech about the future of our planet earth next week at the UN. Could you please give me a sample for it?
model (CoT):
The user is asking: “...” So they want a sample speech.
I could give a text script, but it will not be that helpful. I guess give them an example audio would be better. It may also require some emotions. I should first get up structures and contents first.
...
Alright, so I am done with initial composing, let me explore how emotions should be, and I will do some drafts, then it will good to go..
Well, I should have hope, urgency, inspiration, responsibility, solidarity, empathy, ...
OK, so for hope, I think I should use upraising tone, just like this:
[a piece of voice with upraising tone]
Cool, sounds good, let’s continue:
...
Alright, I think I got all emotions done. Let me make a first draft:
[full composed speech with correct emotions]
...
model (response):
Here’s the speech I prepared for you:
[full final speech audio]
image modality: Although the latest Nano‑Banana can create or edit amazing images, it still struggles in complex scenarios. This makes sense because the model outputs the image in just a single shot. If it could take some time to think and draft first, like humans do, the results might be better. For example:
user:
Please create a cinematic image of a interstellar war. it needs to be epic.
model (CoT):
The user is asking me to generate a image of ...
So let me first imagine the scene: a massive fleet of sleek starships engaged in battle over a luminous nebula, laser beams and plasma torpedoes streaking across space, explosions creating bright fireballs, a giant battlecruiser in the foreground with detailed hull, background with distant stars and a planet partially visible ...
Cool, let me create a sketch first:
[a sketch of the scene with only few lines]
...
I see the planet in the image doesn’t seem real, let me change the position:
[a final-prepared version of the image with all elements and colors]
...
Great, it seems perfect, let me give this to the user.
model (response):
Here’s the image you want:
[the final version of the image]
or even video modality (I know there’s no language model that could output both text and video, but maybe it is a good idea)
Today’s video generation models, like Veo-3, are great as well; however, they all have some huge weaknesses: short generation, not able to comply with complex story, etc. So giving the model the ability to think in audio, image and more, could enhance it greatly. For example:
user:
please give me a short movie about space exploration.
model (CoT):
The user asks for a short movie ...
I plan to write the script first, then draw some certain frames, create some suitable BGMs. This seems good. Let me proceed with the some ideas.
...
Phew. I think this is amazing, user is gonna love it. So next, frames.
For the opening, I think this would be good:
[a sketch frame for the first part of the story]
...
That’s great. I’m all done. Now, music.
[some background audios for specific frames]
...
Sounds good, then let me go for the final video part.
...
model (response):
Here’s the movie you want:
[final movie]
2. How could so-called “hybrid reasoning models” (possibly) work?
So the definition for hybrid reasoning models is that the models which can either respond directly or think deeply before respond (or even able to decide when to think more on its own). And currently, there’re only a few models have such ability: Claude-3.7, Claude-4, DeepSeek-v3.1, Qwen-3, and some more (GPT-5 doesn’t count for now, because it has a router).
Claude is close-sourced, so we don’t know how their thinking toggle actually works (maybe similar to other models). For DeepSeek-v3.1 and Qwen-3’s non-thinking mode, they are just prefilled with a blank thinking block (like <think> </think>). This is a good and quick way to let the model skip thinking and respond directly, but…. well, seems like the result may not be that satisfying for Qwen team, and they separated thinking and non-thinking mode into two models soon (here).
But, what if, we let the model always think, but for different reason? Here’s what I mean:
In basic sense, we could train the model to know how to react with different setting (thinking mode on, off, or auto). In inference time, we will let the model know (like via system prompt) what user pick. And in different mode, the model will always take a look at the current setting, and decide what it needs to do on its own. This means, there will be fewer manual intervention, the model just knows what to do.
Example behavior:
Thinking mode on:
system: ... <thinking_mode>on</thinking_mode>
user:
...
model (CoT):
Let me see. I saw the thinking mode has been set with on, which means I should take more time to continue thinking afterwards.
The user asks ...
model (response):
...
Thinking mode off:
system: ... <thinking_mode>off</thinking_mode>
user:
...
model (CoT):
Hmmm... I see the thinking mode is off. This means I should start responding directly. Yes, no more thinking. Start to respond now.
model (response):
...
Thinking mode with auto:
system: ... <thinking_mode>auto</thinking_mode>
user:
What is 1+1?
model (CoT):
I see an auto thinking mode is been put. So basically I just need to decide how much I need to think. Hmmm... Let me see.
The user asks: “What is 1+1?” This is trivial: 2. Any other things? No. Just a number is fine. Respond now.
model (response):
2.
or
system: ... <thinking_mode>auto</thinking_mode>
user:
...
model (CoT):
... Oh my god, this is hard. According to my setting, I guess I should take more time to think further for the message.
The user asks ...
model (response):
...
And we could also use RL to let the model learn such behaviors:
For the general “react-to-setting” behavior, we could set a verifier that look for whether model really correctly react to model’s setting, because the model may usually use similar phrasing or meaning when reacting to the setting.
For whether model correctly react to the setting (like stop thinking when off, continue when on), we could reward/penalize the model with its afterward behavior, like if model still insists in thinking forward when the setting is set to off, penalty would be casted.
For the auto thinking, I think we could use a prelabelled dataset with “easy/hard” labels, then reward on the output that correctly deal with the question under auto setting.
This may not really work, but may be a path that worth exploration. Why? Take a look at OpenAI’s o-series models and GPT-5-thinking model. Their reasoning effort are all been controlled by internal parameter called “juice” (GPT-5 even has a param called “oververbosity” which is for final response verbosity control). Also, Anthropic uses something like <max_thinking_length> to tell the model how long it should think in all. So via curated data and RL training, the model may also gain ability in thinking more adaptively and efficiently.