Parameter Update: 2026-09
"gpt-5.4.3.2.1" edition
GPT-5.4 is very strong (and one of the worst pieces of branding we've seen from OpenAI in recent memory). Apart from that - pretty slow week.
OpenAI
GPT-5.4-Thinking
While GPT-5.3-Codex was a fair opponent for Opus 4.6 in coding, OpenAI has actually been a fair bit behind in more general use cases for a while now. This changed this week, with GPT-5.4, which matches or exceeds Opus in basically every benchmark. The focus, based on what OpenAI is choosing to talk about, appears to be on two things:
- Computer Use / Knowledge Work: While Anthropic has been pushing this angle for a while now, this is the first time since CUA that OpenAI is pitching a model explicitly for its computer use abilities, explicitly labeling it as "designed for professional work".
- Making the model pleasant to talk about: Altman has openly admitted that GPT-5.2 was not great to talk to, and in my experience displayed sometimes shockingly low emotional range/ability. According to my Twitter timeline, this has been fixed with GPT-5.4, though in my personal experience it is still quite far behind Opus 4.6 in that regard.
Interestingly, it also appears to be ever-so-slightly better than GPT-5.3-Codex at code, making it the generally best model OpenAI has to offer right now, for all use cases.
GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT.
— OpenAI (@OpenAI) March 5, 2026
GPT-5.4 is also now available in the API and Codex.
GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model. pic.twitter.com/1hy6xXLAmJ
GPT-5.3 Instant
Just before launching GPT-5.4-Thinking, OpenAI launched another model, this time limited to non-reasoning usage on ChatGPT. I have no idea why they didn't just also roll out GPT-5.4 'non-thinking', but that aside, it still appears to be a nice upgrade? I personally don't really use non-reasoning models these days, and I haven't seen any benchmarks yet, so the nicest thing I have to say is that it might be a signficant upgrade for the majority of people that aren't paying for ChatGPT?
GPT-5.3 Instant in ChatGPT is now rolling out to everyone.
— OpenAI (@OpenAI) March 3, 2026
More accurate, less cringe.https://t.co/oJpXsp9TBc
Alibaba Qwen
Qwen 3.5 Small
Alibaba has released a set of small models, "based on the same foundation" as the original "large" Qwen 3.5 release, which was competitive with GPT-5.2 and Opus 4.5. The benchmarks are extremely impressive - the 9B model comes close to GPT-OSS-120B. It is important to note that we've seen small models doing impressive numbers before, and it's not always indicative of real-world utility, but packing that much intelligence into a tiny model is still commendable. I certainly didn't have a 2B model approaching GPT-4 in benchmarks on bingo card this early.
🚀 Introducing the Qwen 3.5 Small Model Series
— Qwen (@Alibaba_Qwen) March 2, 2026
Qwen3.5-0.8B · Qwen3.5-2B · Qwen3.5-4B · Qwen3.5-9B
✨ More intelligence, less compute.
These small models are built on the same Qwen3.5 foundation — native multimodal, improved architecture, scaled RL:
• 0.8B / 2B → tiny, fast,… pic.twitter.com/90JfOM9k4T
Team falling apart?
Just after the impressive launch, Lin Junyang, head of the Qwen team, has officially resigned from the Qwen team. This would be interesting enough by itself - senior leadership departures like this are pretty rare, but it's also the third notable departure from the Qwen team this year, and one contributor noted (in a since deleted post) "I know leaving wasn’t your choice". Whether or not that is true, I sincerely hope the Qwen team can continue working on new models, given their stellar track record lately.
me stepping down. bye my beloved qwen.
— Junyang Lin (@JustinLin610) March 3, 2026
DeepMind: Gemini 3.1 Flash-Lite
In one of the rare pieces of fun branding from Google, DeepMind has announced Gemini 3.1 Flash-Lite (🔦). It's a more capable model than "full" Gemini 2.5 Flash in most use cases, while (unfortunately) also being much closer to it in price than the Gemini 2.5 Flash-Lite model it's meant to be replacing.
Personally, I still don't see many use cases for these types of models - if a task is valuable enough to use a proprietary model, it's probably also valuable enough to use a strong one? Large-scale tasks are viable with weaker, cheaper models (given some tuning), but why would you expend the time and ressource to do this model-specific tuning on a model that might get discontinued at any point?
Introducing Gemini 3.1 Flash-Lite 🔦, a huge step forward on the boundary of intelligence, beating 2.5 Flash on many tasks. pic.twitter.com/cXJe2YIOr6
— Logan Kilpatrick (@OfficialLoganK) March 3, 2026