Parameter Update: 2026-09

"gpt-5.4.3.2.1" edition

Parameter Update: 2026-09

GPT-5.4 is very strong (and one of the worst pieces of branding we've seen from OpenAI in recent memory). Apart from that - pretty slow week.

OpenAI

GPT-5.4-Thinking

While GPT-5.3-Codex was a fair opponent for Opus 4.6 in coding, OpenAI has actually been a fair bit behind in more general use cases for a while now. This changed this week, with GPT-5.4, which matches or exceeds Opus in basically every benchmark. The focus, based on what OpenAI is choosing to talk about, appears to be on two things:

  • Computer Use / Knowledge Work: While Anthropic has been pushing this angle for a while now, this is the first time since CUA that OpenAI is pitching a model explicitly for its computer use abilities, explicitly labeling it as "designed for professional work".
  • Making the model pleasant to talk about: Altman has openly admitted that GPT-5.2 was not great to talk to, and in my experience displayed sometimes shockingly low emotional range/ability. According to my Twitter timeline, this has been fixed with GPT-5.4, though in my personal experience it is still quite far behind Opus 4.6 in that regard.

Interestingly, it also appears to be ever-so-slightly better than GPT-5.3-Codex at code, making it the generally best model OpenAI has to offer right now, for all use cases.

GPT-5.3 Instant

Just before launching GPT-5.4-Thinking, OpenAI launched another model, this time limited to non-reasoning usage on ChatGPT. I have no idea why they didn't just also roll out GPT-5.4 'non-thinking', but that aside, it still appears to be a nice upgrade? I personally don't really use non-reasoning models these days, and I haven't seen any benchmarks yet, so the nicest thing I have to say is that it might be a signficant upgrade for the majority of people that aren't paying for ChatGPT?

Alibaba Qwen

Qwen 3.5 Small

Alibaba has released a set of small models, "based on the same foundation" as the original "large" Qwen 3.5 release, which was competitive with GPT-5.2 and Opus 4.5. The benchmarks are extremely impressive - the 9B model comes close to GPT-OSS-120B. It is important to note that we've seen small models doing impressive numbers before, and it's not always indicative of real-world utility, but packing that much intelligence into a tiny model is still commendable. I certainly didn't have a 2B model approaching GPT-4 in benchmarks on bingo card this early.

Team falling apart?

Just after the impressive launch, Lin Junyang, head of the Qwen team, has officially resigned from the Qwen team. This would be interesting enough by itself - senior leadership departures like this are pretty rare, but it's also the third notable departure from the Qwen team this year, and one contributor noted (in a since deleted post) "I know leaving wasn’t your choice". Whether or not that is true, I sincerely hope the Qwen team can continue working on new models, given their stellar track record lately.

DeepMind: Gemini 3.1 Flash-Lite

In one of the rare pieces of fun branding from Google, DeepMind has announced Gemini 3.1 Flash-Lite (🔦). It's a more capable model than "full" Gemini 2.5 Flash in most use cases, while (unfortunately) also being much closer to it in price than the Gemini 2.5 Flash-Lite model it's meant to be replacing.

Personally, I still don't see many use cases for these types of models - if a task is valuable enough to use a proprietary model, it's probably also valuable enough to use a strong one? Large-scale tasks are viable with weaker, cheaper models (given some tuning), but why would you expend the time and ressource to do this model-specific tuning on a model that might get discontinued at any point?