Parameter Update: 2026-14

"cyber-warfare-over-refusal" edition

Parameter Update: 2026-14

Last couple weeks have been unusally busy, so it's good to see some return to normalcy now. That being said, I do wonder what's going on at Anthropic, and we're expecting a bigger OpenAI model launch very soon.

Anthropic

Claude Opus 4.7

Two weeks after announcing the "machine god" that is Claude Mythos (if you trust the hype), Anthropic announced a surprisingly boring model update - a minor version bump to Opus. Looking at benchmarks it seems like a small-but-notable improvement.

On internal benchmarks, it matches Opus 4.6 at much better token efficiency:

Stolen from Anthropic's official announcement

The real story appears to be a bit more nuanced, though. At least on my timeline, people have consistently complained about the over-refusals and inconsistent results:

Which appear to stem from them rolling out "safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses" to prepare for an eventual Mythos launch. And even the better token efficiency might not be all that useful, given the new tokenizer seems to use more tokens to encode the same text as before:

and the model will use more tokens on higher reasoning settings. This fumble of a launch follows rumors of Anthropic launching their own vibe coding platform, their own Design application, and consistent complaints about the new Claude Desktop app. I've personally experienced some of these issues and it really seems like the Anthropic team is both distracted and compute-constraint right now - while still training the best model out there. Add the ongoing DOD drama, and it seems like a very interesting time for the company.

OpenAI

Codex Improvements

The Codex app go a number of updates this week, turning it into more of a "general purpose productivity tool" than pure developer tooling. The app can now

  • Interact with any application of your Mac (in the background, without it taking focus!)
  • Generate images using gpt-image-1.5 and automatically include them in projects
  • Run automations in existing threads

They also added 90+ new 'Plugins' (these appear to just be Skills?) for existing apps like Microsoft Teams, Google Calendar, Attio, Binance,...

GPT-Rosalind

While Anthropic positioned Mythos' cyberattack capabilities as mostly 'emergent' and a side effect of overall scaling, OpenAI actively announced this week they intentionally tuned a new model in a higher risk domain - biological reasoning and drug discovery. Similarly to Anthropic, OpenAI is limiting access to selected partners for now. Going one step further, they didn't even provide a full system card or benchmarks (except for one barely useful graph) - boo!

Gemini Robotics ER 1.6

While Anthropic and OpenAI are advancing cyber-attack capabilities and biological warfare respectively, DeepMind is focussed on embodied AI in the form of visual and spatial reasoning with their Gemini Robotics ER 1.6 model. Most of the performance improvements seem like small-but-significant upgrades from the Gemini 3.0 Flash base model, so the most interesting thing here is (1) the fact they highlight specific functionality like instrument reading, (2) their ongoing collaboration with Boston Dynamics (even after they sold them in 2017), and the fact this model exists at all (you wouldn't bother fine-tuning a model for robotic reasoning tasks unless you had some reason for it to exist, and Google keeps doing it).