Parameter Update: 2025-08

"robots!" edition

Parameter Update: 2025-08

xAI: Grok 3

After predicting a load of announcements last week, it turned out we got exactly one of them - Grok. And in one word, this one is weird.

The Good:

The Bad:

  • "Real world" coding performance (or at least the ball in cube demo everyone loves doing) seems to be... not great. That has also been my personal experience.
  • I would've appreciated some more clarity in their product communication - what is the difference between "Thinking" and "Big Brain Mode" except that the latter takes longer and sounds extremely cringe?
  • While also very funny, it's slightly troubling that it seems you can get around most safety tuning by negging Grok?

The Ugly

As it stands, Grok 3 is fascinating. It's extremely impressive and, frankly, better than I was expecting. The UX feels polished and I've gotten some deeply impressive results from the model. At the same time, there are some rather obvious shortcomings (where's the API? Do they expect anyone to use any of this when they continue to hardcode propaganda into the system prompt?) and, similarly to Gemini 2.0 Pro, anyone looking for a step-function increase in capabilities will be disappointed yet again.

Figure: Helix

Somewhat surprisingly, the past week has been packed with robotics-related announcements and demos. Most of these are only somewhat AI-related, but they are all 😎very cool😎, so I'm keeping them in.

The headliner

Last week, Figure announced Helix, their next-gen "vision-language-action model". While I would have loved more technical details (and the "scaling law" graph from the blog post feels ridiculous), this is our clearest look yet at why they very publically broke up with OpenAI just a few weeks ago. On the plus side, we got an impressive demo video - and a nice diagram explaining how they are splitting world understanding between a two differently sized models, similar to the concept of System 1 and System 2 thinking in humans.

The System 1 model is just 80M parameters in size and runs at 200Hz, while the System 2 model is a more sizable 7B pretrained VLM running at a (still very impressive) 7-9Hz. The special sauce, then, lies in the integration of the two, allowing latent representations from the smarter but slower model to influence the outputs of the fast model. The fact that they can scale this across two robots in the same model also feels like a pretty major step.

The rest

While less technically interesting, here are some of the other robot announcements that caught my eye last week:

The dancing Unitree G1 video was so awesome, I was actually convinced it was fully AI until the CEO followed it up with another one. Now I only believe it's staged extremely well.

Clone's "Protoclone" looks terrifying enough by itself, but for some reason they insist on filming it in a barely lit environment with intense apocalyptic music playing in the background? - who runs their twitter?

Finally, 1X announced Neo Gamma in a video that feels like an intentional antithesis to Clone while feeling extremely hollow and even less realistic than what Unitree is showing?

Stepfun

Just as we were starting to recover from DeepSeek, it seems that China's AI progress is not slowing down any time soon. This week, we've seen some cool drops from a lab I personally never heard from before: Stepfun AI. Founded in April 2023, these guys have been quite busy shipping for a while now, though this is the first time they've garnered notable attention outside China. There are two models of theirs to talk about:

Step-Audio-Chat

The first of the two is Step-Audio-Chat, a 130B Audio-to-Audio model (which is a bit of an underserved niche in the open source space). That being said, it handily beats the competition is most benchmarks, though my tries to get it to perform some German music failed pretty spectacularly.

Step-Video-T2V

The second, in my opinion more interesting, model is Step-Video-T2V - a competitor to closed-source video models like Sora or the just-launched Veo 2 (which seems mostly noteworthy for it's impressively large price tag). Looking at the demo below, I am starting to realize that benchmarking video models seems to be even harder than it is for text models - who is working on this?

0:00
/0:08

Politics

Thinking Machines

Since Mira Murati left OpenAI in September last year, I've been waiting to see what she's cooking. Well, now we know... something? Thinking Machines will be an "artificial intelligence research and product company" that has managed to acquire some pretty intense talent. Excited to see how quickly they can start to execute!

Satya Nadella

Dwarkesh's had Satya Nadella on his podcast this week. The entire thing is worth a listen, but notably included some pretty significant vibe shifts. After pushing for more and more capex last year, Nadella is now a lot more careful in his wording and promises. I am expecting at least some of that being related to OpenAI shifting their compute away from them?

Humane

Not sure where else to put it this, but didn't want to leave it out. Humane is dead. On the one hand, this sucks for the dozens of customers that were still using their AI pin. On the other hand, it has lead to some pretty great memes.

Personally, I am genuinely surprise about them not failing a lot harder after their initial launch. Sure, $116 million is pretty far from their peak valuation of $850 million, but with employees keeping their jobs for now, this actually seems like a pretty okay outcome?

Phind

While I'd heard of (and even used) Phind before, I would've completely missed them soft-relaunching their entire product if my colleague Florian hadn't told me about it (thanks!). After reading through their very detailed blog post and trying the new search myself, this feels in many ways more polished than the equivalent products offered by major players like Perplexity. I can't believe they're building this with a total of four people. Hat's off!

Research

DeepSeek: Native Sparse Attention

DeepSeek seems to have missed the note about not comparing your attention alternatives to the real deal (because you'll lose dramatically) and has developed something.. better? Not just more computationally effective, but just actually getting better result? Also check out the paper, it's really cool.

Google: Co-Scientist

Google claims to have developed an AI system assisting scientific discovery. I didn't look too much into it, as it seemed pretty fluffy, but maybe a cool read?

Microsoft: OmniParser v2 & Muse

In my bookmarks this week I had two Microsoft announcements. The first one is OmniParser-2, a screen parser that allows regular LLMs to interact with UIs.

The other one is Muse - an AI that generates "minutes of AI "gameplay"". Honestly, this seems technically really cool, but I still don't see the endgame here?