Parameter Update: 2025-29

"HRM đź‘‘" edition

Parameter Update: 2025-29

Not quite GPT-5 yet, but still some interesting stuff in here this week!

Runway Aleph

Massively better prompt control for AI video. Feels a bit like gpt-image-1 moment for video generation (if it works as advertised). Will be interesting to see the next movers in the space (Runway tends to be very expensive).

Anthropic

Claude Code limits

Ever since Cursor introduced their pretty draconian usage limits a few weeks ago, the writing has been on the wall for Claude Code with it's extremely generous rate limits (at least in the $200/month tier.

Well, this week we go the first taste of what that may look like in practice:

While they try to dress it up nicely, specifying that under 5% of users will be impacted, I would be remiss to note that that is a lot, actually. Opus can get expensive very fast, and I (despite writing less code these days than I used to!) still managed to rack up the equivalent of a few thousand USD in API costs in the last month after switching over from Cursor once they made their software unusable. It truly feels like the ZIRP-equivalent for token usage may be coming to an end faster than I anticipated.

Persona Vectors

In the next instalment of "alignment research leading to more interesting findings than I anticipated": Anthropic managed to figure out how to programmatically steer model personality through vectors directly inside the model weights. Besides finding an "evil" vector (unsettlingly enough), they also managed to identify a "hallucination" vector that, when turned up, massively increased model hallucination rate. Also, I'm very happy about them looking into the "sycophancy" vector to stop the wave of Ai psychosis currently hitting my timeline. Hard to believe this all follow from good old Golden Gate Claude!

Hierarchical Reasoning Model

After my colleague Marc pointed out to me that this one managed to make it out of my Twitter timeline into more mainstream Instagram Reels, I took a bit of a closer look at Sapient Intelligence's new Hierarchical Reasoning Model paper as I figured it might make a nice example for how I go about looking into news like this.

At first glance, I can certainly see why it would make the rounds. In their results section, they claim to outperform reasoning models like o3-mini or R1 using a massively smaller 27M (!) parameter model. Does it pass the sniff test? Let's take a look together:

What's the exact claim?
While a large portion of Twitter reported the model as "solving" reasoning at a much smaller model size through “neuroscience adjacent” modelling, the paper itself is more nuanced, only claiming competitive results on three benchmarks tasks (including ARC-AGI).

When did it happen?
The original paper was published on Arxiv over a month ago, but it only gained real prominence over the last week.

Who did it?
Sapient Intelligence, an AI startup I personally hadn't heard of before, with business adresses listed in Singapore and China. Their website looks mostly legit, but with what I was able to find out about their team, they don't appear to have the same resources or talent at scale as the large labs. To clarify: This doesn't mean they're not very smart people - just that they this is not a DeepSeek situation (i.e., a huge hedgefund subsidizing their own lab on the side)!

What was released?
This seems like the gold standard. Apart from transparency about their training data, we got both a comprehensive writeup (in the form of the primary paper) and the reference implementation in code. This is good!

Next, let's look into the paper. Here we find some interesting things:

  • HRM is not actually a foundation model. It is not even really adjacent. Instead of training on large amounts of text, it is only trained on task-specific samples. To that end, it is closer to e.g., a character recognition model than to an autoregressive text model.
  • Looking into the results, we see the slightly worrying claim that "input-output example pairs in the training and the evaluation sets". While this has been called out on Twitter, it's not as bad as it may initially seem, as they appear to only train on the provided 1–2 examples per problem, not the actual answers the model is supposed to generate.

Now, the first one here is the interesting "finding" (i.e., the thing people - not the authors! - are misrepresenting. ARC-AGI is not a general reasoning benchmark. Despite the name, solving it does not require AGI. Instead, it shows a specific set of capabilities. Claiming "general reasoning" because you solved it with a specialized model is similar to claiming to have made a smarter image recognition model because you beat 4o on CIFAR-100!

Again, this doesn't undercut the achievement of achieving such a high score with such a small model - it's still really impressive! Nor does it mean the architecture doesn't scale - we just don't know.

Here is what we do know:

  • Historically, small models used in ensembles with big models have been really useful. The latest iteration might be exposing smaller models as tools via MCP.
  • The proposed architecture seems interesting. It appears to be very resilient to overfitting (as they are using a lot of epochs with very few examples, way over the chinchilla optimal compute scaling)
  • This is not even the first time something like this has been done - take a look at this mostly similar paper from September 2019!

Small hype waves like this seem like an artifact of the scientific method in AI: Scientists train small models on new techniques to validate, then see if they scale to larger methods suitable for LLM style deployment. Often, the hype is welcomed - media attention does make it easier to go into the next funding round - but just as often, they lead to misrepresentations of genuinely interesting findings. In a sense, this is the AI equivalent of the new miracle drug curing a specific type of cancer in mice. It's exciting! But extremely unclear if it scales to larger sample sizes, let alone humans. It's also unclear if it can be produced at scale in a way that is remotely cost-efficient. Unfortunately, that type of nuance rarely makes for good headlines. As I said, this is still a very interesting release that might be useful in many cases. It fits well into the current research trend going back into “pure” RL and attempts to make good contributions to that field (I am frankly not well informed enough to judge so reliably!) - but it is almost certainly just one of many contributions, and it’s most certainly not AGI.

Miscellaneous Twitter Stuff

Prompt Engineering Study

Ethan Mollick posted results on their new prompt engineering study. Turns out that things might have changed more than many people expected since the original ChatGPT release!

While some of the findings go agains my subjective own experience (esp. CoT prompting being less effective on non-reasoning models), this does provide some much-needed quantitative analysis of a traditionally very subjective field.

New Models

While we still didn't get GPT-5 or the promised open source model from OpenAI this week (just more vagueposting from Altman), we did see a sleuth of new experimental models (e.g., "Horizon Alpha" and "Horizon Beta" on OpenRouter, (following the Summit/Zenith models last week, a leak of the OpenAI open source model architecture and weights and even a possible GPT-5 API for a few minutes:

Talent War

If you thought that the AI talent wars would slow down a bit after Meta got their Superintelligence Lab of the ground, you would be wrong! Two recent hightlights: The first US$1 Billion offer - which was turned down! And: Meta introducing their new Chief Scientist of their Superintelligence Lab, which is (apparently?) notably different from their Chief AI Scientist, Yann LeCun or their Chief AI Officer, Alexandr Wang.

Lume

With what may be the most unsettling video that has hit my timeline recently, laundry folding bedside robot (!) Lume can now be preordered for a $50 deposit, with a final price aimed at <$2000. I genuinely wish these guys the best - building hardware is hard and all that, but with the CGI video, the incredibly low price and the Stripe payment link for the deposit, this feels like a throwback to 2010's Kickstarter vaporware - just a lot weirder.