Parameter Update: 2025-4

"everybody gets o1" edition

Parameter Update: 2025-4

Welcome to what I am tentatively calling Parameter Update. It is a place for me to dump all the possibly-interesting things I encounter during the week to summarize them for myself and my work friends.

DeepSeek R1

If anyone was still under the illusion that limiting access to GPUs would be an effective measure to prevent China from building an o1-level model after Berkeley made one for $450, they should now be fully disabused of that notion.

While I remain sceptical of claims that R1 is a side hustle (have you seen the list of authors on the paper?), the 95% cost decrease in both training and inference - made possible by genuine RL innovation, not a Chinese psyop (lol) - compared to o1 is an incredible achievement that has unlocked a veritable explosion of new cool stuff. The fact that they open sourced all of it really is the cherry on top, which means that this is the new baseline now.

Some of the coolest things I've seen this week:

OpenAI

Stargate Project

Ironically announced just one day before DeepSeek came out running with R1, the project rivals the Manhattan Project in percentage of US GDP. It must be noted this is mostly non-governmental at this point, though Trump did pose for a photo-op. Interestingly enough, it seems this might have kicked off some drama, with Musk and Altman feuding in the replies:

Quickly followed by Nadella clearing things up from his end:

Operator

In other OpenAI news, we finally got our proverbial hands on OpenAIs first real agent product (I'm not counting their Swarm framework from late last year or Tasks from last week).

Operator is an AI model based on GPT-4o (actually a new model, not just a new UI!) that can independently operate a sandboxed web browser to complete various actions on your behalf. It crushes other VLM-based approaches, but still falls short of humans at the majority of tasks. It's also potentially vulnerable to some new forms of jailbreaks.

Apart from some of the compute-considerations (20min limit to tasks,...), what I find particularly interesting is the large rift between the way OpenAI is packaging this (preview, limited to Pro subscription) vs. how Anthropic announced Computer Use a few months ago (API changes with a ready-made Docker image to use them).

Honestly, I might be lacking the necessary creativity to think of why most people would be willing to shell out $200 for what is being offered right now. At the same time, I am fascinated about what is essentially the first move OpenAI has made away from having their models interface with ready-made APIs and what these capabilities signal about future use cases enabled by a combination of o3, memory, Search, RAG,... - and I'm looking forward to experimenting when it inevitably comes to Plus tier later on.

o3-mini

In what seems like a direct response to R1, Altman has announced that o3-mini will be available to Free Users, with Plus subscribers getting 100 queries per day. Pro subscribers, on the other hand, will be getting (broader? sole?) access to o3 when it launches later this year.
While the latter matches up with my expectations given the enormous compute requirements of full o3 (OpenAI spend half the DeepSeek v3 training cost on a single benchmark run), I am surprised they are willing to expend the compute it'll take to roll out a TTC model to free-tier users - it really seems like scaling SSP is no longer a priority for them.

Google: Gemini 2.0 Flash Thinking Experimental 01-21

Despite their model names being somehow the worst of the bunch, DeepMind has been on a roll lately when it comes to actually shipping stuff. While R1 has stolen most of the thunder of this launch, using this model in AI Studio has been a really good experience this week, as it is (to my knowledge) the first TTC model directly usable with Code Execution (if you don't count o1 now supporting Canvas). One million tokens context limit also don't hurt - nice!

Perplexity

Sonar APIs

After neglecting their APIs for the longest time to focus on more important matters like buying Tiktok, it seems that Aravind Srinivas has finally woken up to the reality of Google (with their Search Grounding) and other competitors catching up to their product. In response, Perplexity has rolled out two new models in their API:

  • Sonar, promoted as being "great for everyday questions"
  • Sonar Pro, billed as "state-of-the-art factuality"

I remain sceptical of Perplexity due to their rapid, short-term deprecation of old models and IMO outrageous $9 Billion valuation (lol), but this seems like a very solid step towards building better search products. The majority of my slightly more tricky prompts still failed to produce usable results though.

Assistant

As a complete surprise (to me anyway), came the launch of Perplexity Assistant - a drop-in replacement to Google's Gemini Assistant on Android.

I'm not sure, I get the product reasoning behind this - Perplexity is a much smaller competitor, now fragmenting their product to compete on Google's home turf. While I love Perplexity's design language, it is unsurprising that providing feature parity with the many smaller integrations Google has to build may prove difficult. On the other hand, it seems (from the outside in) like Google's assistant offering is at a bit of a pivotal moment as well, so now may be the best time for them to gain some foothold in this market. And with Apple prioritizing Controls APIs as part of their Apple Intelligence push (and being forced to allow switching standard apps due to the EU DMI), even a cross-platform product may be feasible.