The last six months in LLMs in five minutes

Simon Willison used a five-minute lightning talk at PyCon US 2026 to compress six months of LLM history into something digestible, and his annotated slide deck tells the story well: November 2025 was a turning point, coding agents quietly crossed a quality threshold that actually matters, and the open-weight models running on laptop hardware have become surprisingly hard to dismiss.

The period Willison covers begins with what he calls the November 2025 inflection point. The title of "best model" changed hands five times that month alone, trading between Anthropic, OpenAI, and Google in a sequence that started with Claude Sonnet 4.5, jumped to GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, before Anthropic reclaimed it with Claude Opus 4.5. Willison tracks this with his signature benchmark: asking each model to generate an SVG of a pelican riding a bicycle. The logic is sound. Pelicans are hard to draw, bicycles are hard to draw, the combination is physically impossible, and no AI lab would ever specifically train for it. Gemini 3 drew the best pelican of the November batch, he notes, though Opus 4.5 held the broader practitioner consensus for the next couple of months.

The pelican derby is fun, but the more consequential development from November was quieter and took longer to register. OpenAI and Anthropic had spent most of 2025 applying Reinforcement Learning from Verifiable Rewards to their models, specifically to improve code quality when paired with their respective agent frameworks. By November, the results were visible. Coding agents moved, in Willison's framing, from "often-work" to "mostly-work." That sounds like modest progress, but crossing the threshold where you can trust an agent as a daily driver without spending the majority of your time cleaning up its errors is a genuine shift in what the tools are actually useful for.

Also in November, someone named Pete made an initial commit to an obscure repository called Warelay. Willison includes this as a data point worth flagging, because Warelay went through several name changes over December and January before arriving at its final identity: OpenClaw, a personal AI assistant that, by February, had become a phenomenon well beyond what a three-month-old project typically achieves. The "Claw" as a category term took hold alongside it, driven by spinoff projects like NanoClaw and ZeroClaw. Mac Minis started selling out around Silicon Valley as people bought them specifically to run their Claws locally. Drew Breunig, as Willison recounts, joked that they had become the new digital pets, the Mac Mini functioning as the perfect aquarium. Willison's own preferred metaphor is Alfred Molina's Doc Ock from Spider-Man 2: AI-powered appendages that work fine until something damages the inhibitor chip.

The December and January holiday period brought its own pattern. A lot of people, Willison included, used the break to push the new coding agents to their limits and got somewhat carried away. His own holiday project was a vibe-coded JavaScript interpreter written in Python, a loose port of MicroQuickJS he named micro-javascript. The demo runs JavaScript in Python, inside Pyodide, inside WebAssembly, inside JavaScript, inside a browser. Technically layered and genuinely impressive. Also, by his own admission, something nobody actually needed. He has since quietly retired that one and several others from the same period.

February brought Gemini 3.1 Pro, which produced what Willison considers the best pelican-on-a-bicycle result he has seen, complete with a fish in the basket. Google's Jeff Dean amplified the moment by posting an animated version that included a frog on a penny-farthing, a giraffe driving a tiny car, an ostrich on roller skates, a turtle kickflipping a skateboard, and a dachshund driving a stretch limousine. Whether the AI labs had been watching Willison's benchmark or simply got very good at drawing improbable animals on vehicles, the output quality was hard to argue with.

April brought more developments worth noting. Google released the Gemma 4 series, which Willison describes as the most capable open-weight models he has seen from a US company. Chinese lab GLM released GLM-5.1, an open-weight model at 754 billion parameters and 1.51 terabytes in size — effective, but requiring hardware most people cannot casually afford. GLM-5.1 produced a competent pelican, though it struggled to animate it cleanly. It also, when prompted by a Bluesky commenter, generated an illustrated and animated SVG of a North Virginia Opossum on an e-scooter, complete with the tagline "Cruising the commonwealth since dusk." Willison reports that other models cannot come close to that result.

The other notable April contribution came from Qwen. The Qwen3.6-35B-A3B model, a 20.9GB file that runs on a laptop, drew a better pelican-on-a-bicycle than Claude Opus 4.7. Willison acknowledges this probably says as much about the limits of the benchmark as it does about the model, but the underlying point stands: locally runnable open-weight models have started producing results that were implausible a year ago.

The two themes Willison pulls out of the six months are straightforward. Coding agents got genuinely good enough to use. And laptop-scale open-weight models, while still well behind the frontier, have started wildly outperforming expectations. Both of those developments have practical weight, and neither looks like a temporary blip.