On June 10, 2026, Google released DiffusionGemma, an experimental "open model" that generates text in a way ordinary AI does not. A typical chatbot writes one word-piece (token) at a time, left to right; this model instead spreads out a blank 256-cell "canvas" and refines the whole thing at once to finish the text. It produces writing much the way a picture is drawn, Google says.

The point isn't simply that "Gemma got faster." It is an experiment that trades quality for speed — and even that speed doesn't appear just anywhere: it shows up on a local machine, and only when few requests arrive at once. Google spells out the condition itself.

What the model is

DiffusionGemma is an "open-weights" model built by Google DeepMind. That means the files that make up the model's brain (its weights) are released in full, so anyone can download it and run it on their own computer. Its license is the permissive Apache 2.0, so even commercial use is free.

It is based on Google's Gemma 4 family. In total it is made of about 26 billion (26B) numbers, but only 3.8B of them switch on for any single calculation. This design — turning on only the parts needed at a given moment — is called a mixture of experts (MoE), and it lets a bulky model run relatively light. Onto that, Google added a new component that churns out text the "diffusion" way.

One thing to be clear about: this is not a new consumer product like the ChatGPT or Gemini apps. What was released is the model files plus developer paths to run them (download from Hugging Face or Kaggle and serve with "model-running programs" like vLLM or Transformers). No one hosts it for users yet, but within days of release dozens of reworked versions had gone up — as of June 29, 11 fine-tunes, 29 quantizations, and 17 demo Spaces.

It takes text and images as input and answers in text. Even Google's own materials disagree, though: the prose lists video input as well, while the spec table on the same page lists only Text and Image.

It can't yet be downloaded and run straight through Ollama — being a new diffusion architecture, it needs a dedicated build of its own (as of June 29).

One block at a time, not one word

A typical chatbot AI writes like a typewriter. It puts down one word-piece (token), looks at it, then writes the next… moving one slot at a time, left to right. Deciding each next piece from the one before is called autoregressive. In a large service (the cloud) this is efficient, because thousands of requests are batched together; but run alone on a personal computer, the GPU (graphics card) sits idle most of the time, waiting for the "next character."

DiffusionGemma turns this into a printing press. Instead of striking one character at a time, it stamps out a whole block (here, 256 cells) at once. The sequence runs like this: it first reads the prompt once and holds it in memory (this memory is called the KV cache), then fills a blank 256-cell canvas with random "noise." It sweeps over the whole canvas several times, fixing cells starting where it is most confident and blurring the rest back out to correct them on the next pass. It applies to text the same principle by which an image generator coaxes a clear picture out of fuzzy noise.

The key is that every cell on the canvas sees both directions at once (bidirectional). An ordinary model sees only the left side it has already written; this one writes while taking in the whole text, so it can fix its own mistakes along the way. A one-direction model drags an early slip all the way to the end, whereas here the whole text gradually converges on its finished form, by Google's account.

One layer of hype has to come off, though. "256 cells at once" doesn't mean it finishes in a single shot. By the model card, each sweep fixes 15–20 cells, and finishing one block takes up to 48 rounds (denoising steps); once a block is done it is locked in and the next one begins. In other words: block by block in sequence, all at once within a block.

The conditions on "fast"

First, the numbers. Google reports up to 4x faster on a dedicated GPU — specifically, more than 1,000 tokens per second on a single NVIDIA H100, and more than 700 on the gaming-grade GeForce RTX 5090 ("tokens per second" is the unit for how fast text is produced; the model card also cites 1,100+ under certain conditions). All of these, though, are Google's own figures, not independently verified by a third party.

The interesting part is the conditions. This speed is meant for a local machine, when few requests arrive at once. It is a different story in a large service where requests pour into one server all at once (high-QPS — a great many requests per second). There, a conventional (autoregressive) model can keep the GPU fully packed, so DiffusionGemma's "make it all at once" actually yields less and can cost more to run, Google says. The benefit is therefore largest when a single GPU handles a low or medium number of requests at a time. Google also adds a caveat: designs where memory and chip sit on one body, such as Apple Silicon Macs, may not see the same speed gain.

The hardware bar, on the other hand, is fairly low. It is 26B in total but uses only 3.8B for the actual computation, so once the model is shrunk down (quantized) it fits even on a high-end graphics card with 18GB of VRAM, Google says (VRAM is a graphics card's dedicated memory).

The price of speed — quality

Google doesn't hide the trade. It states outright that DiffusionGemma's "overall output quality is lower than standard Gemma 4," and recommends standard Gemma 4 wherever top quality is needed.

The evidence is in the same material. Across a range of test scores (benchmarks), DiffusionGemma lands below the same-size, ordinary Gemma 4 26B on nearly all of them (again Google's own figures, not independently verified).

BenchmarkDiffusionGemma 26B A4BGemma 4 26B A4B
MMLU Pro77.6%82.6%
AIME 2026 (no tools)69.1%88.3%
LiveCodeBench v669.1%77.1%
GPQA Diamond73.2%82.3%

Its strengths lie elsewhere. Because every cell takes in the others as a block is refined together, it does better on work that can't be solved strictly in order — writing code to fill a gap in the middle of a function, editing just one picked-out part of text already written, or satisfying interlocking constraints at once (Google's examples also include amino acid sequences and mathematical graphs). Google in fact points to one group (Unsloth) that separately trained the model to solve Sudoku — a puzzle where later squares govern earlier ones, exactly the kind of thing an ordinary one-direction model is weak at.

So what changed

Reducing it to "Gemma got faster" misses the point. What this model touches is where the speed bottleneck sits. If the bet so far has been "make the model bigger and smarter," DiffusionGemma shifts the weight toward how text is produced and how fully the GPU is used. It gives up some quality in return, and the trade-off splits cleanly: where it fits (interactive work on a local machine that needs a fast response) and where it doesn't (large cloud services, jobs that demand top quality).

Sources