luma.gl

WebGPU Compute Demo

GPT-2 124M next-token lab

Load the llm.c checkpoint, inspect tensor activity, and sample continuations with luma.gl compute shaders driving the dense projections and language-model head.

Prompt

Max new tokens

Context window

Temperature 0.10

Loading GPT-2 124M model and tokenizer from Hugging Face...

Model Waiting

Tokenizer Waiting

Tensor atlas

Device: initializing...

Canvas hover: move over the atlas after a run to inspect the tensor span under the mouse. Scroll to zoom, drag to pan, double-click resets.

The prompt is tokenized with byte-level GPT-2 BPE, converted to token plus absolute position embeddings, and run through the selected sliding context window. Each transformer block applies layer norm, causal self-attention, a residual add, another layer norm, an MLP with GELU, and a final residual add.

WebGPU compute shaders run the tensor-heavy work: layer norms, QKV projections, attention, projection matrices, GELU, residual adds, and the final language-model head. The CPU still handles UI, tokenization, sampling, and reading back final logits.

The final logits are unnormalized token scores. Temperature 0 picks the highest logit; temperatures above 0 convert logits to probabilities and sample from the distribution. The logits table shows raw logit, temperature-scaled delta from the best token, probability, and decoded token text. During generation, click any generated-token chip to inspect the logits that produced that token; the demo keeps the logits history for the current run.

The canvas is a dense tensor atlas. It uses 640 by 360 tiny cells, so the latest forward pass can show token embeddings, position embeddings, sampled matrix weights, activations, and logits across all layers. Read it left to right, top to bottom; new tensor spans are appended in the same order the model runs.

Each tiny cell is one sampled tensor value: yellow is positive, cyan is negative, dark gray is zero, and red is NaN or infinity. Small tensors are shown value-for-value; large matrices are sampled from the whole matrix so every layer and major multiply gets canvas space.

The first spans are the selected context token embeddings, position embeddings, and their sum. Then each GPT-2 block adds sampled parameter tensors for layer norm, QKV, attention projection, MLP, and output projection, followed by the activations produced by those WebGPU kernels. The final spans show the last layer norm and the language-model-head logits over the vocabulary.

Repeated horizontal textures usually mean a matrix is being sampled across rows and columns. Large yellow or cyan bands show strongly signed values; mixed blue/yellow noise means weights or activations are balanced around zero. Red is the important failure signal: it means a tensor readback saw NaN or infinity.

Move the mouse over the canvas to see the tensor span, atlas cell, approximate sampled source index, sign, and intensity under the pointer. Context-token spans and the final logits span also show the actual token id and decoded token text under the mouse. Scroll on the canvas to zoom around the cursor, drag to pan the zoomed atlas, and double-click the canvas to reset to the full atlas. The canvas visualization and Debug Trace have separate controls. Enable Debug Trace to see the exact tensor names, stats, and pixel ranges that correspond to the atlas spans; keep it off for faster generation.

Canvas tensor visualization Debug Trace

Waiting for model load...

This demo downloads GPT-2 124M model and tokenizer artifacts from Andrej Karpathy's llmc-starter-pack, which packages files for the MIT-licensed llm.c workflow. GPT-2 was originally released by OpenAI in the gpt-2 repository.