WebGPU inference engine for BitNet b1.58 — and notes on I2_S format documentation

## WebGPU Implementation

We built [**0xBitNet**](https://github.com/m96-chan/0xBitNet) — a TypeScript library that runs BitNet b1.58 2B-4T entirely on WebGPU, with custom WGSL compute shaders for ternary matmul, RMSNorm, RoPE, and ReLU².

- **Live demo:** [browser chat](https://m96-chan.github.io/0xBitNet/chat/) (no server, no WASM — pure WebGPU)
- **npm:** `npm install 0xbitnet`
- **Platforms:** Chrome/Edge 113+, Deno, Node.js (via Dawn bindings)

The model GGUF is fetched directly from Hugging Face, parsed in-browser, and uploaded to GPU buffers. Works offline after the first download (cached in IndexedDB).

## I2_S Format — Documentation Gap

While implementing the GGUF loader, the **I2_S quantization format** was the hardest part to get right due to lack of documentation. Sharing our findings here in case it helps other runtime implementers.

### What we had to reverse-engineer

The I2_S format uses **128-element block interleaving**, not sequential packing. This is not documented anywhere outside the source code of the [llama.cpp fork](https://github.com/Eddie-Wang1120/llama.cpp).

The actual layout (from `dequantize_row_i2_s()` and `quantize_i2_s()`):

```
Each 32-byte block stores 128 ternary elements in 4 groups of 32.

Byte[gp] within a block encodes elements at positions [gp, 32+gp, 64+gp, 96+gp]:
  bits[7:6] = group 0 (offset 0)
  bits[5:4] = group 1 (offset 32)
  bits[3:2] = group 2 (offset 64)
  bits[1:0] = group 3 (offset 96)

Ternary encoding: 0b00 = 0, 0b01 = +1, 0b10 = -1

Total byte size per tensor: ceil(num_elements / 4) + 32
  (trailing 32 bytes = per-tensor float32 scale, replicated 8×)
```

To extract the element at logical index `k`:
```
block = k / 128
pos   = k % 128
group = pos / 32
gp    = pos % 32
byte_offset = block * 32 + gp
shift = 6 - 2 * group
value = (byte >> shift) & 0x03
```

### Other non-obvious details we discovered

- I2_S is type **36** in the Eddie-Wang1120/llama.cpp fork (not type 27, which is I64 in upstream ggml)
- The per-tensor scale in the trailing 32 bytes is a single `float32` replicated 8 times
- `token_embd.weight` is F16 (type 1), not I2_S — embeddings are not ternary
- No `output.weight` tensor — tied embeddings (`lm_head` reuses `token_embd`)
- GGUF metadata uses architecture prefix `bitnet-25`, not `bitnet` or `llama`

### Suggestion

Would it be possible to add documentation for the I2_S quant format to this repo? Even a brief section in the README or a `docs/` file describing the block-interleaved layout would save significant time for anyone building alternative inference runtimes (WebGPU, Vulkan compute, Metal, etc.).

Happy to contribute a documentation PR if that would be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU inference engine for BitNet b1.58 — and notes on I2_S format documentation #412

WebGPU Implementation

I2_S Format — Documentation Gap

What we had to reverse-engineer

Other non-obvious details we discovered

Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WebGPU inference engine for BitNet b1.58 — and notes on I2_S format documentation #412

Description

WebGPU Implementation

I2_S Format — Documentation Gap

What we had to reverse-engineer

Other non-obvious details we discovered

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions