Skip to content

WebGPU inference engine for BitNet b1.58 — and notes on I2_S format documentation #412

@m96-chan

Description

@m96-chan

WebGPU Implementation

We built 0xBitNet — a TypeScript library that runs BitNet b1.58 2B-4T entirely on WebGPU, with custom WGSL compute shaders for ternary matmul, RMSNorm, RoPE, and ReLU².

  • Live demo: browser chat (no server, no WASM — pure WebGPU)
  • npm: npm install 0xbitnet
  • Platforms: Chrome/Edge 113+, Deno, Node.js (via Dawn bindings)

The model GGUF is fetched directly from Hugging Face, parsed in-browser, and uploaded to GPU buffers. Works offline after the first download (cached in IndexedDB).

I2_S Format — Documentation Gap

While implementing the GGUF loader, the I2_S quantization format was the hardest part to get right due to lack of documentation. Sharing our findings here in case it helps other runtime implementers.

What we had to reverse-engineer

The I2_S format uses 128-element block interleaving, not sequential packing. This is not documented anywhere outside the source code of the llama.cpp fork.

The actual layout (from dequantize_row_i2_s() and quantize_i2_s()):

Each 32-byte block stores 128 ternary elements in 4 groups of 32.

Byte[gp] within a block encodes elements at positions [gp, 32+gp, 64+gp, 96+gp]:
  bits[7:6] = group 0 (offset 0)
  bits[5:4] = group 1 (offset 32)
  bits[3:2] = group 2 (offset 64)
  bits[1:0] = group 3 (offset 96)

Ternary encoding: 0b00 = 0, 0b01 = +1, 0b10 = -1

Total byte size per tensor: ceil(num_elements / 4) + 32
  (trailing 32 bytes = per-tensor float32 scale, replicated 8×)

To extract the element at logical index k:

block = k / 128
pos   = k % 128
group = pos / 32
gp    = pos % 32
byte_offset = block * 32 + gp
shift = 6 - 2 * group
value = (byte >> shift) & 0x03

Other non-obvious details we discovered

  • I2_S is type 36 in the Eddie-Wang1120/llama.cpp fork (not type 27, which is I64 in upstream ggml)
  • The per-tensor scale in the trailing 32 bytes is a single float32 replicated 8 times
  • token_embd.weight is F16 (type 1), not I2_S — embeddings are not ternary
  • No output.weight tensor — tied embeddings (lm_head reuses token_embd)
  • GGUF metadata uses architecture prefix bitnet-25, not bitnet or llama

Suggestion

Would it be possible to add documentation for the I2_S quant format to this repo? Even a brief section in the README or a docs/ file describing the block-interleaved layout would save significant time for anyone building alternative inference runtimes (WebGPU, Vulkan compute, Metal, etc.).

Happy to contribute a documentation PR if that would be helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions