-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
WebGPU Implementation
We built 0xBitNet — a TypeScript library that runs BitNet b1.58 2B-4T entirely on WebGPU, with custom WGSL compute shaders for ternary matmul, RMSNorm, RoPE, and ReLU².
- Live demo: browser chat (no server, no WASM — pure WebGPU)
- npm:
npm install 0xbitnet - Platforms: Chrome/Edge 113+, Deno, Node.js (via Dawn bindings)
The model GGUF is fetched directly from Hugging Face, parsed in-browser, and uploaded to GPU buffers. Works offline after the first download (cached in IndexedDB).
I2_S Format — Documentation Gap
While implementing the GGUF loader, the I2_S quantization format was the hardest part to get right due to lack of documentation. Sharing our findings here in case it helps other runtime implementers.
What we had to reverse-engineer
The I2_S format uses 128-element block interleaving, not sequential packing. This is not documented anywhere outside the source code of the llama.cpp fork.
The actual layout (from dequantize_row_i2_s() and quantize_i2_s()):
Each 32-byte block stores 128 ternary elements in 4 groups of 32.
Byte[gp] within a block encodes elements at positions [gp, 32+gp, 64+gp, 96+gp]:
bits[7:6] = group 0 (offset 0)
bits[5:4] = group 1 (offset 32)
bits[3:2] = group 2 (offset 64)
bits[1:0] = group 3 (offset 96)
Ternary encoding: 0b00 = 0, 0b01 = +1, 0b10 = -1
Total byte size per tensor: ceil(num_elements / 4) + 32
(trailing 32 bytes = per-tensor float32 scale, replicated 8×)
To extract the element at logical index k:
block = k / 128
pos = k % 128
group = pos / 32
gp = pos % 32
byte_offset = block * 32 + gp
shift = 6 - 2 * group
value = (byte >> shift) & 0x03
Other non-obvious details we discovered
- I2_S is type 36 in the Eddie-Wang1120/llama.cpp fork (not type 27, which is I64 in upstream ggml)
- The per-tensor scale in the trailing 32 bytes is a single
float32replicated 8 times token_embd.weightis F16 (type 1), not I2_S — embeddings are not ternary- No
output.weighttensor — tied embeddings (lm_headreusestoken_embd) - GGUF metadata uses architecture prefix
bitnet-25, notbitnetorllama
Suggestion
Would it be possible to add documentation for the I2_S quant format to this repo? Even a brief section in the README or a docs/ file describing the block-interleaved layout would save significant time for anyone building alternative inference runtimes (WebGPU, Vulkan compute, Metal, etc.).
Happy to contribute a documentation PR if that would be helpful.