Conversation
Would we re-use the dictionary across sync lines? Is the dictionary dynamic and sent by the service? |
In short, yes. My initial idea is that compression would be used for the data, rather than the protocol. This allows keeping the data compressed as is from bucket storage -> sync stream (websocket) -> ps_oplog, potentially even in the data tables (data tables would be opt-in, since this would slow down queries). However, if we do compression on the data of individual rows, the compression ratio for small rows isn't great (around 0.6 of the original size in my tests). However, if we pre-train a dictionary on the same data, the compression ratio can be as good as 0.2 or 0.1, a 5-10x reduction in size. This does have some implications:
So overall the project becomes quite complex, but I don't think compression is worth it without using dictionaries. This PR is just one small piece of evaluating the feasibility of the project: Can we efficiently do decompression on the client? |
This investigates the possibility of decompressing zstd data in the core extension, which could allow us using zstd data in the protocol. This POC only looks at the zstd decompression itself and tests its performance, and does not actively use it anywhere.
Usage:
On my machine, this takes around 500ms to decompress 80MB of data over 100k rows. This will likely be more efficient if we parse the dictionary up-front.
This increases linux release build size from around 400kb -> 487kb. Adding compression support would add another 260kb or so, and I don't think we have a good use case for that on the client.
To actually use this with compressed data, we'd need to additionally: