I Implemented a Google Research Paper as an Open-Source JS Library
It started with a YouTube video.
I was watching Lucas Montano's channel when he broke down a paper from Google Research — TurboQuant: Online Vector Quantization, accepted at ICLR 2026. He compared it to Pied Piper from Silicon Valley: lossless compression that actually works, except this time it's real and designed for the AI era. Google's tweet about the paper hit 19 million views. Micron and SanDisk stocks dropped on the news — if TurboQuant delivers what it promises, the demand for memory hardware in AI infrastructure could shrink significantly.
But what caught my attention most was Lucas showing his own implementation. He'd built TurboQuant into Perssua, his AI assistant SaaS, to solve a concrete problem: users upload entire folders of documents to create knowledge bases for their AI assistants, and the vector embeddings from those documents were eating all the client's memory. His solution was to compress the embedding indexes using TurboQuant and store everything in IndexedDB — no server, no cloud, all client-side. After indexing 400+ video scripts, memory consumption stayed flat.
That's when I had the insight: if it solved a real problem for his product, it could solve it for a lot of other developers too. The existing implementations, so far, relied on WASM or native binaries. There was nothing you could just npm install and use in pure TypeScript.
So I thought: why not build it as an open-source library?
I've been a heavy consumer of open source throughout my career — npm install and pip install everywhere — but I'd never been on the other side. I'd never published something that other developers could use. So why not try with this insight?
turboquant-js is the result: a pure TypeScript implementation of the TurboQuant algorithm. Zero dependencies, works in Node.js and browsers, ~15 KB bundled.
npm install turboquant-jsOr see it in action: Live Demo — Client-Side Semantic Search
First, what are vector embeddings?
If you've worked with full-text search (Elasticsearch, PostgreSQL tsvector), you know how it works: you match keywords. Search for "cheap flights to Paris" and you'll find documents containing those words.
Vector embeddings are the next step. Instead of matching keywords, an AI model converts text (or images, audio, etc.) into an array of numbers — typically 384 or 768 floats. These arrays capture meaning, not just words. Two sentences with completely different words but similar meaning will have similar arrays.
// This is what an embedding looks like — a regular array of numbers
const embedding = [0.023, -0.041, 0.089, ..., 0.017]; // 384 floatsTo search by meaning, you compare these arrays using math (cosine similarity or dot product). The closer two arrays are, the more similar their meaning.
This is what powers features like "similar products", "related articles", semantic search bars, and increasingly RAG (Retrieval-Augmented Generation) — where an LLM looks up relevant context before answering a question.
The problem: embeddings are expensive to store
Each embedding is just a Float64Array, but they add up fast:
| Documents | Dimensions | Raw size |
|---|---|---|
| 1,000 | 384 | 3 MB |
| 100,000 | 384 | 300 MB |
| 1,000,000 | 384 | 3 GB |
If you're running a server, 3 GB is manageable. But what if you want to run this in the browser? For a privacy-first search feature, an offline app, or a browser extension? 3 GB on the client is a non-starter.
This is exactly the problem Lucas ran into with Persua: hundreds of documents embedded into vectors, all stored client-side, with no room for the memory to grow linearly. And it's the same problem you'll hit if you're building any client-side AI feature that works with embeddings.
The typical solution is to use a server-side vector database (Pinecone, Qdrant, Weaviate) or a WASM-compiled library like FAISS. But these come with trade-offs:
- Server-side databases mean your data leaves the client — bad for privacy-sensitive use cases
- FAISS-WASM is a ~2 MB binary blob with a complex build process
- Product Quantization (PQ), the standard compression technique, requires a training step — you need a representative dataset to "teach" the compressor what your data looks like before you can use it
What if you could compress those 3 GB down to ~150 MB with no server, no WASM, no training step — just npm install and a few lines of TypeScript?
What TurboQuant does differently
Think of it like image compression. A JPEG compresses a photo by throwing away visual detail your eyes won't notice. TurboQuant does something similar for number arrays: it compresses each float down to 2-4 bits while preserving the mathematical relationships between vectors.
The critical property — and this is what Lucas was excited about in his video — is that when you compare two compressed vectors, the similarity score you get is not systematically wrong. It might have small random noise (like any compression), but it doesn't consistently drift in one direction. The paper proves this mathematically. As Lucas put it, referencing Silicon Valley: "what Richard Hendricks tried to do across six seasons, Google actually did — and it works."
At 3-bit quantization, you get 20.8x compression:
| Format | Size per 1M docs | Compression |
|---|---|---|
| float64 (raw) | 3 GB | 1x |
| 4-bit quantized | 192 MB | 15.7x |
| 3-bit quantized | 144 MB | 20.8x |
| 2-bit quantized | 98 MB | 30.7x |
That's the difference between "impossible in a browser" and "totally feasible".
How it works (without the PhD)
The algorithm combines three techniques — and Lucas actually walked through all three in his video, which helped me understand them before diving into the paper:
Step 1: Random rotation. Before compressing, TurboQuant randomly rotates the vector using an orthogonal rotation. Think of it as scrambling the numbers so that no single coordinate is more important than any other. This simplifies the geometry and makes per-coordinate compression near-optimal.
Step 2: Scalar quantization. Once rotated, each number follows a predictable distribution, so you can round it to a small set of values (the "codebook") that are pre-computed to minimize error. No training needed — the codebook is determined purely by math.
Step 3: QJL error correction. Rounding introduces error, and that error would make similarity comparisons slightly biased — like a scale that consistently reads 0.5 kg too high. TurboQuant fixes this with a 1-bit correction term based on the Quantized Johnson-Lindenstrauss projection. As Lucas explained: "it costs you one extra bit, but you get near-perfect precision in attention score calculations." The result: similarity scores on compressed vectors are unbiased estimates of the true scores.
You don't need to understand the math to use the library. But if you're curious, there's a detailed THEORY.md mapping the implementation to the paper.
Building it: vibe coding a research paper
Here's a detail that connects my story to Lucas's: he also used AI to implement TurboQuant. In his video, he mentioned feeding the Google papers to Claude Code and having it implement the algorithm for Persua. I did the same thing — I built turboquant-js almost entirely through vibe coding.
We're in a moment where every company is racing to integrate AI into their workflows, and I wanted to see firsthand what it's like to build a non-trivial project from scratch with AI as a co-pilot. Not a todo app or a CRUD API — an actual implementation of a research paper with real math, bit-level operations, and numerical precision requirements.
The process was genuinely surprising. Things like the Randomized Hadamard Transform, Lloyd-Max codebook generation with adaptive Simpson quadrature, and bit-packing into Uint8Array buffers — these are the kind of tasks where you'd normally spend days reading textbooks and debugging off-by-one errors. With AI assistance, I could focus on understanding what the algorithm needed to do and let the tooling handle the mechanical translation into working TypeScript.
That said, it wasn't just "prompt and ship." I had to understand the paper deeply enough to validate the output, write meaningful tests (196 of them, including statistical z-tests for unbiasedness), and make architectural decisions that the AI couldn't make for me — like choosing the Randomized Hadamard Transform over dense QR decomposition for the rotation step, which brought the complexity from O(d³) down to O(d log d).
The takeaway: vibe coding works surprisingly well for implementing well-defined algorithms. The paper was the spec, the math was the test oracle, and AI was the translator between the two. It's not magic — you still need to understand what you're building — but it dramatically lowers the barrier to turning a research paper into working software.
Using it in practice
The library exposes two high-level APIs. If you've used any ORM or search client, the pattern will feel familiar.
Vector search index
import { VectorIndex } from 'turboquant-js';
// Create an index — like creating a table
const index = new VectorIndex({ dimension: 384, bits: 3, metric: 'cosine' });
// Add vectors — like inserting rows
index.add('doc1', embedding1);
index.add('doc2', embedding2);
index.add('doc3', embedding3);
// Search — like a query, but by meaning instead of keywords
const results = index.search(queryEmbedding, 10);
// => [{ id: 'doc2', score: 0.93 }, { id: 'doc1', score: 0.87 }, ...]
// Check how much memory you saved
console.log(index.memoryUsage);
// => { compressionRatio: 20.8, actualBytes: 7200 }
// Save to disk / IndexedDB / send over the network
const buffer = index.toBuffer();
// Later: restore it
const restored = VectorIndex.fromBuffer(buffer, { dimension: 384 });KV cache compression (for LLM applications)
If you're working with LLMs in the browser (using Transformers.js or WebLLM), the KV cache is a major memory bottleneck — it's the short-term memory that grows with every token the model generates. This is the exact problem Google's paper targets, and what Lucas demonstrated with Gemma 3 12B going from 6 GB to under 5 GB of RAM usage:
import { KVCacheCompressor } from 'turboquant-js';
const compressor = new KVCacheCompressor({
keyDim: 128, valueDim: 128,
keyBits: 3, // unbiased attention scores
valueBits: 2, // low-error value reconstruction
});
compressor.append(keys, values);
const scores = compressor.attentionScores(queryVector);Benchmarks
Real numbers from npm run bench:
| Dimension | Bits | Avg error (MSE) | Bias | Compression |
|---|---|---|---|---|
| 384 | 2 | 0.0012 | ~0 | 30.7x |
| 384 | 3 | 0.00074 | ~0 | 20.8x |
| 384 | 4 | 0.0006 | ~0 | 15.7x |
To put the error in perspective: an MSE of 0.00074 means the average error per coordinate is about 0.027. For a 384-dimensional vector, that's negligible — search results are nearly identical to uncompressed search.
The live demo shows this concretely: at 3-bit quantization, 3 out of 5 top results typically match exact brute-force search.
Live Demo
The best way to understand what this enables is to try it yourself. The demo runs a complete semantic search pipeline in your browser:
- Downloads a small embedding model (~30 MB) via Transformers.js
- Converts 50 articles into 384-dimensional vectors
- Compresses them with turboquant-js (3-bit by default — 20.8x compression)
- Lets you search by meaning, not keywords
Everything runs client-side. No server, no API calls, no data leaving your browser. You can toggle between 2-bit, 3-bit, and 4-bit quantization and see the quality/compression trade-off in real time. The demo shows side-by-side results comparing quantized search against exact brute-force search.
How it compares
| turboquant-js | FAISS-WASM | Pinecone / Qdrant | |
|---|---|---|---|
| Runtime | Pure TypeScript | WASM binary | Server-side |
| Setup | npm install |
Complex build | API key + infra |
| Training required | None | Yes (PQ) | N/A |
| Unbiased scores | Yes (proven) | No | N/A |
| Bundle size | ~15 KB | ~2 MB | N/A |
| Privacy | Data stays on client | Data stays on client | Data sent to server |
When would you use this?
- AI assistants with knowledge bases — Exactly what Lucas built with Persua: let users upload folders of documents, embed them, compress the indexes, and search by meaning — all in the browser, stored in IndexedDB.
- Browser extensions — Index bookmarks, history, or notes for local semantic search. Data never leaves the device.
- Offline-capable apps — Build search or RAG features that work without internet. Pair with Transformers.js for the embedding model.
- Privacy-first features — "Similar documents" or "smart search" where user data must stay client-side (GDPR, healthcare, legal).
- Edge functions — Cloudflare Workers and Vercel Edge run JS but not WASM. turboquant-js works there out of the box (~15 KB).
- Prototyping — Need vector search in a side project? Skip the Docker containers and API keys.
npm install, three lines of code, done.
From consumer to contributor
Building turboquant-js was a milestone for me. After years of being on the receiving end of open source — using other people's libraries, reading other people's code — I finally published something of my own.
Lucas said something in his video that stuck with me: he was shocked that nobody was really talking about TurboQuant despite its potential. I felt the same way — and instead of just talking about it, I decided to make it accessible. A single npm install away from anyone who needs it.
If you've been thinking about making your first open-source contribution but don't know where to start, here's what I learned: find a paper, a technique, or a tool that solves a real problem, and make it accessible. The world doesn't need another framework. It needs more bridges between research and practice.
Try it
npm install turboquant-js- GitHub: github.com/danilodevhub/turboquant-js
- npm: npmjs.com/package/turboquant-js
- Live Demo: danilodevhub.github.io/turboquant-js-examples
- Paper: arxiv.org/abs/2504.19874
The library is MIT-licensed and contributions are welcome. If you're building anything with vector embeddings in JavaScript, I'd love to hear how you use it — open an issue or reach out.
turboquant-js is based on the paper "TurboQuant: Online Vector Quantization" by Zandieh, Daliri, Hadian, and Mirrokni (ICLR 2026).