Compressing high-dimensional vectors using quantization methods.
Vector quantization is a technique used to compress high-dimensional vectors, which can significantly reduce memory consumption in vector databases. This is particularly useful when dealing with large datasets or when deploying models on resource-constrained devices.
Binary quantization is the simplest form of quantization, where each component in a vector is converted from a float32 to a single bit. This is done by rounding each component in a given vector to 0 if it is ≤0 or 1 if it is >0. This converts a float32 component to a single bit (0 or 1), which greatly reduces the memory consumption of the vector.Binary quantization can be defined as:Q(v)={01if vi≤0,if vi>0where vi is the i-th component of vector v∈RD for 0≤i<D
If I had a 1536-dimensional vector, it would originally consume 81536∗32=6144 bytes. After binary quantization, it would only consume 81536∗1=192 bytes. That’s a reduction of ≈97% in memory consumption!The trade-off with binary quantization is that it can lead to a loss of semantic meaning, as the original values are not entirely preserved. However, it can still be effective for certain applications where high precision is not critical.
Scalar quantization is a method that reduces the precision of each component in a vector by mapping it to a smaller set of values. This is done by dividing the range of values into intervals and assigning each interval a representative value. The number of bits allocated to each component determines the number of intervals.
Let’s say we have a N-dimensional vectorv=v0v1v2...vN−1Where each component vi is a continuous float32 value between [−1,1]. We can apply scalar quantization by mapping each component to a smaller set of values, such as integers or fixed-point numbers.For this example, we will map each float32 component to a 8-bit integer.
We first define the number of bits to allocate to each component, which in this case is nbits=8.
We then calculate the number of intervals as intervals=2nbits=256.
Next, we map each component of our original float32 vector to the nearest interval in the discrete range [0,256). To achieve this, we can use min-max normalization to scale our values linearly.
x′=a+max(x)−min(x)(x−min(x))(b−a) to scale x to the range [a,b]In our case, we can set a=0 and b=2nbits−1=255 to map our float32 values to the range of an 8-bit integer [0,256).
After passing each component through the min-max normalization, we can round each value to the nearest integer in the range [0,255].
Finally, we can represent each component as an 8-bit integer, which reduces the memory consumption of the vector. Our initial N-dimensional vector v consumed N×32 bits, but now only consumes N×8 bits which is a 4X reduction in memory consumption!
Product quantization is a more advanced quantization method which uses a technique called sub-vector quantization to compress high-dimensional vectors using a set of centroids. This method is particularly useful for large datasets and can achieve significant memory savings while maintaining a reasonable level of accuracy.
We have a 6-dimensional index (D=6) where we would like to apply product quantization to the following vector:
v=0.62−0.140.310.97−0.530.45
We then choose to split the vector into m=3 sub-vectors, each with dimensions D′=mD=2.
u1=[0.62−0.14]u2=[0.310.97]u3=[−0.530.45]Note: the chosen m value must divide D (i.e. D mod m=0).
We then generate D′D=3codebooks for each sub-vector. A codebook is a set of centroids in a D′-dimensional space, which are used to approximate each sub-vector. For example, let’s assume we have the following 3 codebooks:
c1=c11c12c13c14c2=c21c22c23c24c3=c31c32c33c34Where cij is a D′-dimensional centroid used to approximate the i-th sub-vector of any given vector.Note: the number of centroids generated per sub-vector is =2nbits where nbits is the number of bits allocated to each single value in our outputted compressed vector. Currently, in an uncompressed vector like v, nbits=32 as vectors originally store float32 values. In the example above, we set nbits=2 which creates 2nbits=4 centroids per codebook.
Given our sub-vectors above, we can then approximate each sub-vector by finding the closest centroid in each codebook using simple Euclidean distance. Let’s say after computing the nearest centroids to each sub-vector, we approximate them as follows:
u1≈c12u2≈c23u3≈c31In our original vector, we can replace each sub-vector with its corresponding centroid index j in its respective codebook. We will denote the quantized vector as Q(v):v=0.62−0.140.310.97−0.530.45→v=u1u2u3→v≈c12c23c31→Q(v)=231Given that nbits=2, that means that centroids indexes can be stored in only 2 bits (0-3). This means that v consumed 6×32=192 bits, but Q(v) now only consumes 3×2=6 bits. This is a memory reduction of ≈97%!The memory consumption of a vector passed through product quantization can be calculated as follows:Memory consumption=m×nbitsWhere m=D′D is the number of sub-vectors, D is the dimension of the initial vector, D′ is the dimension of each sub-vector, and nbits is the number of bits allocated to each centroid index.
One of the challenges with product quantization is generating the codebooks. This is because when approximating a sub-vector to a centroids in a codebook creates loss. This means that the centroids we decide to use will directly affect how lossy the quantization will be, therefore affecting the overall accuracy of results in the vector database.A common approach for generating codebooks is to use a clustering algorithm, like k-means clustering. K-means clustering is a machine learning algorithm that partitions a set of vectors into k clusters, where each cluster is represented by its centroid. As with most machine-learning algorithms, k-means needs to learn the shape of data it’s given to accurately partition it.
Given a dataset of many vectors, when using product quantization, you must split all vectors in the dataset each into m sub-vectors. This creates m sub-spaces, each of dimension D′. K-means clustering can then be applied to each of these sub-spaces to learn the shape of the sub-vectors in each created sub-space, generating centroids that best approximate the sub-vectors in that sub-space. This will result in m codebooks that greatly reduce the loss when approximating sub-vectors to centroids.Since k-means clustering is a machine learning algorithm which needs to “learn” the shape of data, it is important to note that a vector database using product quantization will need to be trained on a given dataset of vectors before it can be used effectively. This training process involves running k-means clustering on the dataset to generate the codebooks, which can then be used for accurate quantization. Although, if a different dataset is used on the same trained database, the database must be re-trained to regenerate the codebooks as it needs to “re-learn” the shape of the new data to ensure accurate quantization.