F
fsunavala-msft
As organizations continue to harness the power of Generative AI for building Retrieval-Augmented Generation (RAG) applications and agents, the need for efficient, high-performance, and scalable solutions has never been greater. Today, we're excited to introduce Binary Quantization, a new feature that reduces vector sizes by up to 96% while reducing search latency by up to 40%.
Binary Quantization (BQ) is a technique that compresses high-dimensional vectors by representing each dimension as a single bit. This method drastically reduces the memory footprint of a vector index and accelerates vector comparison operations at the cost of recall. The loss of recall can be compensated for with two techniques called oversampling and reranking, giving you tools to choose what to prioritize in your application: recall, speed, or cost.
Binary quantization is most applicable to customers who want to store a very large number of vectors at a low cost. Azure AI Search keeps the vector indexes in memory to offer the best possible search performance. Binary Quantization (BQ) allows to reduce the size of the in-memory vector index, which in turn reduces the number of Azure AI Search partitions you need to fit your data, leading to cost reductions.
Binary quantization reduces the size of the in-memory vector index by converting 32-bit floating point numbers into 1-bit values, can achieve up to a 28x reduction in vector index size (slightly less than the theoretical 32x due to overheads introduced by the index data structures). The table below shows the impact of binary quantization on vector index size and storage use.
Table 1.1: Vector Index Storage Benchmarks
Table 1.1 compares the storage metrics of three different vector compression configurations: Uncompressed, Scalar Quantization (SQ), and Binary Quantization (BQ). The data illustrates significant storage and performance improvements with Binary Quantization, showing up to 96% savings in vector index size and 22% in overall storage. MTEB/dbpedia was used with default vector search settings and OpenAI text-embeddings-ada-002 @1536 dimensions.
Binary Quantization (BQ) enhances performance, reducing query latencies by 10-40% compared to uncompressed indexes. The improvement will vary based on oversampling rate, dataset size, vector dimensionality, and service configuration. BQ is fast for a few reasons, such as Hamming distance being faster to compute than cosine similarity, and packed bit vectors being smaller yielding improved locality. This makes it a great choice where speed is critical, and allows moderate oversampling to be applied to balance speed with relevance.
Reduction in storage use and improvements in the search performance come at the cost of recall when binary quantization is used. However, the tradeoff can be managed effectively using techniques like oversampling and reranking. Oversampling retrieves a greater set of potential documents to offset the resolution loss due to quantization. Reranking will recalculate similarity scores using the full-resolution vectors. The table below shows a subset of the MTEB datasets for OpenAI and Cohere embeddings with binary quantization mean NDCG@10 with and without reranking/oversampling.
Table 1.2: Impact of Binary Quantization on Mean NDCG@10 Across MTEB Subset
Table 1.2 compares the relative point differences of Mean NDCG@10 when using Binary Quantization from an Uncompressed index across different embeddings models from a subset of MTEB datasets.
Key takeaways:
Binary Quantization is recommended for applications with high-dimensional vectors and large datasets, where storage efficiency and fast search performance are critical. It is particularly effective for embeddings with dimensions greater than 1024. However, for smaller dimensions, we recommend testing BQ's quality or considering SQ as an alternative. Additionally, BQ performs exceptionally well when embeddings are centered around zero, as seen in popular embedding models like OpenAI and Cohere.
BQ + reranking/oversampling works by searching over a compressed vector index in-memory and reranking using full-precision vectors stored on disk, allowing you to significantly reduce costs while maintaining strong search quality. This approach achieves the goal of efficiently operating on memory-constrained settings by leveraging both memory and SSDs to deliver high performance and scalability with large datasets.
BQ adds to our price-performance enhancements made over the past several months, offering storage savings and performance improvements. By adopting this feature, organizations can achieve faster search results and lower operational costs, ultimately driving better outcomes and user experiences.
We're pleased to share several vector search enhancements are now generally available in Azure AI Search. These updates provide users with more control over their retriever in RAG solutions and optimize LLM performance. Here are the key highlights:
To get started with binary quantization, visit our official documentation here: Reduce vector size - Azure AI Search | Microsoft Learn
Continue reading...
What is Binary Quantization?
Binary Quantization (BQ) is a technique that compresses high-dimensional vectors by representing each dimension as a single bit. This method drastically reduces the memory footprint of a vector index and accelerates vector comparison operations at the cost of recall. The loss of recall can be compensated for with two techniques called oversampling and reranking, giving you tools to choose what to prioritize in your application: recall, speed, or cost.
Why should I use Binary Quantization?
Binary quantization is most applicable to customers who want to store a very large number of vectors at a low cost. Azure AI Search keeps the vector indexes in memory to offer the best possible search performance. Binary Quantization (BQ) allows to reduce the size of the in-memory vector index, which in turn reduces the number of Azure AI Search partitions you need to fit your data, leading to cost reductions.
Binary quantization reduces the size of the in-memory vector index by converting 32-bit floating point numbers into 1-bit values, can achieve up to a 28x reduction in vector index size (slightly less than the theoretical 32x due to overheads introduced by the index data structures). The table below shows the impact of binary quantization on vector index size and storage use.
Table 1.1: Vector Index Storage Benchmarks
[td]
Compression Configuration
[/td][td]
Document Count
[/td][td]
Vector Index Size (GB)
[/td][td]
Total Storage Size (GB)
[/td][td]
% Vector Index Savings
[/td][td]
% Storage Savings
[/td][td]
Uncompressed
[/td][td]
1M
[/td][td]
5.77
[/td][td]
24.77
[/td][td]
[/td][td]
[/td][td]
SQ
[/td][td]
1M
[/td][td]
1.48
[/td][td]
20.48
[/td][td]
74%
[/td][td]
17%
[/td][td]
BQ
[/td][td]
1M
[/td][td]
0.235
[/td][td]
19.23
[/td][td]
96%
[/td][td]
22%
[/td]Table 1.1 compares the storage metrics of three different vector compression configurations: Uncompressed, Scalar Quantization (SQ), and Binary Quantization (BQ). The data illustrates significant storage and performance improvements with Binary Quantization, showing up to 96% savings in vector index size and 22% in overall storage. MTEB/dbpedia was used with default vector search settings and OpenAI text-embeddings-ada-002 @1536 dimensions.
Increased Performance
Binary Quantization (BQ) enhances performance, reducing query latencies by 10-40% compared to uncompressed indexes. The improvement will vary based on oversampling rate, dataset size, vector dimensionality, and service configuration. BQ is fast for a few reasons, such as Hamming distance being faster to compute than cosine similarity, and packed bit vectors being smaller yielding improved locality. This makes it a great choice where speed is critical, and allows moderate oversampling to be applied to balance speed with relevance.
Quality Retainment
Reduction in storage use and improvements in the search performance come at the cost of recall when binary quantization is used. However, the tradeoff can be managed effectively using techniques like oversampling and reranking. Oversampling retrieves a greater set of potential documents to offset the resolution loss due to quantization. Reranking will recalculate similarity scores using the full-resolution vectors. The table below shows a subset of the MTEB datasets for OpenAI and Cohere embeddings with binary quantization mean NDCG@10 with and without reranking/oversampling.
Table 1.2: Impact of Binary Quantization on Mean NDCG@10 Across MTEB Subset
[td]
Model
[/td]
[td]
No Rerank (Δ / %)
[/td]
[td]
Rerank 2x Oversampling (Δ / %)
[/td]
Model
[td]
No Rerank (Δ / %)
[td]
Rerank 2x Oversampling (Δ / %)
[td]
Cohere Embed V3 (1024d)
[/td]
[td]
-4.883 (-9.5%)
[/td]
[td]
-0.393 (-0.76%)
[/td]
Cohere Embed V3 (1024d)
[td]
-4.883 (-9.5%)
[td]
-0.393 (-0.76%)
[td]
OpenAI text-embedding-3-small (1536d)
[/td]
[td]
-2.312 (-4.55%)
[/td]
[td]
+0.069 (+0.14%)
[/td]
OpenAI text-embedding-3-small (1536d)
[td]
-2.312 (-4.55%)
[td]
+0.069 (+0.14%)
[td]
OpenAI text-embedding-3-large (3072d)
[/td]
[td]
-1.024 (-1.86%)
[/td]
[td]
+0.006 (+0.01%)
[/td]
OpenAI text-embedding-3-large (3072d)
[td]
-1.024 (-1.86%)
[td]
+0.006 (+0.01%)
Table 1.2 compares the relative point differences of Mean NDCG@10 when using Binary Quantization from an Uncompressed index across different embeddings models from a subset of MTEB datasets.
Key takeaways:
- BQ+Reranking yields higher retrieval quality compared to no reranking
- The impact of reranking is more pronounced in models with lower dimensions, while for higher dimensions, the effect is smaller and sometimes negligible
- Strongly considering reranking with full precision vectors to minimize or even eliminate recall loss caused by quantization
When to Use Binary Quantization
Binary Quantization is recommended for applications with high-dimensional vectors and large datasets, where storage efficiency and fast search performance are critical. It is particularly effective for embeddings with dimensions greater than 1024. However, for smaller dimensions, we recommend testing BQ's quality or considering SQ as an alternative. Additionally, BQ performs exceptionally well when embeddings are centered around zero, as seen in popular embedding models like OpenAI and Cohere.
BQ + reranking/oversampling works by searching over a compressed vector index in-memory and reranking using full-precision vectors stored on disk, allowing you to significantly reduce costs while maintaining strong search quality. This approach achieves the goal of efficiently operating on memory-constrained settings by leveraging both memory and SSDs to deliver high performance and scalability with large datasets.
BQ adds to our price-performance enhancements made over the past several months, offering storage savings and performance improvements. By adopting this feature, organizations can achieve faster search results and lower operational costs, ultimately driving better outcomes and user experiences.
More Functionality now Generally Available
We're pleased to share several vector search enhancements are now generally available in Azure AI Search. These updates provide users with more control over their retriever in RAG solutions and optimize LLM performance. Here are the key highlights:
- Integrated vectorization with Azure OpenAI for Azure AI Search is now generally available!
- Support for Binary Vector Types: Azure AI Search supports narrow vector types including binary vectors. This feature enables the storage and processing of larger vector datasets at lower costs while maintaining fast search capabilities.
- Vector Weighting: This feature allows users to assign relative importance to vector queries over term queries in hybrid search scenarios. It gives more control over the final result set by enabling users to favor vector similarity over keyword similarity.
- Document Boosting: Boost your search results with scoring profiles tailored to vector and hybrid search queries. Whether you prioritize freshness, geolocation, or specific keywords, our new feature allows for targeted document boosting, ensuring more relevant results for your needs.
Getting started with Azure AI Search
To get started with binary quantization, visit our official documentation here: Reduce vector size - Azure AI Search | Microsoft Learn
- Learn more about Azure AI Search and about all the latest features.
- Start creating a search service in the Azure Portal, Azure CLI, the Management REST API, ARM template, or a Bicep file.
- Learn about Retrieval Augmented Generation in Azure AI Search.
- Explore our preview client libraries in Python, .NET, Java, and JavaScript, offering diverse integration methods to cater to varying user needs.
- Explore how to create end-to-end RAG applications with Azure AI Studio.
Continue reading...