Good QuestionChajgo itneun Deo naeun vector databases Haegyeol?
Balgyon eotteoke leading tim are revolutionizing their vector databases wokeupeullou wa AI-powered jadonghwa.
In short:Vector database guide for 2026: storing and searching embeddings for AI apps. Compare Pinecone, Weaviate, pgvector, and learn architecture for RAG systems.
Three ways to ship this workflow
All start with a free Swfte account — no card.
vectors stored across leading platforms
vector DB market size in 2026
p99 query latency at 10M vectors
of new RAG apps use a managed vector DB
Key Features
ANN search algorithms
HNSW, IVF-PQ, DiskANN, and ScaNN power sub-10ms approximate nearest-neighbor lookups across hundreds of millions of vectors with tunable recall/latency trade-offs.
Horizontal scalability
Sharded indexes, replica routing, and tiered storage let modern vector DBs scale from 100K embeddings on a laptop to 10B+ vectors across distributed clusters.
Hybrid search (dense + sparse)
Combine BM25 keyword scoring with dense vector similarity (and reranker fusion) to recover the precision that pure semantic search misses on names, IDs, and code tokens.
Metadata filtering & namespaces
Pre-filter or post-filter on tenant, language, document_type, ACLs, and time ranges so a single index serves many surfaces without leaking data across users.
Multi-tenancy & isolation
Per-tenant namespaces, encryption keys, and row-level security let SaaS builders host thousands of customers in one cluster without rebuilding the index per tenant.
Observability & quality monitoring
Built-in dashboards for recall@k, latency percentiles, drift detection, and reranker hit-rate tell you when your embeddings or chunking strategy needs to change.
By Anjali Rao · Research Engineer, Long-Context Systems
Updated May 6, 2026
What vector databases actually solve in 2026
A vector database is a storage and retrieval system specialized for high-dimensional embedding vectors — the numerical representations produced by models like OpenAI text-embedding-3-large, Voyage-3-large, or Cohere embed-english-v4. Instead of searching by exact keyword match, you search by semantic similarity: "find me the 50 product reviews most like this query," "retrieve the 10 internal documents closest in meaning to this customer ticket," "surface images that look like this one." Under the hood, vector DBs index those embeddings using approximate nearest-neighbor (ANN) algorithms such as HNSW, IVF-PQ, ScaNN, and DiskANN to deliver single-digit-millisecond lookups across hundreds of millions of items.
The 2026 vector DB landscape matters because retrieval-augmented generation has become the default architecture for production AI: you cannot fit your entire knowledge base into even a 2M-token context window, so you embed your data, store the vectors, and retrieve the top-K relevant chunks at query time. Beyond RAG, vector DBs power semantic search, deduplication, recommendation, anomaly detection, code search (think Cursor and Copilot), and any system that needs embeddings at scale.
The shape of the market in 2026: managed services (Pinecone, Weaviate Cloud, Qdrant Cloud) dominate new RAG builds; pgvector eats the under-10M-vector segment for Postgres shops; Milvus and Qdrant own self-hosted at hyperscale; and open-source LanceDB and Chroma rule local development. Swfte Studio integrates with all of them so you can swap stores without rewriting your retrieval layer.
Top 8 Vector Databases (2026)
| Vendor | Hosting | Pricing model | Best for | Latency at 10M vectors | Scale ceiling |
|---|---|---|---|---|---|
| Pinecone | Fully managed serverless | Per read/write unit + storage GB | Zero-ops managed RAG, fastest time-to-first-query | 8-15ms p99 | 10B+ vectors (sharded) |
| Weaviate | OSS + managed cloud | Per node-hour or per object on cloud | Hybrid BM25+vector search, multi-tenant SaaS apps | 10-20ms p99 | 1B+ vectors per cluster |
| Qdrant | OSS + managed cloud | Per node-hour, generous free tier | Self-hosted at scale, advanced payload filtering | 6-15ms p99 | 5B+ vectors per cluster |
| Milvus | OSS + Zilliz Cloud | Per cluster + storage tiering | Hyperscale (10B+), GPU-accelerated indexing | 8-25ms p99 | 100B+ vectors (sharded) |
| pgvector | Self-hosted Postgres / Neon / Supabase / RDS | Postgres compute + storage | Postgres-native apps under 10M vectors | 15-40ms p99 (HNSW) | ~50M vectors per index |
| Chroma | OSS + Chroma Cloud (2026) | Free OSS, managed in beta | Local dev, prototyping, notebooks | 20-60ms p99 | ~10M vectors per node |
| LanceDB | OSS embedded + managed | Free OSS, S3-backed | Edge/serverless, columnar analytics on vectors | 10-30ms p99 | 1B+ vectors on object storage |
| MongoDB Atlas Vector Search | Managed (Atlas) | Atlas cluster tier + search nodes | Mongo-native apps, document + vector hybrid | 15-35ms p99 | 500M+ vectors per cluster |
Latency assumes 768-dim embeddings, HNSW (or vendor equivalent), default recall targets ~0.95. Always benchmark with your real query distribution.
When you actually need a vector database (and when you don't)
You probably need one when:
- Your knowledge base exceeds ~100K chunks or grows continuously (support tickets, product catalog, internal wiki).
- You are building RAG over thousands of documents and need sub-50ms retrieval.
- You need semantic similarity for recommendations, deduplication, or fraud/anomaly clustering.
- Your app is multi-tenant and each tenant needs its own isolated index/namespace.
- You require hybrid search (vector + keyword + metadata filters) over the same corpus.
- You expect to scale past 10M vectors or sustain over 100 QPS in production.
You probably do not need one when:
- Your entire corpus fits in a 1M-2M token context window — just stuff the prompt.
- You have under ~10K rows; a simple in-memory FAISS or even brute-force cosine works fine.
- Your users actually want exact keyword matching (legal, compliance, code lookup); BM25 / Elasticsearch beats vectors here.
- Your data updates so rarely that re-embedding once a month into a flat file is acceptable.
- You are still validating the product idea — postpone the infra decision until traffic justifies it.
- You are already on Postgres with under 5M items and modest QPS; pgvector is enough.
Postgres + pgvector vs dedicated vector DB: the 2026 verdict
Two years ago this debate had a clear answer — pgvector was a toy. In 2026 the answer is genuinely "it depends." pgvector with HNSW, binary quantization, and pgvectorscale now handles 5-10M vectors at single-digit-millisecond latency on a modest Postgres instance, and you get transactional consistency between your metadata rows and your vectors for free. For most internal AI tools, customer support copilots, and B2B SaaS RAG features, that is more than enough — and shipping one fewer service is a real engineering win.
Where dedicated vector DBs (Pinecone, Qdrant, Weaviate, Milvus) still pull decisively ahead: (1) beyond 50M vectors per index, (2) sustained 1,000+ QPS workloads, (3) heavy multi-tenant SaaS where per-tenant namespaces matter operationally, and (4) when you need first-class hybrid search and rerankers wired in. The honest 2026 default: start with pgvector, graduate to a dedicated store when your p99 latency or operational pain crosses the threshold. Swfte Studio abstracts both behind one retrieval API so the migration is a config change, not a rewrite.
How vector search actually works under the hood
Vector search is dominated by four families of approximate nearest-neighbor (ANN) algorithms, each with sharply different trade-offs once your corpus crosses 1M items. HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph where queries greedily walk from a coarse top layer down to a dense base layer; it dominates managed vendors (Pinecone, Weaviate, Qdrant, pgvector) because it delivers 0.95-0.99 recall at sub-10ms latency on million-scale indexes. The cost is RAM: a 768-dim float32 HNSW graph eats roughly 4-6KB per vector, so 100M vectors needs 400-600GB of memory. IVF-PQ (Inverted File with Product Quantization) clusters vectors into Voronoi cells and compresses each into 8-32 bytes via product quantization. It cuts memory 16-64x at the price of 1-3% recall and slightly higher latency — the workhorse for 100M+ vector workloads in Milvus, Faiss, and pgvectorscale. DiskANN (Microsoft) and its successor FreshDiskANN hold most of the graph on NVMe SSD, fetching only the visited nodes; this delivers 50-100M vectors per node at 5-30ms p99 with 90%+ less RAM than HNSW. ScaNN (Google) uses anisotropic vector quantization optimized for inner-product loss and is the foundation of Vertex AI Vector Search and most internal Google retrieval.
The recall vs latency trade-off is the daily knob: turning efSearch in HNSW from 64 to 256 lifts recall@10 from 0.94 to 0.99 but doubles tail latency. Heavy metadata pre-filters (e.g. tenant_id IN (...) AND lang = 'en') punish HNSW more than IVF-PQ because they break graph locality. What changes at each scale tier: at 1M vectors, almost any algorithm on a single node hits sub-10ms p99 — pick whatever is operationally simplest. At 10M vectors, RAM cost forces a real choice: HNSW if latency-critical, IVF-PQ if cost-sensitive, and you start caring about index build time (HNSW rebuilds at 10M can take 30+ minutes). At 100M vectors, sharding across 4-8 nodes becomes mandatory, DiskANN or IVF-PQ replaces in-memory HNSW for cost reasons, and you need real query routing. At 1B vectors, you are operating a small distributed system: tiered storage (hot HNSW + cold IVF-PQ), multi-region replication, hierarchical clustering, and quality monitoring per shard. Almost no team needs to operate at this tier — but if you do, Milvus, Vespa, and Vertex AI Vector Search are the proven options.
How to choose a vector database in 2026 (12 steps)
- Project peak scale honestly. Take your current vector count, multiply by 10x for 18 months, then add a buffer for re-embedding when you swap models. Most teams undersize and re-platform within a year.
- Decide hosting model. Fully managed (Pinecone, Weaviate Cloud, Qdrant Cloud, Zilliz) for fastest velocity. Self-hosted OSS (Qdrant, Milvus, Weaviate, pgvector) for cost control or sovereignty. Embedded (LanceDB, Chroma) for edge or offline.
- Map hybrid search needs. If users search by exact IDs, SKUs, names, or code tokens you need first-class BM25 + dense fusion. Weaviate, Vespa, Elasticsearch, and Qdrant lead here; Pinecone added it in late 2025.
- Profile the metadata filter shape. Heavy pre-filters on high-cardinality fields punish HNSW. Test with realistic filters before benchmarking — vendors quote unfiltered numbers.
- Plan multi-tenancy from day one. Per-tenant namespaces (Pinecone, Qdrant collections, Weaviate tenants) scale operationally; a single shared index with a tenant_id filter does not past a few hundred tenants.
- Quantify observability. Demand recall@k, p50/p95/p99 latency per shard, ingest lag, replica health, and reranker hit rate out of the box. Rolling your own dashboards is a recurring tax.
- Model the cost curve. Pinecone serverless is read/write units + storage; Weaviate Cloud is per object + node-hour; Qdrant Cloud is per vCPU-hour; pgvector is just Postgres. Run your projected QPS through each pricing page — costs differ 5-10x at the same scale.
- Estimate migration risk. Re-embedding 100M docs is a multi-day GPU bill. Pick a vendor whose data export and ingest formats are clean (Parquet, Arrow, JSONL) so you can leave without rewriting your pipeline.
- Audit the ecosystem. First-class clients for Python, TypeScript, and the framework you actually use (LangChain, LlamaIndex, Haystack, Vercel AI SDK). Native integrations with your embedding provider matter more than benchmarks.
- Verify compliance posture. SOC 2 Type II, HIPAA, GDPR data residency, BYOK / customer-managed keys, VPC peering. Regulated buyers will block you on missing controls.
- Test reranking integration. If your final answer quality depends on a reranker (Cohere Rerank 3.5, Voyage Rerank-2), the vector DB should make passing top-100 candidates to the reranker trivial — not a custom service.
- Run a 7-day shadow benchmark. Mirror real production traffic to two candidate stores in parallel, score recall against a held-out gold set, and compare p99 latency under your actual filter and QPS distribution. Synthetic benchmarks lie.
Embedding model + dimension trade-offs (2026)
| Model | Dimensions | MTEB score | Cost per 1M tokens | Best for |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (truncatable to 256-3072) | 64.6 | $0.13 | General-purpose English, Matryoshka truncation for cost tuning |
| Cohere Embed v4 | 1024 / 1536 (configurable) | 65.1 | $0.10 | Multilingual (100+ languages), pairs natively with Cohere Rerank 3.5 |
| Voyage-3-large | 1024 | 66.3 | $0.18 | Top retrieval quality on legal, finance, code; strong with Voyage Rerank-2 |
| BGE-M3 (open) | 1024 | 63.9 | Self-hosted (~$0.01) | Multilingual + multi-functional (dense+sparse+colbert) in one pass |
| Nomic Embed v2 (open) | 768 (Matryoshka) | 62.4 | Self-hosted (~$0.01) | Permissive license, fully reproducible training data, edge deployment |
| Gemini Embedding-004 | 3072 (truncatable) | 65.8 | $0.025 | Cheapest frontier-tier; Vertex AI native; long-document optimized |
| Jina Embeddings v4 | 1024 | 63.2 | $0.05 | 8K token chunks, multimodal variant for image+text in one space |
| Mistral Embed 2 | 1024 | 64.0 | $0.10 | European data residency, strong on French/German/Spanish |
MTEB scores reflect the v2 leaderboard average across retrieval, classification, clustering, and reranking tasks (May 2026). Higher is better. Cost is for the embedding API; self-hosted estimates assume amortized H100 inference at high utilization.
Real example: a fintech RAG migration from pgvector to Pinecone to Qdrant
A mid-market fintech we worked with in 2025-2026 ran a customer-facing financial-document RAG system over a corpus that grew from 800K to 50M chunks in 14 months (10-K filings, earnings transcripts, analyst notes, news). They started on pgvector running on Neon Postgres because their core app was already Postgres and their team had zero ops appetite. At 800K vectors with 1024-dim Voyage embeddings, p99 sat at 12ms and the whole retrieval stack was a single SQL query joining metadata filters with the HNSW index — beautiful. They held this architecture until ~5M vectors. Past that, ingest pressure during nightly re-embedding started competing with read traffic, p99 climbed to 80ms during business hours, and the team began babysitting maintenance_work_mem and HNSW build parameters weekly.
At 8M vectors they migrated to Pinecone serverless. The migration itself took about three weeks: dual-writing to both stores for two weeks while validating recall parity on a 5,000-query gold set, then a clean cutover. Pinecone gave them sub-10ms p99 again, native namespaces for the 200+ enterprise tenants they were onboarding, and removed the operational tax — but the cost curve started biting hard around 25M vectors as read units scaled with their fast-growing query volume (their bill went from ~$2.4K/month to ~$11K/month over six months, and the projection at 50M vectors was ~$32K/month).
At 35M vectors they migrated again to self-hosted Qdrant on three c7i.4xlarge nodes (running about $1,800/month all-in, including S3 backups). The trigger was not technical dissatisfaction with Pinecone — recall and latency were excellent — but pure unit economics: Qdrant's payload filtering, native sharding, and binary quantization let them hit the same SLA at roughly 18% of the managed cost. The lesson is one we see constantly: the right vector DB at 1M is rarely the right one at 50M, and the migration cost is real but bounded. Picking a vendor with clean export tooling matters more than picking the "best" vendor on day one. See our RAG architecture guide for the dual-write migration pattern they used.
When you don't need a vector database at all
- Document count under ~10K — a flat in-memory FAISS index, an Annoy file, or even brute-force cosine in NumPy will out-perform a managed vector DB on both latency and total cost. The operational overhead simply is not justified.
- Exact-lookup workloads — if users search by SKU, order ID, license plate, ticker symbol, or any other identifier where they expect an exact hit, a B-tree or hash index in Postgres beats vectors on recall, latency, and cost. Vectors are for fuzzy meaning, not exact tokens.
- High-precision regex or rule-matching — compliance keyword scanning ("does this email contain any of these 12,000 banned phrases?"), code linting, and policy enforcement want deterministic matching. Vectors will give you 88% recall when you need 100%.
- Single-language keyword search where users type query terms verbatim — for many internal-search tools, BM25 in Elasticsearch, OpenSearch, or Tantivy beats dense retrieval on user-perceived quality, especially for short queries with proper nouns.
- Corpus that fits in your context window — if you only have 80 PDFs and a 1-2M-token model context, just put them in the prompt with prompt caching enabled. You skip the entire embedding plus retrieval stack.
- Strict legal or regulated retrieval — when miss rates carry liability (medical records lookup, legal discovery), a deterministic SQL or BM25 index with audit trails is often the safer default, with vectors used only as a re-ranking signal.
Hybrid search: BM25 + dense vectors + reranker
Pure dense vector search is wrong about 8-15% of the time on real queries — it confidently returns semantically similar but lexically wrong results, especially for proper nouns, codes, IDs, and rare technical terms. The 2026 production answer is hybrid search: run BM25 (lexical) and dense vector (semantic) in parallel, fuse the scores with Reciprocal Rank Fusion (RRF) or a learned weighted sum, then send the top-50 to a cross-encoder reranker for the final top-10. This three-stage pipeline (lexical + dense + rerank) is now the default for every serious RAG system we deploy.
The reranker is the part teams most often skip and most often regret. Cohere Rerank 3.5 and Voyage Rerank-2 are the two best closed-source options in 2026, with open alternatives like BGE Reranker v2-M3 and mxbai-rerank-large-v2 close behind. A reranker scores each (query, candidate) pair with a full cross-attention forward pass — much more expensive per pair than a dot product, but vastly more accurate. Reranking the top-50 candidates from hybrid retrieval typically lifts NDCG@10 by 8-20% over plain hybrid, and 15-30% over plain dense. The reranker costs roughly $1-$2 per 1M reranked pairs at managed pricing — material at scale, free at low volume.
Latency budgeting matters: a typical hybrid + rerank stack adds 50-200ms over plain dense retrieval (BM25 in 5-15ms, dense in 5-15ms, rerank top-50 in 30-150ms). For interactive chat (target: under 1 second to first token) this is fine. For autocomplete or sub-100ms paths it is not. The standard escape hatch is to skip the reranker for the bottom 80% of queries (low ambiguity, single-intent) and only invoke it when the top-K dense scores are clustered close together — a cheap classifier or score-gap heuristic. See our RAG concepts guide for the full hybrid architecture and code patterns.
pgvector vs dedicated vendor: the 4 questions that decide it
If you can answer "no" to all four of these in good faith, stay on pgvector. If even one is "yes" today (not "maybe in 18 months"), start evaluating a dedicated vendor.
- Will you cross 10M vectors per index in the next 12 months? pgvector with HNSW is genuinely fine to ~10M; past that, RAM and rebuild times become operational drag.
- Do you need sustained 500+ QPS with under-50ms p99? pgvector can do bursts; dedicated vendors handle sustained high-QPS with replica routing and shard-aware caching out of the box.
- Do you have 100+ tenants needing isolated namespaces? pgvector forces a tenant_id filter; vendor namespaces (Pinecone, Qdrant collections, Weaviate tenants) scale operationally and isolate quality regressions.
- Do you need first-class hybrid search and reranker hooks? pgvector + pg_search is workable; dedicated vendors ship hybrid + rerank as a single API call. The engineering time saved is real.
The honest default in 2026: start on pgvector, plan the migration trigger explicitly, and pick your dedicated-vendor escape hatch on day one. Swfte Studio abstracts both behind one retrieval API.
Trusted by Teams Worldwide
"This peullaetpom transformed eotteoke we work. We've automated 80% of uri manual gwajeong gwa uri team is more productive than ever."
Sarah Chen
VP of Operations at TechCorp
"The choego investment we've made this year. ROI was positive within 2 months, gwa the sigan savings have been incredible."
Michael Rodriguez
CEO at StartupXYZ
"Finally, a haegyeol that just works. Setup was painless, features are gangryeokhan yet intuitive, gwa jiwon has been outstanding."
Emily Thompson
Director of Engineering at InnovateLabs
Frequently Asked Questions
A vector database is a storage system optimized for high-dimensional embedding vectors (typically 384-3072 dimensions) and similarity search over them. Instead of querying by exact match like a traditional database, you ask "give me the K most similar items to this query vector" and the engine uses approximate nearest-neighbor (ANN) algorithms such as HNSW or IVF-PQ to return results in single-digit milliseconds — even across billions of vectors. Vector DBs are the storage backbone for retrieval-augmented generation, semantic search, recommendations, anomaly detection, and any workflow that compares meaning rather than literal strings.
Traditional relational and document databases index scalar values (numbers, strings, JSON) with B-trees and hash maps for exact lookups. Vector databases index high-dimensional float arrays with ANN graphs (HNSW, NSG) or quantized inverted lists (IVF-PQ) for similarity search. The query model is fundamentally different: SQL asks "WHERE category = electronics", vector search asks "ORDER BY similarity(query_embedding, item_embedding) LIMIT 50". Modern systems blur the line — Postgres adds pgvector, MongoDB adds Atlas Vector Search, Elasticsearch adds dense_vector — but a purpose-built vector DB still wins on recall, latency, and operational simplicity once you cross ~10M vectors.
There is no single best option — it depends on scale, infra preference, and how much hybrid/filtering you need. For zero-ops managed RAG under 50M vectors, Pinecone and Weaviate Cloud are the easiest. For self-hosted at scale, Qdrant and Milvus are the strongest open-source contenders. For teams already on Postgres with under ~5M vectors and hybrid SQL needs, pgvector + pgvectorscale is excellent and removes a service. For local prototyping and edge deployment, LanceDB and Chroma are friction-free. Most production RAG stacks we see in 2026 pair a managed vector DB with a reranker (Cohere Rerank 3.5 or Voyage Rerank-2) for the final top-10.
Pinecone is fully managed, serverless, and the fastest to ship — no index tuning, predictable pricing, but limited control and no on-prem option. Weaviate is open-source with a managed cloud, has first-class hybrid search, multi-tenancy, and modules for embedding generation; it is the most flexible. pgvector turns Postgres into a vector store — perfect when you already have Postgres, need transactional consistency with metadata, and have under 5-10M vectors per index. Rule of thumb: choose pgvector if you want one less service, Weaviate if you need hybrid + multi-tenancy at scale, Pinecone if you want a managed service to disappear into the background.
Only if your app retrieves over more documents, products, or chunks than fit in the model context window — practically, beyond ~50-100 documents or anything updated frequently. If you have 30 PDFs that fit in a 1M-token context, just stuff them in the prompt; you do not need a vector DB. If you have 30,000 support articles, customer tickets, code files, or product images that change daily, you need embeddings + a vector index. Many teams over-engineer with a vector DB when a simple in-memory FAISS index or even keyword search would suffice for their first 10K rows.
Costs in 2026 fall into three tiers. Tier 1 (free / under $50/mo): pgvector on a small Postgres instance, Chroma or LanceDB self-hosted, Pinecone serverless free tier (100K vectors). Tier 2 ($100-$1,000/mo): managed Pinecone, Weaviate Cloud, Qdrant Cloud at 1-10M vectors with single-digit QPS. Tier 3 ($2,000-$50,000+/mo): production RAG at 50M-1B+ vectors with high QPS, multi-region replication, and SLA. The big cost driver is not storage but query throughput and recall — paying for higher-recall HNSW configs roughly doubles RAM per vector.
Yes — pgvector has matured significantly and pgvectorscale (TimescaleDB) and pg_search push it further with HNSW indexes, IVFFlat, binary quantization, and disk-based ANN. For workloads under ~5-10M vectors with strong filter/SQL needs and existing Postgres infrastructure, pgvector is often the right call: one less service, transactional joins between metadata and vectors, and the Postgres ecosystem (replication, backups, observability). Beyond 10M vectors, or if you need sub-5ms latency at high QPS, dedicated vector DBs (Qdrant, Pinecone, Milvus) still pull ahead on raw performance and operational ergonomics.
On a well-tuned HNSW index with 768-dim embeddings: at 1M vectors expect 2-5ms p50 and under 10ms p99 on a single small node. At 10M vectors, plan for 5-15ms p50, 20-40ms p99, and roughly 16-32GB RAM for the in-memory graph. At 100M vectors you typically shard across 4-8 nodes with disk-backed ANN (DiskANN, IVF-PQ) to get 15-50ms p99 at 100-1,000 QPS. These numbers degrade sharply if you turn up recall (efSearch), apply heavy metadata filters, or run on cold storage — always benchmark with your real query distribution before sizing.