Skip to Content

Search

Query RAG indexes with semantic search, hybrid ranking, and advanced filtering for optimal document retrieval.

Search Methods

M3 Forge supports three retrieval strategies, each optimized for different use cases.

Vector similarity search using cosine distance between query and document embeddings:

{ "index_id": "customer-docs", "query": "How do I configure SSL certificates?", "search_type": "semantic", "top_k": 5 }

How it works:

  1. Query embedded with same model as documents
  2. Vector database computes cosine similarity to all chunks
  3. Top-K most similar chunks returned ranked by score

Best for:

  • Natural language questions
  • Conceptual queries (“how to improve performance”)
  • Multilingual search (with multilingual embeddings)
  • Queries with synonyms or paraphrasing

Limitations:

  • May miss exact keyword matches
  • Struggles with rare terms, acronyms, product codes
  • Requires query and documents in similar semantic space

Keyword Search (BM25)

Traditional full-text search using TF-IDF with BM25 ranking:

{ "index_id": "product-catalog", "query": "SKU-12345", "search_type": "keyword", "top_k": 10 }

Best for:

  • Exact matches (SKUs, error codes, IDs)
  • Rare or technical terms
  • Short queries (1-3 words)
  • Deterministic retrieval (same query always returns same results)

Limitations:

  • No semantic understanding
  • Sensitive to query phrasing
  • Weak on long-form questions

Combines semantic and keyword search with weighted score fusion:

{ "index_id": "support-tickets", "query": "database connection error SQLSTATE[HY000]", "search_type": "hybrid", "top_k": 5, "hybrid_alpha": 0.7 }

hybrid_alpha controls the balance:

  • 0.0 - Pure keyword search
  • 0.5 - Equal weighting
  • 1.0 - Pure semantic search
  • 0.7 (default) - Favor semantic with keyword boost

Best for:

  • General-purpose retrieval (most use cases)
  • Mixed queries (natural language + specific terms)
  • Production systems requiring robustness

Hybrid search provides the best balance of precision and recall. Start with hybrid_alpha: 0.7 and tune based on evaluation metrics.

Query Parameters

Top-K Results

Control result count with top_k:

{ "top_k": 5 // Return 5 most relevant chunks }

Guidelines:

  • RAG Context: 3-5 chunks (fits most LLM context windows)
  • Search UI: 10-20 chunks (user browses results)
  • Reranking: 50-100 chunks (reranker selects best subset)

More results increase recall but reduce precision and add latency.

Similarity Threshold

Filter results by minimum similarity score:

{ "threshold": 0.7 // Only return chunks with score >= 0.7 }

Scores range from 0.0 (no similarity) to 1.0 (identical):

  • 0.9+ - Near-duplicate content
  • 0.7-0.9 - Highly relevant
  • 0.5-0.7 - Somewhat relevant
  • < 0.5 - Weak relevance (likely noise)

Thresholds prevent low-quality results from polluting LLM context.

Metadata Filtering

Restrict search to documents matching metadata criteria:

{ "metadata_filter": { "product": "enterprise", "version": ["2.4.0", "2.5.0"], "category": "installation" } }

Operators:

  • Equality: "key": "value" - Exact match
  • Array: "key": ["val1", "val2"] - Match any value
  • Range: "date": {"gte": "2024-01-01", "lt": "2024-12-31"} - Numeric/date ranges
  • Existence: "key": {"exists": true} - Field is present

Filters apply before similarity search, reducing search space and improving latency.

Advanced Retrieval

Reranking

Secondary model re-scores top results for improved relevance:

{ "rerank": true, "rerank_model": "cross-encoder/ms-marco-MiniLM-L-12-v2", "rerank_top_k": 3 }

Process:

  1. Initial retrieval returns top_k: 50 candidates
  2. Cross-encoder model scores query-document pairs
  3. Top rerank_top_k: 3 highest-scoring chunks returned

Reranking adds 50-200ms latency but can improve precision by 10-30%.

Query: “SSL certificate installation”

Results:

  1. “Configuring SSL/TLS in production” (score: 0.82)
  2. “Certificate renewal procedures” (score: 0.79)
  3. “Installing packages via apt-get” (score: 0.75)
  4. “SSL certificate generation guide” (score: 0.74)

Result 3 is a false positive (mentions “install” but wrong context).

MMR (Maximal Marginal Relevance)

Diversify results to reduce redundancy:

{ "mmr": true, "mmr_lambda": 0.5, "top_k": 10 }

MMR balances relevance and diversity:

  • mmr_lambda: 1.0 - Pure relevance (may return duplicates)
  • mmr_lambda: 0.5 - Balance relevance and diversity
  • mmr_lambda: 0.0 - Pure diversity (may return less relevant results)

Use cases:

  • Search UIs (avoid repetitive results)
  • RAG with long context (maximize information density)
  • Exploratory search (discover related topics)

Contextual Expansion

Include surrounding chunks for better context:

{ "expand_context": true, "context_window": 1 }

For each matched chunk, retrieve:

  • context_window: 1 - Previous and next chunk
  • context_window: 2 - Two chunks before and after

Expanded context improves LLM understanding but increases token usage.

Search Response

Successful query returns:

{ "results": [ { "chunk_id": "doc123-chunk5", "text": "To configure SSL certificates, navigate to...", "score": 0.87, "metadata": { "document_id": "doc123", "filename": "ssl-guide.pdf", "page": 12, "product": "enterprise", "version": "2.5.0" }, "highlights": ["SSL", "certificates", "configure"] } ], "total": 127, "latency_ms": 42 }

Fields:

  • chunk_id - Unique identifier for the text chunk
  • text - Chunk content (truncated if > 2000 chars)
  • score - Similarity score (0.0-1.0)
  • metadata - Custom fields attached during indexing
  • highlights - Query terms found in chunk (keyword search only)
  • total - Total matching chunks before top-k filtering
  • latency_ms - Query execution time

Relevance Tuning

Evaluation Metrics

Measure search quality with:

MetricDefinitionTarget
Precision@KRelevant results in top-K / K> 0.8
Recall@KRelevant results in top-K / Total relevant> 0.6
MRR (Mean Reciprocal Rank)1 / rank of first relevant result> 0.7
NDCG (Normalized Discounted Cumulative Gain)Ranking quality weighted by position> 0.75

Use the Evaluations dashboard to track metrics over time and compare configurations.

Tuning Strategies

Poor Precision (too many irrelevant results):

  • Increase threshold to 0.75+
  • Switch to hybrid_alpha: 0.5 (more keyword weight)
  • Enable reranking
  • Reduce top_k to focus on highest-scoring chunks

Poor Recall (missing relevant results):

  • Decrease threshold to 0.5-0.6
  • Increase top_k to 20-50
  • Switch to search_type: semantic (pure vector search)
  • Add query expansion (synonyms, related terms)

Noisy Results (lots of near-duplicates):

  • Enable MMR with mmr_lambda: 0.5
  • Increase chunk size during indexing
  • Add metadata filters to narrow search scope

A/B Testing

Compare search configurations:

  1. Create two index versions with different chunking/embedding
  2. Route 50% of queries to each index
  3. Measure precision, recall, latency
  4. Promote winning configuration

The Evaluations section provides built-in A/B test tracking.

Integration Patterns

Workflow Node

Use RAG Retrieval Node in workflows:

{ "type": "rag-retrieval", "config": { "index_id": "support-docs", "query": "$.data.user_question", "search_type": "hybrid", "top_k": 5, "threshold": 0.7, "metadata_filter": { "category": "$.data.product_category" } } }

Output is available at $.nodes.<node_id>.output.chunks for downstream LLM nodes.

API Client

Query via REST API:

curl -X POST https://your-instance/api/rag/search \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "index_id": "customer-docs", "query": "database backup procedures", "search_type": "hybrid", "top_k": 5 }'

tRPC Client (TypeScript)

Type-safe queries from M3 Forge frontend:

const { data } = trpc.rag.search.useQuery({ indexId: 'product-docs', query: searchQuery, searchType: 'hybrid', topK: 10, metadataFilter: { product: selectedProduct, }, });

Performance Optimization

Query Latency

Typical latencies:

  • Semantic search: 20-50ms (vector index lookup)
  • Keyword search: 10-30ms (full-text index)
  • Hybrid search: 40-80ms (both indexes + fusion)
  • With reranking: +50-200ms (cross-encoder inference)

Optimization techniques:

  • Use smaller embedding models (768-dim vs 3072-dim)
  • Cache frequent queries (Redis/Memcached)
  • Partition large indexes by metadata
  • Use approximate nearest neighbors (ANN) for > 1M chunks

Cost Control

Reduce embedding API costs:

  • Cache query embeddings (same query = same embedding)
  • Use smaller top_k (fewer chunks to embed/rank)
  • Batch queries when possible
  • Choose cost-efficient models (Jina v4 vs OpenAI)

Best Practices

Query Formulation

Good queries:

  • “How do I configure SSL certificates?” (natural language, specific)
  • “database connection error SQLSTATE[HY000]” (mixed natural + technical)
  • “backup procedures for PostgreSQL” (clear intent, key terms)

Poor queries:

  • “help” (too vague, no context)
  • “How do I do the thing with the stuff?” (ambiguous pronouns)
  • Overly long queries (> 100 words) - truncate or summarize first

Result Presentation

When displaying search results to users:

  • Show snippet with highlighted query terms
  • Include source document name and page number
  • Link to full document for context
  • Display relevance score for transparency

Monitoring

Track in production:

  • Query latency (p50, p95, p99)
  • Result quality (CTR, user feedback)
  • Cache hit rate (for query caching)
  • Error rate (failed queries, timeouts)

Set up alerts for latency spikes or quality degradation.

Troubleshooting

No Results Returned

Possible causes:

  • threshold too high (relax to 0.5)
  • metadata_filter too restrictive (check filter logic)
  • Query embedding mismatch (ensure same model as index)
  • Index is empty (verify documents are uploaded)

Irrelevant Results

Solutions:

  • Increase threshold to 0.75+
  • Enable reranking
  • Switch to hybrid search
  • Add metadata filters
  • Review chunking strategy (chunks may be too large/small)

High Latency

Optimizations:

  • Reduce top_k to minimum needed
  • Disable reranking for non-critical paths
  • Partition index by metadata
  • Scale vector database horizontally

For multilingual retrieval:

  1. Use multilingual embedding model (Jina v4, Cohere multilingual-v3)
  2. Index documents in all target languages
  3. Queries automatically work across languages
  4. Consider language-specific indexes for better precision

Next Steps

Last updated on