Skip to content

JDScript/SearchEngine-rs

Repository files navigation

SearchEngine-rs

A high-performance hybrid web search engine implemented in Rust, combining efficient inverted indexing with state-of-the-art semantic search capabilities. It supports indexing millions of documents with limited memory and provides fast multilingual query processing through both command-line and web interfaces.

Overview

This search engine was built from scratch to demonstrate core information retrieval principles and modern hybrid search techniques. It features:

  • Hybrid Search: Combines BM25 lexical search with dense vector semantic search.
  • Cross-Encoder Reranking: Improves result precision by reranking top candidates using a cross-encoder model.
  • Memory-efficient indexing of large datasets using block-based processing and k-way merging.
  • Fast query processing with caching optimization and variable-byte compression.
  • Multilingual support through Unicode-aware tokenization.
  • Dual interfaces for both command-line and web-based interaction.
  • Centralized Configuration: All settings managed via a single TOML configuration file.
  • Python Interop: Seamless integration with Python for deep learning models (Sentence Transformers).

The system has been tested on the MS MARCO datasets:

  • Passage-ranking: 8.8M documents, indexed in ~100 seconds using <600MB RAM
  • Document-ranking: 3.2M documents, indexed in ~600 seconds using <800MB RAM

Architecture

The search engine consists of three main components:

1. Indexer (src/bin/indexer.rs)

Builds an inverted index from TSV datasets by:

  • Reading and tokenizing documents with a language-agnostic tokenizer
  • Building in-memory inverted index mappings (term → postings list)
  • Periodically flushing partial indexes to disk when memory threshold is reached
  • Supporting multiple output formats: text, binary, and variable-byte encoding

2. Merger (src/bin/merger.rs)

Merges partial indexes into a final consolidated index using:

  • K-way merge algorithm with min-heap for efficient processing
  • Delta encoding (d-gap) to compress document IDs
  • Lexicon construction for fast term lookup
  • Pre-computing BM25 statistics

3. Query Processor (src/bin/query_processor.rs)

Processes user queries and returns ranked results by:

  • Lexical Search: Standard BM25 ranking using the inverted index.
  • Semantic Search: Dense vector retrieval using pre-computed embeddings (via Python/PyO3).
  • Hybrid Reranking: Re-scoring top results using a Cross-Encoder model for higher accuracy.
  • Caching: LRU caching for inverted lists and document content.
  • Snippet Generation: Direct file seeking for O(1) snippet extraction.

Installation

Prerequisites

  • Rust 2024 edition or later
  • Cargo 1.19.0 or later
  • Python 3.10+ (for semantic search features)

Python Environment Setup

To enable semantic search, you need to set up a Python environment with the required dependencies:

# Install dependencies uv sync source .venv/bin/activate

Build

# Clone the repository git clone https://github.com/JDScript/SearchEngine-rs.git cd SearchEngine-rs # Build in release mode (recommended for performance) cargo build --release

Configuration

The system is configured via config/search_engine.toml. This file controls all aspects of indexing, merging, and querying.

Example configuration:

[indexer] dataset = "./data/msmarco-docs.tsv"output = "./index"format = "vbyte"block_size = 50000buffer_size = 8_388_608 [merger] input = "./index/temporary"output = "./index"input_format = "vbyte"output_format = "vbyte"store_diff = truechunk_size = 8192output_entry_size = 131072docs = "./index/docs.txt"bm25_output = "./index/bm25.bin"bm25_k = 0.9bm25_b = 0.4 [query] lexicon = "./index/storage_vbyte.txt"docs = "./index/docs.txt"ids = "./index/index.vbyte"freqs = "./index/freqs.vbyte"format = "vbyte"collection = "./data/msmarco-docs.tsv"bm25 = "./index/bm25.bin" [query.semantic] model = "msmarco-MiniLM-L6-cos-v5"cross_encoder_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"embeddings_path = "corpus_embeddings_msmarco-MiniLM-L6-cos-v5.pt"doc_ids_path = "embedding_doc_ids.pt"device = "cuda:0"# Use "cpu" if no GPU is availableport = 8080snippet_len = 200

Usage

Step 1: Index Your Dataset

cargo run --bin indexer --release -- --config config/search_engine.toml

Step 2: Merge Partial Indexes

cargo run --bin merger --release -- --config config/search_engine.toml

Step 3: Generate Embeddings (Optional, for Semantic Search)

If you want to use semantic search, you need to generate embeddings for your dataset.

# Ensure your virtual environment is activesource .venv/bin/activate # Run the embedding generation script python -m python.semantic_search.create_embedding

Note: This script reads the docs.txt generated by the indexer to ensure alignment between the Rust index and Python embeddings.

Step 4: Query the Index

Command-Line Interface

cargo run --bin query_processor --release -- --config config/search_engine.toml

Once running, you can use the following commands:

  • query: Standard BM25 search.
  • /mode sem: Switch to Semantic Search mode.
  • /mode dis: Switch to Disjunctive (OR) mode.
  • /mode con: Switch to Conjunctive (AND) mode.
  • /rerank on: Enable Cross-Encoder reranking.
  • /rerank off: Disable Cross-Encoder reranking.

Web Interface

If port is set to a non-zero value in config/search_engine.toml (e.g., 8080), the query processor will start a web server.

Open http://localhost:8080 in your browser to use the search interface.

Performance (Not Updated Yet)

Indexing (MS MARCO Passage-Ranking, 8.8M documents)

FormatTimeTotal SizeCompressionPeak RAM
Text110.12s3.86GB100%~600MB
Binary89.38s6.15GB159%~600MB
VByte102.68s2.07GB54%~600MB

Merging (MS MARCO Passage-Ranking)

Input → OutputTimeFinal Size (Index + Freqs)Peak RAM
Binary → Binary13.98s2.86GB~700MB
VByte → VByte28.94s830.5MB~700MB
Text → Text26.88s1.83GB~700MB

Query Processing (MS MARCO Passage-Ranking)

QueryModeLoad Time*Search Time*Total Time*
Hello WorldAND239ms (0ms)1.3ms (2.5ms)243ms (3ms)
Hello WorldOR232ms (0ms)65ms (110ms)498ms (305ms)
Dog Cat ElephantAND99ms (0ms)0.2ms (0.6ms)100ms (1ms)
Dog Cat ElephantOR79ms (0ms)57ms (30ms)221ms (80ms)

*Times shown as: first run (cached run)

Key Observations:

  • Variable-byte encoding provides the best compression (~50% reduction) with acceptable performance
  • Conjunctive (AND) queries are significantly faster than disjunctive (OR) queries
  • Caching dramatically improves repeated query performance
  • Common terms (e.g., "and", "the") significantly slow down disjunctive queries

Features

Core Functionality

  • Block-based Indexing: Processes datasets larger than available RAM by flushing partial indexes periodically
  • K-way Merging: Efficiently merges multiple partial indexes with minimal memory usage
  • BM25 Ranking: Industry-standard probabilistic relevance ranking with tunable parameters (k=0.9, b=0.4)
  • Variable-byte Encoding: Efficient compression for document IDs and term frequencies
  • Delta Encoding: Further compresses sorted document IDs using gap encoding
  • LRU Caching: Intelligent caching of frequently accessed inverted lists
  • Direct Snippet Retrieval: O(1) snippet generation using stored document offsets

Query Modes

  • Conjunctive (AND): Returns documents containing all query terms (intersection)
  • Disjunctive (OR): Returns documents containing any query term (union)
  • Semantic: Returns documents based on vector similarity (cosine similarity)

Multilingual Support

The tokenizer handles multiple languages through Unicode-aware character processing:

  • ASCII text (case-insensitive)
  • Chinese, Japanese, Korean (CJK)
  • Arabic, Hebrew, and other RTL scripts
  • Cyrillic, Greek, and other alphabets

Project Structure

search_engine/ ├── config/ # Configuration files │ └── search_engine.toml ├── src/ │ ├── bin/ # Executable binaries │ │ ├── indexer.rs # Index builder │ │ ├── merger.rs # Index merger │ │ └── query_processor.rs # Query handler & web server │ └── lib/ # Shared library code │ ├── config.rs # Configuration loading │ ├── encoding.rs # Variable-byte encoding/decoding │ ├── format.rs # File format handlers │ ├── inverted_list.rs # Inverted list abstraction │ ├── merger.rs # K-way merge logic │ ├── types.rs # Core data structures │ ├── utils.rs # Utilities (tokenization, etc.) │ └── mod.rs # Library root ├── python/ # Python scripts for semantic search │ └── semantic_search/ │ ├── create_embedding.py # Offline embedding generation │ └── search.py # Runtime search logic ├── static/ # Web interface assets │ ├── index.html # Main web UI │ └── assets/ # CSS, JS, images ├── data/ # Dataset and indexes ├── Cargo.toml # Rust package manifest └── README.md # This file 

Technical Details

Tokenization

The tokenizer uses a character-based approach that:

  1. Converts ASCII letters to lowercase
  2. Preserves non-ASCII characters (for multilingual support)
  3. Treats whitespace and punctuation as delimiters
  4. Supports Unicode character classes (General Punctuation, CJK Symbols)

Variable-byte Encoding

Each integer is encoded using 7 bits per byte:

  • MSB (bit 7) = continuation flag (0 = more bytes follow, 1 = last byte)
  • Lower 7 bits store value chunks in little-endian order
  • Highly efficient for small numbers (<128 uses 1 byte)

BM25 Scoring

BM25(d, q) = Σ IDF(t) · [f(t,d) · (k+1)] / [f(t,d) + k·(1 - b + b·|d|/avgdl)] t∈q 

Where:

  • f(t, d) = term frequency in document
  • |d| = document length
  • avgdl = average document length
  • k = 0.9, b = 0.4 (tuning parameters)

Known Limitations

Indexer

  • Single-threaded: Could benefit from parallel processing for multi-core systems
  • Naive tokenization: No stemming, lemmatization, or stop-word removal
  • Static memory control: Block size based on document count rather than actual memory usage

Merger

  • Memory spikes: Very common terms may cause temporary memory pressure
  • Sequential merging: Could use multi-level merging for hundreds of partial indexes

Query Processor

  • No query optimization: Doesn't reorder terms or perform early termination
  • Limited caching: Cache size is arbitrary rather than memory-aware
  • Basic term matching: No phrase search, proximity, wildcards, or semantic search
  • Online BM25: Scores computed on-the-fly (could precompute partial scores)

Future Improvements

  • Parallel indexing with thread-based document partitioning
  • Advanced tokenization with stemming and language detection
  • Query optimization (term reordering, early termination)
  • Phrase and proximity search support
  • Score caching and pre-computation
  • Dynamic memory-aware caching

Dependencies

Key Rust crates used:

  • axum - Web framework for HTTP server
  • tokio - Async runtime
  • clap - Command-line argument parsing
  • serde/serde_json - Serialization
  • byteorder - Binary I/O utilities
  • indicatif - Progress bars
  • colored - Terminal colors
  • pyo3 - Rust bindings for Python (used for semantic search)

See Cargo.toml for complete list.

License

This project is an academic assignment for CS-GY 6913: Web Search Engines at NYU.

Author

Zeyu Jiang (zj2631, N16724747)
[email protected]

Acknowledgments

  • Dr. Torsten Suel for course instruction
  • Open-source Rust community for excellent libraries
  • Claude for assistance with web interface and Rust programming guidance

About

A naive search engine built by rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published