Curated Digest: Building a Semantic Search Engine for LessWrong

lessw-blog details the development of a specialized semantic search engine for the LessWrong community, highlighting practical approaches to embeddings, document chunking, and similarity search for niche content archives.

In a recent post, lessw-blog discusses the creation and deployment of a semantic search engine specifically tailored for LessWrong and the broader rationalist Substack ecosystem.

As organizations and communities accumulate vast archives of specialized knowledge, traditional keyword search often falls short in surfacing conceptually related content. This is especially true in communities where specialized terminology evolves rapidly or complex arguments span multiple authors and years. Semantic search, a foundational component of modern Retrieval-Augmented Generation (RAG) systems, addresses this by mapping documents into vector spaces where proximity indicates thematic similarity. Understanding how to build and scale these systems is highly relevant for enterprise teams looking to improve information retrieval across proprietary datasets.

The author presents a working prototype that processes approximately 372,000 documents, including LessWrong posts, highly-rated comments (karma of 5 or higher), and articles from 173 related Substacks. By chunking these documents and embedding them using the Voyage-3.5 model to create 1024-dimensional vectors, the system averages the chunks into a single document vector. Users can input a specific URL, and the engine calculates cosine distance to return the top 100 most semantically similar matches.

The decision to average chunked embeddings into a single document vector represents a pragmatic approach to handling long-form content, balancing retrieval speed with semantic representation. lessw-blog also touches on the practical constraints of scaling such systems. While features like open-ended custom queries and BGE reranking are planned for future iterations, they currently present cost and complexity hurdles for public deployment. The author candidly discusses the operational realities of maintaining such a database, noting that the current index reflects a snapshot from mid-April, which underscores the ongoing maintenance required for live semantic search tools.

This project serves as an excellent case study in applying modern embedding models to niche community content, offering valuable insights into the trade-offs between search accuracy, computational cost, and system architecture. For engineers and product managers exploring AI-powered search solutions, this breakdown provides a clear view into data ingestion, embedding strategies, and the realities of production constraints. Read the full post to explore the technical implementation and try the prototype.

Key Takeaways

A working semantic search prototype was built for LessWrong and 173 related Substacks, indexing roughly 372,000 documents.
The system utilizes Voyage-3.5 embeddings (1024-d), chunking documents and averaging them into single vectors for comparison.
Similarity is determined via cosine distance, currently allowing users to find the top 100 matches based on a provided URL.
Scaling advanced features like open-ended queries and BGE reranking remains a challenge due to cost and computational complexity.

Read the original post at lessw-blog

Key Takeaways

Sources