Hi HN, weāre Lewis and Edgar, building Captain to simplify unstructured data search (https://runcaptain.com). Captain automates the building and maintenance of file-based RAG pipelines. It indexes cloud storage like S3 and GCS, plus SaaS sources like Google Drive. Thereās a quick walkthrough at https://youtu.be/EIQkwAsIPmc.
We also put up this demo site called āAsk PGās Essaysā which lets you ask/search the corpus of pgās essays, to get a feel for how it works: https://pg.runcaptain.com. The RAG part of this took Captain about 3 minutes to set up.
Here are some sample prompts to get a feel for the experience:
āWhen do we do things that don't scale? When should we be more cautious?ā
https://pg.runcaptain.com/?q=When%20do%20we%20do%20things%20...
āGive me some advice, I'm fundraisingā
https://pg.runcaptain.com/?q=Give%20me%20some%20advice%2C%20...
āWhat are the biggest advantages of Lispā
https://pg.runcaptain.com/?q=what%20are%20the%20biggest%20ad...
A good production RAG pipeline takes substantial effort to build, especially for file workloads. You have to handle ETL or text extraction, chunking, embedding, storage, search, re-ranking, inference, and often compliance and observability ā all while optimizing for latency and reliability. Itās a lot to manage. grep works well in some cases, but for agents, semantic search provides significantly higher performance. Cursor uses both and reports 6.5%ā23.5% accuracy gains from vector search over grep (https://cursor.com/blog/semsearch).
Weāve spent the past four years scaling RAG pipelines for companies, and Edgarās work at Purdueās NLP lab directly informed our chunking techniques. In conversations with dozens of engineers, we repeatedly saw DIY pipelines produce inconsistent results, even after weeks of tuning. Many teams lacked clarity on which retrieval strategies best fit their data.
We realized that a system to provision storage and embeddings, handle indexing, and continuously update pipelines to reflect the latest search techniques could remove the need for every team to rebuild RAG themselves. That idea became Captain.
In practice, one API call indexes URLs, cloud storage buckets, directories, or individual files. Under the hood, weāre converting everything to Markdown. For this, weāve had good results with Gemini 3 Pro for images, Reducto for complex documents, and Extend for basic OCR. For embedding models, āgemini-embedding-001ā performed reasonably well at first, but we later switched to the Contextualized Embeddings from āvoyage-context-3ā. It produced more relevant results than even the newer Voyage 4 models because its chunk embeddings are encoded with awareness of the surrounding document context. We then applied Voyageās ārerank-2.5ā as second-stage re-ranking, reducing 50 initial chunks to a final top 15 (configurable in Captainās API). Dense embeddings are just half the picture and full-text search with RRF complete our hybrid retrieval. In the Captain API, these techniques are exposed through a single /query endpoint. Access controls can be configured via metadata filters, and page number citations are returned automatically.
The stack is constantly changing but the Captain API creates a standard interface for this. You can try Captain, 1 month for free, and build your own pipelines at https://runcaptain.com. Weāre looking for candid feedback, especially anything that can make it more useful, and look forward to your comments!