\

Show HN: Ragctl – document ingestion CLI for RAG (OCR, chunking, Qdrant)

4 points - today at 1:35 AM


Hi HN β€” sharing ragctl, an open-source CLI for the most failure-prone part of RAG pipelines: document ingestion, OCR, parsing/cleaning, and chunking.

Vector DB setup is fairly standardized now, but getting high-quality, consistent text + metadata into it still takes a lot of brittle glue code. ragctl aims to make that β€œpre-vector” step repeatable: turn messy documents into retrieval-ready chunks in a few commands.

Features β€’ Multi-format input: PDF, DOCX, HTML, images β€’ OCR for scanned/image-based docs β€’ Semantic chunking (LangChain) β€’ Batch runs with retries + error handling β€’ Output: direct ingestion into Qdrant (for now)

Looking for feedback β€’ DX: is the CLI intuitive? β€’ Performance / edge cases: weird PDFs, mixed layouts, tables β€’ Roadmap: which connectors (S3, Slack, Notion) or vector stores should be next?

Repo: https://github.com/datallmhub/ragstudio Happy to answer questions about the architecture and chunking approach.

Source