Show HN: Ragctl β document ingestion CLI for RAG (OCR, chunking, Qdrant)
4 points - today at 1:35 AM
Hi HN β sharing ragctl, an open-source CLI for the most failure-prone part of RAG pipelines: document ingestion, OCR, parsing/cleaning, and chunking.
Vector DB setup is fairly standardized now, but getting high-quality, consistent text + metadata into it still takes a lot of brittle glue code. ragctl aims to make that βpre-vectorβ step repeatable: turn messy documents into retrieval-ready chunks in a few commands.
Features
β’ Multi-format input: PDF, DOCX, HTML, images
β’ OCR for scanned/image-based docs
β’ Semantic chunking (LangChain)
β’ Batch runs with retries + error handling
β’ Output: direct ingestion into Qdrant (for now)
Looking for feedback
β’ DX: is the CLI intuitive?
β’ Performance / edge cases: weird PDFs, mixed layouts, tables
β’ Roadmap: which connectors (S3, Slack, Notion) or vector stores should be next?
Repo: https://github.com/datallmhub/ragstudio
Happy to answer questions about the architecture and chunking approach.
Source