See the page.Not just the text.
Velix is an open source research project on visual-first retrieval and structured extraction for legal and real-asset documents. The retrieval and extraction layers are built. A hosted demo and frontend integration are next.
Four ideas the whole project rests on.
Visual retrieval
Pages are indexed as images using ColQwen2 multi-vector embeddings. No OCR step on the retrieval path.
Typed extraction
Output is constrained by Pydantic schemas for six oil and gas document types. Invalid output is rejected, not coerced.
Composable layers
Embedder, schema, and store are independent. Mock implementations let the whole pipeline run on CPU for tests.
On-demand by design
Indexing runs once per page. Field extraction runs only when a page is queried. Compute follows usage, not corpus size.
One screen for the page, the fields, and the checks.
The PDF on the left, the typed extraction on the right, validators showing their work. Click into any indexed document and try it.
- Source PDF rendered inline, scrollable, zoomable.
- Pydantic-typed fields mirror the document's schema; what you see is what feeds the LLM.
- Domain validators (PLSS, mineral fractions, party chains) show pass or fail per field.
The repo is the demo.
Code, tests, schemas, and the build scripts that produced the corpus are all open. Read the README, clone, run the test suite. The hosted demo follows once the backend is live.