Proposal: Metadata contracts v0.1 — source/document/chunk schemas for the proof loop #3
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Issue: Metadata contracts v0.1 —
source,document,chunkschemas for the proof loopWhy this issue exists
Per the roadmap, the next foundation step after Phase 0 is defining metadata contracts. Rather than design all seven schemas GPT sketched (
document,source,license,trust_score,chunk,index_manifest,bundle_manifest) up front, this issue proposes only the three the proof loop actually touches:source,document,chunk.The proof loop we're aiming at:
These three schemas are exactly the contract surface that loop needs. Everything else (trust scoring, index/bundle manifests, embeddings) is deliberately deferred until we've run real documents through and felt what those schemas need to contain — you can't design a good index manifest in the abstract.
Important: these schemas are not invented from scratch. Every field maps to a rule already written in
meta/policies/source_acceptance_policy.mdandmeta/policies/license_policy.md. This issue just makes those policies machine-checkable.Scope
In scope
meta/schemas/source.schema.json— one record per original work: provenance, traceability, license, relevance.meta/schemas/document.schema.json— one record per concrete file artifact (original PDF, extracted text, OCR output…), with checksum + transformation lineage.meta/schemas/chunk.schema.json— one record per retrievable passage, with offsets + human-facing location for citations.pipeline/validate/validate_records.pyor similar) that checks records against the schemas. JSON Schema draft 2020-12, validated withjsonschemain Python.Out of scope (deferred on purpose)
trust_score— meaningless until we have competing/conflicting sources to score.index_manifest,bundle_manifest— design after we've indexed/bundled real docs.meta/vocabularies/*.yaml) — for now the enums live inline in the schemas; promote to vocab files later if they grow.Design notes
source(the work) ←document(files derived from it, each with aroleand aderived_frompointer) ←chunk(passages, pointing back to both). This keeps "the AI's chunk" provably traceable to an original file and source — which is the whole point of Arkive's source-is-authority principle.sourceas a reusable$defs.licenseobject rather than a separate top-level schema, to keep v0.1 at three files. It's structured so it can graduate to its ownlicense.schema.jsonlater with no field changes.chunk.location+chunk.source_id. A cited answer shows the passage text, its page/section, and a link back to the source record. No-source-no-answer falls out naturally: retrieval returns chunk IDs; if none clear the threshold, the runtime refuses.license.statususes the six statuses fromlicense_policy.md;license.conditionsuses that policy's condition list;domainsmirrors the relevance list insource_acceptance_policy.md.additionalProperties: falseeverywhere to keep records honest; a free-textnotesfield is provided where humans need room.meta/schemas/source.schema.jsonmeta/schemas/document.schema.jsonmeta/schemas/chunk.schema.jsonWorked example (grounded in a real candidate)
A good first candidate is a U.S. Army survival field manual, because U.S. federal government works are public domain in the U.S. under 17 U.S.C. § 105 — a clean, defensible
redistributablestatus that's perfect for testing the license fields. (Caveat to verify during review: make sure the specific PDF is the actual government work, not a commercial reprint that adds copyrighted front-matter or illustrations — exactly the kind of check the schema is meant to force.)source record
document records (original + extracted text)
chunk record
With those three records, the proof loop is fully expressible: search hits the chunk, the answer cites FM 21-76, p.169, "Water Purification", and the link resolves through
source_id/document_idback to the original PDF and page.Acceptance criteria
meta/schemas/and are valid JSON Schema draft 2020-12.license.status = "unknown"butbundleable = trueis rejected (cross-field rule — may need a small custom check beyond pure JSON Schema).Suggested branch
feature/issue-N-metadata-schemas-v0.1offdevelop, PR intodevelop. (Replace N with the real issue number.)