Proposal: Metadata contracts v0.1 — source/document/chunk schemas for the proof loop #3

Open
opened 2026-06-03 22:25:38 -04:00 by McJuniorstein · 0 comments
Member

Issue: Metadata contracts v0.1 — source, document, chunk schemas for the proof loop

Why this issue exists

Per the roadmap, the next foundation step after Phase 0 is defining metadata contracts. Rather than design all seven schemas GPT sketched (document, source, license, trust_score, chunk, index_manifest, bundle_manifest) up front, this issue proposes only the three the proof loop actually touches: source, document, chunk.

The proof loop we're aiming at:

Take 5 legally-clean documents → record their provenance + license → extract and chunk them → search offline → answer one question with a citation that opens the right passage.

These three schemas are exactly the contract surface that loop needs. Everything else (trust scoring, index/bundle manifests, embeddings) is deliberately deferred until we've run real documents through and felt what those schemas need to contain — you can't design a good index manifest in the abstract.

Important: these schemas are not invented from scratch. Every field maps to a rule already written in meta/policies/source_acceptance_policy.md and meta/policies/license_policy.md. This issue just makes those policies machine-checkable.

Scope

In scope

  • meta/schemas/source.schema.json — one record per original work: provenance, traceability, license, relevance.
  • meta/schemas/document.schema.json — one record per concrete file artifact (original PDF, extracted text, OCR output…), with checksum + transformation lineage.
  • meta/schemas/chunk.schema.json — one record per retrievable passage, with offsets + human-facing location for citations.
  • One worked example record per schema, validating against the schema.
  • A tiny validator script (pipeline/validate/validate_records.py or similar) that checks records against the schemas. JSON Schema draft 2020-12, validated with jsonschema in Python.

Out of scope (deferred on purpose)

  • trust_score — meaningless until we have competing/conflicting sources to score.
  • index_manifest, bundle_manifest — design after we've indexed/bundled real docs.
  • Embeddings / vector storage — the chunk schema leaves an optional pointer field, but no vectors yet.
  • Controlled-vocabulary files (meta/vocabularies/*.yaml) — for now the enums live inline in the schemas; promote to vocab files later if they grow.

Design notes

  • Three records, clear lineage. source (the work) ← document (files derived from it, each with a role and a derived_from pointer) ← chunk (passages, pointing back to both). This keeps "the AI's chunk" provably traceable to an original file and source — which is the whole point of Arkive's source-is-authority principle.
  • License is embedded in source as a reusable $defs.license object rather than a separate top-level schema, to keep v0.1 at three files. It's structured so it can graduate to its own license.schema.json later with no field changes.
  • Citation = chunk.location + chunk.source_id. A cited answer shows the passage text, its page/section, and a link back to the source record. No-source-no-answer falls out naturally: retrieval returns chunk IDs; if none clear the threshold, the runtime refuses.
  • Enums mirror the policies exactly. license.status uses the six statuses from license_policy.md; license.conditions uses that policy's condition list; domains mirrors the relevance list in source_acceptance_policy.md.
  • additionalProperties: false everywhere to keep records honest; a free-text notes field is provided where humans need room.

meta/schemas/source.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "urn:arkive:schema:source:0.1.0",
  "title": "Arkive Source Record",
  "description": "One original work Arkive has reviewed: provenance, traceability, license, relevance. Maps to source_acceptance_policy.md and license_policy.md.",
  "type": "object",
  "additionalProperties": false,
  "required": ["source_id", "title", "review_status", "license"],
  "properties": {
    "source_id": {
      "type": "string",
      "pattern": "^src_[a-z0-9][a-z0-9_-]{2,63}$",
      "description": "Stable identifier, e.g. src_army-survival-fm21-76."
    },
    "title": { "type": "string", "minLength": 1 },
    "authors": { "type": "array", "items": { "type": "string" }, "default": [] },
    "publisher": { "type": "string" },
    "publication_date": {
      "type": "string",
      "description": "ISO date (YYYY-MM-DD) or year (YYYY) when full date is unknown.",
      "pattern": "^[0-9]{4}(-[0-9]{2}-[0-9]{2})?$"
    },
    "edition": { "type": "string" },
    "language": {
      "type": "string",
      "description": "BCP-47 / ISO 639 code, e.g. 'en', 'es', 'fr'.",
      "pattern": "^[a-z]{2,3}(-[A-Za-z0-9]{2,8})*$"
    },
    "origin": {
      "type": "object",
      "additionalProperties": false,
      "description": "Where the material came from (traceability).",
      "properties": {
        "url": { "type": "string", "format": "uri" },
        "acquisition_path": { "type": "string", "description": "How it was obtained if not a plain URL." },
        "date_accessed": { "type": "string", "format": "date" }
      }
    },
    "domains": {
      "type": "array",
      "description": "Relevance tags mirroring source_acceptance_policy.md. Promote to meta/vocabularies/domains.yaml when this grows.",
      "items": {
        "type": "string",
        "enum": [
          "medicine", "first_aid", "childbirth_womens_health", "water_purification",
          "sanitation", "food_preservation", "agriculture", "seed_saving",
          "animal_husbandry", "shelter", "carpentry", "mechanics", "electricity",
          "radio", "chemistry", "materials_science", "textiles", "fuel_production",
          "education", "governance", "disaster_response", "mapping", "forestry",
          "manufacturing", "toolmaking", "mining_metallurgy",
          "mental_health_community_resilience", "general_reference"
        ]
      },
      "default": []
    },
    "safety_critical": {
      "type": "boolean",
      "default": false,
      "description": "True if the source falls in a high-risk domain per the policy; triggers extra review caution."
    },
    "review_status": {
      "type": "string",
      "enum": ["pending_review", "accepted", "rejected"],
      "description": "Acceptance lifecycle, distinct from license status."
    },
    "rejection_reason": {
      "type": "string",
      "description": "Required in practice when review_status = rejected; see rejected-source records in the policy."
    },
    "redundancy_group": {
      "type": "string",
      "description": "Optional group key linking multiple sources covering the same critical topic."
    },
    "license": { "$ref": "#/$defs/license" },
    "notes": { "type": "string" },
    "created_at": { "type": "string", "format": "date-time" },
    "updated_at": { "type": "string", "format": "date-time" }
  },
  "$defs": {
    "license": {
      "type": "object",
      "additionalProperties": false,
      "description": "License / redistribution status. Maps directly to license_policy.md. Promote to its own schema later if needed.",
      "required": ["status", "bundleable"],
      "properties": {
        "status": {
          "type": "string",
          "enum": [
            "redistributable",
            "redistributable_with_conditions",
            "metadata_only",
            "pending_review",
            "rejected",
            "unknown"
          ]
        },
        "name": { "type": "string", "description": "Human license name, e.g. 'Public Domain (US Gov work)', 'CC BY-NC-SA 4.0'." },
        "url": { "type": "string", "format": "uri" },
        "attribution_text": { "type": "string", "description": "Exact attribution string to preserve, when required." },
        "conditions": {
          "type": "array",
          "description": "Redistribution conditions from license_policy.md.",
          "items": {
            "type": "string",
            "enum": [
              "attribution",
              "preserve_license_text",
              "share_alike",
              "non_commercial",
              "no_derivatives",
              "identify_changes",
              "link_original"
            ]
          },
          "default": []
        },
        "evidence": {
          "type": "array",
          "description": "License evidence preserved with the record (do not rely on memory).",
          "items": {
            "type": "object",
            "additionalProperties": false,
            "required": ["type"],
            "properties": {
              "type": {
                "type": "string",
                "enum": [
                  "license_in_document", "official_license_url", "publisher_rights_statement",
                  "public_domain_notice", "government_publication", "repository_license_file",
                  "archived_license_page"
                ]
              },
              "detail": { "type": "string" },
              "url": { "type": "string", "format": "uri" },
              "date_accessed": { "type": "string", "format": "date" }
            }
          },
          "default": []
        },
        "bundleable": {
          "type": "boolean",
          "description": "Whether this source may be included in redistributable releases. MUST be false unless status is redistributable or redistributable_with_conditions."
        }
      }
    }
  }
}

meta/schemas/document.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "urn:arkive:schema:document:0.1.0",
  "title": "Arkive Document Record",
  "description": "One concrete file artifact tied to a source: the original file or a derived form (extracted text, OCR, cleaned text). Preserves checksum and transformation lineage per source_acceptance_policy.md.",
  "type": "object",
  "additionalProperties": false,
  "required": ["document_id", "source_id", "role", "media_type", "path", "checksum"],
  "properties": {
    "document_id": {
      "type": "string",
      "pattern": "^doc_[a-z0-9][a-z0-9_-]{2,63}$"
    },
    "source_id": {
      "type": "string",
      "pattern": "^src_[a-z0-9][a-z0-9_-]{2,63}$",
      "description": "FK to the source record this file belongs to."
    },
    "role": {
      "type": "string",
      "enum": ["original", "extracted_text", "ocr_text", "cleaned_text"],
      "description": "What this file is in the processing lineage."
    },
    "media_type": {
      "type": "string",
      "description": "IANA media type, e.g. application/pdf, text/plain, text/html.",
      "pattern": "^[a-z]+/[a-z0-9.+-]+$"
    },
    "path": {
      "type": "string",
      "description": "Repository-relative path under data/, e.g. data/sources/core/army-survival/original.pdf."
    },
    "byte_size": { "type": "integer", "minimum": 0 },
    "checksum": {
      "type": "object",
      "additionalProperties": false,
      "required": ["algorithm", "value"],
      "properties": {
        "algorithm": { "type": "string", "enum": ["sha256"] },
        "value": { "type": "string", "pattern": "^[a-f0-9]{64}$" }
      }
    },
    "page_count": { "type": "integer", "minimum": 0 },
    "language": { "type": "string", "pattern": "^[a-z]{2,3}(-[A-Za-z0-9]{2,8})*$" },
    "derived_from": {
      "type": "string",
      "pattern": "^doc_[a-z0-9][a-z0-9_-]{2,63}$",
      "description": "document_id this file was derived from (e.g. extracted_text derived from original). Omit for originals."
    },
    "transformation": {
      "type": "object",
      "additionalProperties": false,
      "description": "How a derived file was produced. Omit for role=original.",
      "required": ["method"],
      "properties": {
        "method": {
          "type": "string",
          "enum": ["pdf_text_extract", "ocr", "cleanup", "format_convert", "translate"]
        },
        "tool": { "type": "string", "description": "Tool + version, e.g. pdfminer.six 20240706." },
        "params": { "type": "object", "description": "Free-form parameters used, for reproducibility." },
        "date": { "type": "string", "format": "date-time" }
      }
    },
    "notes": { "type": "string" },
    "created_at": { "type": "string", "format": "date-time" }
  }
}

meta/schemas/chunk.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "urn:arkive:schema:chunk:0.1.0",
  "title": "Arkive Chunk Record",
  "description": "One retrievable passage. Carries the text plus enough location data to cite it and open the right page/section. Back-references both the document and the source.",
  "type": "object",
  "additionalProperties": false,
  "required": ["chunk_id", "document_id", "source_id", "sequence", "text"],
  "properties": {
    "chunk_id": {
      "type": "string",
      "pattern": "^chk_[a-z0-9][a-z0-9_-]{2,79}$"
    },
    "document_id": {
      "type": "string",
      "pattern": "^doc_[a-z0-9][a-z0-9_-]{2,63}$",
      "description": "FK to the text document this chunk was cut from (usually an extracted_text/cleaned_text doc)."
    },
    "source_id": {
      "type": "string",
      "pattern": "^src_[a-z0-9][a-z0-9_-]{2,63}$",
      "description": "FK to the source. Denormalized so citations don't require a document lookup."
    },
    "sequence": {
      "type": "integer",
      "minimum": 0,
      "description": "Order of this chunk within its document."
    },
    "text": { "type": "string", "minLength": 1 },
    "char_start": {
      "type": "integer",
      "minimum": 0,
      "description": "Start offset into the document's text, for exact passage display."
    },
    "char_end": { "type": "integer", "minimum": 0 },
    "token_count": { "type": "integer", "minimum": 0 },
    "location": {
      "type": "object",
      "additionalProperties": false,
      "description": "Human-facing citation anchor — what lets a cited answer open the right page.",
      "properties": {
        "page": { "type": "integer", "minimum": 1 },
        "section": { "type": "string" },
        "heading": { "type": "string" }
      }
    },
    "embedding_ref": {
      "type": "string",
      "description": "Optional pointer to a future vector record. Embeddings are out of scope for v0.1; this field reserves the slot."
    },
    "created_at": { "type": "string", "format": "date-time" }
  }
}

Worked example (grounded in a real candidate)

A good first candidate is a U.S. Army survival field manual, because U.S. federal government works are public domain in the U.S. under 17 U.S.C. § 105 — a clean, defensible redistributable status that's perfect for testing the license fields. (Caveat to verify during review: make sure the specific PDF is the actual government work, not a commercial reprint that adds copyrighted front-matter or illustrations — exactly the kind of check the schema is meant to force.)

source record

{
  "source_id": "src_army-survival-fm21-76",
  "title": "FM 21-76 Survival",
  "authors": ["U.S. Department of the Army"],
  "publisher": "U.S. Department of the Army",
  "publication_date": "1992",
  "language": "en",
  "origin": {
    "url": "https://example.gov/fm21-76.pdf",
    "date_accessed": "2026-06-03"
  },
  "domains": ["first_aid", "water_purification", "shelter", "disaster_response"],
  "safety_critical": true,
  "review_status": "pending_review",
  "license": {
    "status": "redistributable",
    "name": "Public Domain (U.S. Government work)",
    "conditions": [],
    "evidence": [
      {
        "type": "government_publication",
        "detail": "Work prepared by U.S. Dept. of the Army; public domain in the U.S. under 17 U.S.C. § 105. Verify the specific file is the genuine government work, not a copyrighted reprint.",
        "date_accessed": "2026-06-03"
      }
    ],
    "bundleable": true
  },
  "notes": "First proof-loop candidate. License reasoning to be confirmed during review before any bundling."
}

document records (original + extracted text)

[
  {
    "document_id": "doc_army-survival-original",
    "source_id": "src_army-survival-fm21-76",
    "role": "original",
    "media_type": "application/pdf",
    "path": "data/sources/core/army-survival/original.pdf",
    "checksum": { "algorithm": "sha256", "value": "0000000000000000000000000000000000000000000000000000000000000000" },
    "page_count": 277,
    "language": "en"
  },
  {
    "document_id": "doc_army-survival-text",
    "source_id": "src_army-survival-fm21-76",
    "role": "extracted_text",
    "media_type": "text/plain",
    "path": "data/sources/core/army-survival/extracted.txt",
    "checksum": { "algorithm": "sha256", "value": "1111111111111111111111111111111111111111111111111111111111111111" },
    "derived_from": "doc_army-survival-original",
    "transformation": {
      "method": "pdf_text_extract",
      "tool": "pdfminer.six 20240706"
    }
  }
]

chunk record

{
  "chunk_id": "chk_army-survival-water-0001",
  "document_id": "doc_army-survival-text",
  "source_id": "src_army-survival-fm21-76",
  "sequence": 142,
  "text": "Boiling is the safest method of purifying water. Bring water to a rolling boil for at least one minute...",
  "char_start": 184213,
  "char_end": 184498,
  "location": { "page": 169, "section": "Water Procurement", "heading": "Water Purification" }
}

With those three records, the proof loop is fully expressible: search hits the chunk, the answer cites FM 21-76, p.169, "Water Purification", and the link resolves through source_id/document_id back to the original PDF and page.

Acceptance criteria

  • Three schema files exist under meta/schemas/ and are valid JSON Schema draft 2020-12.
  • A validator script loads the schemas and validates a record; passing records exit 0, failing records report which field/rule broke.
  • The worked-example records above validate cleanly.
  • A record with license.status = "unknown" but bundleable = true is rejected (cross-field rule — may need a small custom check beyond pure JSON Schema).
  • README/notes explain the three-record lineage in 1–2 paragraphs.

Suggested branch

feature/issue-N-metadata-schemas-v0.1 off develop, PR into develop. (Replace N with the real issue number.)

# Issue: Metadata contracts v0.1 — `source`, `document`, `chunk` schemas for the proof loop ## Why this issue exists Per the roadmap, the next foundation step after Phase 0 is defining metadata contracts. Rather than design all seven schemas GPT sketched (`document`, `source`, `license`, `trust_score`, `chunk`, `index_manifest`, `bundle_manifest`) up front, this issue proposes **only the three the proof loop actually touches**: `source`, `document`, `chunk`. The proof loop we're aiming at: > Take 5 legally-clean documents → record their provenance + license → extract and chunk them → search offline → answer one question with a citation that opens the right passage. These three schemas are exactly the contract surface that loop needs. Everything else (trust scoring, index/bundle manifests, embeddings) is deliberately deferred until we've run real documents through and *felt* what those schemas need to contain — you can't design a good index manifest in the abstract. **Important:** these schemas are not invented from scratch. Every field maps to a rule already written in `meta/policies/source_acceptance_policy.md` and `meta/policies/license_policy.md`. This issue just makes those policies machine-checkable. ## Scope **In scope** - `meta/schemas/source.schema.json` — one record per original work: provenance, traceability, license, relevance. - `meta/schemas/document.schema.json` — one record per concrete file artifact (original PDF, extracted text, OCR output…), with checksum + transformation lineage. - `meta/schemas/chunk.schema.json` — one record per retrievable passage, with offsets + human-facing location for citations. - One worked example record per schema, validating against the schema. - A tiny validator script (`pipeline/validate/validate_records.py` or similar) that checks records against the schemas. JSON Schema draft 2020-12, validated with `jsonschema` in Python. **Out of scope (deferred on purpose)** - `trust_score` — meaningless until we have competing/conflicting sources to score. - `index_manifest`, `bundle_manifest` — design after we've indexed/bundled real docs. - Embeddings / vector storage — the chunk schema leaves an optional pointer field, but no vectors yet. - Controlled-vocabulary files (`meta/vocabularies/*.yaml`) — for now the enums live inline in the schemas; promote to vocab files later if they grow. ## Design notes - **Three records, clear lineage.** `source` (the work) ← `document` (files derived from it, each with a `role` and a `derived_from` pointer) ← `chunk` (passages, pointing back to both). This keeps "the AI's chunk" provably traceable to an original file and source — which is the whole point of Arkive's source-is-authority principle. - **License is embedded in `source`** as a reusable `$defs.license` object rather than a separate top-level schema, to keep v0.1 at three files. It's structured so it can graduate to its own `license.schema.json` later with no field changes. - **Citation = `chunk.location` + `chunk.source_id`.** A cited answer shows the passage text, its page/section, and a link back to the source record. No-source-no-answer falls out naturally: retrieval returns chunk IDs; if none clear the threshold, the runtime refuses. - **Enums mirror the policies exactly.** `license.status` uses the six statuses from `license_policy.md`; `license.conditions` uses that policy's condition list; `domains` mirrors the relevance list in `source_acceptance_policy.md`. - `additionalProperties: false` everywhere to keep records honest; a free-text `notes` field is provided where humans need room. --- ## `meta/schemas/source.schema.json` ```json { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "urn:arkive:schema:source:0.1.0", "title": "Arkive Source Record", "description": "One original work Arkive has reviewed: provenance, traceability, license, relevance. Maps to source_acceptance_policy.md and license_policy.md.", "type": "object", "additionalProperties": false, "required": ["source_id", "title", "review_status", "license"], "properties": { "source_id": { "type": "string", "pattern": "^src_[a-z0-9][a-z0-9_-]{2,63}$", "description": "Stable identifier, e.g. src_army-survival-fm21-76." }, "title": { "type": "string", "minLength": 1 }, "authors": { "type": "array", "items": { "type": "string" }, "default": [] }, "publisher": { "type": "string" }, "publication_date": { "type": "string", "description": "ISO date (YYYY-MM-DD) or year (YYYY) when full date is unknown.", "pattern": "^[0-9]{4}(-[0-9]{2}-[0-9]{2})?$" }, "edition": { "type": "string" }, "language": { "type": "string", "description": "BCP-47 / ISO 639 code, e.g. 'en', 'es', 'fr'.", "pattern": "^[a-z]{2,3}(-[A-Za-z0-9]{2,8})*$" }, "origin": { "type": "object", "additionalProperties": false, "description": "Where the material came from (traceability).", "properties": { "url": { "type": "string", "format": "uri" }, "acquisition_path": { "type": "string", "description": "How it was obtained if not a plain URL." }, "date_accessed": { "type": "string", "format": "date" } } }, "domains": { "type": "array", "description": "Relevance tags mirroring source_acceptance_policy.md. Promote to meta/vocabularies/domains.yaml when this grows.", "items": { "type": "string", "enum": [ "medicine", "first_aid", "childbirth_womens_health", "water_purification", "sanitation", "food_preservation", "agriculture", "seed_saving", "animal_husbandry", "shelter", "carpentry", "mechanics", "electricity", "radio", "chemistry", "materials_science", "textiles", "fuel_production", "education", "governance", "disaster_response", "mapping", "forestry", "manufacturing", "toolmaking", "mining_metallurgy", "mental_health_community_resilience", "general_reference" ] }, "default": [] }, "safety_critical": { "type": "boolean", "default": false, "description": "True if the source falls in a high-risk domain per the policy; triggers extra review caution." }, "review_status": { "type": "string", "enum": ["pending_review", "accepted", "rejected"], "description": "Acceptance lifecycle, distinct from license status." }, "rejection_reason": { "type": "string", "description": "Required in practice when review_status = rejected; see rejected-source records in the policy." }, "redundancy_group": { "type": "string", "description": "Optional group key linking multiple sources covering the same critical topic." }, "license": { "$ref": "#/$defs/license" }, "notes": { "type": "string" }, "created_at": { "type": "string", "format": "date-time" }, "updated_at": { "type": "string", "format": "date-time" } }, "$defs": { "license": { "type": "object", "additionalProperties": false, "description": "License / redistribution status. Maps directly to license_policy.md. Promote to its own schema later if needed.", "required": ["status", "bundleable"], "properties": { "status": { "type": "string", "enum": [ "redistributable", "redistributable_with_conditions", "metadata_only", "pending_review", "rejected", "unknown" ] }, "name": { "type": "string", "description": "Human license name, e.g. 'Public Domain (US Gov work)', 'CC BY-NC-SA 4.0'." }, "url": { "type": "string", "format": "uri" }, "attribution_text": { "type": "string", "description": "Exact attribution string to preserve, when required." }, "conditions": { "type": "array", "description": "Redistribution conditions from license_policy.md.", "items": { "type": "string", "enum": [ "attribution", "preserve_license_text", "share_alike", "non_commercial", "no_derivatives", "identify_changes", "link_original" ] }, "default": [] }, "evidence": { "type": "array", "description": "License evidence preserved with the record (do not rely on memory).", "items": { "type": "object", "additionalProperties": false, "required": ["type"], "properties": { "type": { "type": "string", "enum": [ "license_in_document", "official_license_url", "publisher_rights_statement", "public_domain_notice", "government_publication", "repository_license_file", "archived_license_page" ] }, "detail": { "type": "string" }, "url": { "type": "string", "format": "uri" }, "date_accessed": { "type": "string", "format": "date" } } }, "default": [] }, "bundleable": { "type": "boolean", "description": "Whether this source may be included in redistributable releases. MUST be false unless status is redistributable or redistributable_with_conditions." } } } } } ``` ## `meta/schemas/document.schema.json` ```json { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "urn:arkive:schema:document:0.1.0", "title": "Arkive Document Record", "description": "One concrete file artifact tied to a source: the original file or a derived form (extracted text, OCR, cleaned text). Preserves checksum and transformation lineage per source_acceptance_policy.md.", "type": "object", "additionalProperties": false, "required": ["document_id", "source_id", "role", "media_type", "path", "checksum"], "properties": { "document_id": { "type": "string", "pattern": "^doc_[a-z0-9][a-z0-9_-]{2,63}$" }, "source_id": { "type": "string", "pattern": "^src_[a-z0-9][a-z0-9_-]{2,63}$", "description": "FK to the source record this file belongs to." }, "role": { "type": "string", "enum": ["original", "extracted_text", "ocr_text", "cleaned_text"], "description": "What this file is in the processing lineage." }, "media_type": { "type": "string", "description": "IANA media type, e.g. application/pdf, text/plain, text/html.", "pattern": "^[a-z]+/[a-z0-9.+-]+$" }, "path": { "type": "string", "description": "Repository-relative path under data/, e.g. data/sources/core/army-survival/original.pdf." }, "byte_size": { "type": "integer", "minimum": 0 }, "checksum": { "type": "object", "additionalProperties": false, "required": ["algorithm", "value"], "properties": { "algorithm": { "type": "string", "enum": ["sha256"] }, "value": { "type": "string", "pattern": "^[a-f0-9]{64}$" } } }, "page_count": { "type": "integer", "minimum": 0 }, "language": { "type": "string", "pattern": "^[a-z]{2,3}(-[A-Za-z0-9]{2,8})*$" }, "derived_from": { "type": "string", "pattern": "^doc_[a-z0-9][a-z0-9_-]{2,63}$", "description": "document_id this file was derived from (e.g. extracted_text derived from original). Omit for originals." }, "transformation": { "type": "object", "additionalProperties": false, "description": "How a derived file was produced. Omit for role=original.", "required": ["method"], "properties": { "method": { "type": "string", "enum": ["pdf_text_extract", "ocr", "cleanup", "format_convert", "translate"] }, "tool": { "type": "string", "description": "Tool + version, e.g. pdfminer.six 20240706." }, "params": { "type": "object", "description": "Free-form parameters used, for reproducibility." }, "date": { "type": "string", "format": "date-time" } } }, "notes": { "type": "string" }, "created_at": { "type": "string", "format": "date-time" } } } ``` ## `meta/schemas/chunk.schema.json` ```json { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "urn:arkive:schema:chunk:0.1.0", "title": "Arkive Chunk Record", "description": "One retrievable passage. Carries the text plus enough location data to cite it and open the right page/section. Back-references both the document and the source.", "type": "object", "additionalProperties": false, "required": ["chunk_id", "document_id", "source_id", "sequence", "text"], "properties": { "chunk_id": { "type": "string", "pattern": "^chk_[a-z0-9][a-z0-9_-]{2,79}$" }, "document_id": { "type": "string", "pattern": "^doc_[a-z0-9][a-z0-9_-]{2,63}$", "description": "FK to the text document this chunk was cut from (usually an extracted_text/cleaned_text doc)." }, "source_id": { "type": "string", "pattern": "^src_[a-z0-9][a-z0-9_-]{2,63}$", "description": "FK to the source. Denormalized so citations don't require a document lookup." }, "sequence": { "type": "integer", "minimum": 0, "description": "Order of this chunk within its document." }, "text": { "type": "string", "minLength": 1 }, "char_start": { "type": "integer", "minimum": 0, "description": "Start offset into the document's text, for exact passage display." }, "char_end": { "type": "integer", "minimum": 0 }, "token_count": { "type": "integer", "minimum": 0 }, "location": { "type": "object", "additionalProperties": false, "description": "Human-facing citation anchor — what lets a cited answer open the right page.", "properties": { "page": { "type": "integer", "minimum": 1 }, "section": { "type": "string" }, "heading": { "type": "string" } } }, "embedding_ref": { "type": "string", "description": "Optional pointer to a future vector record. Embeddings are out of scope for v0.1; this field reserves the slot." }, "created_at": { "type": "string", "format": "date-time" } } } ``` --- ## Worked example (grounded in a real candidate) A good first candidate is a **U.S. Army survival field manual**, because U.S. federal government works are public domain in the U.S. under 17 U.S.C. § 105 — a clean, defensible `redistributable` status that's perfect for testing the license fields. *(Caveat to verify during review: make sure the specific PDF is the actual government work, not a commercial reprint that adds copyrighted front-matter or illustrations — exactly the kind of check the schema is meant to force.)* **source record** ```json { "source_id": "src_army-survival-fm21-76", "title": "FM 21-76 Survival", "authors": ["U.S. Department of the Army"], "publisher": "U.S. Department of the Army", "publication_date": "1992", "language": "en", "origin": { "url": "https://example.gov/fm21-76.pdf", "date_accessed": "2026-06-03" }, "domains": ["first_aid", "water_purification", "shelter", "disaster_response"], "safety_critical": true, "review_status": "pending_review", "license": { "status": "redistributable", "name": "Public Domain (U.S. Government work)", "conditions": [], "evidence": [ { "type": "government_publication", "detail": "Work prepared by U.S. Dept. of the Army; public domain in the U.S. under 17 U.S.C. § 105. Verify the specific file is the genuine government work, not a copyrighted reprint.", "date_accessed": "2026-06-03" } ], "bundleable": true }, "notes": "First proof-loop candidate. License reasoning to be confirmed during review before any bundling." } ``` **document records** (original + extracted text) ```json [ { "document_id": "doc_army-survival-original", "source_id": "src_army-survival-fm21-76", "role": "original", "media_type": "application/pdf", "path": "data/sources/core/army-survival/original.pdf", "checksum": { "algorithm": "sha256", "value": "0000000000000000000000000000000000000000000000000000000000000000" }, "page_count": 277, "language": "en" }, { "document_id": "doc_army-survival-text", "source_id": "src_army-survival-fm21-76", "role": "extracted_text", "media_type": "text/plain", "path": "data/sources/core/army-survival/extracted.txt", "checksum": { "algorithm": "sha256", "value": "1111111111111111111111111111111111111111111111111111111111111111" }, "derived_from": "doc_army-survival-original", "transformation": { "method": "pdf_text_extract", "tool": "pdfminer.six 20240706" } } ] ``` **chunk record** ```json { "chunk_id": "chk_army-survival-water-0001", "document_id": "doc_army-survival-text", "source_id": "src_army-survival-fm21-76", "sequence": 142, "text": "Boiling is the safest method of purifying water. Bring water to a rolling boil for at least one minute...", "char_start": 184213, "char_end": 184498, "location": { "page": 169, "section": "Water Procurement", "heading": "Water Purification" } } ``` With those three records, the proof loop is fully expressible: search hits the chunk, the answer cites *FM 21-76, p.169, "Water Purification"*, and the link resolves through `source_id`/`document_id` back to the original PDF and page. ## Acceptance criteria - [ ] Three schema files exist under `meta/schemas/` and are valid JSON Schema draft 2020-12. - [ ] A validator script loads the schemas and validates a record; passing records exit 0, failing records report which field/rule broke. - [ ] The worked-example records above validate cleanly. - [ ] A record with `license.status = "unknown"` but `bundleable = true` is rejected (cross-field rule — may need a small custom check beyond pure JSON Schema). - [ ] README/notes explain the three-record lineage in 1–2 paragraphs. ## Suggested branch `feature/issue-N-metadata-schemas-v0.1` off `develop`, PR into `develop`. (Replace N with the real issue number.)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Arkive/arkive#3
No description provided.