Skip to content

Provenance & Attestation

Every PikoClaw extraction produces provenance.json -- a metadata file that records exactly what was extracted, from where, and when. This supports audit trails, compliance requirements, and on-chain attestation.

provenance.json

{
  "tool": "PikoClaw",
  "version": "0.5.0",
  "extracted_at": "2026-02-23T15:30:00+00:00",
  "source_files": ["mailbox.pst"],
  "source_hash": "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890",
  "source_format": "pst",
  "statistics": {
    "total_messages": 12847,
    "total_emails": 12847,
    "total_calendar_events": 234,
    "total_contacts": 342,
    "total_threads": 4291,
    "multi_message_threads": 1847
  },
  "warnings": []
}

Fields

Field Description
tool Always "PikoClaw"
version PikoClaw version used for extraction
extracted_at ISO 8601 timestamp in UTC
source_files List of input file paths
source_hash SHA-256 digest of the source file(s)
source_format Format identifier (pst, mbox, maildir, mixed)
statistics Extraction statistics (emails, contacts, threads, etc.)
warnings List of issues encountered during extraction

Source Hash

The source_hash is a SHA-256 digest computed over the raw bytes of all input files (sorted by path). For Maildir directories, the hash covers the directory path and file count (hashing every file in a large Maildir would be prohibitively slow).

This hash enables:

  • Verification -- Confirm that an extraction result corresponds to a specific source archive
  • Deduplication -- Detect if the same archive has been processed before
  • Attestation -- Anchor the extraction to a cryptographic proof for on-chain or legal use

Multiple Sources

When processing multiple files, source_files lists all inputs and source_format is set to "mixed":

{
  "source_files": ["mailbox.pst", "archive.mbox"],
  "source_hash": "combined-hash-of-all-files",
  "source_format": "mixed"
}

Warnings

The warnings list captures non-fatal issues encountered during extraction:

  • Malformed messages that were skipped
  • Encoding errors in body text
  • Missing or corrupted headers
  • Attachment extraction failures

An empty warnings list means the extraction completed cleanly.

Using Provenance

Audit Trail

# Extract with provenance
pikoclaw extract mailbox.pst --output ./audit-2024

# Verify the source hash later
sha256sum mailbox.pst
# Compare with provenance.json source_hash

Compliance

The combination of source_hash + extracted_at + version provides a complete chain of custody for the extraction. This can be used in legal discovery to demonstrate that the output faithfully represents the source archive.

On-Chain Attestation

The source_hash and extraction metadata are designed to be compatible with ERC-8004 attestation flows. The provenance JSON can be hashed and committed to a blockchain or timestamping service to create an immutable record of the extraction.