Provenance & Attestation¶
Every PikoClaw extraction produces provenance.json -- a metadata file that records exactly what was extracted, from where, and when. This supports audit trails, compliance requirements, and on-chain attestation.
provenance.json¶
{
"tool": "PikoClaw",
"version": "0.5.0",
"extracted_at": "2026-02-23T15:30:00+00:00",
"source_files": ["mailbox.pst"],
"source_hash": "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890",
"source_format": "pst",
"statistics": {
"total_messages": 12847,
"total_emails": 12847,
"total_calendar_events": 234,
"total_contacts": 342,
"total_threads": 4291,
"multi_message_threads": 1847
},
"warnings": []
}
Fields¶
| Field | Description |
|---|---|
tool |
Always "PikoClaw" |
version |
PikoClaw version used for extraction |
extracted_at |
ISO 8601 timestamp in UTC |
source_files |
List of input file paths |
source_hash |
SHA-256 digest of the source file(s) |
source_format |
Format identifier (pst, mbox, maildir, mixed) |
statistics |
Extraction statistics (emails, contacts, threads, etc.) |
warnings |
List of issues encountered during extraction |
Source Hash¶
The source_hash is a SHA-256 digest computed over the raw bytes of all input files (sorted by path). For Maildir directories, the hash covers the directory path and file count (hashing every file in a large Maildir would be prohibitively slow).
This hash enables:
- Verification -- Confirm that an extraction result corresponds to a specific source archive
- Deduplication -- Detect if the same archive has been processed before
- Attestation -- Anchor the extraction to a cryptographic proof for on-chain or legal use
Multiple Sources¶
When processing multiple files, source_files lists all inputs and source_format is set to "mixed":
{
"source_files": ["mailbox.pst", "archive.mbox"],
"source_hash": "combined-hash-of-all-files",
"source_format": "mixed"
}
Warnings¶
The warnings list captures non-fatal issues encountered during extraction:
- Malformed messages that were skipped
- Encoding errors in body text
- Missing or corrupted headers
- Attachment extraction failures
An empty warnings list means the extraction completed cleanly.
Using Provenance¶
Audit Trail¶
# Extract with provenance
pikoclaw extract mailbox.pst --output ./audit-2024
# Verify the source hash later
sha256sum mailbox.pst
# Compare with provenance.json source_hash
Compliance¶
The combination of source_hash + extracted_at + version provides a complete chain of custody for the extraction. This can be used in legal discovery to demonstrate that the output faithfully represents the source archive.
On-Chain Attestation¶
The source_hash and extraction metadata are designed to be compatible with ERC-8004 attestation flows. The provenance JSON can be hashed and committed to a blockchain or timestamping service to create an immutable record of the extraction.