PikoClaw¶

📧 → 🧠

Extract institutional knowledge from email archives.

Turn decades of Outlook PST, MBOX, Maildir, EML, and Slack archives into a navigable knowledge base — contacts, threads, graphs, provenance — in a single command.

What You Get¶

Point PikoClaw at an email archive and it produces:

Output	Description
`wiki/`	Obsidian-native wiki with `[[wikilinks]]` for browsing
`contacts.json`	Contact graph with HITS, PageRank, communities, knowledge risk metrics
`threads.md`	Conversation threads grouped by topic
`network-analysis.md`	Communication intelligence: authorities, hubs, influence, risk
`provenance.json`	SHA-256 source hash, tool version, extraction metadata
`extraction.json`	Full structured export with schema validation
`emails.csv`	Tabular export for spreadsheets and BI tools
`graph.html`	Interactive D3 force-directed visualization
HTTP API	`pikoclaw serve` for agent queries and search

Quick Start¶

pip install pikoclaw
pikoclaw extract mailbox.pst

Output lands in ./pikoclaw-output/wiki/. That's it.

Get started :material-arrow-right:

Why PikoClaw?¶

When someone leaves your organization, their email doesn't have to leave with them. PikoClaw preserves institutional memory — who knew what, when decisions were made, where expertise lived.

Most teams facing email archive analysis end up writing custom scripts with libpff or mailbox from the stdlib. That works for one-off extractions but falls apart when you need threading, contact graphs, search, provenance, or repeatable output across formats.

PikoClaw turns a week of scripting into a single command.

vs. Rolling Your Own¶

	Custom scripts	PikoClaw
Format support	One script per format	Auto-detects PST, MBOX, Maildir, EML, Slack
Threading	In-Reply-To only (breaks on Outlook)	4-signal union-find: References, Conversation-Index, Gmail ID, subject
Contact intelligence	Manual counting	HITS, PageRank, communities, knowledge risk
Search	`grep` or custom index	TF-IDF with temporal filters (`--after`, `--before`, `--sender`)
Provenance	None	SHA-256 source hash + tool version + audit trail
Output	Custom JSON	Wiki, JSON, CSV, D3 viz, HTTP API
Redaction	Regex one-offs	Built-in PII scrubbing (email, phone, SSN, credit card, IP)
Network dependency	Varies	Zero. Air-gapped by default.
Testing	Hope and prayer	259 automated tests (fuzz, benchmarks, integration)

vs. Commercial Tools¶

	E-discovery SaaS	PikoClaw
Cost	$1000s/month	Free (MIT license)
Data custody	Upload to cloud	Local-first. You control it.
Privacy	Terms of Service + trust	Air-gapped. No telemetry. No cloud calls.
Extensibility	Vendor APIs (if any)	Open source. Python. HTTP API. Agent-ready.
Output	Proprietary formats	Obsidian wiki, JSON, CSV, HTML viz
LLM integration	Vendor LLM (black box)	Opt-in. Local models first. Full auditability.

Supported Formats¶

PikoClaw auto-detects the format. Mix formats in a single command.

Format	Extension	Dependencies
Outlook PST/OST	`.pst`, `.ost`	`libpff-python`
Maildir	directory	none (stdlib)
MBOX	`.mbox`, `.mbx`	none (stdlib)
EML	`.eml`	none (stdlib)
Slack export	`.zip` or directory	none (stdlib)

# Mix formats in a single knowledge base
pikoclaw extract mailbox.pst archive.mbox /path/to/maildir --output ./kb

Key Capabilities¶

🧵 Understands conversations, not just messages¶

4-signal threading via union-find groups messages into real conversations, even across Outlook PSTs where References headers are missing and only Conversation-Index exists.

Outcome: See decision threads, not isolated messages. Know who was involved.

🕸️ Reveals who matters and why¶

HITS hub/authority scores, PageRank, Louvain community detection, and knowledge concentration risk scoring show you the hidden structure of an organization's communication.

Outcome: Identify key people, silos, and single points of failure before they become crises.

🔍 Finds what you're looking for¶

TF-IDF search index with date-range and sender filters. Query archives with pikoclaw search.

Outcome: Answer "What did we know about X in Q3 2024?" in seconds, not hours.

📤 Produces output you can use today¶

Obsidian-native wiki with [[wikilinks]], JSON with schema validation, CSV for spreadsheets, D3 force-directed graph, and a REST API for agent integration.

Outcome: Browse in Obsidian, analyze in Excel, query from Python, visualize in the browser.

✅ Proves what it did¶

SHA-256 source hash, tool version stamp, and warnings list in provenance.json. Every extraction is reproducible and auditable.

Outcome: Chain of custody for regulatory compliance, forensics, and trust.

🔒 Runs anywhere, trusts no one¶

Zero network dependencies. No telemetry. No cloud calls. Air-gapped by default. All future LLM features are opt-in and support local models.

Outcome: Run on isolated networks, in secure environments, or on a Raspberry Pi. Your data never leaves your control.

Real-World Use Cases¶

🏢 Organizational offboarding — Preserve institutional knowledge when employees leave. Extract decision history, expertise maps, and context.

🔬 E-discovery & forensics — Chain of custody, provenance metadata, and auditable extraction for legal and regulatory compliance.

🏠 Personal archiving — Convert decades of Gmail Takeout or Outlook archives into a browsable, searchable personal knowledge base.

🤝 DAO governance — Extract decision threads, proposal discussions, and contributor graphs from Slack or Discord exports.

🧠 Agent memory layer — Persistent, queryable institutional knowledge for AI agents. PikoClaw is the memory; PicoClaw is the agent.

Conference-Ready¶

Panathenea 2026 — May 27–29, Athens, Greece

PikoClaw will be presented as part of the institutional knowledge preservation and web3 identity track. The demo shows a complete end-to-end extraction from Enron corpus to Obsidian wiki in under 60 seconds, with provenance attestation and contact graph visualization.

🎥 Demo script available — reproduce the demo on your own machine.

📊 Stress tested — 259 automated tests, including fuzz tests (binary garbage, null bytes, circular refs, 100K headers, XSS, Unicode) and benchmarks (100–5000 message scale).

🔬 Validated against real-world data — Enron corpus (~1.3 GB, ~500K messages), Gmail Takeout archives, Slack exports with 10K+ messages.

Get Started¶

Installation Guide Quick Start Tutorial CLI Reference Architecture Docs