Skip to content

PikoClaw

πŸ“§ β†’ 🧠

Extract institutional knowledge from email archives.

Turn decades of Outlook PST, MBOX, Maildir, EML, and Slack archives into a navigable knowledge base β€” contacts, threads, graphs, provenance β€” in a single command.


What You Get

Point PikoClaw at an email archive and it produces:

Output Description
wiki/ Obsidian-native wiki with [[wikilinks]] for browsing
contacts.json Contact graph with HITS, PageRank, communities, knowledge risk metrics
threads.md Conversation threads grouped by topic
network-analysis.md Communication intelligence: authorities, hubs, influence, risk
provenance.json SHA-256 source hash, tool version, extraction metadata
extraction.json Full structured export with schema validation
emails.csv Tabular export for spreadsheets and BI tools
graph.html Interactive D3 force-directed visualization
HTTP API pikoclaw serve for agent queries and search

Quick Start

pip install pikoclaw
pikoclaw extract mailbox.pst

Output lands in ./pikoclaw-output/wiki/. That's it.

Get started :material-arrow-right:


Why PikoClaw?

When someone leaves your organization, their email doesn't have to leave with them. PikoClaw preserves institutional memory β€” who knew what, when decisions were made, where expertise lived.

Most teams facing email archive analysis end up writing custom scripts with libpff or mailbox from the stdlib. That works for one-off extractions but falls apart when you need threading, contact graphs, search, provenance, or repeatable output across formats.

PikoClaw turns a week of scripting into a single command.

vs. Rolling Your Own

Custom scripts PikoClaw
Format support One script per format Auto-detects PST, MBOX, Maildir, EML, Slack
Threading In-Reply-To only (breaks on Outlook) 4-signal union-find: References, Conversation-Index, Gmail ID, subject
Contact intelligence Manual counting HITS, PageRank, communities, knowledge risk
Search grep or custom index TF-IDF with temporal filters (--after, --before, --sender)
Provenance None SHA-256 source hash + tool version + audit trail
Output Custom JSON Wiki, JSON, CSV, D3 viz, HTTP API
Redaction Regex one-offs Built-in PII scrubbing (email, phone, SSN, credit card, IP)
Network dependency Varies Zero. Air-gapped by default.
Testing Hope and prayer 259 automated tests (fuzz, benchmarks, integration)

vs. Commercial Tools

E-discovery SaaS PikoClaw
Cost $1000s/month Free (MIT license)
Data custody Upload to cloud Local-first. You control it.
Privacy Terms of Service + trust Air-gapped. No telemetry. No cloud calls.
Extensibility Vendor APIs (if any) Open source. Python. HTTP API. Agent-ready.
Output Proprietary formats Obsidian wiki, JSON, CSV, HTML viz
LLM integration Vendor LLM (black box) Opt-in. Local models first. Full auditability.

Supported Formats

PikoClaw auto-detects the format. Mix formats in a single command.

Format Extension Dependencies
Outlook PST/OST .pst, .ost libpff-python
Maildir directory none (stdlib)
MBOX .mbox, .mbx none (stdlib)
EML .eml none (stdlib)
Slack export .zip or directory none (stdlib)
# Mix formats in a single knowledge base
pikoclaw extract mailbox.pst archive.mbox /path/to/maildir --output ./kb

Key Capabilities

🧡 Understands conversations, not just messages

4-signal threading via union-find groups messages into real conversations, even across Outlook PSTs where References headers are missing and only Conversation-Index exists.

Outcome: See decision threads, not isolated messages. Know who was involved.

πŸ•ΈοΈ Reveals who matters and why

HITS hub/authority scores, PageRank, Louvain community detection, and knowledge concentration risk scoring show you the hidden structure of an organization's communication.

Outcome: Identify key people, silos, and single points of failure before they become crises.

πŸ” Finds what you're looking for

TF-IDF search index with date-range and sender filters. Query archives with pikoclaw search.

Outcome: Answer "What did we know about X in Q3 2024?" in seconds, not hours.

πŸ“€ Produces output you can use today

Obsidian-native wiki with [[wikilinks]], JSON with schema validation, CSV for spreadsheets, D3 force-directed graph, and a REST API for agent integration.

Outcome: Browse in Obsidian, analyze in Excel, query from Python, visualize in the browser.

βœ… Proves what it did

SHA-256 source hash, tool version stamp, and warnings list in provenance.json. Every extraction is reproducible and auditable.

Outcome: Chain of custody for regulatory compliance, forensics, and trust.

πŸ”’ Runs anywhere, trusts no one

Zero network dependencies. No telemetry. No cloud calls. Air-gapped by default. All future LLM features are opt-in and support local models.

Outcome: Run on isolated networks, in secure environments, or on a Raspberry Pi. Your data never leaves your control.


Real-World Use Cases

🏒 Organizational offboarding β€” Preserve institutional knowledge when employees leave. Extract decision history, expertise maps, and context.

πŸ”¬ E-discovery & forensics β€” Chain of custody, provenance metadata, and auditable extraction for legal and regulatory compliance.

🏠 Personal archiving β€” Convert decades of Gmail Takeout or Outlook archives into a browsable, searchable personal knowledge base.

🀝 DAO governance β€” Extract decision threads, proposal discussions, and contributor graphs from Slack or Discord exports.

🧠 Agent memory layer β€” Persistent, queryable institutional knowledge for AI agents. PikoClaw is the memory; PicoClaw is the agent.


Conference-Ready

Panathenea 2026 β€” May 27–29, Athens, Greece

PikoClaw will be presented as part of the institutional knowledge preservation and web3 identity track. The demo shows a complete end-to-end extraction from Enron corpus to Obsidian wiki in under 60 seconds, with provenance attestation and contact graph visualization.

πŸŽ₯ Demo script available β€” reproduce the demo on your own machine.

πŸ“Š Stress tested β€” 259 automated tests, including fuzz tests (binary garbage, null bytes, circular refs, 100K headers, XSS, Unicode) and benchmarks (100–5000 message scale).

πŸ”¬ Validated against real-world data β€” Enron corpus (~1.3 GB, ~500K messages), Gmail Takeout archives, Slack exports with 10K+ messages.


Get Started

Installation Guide Quick Start Tutorial CLI Reference Architecture Docs