PikoClaw¶
π§ β π§
Extract institutional knowledge from email archives.
Turn decades of Outlook PST, MBOX, Maildir, EML, and Slack archives into a navigable knowledge base β contacts, threads, graphs, provenance β in a single command.
What You Get¶
Point PikoClaw at an email archive and it produces:
| Output | Description |
|---|---|
wiki/ |
Obsidian-native wiki with [[wikilinks]] for browsing |
contacts.json |
Contact graph with HITS, PageRank, communities, knowledge risk metrics |
threads.md |
Conversation threads grouped by topic |
network-analysis.md |
Communication intelligence: authorities, hubs, influence, risk |
provenance.json |
SHA-256 source hash, tool version, extraction metadata |
extraction.json |
Full structured export with schema validation |
emails.csv |
Tabular export for spreadsheets and BI tools |
graph.html |
Interactive D3 force-directed visualization |
| HTTP API | pikoclaw serve for agent queries and search |
Quick Start¶
Output lands in ./pikoclaw-output/wiki/. That's it.
Get started :material-arrow-right:
Why PikoClaw?¶
When someone leaves your organization, their email doesn't have to leave with them. PikoClaw preserves institutional memory β who knew what, when decisions were made, where expertise lived.
Most teams facing email archive analysis end up writing custom scripts with libpff or mailbox from the stdlib. That works for one-off extractions but falls apart when you need threading, contact graphs, search, provenance, or repeatable output across formats.
PikoClaw turns a week of scripting into a single command.
vs. Rolling Your Own¶
| Custom scripts | PikoClaw | |
|---|---|---|
| Format support | One script per format | Auto-detects PST, MBOX, Maildir, EML, Slack |
| Threading | In-Reply-To only (breaks on Outlook) | 4-signal union-find: References, Conversation-Index, Gmail ID, subject |
| Contact intelligence | Manual counting | HITS, PageRank, communities, knowledge risk |
| Search | grep or custom index |
TF-IDF with temporal filters (--after, --before, --sender) |
| Provenance | None | SHA-256 source hash + tool version + audit trail |
| Output | Custom JSON | Wiki, JSON, CSV, D3 viz, HTTP API |
| Redaction | Regex one-offs | Built-in PII scrubbing (email, phone, SSN, credit card, IP) |
| Network dependency | Varies | Zero. Air-gapped by default. |
| Testing | Hope and prayer | 259 automated tests (fuzz, benchmarks, integration) |
vs. Commercial Tools¶
| E-discovery SaaS | PikoClaw | |
|---|---|---|
| Cost | $1000s/month | Free (MIT license) |
| Data custody | Upload to cloud | Local-first. You control it. |
| Privacy | Terms of Service + trust | Air-gapped. No telemetry. No cloud calls. |
| Extensibility | Vendor APIs (if any) | Open source. Python. HTTP API. Agent-ready. |
| Output | Proprietary formats | Obsidian wiki, JSON, CSV, HTML viz |
| LLM integration | Vendor LLM (black box) | Opt-in. Local models first. Full auditability. |
Supported Formats¶
PikoClaw auto-detects the format. Mix formats in a single command.
| Format | Extension | Dependencies |
|---|---|---|
| Outlook PST/OST | .pst, .ost |
libpff-python |
| Maildir | directory | none (stdlib) |
| MBOX | .mbox, .mbx |
none (stdlib) |
| EML | .eml |
none (stdlib) |
| Slack export | .zip or directory |
none (stdlib) |
# Mix formats in a single knowledge base
pikoclaw extract mailbox.pst archive.mbox /path/to/maildir --output ./kb
Key Capabilities¶
π§΅ Understands conversations, not just messages¶
4-signal threading via union-find groups messages into real conversations, even across Outlook PSTs where References headers are missing and only Conversation-Index exists.
Outcome: See decision threads, not isolated messages. Know who was involved.
πΈοΈ Reveals who matters and why¶
HITS hub/authority scores, PageRank, Louvain community detection, and knowledge concentration risk scoring show you the hidden structure of an organization's communication.
Outcome: Identify key people, silos, and single points of failure before they become crises.
π Finds what you're looking for¶
TF-IDF search index with date-range and sender filters. Query archives with pikoclaw search.
Outcome: Answer "What did we know about X in Q3 2024?" in seconds, not hours.
π€ Produces output you can use today¶
Obsidian-native wiki with [[wikilinks]], JSON with schema validation, CSV for spreadsheets, D3 force-directed graph, and a REST API for agent integration.
Outcome: Browse in Obsidian, analyze in Excel, query from Python, visualize in the browser.
β Proves what it did¶
SHA-256 source hash, tool version stamp, and warnings list in provenance.json. Every extraction is reproducible and auditable.
Outcome: Chain of custody for regulatory compliance, forensics, and trust.
π Runs anywhere, trusts no one¶
Zero network dependencies. No telemetry. No cloud calls. Air-gapped by default. All future LLM features are opt-in and support local models.
Outcome: Run on isolated networks, in secure environments, or on a Raspberry Pi. Your data never leaves your control.
Real-World Use Cases¶
π’ Organizational offboarding β Preserve institutional knowledge when employees leave. Extract decision history, expertise maps, and context.
π¬ E-discovery & forensics β Chain of custody, provenance metadata, and auditable extraction for legal and regulatory compliance.
π Personal archiving β Convert decades of Gmail Takeout or Outlook archives into a browsable, searchable personal knowledge base.
π€ DAO governance β Extract decision threads, proposal discussions, and contributor graphs from Slack or Discord exports.
π§ Agent memory layer β Persistent, queryable institutional knowledge for AI agents. PikoClaw is the memory; PicoClaw is the agent.
Conference-Ready¶
Panathenea 2026 β May 27β29, Athens, Greece
PikoClaw will be presented as part of the institutional knowledge preservation and web3 identity track. The demo shows a complete end-to-end extraction from Enron corpus to Obsidian wiki in under 60 seconds, with provenance attestation and contact graph visualization.
π₯ Demo script available β reproduce the demo on your own machine.
π Stress tested β 259 automated tests, including fuzz tests (binary garbage, null bytes, circular refs, 100K headers, XSS, Unicode) and benchmarks (100β5000 message scale).
π¬ Validated against real-world data β Enron corpus (~1.3 GB, ~500K messages), Gmail Takeout archives, Slack exports with 10K+ messages.
Get Started¶
Installation Guide Quick Start Tutorial CLI Reference Architecture Docs