Roadmap¶
Last updated: 2026-02-24 (v0.5.0)
Maintainer: @fuzzywigg / smtp.eth
Conference deadline: Panathenea — May 27–29, 2026, Athens, Greece
PikoClaw is evolving from a PST/OST extractor into a universal institutional memory layer — the piece of your stack that turns decades of archived communication into structured, queryable, agent-readable knowledge.
How to Read This Document¶
This roadmap is organized into four time horizons. Each item includes a priority tag, status, and the source of the requirement (partner feedback, community input, or internal planning).
| Horizon | Timeframe | Goal |
|---|---|---|
| NOW | Feb–May 2026 | Ship for conference demo |
| NEXT | Jun–Sep 2026 | Post-conference hardening |
| LATER | Q4 2026–Q1 2027 | Product expansion |
| VISION | 2027+ | Full ecosystem play |
Status legend: [ ] not started · [~] in progress · [x] done · [—] deferred
NOW — Ship for Panathenea (Feb–May 2026)¶
Conference deadline: May 27–29, 2026 — Athens, Greece
The immediate focus is validation, stress testing, and polish for the conference presentation.
1. Core Pipeline Consolidation¶
Source: Internal — models.py + CONSOLIDATION-PLAN.md
- [x] Install canonical data models (
models.py) as single source of truth - [x] Create adapter base class; convert PST/Maildir/MBOX/EML extractors to adapters
- [x] Build core pipeline:
Source → Adapter → Models → OutputGenerators - [x] Rewire CLI entry point to use new pipeline
- [ ] Verify end-to-end with Enron dataset
2. Provenance & Trust (5 lines, enormous credibility)¶
Source: Gap Analysis #5; Partner Q17
- [x] SHA-256 source hash in
provenance.json - [x] Tool version stamp
- [x] Add
warningslist to ExtractionResult - [x] Add
provenance.jsontobuild_result()output - [x] Generate
provenance.jsonon every extraction run
3. Conversation-Index Header Parsing¶
Source: Gap Analysis #1 — CRITICAL for threading quality
- [x] Parse
Conversation-Indexheader inpst_adapter.py - [x] Feed parsed index into union-find threading alongside References/In-Reply-To
- [x] Conversation-Index root extraction (first 22 bytes) used in union-find as Signal #2
- [ ] Test against Exchange-generated PSTs (Enron has these)
4. Stress Testing & Honest Limits¶
Source: Partner Q4 — "What's the largest PST you've tested?"
- [x] Enron corpus (~1.3 GB, ~500K messages) — works
- [ ] Test with 5 GB synthetic PST — document memory ceiling
- [ ] Test with 10 GB PST — document failure mode
- [ ] Document results in README: "Tested up to X GB / Y messages"
5. Demo Recording (Fallback)¶
Source: Internal — conference prep
- [ ] Record clean screen capture:
pip install pikoclaw && pikoclaw extract enron.pst - [ ] Show wiki output in Obsidian with wikilinks working
- [ ] Show
contacts.jsonnetwork metrics - [ ] Show
provenance.jsonwith SHA-256 hash - [ ] 60 seconds or less, no dependencies on venue WiFi
6. Docs Site Polish¶
Source: Docs review — nft2-me.github.io/PikoClaw
- [x] MkDocs Material site deployed
- [x] Getting Started / User Guide / Architecture nav structure
- [x] Architecture sub-pages written (Overview, Data Model, Adapters, Pipeline)
- [x] Add hero section to landing page
- [x] Fill Roadmap page with content from ROADMAP.md
- [x] Add "Why PikoClaw?" comparison section (vs. rolling your own + vs. commercial tools)
- [x] Tighten "Key Capabilities" — lead with outcomes, not mechanisms
7. Open-Source Presence¶
Source: Internal — credibility for Athens
- [x] Public GitHub repo at nft2-me/PikoClaw
- [x] Clean README with install instructions, example output, badges
- [ ] Working
pip install pikoclaw→ wiki output in under 60 seconds - [x] Add GitHub topics:
email,knowledge-extraction,pst,institutional-knowledge,mbox,maildir,outlook,forensics,email-archiving - [ ] 100+ stars target (engage PicoClaw community with "memory claw" positioning)
7a. CLI & Testing Infrastructure ✅¶
Source: Internal — operational maturity for v0.5.0
- [x]
pikoclaw infocommand: version, Python version, installed deps, adapter/feature inventory - [x]
pikoclaw statscommand: extraction statistics fromextraction.json - [x]
pikoclaw validatecommand: schema validation ofextraction.json - [x] CSV export (
--csv): emails.csv, contacts.csv, threads.csv viaCSVGenerator - [x]
-v/--verboseand-q/--quietlogging flags (DEBUG/INFO/WARNING levels) - [x] Temporal search:
--after/--beforedate-range filters onpikoclaw search - [x] Fuzz test suite: 41 tests — binary garbage, null bytes, circular refs, 100K headers, XSS, Unicode
- [x] Benchmark suite: 18 tests — pipeline/threading/output timing at 100–5000 message scale
- [x] Search robustness: graceful handling of very small document sets and all-stop-word edge cases
NEXT — Post-Conference Hardening (Jun–Sep 2026)¶
8. Ecosystem Positioning: "The Memory Claw"¶
Source: Partner Q1–Q2 — PicoClaw ecosystem play
PicoClaw (Sipeed, 12K+ stars) is the lightweight edge agent. PikoClaw is its long-term, auditable memory store. The Claw ecosystem also includes OpenClaw, MimiClaw, ZeroClaw, and NanoClaw. PikoClaw occupies a unique layer: persistent institutional knowledge that any agent can query.
- [ ] Publish positioning blog post: "PicoClaw is the agent, PikoClaw is its memory"
- [x] Add
pikoclaw serve --port 8080HTTP API for agent queries - [ ] Define simple query protocol:
GET /api/search?q=budget+decisions+Q3 - [ ] Test PikoClaw → PicoClaw integration on localhost
9. Contact Graph Intelligence ✅¶
Source: Gap Analysis #2 — real but not demo-critical
- [x] NetworkX integration for HITS scores, Louvain communities, degree centrality
- [x]
contacts.jsonoutput with full graph metrics - [x] Knowledge concentration risk score: flag when >X% of communication flows through single node
10. Docker Image¶
Source: Partner Q19 — install friction is real
- [x]
Dockerfilewith libpff pre-compiled - [ ]
docker run -v ./data:/data pikoclaw extract /data/mailbox.pst - [ ] Publish to GitHub Container Registry (ghcr.io)
- [ ] Eliminates the #1 install friction point (libpff compilation)
11. Incremental / Delta Extraction ✅¶
Source: Partner Q6
- [x] Hash-based dedup: skip messages where
message_idalready exists in output - [x]
--incrementalflag that reads existingextraction.jsonand only processes new messages - [x] Timestamp watermark in
provenance.jsonfor last extraction run
12. Interactive Graph Visualization ✅¶
Source: Partner Q13
- [x] Force-directed D3 graph exported as standalone HTML
- [ ] Search bar for contacts/threads
- [x] Node size = message count, color = Louvain community
- [x] Ship as
pikoclaw vizcommand
13. PII Scrubbing / Redaction Mode ✅¶
Source: Partner Q16 — first enterprise feature
- [x]
--redactflag with configurable entity types (email, phone, SSN, credit_card, ip) - [x] Regex-based scrubbing with Luhn/SSN validation (no LLM dependency)
- [x] Produces redacted wiki + audit log (audit via provenance warnings; dedicated log planned)
- [ ] spaCy NER upgrade for name redaction
- [ ] GDPR/CCPA "right to be forgotten" workflow documentation
LATER — Product Expansion (Q4 2026–Q1 2027)¶
Design principle
Core extraction stays deterministic and auditable. No LLM dependency in the core pipeline. All LLM features are opt-in flags routed through a single local endpoint.
14. Model Routing Spine¶
Source: Perplexity/CC feedback §1 — critical for any LLM-powered features
When PikoClaw adds optional LLM features (summarization, embeddings, RAG), it needs a routing layer that abstracts provider choice.
- [ ] Vendor LLM proxy (LiteLLM, ScalePortal/llm-router, or similar)
- [ ] Single endpoint:
http://127.0.0.1:4801/v1/chat/completions - [ ] Routing policies: light tasks → cheap/fast model, heavy tasks → capable model
- [ ] Request metadata:
intent,surface,importancefields - [ ] Cost + latency instrumentation per surface (inbox, calendar, discord)
Design principle: PikoClaw never calls providers directly. All LLM traffic goes through one local URL. Core extraction pipeline remains zero-LLM.
15. CC-Style Inbox/Calendar Agent¶
Source: Perplexity/CC feedback §2, §4 — Google CC (Dec 2025) as reference implementation
Google CC is an inbox-native AI agent that scans Gmail/Calendar/Drive and sends "Your Day Ahead" briefings. PikoClaw can offer a privacy-first, self-hosted alternative.
15a. Google API Connectors¶
- [ ] Gmail connector:
list_unread_threads,get_thread,label_thread - [ ] Calendar connector:
list_events(time_range),create_event - [ ] OAuth token management with local encrypted store
- [ ] Scoped tokens: Gmail + Calendar only, single account
- [ ] Config file:
pikoclaw_cc.yamllisting watched labels/folders
15b. Daily Briefing Workflow¶
- [ ] "Your Day Ahead" skill: summarize schedule + key emails + action items
- [ ] Output: email digest and/or Discord post
- [ ] Sections: Today's meetings, Tasks, Waiting on others, Alerts
- [ ] Ambient polling: cron or internal scheduler, 7-8am briefing
15c. Inbox Classification & Labeling¶
- [ ] Labels: urgent, important, reference, automated, travel, billing, social
- [ ] Store classifications in Gmail labels or local db
- [ ] Pattern: clone Relay.app "Inbox AI Assistant" template approach
15d. Event & Thread Linking¶
- [ ] Detect "event-worthy" emails (birthday planning, trip, meeting)
- [ ] Propose/create calendar events with summary and link
- [ ] Skills:
create_event,plan_thread_summary
16. Discord "Geryon" Surface¶
Source: Perplexity/CC feedback §3 — concrete Discord integration
Geryon = Discord persona. PikoClaw = brain. Proxy = model router.
- [ ] Minimal Discord bot service (discord.py or similar)
- [ ] Env:
DISCORD_BOT_TOKEN,GUILD_ID, allowed channel IDs - [ ] Webhook:
POST /discord-event→ PikoClaw - [ ] Channel routing:
#planning,#birthdays,#inbox-cc - [ ] Bot replies in threads under triggering message
- [ ] Mirror inbox/calendar context: "when is Katie's birthday thing?" → PikoClaw lookup
17. CC Skill Bundle¶
Source: Perplexity/CC feedback §5 — standardize all agent behaviors as skills
- [ ] Define skill bundle in PikoClaw's skills system:
cc_daily_briefingcc_inbox_triagecc_event_planning(birthdays, trips)cc_discord_reply(formatting & tone for Discord)- [ ] Standardize metadata for every model call:
surface,intent,importance,deadline - [ ] Router uses metadata to pick models and log structured metrics
18. Optional LLM Integration (Pluggable, Never Core)¶
Source: Partner Q9, Q10, Q11
Design principle: Core pipeline stays deterministic and auditable. No LLM dependency. All LLM features are opt-in flags.
- [ ]
--summarize: Executive summaries per thread/contact/quarter via local model (Ollama) or API - [ ]
--embed: Vector embeddings output → Chroma/LanceDB folder alongside wiki - [ ]
--risk-report: Anomaly detection — knowledge concentration risk, communication drop-offs - [ ] All LLM calls go through model routing spine (§14)
19. Additional Adapter Formats¶
Source: Partner Q20
- [x] Gmail Takeout (MBOX) — works today
- [x] Apple Mail (MBOX) — works today
- [x] EML files — works today
- [ ] Google Takeout native format (structured JSON, not just MBOX)
- [x] Slack export (ZIP + directory) —
SlackAdapterwith user resolution, thread grouping, 26 tests - [ ] Microsoft Teams export
- [ ] ProtonMail export (MBOX — should work, needs testing)
20. Packaging Improvements¶
Source: Partner Q19
- [x]
pip install pikoclaw(with libpff friction on PST) - [x] Docker image (see §10)
- [ ] Pre-built wheels for macOS (arm64 + x86_64) and Windows
- [ ] PyOxidizer / Nuitka single binary option
- [ ] Maildir/MBOX-only mode: zero native deps, runs anywhere
VISION — Full Ecosystem Play (2027+)¶
21. Organizational Brain¶
Source: Partner Q25, Q26
Multiple people's archives → unified org knowledge graph with:
- Per-source attribution and version history
- Role-based access controls
- Agentic querying via HTTP API (PikoClaw serve)
- Temporal querying: "What decisions were made about X between March and June?"
22. Living CRM Seed¶
Source: Partner Q27
Communication graph → auto-populated CRM:
contacts.jsonalready maps to Salesforce, HubSpot, MS Dynamics field schemas- Communication frequency, recency, and relationship strength scores
- Auto-sync to CRM via standard APIs
23. ERC-8004 On-Chain Attestation¶
Source: Partner Q3 — Panathenea thesis, post-conference development
- [ ] Merkle-tree proofs of extraction outputs
- [ ] IPFS pinning of provenance metadata
- [ ] Soulbound tokens for "verified institutional handoff"
- [ ] Integration with ENS/smtp.eth identity
24. Academic Publication¶
Source: Partner Q28
The 3-signal threading algorithm (References chain + Conversation-Index + normalized subject, unified via union-find) is genuinely novel synthesis work.
- [ ] Open-source threading + graph algorithms as standalone library
- [ ] Short paper describing the approach and benchmarks
- [ ] Comparative evaluation against existing threading implementations
25. PikoClaw + PicoClaw Demo Kit¶
Source: Partner Q29 — community challenge
One command, fresh Raspberry Pi, both running, talking to each other:
- PicoClaw as the interactive agent
- PikoClaw as the persistent knowledge store
- HTTP API connecting them
- Community challenge: first person to build it gets featured prominently
26. Governance, Safety & Failure Modes¶
Source: Perplexity/CC feedback §6
- [ ] No autonomous external email/DM sending without human confirmation
- [ ] Rate limits per channel/user for agent surfaces
- [ ] Graceful degradation: "I can't see your calendar right now" vs. silent failure
- [ ] Structured logging:
input → model → actionfor replay and prompt refinement - [ ] All LLM features support zero-trust/air-gapped mode (local models via Ollama)
General Observations¶
Existing Search UI
A V1 of the search page (web/src/app/search/page.tsx) already exists in the main branch. This needs to be reconciled with the planned API work for search to ensure a consistent and integrated user experience.
Frequently Asked Questions¶
These answers correspond to the 29-question assessment from our support partner.
Naming & Positioning (Q1-Q3)¶
Q1: Is "PikoClaw" a deliberate play on PicoClaw?
Independent origin, serendipitous timing. PicoClaw (Sipeed, 12K+ stars) is the lightweight edge agent. PikoClaw is the long-term memory store. Different layer of the stack.
Q2: Open to "the memory claw" positioning?
Yes. This is the plan. See §8.
Q3: How far into web3/ENS/smtp.eth?
Provenance ships now (SHA-256 + tool version). ERC-8004 attestation is the post-conference thesis (§23). Merkle proofs and IPFS are roadmap, not pre-conference.
Parsing Robustness (Q4-Q7)¶
Q4: Largest PST tested?
Enron corpus (~1.3 GB, ~500K messages). Stress testing to 5-10 GB is a pre-conference TODO (§4).
Q5: Password-protected PSTs, TNEF, corrupted items?
Password-protected PSTs supported (--password, --skip-protected). Corrupted items skip with warnings and SHA-256 fingerprints in provenance. TNEF/winmail.dat is on the roadmap.
Q6: Incremental extraction?
--incremental flag skips messages whose message_id already exists in previous extraction output. See §11.
Q7: Multi-user/shared mailbox?
Multiple archives in a single command already work. Org-level access controls and per-user attribution are future (§21).
Intelligence Layer (Q8-Q11)¶
Q8: Next intelligence layer?
Temporal analysis (shipped: --after/--before search), contact intelligence, topic clustering, and knowledge concentration risk scoring. See §7a, §9, §18.
Q9: Auto-summarization?
Optional, pluggable, never core. --summarize flag with local model support. See §18.
Q10: RAG-ready output?
TF-IDF search index ships today (pikoclaw search). --embed flag producing Chroma/LanceDB folder planned. See §18.
Q11: Anomaly/risk detection?
Knowledge concentration risk score computed in graph analysis. Full --risk-report flag planned (§18).
Output Experience (Q12-Q15)¶
Q12: Wiki in Obsidian?
Designed for it. Internal [[wikilinks]] work natively.
Q13: Interactive graph viz?
Shipped as pikoclaw viz command with D3 force-directed graph. See §12.
Q14: pikoclaw serve?
Shipped. HTTP API with search, contacts, emails, threads, provenance endpoints. See §8.
Q15: Attachment handling?
Attachments extracted with metadata (--extract-attachments). Inline previews, OCR, base64 embedding are future features.
Security & Privacy (Q16-Q18)¶
Q16: PII scrubbing?
Foundation shipped. --redact flag with regex-based scrubbing for email, phone, SSN, credit card, IP. spaCy NER upgrade planned. See §13.
Q17: Chain-of-custody?
provenance.json ships now with SHA-256 hash, tool version, warnings. Digital signatures and EDRM export are on radar.
Q18: Air-gapped mode?
This is the default. Zero network dependencies. No telemetry. No cloud calls. Any future LLM features are opt-in and support local models first.
Deployment (Q19-Q21)¶
Q19: Install friction?
Docker image ships with libpff pre-compiled. See §10, §20.
Q20: Other tool integrations?
Gmail Takeout, Apple Mail (MBOX), EML, and Slack exports work today. Teams adapter planned. See §19.
Q21: Raspberry Pi?
Core Python runs fine on ARM. Maildir/MBOX mode (no PST/libpff) runs on a Pi today. See §25 for the PikoClaw+PicoClaw demo kit community challenge.
Business & Sustainability (Q22-Q24)¶
Q22: Monetization?
Core stays MIT. Enterprise features (password PSTs, PII redaction, SSO, hosted RAG, compliance exports) are the commercial layer. No timeline -- community first.
Q23: Target personas?
1) Offboarding/institutional knowledge consultants, 2) Personal archivists & homelab tinkerers, 3) Corporate knowledge managers, 4) E-discovery pros, 5) Web3 DAOs.
Q24: Success metrics (3-6 months)?
100+ GitHub stars, 3+ real-world extractions on non-Enron data, working 60-second demo, Panathenea presentation, one published write-up.
Vision (Q25-Q29)¶
Q25: 2027 version?
Multi-source knowledge graph with temporal querying, agentic access, optional NL queries. See §21.
Q26: Organizational brain?
Yes. Multiple archives → unified org graph with access controls and versioning. See §21.
Q27: Living CRM seed?
Data model already maps to Salesforce/HubSpot/Dynamics schemas. See §22.
Q28: Academic paper?
Yes. 3-signal threading algorithm is novel synthesis work. See §24.
Q29: PikoClaw + PicoClaw demo kit?
Community challenge. See §25.
Reference Links¶
| Resource | URL |
|---|---|
| PikoClaw repo | github.com/nft2-me/PikoClaw |
| PikoClaw docs | nft2-me.github.io/PikoClaw |
| PicoClaw (Sipeed) | github.com/sipeed/picoclaw |
| Enron dataset | cs.cmu.edu/~enron |
| Panathenea 2026 | panathenea.org |