Adapters¶
Adapters are the input layer of PikoClaw. Each adapter handles a specific email archive format and normalizes its contents into the universal Message type.
BaseAdapter Interface¶
Every adapter implements this abstract base class:
class BaseAdapter(ABC):
@abstractmethod
def extract(
self,
path: str | Path,
*,
max_messages: int | None = None,
verbose: bool = True,
extract_attachments: bool = False,
) -> tuple[list[Message], list[CalendarEvent]]:
...
@abstractmethod
def can_handle(self, path: str | Path) -> bool:
...
@property
@abstractmethod
def format_name(self) -> str:
...
Contract:
extract()returns(messages, calendar_events)can_handle()returnsTrueif this adapter can process the given pathformat_nameis a short identifier like"pst","mbox","maildir"
Auto-Detection Registry¶
get_adapter(path) in adapters/__init__.py tries each registered adapter in order:
- PSTAdapter -- checks for
.pstor.ostextension (lazy import, skipped if pypff not installed) - MaildirAdapter -- checks for directory with
cur/new/tmpsubdirs or Enron-style message files - MboxAdapter -- checks for
.mboxor.mbxextension - EmlAdapter -- checks for
.emlextension (single message files)
The first adapter whose can_handle() returns True wins.
PST Adapter¶
File: adapters/pst_adapter.py
Parses Microsoft Outlook PST and OST files using pypff (libpff-python).
Features:
- Recursive folder walking with depth tracking
- MAPI message class detection (
IPM.Note= email,IPM.Appointment= calendar, etc.) - Folder path heuristics for calendar/contact/task classification
- Transport header parsing for Message-ID, In-Reply-To, References
- Conversation-Index parsing (Microsoft's proprietary threading signal)
- Binary attachment extraction (opt-in via
extract_attachments) - Progress reporting to stderr
Key helpers:
safe_str()-- handles pypff's occasionalNonereturnsextract_time()-- converts MAPI timestamps to ISO 8601safe_body()/safe_html()-- extracts body with encoding fallbacknormalize_subject()-- strips Re:/Fwd:/FW: prefixes for thread groupingclassify_item()-- determines MessageKind from MAPI class + folder path
Maildir Adapter¶
File: adapters/maildir_adapter.py
Parses Maildir directories and Enron-style directory trees.
Detection logic:
- Standard Maildir: directory with
cur/,new/,tmp/subdirectories - Enron-style: directory tree with message files (used for the Enron email corpus)
Features:
- Recursive directory walking
- Uses shared
_rfc2822.pyparser for message parsing - Enron X-header support (X-From, X-To, X-Folder, X-Origin stored in
extradict) - Calendar detection via folder name heuristic
MBOX Adapter¶
File: adapters/mbox_adapter.py
Parses MBOX files using Python's standard library mailbox.mbox.
Detection logic:
- File with
.mboxor.mbxextension
Features:
- Handles Gmail Takeout MBOX exports
- Uses shared
_rfc2822.pyparser for message content - stdlib-only (no external dependencies)
Shared RFC 2822 Parser¶
File: adapters/_rfc2822.py
Shared parsing logic used by both Maildir and MBOX adapters (approximately 80% shared code).
Functions:
| Function | Purpose |
|---|---|
parse_addresses(header) |
Parse RFC 5322 address fields into list[EmailAddress] |
parse_date(date_str) |
Convert RFC 2822 date strings to ISO 8601 |
normalize_subject(subject) |
Strip Re:/Fwd:/FW:/Fw:/fw: prefixes |
parse_rfc2822_message(msg) |
Convert email.message.Message to PikoClaw Message |
parse_rfc2822_file(path) |
Read and parse a single .eml file |
EML Adapter¶
File: adapters/eml_adapter.py
Parses individual .eml files (RFC 2822 format).
Detection logic:
- File with
.emlextension
Features:
- Uses shared
_rfc2822.pyparser for message parsing - stdlib-only (no external dependencies)
- Handles single message files exported from any email client
Adding a New Adapter¶
To add support for a new format (e.g., Google Takeout JSON, Slack export):
- Create
adapters/new_adapter.pyimplementingBaseAdapter - Implement
can_handle()to detect the format - Implement
extract()to return(list[Message], list[CalendarEvent]) - Register in
adapters/__init__.pyby adding to_ensure_registered()
The adapter only needs to produce list[Message]. Threading, contact aggregation, graph analysis, and output generation are all handled by the pipeline and output generators.