Skip to content

Adapters

Adapters are the input layer of PikoClaw. Each adapter handles a specific email archive format and normalizes its contents into the universal Message type.

BaseAdapter Interface

Every adapter implements this abstract base class:

class BaseAdapter(ABC):

    @abstractmethod
    def extract(
        self,
        path: str | Path,
        *,
        max_messages: int | None = None,
        verbose: bool = True,
        extract_attachments: bool = False,
    ) -> tuple[list[Message], list[CalendarEvent]]:
        ...

    @abstractmethod
    def can_handle(self, path: str | Path) -> bool:
        ...

    @property
    @abstractmethod
    def format_name(self) -> str:
        ...

Contract:

  • extract() returns (messages, calendar_events)
  • can_handle() returns True if this adapter can process the given path
  • format_name is a short identifier like "pst", "mbox", "maildir"

Auto-Detection Registry

get_adapter(path) in adapters/__init__.py tries each registered adapter in order:

  1. PSTAdapter -- checks for .pst or .ost extension (lazy import, skipped if pypff not installed)
  2. MaildirAdapter -- checks for directory with cur/new/tmp subdirs or Enron-style message files
  3. MboxAdapter -- checks for .mbox or .mbx extension
  4. EmlAdapter -- checks for .eml extension (single message files)

The first adapter whose can_handle() returns True wins.

PST Adapter

File: adapters/pst_adapter.py

Parses Microsoft Outlook PST and OST files using pypff (libpff-python).

Features:

  • Recursive folder walking with depth tracking
  • MAPI message class detection (IPM.Note = email, IPM.Appointment = calendar, etc.)
  • Folder path heuristics for calendar/contact/task classification
  • Transport header parsing for Message-ID, In-Reply-To, References
  • Conversation-Index parsing (Microsoft's proprietary threading signal)
  • Binary attachment extraction (opt-in via extract_attachments)
  • Progress reporting to stderr

Key helpers:

  • safe_str() -- handles pypff's occasional None returns
  • extract_time() -- converts MAPI timestamps to ISO 8601
  • safe_body() / safe_html() -- extracts body with encoding fallback
  • normalize_subject() -- strips Re:/Fwd:/FW: prefixes for thread grouping
  • classify_item() -- determines MessageKind from MAPI class + folder path

Maildir Adapter

File: adapters/maildir_adapter.py

Parses Maildir directories and Enron-style directory trees.

Detection logic:

  • Standard Maildir: directory with cur/, new/, tmp/ subdirectories
  • Enron-style: directory tree with message files (used for the Enron email corpus)

Features:

  • Recursive directory walking
  • Uses shared _rfc2822.py parser for message parsing
  • Enron X-header support (X-From, X-To, X-Folder, X-Origin stored in extra dict)
  • Calendar detection via folder name heuristic

MBOX Adapter

File: adapters/mbox_adapter.py

Parses MBOX files using Python's standard library mailbox.mbox.

Detection logic:

  • File with .mbox or .mbx extension

Features:

  • Handles Gmail Takeout MBOX exports
  • Uses shared _rfc2822.py parser for message content
  • stdlib-only (no external dependencies)

Shared RFC 2822 Parser

File: adapters/_rfc2822.py

Shared parsing logic used by both Maildir and MBOX adapters (approximately 80% shared code).

Functions:

Function Purpose
parse_addresses(header) Parse RFC 5322 address fields into list[EmailAddress]
parse_date(date_str) Convert RFC 2822 date strings to ISO 8601
normalize_subject(subject) Strip Re:/Fwd:/FW:/Fw:/fw: prefixes
parse_rfc2822_message(msg) Convert email.message.Message to PikoClaw Message
parse_rfc2822_file(path) Read and parse a single .eml file

EML Adapter

File: adapters/eml_adapter.py

Parses individual .eml files (RFC 2822 format).

Detection logic:

  • File with .eml extension

Features:

  • Uses shared _rfc2822.py parser for message parsing
  • stdlib-only (no external dependencies)
  • Handles single message files exported from any email client

Adding a New Adapter

To add support for a new format (e.g., Google Takeout JSON, Slack export):

  1. Create adapters/new_adapter.py implementing BaseAdapter
  2. Implement can_handle() to detect the format
  3. Implement extract() to return (list[Message], list[CalendarEvent])
  4. Register in adapters/__init__.py by adding to _ensure_registered()

The adapter only needs to produce list[Message]. Threading, contact aggregation, graph analysis, and output generation are all handled by the pipeline and output generators.