Extraction Process¶

PikoClaw is designed to handle a variety of common email archive formats, processing them through a unified pipeline to build a knowledge graph.

Supported Formats¶

PST (Personal Storage Table): The format used by Microsoft Outlook.
MBOX (Mailbox): A common format used by many email clients, including Thunderbird and Apple Mail.
EML (Email): Individual email message files.

Adapter Selection: Based on the file extension, PikoClaw selects the appropriate adapter to read the archive.
Message Parsing: Each message is parsed to extract key metadata (sender, recipients, subject, date) and content (body text, HTML).
PII Redaction: The body content is scanned for Personally Identifiable Information (PII) like email addresses, phone numbers, and SSNs, which are replaced with placeholders.
Topic Clustering: The redacted text is fed into a machine learning model to identify and group messages into topics.
Contact Intelligence: A graph of all unique contacts is built, deduplicating entries and linking individuals.
Storage: The original file is archived in R2 storage, and the extracted metadata, contacts, and topics are stored in a D1 database for querying via the API.