Extraction Process¶
PikoClaw is designed to handle a variety of common email archive formats, processing them through a unified pipeline to build a knowledge graph.
Supported Formats¶
- PST (Personal Storage Table): The format used by Microsoft Outlook.
- MBOX (Mailbox): A common format used by many email clients, including Thunderbird and Apple Mail.
- EML (Email): Individual email message files.
The Pipeline¶
- Adapter Selection: Based on the file extension, PikoClaw selects the appropriate adapter to read the archive.
- Message Parsing: Each message is parsed to extract key metadata (sender, recipients, subject, date) and content (body text, HTML).
- PII Redaction: The body content is scanned for Personally Identifiable Information (PII) like email addresses, phone numbers, and SSNs, which are replaced with placeholders.
- Topic Clustering: The redacted text is fed into a machine learning model to identify and group messages into topics.
- Contact Intelligence: A graph of all unique contacts is built, deduplicating entries and linking individuals.
- Storage: The original file is archived in R2 storage, and the extracted metadata, contacts, and topics are stored in a D1 database for querying via the API.