Skip to content

Extraction Process

PikoClaw is designed to handle a variety of common email archive formats, processing them through a unified pipeline to build a knowledge graph.

Supported Formats

  • PST (Personal Storage Table): The format used by Microsoft Outlook.
  • MBOX (Mailbox): A common format used by many email clients, including Thunderbird and Apple Mail.
  • EML (Email): Individual email message files.

The Pipeline

  1. Adapter Selection: Based on the file extension, PikoClaw selects the appropriate adapter to read the archive.
  2. Message Parsing: Each message is parsed to extract key metadata (sender, recipients, subject, date) and content (body text, HTML).
  3. PII Redaction: The body content is scanned for Personally Identifiable Information (PII) like email addresses, phone numbers, and SSNs, which are replaced with placeholders.
  4. Topic Clustering: The redacted text is fed into a machine learning model to identify and group messages into topics.
  5. Contact Intelligence: A graph of all unique contacts is built, deduplicating entries and linking individuals.
  6. Storage: The original file is archived in R2 storage, and the extracted metadata, contacts, and topics are stored in a D1 database for querying via the API.