Skip to content

Database Schema

The Corgi Recommender Service uses PostgreSQL as its primary database, leveraging advanced features like pgvector for semantic similarity search and JSONB for flexible metadata storage. This document provides a comprehensive reference for all database tables, their columns, and relationships.

Schema Overview

The database is organized into six logical domains, each serving a specific purpose in the recommendation system. The diagram below shows how these domains interconnect to support privacy-first, personalized content recommendations.

Domain Organization

graph TB
    subgraph Identity ["User Identity & Privacy"]
        USERS[users<br/>Core user records]
        USER_IDS[user_identities<br/>Mastodon accounts]
        PRIVACY[privacy_settings<br/>User preferences]
    end

    subgraph Content ["Content & Metadata"]
        POSTS[posts<br/>Mastodon posts]
        RANKINGS[post_rankings<br/>Pre-computed scores]
    end

    subgraph Recommendations ["Recommendation System"]
        RECS[recommendations<br/>Generated recommendations]
        INTERACTIONS[interactions<br/>User engagement]
    end

    subgraph Experiments ["A/B Testing & Analytics"]
        EXPS[experiments<br/>Test definitions]
        ASSIGNS[experiment_assignments<br/>User assignments]
        METRICS[experiment_metrics<br/>Results tracking]
    end

    subgraph Knowledge ["Knowledge Base & RAG"]
        CHUNKS[knowledge_chunks<br/>Documentation chunks]
        EMBEDS[knowledge_embeddings<br/>Vector embeddings]
    end

    subgraph Agents ["Agent System"]
        EXECS[agent_executions<br/>Run history]
        COSTS[agent_costs<br/>Cost tracking]
    end

    %% User relationships
    USERS --> USER_IDS
    USERS --> PRIVACY
    USERS --> RECS
    USERS --> RANKINGS
    USERS --> ASSIGNS

    %% Identity relationships
    USER_IDS --> INTERACTIONS

    %% Post relationships
    POSTS --> INTERACTIONS
    POSTS --> RECS
    POSTS --> RANKINGS
    POSTS --> EMBEDS

    %% Experiment relationships
    EXPS --> ASSIGNS
    EXPS --> METRICS

    %% Knowledge relationships
    CHUNKS --> EMBEDS

    %% Agent relationships
    EXECS -.-> COSTS

    classDef identity fill:#e3f2fd,stroke:#1976d2,color:#000
    classDef content fill:#e8f5e8,stroke:#388e3c,color:#000
    classDef recommendation fill:#fce4ec,stroke:#c2185b,color:#000
    classDef experiment fill:#fff3e0,stroke:#f57c00,color:#000
    classDef knowledge fill:#f3e5f5,stroke:#7b1fa2,color:#000
    classDef agent fill:#efebe9,stroke:#5d4037,color:#000

    class USERS,USER_IDS,PRIVACY identity
    class POSTS,RANKINGS content
    class RECS,INTERACTIONS recommendation
    class EXPS,ASSIGNS,METRICS experiment
    class CHUNKS,EMBEDS knowledge
    class EXECS,COSTS agent

Key Relationships

The schema design emphasizes data privacy and efficient querying:

  1. User Identity Separation: The users table represents a Corgi user, while user_identities maps to their various Mastodon accounts. This allows users to maintain privacy across different instances.

  2. Interaction Tracking: User engagement flows from user_identitiesinteractionsposts, enabling personalized recommendations while maintaining account separation.

  3. Recommendation Pipeline: The system generates recommendations by analyzing interactions, computing rankings, and storing them in the recommendations table for fast retrieval.

  4. Experimentation Framework: A/B tests are managed through experimentsexperiment_assignmentsusers, with results tracked in experiment_metrics.

  5. Knowledge Management: The RAG system uses knowledge_chunksknowledge_embeddings with pgvector for semantic search capabilities.

  6. Agent Monitoring: Autonomous agents log their execution history and costs for transparency and control.

Detailed Table Relationships

graph TD
    subgraph Legend ["Relationship Types"]
        L1[1:1 One to One]
        L2[1:N One to Many]
        L3[N:M Many to Many]
    end

    USERS[users]
    USER_IDS[user_identities<br/>1:N with users]
    PRIVACY[privacy_settings<br/>1:1 with users]
    POSTS[posts]
    INTERACTIONS[interactions<br/>N:M bridge table]
    RECS[recommendations<br/>N:M with scoring]
    RANKINGS[post_rankings<br/>1:N per post/user]

    USERS -->|1:N| USER_IDS
    USERS -->|1:1| PRIVACY
    USER_IDS -->|1:N| INTERACTIONS
    POSTS -->|1:N| INTERACTIONS
    USERS -->|1:N| RECS
    POSTS -->|1:N| RECS
    POSTS -->|1:N| RANKINGS
    USERS -->|1:N| RANKINGS

    style INTERACTIONS fill:#fce4ec,stroke:#c2185b
    style RECS fill:#fce4ec,stroke:#c2185b

    classDef bridge fill:#fce4ec,stroke:#c2185b,color:#000
    classDef entity fill:#e3f2fd,stroke:#1976d2,color:#000

    class INTERACTIONS,RECS bridge
    class USERS,USER_IDS,PRIVACY,POSTS,RANKINGS entity

Core Tables

users

The primary user table storing aggregated user information across all their Mastodon identities.

Column Data Type Description
id UUID Primary key, unique user identifier
created_at TIMESTAMP When the user was first seen by Corgi
updated_at TIMESTAMP Last modification timestamp
total_interactions INTEGER Cached count of all user interactions
last_active TIMESTAMP Last interaction timestamp
preferred_languages TEXT[] Array of ISO language codes
timezone TEXT User's timezone (if detected)

user_identities

Maps Mastodon accounts to Corgi users, supporting users with multiple accounts.

Column Data Type Description
id UUID Primary key
user_id UUID Foreign key to users table
mastodon_instance TEXT Mastodon instance domain
mastodon_user_id TEXT User ID on the Mastodon instance
username TEXT Mastodon username
display_name TEXT User's display name
created_at TIMESTAMP When this identity was added
is_primary BOOLEAN Whether this is the user's primary identity
last_seen TIMESTAMP Last time this identity was active

posts

Stores all crawled and analyzed Mastodon posts.

Column Data Type Description
id UUID Primary key
mastodon_instance TEXT Origin instance domain
mastodon_id TEXT Post ID on origin instance
author_id TEXT Mastodon author ID
content TEXT Post content (HTML)
content_plain TEXT Plain text version for analysis
language TEXT Detected language code
created_at TIMESTAMP When post was created on Mastodon
crawled_at TIMESTAMP When Corgi first saw the post
updated_at TIMESTAMP Last update timestamp
media_attachments JSONB Array of media attachment metadata
tags TEXT[] Array of hashtags
mentions TEXT[] Array of mentioned usernames
is_reply BOOLEAN Whether post is a reply
is_reblog BOOLEAN Whether post is a reblog
sentiment_score FLOAT Calculated sentiment (-1 to 1)
toxicity_score FLOAT Content safety score (0 to 1)
embedding vector(384) Semantic embedding from sentence-transformers

interactions

Records all user interactions with posts for personalization.

Column Data Type Description
id UUID Primary key
user_identity_id UUID Foreign key to user_identities
post_id UUID Foreign key to posts
interaction_type TEXT Type: view, like, reblog, reply, bookmark
interaction_value FLOAT Normalized interaction strength (0-1)
created_at TIMESTAMP When interaction occurred
context JSONB Additional context (e.g., view duration)
experiment_id UUID Associated A/B test experiment

recommendations

Stores generated recommendations with scores and metadata.

Column Data Type Description
id UUID Primary key
user_id UUID Foreign key to users
post_id UUID Foreign key to posts
score FLOAT Final recommendation score
algorithm_scores JSONB Individual algorithm scores
created_at TIMESTAMP When recommendation was generated
served_at TIMESTAMP When recommendation was served
position INTEGER Position in recommendation list
experiment_id UUID Associated A/B test
model_version TEXT Model version used

privacy_settings

User privacy preferences and data handling settings.

Column Data Type Description
id UUID Primary key
user_id UUID Foreign key to users
data_collection_enabled BOOLEAN Whether to collect interaction data
recommendation_enabled BOOLEAN Whether to show recommendations
sensitive_content_filter BOOLEAN Filter sensitive content
blocked_instances TEXT[] Instances to exclude
blocked_users TEXT[] Users to exclude
created_at TIMESTAMP Settings creation time
updated_at TIMESTAMP Last modification time

Analytics Tables

A/B Testing Flow

graph LR
    subgraph Setup ["Experiment Setup"]
        EXP[experiments<br/>Define test variants]
    end

    subgraph Assignment ["User Assignment"]
        USER[User Request]
        ASSIGN[experiment_assignments<br/>Random assignment]
    end

    subgraph Execution ["Test Execution"]
        VAR_A[Variant A<br/>Control]
        VAR_B[Variant B<br/>Treatment]
    end

    subgraph Tracking ["Results Tracking"]
        INTER[Track Interactions]
        METRICS[experiment_metrics<br/>Aggregate results]
    end

    EXP --> USER
    USER --> ASSIGN
    ASSIGN --> VAR_A
    ASSIGN --> VAR_B
    VAR_A --> INTER
    VAR_B --> INTER
    INTER --> METRICS
    METRICS --> EXP

    classDef setup fill:#f3e5f5,stroke:#7b1fa2,color:#000
    classDef assign fill:#e3f2fd,stroke:#1976d2,color:#000
    classDef exec fill:#e8f5e8,stroke:#388e3c,color:#000
    classDef track fill:#fff3e0,stroke:#f57c00,color:#000

    class EXP setup
    class USER,ASSIGN assign
    class VAR_A,VAR_B exec
    class INTER,METRICS track

post_rankings

Pre-computed post rankings for performance optimization.

Column Data Type Description
id UUID Primary key
post_id UUID Foreign key to posts
user_id UUID Foreign key to users (null for global)
ranking_type TEXT Type: trending, quality, personalized
score FLOAT Computed ranking score
factors JSONB Score breakdown by factor
valid_until TIMESTAMP Cache expiration time
created_at TIMESTAMP When ranking was computed

experiment_metrics

Tracks A/B test metrics and outcomes.

Column Data Type Description
id UUID Primary key
experiment_id UUID Foreign key to experiments
metric_name TEXT Metric identifier
variant TEXT Test variant (control/treatment)
value FLOAT Metric value
sample_size INTEGER Number of users
confidence_interval FLOAT[] Statistical confidence bounds
computed_at TIMESTAMP When metric was calculated

Agent System Tables

agent_executions

Logs all agent runs for monitoring and debugging.

Column Data Type Description
id UUID Primary key
agent_name TEXT Name of the agent
execution_id TEXT Unique execution identifier
started_at TIMESTAMP Execution start time
completed_at TIMESTAMP Execution end time
status TEXT Status: running, completed, failed
metrics JSONB Performance metrics
errors JSONB Error details if failed
actions_taken JSONB Summary of actions performed

agent_costs

Tracks costs associated with agent operations.

Column Data Type Description
id UUID Primary key
agent_name TEXT Name of the agent
execution_id TEXT Link to agent_executions
cost_type TEXT Type: api_calls, compute_time
amount DECIMAL(10,4) Cost amount in USD
details JSONB Breakdown of costs
created_at TIMESTAMP When cost was incurred

Knowledge Base Tables

knowledge_chunks

Stores documentation and code chunks for RAG system.

Column Data Type Description
id UUID Primary key
content TEXT Chunk content
metadata JSONB Source file, type, etc.
chunk_index INTEGER Position in source
created_at TIMESTAMP When chunk was created
updated_at TIMESTAMP Last modification

knowledge_embeddings

Vector embeddings for semantic search.

Column Data Type Description
id UUID Primary key
chunk_id UUID Foreign key to knowledge_chunks
embedding vector(384) Semantic embedding
model_version TEXT Embedding model used
created_at TIMESTAMP When embedding was generated

Indexes and Constraints

Vector Search Architecture

graph LR
    subgraph Query ["Query Processing"]
        QT[Query Text]
        QE[Query Embedding<br/>vector(384)]
    end

    subgraph Indexes ["Vector Indexes"]
        PI[posts.embedding<br/>IVFFlat Index]
        KI[knowledge_embeddings.embedding<br/>IVFFlat Index]
    end

    subgraph Results ["Search Results"]
        PR[Similar Posts<br/>Cosine Distance]
        KR[Similar Knowledge<br/>Cosine Distance]
    end

    QT --> QE
    QE --> PI
    QE --> KI
    PI --> PR
    KI --> KR

    classDef query fill:#e3f2fd,stroke:#1976d2,color:#000
    classDef index fill:#fff3e0,stroke:#f57c00,color:#000
    classDef result fill:#e8f5e8,stroke:#388e3c,color:#000

    class QT,QE query
    class PI,KI index
    class PR,KR result

Performance Indexes

-- User lookup indexes
CREATE INDEX idx_user_identities_mastodon ON user_identities(mastodon_instance, mastodon_user_id);
CREATE INDEX idx_users_last_active ON users(last_active DESC);

-- Post retrieval indexes
CREATE INDEX idx_posts_created_at ON posts(created_at DESC);
CREATE INDEX idx_posts_mastodon ON posts(mastodon_instance, mastodon_id);
CREATE INDEX idx_posts_author ON posts(author_id);
CREATE INDEX idx_posts_language ON posts(language);

-- Interaction queries
CREATE INDEX idx_interactions_user_post ON interactions(user_identity_id, post_id);
CREATE INDEX idx_interactions_created ON interactions(created_at DESC);

-- Vector similarity search
CREATE INDEX idx_posts_embedding ON posts USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX idx_knowledge_embedding ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops);

-- Analytics indexes
CREATE INDEX idx_recommendations_user_created ON recommendations(user_id, created_at DESC);
CREATE INDEX idx_rankings_valid ON post_rankings(ranking_type, valid_until) WHERE valid_until > NOW();

Constraints

-- Unique constraints
ALTER TABLE user_identities ADD CONSTRAINT unique_mastodon_identity 
  UNIQUE (mastodon_instance, mastodon_user_id);

ALTER TABLE posts ADD CONSTRAINT unique_mastodon_post 
  UNIQUE (mastodon_instance, mastodon_id);

-- Foreign key constraints
ALTER TABLE interactions ADD CONSTRAINT fk_interaction_user 
  FOREIGN KEY (user_identity_id) REFERENCES user_identities(id) ON DELETE CASCADE;

ALTER TABLE interactions ADD CONSTRAINT fk_interaction_post 
  FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE;

-- Check constraints
ALTER TABLE interactions ADD CONSTRAINT check_interaction_value 
  CHECK (interaction_value >= 0 AND interaction_value <= 1);

ALTER TABLE posts ADD CONSTRAINT check_sentiment_range 
  CHECK (sentiment_score >= -1 AND sentiment_score <= 1);

ALTER TABLE posts ADD CONSTRAINT check_toxicity_range 
  CHECK (toxicity_score >= 0 AND toxicity_score <= 1);

Design Decisions

Why PostgreSQL?

  • pgvector Extension: Native support for vector similarity search, essential for content-based recommendations
  • JSONB Support: Flexible schema for metadata without sacrificing query performance
  • ACID Compliance: Critical for maintaining data consistency in recommendation tracking
  • Advanced Indexing: B-tree, GiST, and IVFFlat indexes for diverse query patterns

Why UUIDs?

  • Distributed Generation: No central bottleneck for ID generation
  • Mastodon Compatibility: Aligns with Fediverse's distributed nature
  • No Information Leakage: Unlike sequential IDs, UUIDs don't reveal usage patterns

Why Separate user_identities?

  • Multi-Account Support: Users often have accounts on multiple instances
  • Privacy: Allows users to control data sharing per identity
  • Federation: Aligns with Mastodon's federated architecture

Why JSONB for Metadata?

  • Schema Flexibility: Mastodon's ActivityPub objects vary in structure
  • Query Performance: PostgreSQL's JSONB indexing enables efficient queries
  • Future-Proofing: New fields can be added without migrations

Migration Management

Database migrations are managed through numbered SQL files in db/migrations/:

db/migrations/
├── 001_initial_schema.sql
├── 002_add_vector_search.sql
├── 003_add_privacy_settings.sql
├── 004_add_agent_tables.sql
├── 005_add_analytics_tables.sql
├── 006_add_content_classification.sql
├── 007_add_knowledge_embeddings.sql
└── 008_optimize_indexes.sql

Apply migrations in order:

psql -U corgi_user -d corgi_db -f db/migrations/001_initial_schema.sql

Performance Considerations

  1. Partitioning: The interactions table is partitioned by month for scalability
  2. Materialized Views: Pre-computed trending posts refresh every hour
  3. Connection Pooling: PgBouncer manages connection pooling for high concurrency
  4. Vacuum Strategy: Aggressive autovacuum on high-churn tables like interactions
  5. Index Maintenance: Weekly REINDEX on vector similarity indexes

Security Considerations

  1. Row-Level Security: Enabled on privacy_settings and user_identities
  2. Encryption: Sensitive fields use pgcrypto for at-rest encryption
  3. Audit Logging: All data modifications logged to audit_log table
  4. Least Privilege: Application uses role with minimal required permissions
  5. SQL Injection: All queries use parameterized statements