Database Schema

The Corgi Recommender Service uses PostgreSQL as its primary database, leveraging advanced features like pgvector for semantic similarity search and JSONB for flexible metadata storage. This document provides a comprehensive reference for all database tables, their columns, and relationships.

Schema Overview

The database is organized into six logical domains, each serving a specific purpose in the recommendation system. The diagram below shows how these domains interconnect to support privacy-first, personalized content recommendations.

Domain Organization

graph TB
    subgraph Identity ["User Identity & Privacy"]
        USERS[users<br/>Core user records]
        USER_IDS[user_identities<br/>Mastodon accounts]
        PRIVACY[privacy_settings<br/>User preferences]
    end

    subgraph Content ["Content & Metadata"]
        POSTS[posts<br/>Mastodon posts]
        RANKINGS[post_rankings<br/>Pre-computed scores]
    end

    subgraph Recommendations ["Recommendation System"]
        RECS[recommendations<br/>Generated recommendations]
        INTERACTIONS[interactions<br/>User engagement]
    end

    subgraph Experiments ["A/B Testing & Analytics"]
        EXPS[experiments<br/>Test definitions]
        ASSIGNS[experiment_assignments<br/>User assignments]
        METRICS[experiment_metrics<br/>Results tracking]
    end

    subgraph Knowledge ["Knowledge Base & RAG"]
        CHUNKS[knowledge_chunks<br/>Documentation chunks]
        EMBEDS[knowledge_embeddings<br/>Vector embeddings]
    end

    subgraph Agents ["Agent System"]
        EXECS[agent_executions<br/>Run history]
        COSTS[agent_costs<br/>Cost tracking]
    end

    %% User relationships
    USERS --> USER_IDS
    USERS --> PRIVACY
    USERS --> RECS
    USERS --> RANKINGS
    USERS --> ASSIGNS

    %% Identity relationships
    USER_IDS --> INTERACTIONS

    %% Post relationships
    POSTS --> INTERACTIONS
    POSTS --> RECS
    POSTS --> RANKINGS
    POSTS --> EMBEDS

    %% Experiment relationships
    EXPS --> ASSIGNS
    EXPS --> METRICS

    %% Knowledge relationships
    CHUNKS --> EMBEDS

    %% Agent relationships
    EXECS -.-> COSTS

    classDef identity fill:#e3f2fd,stroke:#1976d2,color:#000
    classDef content fill:#e8f5e8,stroke:#388e3c,color:#000
    classDef recommendation fill:#fce4ec,stroke:#c2185b,color:#000
    classDef experiment fill:#fff3e0,stroke:#f57c00,color:#000
    classDef knowledge fill:#f3e5f5,stroke:#7b1fa2,color:#000
    classDef agent fill:#efebe9,stroke:#5d4037,color:#000

    class USERS,USER_IDS,PRIVACY identity
    class POSTS,RANKINGS content
    class RECS,INTERACTIONS recommendation
    class EXPS,ASSIGNS,METRICS experiment
    class CHUNKS,EMBEDS knowledge
    class EXECS,COSTS agent

Key Relationships

The schema design emphasizes data privacy and efficient querying:

User Identity Separation: The users table represents a Corgi user, while user_identities maps to their various Mastodon accounts. This allows users to maintain privacy across different instances.
Interaction Tracking: User engagement flows from user_identities → interactions → posts, enabling personalized recommendations while maintaining account separation.
Recommendation Pipeline: The system generates recommendations by analyzing interactions, computing rankings, and storing them in the recommendations table for fast retrieval.
Experimentation Framework: A/B tests are managed through experiments → experiment_assignments → users, with results tracked in experiment_metrics.
Knowledge Management: The RAG system uses knowledge_chunks → knowledge_embeddings with pgvector for semantic search capabilities.
Agent Monitoring: Autonomous agents log their execution history and costs for transparency and control.

Detailed Table Relationships

graph TD
    subgraph Legend ["Relationship Types"]
        L1[1:1 One to One]
        L2[1:N One to Many]
        L3[N:M Many to Many]
    end

    USERS[users]
    USER_IDS[user_identities<br/>1:N with users]
    PRIVACY[privacy_settings<br/>1:1 with users]
    POSTS[posts]
    INTERACTIONS[interactions<br/>N:M bridge table]
    RECS[recommendations<br/>N:M with scoring]
    RANKINGS[post_rankings<br/>1:N per post/user]

    USERS -->|1:N| USER_IDS
    USERS -->|1:1| PRIVACY
    USER_IDS -->|1:N| INTERACTIONS
    POSTS -->|1:N| INTERACTIONS
    USERS -->|1:N| RECS
    POSTS -->|1:N| RECS
    POSTS -->|1:N| RANKINGS
    USERS -->|1:N| RANKINGS

    style INTERACTIONS fill:#fce4ec,stroke:#c2185b
    style RECS fill:#fce4ec,stroke:#c2185b

    classDef bridge fill:#fce4ec,stroke:#c2185b,color:#000
    classDef entity fill:#e3f2fd,stroke:#1976d2,color:#000

    class INTERACTIONS,RECS bridge
    class USERS,USER_IDS,PRIVACY,POSTS,RANKINGS entity

Core Tables

users

The primary user table storing aggregated user information across all their Mastodon identities.

Column	Data Type	Description
`id`	UUID	Primary key, unique user identifier
`created_at`	TIMESTAMP	When the user was first seen by Corgi
`updated_at`	TIMESTAMP	Last modification timestamp
`total_interactions`	INTEGER	Cached count of all user interactions
`last_active`	TIMESTAMP	Last interaction timestamp
`preferred_languages`	TEXT[]	Array of ISO language codes
`timezone`	TEXT	User's timezone (if detected)

user_identities

Maps Mastodon accounts to Corgi users, supporting users with multiple accounts.

Column	Data Type	Description
`id`	UUID	Primary key
`user_id`	UUID	Foreign key to users table
`mastodon_instance`	TEXT	Mastodon instance domain
`mastodon_user_id`	TEXT	User ID on the Mastodon instance
`username`	TEXT	Mastodon username
`display_name`	TEXT	User's display name
`created_at`	TIMESTAMP	When this identity was added
`is_primary`	BOOLEAN	Whether this is the user's primary identity
`last_seen`	TIMESTAMP	Last time this identity was active

posts

Stores all crawled and analyzed Mastodon posts.

Column	Data Type	Description
`id`	UUID	Primary key
`mastodon_instance`	TEXT	Origin instance domain
`mastodon_id`	TEXT	Post ID on origin instance
`author_id`	TEXT	Mastodon author ID
`content`	TEXT	Post content (HTML)
`content_plain`	TEXT	Plain text version for analysis
`language`	TEXT	Detected language code
`created_at`	TIMESTAMP	When post was created on Mastodon
`crawled_at`	TIMESTAMP	When Corgi first saw the post
`updated_at`	TIMESTAMP	Last update timestamp
`media_attachments`	JSONB	Array of media attachment metadata
`tags`	TEXT[]	Array of hashtags
`mentions`	TEXT[]	Array of mentioned usernames
`is_reply`	BOOLEAN	Whether post is a reply
`is_reblog`	BOOLEAN	Whether post is a reblog
`sentiment_score`	FLOAT	Calculated sentiment (-1 to 1)
`toxicity_score`	FLOAT	Content safety score (0 to 1)
`embedding`	vector(384)	Semantic embedding from sentence-transformers

interactions

Records all user interactions with posts for personalization.

Column	Data Type	Description
`id`	UUID	Primary key
`user_identity_id`	UUID	Foreign key to user_identities
`post_id`	UUID	Foreign key to posts
`interaction_type`	TEXT	Type: view, like, reblog, reply, bookmark
`interaction_value`	FLOAT	Normalized interaction strength (0-1)
`created_at`	TIMESTAMP	When interaction occurred
`context`	JSONB	Additional context (e.g., view duration)
`experiment_id`	UUID	Associated A/B test experiment

recommendations

Stores generated recommendations with scores and metadata.

Column	Data Type	Description
`id`	UUID	Primary key
`user_id`	UUID	Foreign key to users
`post_id`	UUID	Foreign key to posts
`score`	FLOAT	Final recommendation score
`algorithm_scores`	JSONB	Individual algorithm scores
`created_at`	TIMESTAMP	When recommendation was generated
`served_at`	TIMESTAMP	When recommendation was served
`position`	INTEGER	Position in recommendation list
`experiment_id`	UUID	Associated A/B test
`model_version`	TEXT	Model version used

privacy_settings

User privacy preferences and data handling settings.

Column	Data Type	Description
`id`	UUID	Primary key
`user_id`	UUID	Foreign key to users
`data_collection_enabled`	BOOLEAN	Whether to collect interaction data
`recommendation_enabled`	BOOLEAN	Whether to show recommendations
`sensitive_content_filter`	BOOLEAN	Filter sensitive content
`blocked_instances`	TEXT[]	Instances to exclude
`blocked_users`	TEXT[]	Users to exclude
`created_at`	TIMESTAMP	Settings creation time
`updated_at`	TIMESTAMP	Last modification time

Analytics Tables

A/B Testing Flow

graph LR
    subgraph Setup ["Experiment Setup"]
        EXP[experiments<br/>Define test variants]
    end

    subgraph Assignment ["User Assignment"]
        USER[User Request]
        ASSIGN[experiment_assignments<br/>Random assignment]
    end

    subgraph Execution ["Test Execution"]
        VAR_A[Variant A<br/>Control]
        VAR_B[Variant B<br/>Treatment]
    end

    subgraph Tracking ["Results Tracking"]
        INTER[Track Interactions]
        METRICS[experiment_metrics<br/>Aggregate results]
    end

    EXP --> USER
    USER --> ASSIGN
    ASSIGN --> VAR_A
    ASSIGN --> VAR_B
    VAR_A --> INTER
    VAR_B --> INTER
    INTER --> METRICS
    METRICS --> EXP

    classDef setup fill:#f3e5f5,stroke:#7b1fa2,color:#000
    classDef assign fill:#e3f2fd,stroke:#1976d2,color:#000
    classDef exec fill:#e8f5e8,stroke:#388e3c,color:#000
    classDef track fill:#fff3e0,stroke:#f57c00,color:#000

    class EXP setup
    class USER,ASSIGN assign
    class VAR_A,VAR_B exec
    class INTER,METRICS track

post_rankings

Pre-computed post rankings for performance optimization.

Column	Data Type	Description
`id`	UUID	Primary key
`post_id`	UUID	Foreign key to posts
`user_id`	UUID	Foreign key to users (null for global)
`ranking_type`	TEXT	Type: trending, quality, personalized
`score`	FLOAT	Computed ranking score
`factors`	JSONB	Score breakdown by factor
`valid_until`	TIMESTAMP	Cache expiration time
`created_at`	TIMESTAMP	When ranking was computed

experiment_metrics

Tracks A/B test metrics and outcomes.

Column	Data Type	Description
`id`	UUID	Primary key
`experiment_id`	UUID	Foreign key to experiments
`metric_name`	TEXT	Metric identifier
`variant`	TEXT	Test variant (control/treatment)
`value`	FLOAT	Metric value
`sample_size`	INTEGER	Number of users
`confidence_interval`	FLOAT[]	Statistical confidence bounds
`computed_at`	TIMESTAMP	When metric was calculated

Agent System Tables

agent_executions

Logs all agent runs for monitoring and debugging.

Column	Data Type	Description
`id`	UUID	Primary key
`agent_name`	TEXT	Name of the agent
`execution_id`	TEXT	Unique execution identifier
`started_at`	TIMESTAMP	Execution start time
`completed_at`	TIMESTAMP	Execution end time
`status`	TEXT	Status: running, completed, failed
`metrics`	JSONB	Performance metrics
`errors`	JSONB	Error details if failed
`actions_taken`	JSONB	Summary of actions performed

agent_costs

Tracks costs associated with agent operations.

Column	Data Type	Description
`id`	UUID	Primary key
`agent_name`	TEXT	Name of the agent
`execution_id`	TEXT	Link to agent_executions
`cost_type`	TEXT	Type: api_calls, compute_time
`amount`	DECIMAL(10,4)	Cost amount in USD
`details`	JSONB	Breakdown of costs
`created_at`	TIMESTAMP	When cost was incurred

Knowledge Base Tables

knowledge_chunks

Stores documentation and code chunks for RAG system.

Column	Data Type	Description
`id`	UUID	Primary key
`content`	TEXT	Chunk content
`metadata`	JSONB	Source file, type, etc.
`chunk_index`	INTEGER	Position in source
`created_at`	TIMESTAMP	When chunk was created
`updated_at`	TIMESTAMP	Last modification

knowledge_embeddings

Vector embeddings for semantic search.

Column	Data Type	Description
`id`	UUID	Primary key
`chunk_id`	UUID	Foreign key to knowledge_chunks
`embedding`	vector(384)	Semantic embedding
`model_version`	TEXT	Embedding model used
`created_at`	TIMESTAMP	When embedding was generated

Indexes and Constraints

Vector Search Architecture

graph LR
    subgraph Query ["Query Processing"]
        QT[Query Text]
        QE[Query Embedding<br/>vector(384)]
    end

    subgraph Indexes ["Vector Indexes"]
        PI[posts.embedding<br/>IVFFlat Index]
        KI[knowledge_embeddings.embedding<br/>IVFFlat Index]
    end

    subgraph Results ["Search Results"]
        PR[Similar Posts<br/>Cosine Distance]
        KR[Similar Knowledge<br/>Cosine Distance]
    end

    QT --> QE
    QE --> PI
    QE --> KI
    PI --> PR
    KI --> KR

    classDef query fill:#e3f2fd,stroke:#1976d2,color:#000
    classDef index fill:#fff3e0,stroke:#f57c00,color:#000
    classDef result fill:#e8f5e8,stroke:#388e3c,color:#000

    class QT,QE query
    class PI,KI index
    class PR,KR result

Performance Indexes

-- User lookup indexes
CREATE INDEX idx_user_identities_mastodon ON user_identities(mastodon_instance, mastodon_user_id);
CREATE INDEX idx_users_last_active ON users(last_active DESC);

-- Post retrieval indexes
CREATE INDEX idx_posts_created_at ON posts(created_at DESC);
CREATE INDEX idx_posts_mastodon ON posts(mastodon_instance, mastodon_id);
CREATE INDEX idx_posts_author ON posts(author_id);
CREATE INDEX idx_posts_language ON posts(language);

-- Interaction queries
CREATE INDEX idx_interactions_user_post ON interactions(user_identity_id, post_id);
CREATE INDEX idx_interactions_created ON interactions(created_at DESC);

-- Vector similarity search
CREATE INDEX idx_posts_embedding ON posts USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX idx_knowledge_embedding ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops);

-- Analytics indexes
CREATE INDEX idx_recommendations_user_created ON recommendations(user_id, created_at DESC);
CREATE INDEX idx_rankings_valid ON post_rankings(ranking_type, valid_until) WHERE valid_until > NOW();

Constraints

-- Unique constraints
ALTER TABLE user_identities ADD CONSTRAINT unique_mastodon_identity 
  UNIQUE (mastodon_instance, mastodon_user_id);

ALTER TABLE posts ADD CONSTRAINT unique_mastodon_post 
  UNIQUE (mastodon_instance, mastodon_id);

-- Foreign key constraints
ALTER TABLE interactions ADD CONSTRAINT fk_interaction_user 
  FOREIGN KEY (user_identity_id) REFERENCES user_identities(id) ON DELETE CASCADE;

ALTER TABLE interactions ADD CONSTRAINT fk_interaction_post 
  FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE;

-- Check constraints
ALTER TABLE interactions ADD CONSTRAINT check_interaction_value 
  CHECK (interaction_value >= 0 AND interaction_value <= 1);

ALTER TABLE posts ADD CONSTRAINT check_sentiment_range 
  CHECK (sentiment_score >= -1 AND sentiment_score <= 1);

ALTER TABLE posts ADD CONSTRAINT check_toxicity_range 
  CHECK (toxicity_score >= 0 AND toxicity_score <= 1);

Design Decisions

Why PostgreSQL?

pgvector Extension: Native support for vector similarity search, essential for content-based recommendations
JSONB Support: Flexible schema for metadata without sacrificing query performance
ACID Compliance: Critical for maintaining data consistency in recommendation tracking
Advanced Indexing: B-tree, GiST, and IVFFlat indexes for diverse query patterns

Why UUIDs?

Distributed Generation: No central bottleneck for ID generation
Mastodon Compatibility: Aligns with Fediverse's distributed nature
No Information Leakage: Unlike sequential IDs, UUIDs don't reveal usage patterns

Why Separate `user_identities`?

Multi-Account Support: Users often have accounts on multiple instances
Privacy: Allows users to control data sharing per identity
Federation: Aligns with Mastodon's federated architecture

Why JSONB for Metadata?

Schema Flexibility: Mastodon's ActivityPub objects vary in structure
Query Performance: PostgreSQL's JSONB indexing enables efficient queries
Future-Proofing: New fields can be added without migrations

Migration Management

Database migrations are managed through numbered SQL files in db/migrations/:

db/migrations/
├── 001_initial_schema.sql
├── 002_add_vector_search.sql
├── 003_add_privacy_settings.sql
├── 004_add_agent_tables.sql
├── 005_add_analytics_tables.sql
├── 006_add_content_classification.sql
├── 007_add_knowledge_embeddings.sql
└── 008_optimize_indexes.sql

Apply migrations in order:

psql -U corgi_user -d corgi_db -f db/migrations/001_initial_schema.sql

Performance Considerations

Partitioning: The interactions table is partitioned by month for scalability
Materialized Views: Pre-computed trending posts refresh every hour
Connection Pooling: PgBouncer manages connection pooling for high concurrency
Vacuum Strategy: Aggressive autovacuum on high-churn tables like interactions
Index Maintenance: Weekly REINDEX on vector similarity indexes

Security Considerations

Row-Level Security: Enabled on privacy_settings and user_identities
Encryption: Sensitive fields use pgcrypto for at-rest encryption
Audit Logging: All data modifications logged to audit_log table
Least Privilege: Application uses role with minimal required permissions
SQL Injection: All queries use parameterized statements