Database Schema
The Corgi Recommender Service uses PostgreSQL as its primary database, leveraging advanced features like pgvector for semantic similarity search and JSONB for flexible metadata storage. This document provides a comprehensive reference for all database tables, their columns, and relationships.
Schema Overview
The database is organized into six logical domains, each serving a specific purpose in the recommendation system. The diagram below shows how these domains interconnect to support privacy-first, personalized content recommendations.
Domain Organization
graph TB
subgraph Identity ["User Identity & Privacy"]
USERS[users<br/>Core user records]
USER_IDS[user_identities<br/>Mastodon accounts]
PRIVACY[privacy_settings<br/>User preferences]
end
subgraph Content ["Content & Metadata"]
POSTS[posts<br/>Mastodon posts]
RANKINGS[post_rankings<br/>Pre-computed scores]
end
subgraph Recommendations ["Recommendation System"]
RECS[recommendations<br/>Generated recommendations]
INTERACTIONS[interactions<br/>User engagement]
end
subgraph Experiments ["A/B Testing & Analytics"]
EXPS[experiments<br/>Test definitions]
ASSIGNS[experiment_assignments<br/>User assignments]
METRICS[experiment_metrics<br/>Results tracking]
end
subgraph Knowledge ["Knowledge Base & RAG"]
CHUNKS[knowledge_chunks<br/>Documentation chunks]
EMBEDS[knowledge_embeddings<br/>Vector embeddings]
end
subgraph Agents ["Agent System"]
EXECS[agent_executions<br/>Run history]
COSTS[agent_costs<br/>Cost tracking]
end
%% User relationships
USERS --> USER_IDS
USERS --> PRIVACY
USERS --> RECS
USERS --> RANKINGS
USERS --> ASSIGNS
%% Identity relationships
USER_IDS --> INTERACTIONS
%% Post relationships
POSTS --> INTERACTIONS
POSTS --> RECS
POSTS --> RANKINGS
POSTS --> EMBEDS
%% Experiment relationships
EXPS --> ASSIGNS
EXPS --> METRICS
%% Knowledge relationships
CHUNKS --> EMBEDS
%% Agent relationships
EXECS -.-> COSTS
classDef identity fill:#e3f2fd,stroke:#1976d2,color:#000
classDef content fill:#e8f5e8,stroke:#388e3c,color:#000
classDef recommendation fill:#fce4ec,stroke:#c2185b,color:#000
classDef experiment fill:#fff3e0,stroke:#f57c00,color:#000
classDef knowledge fill:#f3e5f5,stroke:#7b1fa2,color:#000
classDef agent fill:#efebe9,stroke:#5d4037,color:#000
class USERS,USER_IDS,PRIVACY identity
class POSTS,RANKINGS content
class RECS,INTERACTIONS recommendation
class EXPS,ASSIGNS,METRICS experiment
class CHUNKS,EMBEDS knowledge
class EXECS,COSTS agent
Key Relationships
The schema design emphasizes data privacy and efficient querying:
-
User Identity Separation: The
userstable represents a Corgi user, whileuser_identitiesmaps to their various Mastodon accounts. This allows users to maintain privacy across different instances. -
Interaction Tracking: User engagement flows from
user_identities→interactions→posts, enabling personalized recommendations while maintaining account separation. -
Recommendation Pipeline: The system generates recommendations by analyzing interactions, computing rankings, and storing them in the
recommendationstable for fast retrieval. -
Experimentation Framework: A/B tests are managed through
experiments→experiment_assignments→users, with results tracked inexperiment_metrics. -
Knowledge Management: The RAG system uses
knowledge_chunks→knowledge_embeddingswith pgvector for semantic search capabilities. -
Agent Monitoring: Autonomous agents log their execution history and costs for transparency and control.
Detailed Table Relationships
graph TD
subgraph Legend ["Relationship Types"]
L1[1:1 One to One]
L2[1:N One to Many]
L3[N:M Many to Many]
end
USERS[users]
USER_IDS[user_identities<br/>1:N with users]
PRIVACY[privacy_settings<br/>1:1 with users]
POSTS[posts]
INTERACTIONS[interactions<br/>N:M bridge table]
RECS[recommendations<br/>N:M with scoring]
RANKINGS[post_rankings<br/>1:N per post/user]
USERS -->|1:N| USER_IDS
USERS -->|1:1| PRIVACY
USER_IDS -->|1:N| INTERACTIONS
POSTS -->|1:N| INTERACTIONS
USERS -->|1:N| RECS
POSTS -->|1:N| RECS
POSTS -->|1:N| RANKINGS
USERS -->|1:N| RANKINGS
style INTERACTIONS fill:#fce4ec,stroke:#c2185b
style RECS fill:#fce4ec,stroke:#c2185b
classDef bridge fill:#fce4ec,stroke:#c2185b,color:#000
classDef entity fill:#e3f2fd,stroke:#1976d2,color:#000
class INTERACTIONS,RECS bridge
class USERS,USER_IDS,PRIVACY,POSTS,RANKINGS entity
Core Tables
users
The primary user table storing aggregated user information across all their Mastodon identities.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key, unique user identifier |
created_at |
TIMESTAMP | When the user was first seen by Corgi |
updated_at |
TIMESTAMP | Last modification timestamp |
total_interactions |
INTEGER | Cached count of all user interactions |
last_active |
TIMESTAMP | Last interaction timestamp |
preferred_languages |
TEXT[] | Array of ISO language codes |
timezone |
TEXT | User's timezone (if detected) |
user_identities
Maps Mastodon accounts to Corgi users, supporting users with multiple accounts.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
user_id |
UUID | Foreign key to users table |
mastodon_instance |
TEXT | Mastodon instance domain |
mastodon_user_id |
TEXT | User ID on the Mastodon instance |
username |
TEXT | Mastodon username |
display_name |
TEXT | User's display name |
created_at |
TIMESTAMP | When this identity was added |
is_primary |
BOOLEAN | Whether this is the user's primary identity |
last_seen |
TIMESTAMP | Last time this identity was active |
posts
Stores all crawled and analyzed Mastodon posts.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
mastodon_instance |
TEXT | Origin instance domain |
mastodon_id |
TEXT | Post ID on origin instance |
author_id |
TEXT | Mastodon author ID |
content |
TEXT | Post content (HTML) |
content_plain |
TEXT | Plain text version for analysis |
language |
TEXT | Detected language code |
created_at |
TIMESTAMP | When post was created on Mastodon |
crawled_at |
TIMESTAMP | When Corgi first saw the post |
updated_at |
TIMESTAMP | Last update timestamp |
media_attachments |
JSONB | Array of media attachment metadata |
tags |
TEXT[] | Array of hashtags |
mentions |
TEXT[] | Array of mentioned usernames |
is_reply |
BOOLEAN | Whether post is a reply |
is_reblog |
BOOLEAN | Whether post is a reblog |
sentiment_score |
FLOAT | Calculated sentiment (-1 to 1) |
toxicity_score |
FLOAT | Content safety score (0 to 1) |
embedding |
vector(384) | Semantic embedding from sentence-transformers |
interactions
Records all user interactions with posts for personalization.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
user_identity_id |
UUID | Foreign key to user_identities |
post_id |
UUID | Foreign key to posts |
interaction_type |
TEXT | Type: view, like, reblog, reply, bookmark |
interaction_value |
FLOAT | Normalized interaction strength (0-1) |
created_at |
TIMESTAMP | When interaction occurred |
context |
JSONB | Additional context (e.g., view duration) |
experiment_id |
UUID | Associated A/B test experiment |
recommendations
Stores generated recommendations with scores and metadata.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
user_id |
UUID | Foreign key to users |
post_id |
UUID | Foreign key to posts |
score |
FLOAT | Final recommendation score |
algorithm_scores |
JSONB | Individual algorithm scores |
created_at |
TIMESTAMP | When recommendation was generated |
served_at |
TIMESTAMP | When recommendation was served |
position |
INTEGER | Position in recommendation list |
experiment_id |
UUID | Associated A/B test |
model_version |
TEXT | Model version used |
privacy_settings
User privacy preferences and data handling settings.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
user_id |
UUID | Foreign key to users |
data_collection_enabled |
BOOLEAN | Whether to collect interaction data |
recommendation_enabled |
BOOLEAN | Whether to show recommendations |
sensitive_content_filter |
BOOLEAN | Filter sensitive content |
blocked_instances |
TEXT[] | Instances to exclude |
blocked_users |
TEXT[] | Users to exclude |
created_at |
TIMESTAMP | Settings creation time |
updated_at |
TIMESTAMP | Last modification time |
Analytics Tables
A/B Testing Flow
graph LR
subgraph Setup ["Experiment Setup"]
EXP[experiments<br/>Define test variants]
end
subgraph Assignment ["User Assignment"]
USER[User Request]
ASSIGN[experiment_assignments<br/>Random assignment]
end
subgraph Execution ["Test Execution"]
VAR_A[Variant A<br/>Control]
VAR_B[Variant B<br/>Treatment]
end
subgraph Tracking ["Results Tracking"]
INTER[Track Interactions]
METRICS[experiment_metrics<br/>Aggregate results]
end
EXP --> USER
USER --> ASSIGN
ASSIGN --> VAR_A
ASSIGN --> VAR_B
VAR_A --> INTER
VAR_B --> INTER
INTER --> METRICS
METRICS --> EXP
classDef setup fill:#f3e5f5,stroke:#7b1fa2,color:#000
classDef assign fill:#e3f2fd,stroke:#1976d2,color:#000
classDef exec fill:#e8f5e8,stroke:#388e3c,color:#000
classDef track fill:#fff3e0,stroke:#f57c00,color:#000
class EXP setup
class USER,ASSIGN assign
class VAR_A,VAR_B exec
class INTER,METRICS track
post_rankings
Pre-computed post rankings for performance optimization.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
post_id |
UUID | Foreign key to posts |
user_id |
UUID | Foreign key to users (null for global) |
ranking_type |
TEXT | Type: trending, quality, personalized |
score |
FLOAT | Computed ranking score |
factors |
JSONB | Score breakdown by factor |
valid_until |
TIMESTAMP | Cache expiration time |
created_at |
TIMESTAMP | When ranking was computed |
experiment_metrics
Tracks A/B test metrics and outcomes.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
experiment_id |
UUID | Foreign key to experiments |
metric_name |
TEXT | Metric identifier |
variant |
TEXT | Test variant (control/treatment) |
value |
FLOAT | Metric value |
sample_size |
INTEGER | Number of users |
confidence_interval |
FLOAT[] | Statistical confidence bounds |
computed_at |
TIMESTAMP | When metric was calculated |
Agent System Tables
agent_executions
Logs all agent runs for monitoring and debugging.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
agent_name |
TEXT | Name of the agent |
execution_id |
TEXT | Unique execution identifier |
started_at |
TIMESTAMP | Execution start time |
completed_at |
TIMESTAMP | Execution end time |
status |
TEXT | Status: running, completed, failed |
metrics |
JSONB | Performance metrics |
errors |
JSONB | Error details if failed |
actions_taken |
JSONB | Summary of actions performed |
agent_costs
Tracks costs associated with agent operations.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
agent_name |
TEXT | Name of the agent |
execution_id |
TEXT | Link to agent_executions |
cost_type |
TEXT | Type: api_calls, compute_time |
amount |
DECIMAL(10,4) | Cost amount in USD |
details |
JSONB | Breakdown of costs |
created_at |
TIMESTAMP | When cost was incurred |
Knowledge Base Tables
knowledge_chunks
Stores documentation and code chunks for RAG system.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
content |
TEXT | Chunk content |
metadata |
JSONB | Source file, type, etc. |
chunk_index |
INTEGER | Position in source |
created_at |
TIMESTAMP | When chunk was created |
updated_at |
TIMESTAMP | Last modification |
knowledge_embeddings
Vector embeddings for semantic search.
| Column | Data Type | Description |
|---|---|---|
id |
UUID | Primary key |
chunk_id |
UUID | Foreign key to knowledge_chunks |
embedding |
vector(384) | Semantic embedding |
model_version |
TEXT | Embedding model used |
created_at |
TIMESTAMP | When embedding was generated |
Indexes and Constraints
Vector Search Architecture
graph LR
subgraph Query ["Query Processing"]
QT[Query Text]
QE[Query Embedding<br/>vector(384)]
end
subgraph Indexes ["Vector Indexes"]
PI[posts.embedding<br/>IVFFlat Index]
KI[knowledge_embeddings.embedding<br/>IVFFlat Index]
end
subgraph Results ["Search Results"]
PR[Similar Posts<br/>Cosine Distance]
KR[Similar Knowledge<br/>Cosine Distance]
end
QT --> QE
QE --> PI
QE --> KI
PI --> PR
KI --> KR
classDef query fill:#e3f2fd,stroke:#1976d2,color:#000
classDef index fill:#fff3e0,stroke:#f57c00,color:#000
classDef result fill:#e8f5e8,stroke:#388e3c,color:#000
class QT,QE query
class PI,KI index
class PR,KR result
Performance Indexes
-- User lookup indexes
CREATE INDEX idx_user_identities_mastodon ON user_identities(mastodon_instance, mastodon_user_id);
CREATE INDEX idx_users_last_active ON users(last_active DESC);
-- Post retrieval indexes
CREATE INDEX idx_posts_created_at ON posts(created_at DESC);
CREATE INDEX idx_posts_mastodon ON posts(mastodon_instance, mastodon_id);
CREATE INDEX idx_posts_author ON posts(author_id);
CREATE INDEX idx_posts_language ON posts(language);
-- Interaction queries
CREATE INDEX idx_interactions_user_post ON interactions(user_identity_id, post_id);
CREATE INDEX idx_interactions_created ON interactions(created_at DESC);
-- Vector similarity search
CREATE INDEX idx_posts_embedding ON posts USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX idx_knowledge_embedding ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops);
-- Analytics indexes
CREATE INDEX idx_recommendations_user_created ON recommendations(user_id, created_at DESC);
CREATE INDEX idx_rankings_valid ON post_rankings(ranking_type, valid_until) WHERE valid_until > NOW();
Constraints
-- Unique constraints
ALTER TABLE user_identities ADD CONSTRAINT unique_mastodon_identity
UNIQUE (mastodon_instance, mastodon_user_id);
ALTER TABLE posts ADD CONSTRAINT unique_mastodon_post
UNIQUE (mastodon_instance, mastodon_id);
-- Foreign key constraints
ALTER TABLE interactions ADD CONSTRAINT fk_interaction_user
FOREIGN KEY (user_identity_id) REFERENCES user_identities(id) ON DELETE CASCADE;
ALTER TABLE interactions ADD CONSTRAINT fk_interaction_post
FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE;
-- Check constraints
ALTER TABLE interactions ADD CONSTRAINT check_interaction_value
CHECK (interaction_value >= 0 AND interaction_value <= 1);
ALTER TABLE posts ADD CONSTRAINT check_sentiment_range
CHECK (sentiment_score >= -1 AND sentiment_score <= 1);
ALTER TABLE posts ADD CONSTRAINT check_toxicity_range
CHECK (toxicity_score >= 0 AND toxicity_score <= 1);
Design Decisions
Why PostgreSQL?
- pgvector Extension: Native support for vector similarity search, essential for content-based recommendations
- JSONB Support: Flexible schema for metadata without sacrificing query performance
- ACID Compliance: Critical for maintaining data consistency in recommendation tracking
- Advanced Indexing: B-tree, GiST, and IVFFlat indexes for diverse query patterns
Why UUIDs?
- Distributed Generation: No central bottleneck for ID generation
- Mastodon Compatibility: Aligns with Fediverse's distributed nature
- No Information Leakage: Unlike sequential IDs, UUIDs don't reveal usage patterns
Why Separate user_identities?
- Multi-Account Support: Users often have accounts on multiple instances
- Privacy: Allows users to control data sharing per identity
- Federation: Aligns with Mastodon's federated architecture
Why JSONB for Metadata?
- Schema Flexibility: Mastodon's ActivityPub objects vary in structure
- Query Performance: PostgreSQL's JSONB indexing enables efficient queries
- Future-Proofing: New fields can be added without migrations
Migration Management
Database migrations are managed through numbered SQL files in db/migrations/:
db/migrations/
├── 001_initial_schema.sql
├── 002_add_vector_search.sql
├── 003_add_privacy_settings.sql
├── 004_add_agent_tables.sql
├── 005_add_analytics_tables.sql
├── 006_add_content_classification.sql
├── 007_add_knowledge_embeddings.sql
└── 008_optimize_indexes.sql
Apply migrations in order:
psql -U corgi_user -d corgi_db -f db/migrations/001_initial_schema.sql
Performance Considerations
- Partitioning: The
interactionstable is partitioned by month for scalability - Materialized Views: Pre-computed trending posts refresh every hour
- Connection Pooling: PgBouncer manages connection pooling for high concurrency
- Vacuum Strategy: Aggressive autovacuum on high-churn tables like
interactions - Index Maintenance: Weekly REINDEX on vector similarity indexes
Security Considerations
- Row-Level Security: Enabled on
privacy_settingsanduser_identities - Encryption: Sensitive fields use pgcrypto for at-rest encryption
- Audit Logging: All data modifications logged to
audit_logtable - Least Privilege: Application uses role with minimal required permissions
- SQL Injection: All queries use parameterized statements