Skip to content

Recommendation Engine

The Corgi Recommendation Engine is a sophisticated hybrid algorithm that combines four distinct scoring components to deliver personalized content recommendations. This document provides a deep dive into how the engine works, the mathematical formulas behind each component, and how they combine to produce the final ranking scores.

Overview

Corgi's recommendation engine uses a weighted multi-factor approach that balances personalization with content discovery. The system analyzes user interactions, content popularity, temporal relevance, and semantic understanding to create a recommendation score between 0 and 1 for each post.

graph LR
    subgraph Input ["User & Content Data"]
        UI[User Interactions]
        PM[Post Metadata]
        PE[Post Embeddings]
        UT[User Taste Vector]
    end

    subgraph Components ["Scoring Components"]
        AP[Author Preference<br/>Weight: 0.10]
        CE[Content Engagement<br/>Weight: 0.25]
        RS[Recency Score<br/>Weight: 0.20]
        SS[Semantic Similarity<br/>Weight: 0.45]
    end

    subgraph Output ["Final Score"]
        WC[Weighted Combination]
        FS[Final Score<br/>0.0 - 1.0]
    end

    UI --> AP
    PM --> CE
    PM --> RS
    PE --> SS
    UT --> SS

    AP --> WC
    CE --> WC
    RS --> WC
    SS --> WC

    WC --> FS

    classDef input fill:#e3f2fd,stroke:#1976d2,color:#000
    classDef component fill:#f3e5f5,stroke:#7b1fa2,color:#000
    classDef output fill:#e8f5e8,stroke:#388e3c,color:#000

    class UI,PM,PE,UT input
    class AP,CE,RS,SS component
    class WC,FS output

Design Philosophy

The engine is designed with three core principles:

  1. Discovery Over Echo Chambers: With semantic similarity weighted at 45%, the engine prioritizes helping users discover new content over reinforcing existing preferences.

  2. Quality Matters: Content engagement (25%) ensures that popular, high-quality posts are surfaced while avoiding pure popularity contests.

  3. Fresh Yet Timeless: The recency component (20%) balances showing fresh content while not completely ignoring evergreen posts.

The Four Scoring Components

1. Author Preference Score (10% Weight)

The author preference score measures how much a user has historically engaged with a particular content creator. This component helps surface content from creators the user already enjoys, but with a deliberately low weight to avoid creating filter bubbles.

How It Works

def get_author_preference_score(user_interactions, author_id):
    # Find all interactions with this author's posts
    author_interactions = filter_by_author(user_interactions, author_id)

    if not author_interactions:
        return 0.1  # Baseline score for unknown authors

    # Count positive and negative signals
    positive_actions = ['like', 'favorite', 'reblog', 'bookmark']
    positive_count = count_actions(author_interactions, positive_actions)
    total_interactions = len(author_interactions)

    # Calculate preference ratio
    positive_ratio = positive_count / total_interactions

    # Apply logistic function for smooth scaling
    score = 1 / (1 + exp(-5 * (positive_ratio - 0.5)))

    return clamp(score, 0.1, 1.0)

Mathematical Formula

The score uses a logistic function to create a smooth S-curve:

\[\text{score} = \frac{1}{1 + e^{-5 \times (\text{positive\_ratio} - 0.5)}}\]

Where: - positive_ratio = positive interactions / total interactions with author - The factor of 5 controls the steepness of the curve - The 0.5 offset centers the curve at 50% positive interactions

Why This Matters

This component ensures users see more content from creators they love, but the low weight (10%) prevents the system from becoming an echo chamber. New authors still have a chance to be discovered through other scoring components.

2. Content Engagement Score (25% Weight)

The content engagement score reflects how the broader community has interacted with a post. This helps surface quality content that resonates with users.

How It Works

def get_content_engagement_score(post_data):
    # Extract interaction counts
    likes = post_data.get('favourites_count', 0)
    boosts = post_data.get('reblogs_count', 0)
    replies = post_data.get('replies_count', 0)

    # Sum total engagement
    total_engagement = likes + boosts + replies

    if total_engagement == 0:
        return 0.0

    # Apply logarithmic scaling to prevent viral posts from dominating
    return log(total_engagement + 1) / 10.0

Mathematical Formula

The score uses logarithmic scaling to handle the wide range of engagement values:

\[\text{score} = \frac{\ln(\text{total\_engagement} + 1)}{10}\]

Where: - total_engagement = likes + boosts + replies - The +1 prevents log(0) errors - Division by 10 normalizes to a 0-1 range for typical engagement levels

Why This Matters

Logarithmic scaling ensures that: - Posts with 10 interactions score ~0.23 - Posts with 100 interactions score ~0.46 - Posts with 1000 interactions score ~0.69

This prevents extremely viral posts from completely dominating recommendations while still rewarding quality content.

3. Recency Score (20% Weight)

The recency score ensures fresh content appears in recommendations while not completely excluding older posts that might still be relevant.

How It Works

def get_recency_score(post_data):
    # Calculate post age in days
    created_at = post_data.get('created_at')
    now = datetime.now()
    age_days = (now - created_at).total_seconds() / (24 * 3600)

    # Apply exponential decay
    decay_factor = 7.0  # Configurable, default 7 days
    recency_score = exp(-age_days / decay_factor)

    return clamp(recency_score, 0.2, 1.0)

Mathematical Formula

The score uses exponential decay:

\[\text{score} = e^{-\frac{\text{age\_days}}{\text{decay\_factor}}}\]

Where: - age_days = days since post creation - decay_factor = 7 (configurable) controls decay rate - Minimum score is clamped at 0.2 to give older posts a chance

Decay Curve

  • Fresh posts (0 days): 1.00
  • 1 day old: ~0.87
  • 3 days old: ~0.65
  • 7 days old: ~0.37
  • 14 days old: ~0.20 (minimum)

Why This Matters

This balanced approach ensures: - Fresh content gets priority - Week-old content still has reasonable visibility - Evergreen content isn't completely excluded

4. Semantic Similarity Score (45% Weight)

The semantic similarity score is the most heavily weighted component, using advanced NLP to understand content meaning and match it with user preferences.

How It Works

def get_semantic_similarity_score(post_data, user_taste_vector):
    # Get post embedding (384-dimensional vector)
    post_embedding = post_data.get('embedding')

    if not post_embedding or user_taste_vector is None:
        return 0.0

    # Normalize both vectors
    post_normalized = normalize(post_embedding)
    user_normalized = normalize(user_taste_vector)

    # Calculate cosine similarity
    similarity = dot_product(post_normalized, user_normalized)

    # Convert from [-1, 1] to [0, 1] range
    normalized_score = (similarity + 1) / 2

    return clamp(normalized_score, 0.0, 1.0)

Building the User Taste Vector

The user taste vector is constructed by:

  1. Collecting embeddings from posts the user positively interacted with
  2. Averaging these embeddings to create a "taste profile"
  3. Normalizing the resulting vector
def build_user_taste_vector(user_interactions):
    # Get embeddings from liked posts (last 20 interactions)
    positive_posts = filter_positive_interactions(user_interactions)
    embeddings = get_embeddings(positive_posts[-20:])

    # Average embeddings to create taste vector
    taste_vector = mean(embeddings, axis=0)

    # Normalize to unit vector
    return normalize(taste_vector)

Mathematical Formula

Cosine similarity measures the angle between vectors:

\[\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||}\]

Where: - A = normalized post embedding - B = normalized user taste vector - Result is in range [-1, 1], transformed to [0, 1]

Why This Matters

With 45% weight, semantic similarity: - Helps users discover content similar to what they've enjoyed - Works across languages (multilingual embeddings) - Understands context beyond keywords - Drives content discovery and serendipity

Combining the Scores

The final ranking score is a weighted combination of all four components:

def calculate_ranking_score(post_data, user_interactions, user_taste_vector):
    # Get individual scores
    author_score = get_author_preference_score(user_interactions, post_data['author_id'])
    engagement_score = get_content_engagement_score(post_data)
    recency_score = get_recency_score(post_data)
    semantic_score = get_semantic_similarity_score(post_data, user_taste_vector)

    # Apply weights
    weights = {
        'author': 0.10,
        'engagement': 0.25,
        'recency': 0.20,
        'semantic': 0.45
    }

    # Calculate weighted sum
    final_score = (
        weights['author'] * author_score +
        weights['engagement'] * engagement_score +
        weights['recency'] * recency_score +
        weights['semantic'] * semantic_score
    )

    return final_score

Score Distribution

The final scores typically follow this distribution:

  • 0.8 - 1.0: Exceptional matches (rare)
  • 0.6 - 0.8: Strong recommendations
  • 0.4 - 0.6: Good recommendations
  • 0.2 - 0.4: Moderate relevance
  • 0.0 - 0.2: Low relevance (rarely shown)

Algorithm Configuration

The recommendation engine is highly configurable through environment variables:

ALGORITHM_CONFIG = {
    "weights": {
        "author_preference": 0.10,    # RANKING_WEIGHT_AUTHOR
        "content_engagement": 0.25,   # RANKING_WEIGHT_ENGAGEMENT
        "recency": 0.20,             # RANKING_WEIGHT_RECENCY
        "semantic_similarity": 0.45  # RANKING_WEIGHT_SEMANTIC
    },
    "time_decay_days": 7,            # RANKING_TIME_DECAY_DAYS
    "min_interactions": 0,           # RANKING_MIN_INTERACTIONS
    "max_candidates": 100            # RANKING_MAX_CANDIDATES
}

Tuning for Different Use Cases

For Discovery-Focused Experience: - Increase semantic_similarity to 0.50-0.60 - Decrease author_preference to 0.05 - Keep engagement moderate to avoid popularity bias

For Comfort-Zone Experience: - Increase author_preference to 0.20-0.30 - Decrease semantic_similarity to 0.30 - Increase recency for familiar, fresh content

For Trending Content: - Increase content_engagement to 0.35-0.40 - Increase recency to 0.30 - Balance other components

A/B Testing Integration

The recommendation engine seamlessly integrates with Corgi's A/B testing framework, allowing dynamic weight adjustments per user segment:

# Get user-specific configuration from A/B test
semantic_config = get_user_semantic_config(user_id)
if semantic_config:
    algorithm_config = semantic_config
else:
    algorithm_config = ALGORITHM_CONFIG

This enables: - Testing different weight combinations - Gradual rollout of algorithm changes - User segment-specific optimizations - Data-driven algorithm improvements

Performance Considerations

Caching Strategy

The engine implements multi-level caching: 1. User taste vectors: Cached for 30 minutes 2. Post embeddings: Cached indefinitely 3. Final scores: Cached for 5 minutes per user

Query Optimization

-- Efficient candidate retrieval with indexes
SELECT post_id, author_id, content, created_at, embedding
FROM crawled_posts
WHERE created_at > (CURRENT_TIMESTAMP - INTERVAL '7 days')
ORDER BY created_at DESC
LIMIT 300;  -- 3x requested to allow filtering

Computational Complexity

  • Author preference: O(n) where n = user interactions
  • Engagement score: O(1) simple calculation
  • Recency score: O(1) simple calculation
  • Semantic similarity: O(d) where d = embedding dimension (384)
  • Total: O(n + m×d) where m = number of candidates

Summary

The Corgi Recommendation Engine represents a carefully balanced approach to content recommendation:

  • 45% Semantic Understanding: Drives content discovery through meaning
  • 25% Community Signals: Surfaces quality content
  • 20% Freshness: Keeps the feed current
  • 10% Personal History: Maintains some familiarity

This distribution prioritizes helping users discover new, relevant content while maintaining enough personalization to feel familiar. The modular design allows for easy experimentation and optimization based on user feedback and engagement metrics.

The engine's strength lies not in any single component, but in how they work together to create a recommendation experience that feels both personalized and full of pleasant surprises.