Back to all posts
Data Science

Pure Python Data Science — Building a Social Network from Scratch

Learn how to load, clean, and analyze social network data using only pure Python. Build People You May Know & Pages You Might Like features from scratch.

1. Introduction

What Are We Building?

Imagine you just joined a startup called DevConnect — a social platform exclusively for software developers. Think of it like LinkedIn but only for coders, built in Bangalore.

Your first assignment: analyze user data and build smart recommendation features — but here's the twist — you can only use pure Python. No pandas. No NumPy. No scikit-learn. Just Python's standard library.

Sounds scary? It shouldn't. This is actually how recommendation systems worked in the early days of social networks. And mastering this makes you a far better data scientist because you truly understand what pandas and NumPy are doing under the hood.

Why Does This Matter?

  • Interviews: FAANG and startups both love asking pure Python data questions

  • Understanding: When you can build it from scratch, you own the concept

  • Lightweight systems: Not every production system needs heavy ML libraries

  • Embedded systems / APIs: Sometimes you need lean, dependency-free code

What You'll Build by the End

Feature

Description

Data Loader

Load and parse JSON user data

Data Cleaner

Handle missing values, duplicates, inactive users

People You May Know

Mutual friend-based recommendations

Developers Like You

Skill/interest-based user matching

Projects You Might Like

Collaborative filtering on project interactions


2. Basic Concepts

2.1 The Data Model

Before we write a single line of code, understand the data we're working with. DevConnect has three entities:

JSON
{
  "users": [
    {
      "id": 1,
      "name": "Arjun",
      "city": "Bangalore",
      "skills": ["Python", "Django"],
      "connections": [2, 3],
      "liked_projects": [201, 202]
    },
    {
      "id": 2,
      "name": "Sneha",
      "city": "Mumbai",
      "skills": ["JavaScript", "React"],
      "connections": [1, 4],
      "liked_projects": [202, 203]
    },
    {
      "id": 3,
      "name": "Vikram",
      "city": "Hyderabad",
      "skills": ["Python", "ML"],
      "connections": [1],
      "liked_projects": [201, 204]
    },
    {
      "id": 4,
      "name": "Nisha",
      "city": "Delhi",
      "skills": ["Java", "Spring Boot"],
      "connections": [2],
      "liked_projects": [205]
    }
  ],
  "projects": [
    {"id": 201, "name": "Open Source ML Toolkit"},
    {"id": 202, "name": "React Component Library"},
    {"id": 203, "name": "Node.js REST Boilerplate"},
    {"id": 204, "name": "Python Automation Scripts"},
    {"id": 205, "name": "Java Microservices Template"}
  ]
}

2.2 Key Python Concepts You'll Use

Before diving in, here are the pure Python tools we'll rely on:

Python
# 1. json module — read/write JSON files
import json

# 2. set() — for finding intersections (mutual friends, common pages)
mutual = set([1, 2, 3]).intersection(set([2, 3, 4]))  # {2, 3}

# 3. list comprehension — filter and transform lists cleanly
clean_users = [u for u in users if u["name"].strip()]

# 4. dict.get() — safe dictionary access with default
count = suggestions.get(user_id, 0) + 1

# 5. sorted() with lambda — rank recommendations
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

# 6. collections.defaultdict — auto-initialize dict values
from collections import defaultdict
graph = defaultdict(set)

2.3 Understanding the Graph Structure

Connections in social networks are essentially graphs. When Arjun connects with Sneha, that's an edge between two nodes.

SQL
Arjun (1) ——— Sneha (2)
  |               |
Vikram (3)      Nisha (4)

In pure Python, we represent this as a dictionary of sets:

Python
connection_graph = {
    1: {2, 3},   # Arjun's connections
    2: {1, 4},   # Sneha's connections
    3: {1},      # Vikram's connections
    4: {2}       # Nisha's connections
}

3. All Functions and Features

Let's build a complete utility module. Each function does one job cleanly.

3.1 Loading Data

Python
import json

def load_data(filepath: str) -> dict:
    """
    Load JSON data from a file and return as a Python dictionary.
    
    Args:
        filepath: Path to the JSON file
    
    Returns:
        dict: Parsed data with 'users' and 'projects' keys
    
    Raises:
        FileNotFoundError: If the file doesn't exist
        json.JSONDecodeError: If JSON is malformed
    """
    try:
        with open(filepath, "r", encoding="utf-8") as file:
            data = json.load(file)
        print(f"✅ Loaded {len(data['users'])} users and {len(data['projects'])} projects.")
        return data
    except FileNotFoundError:
        print(f"❌ Error: File '{filepath}' not found.")
        return {}
    except json.JSONDecodeError as e:
        print(f"❌ JSON parsing error: {e}")
        return {}

3.2 Displaying Data

Python
def display_users(data: dict) -> None:
    """Print all users and their connections in a readable format."""
    print("\n" + "="*50)
    print("       DEVCONNECT USERS")
    print("="*50)
    
    for user in data.get("users", []):
        print(f"\n👤 {user['name']} (ID: {user['id']}) | 📍 {user.get('city', 'Unknown')}")
        print(f"   🔧 Skills     : {', '.join(user.get('skills', []))}")
        print(f"   🤝 Connections: {user.get('connections', [])}")
        print(f"   ❤️  Liked      : {user.get('liked_projects', [])}")

def display_projects(data: dict) -> None:
    """Print all available projects."""
    print("\n" + "="*50)
    print("       AVAILABLE PROJECTS")
    print("="*50)
    for project in data.get("projects", []):
        print(f"  [{project['id']}] {project['name']}")

3.3 Saving Data

Python
def save_data(data: dict, filepath: str) -> None:
    """
    Save processed data back to a JSON file.
    
    Args:
        data: Dictionary to save
        filepath: Output file path
    """
    with open(filepath, "w", encoding="utf-8") as file:
        json.dump(data, file, indent=4, ensure_ascii=False)
    print(f"✅ Data saved to '{filepath}'")

4. Intermediate Usage

4.1 Data Cleaning — The Real World is Messy

Here's a realistic messy dataset you'd actually encounter:

JSON
{
  "users": [
    {"id": 1, "name": "Arjun", "city": "Bangalore", "skills": ["Python"], "connections": [2, 3], "liked_projects": [201]},
    {"id": 2, "name": "", "city": "Mumbai", "skills": [], "connections": [1], "liked_projects": [202]},
    {"id": 3, "name": "Vikram", "city": "  ", "skills": ["ML", "ML"], "connections": [1, 1], "liked_projects": []},
    {"id": 4, "name": "Nisha", "city": "Delhi", "skills": ["Java"], "connections": [], "liked_projects": []},
    {"id": 5, "name": "Arjun", "city": "Bangalore", "skills": ["Python"], "connections": [2, 3], "liked_projects": [201]}
  ],
  "projects": [
    {"id": 201, "name": "ML Toolkit"},
    {"id": 202, "name": "React Library"},
    {"id": 201, "name": "ML Toolkit Pro"},
    {"id": 203, "name": ""}
  ]
}

Problems identified:

Issue

Who

Problem

Empty name

User ID 2

name: ""

Whitespace city

User ID 3

city: " "

Duplicate skills

User ID 3

["ML", "ML"]

Duplicate connections

User ID 3

[1, 1]

Inactive user

User ID 4

No connections, no likes

Duplicate user

User ID 5

Same as User 1

Duplicate project ID

Project 201

Two entries with same ID

Empty project name

Project 203

name: ""

4.2 Complete Data Cleaning Function

Python
def clean_data(data: dict) -> dict:
    """
    Clean the raw DevConnect data by:
    - Removing users with empty names
    - Normalizing city names (strip whitespace)
    - Removing duplicate skills and connections
    - Removing inactive users (no connections AND no likes)
    - Removing duplicate users (same name + city combo)
    - Deduplicating projects by ID
    - Removing projects with empty names
    
    Args:
        data: Raw data dictionary
    
    Returns:
        dict: Cleaned data dictionary
    """
    users = data.get("users", [])
    projects = data.get("projects", [])
    
    # --- Step 1: Remove users with empty or whitespace-only names ---
    users = [u for u in users if u.get("name", "").strip()]
    print(f"After removing empty names: {len(users)} users")
    
    # --- Step 2: Normalize city field ---
    for user in users:
        user["city"] = user.get("city", "").strip() or "Unknown"
    
    # --- Step 3: Remove duplicate skills ---
    for user in users:
        user["skills"] = list(set(user.get("skills", [])))
    
    # --- Step 4: Remove duplicate connections ---
    for user in users:
        user["connections"] = list(set(user.get("connections", [])))
    
    # --- Step 5: Remove inactive users ---
    users = [
        u for u in users
        if u.get("connections") or u.get("liked_projects")
    ]
    print(f"After removing inactive users: {len(users)} users")
    
    # --- Step 6: Remove duplicate users (same name + city) ---
    seen_identities = set()
    unique_users = []
    for user in users:
        identity = (user["name"].lower(), user["city"].lower())
        if identity not in seen_identities:
            seen_identities.add(identity)
            unique_users.append(user)
    users = unique_users
    print(f"After removing duplicate users: {len(users)} users")
    
    # --- Step 7: Deduplicate projects by ID, keep last occurrence ---
    unique_projects = {}
    for project in projects:
        if project.get("name", "").strip():  # Also remove empty project names
            unique_projects[project["id"]] = project
    projects = list(unique_projects.values())
    print(f"After deduplicating projects: {len(projects)} projects")
    
    data["users"] = users
    data["projects"] = projects
    return data

Running it:

Python
raw_data = load_data("devconnect_raw.json")
cleaned_data = clean_data(raw_data)
save_data(cleaned_data, "devconnect_clean.json")

Expected Output:

SQL
✅ Loaded 5 users and 4 projects.
After removing empty names: 4 users
After removing inactive users: 3 users
After removing duplicate users: 3 users
After deduplicating projects: 3 projects
✅ Data saved to 'devconnect_clean.json'

5. Advanced Concepts

5.1 People You May Know — Mutual Connection Algorithm

This is a classic 2-hop graph traversal problem. The idea:

"Walk through my friends. For each of their friends — if I don't already know them, they're a candidate. Count how many of my friends know them = mutual count. Higher mutual count = stronger recommendation."

Python
def people_you_may_know(user_id: int, data: dict, top_n: int = 5) -> list:
    """
    Suggest users that a given user might want to connect with,
    based on mutual connections.
    
    Algorithm:
    1. Get direct connections of the target user
    2. For each direct connection, get THEIR connections
    3. Filter out: the user themselves + already-connected users
    4. Count mutual connections (score)
    5. Sort by score descending
    
    Args:
        user_id : The ID of the user we're recommending for
        data    : Cleaned data dictionary
        top_n   : How many recommendations to return
    
    Returns:
        list of (user_id, mutual_count) tuples, sorted by mutual_count desc
    """
    # Build a connection lookup: {user_id: set_of_connections}
    connection_map = {}
    for user in data["users"]:
        connection_map[user["id"]] = set(user.get("connections", []))
    
    # Validate user exists
    if user_id not in connection_map:
        print(f"❌ User ID {user_id} not found.")
        return []
    
    my_connections = connection_map[user_id]
    candidate_scores = {}
    
    # Walk 2 hops
    for friend_id in my_connections:
        friends_of_friend = connection_map.get(friend_id, set())
        
        for candidate_id in friends_of_friend:
            # Skip: myself, and people I already know
            if candidate_id == user_id or candidate_id in my_connections:
                continue
            
            # Increment mutual friend count for this candidate
            candidate_scores[candidate_id] = candidate_scores.get(candidate_id, 0) + 1
    
    # Sort by mutual count (descending), return top N
    ranked = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
    return ranked[:top_n]

Usage and Output:

Python
data = load_data("devconnect_clean.json")

# For Arjun (ID: 1) who knows Sneha (2) and Vikram (3)
results = people_you_may_know(1, data)

print("\n🤝 People You May Know:")
user_lookup = {u["id"]: u["name"] for u in data["users"]}

for candidate_id, mutual_count in results:
    name = user_lookup.get(candidate_id, "Unknown")
    print(f"  👤 {name} (ID: {candidate_id}) — {mutual_count} mutual connection(s)")
SQL
🤝 People You May Know:
  👤 Nisha (ID: 4) — 1 mutual connection(s)

Why Nisha? Because Arjun → Sneha → Nisha. Sneha is Arjun's friend, and Nisha is Sneha's friend. So Nisha is 2 hops away from Arjun with 1 mutual.

5.2 Developers Like You — Skill-Based Matching

Beyond mutual connections, we can match based on shared skills. This is a form of content-based filtering.

Python
def developers_like_you(user_id: int, data: dict, top_n: int = 3) -> list:
    """
    Recommend developers with similar skill sets.
    
    Score = number of skills in common.
    
    Args:
        user_id : Target user ID
        data    : Cleaned data dictionary
        top_n   : Number of recommendations
    
    Returns:
        list of (user_id, shared_skill_count, shared_skills) tuples
    """
    skill_map = {}
    for user in data["users"]:
        skill_map[user["id"]] = set(user.get("skills", []))
    
    if user_id not in skill_map:
        return []
    
    my_skills = skill_map[user_id]
    
    if not my_skills:
        print(f"⚠️ User {user_id} has no skills listed. Cannot find matches.")
        return []
    
    scores = []
    for other_id, other_skills in skill_map.items():
        if other_id == user_id:
            continue
        
        shared = my_skills.intersection(other_skills)
        if shared:
            scores.append((other_id, len(shared), list(shared)))
    
    # Sort by number of shared skills, descending
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_n]

Usage:

Python
matches = developers_like_you(1, data)

print("\n💡 Developers Like You:")
for dev_id, score, skills in matches:
    name = user_lookup.get(dev_id, "Unknown")
    print(f"  🧑‍💻 {name}{score} shared skill(s): {skills}")
SQL
💡 Developers Like You:
  🧑‍💻 Vikram — 1 shared skill(s): ['Python']

5.3 Projects You Might Like — Collaborative Filtering

This is the collaborative filtering algorithm — the same core idea behind Netflix, Spotify, and Amazon recommendations.

Python
def projects_you_might_like(user_id: int, data: dict, top_n: int = 5) -> list:
    """
    Recommend projects based on collaborative filtering.
    
    Logic:
    - Find other users who liked some of the same projects as you (similar users)
    - Look at what ELSE those users liked that you haven't seen yet
    - Score each unseen project by: SUM of (overlap_count for each similar user who liked it)
    - Higher score = stronger recommendation
    
    Args:
        user_id : Target user ID
        data    : Cleaned data dictionary
        top_n   : Number of project recommendations to return
    
    Returns:
        list of (project_id, score) tuples sorted by score descending
    """
    # Build: {user_id: set_of_liked_project_ids}
    user_project_map = {}
    for user in data["users"]:
        user_project_map[user["id"]] = set(user.get("liked_projects", []))
    
    if user_id not in user_project_map:
        return []
    
    my_projects = user_project_map[user_id]
    project_scores = {}
    
    for other_id, their_projects in user_project_map.items():
        if other_id == user_id:
            continue
        
        # How similar is this other user to me?
        overlap = my_projects.intersection(their_projects)
        similarity_score = len(overlap)
        
        if similarity_score == 0:
            continue  # No common ground, skip this user entirely
        
        # What did they like that I haven't seen?
        unseen = their_projects - my_projects
        
        for project_id in unseen:
            # Weight recommendation by similarity score
            project_scores[project_id] = project_scores.get(project_id, 0) + similarity_score
    
    ranked = sorted(project_scores.items(), key=lambda x: x[1], reverse=True)
    return ranked[:top_n]

Usage:

Python
project_lookup = {p["id"]: p["name"] for p in data["projects"]}
recommendations = projects_you_might_like(1, data)

print("\n📂 Projects You Might Like:")
for proj_id, score in recommendations:
    name = project_lookup.get(proj_id, "Unknown Project")
    print(f"  ⭐ {name} (ID: {proj_id}) — Recommendation score: {score}")
SQL
📂 Projects You Might Like:
  ⭐ Node.js REST Boilerplate (ID: 203) — Recommendation score: 1
  ⭐ Python Automation Scripts (ID: 204) — Recommendation score: 1

6. Real-World Use Cases

6.1 Where This Pattern Shows Up in Production

Platform

Feature

Algorithm Used

LinkedIn

"People You May Know"

Mutual connections (2-hop BFS)

Twitter/X

"Who To Follow"

Mutual follows + interest overlap

GitHub

"Repositories You Might Like"

Collaborative filtering on stars

Netflix

"Because You Watched..."

Item-based collaborative filtering

Spotify

"Discover Weekly"

Matrix factorization (advanced CF)

Amazon

"Customers Also Bought"

Item-to-item collaborative filtering

6.2 The Pipeline in Production

SQL
[Raw Data Source] → [ETL / Cleaning] → [Feature Extraction] → [Recommendation Engine] → [API] → [Frontend]
     ↑                                                                                        ↓
[User Actions] ←————————————————————————————————————————————————————————————————————————————

In our mini project:

SQL
devconnect_raw.json → clean_data() → people_you_may_know() / projects_you_might_like() → JSON output → REST API

6.3 Wrapping It All as an API Response

Python
import json

def get_full_recommendations(user_id: int, data: dict) -> dict:
    """
    Returns a complete recommendation payload for a given user.
    Suitable for serving via a REST API endpoint.
    """
    user_lookup = {u["id"]: u["name"] for u in data["users"]}
    project_lookup = {p["id"]: p["name"] for p in data["projects"]}
    
    # People recommendations
    pymk_raw = people_you_may_know(user_id, data)
    pymk = [
        {"user_id": uid, "name": user_lookup.get(uid, "?"), "mutual_connections": score}
        for uid, score in pymk_raw
    ]
    
    # Project recommendations
    pyml_raw = projects_you_might_like(user_id, data)
    pyml = [
        {"project_id": pid, "name": project_lookup.get(pid, "?"), "score": score}
        for pid, score in pyml_raw
    ]
    
    # Skill-based developer matches
    dly_raw = developers_like_you(user_id, data)
    dly = [
        {"user_id": uid, "name": user_lookup.get(uid, "?"), "shared_skills": skills, "score": score}
        for uid, score, skills in dly_raw
    ]
    
    return {
        "user_id": user_id,
        "people_you_may_know": pymk,
        "projects_you_might_like": pyml,
        "developers_like_you": dly
    }


# Usage
data = load_data("devconnect_clean.json")
response = get_full_recommendations(1, data)
print(json.dumps(response, indent=2))

Output:

JSON
{
  "user_id": 1,
  "people_you_may_know": [
    {"user_id": 4, "name": "Nisha", "mutual_connections": 1}
  ],
  "projects_you_might_like": [
    {"project_id": 203, "name": "Node.js REST Boilerplate", "score": 1},
    {"project_id": 204, "name": "Python Automation Scripts", "score": 1}
  ],
  "developers_like_you": [
    {"user_id": 3, "name": "Vikram", "shared_skills": ["Python"], "score": 1}
  ]
}

This JSON is exactly what a Flask or FastAPI backend would return to a React/Vue frontend. Clean, structured, production-ready.


7. Practical Examples

7.1 Full Working Script — End to End

Python
# devconnect_engine.py
# Pure Python recommendation engine — no external libraries required

import json
from collections import defaultdict


# ─────────────────────────────────────────
#  STEP 1: I/O UTILITIES
# ─────────────────────────────────────────

def load_data(filepath):
    # Open the JSON file safely with UTF-8 encoding
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            return json.load(f)  # Parse JSON → Python dict
    except (FileNotFoundError, json.JSONDecodeError) as e:
        print(f"Error loading data: {e}")
        return {"users": [], "projects": []}

def save_data(data, filepath):
    # Write Python dict back to JSON with pretty formatting
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)


# ─────────────────────────────────────────
#  STEP 2: DATA CLEANING
# ─────────────────────────────────────────

def clean_data(data):
    users = data.get("users", [])
    projects = data.get("projects", [])
    
    # Filter out users with blank names
    users = [u for u in users if u.get("name", "").strip()]
    
    # Normalize each user's data
    seen = set()
    clean_users = []
    for u in users:
        # Deduplicate connections and skills using set()
        u["connections"] = list(set(u.get("connections", [])))
        u["skills"] = list(set(u.get("skills", [])))
        u["city"] = u.get("city", "").strip() or "Unknown"
        
        # Skip inactive users (nothing to analyze)
        if not u["connections"] and not u.get("liked_projects"):
            continue
        
        # Skip duplicate users (same name + city = same person)
        fingerprint = (u["name"].lower(), u["city"].lower())
        if fingerprint in seen:
            continue
        seen.add(fingerprint)
        clean_users.append(u)
    
    # Deduplicate projects: last write wins for same ID
    project_map = {}
    for p in projects:
        if p.get("name", "").strip():  # Ignore projects with empty names
            project_map[p["id"]] = p
    
    data["users"] = clean_users
    data["projects"] = list(project_map.values())
    return data


# ─────────────────────────────────────────
#  STEP 3: PEOPLE YOU MAY KNOW
# ─────────────────────────────────────────

def people_you_may_know(user_id, data, top_n=5):
    # Build adjacency: {user_id → set of connected user IDs}
    conn_map = {u["id"]: set(u.get("connections", [])) for u in data["users"]}
    
    if user_id not in conn_map:
        return []
    
    my_conns = conn_map[user_id]   # My direct connections
    scores = {}                     # candidate_id → mutual count
    
    for friend in my_conns:        # For each person I know
        for candidate in conn_map.get(friend, set()):  # Their connections
            if candidate == user_id or candidate in my_conns:
                continue            # Skip: myself + people I already know
            scores[candidate] = scores.get(candidate, 0) + 1  # Count mutual
    
    # Return top N sorted by mutual count
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]


# ─────────────────────────────────────────
#  STEP 4: PROJECTS YOU MIGHT LIKE
# ─────────────────────────────────────────

def projects_you_might_like(user_id, data, top_n=5):
    # Build: {user_id → set of liked project IDs}
    proj_map = {u["id"]: set(u.get("liked_projects", [])) for u in data["users"]}
    
    if user_id not in proj_map:
        return []
    
    my_likes = proj_map[user_id]   # Projects I already like
    scores = {}                     # project_id → recommendation score
    
    for other_id, their_likes in proj_map.items():
        if other_id == user_id:
            continue
        
        # How similar is this person to me?
        overlap = len(my_likes.intersection(their_likes))
        if overlap == 0:
            continue  # Not similar at all, ignore
        
        # Projects they liked that I haven't seen
        for proj in their_likes - my_likes:
            scores[proj] = scores.get(proj, 0) + overlap  # Weight by similarity
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]


# ─────────────────────────────────────────
#  STEP 5: DEVELOPERS LIKE YOU
# ─────────────────────────────────────────

def developers_like_you(user_id, data, top_n=3):
    # Build: {user_id → set of skills}
    skill_map = {u["id"]: set(u.get("skills", [])) for u in data["users"]}
    
    if user_id not in skill_map:
        return []
    
    my_skills = skill_map[user_id]
    results = []
    
    for other_id, their_skills in skill_map.items():
        if other_id == user_id:
            continue
        shared = my_skills.intersection(their_skills)
        if shared:
            results.append((other_id, len(shared), list(shared)))
    
    return sorted(results, key=lambda x: x[1], reverse=True)[:top_n]


# ─────────────────────────────────────────
#  MAIN EXECUTION
# ─────────────────────────────────────────

if __name__ == "__main__":
    # Load and clean data
    raw = load_data("devconnect_raw.json")
    clean = clean_data(raw)
    save_data(clean, "devconnect_clean.json")
    
    # Lookup helpers
    user_names = {u["id"]: u["name"] for u in clean["users"]}
    proj_names = {p["id"]: p["name"] for p in clean["projects"]}
    
    TARGET_USER = 1
    print(f"\n📊 Recommendations for: {user_names.get(TARGET_USER, '?')}")
    print("="*55)
    
    # People You May Know
    print("\n🤝 People You May Know:")
    for uid, mutual in people_you_may_know(TARGET_USER, clean):
        print(f"   → {user_names.get(uid, '?')} ({mutual} mutual)")
    
    # Projects You Might Like
    print("\n📂 Projects You Might Like:")
    for pid, score in projects_you_might_like(TARGET_USER, clean):
        print(f"   → {proj_names.get(pid, '?')} (score: {score})")
    
    # Developers Like You
    print("\n💡 Developers Like You:")
    for uid, score, skills in developers_like_you(TARGET_USER, clean):
        print(f"   → {user_names.get(uid, '?')} | Common Skills: {skills}")

8. Edge Cases and Errors

8.1 Common Issues and Fixes

Python
# ❌ MISTAKE 1: Modifying a list while iterating over it
for user in data["users"]:
    if not user["name"]:
        data["users"].remove(user)  # BUG! Skips elements

# ✅ FIX: Use list comprehension instead
data["users"] = [u for u in data["users"] if u["name"].strip()]


# ❌ MISTAKE 2: set() doesn't preserve order (can affect rankings)
user["connections"] = list(set(user["connections"]))
# This is fine for sets of IDs, but if ORDER matters, use:
seen_ids = set()
ordered_unique = []
for cid in user["connections"]:
    if cid not in seen_ids:
        seen_ids.add(cid)
        ordered_unique.append(cid)
user["connections"] = ordered_unique


# ❌ MISTAKE 3: KeyError when user ID is missing
my_conns = conn_map[user_id]  # Crashes if user_id not in map

# ✅ FIX: Always validate first
if user_id not in conn_map:
    return []
my_conns = conn_map[user_id]


# ❌ MISTAKE 4: Division by zero in scoring
similarity = overlap / total_items  # Crashes if total_items == 0

# ✅ FIX: Guard against zero
similarity = overlap / total_items if total_items > 0 else 0


# ❌ MISTAKE 5: Assuming JSON values are always lists
connections = user["connections"]  # What if it's None or missing?

# ✅ FIX: Use .get() with a default
connections = user.get("connections") or []

8.2 Debugging Checklist

Python
def validate_data(data):
    """Run sanity checks on cleaned data before running algorithms."""
    errors = []
    
    user_ids = {u["id"] for u in data["users"]}
    
    for user in data["users"]:
        # Check: all connection IDs actually exist in the dataset
        for cid in user.get("connections", []):
            if cid not in user_ids:
                errors.append(f"User {user['id']} references unknown connection: {cid}")
        
        # Check: all liked project IDs actually exist
        project_ids = {p["id"] for p in data["projects"]}
        for pid in user.get("liked_projects", []):
            if pid not in project_ids:
                errors.append(f"User {user['id']} references unknown project: {pid}")
    
    if errors:
        print("⚠️ Data validation failed:")
        for e in errors:
            print(f"  - {e}")
    else:
        print("✅ Data validation passed.")
    
    return len(errors) == 0

9. Pro Developer Insights

9.1 Performance Tips

Python
# ✅ TIP 1: Build lookup dictionaries ONCE, use everywhere
# DON'T do this inside every function call (rebuilds every time):
for user in data["users"]:
    if user["id"] == target_id:
        ...

# DO this: O(1) lookup instead of O(n) search
user_map = {u["id"]: u for u in data["users"]}  # Build once
target = user_map.get(target_id)                  # O(1) lookup


# ✅ TIP 2: Use sets for membership checks — O(1) vs O(n)
# Slow: "if candidate in [1, 2, 3, 4, 5]"  → O(n) list scan
# Fast: "if candidate in {1, 2, 3, 4, 5}"  → O(1) hash lookup


# ✅ TIP 3: Use generators for large datasets (memory efficiency)
def active_users(data):
    """Yield users one at a time instead of loading all into memory."""
    for user in data["users"]:
        if user.get("connections") or user.get("liked_projects"):
            yield user

# Process one by one without loading entire list
for user in active_users(data):
    process(user)


# ✅ TIP 4: Cache frequent computations
def build_indexes(data):
    """Pre-compute all lookup structures once."""
    return {
        "user_by_id": {u["id"]: u for u in data["users"]},
        "project_by_id": {p["id"]: p for p in data["projects"]},
        "connections": {u["id"]: set(u.get("connections", [])) for u in data["users"]},
        "liked_projects": {u["id"]: set(u.get("liked_projects", [])) for u in data["users"]},
        "skills": {u["id"]: set(u.get("skills", [])) for u in data["users"]},
    }

9.2 Best Practices

  • Single Responsibility: Each function does exactly one thing

  • Defensive coding: Always use .get() with defaults for dict access

  • Type hints: Add them even in pure Python for readability

  • Separation of concerns: Keep data loading, cleaning, and analysis in separate functions/modules

  • Immutability when possible: Don't mutate input data in-place; return new structures

9.3 Security in Production

Python
# ✅ Validate input user_id type before processing
def safe_recommend(user_id, data):
    if not isinstance(user_id, int) or user_id <= 0:
        raise ValueError(f"Invalid user_id: {user_id}")
    return people_you_may_know(user_id, data)

# ✅ Limit file sizes before loading in production
import os
MAX_FILE_SIZE_MB = 50

def safe_load(filepath):
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    if size_mb > MAX_FILE_SIZE_MB:
        raise ValueError(f"File too large: {size_mb:.1f} MB")
    return load_data(filepath)

10. Comparison

Pure Python vs pandas vs Graph Libraries

Aspect

Pure Python

pandas

NetworkX (Graph lib)

Dependencies

Zero

Heavy

Moderate

Setup

Instant

pip install

pip install

Speed (small data)

Fast

Slower (overhead)

Moderate

Speed (large data)

Slow

Very Fast

Fast

Memory

Efficient

Higher

High

Code readability

Explicit

Concise

Concise

Recommendation logic

Manual

Manual + apply()

Built-in graph algos

Production use

Small/embedded

ETL pipelines

Social network analysis

Best for

Learning, APIs

Data analysis

Graph-heavy features

When to Use Pure Python

✅ Use when:

  • Learning algorithms from scratch

  • Building lean microservices

  • No external dependencies allowed (security policy)

  • Data is small (< 10,000 users)

  • Embedding logic in another system

❌ Don't use when:

  • Dataset has millions of records (use pandas + NumPy)

  • You need matrix operations (use NumPy)

  • Complex graph algorithms like shortest path (use NetworkX)

  • Time-to-market matters more than learning (use scikit-learn)


11. Data Science Perspective

11.1 What We Implemented (ML Terms)

What We Built

ML Term

Category

Mutual friends count

Graph-based similarity

Collaborative Filtering

Shared skills matching

Content-based filtering

Content-Based Recommendation

Liked projects overlap

User-User Collaborative Filtering

Collaborative Filtering

Ranking by score

Scoring & Ranking

Retrieval System

11.2 The Recommendation System Taxonomy

SQL
Recommendation Systems
├── Collaborative Filtering
│   ├── User-User (what we built for projects)
│   └── Item-Item (what Amazon uses for "Customers also bought")
├── Content-Based Filtering
│   └── Skills matching (what we built for developers)
├── Graph-Based
│   └── Mutual connections (what we built for people)
└── Hybrid (combination of all above → Netflix, Spotify)

11.3 Extending to Real ML

Once you understand this pure Python version, the next step is:

Python
# This is what our "collaborative filtering" becomes at scale:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# User-project matrix (our liked_projects as a binary matrix)
# User 1 likes [201, 202] → [1, 1, 0, 0, 0]
# User 2 likes [202, 203] → [0, 1, 1, 0, 0]

user_matrix = np.array([
    [1, 1, 0, 0, 0],  # Arjun
    [0, 1, 1, 0, 0],  # Sneha
    [1, 0, 0, 1, 0],  # Vikram
    [0, 0, 0, 0, 1],  # Nisha
])

similarity = cosine_similarity(user_matrix)
# similarity[0] tells us how similar Arjun is to everyone else

The pure Python .intersection() count IS essentially a simplified version of cosine similarity on binary vectors.


12. Interview Questions

Basic

Q1: What is collaborative filtering?

A recommendation technique that suggests items to a user based on the preferences of similar users. "People who liked what you liked also liked..."

Q2: Why use set() for finding mutual connections?

Set intersection is O(min(len(A), len(B))) and O(1) for membership checks, making it much faster than nested loops over lists.

Q3: What does dict.get(key, default) do?

Returns the value for key if it exists, otherwise returns default. Prevents KeyError exceptions.

Q4: What is the difference between json.load() and json.loads()?

json.load(file_object) reads from a file. json.loads(string) reads from a string.

Intermediate

Q5: What is a 2-hop graph traversal?

Starting from node A, visit A's neighbors (1 hop), then visit the neighbors of those neighbors (2 hops). Used in "People You May Know."

Q6: How would you handle a very large JSON file that doesn't fit in memory?

Use ijson library for streaming JSON parsing, or split the file into chunks. Process records one at a time using generators.

Q7: What's the time complexity of the people_you_may_know function?

O(U × C²) where U = number of users and C = average connections per user. For dense graphs this is expensive; use BFS with visited sets for optimization.

Advanced

Q8: How would you scale this to 1 million users?

SQL
Options:
1. Pre-compute recommendations offline (batch job), cache in Redis
2. Use sparse matrices (scipy.sparse) for collaborative filtering
3. Approximate nearest neighbors (Faiss, Annoy) for skill matching
4. Graph databases (Neo4j) for the connections/mutual friend queries
5. Distributed processing with Spark for the batch computation

Q9: What's the cold start problem and how do you handle it?

New users have no history — no connections, no likes. Solutions: ask for skills/interests onboarding, recommend trending projects, or use demographic data (city, role).

Q10: Explain why we weight recommendations by overlap count in projects_you_might_like.

If User B has 5 projects in common with me and User C has only 1, User B's additional likes are more trustworthy signals. The overlap acts as a confidence weight for that user's influence on my recommendations.

Scenario-Based

Q11: A user complains they keep seeing the same recommendations. What would you do?

Add diversity: after computing top-N, filter out any already shown recommendations. Add a "seen" list per user. Introduce randomness (epsilon-greedy). Decay scores over time.

Q12: How would you A/B test whether the mutual-friends recommendation actually leads to more connections?

Split users 50/50 — Group A gets mutual-friend recommendations, Group B gets random suggestions. Track: connection acceptance rate, session length, DAU. Run for 2+ weeks for statistical significance.


13. Conclusion

What You Learned

Concept

Takeaway

Pure Python data handling

json, set, dict, list comprehensions are extremely powerful

Data cleaning

Real data is always messy — always clean before analyzing

Graph thinking

Social networks are graphs; connections are edges

Collaborative filtering

Shared behavior → inferred common interest

Content-based filtering

Shared attributes → similar users

Recommendation scoring

Count/weight signals to rank suggestions

Production patterns

API-ready output, validation, error handling

When to Use This Approach

Use pure Python when you're learning algorithms, building lightweight services, or working in constrained environments.

Graduate to pandas/NumPy when your dataset grows beyond 50,000 records.

Graduate to ML libraries (scikit-learn, LightFM, Surprise) when you need matrix factorization, latent factors, or advanced CF.

Final Advice

The best data scientists aren't the ones who know the most libraries. They're the ones who understand the fundamentals deeply enough to implement any algorithm from scratch when needed.

Once you can build a recommendation engine in pure Python, every pandas function, every sklearn model, every Spark job becomes something you understand — not just something you copy from Stack Overflow.

Build from scratch. Break things. Fix them. That's how you grow. 🚀


Have questions or want to extend this with a Flask API wrapper? Drop them in the comments below!

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.