Data Science Python

Pure Python Data Science — Building a Social Network from Scratch

Learn how to load, clean, and analyze social network data using only pure Python. Build People You May Know & Pages You Might Like features from scratch.

Apr 11, 2026 24 min read

1. Introduction

What Are We Building?

Imagine you just joined a startup called DevConnect — a social platform exclusively for software developers. Think of it like LinkedIn but only for coders, built in Bangalore.

Your first assignment: analyze user data and build smart recommendation features — but here's the twist — you can only use pure Python. No pandas. No NumPy. No scikit-learn. Just Python's standard library.

Sounds scary? It shouldn't. This is actually how recommendation systems worked in the early days of social networks. And mastering this makes you a far better data scientist because you truly understand what pandas and NumPy are doing under the hood.

Why Does This Matter?

Interviews: FAANG and startups both love asking pure Python data questions
Understanding: When you can build it from scratch, you own the concept
Lightweight systems: Not every production system needs heavy ML libraries
Embedded systems / APIs: Sometimes you need lean, dependency-free code

What You'll Build by the End

Feature	Description
Data Loader	Load and parse JSON user data
Data Cleaner	Handle missing values, duplicates, inactive users
People You May Know	Mutual friend-based recommendations
Developers Like You	Skill/interest-based user matching
Projects You Might Like	Collaborative filtering on project interactions

2. Basic Concepts

2.1 The Data Model

Before we write a single line of code, understand the data we're working with. DevConnect has three entities:

{
  "users": [
    {
      "id": 1,
      "name": "Arjun",
      "city": "Bangalore",
      "skills": ["Python", "Django"],
      "connections": [2, 3],
      "liked_projects": [201, 202]
    },
    {
      "id": 2,
      "name": "Sneha",
      "city": "Mumbai",
      "skills": ["JavaScript", "React"],
      "connections": [1, 4],
      "liked_projects": [202, 203]
    },
    {
      "id": 3,
      "name": "Vikram",
      "city": "Hyderabad",
      "skills": ["Python", "ML"],
      "connections": [1],
      "liked_projects": [201, 204]
    },
    {
      "id": 4,
      "name": "Nisha",
      "city": "Delhi",
      "skills": ["Java", "Spring Boot"],
      "connections": [2],
      "liked_projects": [205]
    }
  ],
  "projects": [
    {"id": 201, "name": "Open Source ML Toolkit"},
    {"id": 202, "name": "React Component Library"},
    {"id": 203, "name": "Node.js REST Boilerplate"},
    {"id": 204, "name": "Python Automation Scripts"},
    {"id": 205, "name": "Java Microservices Template"}
  ]
}

2.2 Key Python Concepts You'll Use

Before diving in, here are the pure Python tools we'll rely on:

# 1. json module — read/write JSON files
import json

# 2. set() — for finding intersections (mutual friends, common pages)
mutual = set([1, 2, 3]).intersection(set([2, 3, 4]))  # {2, 3}

# 3. list comprehension — filter and transform lists cleanly
clean_users = [u for u in users if u["name"].strip()]

# 4. dict.get() — safe dictionary access with default
count = suggestions.get(user_id, 0) + 1

# 5. sorted() with lambda — rank recommendations
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

# 6. collections.defaultdict — auto-initialize dict values
from collections import defaultdict
graph = defaultdict(set)

2.3 Understanding the Graph Structure

Connections in social networks are essentially graphs. When Arjun connects with Sneha, that's an edge between two nodes.

Arjun (1) ——— Sneha (2)
  |               |
Vikram (3)      Nisha (4)

In pure Python, we represent this as a dictionary of sets:

connection_graph = {
    1: {2, 3},   # Arjun's connections
    2: {1, 4},   # Sneha's connections
    3: {1},      # Vikram's connections
    4: {2}       # Nisha's connections
}

3. All Functions and Features

Let's build a complete utility module. Each function does one job cleanly.

3.1 Loading Data

import json

def load_data(filepath: str) -> dict:
    """
    Load JSON data from a file and return as a Python dictionary.
    
    Args:
        filepath: Path to the JSON file
    
    Returns:
        dict: Parsed data with 'users' and 'projects' keys
    
    Raises:
        FileNotFoundError: If the file doesn't exist
        json.JSONDecodeError: If JSON is malformed
    """
    try:
        with open(filepath, "r", encoding="utf-8") as file:
            data = json.load(file)
        print(f"✅ Loaded {len(data['users'])} users and {len(data['projects'])} projects.")
        return data
    except FileNotFoundError:
        print(f"❌ Error: File '{filepath}' not found.")
        return {}
    except json.JSONDecodeError as e:
        print(f"❌ JSON parsing error: {e}")
        return {}

3.2 Displaying Data

def display_users(data: dict) -> None:
    """Print all users and their connections in a readable format."""
    print("\n" + "="*50)
    print("       DEVCONNECT USERS")
    print("="*50)
    
    for user in data.get("users", []):
        print(f"\n👤 {user['name']} (ID: {user['id']}) | 📍 {user.get('city', 'Unknown')}")
        print(f"   🔧 Skills     : {', '.join(user.get('skills', []))}")
        print(f"   🤝 Connections: {user.get('connections', [])}")
        print(f"   ❤️  Liked      : {user.get('liked_projects', [])}")

def display_projects(data: dict) -> None:
    """Print all available projects."""
    print("\n" + "="*50)
    print("       AVAILABLE PROJECTS")
    print("="*50)
    for project in data.get("projects", []):
        print(f"  [{project['id']}] {project['name']}")

3.3 Saving Data

def save_data(data: dict, filepath: str) -> None:
    """
    Save processed data back to a JSON file.
    
    Args:
        data: Dictionary to save
        filepath: Output file path
    """
    with open(filepath, "w", encoding="utf-8") as file:
        json.dump(data, file, indent=4, ensure_ascii=False)
    print(f"✅ Data saved to '{filepath}'")

4. Intermediate Usage

4.1 Data Cleaning — The Real World is Messy

Here's a realistic messy dataset you'd actually encounter:

{
  "users": [
    {"id": 1, "name": "Arjun", "city": "Bangalore", "skills": ["Python"], "connections": [2, 3], "liked_projects": [201]},
    {"id": 2, "name": "", "city": "Mumbai", "skills": [], "connections": [1], "liked_projects": [202]},
    {"id": 3, "name": "Vikram", "city": "  ", "skills": ["ML", "ML"], "connections": [1, 1], "liked_projects": []},
    {"id": 4, "name": "Nisha", "city": "Delhi", "skills": ["Java"], "connections": [], "liked_projects": []},
    {"id": 5, "name": "Arjun", "city": "Bangalore", "skills": ["Python"], "connections": [2, 3], "liked_projects": [201]}
  ],
  "projects": [
    {"id": 201, "name": "ML Toolkit"},
    {"id": 202, "name": "React Library"},
    {"id": 201, "name": "ML Toolkit Pro"},
    {"id": 203, "name": ""}
  ]
}

Problems identified:

Issue	Who	Problem
Empty name	User ID 2	`name: ""`
Whitespace city	User ID 3	`city: " "`
Duplicate skills	User ID 3	`["ML", "ML"]`
Duplicate connections	User ID 3	`[1, 1]`
Inactive user	User ID 4	No connections, no likes
Duplicate user	User ID 5	Same as User 1
Duplicate project ID	Project 201	Two entries with same ID
Empty project name	Project 203	`name: ""`

4.2 Complete Data Cleaning Function

def clean_data(data: dict) -> dict:
    """
    Clean the raw DevConnect data by:
    - Removing users with empty names
    - Normalizing city names (strip whitespace)
    - Removing duplicate skills and connections
    - Removing inactive users (no connections AND no likes)
    - Removing duplicate users (same name + city combo)
    - Deduplicating projects by ID
    - Removing projects with empty names
    
    Args:
        data: Raw data dictionary
    
    Returns:
        dict: Cleaned data dictionary
    """
    users = data.get("users", [])
    projects = data.get("projects", [])
    
    # --- Step 1: Remove users with empty or whitespace-only names ---
    users = [u for u in users if u.get("name", "").strip()]
    print(f"After removing empty names: {len(users)} users")
    
    # --- Step 2: Normalize city field ---
    for user in users:
        user["city"] = user.get("city", "").strip() or "Unknown"
    
    # --- Step 3: Remove duplicate skills ---
    for user in users:
        user["skills"] = list(set(user.get("skills", [])))
    
    # --- Step 4: Remove duplicate connections ---
    for user in users:
        user["connections"] = list(set(user.get("connections", [])))
    
    # --- Step 5: Remove inactive users ---
    users = [
        u for u in users
        if u.get("connections") or u.get("liked_projects")
    ]
    print(f"After removing inactive users: {len(users)} users")
    
    # --- Step 6: Remove duplicate users (same name + city) ---
    seen_identities = set()
    unique_users = []
    for user in users:
        identity = (user["name"].lower(), user["city"].lower())
        if identity not in seen_identities:
            seen_identities.add(identity)
            unique_users.append(user)
    users = unique_users
    print(f"After removing duplicate users: {len(users)} users")
    
    # --- Step 7: Deduplicate projects by ID, keep last occurrence ---
    unique_projects = {}
    for project in projects:
        if project.get("name", "").strip():  # Also remove empty project names
            unique_projects[project["id"]] = project
    projects = list(unique_projects.values())
    print(f"After deduplicating projects: {len(projects)} projects")
    
    data["users"] = users
    data["projects"] = projects
    return data

Running it:

raw_data = load_data("devconnect_raw.json")
cleaned_data = clean_data(raw_data)
save_data(cleaned_data, "devconnect_clean.json")

Expected Output:

✅ Loaded 5 users and 4 projects.
After removing empty names: 4 users
After removing inactive users: 3 users
After removing duplicate users: 3 users
After deduplicating projects: 3 projects
✅ Data saved to 'devconnect_clean.json'

5. Advanced Concepts

5.1 People You May Know — Mutual Connection Algorithm

This is a classic 2-hop graph traversal problem. The idea:

"Walk through my friends. For each of their friends — if I don't already know them, they're a candidate. Count how many of my friends know them = mutual count. Higher mutual count = stronger recommendation."

def people_you_may_know(user_id: int, data: dict, top_n: int = 5) -> list:
    """
    Suggest users that a given user might want to connect with,
    based on mutual connections.
    
    Algorithm:
    1. Get direct connections of the target user
    2. For each direct connection, get THEIR connections
    3. Filter out: the user themselves + already-connected users
    4. Count mutual connections (score)
    5. Sort by score descending
    
    Args:
        user_id : The ID of the user we're recommending for
        data    : Cleaned data dictionary
        top_n   : How many recommendations to return
    
    Returns:
        list of (user_id, mutual_count) tuples, sorted by mutual_count desc
    """
    # Build a connection lookup: {user_id: set_of_connections}
    connection_map = {}
    for user in data["users"]:
        connection_map[user["id"]] = set(user.get("connections", []))
    
    # Validate user exists
    if user_id not in connection_map:
        print(f"❌ User ID {user_id} not found.")
        return []
    
    my_connections = connection_map[user_id]
    candidate_scores = {}
    
    # Walk 2 hops
    for friend_id in my_connections:
        friends_of_friend = connection_map.get(friend_id, set())
        
        for candidate_id in friends_of_friend:
            # Skip: myself, and people I already know
            if candidate_id == user_id or candidate_id in my_connections:
                continue
            
            # Increment mutual friend count for this candidate
            candidate_scores[candidate_id] = candidate_scores.get(candidate_id, 0) + 1
    
    # Sort by mutual count (descending), return top N
    ranked = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
    return ranked[:top_n]

Usage and Output:

data = load_data("devconnect_clean.json")

# For Arjun (ID: 1) who knows Sneha (2) and Vikram (3)
results = people_you_may_know(1, data)

print("\n🤝 People You May Know:")
user_lookup = {u["id"]: u["name"] for u in data["users"]}

for candidate_id, mutual_count in results:
    name = user_lookup.get(candidate_id, "Unknown")
    print(f"  👤 {name} (ID: {candidate_id}) — {mutual_count} mutual connection(s)")

🤝 People You May Know:
  👤 Nisha (ID: 4) — 1 mutual connection(s)

Why Nisha? Because Arjun → Sneha → Nisha. Sneha is Arjun's friend, and Nisha is Sneha's friend. So Nisha is 2 hops away from Arjun with 1 mutual.

5.2 Developers Like You — Skill-Based Matching

Beyond mutual connections, we can match based on shared skills. This is a form of content-based filtering.

def developers_like_you(user_id: int, data: dict, top_n: int = 3) -> list:
    """
    Recommend developers with similar skill sets.
    
    Score = number of skills in common.
    
    Args:
        user_id : Target user ID
        data    : Cleaned data dictionary
        top_n   : Number of recommendations
    
    Returns:
        list of (user_id, shared_skill_count, shared_skills) tuples
    """
    skill_map = {}
    for user in data["users"]:
        skill_map[user["id"]] = set(user.get("skills", []))
    
    if user_id not in skill_map:
        return []
    
    my_skills = skill_map[user_id]
    
    if not my_skills:
        print(f"⚠️ User {user_id} has no skills listed. Cannot find matches.")
        return []
    
    scores = []
    for other_id, other_skills in skill_map.items():
        if other_id == user_id:
            continue
        
        shared = my_skills.intersection(other_skills)
        if shared:
            scores.append((other_id, len(shared), list(shared)))
    
    # Sort by number of shared skills, descending
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_n]

Usage:

matches = developers_like_you(1, data)

print("\n💡 Developers Like You:")
for dev_id, score, skills in matches:
    name = user_lookup.get(dev_id, "Unknown")
    print(f"  🧑‍💻 {name} — {score} shared skill(s): {skills}")

💡 Developers Like You:
  🧑‍💻 Vikram — 1 shared skill(s): ['Python']

5.3 Projects You Might Like — Collaborative Filtering

This is the collaborative filtering algorithm — the same core idea behind Netflix, Spotify, and Amazon recommendations.

def projects_you_might_like(user_id: int, data: dict, top_n: int = 5) -> list:
    """
    Recommend projects based on collaborative filtering.
    
    Logic:
    - Find other users who liked some of the same projects as you (similar users)
    - Look at what ELSE those users liked that you haven't seen yet
    - Score each unseen project by: SUM of (overlap_count for each similar user who liked it)
    - Higher score = stronger recommendation
    
    Args:
        user_id : Target user ID
        data    : Cleaned data dictionary
        top_n   : Number of project recommendations to return
    
    Returns:
        list of (project_id, score) tuples sorted by score descending
    """
    # Build: {user_id: set_of_liked_project_ids}
    user_project_map = {}
    for user in data["users"]:
        user_project_map[user["id"]] = set(user.get("liked_projects", []))
    
    if user_id not in user_project_map:
        return []
    
    my_projects = user_project_map[user_id]
    project_scores = {}
    
    for other_id, their_projects in user_project_map.items():
        if other_id == user_id:
            continue
        
        # How similar is this other user to me?
        overlap = my_projects.intersection(their_projects)
        similarity_score = len(overlap)
        
        if similarity_score == 0:
            continue  # No common ground, skip this user entirely
        
        # What did they like that I haven't seen?
        unseen = their_projects - my_projects
        
        for project_id in unseen:
            # Weight recommendation by similarity score
            project_scores[project_id] = project_scores.get(project_id, 0) + similarity_score
    
    ranked = sorted(project_scores.items(), key=lambda x: x[1], reverse=True)
    return ranked[:top_n]

Usage:

project_lookup = {p["id"]: p["name"] for p in data["projects"]}
recommendations = projects_you_might_like(1, data)

print("\n📂 Projects You Might Like:")
for proj_id, score in recommendations:
    name = project_lookup.get(proj_id, "Unknown Project")
    print(f"  ⭐ {name} (ID: {proj_id}) — Recommendation score: {score}")

📂 Projects You Might Like:
  ⭐ Node.js REST Boilerplate (ID: 203) — Recommendation score: 1
  ⭐ Python Automation Scripts (ID: 204) — Recommendation score: 1

6. Real-World Use Cases

6.1 Where This Pattern Shows Up in Production

Platform	Feature	Algorithm Used
LinkedIn	"People You May Know"	Mutual connections (2-hop BFS)
Twitter/X	"Who To Follow"	Mutual follows + interest overlap
GitHub	"Repositories You Might Like"	Collaborative filtering on stars
Netflix	"Because You Watched..."	Item-based collaborative filtering
Spotify	"Discover Weekly"	Matrix factorization (advanced CF)
Amazon	"Customers Also Bought"	Item-to-item collaborative filtering

6.2 The Pipeline in Production

[Raw Data Source] → [ETL / Cleaning] → [Feature Extraction] → [Recommendation Engine] → [API] → [Frontend]
     ↑                                                                                        ↓
[User Actions] ←————————————————————————————————————————————————————————————————————————————

In our mini project:

devconnect_raw.json → clean_data() → people_you_may_know() / projects_you_might_like() → JSON output → REST API

6.3 Wrapping It All as an API Response

import json

def get_full_recommendations(user_id: int, data: dict) -> dict:
    """
    Returns a complete recommendation payload for a given user.
    Suitable for serving via a REST API endpoint.
    """
    user_lookup = {u["id"]: u["name"] for u in data["users"]}
    project_lookup = {p["id"]: p["name"] for p in data["projects"]}
    
    # People recommendations
    pymk_raw = people_you_may_know(user_id, data)
    pymk = [
        {"user_id": uid, "name": user_lookup.get(uid, "?"), "mutual_connections": score}
        for uid, score in pymk_raw
    ]
    
    # Project recommendations
    pyml_raw = projects_you_might_like(user_id, data)
    pyml = [
        {"project_id": pid, "name": project_lookup.get(pid, "?"), "score": score}
        for pid, score in pyml_raw
    ]
    
    # Skill-based developer matches
    dly_raw = developers_like_you(user_id, data)
    dly = [
        {"user_id": uid, "name": user_lookup.get(uid, "?"), "shared_skills": skills, "score": score}
        for uid, score, skills in dly_raw
    ]
    
    return {
        "user_id": user_id,
        "people_you_may_know": pymk,
        "projects_you_might_like": pyml,
        "developers_like_you": dly
    }


# Usage
data = load_data("devconnect_clean.json")
response = get_full_recommendations(1, data)
print(json.dumps(response, indent=2))

Output:

{
  "user_id": 1,
  "people_you_may_know": [
    {"user_id": 4, "name": "Nisha", "mutual_connections": 1}
  ],
  "projects_you_might_like": [
    {"project_id": 203, "name": "Node.js REST Boilerplate", "score": 1},
    {"project_id": 204, "name": "Python Automation Scripts", "score": 1}
  ],
  "developers_like_you": [
    {"user_id": 3, "name": "Vikram", "shared_skills": ["Python"], "score": 1}
  ]
}

This JSON is exactly what a Flask or FastAPI backend would return to a React/Vue frontend. Clean, structured, production-ready.

7. Practical Examples

7.1 Full Working Script — End to End

# devconnect_engine.py
# Pure Python recommendation engine — no external libraries required

import json
from collections import defaultdict


# ─────────────────────────────────────────
#  STEP 1: I/O UTILITIES
# ─────────────────────────────────────────

def load_data(filepath):
    # Open the JSON file safely with UTF-8 encoding
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            return json.load(f)  # Parse JSON → Python dict
    except (FileNotFoundError, json.JSONDecodeError) as e:
        print(f"Error loading data: {e}")
        return {"users": [], "projects": []}

def save_data(data, filepath):
    # Write Python dict back to JSON with pretty formatting
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)


# ─────────────────────────────────────────
#  STEP 2: DATA CLEANING
# ─────────────────────────────────────────

def clean_data(data):
    users = data.get("users", [])
    projects = data.get("projects", [])
    
    # Filter out users with blank names
    users = [u for u in users if u.get("name", "").strip()]
    
    # Normalize each user's data
    seen = set()
    clean_users = []
    for u in users:
        # Deduplicate connections and skills using set()
        u["connections"] = list(set(u.get("connections", [])))
        u["skills"] = list(set(u.get("skills", [])))
        u["city"] = u.get("city", "").strip() or "Unknown"
        
        # Skip inactive users (nothing to analyze)
        if not u["connections"] and not u.get("liked_projects"):
            continue
        
        # Skip duplicate users (same name + city = same person)
        fingerprint = (u["name"].lower(), u["city"].lower())
        if fingerprint in seen:
            continue
        seen.add(fingerprint)
        clean_users.append(u)
    
    # Deduplicate projects: last write wins for same ID
    project_map = {}
    for p in projects:
        if p.get("name", "").strip():  # Ignore projects with empty names
            project_map[p["id"]] = p
    
    data["users"] = clean_users
    data["projects"] = list(project_map.values())
    return data


# ─────────────────────────────────────────
#  STEP 3: PEOPLE YOU MAY KNOW
# ─────────────────────────────────────────

def people_you_may_know(user_id, data, top_n=5):
    # Build adjacency: {user_id → set of connected user IDs}
    conn_map = {u["id"]: set(u.get("connections", [])) for u in data["users"]}
    
    if user_id not in conn_map:
        return []
    
    my_conns = conn_map[user_id]   # My direct connections
    scores = {}                     # candidate_id → mutual count
    
    for friend in my_conns:        # For each person I know
        for candidate in conn_map.get(friend, set()):  # Their connections
            if candidate == user_id or candidate in my_conns:
                continue            # Skip: myself + people I already know
            scores[candidate] = scores.get(candidate, 0) + 1  # Count mutual
    
    # Return top N sorted by mutual count
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]


# ─────────────────────────────────────────
#  STEP 4: PROJECTS YOU MIGHT LIKE
# ─────────────────────────────────────────

def projects_you_might_like(user_id, data, top_n=5):
    # Build: {user_id → set of liked project IDs}
    proj_map = {u["id"]: set(u.get("liked_projects", [])) for u in data["users"]}
    
    if user_id not in proj_map:
        return []
    
    my_likes = proj_map[user_id]   # Projects I already like
    scores = {}                     # project_id → recommendation score
    
    for other_id, their_likes in proj_map.items():
        if other_id == user_id:
            continue
        
        # How similar is this person to me?
        overlap = len(my_likes.intersection(their_likes))
        if overlap == 0:
            continue  # Not similar at all, ignore
        
        # Projects they liked that I haven't seen
        for proj in their_likes - my_likes:
            scores[proj] = scores.get(proj, 0) + overlap  # Weight by similarity
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]


# ─────────────────────────────────────────
#  STEP 5: DEVELOPERS LIKE YOU
# ─────────────────────────────────────────

def developers_like_you(user_id, data, top_n=3):
    # Build: {user_id → set of skills}
    skill_map = {u["id"]: set(u.get("skills", [])) for u in data["users"]}
    
    if user_id not in skill_map:
        return []
    
    my_skills = skill_map[user_id]
    results = []
    
    for other_id, their_skills in skill_map.items():
        if other_id == user_id:
            continue
        shared = my_skills.intersection(their_skills)
        if shared:
            results.append((other_id, len(shared), list(shared)))
    
    return sorted(results, key=lambda x: x[1], reverse=True)[:top_n]


# ─────────────────────────────────────────
#  MAIN EXECUTION
# ─────────────────────────────────────────

if __name__ == "__main__":
    # Load and clean data
    raw = load_data("devconnect_raw.json")
    clean = clean_data(raw)
    save_data(clean, "devconnect_clean.json")
    
    # Lookup helpers
    user_names = {u["id"]: u["name"] for u in clean["users"]}
    proj_names = {p["id"]: p["name"] for p in clean["projects"]}
    
    TARGET_USER = 1
    print(f"\n📊 Recommendations for: {user_names.get(TARGET_USER, '?')}")
    print("="*55)
    
    # People You May Know
    print("\n🤝 People You May Know:")
    for uid, mutual in people_you_may_know(TARGET_USER, clean):
        print(f"   → {user_names.get(uid, '?')} ({mutual} mutual)")
    
    # Projects You Might Like
    print("\n📂 Projects You Might Like:")
    for pid, score in projects_you_might_like(TARGET_USER, clean):
        print(f"   → {proj_names.get(pid, '?')} (score: {score})")
    
    # Developers Like You
    print("\n💡 Developers Like You:")
    for uid, score, skills in developers_like_you(TARGET_USER, clean):
        print(f"   → {user_names.get(uid, '?')} | Common Skills: {skills}")

8. Edge Cases and Errors

8.1 Common Issues and Fixes

# ❌ MISTAKE 1: Modifying a list while iterating over it
for user in data["users"]:
    if not user["name"]:
        data["users"].remove(user)  # BUG! Skips elements

# ✅ FIX: Use list comprehension instead
data["users"] = [u for u in data["users"] if u["name"].strip()]


# ❌ MISTAKE 2: set() doesn't preserve order (can affect rankings)
user["connections"] = list(set(user["connections"]))
# This is fine for sets of IDs, but if ORDER matters, use:
seen_ids = set()
ordered_unique = []
for cid in user["connections"]:
    if cid not in seen_ids:
        seen_ids.add(cid)
        ordered_unique.append(cid)
user["connections"] = ordered_unique


# ❌ MISTAKE 3: KeyError when user ID is missing
my_conns = conn_map[user_id]  # Crashes if user_id not in map

# ✅ FIX: Always validate first
if user_id not in conn_map:
    return []
my_conns = conn_map[user_id]


# ❌ MISTAKE 4: Division by zero in scoring
similarity = overlap / total_items  # Crashes if total_items == 0

# ✅ FIX: Guard against zero
similarity = overlap / total_items if total_items > 0 else 0


# ❌ MISTAKE 5: Assuming JSON values are always lists
connections = user["connections"]  # What if it's None or missing?

# ✅ FIX: Use .get() with a default
connections = user.get("connections") or []

8.2 Debugging Checklist

def validate_data(data):
    """Run sanity checks on cleaned data before running algorithms."""
    errors = []
    
    user_ids = {u["id"] for u in data["users"]}
    
    for user in data["users"]:
        # Check: all connection IDs actually exist in the dataset
        for cid in user.get("connections", []):
            if cid not in user_ids:
                errors.append(f"User {user['id']} references unknown connection: {cid}")
        
        # Check: all liked project IDs actually exist
        project_ids = {p["id"] for p in data["projects"]}
        for pid in user.get("liked_projects", []):
            if pid not in project_ids:
                errors.append(f"User {user['id']} references unknown project: {pid}")
    
    if errors:
        print("⚠️ Data validation failed:")
        for e in errors:
            print(f"  - {e}")
    else:
        print("✅ Data validation passed.")
    
    return len(errors) == 0

9. Pro Developer Insights

9.1 Performance Tips

# ✅ TIP 1: Build lookup dictionaries ONCE, use everywhere
# DON'T do this inside every function call (rebuilds every time):
for user in data["users"]:
    if user["id"] == target_id:
        ...

# DO this: O(1) lookup instead of O(n) search
user_map = {u["id"]: u for u in data["users"]}  # Build once
target = user_map.get(target_id)                  # O(1) lookup


# ✅ TIP 2: Use sets for membership checks — O(1) vs O(n)
# Slow: "if candidate in [1, 2, 3, 4, 5]"  → O(n) list scan
# Fast: "if candidate in {1, 2, 3, 4, 5}"  → O(1) hash lookup


# ✅ TIP 3: Use generators for large datasets (memory efficiency)
def active_users(data):
    """Yield users one at a time instead of loading all into memory."""
    for user in data["users"]:
        if user.get("connections") or user.get("liked_projects"):
            yield user

# Process one by one without loading entire list
for user in active_users(data):
    process(user)


# ✅ TIP 4: Cache frequent computations
def build_indexes(data):
    """Pre-compute all lookup structures once."""
    return {
        "user_by_id": {u["id"]: u for u in data["users"]},
        "project_by_id": {p["id"]: p for p in data["projects"]},
        "connections": {u["id"]: set(u.get("connections", [])) for u in data["users"]},
        "liked_projects": {u["id"]: set(u.get("liked_projects", [])) for u in data["users"]},
        "skills": {u["id"]: set(u.get("skills", [])) for u in data["users"]},
    }

9.2 Best Practices

Single Responsibility: Each function does exactly one thing
Defensive coding: Always use .get() with defaults for dict access
Type hints: Add them even in pure Python for readability
Separation of concerns: Keep data loading, cleaning, and analysis in separate functions/modules
Immutability when possible: Don't mutate input data in-place; return new structures

9.3 Security in Production

# ✅ Validate input user_id type before processing
def safe_recommend(user_id, data):
    if not isinstance(user_id, int) or user_id <= 0:
        raise ValueError(f"Invalid user_id: {user_id}")
    return people_you_may_know(user_id, data)

# ✅ Limit file sizes before loading in production
import os
MAX_FILE_SIZE_MB = 50

def safe_load(filepath):
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    if size_mb > MAX_FILE_SIZE_MB:
        raise ValueError(f"File too large: {size_mb:.1f} MB")
    return load_data(filepath)

10. Comparison

Pure Python vs pandas vs Graph Libraries

Aspect	Pure Python	pandas	NetworkX (Graph lib)
Dependencies	Zero	Heavy	Moderate
Setup	Instant	`pip install`	`pip install`
Speed (small data)	Fast	Slower (overhead)	Moderate
Speed (large data)	Slow	Very Fast	Fast
Memory	Efficient	Higher	High
Code readability	Explicit	Concise	Concise
Recommendation logic	Manual	Manual + apply()	Built-in graph algos
Production use	Small/embedded	ETL pipelines	Social network analysis
Best for	Learning, APIs	Data analysis	Graph-heavy features

When to Use Pure Python

✅ Use when:

Learning algorithms from scratch
Building lean microservices
No external dependencies allowed (security policy)
Data is small (< 10,000 users)
Embedding logic in another system

❌ Don't use when:

Dataset has millions of records (use pandas + NumPy)
You need matrix operations (use NumPy)
Complex graph algorithms like shortest path (use NetworkX)
Time-to-market matters more than learning (use scikit-learn)

11. Data Science Perspective

11.1 What We Implemented (ML Terms)

What We Built	ML Term	Category
Mutual friends count	Graph-based similarity	Collaborative Filtering
Shared skills matching	Content-based filtering	Content-Based Recommendation
Liked projects overlap	User-User Collaborative Filtering	Collaborative Filtering
Ranking by score	Scoring & Ranking	Retrieval System

11.2 The Recommendation System Taxonomy

Recommendation Systems
├── Collaborative Filtering
│   ├── User-User (what we built for projects)
│   └── Item-Item (what Amazon uses for "Customers also bought")
├── Content-Based Filtering
│   └── Skills matching (what we built for developers)
├── Graph-Based
│   └── Mutual connections (what we built for people)
└── Hybrid (combination of all above → Netflix, Spotify)

11.3 Extending to Real ML

Once you understand this pure Python version, the next step is:

# This is what our "collaborative filtering" becomes at scale:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# User-project matrix (our liked_projects as a binary matrix)
# User 1 likes [201, 202] → [1, 1, 0, 0, 0]
# User 2 likes [202, 203] → [0, 1, 1, 0, 0]

user_matrix = np.array([
    [1, 1, 0, 0, 0],  # Arjun
    [0, 1, 1, 0, 0],  # Sneha
    [1, 0, 0, 1, 0],  # Vikram
    [0, 0, 0, 0, 1],  # Nisha
])

similarity = cosine_similarity(user_matrix)
# similarity[0] tells us how similar Arjun is to everyone else

The pure Python .intersection() count IS essentially a simplified version of cosine similarity on binary vectors.

12. Interview Questions

Basic

Q1: What is collaborative filtering?

A recommendation technique that suggests items to a user based on the preferences of similar users. "People who liked what you liked also liked..."

Q2: Why use set() for finding mutual connections?

Set intersection is O(min(len(A), len(B))) and O(1) for membership checks, making it much faster than nested loops over lists.

Q3: What does dict.get(key, default) do?

Returns the value for key if it exists, otherwise returns default. Prevents KeyError exceptions.

Q4: What is the difference between json.load() and json.loads()?

json.load(file_object) reads from a file. json.loads(string) reads from a string.

Intermediate

Q5: What is a 2-hop graph traversal?

Starting from node A, visit A's neighbors (1 hop), then visit the neighbors of those neighbors (2 hops). Used in "People You May Know."

Q6: How would you handle a very large JSON file that doesn't fit in memory?

Use ijson library for streaming JSON parsing, or split the file into chunks. Process records one at a time using generators.

Q7: What's the time complexity of the people_you_may_know function?

O(U × C²) where U = number of users and C = average connections per user. For dense graphs this is expensive; use BFS with visited sets for optimization.

Advanced

Q8: How would you scale this to 1 million users?

Options:
1. Pre-compute recommendations offline (batch job), cache in Redis
2. Use sparse matrices (scipy.sparse) for collaborative filtering
3. Approximate nearest neighbors (Faiss, Annoy) for skill matching
4. Graph databases (Neo4j) for the connections/mutual friend queries
5. Distributed processing with Spark for the batch computation

Q9: What's the cold start problem and how do you handle it?

New users have no history — no connections, no likes. Solutions: ask for skills/interests onboarding, recommend trending projects, or use demographic data (city, role).

Q10: Explain why we weight recommendations by overlap count in projects_you_might_like.

If User B has 5 projects in common with me and User C has only 1, User B's additional likes are more trustworthy signals. The overlap acts as a confidence weight for that user's influence on my recommendations.

Scenario-Based

Q11: A user complains they keep seeing the same recommendations. What would you do?

Add diversity: after computing top-N, filter out any already shown recommendations. Add a "seen" list per user. Introduce randomness (epsilon-greedy). Decay scores over time.

Q12: How would you A/B test whether the mutual-friends recommendation actually leads to more connections?

Split users 50/50 — Group A gets mutual-friend recommendations, Group B gets random suggestions. Track: connection acceptance rate, session length, DAU. Run for 2+ weeks for statistical significance.

13. Conclusion

What You Learned

Concept	Takeaway
Pure Python data handling	`json`, `set`, `dict`, list comprehensions are extremely powerful
Data cleaning	Real data is always messy — always clean before analyzing
Graph thinking	Social networks are graphs; connections are edges
Collaborative filtering	Shared behavior → inferred common interest
Content-based filtering	Shared attributes → similar users
Recommendation scoring	Count/weight signals to rank suggestions
Production patterns	API-ready output, validation, error handling

When to Use This Approach

✅ Use pure Python when you're learning algorithms, building lightweight services, or working in constrained environments.

✅ Graduate to pandas/NumPy when your dataset grows beyond 50,000 records.

✅ Graduate to ML libraries (scikit-learn, LightFM, Surprise) when you need matrix factorization, latent factors, or advanced CF.

Final Advice

The best data scientists aren't the ones who know the most libraries. They're the ones who understand the fundamentals deeply enough to implement any algorithm from scratch when needed.

Once you can build a recommendation engine in pure Python, every pandas function, every sklearn model, every Spark job becomes something you understand — not just something you copy from Stack Overflow.

Build from scratch. Break things. Fix them. That's how you grow. 🚀

Have questions or want to extend this with a Flask API wrapper? Drop them in the comments below!