1. Introduction
What Are We Building?
Imagine you just joined a startup called DevConnect — a social platform exclusively for software developers. Think of it like LinkedIn but only for coders, built in Bangalore.
Your first assignment: analyze user data and build smart recommendation features — but here's the twist — you can only use pure Python. No pandas. No NumPy. No scikit-learn. Just Python's standard library.
Sounds scary? It shouldn't. This is actually how recommendation systems worked in the early days of social networks. And mastering this makes you a far better data scientist because you truly understand what pandas and NumPy are doing under the hood.
Why Does This Matter?
Interviews: FAANG and startups both love asking pure Python data questions
Understanding: When you can build it from scratch, you own the concept
Lightweight systems: Not every production system needs heavy ML libraries
Embedded systems / APIs: Sometimes you need lean, dependency-free code
What You'll Build by the End
Feature | Description |
|---|---|
Data Loader | Load and parse JSON user data |
Data Cleaner | Handle missing values, duplicates, inactive users |
People You May Know | Mutual friend-based recommendations |
Developers Like You | Skill/interest-based user matching |
Projects You Might Like | Collaborative filtering on project interactions |
2. Basic Concepts
2.1 The Data Model
Before we write a single line of code, understand the data we're working with. DevConnect has three entities:
{
"users": [
{
"id": 1,
"name": "Arjun",
"city": "Bangalore",
"skills": ["Python", "Django"],
"connections": [2, 3],
"liked_projects": [201, 202]
},
{
"id": 2,
"name": "Sneha",
"city": "Mumbai",
"skills": ["JavaScript", "React"],
"connections": [1, 4],
"liked_projects": [202, 203]
},
{
"id": 3,
"name": "Vikram",
"city": "Hyderabad",
"skills": ["Python", "ML"],
"connections": [1],
"liked_projects": [201, 204]
},
{
"id": 4,
"name": "Nisha",
"city": "Delhi",
"skills": ["Java", "Spring Boot"],
"connections": [2],
"liked_projects": [205]
}
],
"projects": [
{"id": 201, "name": "Open Source ML Toolkit"},
{"id": 202, "name": "React Component Library"},
{"id": 203, "name": "Node.js REST Boilerplate"},
{"id": 204, "name": "Python Automation Scripts"},
{"id": 205, "name": "Java Microservices Template"}
]
}
2.2 Key Python Concepts You'll Use
Before diving in, here are the pure Python tools we'll rely on:
# 1. json module — read/write JSON files
import json
# 2. set() — for finding intersections (mutual friends, common pages)
mutual = set([1, 2, 3]).intersection(set([2, 3, 4])) # {2, 3}
# 3. list comprehension — filter and transform lists cleanly
clean_users = [u for u in users if u["name"].strip()]
# 4. dict.get() — safe dictionary access with default
count = suggestions.get(user_id, 0) + 1
# 5. sorted() with lambda — rank recommendations
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
# 6. collections.defaultdict — auto-initialize dict values
from collections import defaultdict
graph = defaultdict(set)
2.3 Understanding the Graph Structure
Connections in social networks are essentially graphs. When Arjun connects with Sneha, that's an edge between two nodes.
Arjun (1) ——— Sneha (2)
| |
Vikram (3) Nisha (4)
In pure Python, we represent this as a dictionary of sets:
connection_graph = {
1: {2, 3}, # Arjun's connections
2: {1, 4}, # Sneha's connections
3: {1}, # Vikram's connections
4: {2} # Nisha's connections
}
3. All Functions and Features
Let's build a complete utility module. Each function does one job cleanly.
3.1 Loading Data
import json
def load_data(filepath: str) -> dict:
"""
Load JSON data from a file and return as a Python dictionary.
Args:
filepath: Path to the JSON file
Returns:
dict: Parsed data with 'users' and 'projects' keys
Raises:
FileNotFoundError: If the file doesn't exist
json.JSONDecodeError: If JSON is malformed
"""
try:
with open(filepath, "r", encoding="utf-8") as file:
data = json.load(file)
print(f"✅ Loaded {len(data['users'])} users and {len(data['projects'])} projects.")
return data
except FileNotFoundError:
print(f"❌ Error: File '{filepath}' not found.")
return {}
except json.JSONDecodeError as e:
print(f"❌ JSON parsing error: {e}")
return {}
3.2 Displaying Data
def display_users(data: dict) -> None:
"""Print all users and their connections in a readable format."""
print("\n" + "="*50)
print(" DEVCONNECT USERS")
print("="*50)
for user in data.get("users", []):
print(f"\n👤 {user['name']} (ID: {user['id']}) | 📍 {user.get('city', 'Unknown')}")
print(f" 🔧 Skills : {', '.join(user.get('skills', []))}")
print(f" 🤝 Connections: {user.get('connections', [])}")
print(f" ❤️ Liked : {user.get('liked_projects', [])}")
def display_projects(data: dict) -> None:
"""Print all available projects."""
print("\n" + "="*50)
print(" AVAILABLE PROJECTS")
print("="*50)
for project in data.get("projects", []):
print(f" [{project['id']}] {project['name']}")
3.3 Saving Data
def save_data(data: dict, filepath: str) -> None:
"""
Save processed data back to a JSON file.
Args:
data: Dictionary to save
filepath: Output file path
"""
with open(filepath, "w", encoding="utf-8") as file:
json.dump(data, file, indent=4, ensure_ascii=False)
print(f"✅ Data saved to '{filepath}'")
4. Intermediate Usage
4.1 Data Cleaning — The Real World is Messy
Here's a realistic messy dataset you'd actually encounter:
{
"users": [
{"id": 1, "name": "Arjun", "city": "Bangalore", "skills": ["Python"], "connections": [2, 3], "liked_projects": [201]},
{"id": 2, "name": "", "city": "Mumbai", "skills": [], "connections": [1], "liked_projects": [202]},
{"id": 3, "name": "Vikram", "city": " ", "skills": ["ML", "ML"], "connections": [1, 1], "liked_projects": []},
{"id": 4, "name": "Nisha", "city": "Delhi", "skills": ["Java"], "connections": [], "liked_projects": []},
{"id": 5, "name": "Arjun", "city": "Bangalore", "skills": ["Python"], "connections": [2, 3], "liked_projects": [201]}
],
"projects": [
{"id": 201, "name": "ML Toolkit"},
{"id": 202, "name": "React Library"},
{"id": 201, "name": "ML Toolkit Pro"},
{"id": 203, "name": ""}
]
}
Problems identified:
Issue | Who | Problem |
|---|---|---|
Empty name | User ID 2 |
|
Whitespace city | User ID 3 |
|
Duplicate skills | User ID 3 |
|
Duplicate connections | User ID 3 |
|
Inactive user | User ID 4 | No connections, no likes |
Duplicate user | User ID 5 | Same as User 1 |
Duplicate project ID | Project 201 | Two entries with same ID |
Empty project name | Project 203 |
|
4.2 Complete Data Cleaning Function
def clean_data(data: dict) -> dict:
"""
Clean the raw DevConnect data by:
- Removing users with empty names
- Normalizing city names (strip whitespace)
- Removing duplicate skills and connections
- Removing inactive users (no connections AND no likes)
- Removing duplicate users (same name + city combo)
- Deduplicating projects by ID
- Removing projects with empty names
Args:
data: Raw data dictionary
Returns:
dict: Cleaned data dictionary
"""
users = data.get("users", [])
projects = data.get("projects", [])
# --- Step 1: Remove users with empty or whitespace-only names ---
users = [u for u in users if u.get("name", "").strip()]
print(f"After removing empty names: {len(users)} users")
# --- Step 2: Normalize city field ---
for user in users:
user["city"] = user.get("city", "").strip() or "Unknown"
# --- Step 3: Remove duplicate skills ---
for user in users:
user["skills"] = list(set(user.get("skills", [])))
# --- Step 4: Remove duplicate connections ---
for user in users:
user["connections"] = list(set(user.get("connections", [])))
# --- Step 5: Remove inactive users ---
users = [
u for u in users
if u.get("connections") or u.get("liked_projects")
]
print(f"After removing inactive users: {len(users)} users")
# --- Step 6: Remove duplicate users (same name + city) ---
seen_identities = set()
unique_users = []
for user in users:
identity = (user["name"].lower(), user["city"].lower())
if identity not in seen_identities:
seen_identities.add(identity)
unique_users.append(user)
users = unique_users
print(f"After removing duplicate users: {len(users)} users")
# --- Step 7: Deduplicate projects by ID, keep last occurrence ---
unique_projects = {}
for project in projects:
if project.get("name", "").strip(): # Also remove empty project names
unique_projects[project["id"]] = project
projects = list(unique_projects.values())
print(f"After deduplicating projects: {len(projects)} projects")
data["users"] = users
data["projects"] = projects
return data
Running it:
raw_data = load_data("devconnect_raw.json")
cleaned_data = clean_data(raw_data)
save_data(cleaned_data, "devconnect_clean.json")
Expected Output:
✅ Loaded 5 users and 4 projects.
After removing empty names: 4 users
After removing inactive users: 3 users
After removing duplicate users: 3 users
After deduplicating projects: 3 projects
✅ Data saved to 'devconnect_clean.json'
5. Advanced Concepts
5.1 People You May Know — Mutual Connection Algorithm
This is a classic 2-hop graph traversal problem. The idea:
"Walk through my friends. For each of their friends — if I don't already know them, they're a candidate. Count how many of my friends know them = mutual count. Higher mutual count = stronger recommendation."
def people_you_may_know(user_id: int, data: dict, top_n: int = 5) -> list:
"""
Suggest users that a given user might want to connect with,
based on mutual connections.
Algorithm:
1. Get direct connections of the target user
2. For each direct connection, get THEIR connections
3. Filter out: the user themselves + already-connected users
4. Count mutual connections (score)
5. Sort by score descending
Args:
user_id : The ID of the user we're recommending for
data : Cleaned data dictionary
top_n : How many recommendations to return
Returns:
list of (user_id, mutual_count) tuples, sorted by mutual_count desc
"""
# Build a connection lookup: {user_id: set_of_connections}
connection_map = {}
for user in data["users"]:
connection_map[user["id"]] = set(user.get("connections", []))
# Validate user exists
if user_id not in connection_map:
print(f"❌ User ID {user_id} not found.")
return []
my_connections = connection_map[user_id]
candidate_scores = {}
# Walk 2 hops
for friend_id in my_connections:
friends_of_friend = connection_map.get(friend_id, set())
for candidate_id in friends_of_friend:
# Skip: myself, and people I already know
if candidate_id == user_id or candidate_id in my_connections:
continue
# Increment mutual friend count for this candidate
candidate_scores[candidate_id] = candidate_scores.get(candidate_id, 0) + 1
# Sort by mutual count (descending), return top N
ranked = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
return ranked[:top_n]
Usage and Output:
data = load_data("devconnect_clean.json")
# For Arjun (ID: 1) who knows Sneha (2) and Vikram (3)
results = people_you_may_know(1, data)
print("\n🤝 People You May Know:")
user_lookup = {u["id"]: u["name"] for u in data["users"]}
for candidate_id, mutual_count in results:
name = user_lookup.get(candidate_id, "Unknown")
print(f" 👤 {name} (ID: {candidate_id}) — {mutual_count} mutual connection(s)")
🤝 People You May Know:
👤 Nisha (ID: 4) — 1 mutual connection(s)
Why Nisha? Because Arjun → Sneha → Nisha. Sneha is Arjun's friend, and Nisha is Sneha's friend. So Nisha is 2 hops away from Arjun with 1 mutual.
5.2 Developers Like You — Skill-Based Matching
Beyond mutual connections, we can match based on shared skills. This is a form of content-based filtering.
def developers_like_you(user_id: int, data: dict, top_n: int = 3) -> list:
"""
Recommend developers with similar skill sets.
Score = number of skills in common.
Args:
user_id : Target user ID
data : Cleaned data dictionary
top_n : Number of recommendations
Returns:
list of (user_id, shared_skill_count, shared_skills) tuples
"""
skill_map = {}
for user in data["users"]:
skill_map[user["id"]] = set(user.get("skills", []))
if user_id not in skill_map:
return []
my_skills = skill_map[user_id]
if not my_skills:
print(f"⚠️ User {user_id} has no skills listed. Cannot find matches.")
return []
scores = []
for other_id, other_skills in skill_map.items():
if other_id == user_id:
continue
shared = my_skills.intersection(other_skills)
if shared:
scores.append((other_id, len(shared), list(shared)))
# Sort by number of shared skills, descending
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_n]
Usage:
matches = developers_like_you(1, data)
print("\n💡 Developers Like You:")
for dev_id, score, skills in matches:
name = user_lookup.get(dev_id, "Unknown")
print(f" 🧑💻 {name} — {score} shared skill(s): {skills}")
💡 Developers Like You:
🧑💻 Vikram — 1 shared skill(s): ['Python']
5.3 Projects You Might Like — Collaborative Filtering
This is the collaborative filtering algorithm — the same core idea behind Netflix, Spotify, and Amazon recommendations.
def projects_you_might_like(user_id: int, data: dict, top_n: int = 5) -> list:
"""
Recommend projects based on collaborative filtering.
Logic:
- Find other users who liked some of the same projects as you (similar users)
- Look at what ELSE those users liked that you haven't seen yet
- Score each unseen project by: SUM of (overlap_count for each similar user who liked it)
- Higher score = stronger recommendation
Args:
user_id : Target user ID
data : Cleaned data dictionary
top_n : Number of project recommendations to return
Returns:
list of (project_id, score) tuples sorted by score descending
"""
# Build: {user_id: set_of_liked_project_ids}
user_project_map = {}
for user in data["users"]:
user_project_map[user["id"]] = set(user.get("liked_projects", []))
if user_id not in user_project_map:
return []
my_projects = user_project_map[user_id]
project_scores = {}
for other_id, their_projects in user_project_map.items():
if other_id == user_id:
continue
# How similar is this other user to me?
overlap = my_projects.intersection(their_projects)
similarity_score = len(overlap)
if similarity_score == 0:
continue # No common ground, skip this user entirely
# What did they like that I haven't seen?
unseen = their_projects - my_projects
for project_id in unseen:
# Weight recommendation by similarity score
project_scores[project_id] = project_scores.get(project_id, 0) + similarity_score
ranked = sorted(project_scores.items(), key=lambda x: x[1], reverse=True)
return ranked[:top_n]
Usage:
project_lookup = {p["id"]: p["name"] for p in data["projects"]}
recommendations = projects_you_might_like(1, data)
print("\n📂 Projects You Might Like:")
for proj_id, score in recommendations:
name = project_lookup.get(proj_id, "Unknown Project")
print(f" ⭐ {name} (ID: {proj_id}) — Recommendation score: {score}")
📂 Projects You Might Like:
⭐ Node.js REST Boilerplate (ID: 203) — Recommendation score: 1
⭐ Python Automation Scripts (ID: 204) — Recommendation score: 1
6. Real-World Use Cases
6.1 Where This Pattern Shows Up in Production
Platform | Feature | Algorithm Used |
|---|---|---|
"People You May Know" | Mutual connections (2-hop BFS) | |
Twitter/X | "Who To Follow" | Mutual follows + interest overlap |
GitHub | "Repositories You Might Like" | Collaborative filtering on stars |
Netflix | "Because You Watched..." | Item-based collaborative filtering |
Spotify | "Discover Weekly" | Matrix factorization (advanced CF) |
Amazon | "Customers Also Bought" | Item-to-item collaborative filtering |
6.2 The Pipeline in Production
[Raw Data Source] → [ETL / Cleaning] → [Feature Extraction] → [Recommendation Engine] → [API] → [Frontend]
↑ ↓
[User Actions] ←————————————————————————————————————————————————————————————————————————————
In our mini project:
devconnect_raw.json → clean_data() → people_you_may_know() / projects_you_might_like() → JSON output → REST API
6.3 Wrapping It All as an API Response
import json
def get_full_recommendations(user_id: int, data: dict) -> dict:
"""
Returns a complete recommendation payload for a given user.
Suitable for serving via a REST API endpoint.
"""
user_lookup = {u["id"]: u["name"] for u in data["users"]}
project_lookup = {p["id"]: p["name"] for p in data["projects"]}
# People recommendations
pymk_raw = people_you_may_know(user_id, data)
pymk = [
{"user_id": uid, "name": user_lookup.get(uid, "?"), "mutual_connections": score}
for uid, score in pymk_raw
]
# Project recommendations
pyml_raw = projects_you_might_like(user_id, data)
pyml = [
{"project_id": pid, "name": project_lookup.get(pid, "?"), "score": score}
for pid, score in pyml_raw
]
# Skill-based developer matches
dly_raw = developers_like_you(user_id, data)
dly = [
{"user_id": uid, "name": user_lookup.get(uid, "?"), "shared_skills": skills, "score": score}
for uid, score, skills in dly_raw
]
return {
"user_id": user_id,
"people_you_may_know": pymk,
"projects_you_might_like": pyml,
"developers_like_you": dly
}
# Usage
data = load_data("devconnect_clean.json")
response = get_full_recommendations(1, data)
print(json.dumps(response, indent=2))
Output:
{
"user_id": 1,
"people_you_may_know": [
{"user_id": 4, "name": "Nisha", "mutual_connections": 1}
],
"projects_you_might_like": [
{"project_id": 203, "name": "Node.js REST Boilerplate", "score": 1},
{"project_id": 204, "name": "Python Automation Scripts", "score": 1}
],
"developers_like_you": [
{"user_id": 3, "name": "Vikram", "shared_skills": ["Python"], "score": 1}
]
}
This JSON is exactly what a Flask or FastAPI backend would return to a React/Vue frontend. Clean, structured, production-ready.
7. Practical Examples
7.1 Full Working Script — End to End
# devconnect_engine.py
# Pure Python recommendation engine — no external libraries required
import json
from collections import defaultdict
# ─────────────────────────────────────────
# STEP 1: I/O UTILITIES
# ─────────────────────────────────────────
def load_data(filepath):
# Open the JSON file safely with UTF-8 encoding
try:
with open(filepath, "r", encoding="utf-8") as f:
return json.load(f) # Parse JSON → Python dict
except (FileNotFoundError, json.JSONDecodeError) as e:
print(f"Error loading data: {e}")
return {"users": [], "projects": []}
def save_data(data, filepath):
# Write Python dict back to JSON with pretty formatting
with open(filepath, "w", encoding="utf-8") as f:
json.dump(data, f, indent=4, ensure_ascii=False)
# ─────────────────────────────────────────
# STEP 2: DATA CLEANING
# ─────────────────────────────────────────
def clean_data(data):
users = data.get("users", [])
projects = data.get("projects", [])
# Filter out users with blank names
users = [u for u in users if u.get("name", "").strip()]
# Normalize each user's data
seen = set()
clean_users = []
for u in users:
# Deduplicate connections and skills using set()
u["connections"] = list(set(u.get("connections", [])))
u["skills"] = list(set(u.get("skills", [])))
u["city"] = u.get("city", "").strip() or "Unknown"
# Skip inactive users (nothing to analyze)
if not u["connections"] and not u.get("liked_projects"):
continue
# Skip duplicate users (same name + city = same person)
fingerprint = (u["name"].lower(), u["city"].lower())
if fingerprint in seen:
continue
seen.add(fingerprint)
clean_users.append(u)
# Deduplicate projects: last write wins for same ID
project_map = {}
for p in projects:
if p.get("name", "").strip(): # Ignore projects with empty names
project_map[p["id"]] = p
data["users"] = clean_users
data["projects"] = list(project_map.values())
return data
# ─────────────────────────────────────────
# STEP 3: PEOPLE YOU MAY KNOW
# ─────────────────────────────────────────
def people_you_may_know(user_id, data, top_n=5):
# Build adjacency: {user_id → set of connected user IDs}
conn_map = {u["id"]: set(u.get("connections", [])) for u in data["users"]}
if user_id not in conn_map:
return []
my_conns = conn_map[user_id] # My direct connections
scores = {} # candidate_id → mutual count
for friend in my_conns: # For each person I know
for candidate in conn_map.get(friend, set()): # Their connections
if candidate == user_id or candidate in my_conns:
continue # Skip: myself + people I already know
scores[candidate] = scores.get(candidate, 0) + 1 # Count mutual
# Return top N sorted by mutual count
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
# ─────────────────────────────────────────
# STEP 4: PROJECTS YOU MIGHT LIKE
# ─────────────────────────────────────────
def projects_you_might_like(user_id, data, top_n=5):
# Build: {user_id → set of liked project IDs}
proj_map = {u["id"]: set(u.get("liked_projects", [])) for u in data["users"]}
if user_id not in proj_map:
return []
my_likes = proj_map[user_id] # Projects I already like
scores = {} # project_id → recommendation score
for other_id, their_likes in proj_map.items():
if other_id == user_id:
continue
# How similar is this person to me?
overlap = len(my_likes.intersection(their_likes))
if overlap == 0:
continue # Not similar at all, ignore
# Projects they liked that I haven't seen
for proj in their_likes - my_likes:
scores[proj] = scores.get(proj, 0) + overlap # Weight by similarity
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
# ─────────────────────────────────────────
# STEP 5: DEVELOPERS LIKE YOU
# ─────────────────────────────────────────
def developers_like_you(user_id, data, top_n=3):
# Build: {user_id → set of skills}
skill_map = {u["id"]: set(u.get("skills", [])) for u in data["users"]}
if user_id not in skill_map:
return []
my_skills = skill_map[user_id]
results = []
for other_id, their_skills in skill_map.items():
if other_id == user_id:
continue
shared = my_skills.intersection(their_skills)
if shared:
results.append((other_id, len(shared), list(shared)))
return sorted(results, key=lambda x: x[1], reverse=True)[:top_n]
# ─────────────────────────────────────────
# MAIN EXECUTION
# ─────────────────────────────────────────
if __name__ == "__main__":
# Load and clean data
raw = load_data("devconnect_raw.json")
clean = clean_data(raw)
save_data(clean, "devconnect_clean.json")
# Lookup helpers
user_names = {u["id"]: u["name"] for u in clean["users"]}
proj_names = {p["id"]: p["name"] for p in clean["projects"]}
TARGET_USER = 1
print(f"\n📊 Recommendations for: {user_names.get(TARGET_USER, '?')}")
print("="*55)
# People You May Know
print("\n🤝 People You May Know:")
for uid, mutual in people_you_may_know(TARGET_USER, clean):
print(f" → {user_names.get(uid, '?')} ({mutual} mutual)")
# Projects You Might Like
print("\n📂 Projects You Might Like:")
for pid, score in projects_you_might_like(TARGET_USER, clean):
print(f" → {proj_names.get(pid, '?')} (score: {score})")
# Developers Like You
print("\n💡 Developers Like You:")
for uid, score, skills in developers_like_you(TARGET_USER, clean):
print(f" → {user_names.get(uid, '?')} | Common Skills: {skills}")
8. Edge Cases and Errors
8.1 Common Issues and Fixes
# ❌ MISTAKE 1: Modifying a list while iterating over it
for user in data["users"]:
if not user["name"]:
data["users"].remove(user) # BUG! Skips elements
# ✅ FIX: Use list comprehension instead
data["users"] = [u for u in data["users"] if u["name"].strip()]
# ❌ MISTAKE 2: set() doesn't preserve order (can affect rankings)
user["connections"] = list(set(user["connections"]))
# This is fine for sets of IDs, but if ORDER matters, use:
seen_ids = set()
ordered_unique = []
for cid in user["connections"]:
if cid not in seen_ids:
seen_ids.add(cid)
ordered_unique.append(cid)
user["connections"] = ordered_unique
# ❌ MISTAKE 3: KeyError when user ID is missing
my_conns = conn_map[user_id] # Crashes if user_id not in map
# ✅ FIX: Always validate first
if user_id not in conn_map:
return []
my_conns = conn_map[user_id]
# ❌ MISTAKE 4: Division by zero in scoring
similarity = overlap / total_items # Crashes if total_items == 0
# ✅ FIX: Guard against zero
similarity = overlap / total_items if total_items > 0 else 0
# ❌ MISTAKE 5: Assuming JSON values are always lists
connections = user["connections"] # What if it's None or missing?
# ✅ FIX: Use .get() with a default
connections = user.get("connections") or []
8.2 Debugging Checklist
def validate_data(data):
"""Run sanity checks on cleaned data before running algorithms."""
errors = []
user_ids = {u["id"] for u in data["users"]}
for user in data["users"]:
# Check: all connection IDs actually exist in the dataset
for cid in user.get("connections", []):
if cid not in user_ids:
errors.append(f"User {user['id']} references unknown connection: {cid}")
# Check: all liked project IDs actually exist
project_ids = {p["id"] for p in data["projects"]}
for pid in user.get("liked_projects", []):
if pid not in project_ids:
errors.append(f"User {user['id']} references unknown project: {pid}")
if errors:
print("⚠️ Data validation failed:")
for e in errors:
print(f" - {e}")
else:
print("✅ Data validation passed.")
return len(errors) == 0
9. Pro Developer Insights
9.1 Performance Tips
# ✅ TIP 1: Build lookup dictionaries ONCE, use everywhere
# DON'T do this inside every function call (rebuilds every time):
for user in data["users"]:
if user["id"] == target_id:
...
# DO this: O(1) lookup instead of O(n) search
user_map = {u["id"]: u for u in data["users"]} # Build once
target = user_map.get(target_id) # O(1) lookup
# ✅ TIP 2: Use sets for membership checks — O(1) vs O(n)
# Slow: "if candidate in [1, 2, 3, 4, 5]" → O(n) list scan
# Fast: "if candidate in {1, 2, 3, 4, 5}" → O(1) hash lookup
# ✅ TIP 3: Use generators for large datasets (memory efficiency)
def active_users(data):
"""Yield users one at a time instead of loading all into memory."""
for user in data["users"]:
if user.get("connections") or user.get("liked_projects"):
yield user
# Process one by one without loading entire list
for user in active_users(data):
process(user)
# ✅ TIP 4: Cache frequent computations
def build_indexes(data):
"""Pre-compute all lookup structures once."""
return {
"user_by_id": {u["id"]: u for u in data["users"]},
"project_by_id": {p["id"]: p for p in data["projects"]},
"connections": {u["id"]: set(u.get("connections", [])) for u in data["users"]},
"liked_projects": {u["id"]: set(u.get("liked_projects", [])) for u in data["users"]},
"skills": {u["id"]: set(u.get("skills", [])) for u in data["users"]},
}
9.2 Best Practices
Single Responsibility: Each function does exactly one thing
Defensive coding: Always use
.get()with defaults for dict accessType hints: Add them even in pure Python for readability
Separation of concerns: Keep data loading, cleaning, and analysis in separate functions/modules
Immutability when possible: Don't mutate input data in-place; return new structures
9.3 Security in Production
# ✅ Validate input user_id type before processing
def safe_recommend(user_id, data):
if not isinstance(user_id, int) or user_id <= 0:
raise ValueError(f"Invalid user_id: {user_id}")
return people_you_may_know(user_id, data)
# ✅ Limit file sizes before loading in production
import os
MAX_FILE_SIZE_MB = 50
def safe_load(filepath):
size_mb = os.path.getsize(filepath) / (1024 * 1024)
if size_mb > MAX_FILE_SIZE_MB:
raise ValueError(f"File too large: {size_mb:.1f} MB")
return load_data(filepath)
10. Comparison
Pure Python vs pandas vs Graph Libraries
Aspect | Pure Python | pandas | NetworkX (Graph lib) |
|---|---|---|---|
Dependencies | Zero | Heavy | Moderate |
Setup | Instant |
|
|
Speed (small data) | Fast | Slower (overhead) | Moderate |
Speed (large data) | Slow | Very Fast | Fast |
Memory | Efficient | Higher | High |
Code readability | Explicit | Concise | Concise |
Recommendation logic | Manual | Manual + apply() | Built-in graph algos |
Production use | Small/embedded | ETL pipelines | Social network analysis |
Best for | Learning, APIs | Data analysis | Graph-heavy features |
When to Use Pure Python
✅ Use when:
Learning algorithms from scratch
Building lean microservices
No external dependencies allowed (security policy)
Data is small (< 10,000 users)
Embedding logic in another system
❌ Don't use when:
Dataset has millions of records (use pandas + NumPy)
You need matrix operations (use NumPy)
Complex graph algorithms like shortest path (use NetworkX)
Time-to-market matters more than learning (use scikit-learn)
11. Data Science Perspective
11.1 What We Implemented (ML Terms)
What We Built | ML Term | Category |
|---|---|---|
Mutual friends count | Graph-based similarity | Collaborative Filtering |
Shared skills matching | Content-based filtering | Content-Based Recommendation |
Liked projects overlap | User-User Collaborative Filtering | Collaborative Filtering |
Ranking by score | Scoring & Ranking | Retrieval System |
11.2 The Recommendation System Taxonomy
Recommendation Systems
├── Collaborative Filtering
│ ├── User-User (what we built for projects)
│ └── Item-Item (what Amazon uses for "Customers also bought")
├── Content-Based Filtering
│ └── Skills matching (what we built for developers)
├── Graph-Based
│ └── Mutual connections (what we built for people)
└── Hybrid (combination of all above → Netflix, Spotify)
11.3 Extending to Real ML
Once you understand this pure Python version, the next step is:
# This is what our "collaborative filtering" becomes at scale:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# User-project matrix (our liked_projects as a binary matrix)
# User 1 likes [201, 202] → [1, 1, 0, 0, 0]
# User 2 likes [202, 203] → [0, 1, 1, 0, 0]
user_matrix = np.array([
[1, 1, 0, 0, 0], # Arjun
[0, 1, 1, 0, 0], # Sneha
[1, 0, 0, 1, 0], # Vikram
[0, 0, 0, 0, 1], # Nisha
])
similarity = cosine_similarity(user_matrix)
# similarity[0] tells us how similar Arjun is to everyone else
The pure Python .intersection() count IS essentially a simplified version of cosine similarity on binary vectors.
12. Interview Questions
Basic
Q1: What is collaborative filtering?
A recommendation technique that suggests items to a user based on the preferences of similar users. "People who liked what you liked also liked..."
Q2: Why use set() for finding mutual connections?
Set intersection is O(min(len(A), len(B))) and O(1) for membership checks, making it much faster than nested loops over lists.
Q3: What does dict.get(key, default) do?
Returns the value for
keyif it exists, otherwise returnsdefault. PreventsKeyErrorexceptions.
Q4: What is the difference between json.load() and json.loads()?
json.load(file_object)reads from a file.json.loads(string)reads from a string.
Intermediate
Q5: What is a 2-hop graph traversal?
Starting from node A, visit A's neighbors (1 hop), then visit the neighbors of those neighbors (2 hops). Used in "People You May Know."
Q6: How would you handle a very large JSON file that doesn't fit in memory?
Use
ijsonlibrary for streaming JSON parsing, or split the file into chunks. Process records one at a time using generators.
Q7: What's the time complexity of the people_you_may_know function?
O(U × C²) where U = number of users and C = average connections per user. For dense graphs this is expensive; use BFS with visited sets for optimization.
Advanced
Q8: How would you scale this to 1 million users?
Options:
1. Pre-compute recommendations offline (batch job), cache in Redis
2. Use sparse matrices (scipy.sparse) for collaborative filtering
3. Approximate nearest neighbors (Faiss, Annoy) for skill matching
4. Graph databases (Neo4j) for the connections/mutual friend queries
5. Distributed processing with Spark for the batch computation
Q9: What's the cold start problem and how do you handle it?
New users have no history — no connections, no likes. Solutions: ask for skills/interests onboarding, recommend trending projects, or use demographic data (city, role).
Q10: Explain why we weight recommendations by overlap count in projects_you_might_like.
If User B has 5 projects in common with me and User C has only 1, User B's additional likes are more trustworthy signals. The overlap acts as a confidence weight for that user's influence on my recommendations.
Scenario-Based
Q11: A user complains they keep seeing the same recommendations. What would you do?
Add diversity: after computing top-N, filter out any already shown recommendations. Add a "seen" list per user. Introduce randomness (epsilon-greedy). Decay scores over time.
Q12: How would you A/B test whether the mutual-friends recommendation actually leads to more connections?
Split users 50/50 — Group A gets mutual-friend recommendations, Group B gets random suggestions. Track: connection acceptance rate, session length, DAU. Run for 2+ weeks for statistical significance.
13. Conclusion
What You Learned
Concept | Takeaway |
|---|---|
Pure Python data handling |
|
Data cleaning | Real data is always messy — always clean before analyzing |
Graph thinking | Social networks are graphs; connections are edges |
Collaborative filtering | Shared behavior → inferred common interest |
Content-based filtering | Shared attributes → similar users |
Recommendation scoring | Count/weight signals to rank suggestions |
Production patterns | API-ready output, validation, error handling |
When to Use This Approach
✅ Use pure Python when you're learning algorithms, building lightweight services, or working in constrained environments.
✅ Graduate to pandas/NumPy when your dataset grows beyond 50,000 records.
✅ Graduate to ML libraries (scikit-learn, LightFM, Surprise) when you need matrix factorization, latent factors, or advanced CF.
Final Advice
The best data scientists aren't the ones who know the most libraries. They're the ones who understand the fundamentals deeply enough to implement any algorithm from scratch when needed.
Once you can build a recommendation engine in pure Python, every pandas function, every sklearn model, every Spark job becomes something you understand — not just something you copy from Stack Overflow.
Build from scratch. Break things. Fix them. That's how you grow. 🚀
Have questions or want to extend this with a Flask API wrapper? Drop them in the comments below!