Skip to content

ranjeet258/Movie-Recommender-System

Repository files navigation

🎬 Movie Recommender System

A content-based filtering recommender that suggests similar movies using NLP and cosine similarity on the TMDB 5000 dataset.


Pipeline Overview

CSV Files → Merge → Feature Extraction → Tags Column → Stemming → CountVectorizer → Cosine Similarity → recommend() → Pickle → Web App

Pipeline


Phases & Methods

① Data Ingestion

Step Method Detail
Load datasets pd.read_csv() tmdb_5000_movies.csv + tmdb_5000_credits.csv
Merge pd.merge(on='title') Join on movie title
Select columns DataFrame slicing movie_id, title, overview, genres, keywords, cast, crew
Clean dropna(inplace=True) Remove rows with missing values

② Feature Engineering

Column Method Logic
genres convert() + ast.literal_eval() Extract all genre names from JSON string
keywords convert() + ast.literal_eval() Extract all keyword names
cast convert3() Top 3 actors only
crew find_director() Extract only job == 'Director'
overview lambda x: x.split() Tokenize into word list
Space removal str.replace(" ", "") "Sam Mendes""SamMendes" (prevents token splitting)
tags List concatenation + " ".join() Combine all 5 columns into one string per movie

③ NLP & Vectorization

Step Library Detail
Stemming nltk.stem.porter.PorterStemmer Reduce words to root form ("loved""love")
Vectorization sklearn CountVectorizer max_features=5000, stop_words='english'
Output Matrix NumPy .toarray() Shape: (4806 movies × 5000 features) — sparse

④ Similarity Computation

Step Method Detail
Distance metric sklearn cosine_similarity() Chosen over Euclidean — robust in high dimensions
Output 4806 × 4806 matrix similarity[i][j] = score between movie i and movie j
Diagonal Always = 1.0 Each movie is 100% similar to itself

Why Cosine, not Euclidean? In high-dimensional sparse vectors, Euclidean distance is distorted by vector magnitude. Cosine similarity measures the angle between vectors — direction, not length — making it ideal for bag-of-words representations.

⑤ Recommendation & Deployment

def recommend(movie):
    index = new_df[new_df['title'] == movie].index[0]
    distances = sorted(enumerate(similarity[index]), reverse=True, key=lambda x: x[1])
    for i in distances[1:6]:           # skip [0] = self
        print(new_df.iloc[i[0]].title)
Output File Contents
Movie data movies.pkl new_df — movie_id, title, tags
Similarity similarity.pkl 4806 × 4806 cosine matrix
App Web (Streamlit/Flask) Loads .pkl files, serves recommendations

Key Design Decisions

  • Top-3 cast only — avoids over-weighting movies with large ensemble casts
  • Space removal in names"Sam Mendes" becomes one token "SamMendes", not two unrelated words
  • Porter Stemmer — normalises tense/plurality before vectorization
  • 5000 max features — balances vocabulary coverage vs. dimensionality

Tech Stack

pandas · numpy · ast · nltk · scikit-learn · pickle


Files

Movie_Recommender_System.ipynb   # Main notebook
movies.pkl                       # Serialized movie DataFrame
similarity.pkl                   # Serialized similarity matrix

About

Movie Recommender System – A Python‑based recommendation app that takes a movie name as input and suggests 5 similar movies using content‑based filtering on movie features like genre, cast, and description

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors