Skip to content

a REST API in python that will scrap data of top 250 movies from IMDb website and store in DB. On every request, it will only serve the latest data

Notifications You must be signed in to change notification settings

Smiley-nrk/IMDb-API

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IMDb API

This is a REST API that serves details of top 250 movies fetched from IMDb website. The movies can be searched based on Name / Description or they can be sorted based on duration/name/release date/ rating.

Details Fetched:

  • Title (Name)
  • Summary (Description)
  • Rank
  • Rating
  • Duration
  • Release Date

Demo on Youtube:

demo execution video is available at : https://youtu.be/d9zeXq-qmp8

Features:

- Fetch all 250 movie details (http://localhost:5000/movies/all)
	. sorted by Duration
		http://localhost:5000/movies/all?sortBy=duration
		http://localhost:5000/movies/all?sortBy=duration&desc=1
		
	. sorted by Name
		http://localhost:5000/movies/all?sortBy=name
		http://localhost:5000/movies/all?sortBy=name&desc=1
		
	. sorted by Release Date
		http://localhost:5000/movies/all?sortBy=releaseDate
		http://localhost:5000/movies/all?sortBy=releaseDate&desc=1
		
	. sorted by Rating
		http://localhost:5000/movies/all?sortBy=rating
		http://localhost:5000/movies/all?sortBy=rating&desc=1
	
- Fetch movie details based on search
	. searched by Name
		http://localhost:5000/movie?name=godfather 
		
	. searched by Description
		http://localhost:5000/movie?desc=machine

Authentication:

Architecture

here

Implementation Detail

The implementation is done in Python language using BeautifulSoup and Requests for scraping. For database, MongoDB is used. For message queue, RabbitMQ is used.

Firstly, list of top 250 movies is available at: https://www.imdb.com/chart/top?ref_=nv_mv_250 . But, some of the attributes of movie are not available on this page. For those, we have to check movie specific page. This, in terms of scraping, means that 251 HTTP/s requests are needed. Which makes the actual process slow. So, first decision is to not visit 251 requests for each API call. Instead, store data in DB at some interval and then serve from DB only. Now, when to update DB data? For that, on every API call, only send 1 HTTP/s request to IMDb and scrap the list of movie rank and title. Compare that to data in DB. If both are inconsistent, then we need to update data stored in DB. For that also, we use RabbitMQ to communicate between DBService and Scraper Service.

Flow chart:

The rough flow chart is available here

Helpful Links:

About

a REST API in python that will scrap data of top 250 movies from IMDb website and store in DB. On every request, it will only serve the latest data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages