Mid-term Project – 2021
IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content.
Data
For the mid-term project (mtp), you are required to handle a 7 GB dataset. In particular, you will work with the subsets of IMDb data available at datasets-imdbws. This data are publicly available and refreshed daily.
IMDb provides a detailed description of the files available here. In particular, you can access the following files:
- title.akas.tsv.gz
- title.basics.tsv.gz
- title.crew.tsv.gz
- title.episode.tsv.gz
- title.principals.tsv.gz
- title.ratings.tsv.gz
- name.basics.tsv.gz
All these files are gzipped and in tsv (tab-separated-values) format.
Tasks
For the MTP you are required to:
- Import the above tables into your
localhost1 via SQL (psql, PgAdmin4) and/or python (psycopg2). - Set appropriate data types and constraints2 per
each column (n.b.: you may also modify columns with postgre built-in
functions) – use
SQLonly; - Propose at least 3 meaningful views (CREATE
VIEW) leveraging on
JOINs and/or
Aggregations – use
SQLonly; - [Optional] Apply indexes to enhance the performance of your queries (use
EXPLAIN ANALYZE to
show performance improvements) – use
SQLonly.
Some useful references
Deliverables
By June 18th (8:00 PM, London time), groups are required to upload:
- SQL or Python scripts;
- Entity-Relationship Diagram (accepted format: .png);
- Supporting documentation (accepted format: .md, .docx, or .pdf generated via LaTeX) containing:
- a detailed justification of your design choices;
- a clear and concise description of the view insights;
- [optional] a concise description of the indexing benefits.
The supporting documentation should not exceed 1,500 words.
Notes
1: To speed up the process, you may look at COPY FROM PROGRAM combined with wget && gzip -dc. Or via psycopg2, you may be interested in execute_batch.
2: Being a subset of the IMDb data, you may encounter missing values in some fields you want to set as a reference for a Foreign Key. To solve these issues (“quick-and-dirty” solution), you may delete non-matched inserts. Given the number of lines involved (>12M and in one case >40M), you may speed up the process using indexes.