Mid-term Project – 2021

IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content.

Data

For the mid-term project (mtp), you are required to handle a 7 GB dataset. In particular, you will work with the subsets of IMDb data available at datasets-imdbws. This data are publicly available and refreshed daily.

IMDb provides a detailed description of the files available here. In particular, you can access the following files:

All these files are gzipped and in tsv (tab-separated-values) format.

Tasks

For the MTP you are required to:

Import the above tables into your localhost¹ via SQL (psql, PgAdmin4) and/or python (psycopg2).
Set appropriate data types and constraints² per each column (n.b.: you may also modify columns with postgre built-in functions) – use SQL only;
Propose at least 3 meaningful views (CREATE VIEW) leveraging on JOINs and/or Aggregations – use SQL only;
[Optional] Apply indexes to enhance the performance of your queries (use EXPLAIN ANALYZE to show performance improvements) – use SQL only.

Some useful references

Deliverables

By June 18th (8:00 PM, London time), groups are required to upload:

SQL or Python scripts;
Entity-Relationship Diagram (accepted format: .png);
Supporting documentation (accepted format: .md, .docx, or .pdf generated via LaTeX) containing:
- a detailed justification of your design choices;
- a clear and concise description of the view insights;
- [optional] a concise description of the indexing benefits.

The supporting documentation should not exceed 1,500 words.

Notes

1: To speed up the process, you may look at COPY FROM PROGRAM combined with wget && gzip -dc. Or via psycopg2, you may be interested in execute_batch.

2: Being a subset of the IMDb data, you may encounter missing values in some fields you want to set as a reference for a Foreign Key. To solve these issues (“quick-and-dirty” solution), you may delete non-matched inserts. Given the number of lines involved (>12M and in one case >40M), you may speed up the process using indexes.