Skip to the content.

Mid-term Project – 2021

IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content.

Data

For the mid-term project (mtp), you are required to handle a 7 GB dataset. In particular, you will work with the subsets of IMDb data available at datasets-imdbws. This data are publicly available and refreshed daily.

IMDb provides a detailed description of the files available here. In particular, you can access the following files:

All these files are gzipped and in tsv (tab-separated-values) format.

Tasks

For the MTP you are required to:

  1. Import the above tables into your localhost1 via SQL (psql, PgAdmin4) and/or python (psycopg2).
  2. Set appropriate data types and constraints2 per each column (n.b.: you may also modify columns with postgre built-in functions) – use SQL only;
  3. Propose at least 3 meaningful views (CREATE VIEW) leveraging on JOINs and/or Aggregations – use SQL only;
  4. [Optional] Apply indexes to enhance the performance of your queries (use EXPLAIN ANALYZE to show performance improvements) – use SQL only.

Some useful references

Deliverables

By June 18th (8:00 PM, London time), groups are required to upload:

The supporting documentation should not exceed 1,500 words.


Notes

1: To speed up the process, you may look at COPY FROM PROGRAM combined with wget && gzip -dc. Or via psycopg2, you may be interested in execute_batch.

2: Being a subset of the IMDb data, you may encounter missing values in some fields you want to set as a reference for a Foreign Key. To solve these issues (“quick-and-dirty” solution), you may delete non-matched inserts. Given the number of lines involved (>12M and in one case >40M), you may speed up the process using indexes.