Final Course Project – 2020

For the final course project, you are required to handle a small size collection of twitter data, gathered during the 3rd game of the 2018 NBA Finals between Cleveland Cavaliers and Golden State Warriors. Data are publicly available at this link. There, you can find a detailed description of data and the following files:

TweetsNBA.csv: 51,425 observations, 44 columns (27.24 MB);
locations.csv: 4,136 observations, 3 columns (122.64 KB);
TweetsNBA.json: 51,425 observations, more than 100 fields (348.56 MB).

If you want to have further info on data structure and content check the Tweet Data Dictionary.

Note:

You can ignore locations.csv file;
At Microsoft Teams (Assignments->Files), you can find:
- psql_cleaned.json : cleaned version of TweetsNBA.json to load into PostgreSQL
- mongo_cleaned.json : cleaned version of TweetsNBA.json to load into MongoDB

Task

Your task is to prepare data to be analyzed later by a senior data scientist. In particular, you are required to (A) clean and structure raw data, and (B) provide some useful insights.

A: To perform data cleaning, structuring, and manipulation, you can use either PostgreSQL or MongoDB. The expected result is a well-designed dataset that complies with the specific approach of the two DBMS.

B: To provide useful insights, you can leverage either on SQL or MongoDB query language. The expected result is a set of descriptive statistics that depicts some interesting trends or noteworthy data characteristics.

The senior data scientist should be able to (i) clearly understand your strategy for data handling, (ii) easily interact with the processed data, and (iii) get some useful knowledge of the dataset. You can also provide additional analysis with PySpark (e.g. you may want to leverage on the MLlib library). In so doing, you can also suggest future research directions given the analysis provided.

References

Working with NBA and Twitter data is not uncommon in management disciplines. You may want to look at the following references:

Deliverables

By July 17 (8:00 PM, London time), groups are required to upload:

SQL/JS/Python scripts;
Supporting documentation (accepted format: .md, .docx, or .pdf generated via LaTeX) containing:
- a detailed justification of your design choices;
- a clear and concise description of the insights coming from descriptive statistics obtained;
- a clear and concise description of further insights and results obtained analyzing data through PySpark (not mandatory).