Skip to the content.

Final course project – 2022

For the final course project (fcp), you will handle a real-world dataset. In particular, you are required to store, manipulate, and analyze GitHub data for PyTorch and Tensorflow. Both projects are examples of open source software (OSS) development.

OSS challenges common managerial assumptions on the organizing and functioning of organizations (Gulati et al., 2012), attracting the interest of research enthusiasts from several disciplines (e.g., information systems, management, and sociology). But the OSS phenomenon is also extremely relevant from a business perspective. For example, you may think of the Python project or the father of all – Linux (powering NASA projects, Chrome OS, Android hardware, and the largest share of worldwide servers). The OSS experience keeps offering fresh business and research insights and may guide us in the next phase of the organizing technologies based on remote work.

Tasks

You are required to choose your preferred DBMS – PostgreSQL or MongoDB – and:

  1. Clean, manipulate, and structure data. The expected result is a well-designed dataset that complies with the specific approach of the chosen DBMS.
  2. Provide valuable descriptive insights. The expected result is a set of descriptive statistics that depicts some interesting trends or noteworthy data characteristics.
  3. [optional] Perform an insightful data analysis. For example, you can analyse the modularity of the source code or the features of issues getting the community’s attention. You may want to skim through the reference list provided to get some inspiration.

To perform tasks 1 and 2, you need to use either SQL or MQL (MongoDB query language). Alternatively, if you prefer using python, you can leverage psycopg2 or pymongo. For what concerns task 3, you should use PySpark (e.g., you may want to leverage on the MLlib pyspark library).

Data

The fcp is based on commit and issue data for PyTorch1 and Tensorflow projects hosted at GitHub.

The data can be retrieved at this link. In particular, you can find:

data content time frame size
gitData commit history 2021-07-01 - 2021-12-31 7.84 GB
gitIssues issues history 2021-01-01 - 2021-12-31 150 MB

Data are stored in both csv and json formats.

If you are interested in expanding the data collected, you can consider the following libraries:

gitData

Here is a list of fields and a brief synopsis:

gitIssues

Here is a list of fields and a brief synopsis:

Deliverables

By July 22nd (4:00 PM, London time), groups have to upload:

References

Here you can find some academic articles dealing with open source software:

Here you can find some further readings:

An amazing documentary on the early stages of open source:


Notes

1: For what concerns PyTorch, the commits data further contain information on the child repositories: