Final course project – 2022

For the final course project (fcp), you will handle a real-world dataset. In particular, you are required to store, manipulate, and analyze GitHub data for PyTorch and Tensorflow. Both projects are examples of open source software (OSS) development.

OSS challenges common managerial assumptions on the organizing and functioning of organizations (Gulati et al., 2012), attracting the interest of research enthusiasts from several disciplines (e.g., information systems, management, and sociology). But the OSS phenomenon is also extremely relevant from a business perspective. For example, you may think of the Python project or the father of all – Linux (powering NASA projects, Chrome OS, Android hardware, and the largest share of worldwide servers). The OSS experience keeps offering fresh business and research insights and may guide us in the next phase of the organizing technologies based on remote work.

Tasks

You are required to choose your preferred DBMS – PostgreSQL or MongoDB – and:

Clean, manipulate, and structure data. The expected result is a well-designed dataset that complies with the specific approach of the chosen DBMS.
Provide valuable descriptive insights. The expected result is a set of descriptive statistics that depicts some interesting trends or noteworthy data characteristics.
[optional] Perform an insightful data analysis. For example, you can analyse the modularity of the source code or the features of issues getting the community’s attention. You may want to skim through the reference list provided to get some inspiration.

To perform tasks 1 and 2, you need to use either SQL or MQL (MongoDB query language). Alternatively, if you prefer using python, you can leverage psycopg2 or pymongo. For what concerns task 3, you should use PySpark (e.g., you may want to leverage on the MLlib pyspark library).

Data

The fcp is based on commit and issue data for PyTorch¹ and Tensorflow projects hosted at GitHub.

Commits tell the history of a repository and how it came to be the way that it currently is
Issues help to track ideas, feedback, tasks, or bugs for work on GitHub

The data can be retrieved at this link. In particular, you can find:

data	content	time frame	size
gitData	commit history	2021-07-01 - 2021-12-31	7.84 GB
gitIssues	issues history	2021-01-01 - 2021-12-31	150 MB

Data are stored in both csv and json formats.

If you are interested in expanding the data collected, you can consider the following libraries:

Commits: pyDriller
Issues: pyGithub

gitData

Here is a list of fields and a brief synopsis:

hash: hash of the commit
msg: commit message
author_name (Developer): commit author name
committer_name (Developer): commit committer name
author_date: authored date
author_timezone: author timezone (expressed in seconds from epoch)
committer_date: commit date
committer_timezone: commit timezone (expressed in seconds from epoch)
branches: List of branches that contain this commit
in_main_branch: True if the commit is in the main branch
merge: True if the commit is a merge commit
parents: list of the commit parents
project_name: project name
deletions: number of deleted lines in the commit (as shown from –shortstat).
insertions: number of added lines in the commit (as shown from –shortstat).
lines: total number of added + deleted lines in the commit (as shown from –shortstat).
files: number of files changed in the commit (as shown from –shortstat).
old_path: old path of the file (can be None if the file is added)
new_path: new path of the file (can be None if the file is deleted)
filename: return only the filename (e.g., given a path-like-string such as “/Users/dspadini/pydriller/myfile.py” returns “myfile.py”)
change_type: type of the change: can be Added, Deleted, Modified, or Renamed.
diff: diff of the file as Git presents it (e.g., starting with @@ xx,xx @@).
diff_parsed: diff parsed in a dictionary containing the added and deleted lines. The dictionary has 2 keys: “added” and “deleted”, each containing a list of Tuple (int, str) corresponding to (number of line in the file, actual line).
added_lines: number of lines added
deleted_lines: number of lines removed
source_code: source code of the file (can be None if the file is deleted or only renamed)
source_code_before: source code of the file before the change (can be None if the file is added or only renamed)
nloc: Lines Of Code (LOC) of the file
complexity: Cyclomatic Complexity of the file
token_count: Number of Tokens of the file

gitIssues

Here is a list of fields and a brief synopsis:

title: issue title
state: state of the issue, either open or closed (all issues collected are closed)
body: contents of the issue
user: user name
user_id: unique identifier for the user
created_at: when the issue has been created
updated_at: when the issue has been updated
closed_at: when the issue has been closed
assignees: users that this issue is assigned to
labels: labels associated with this issue
reactions: can be one of +1, -1, laugh, confused, heart, hooray, rocket, eyes
n_comments: number of comments for the issue
closed_by: user that closed the issue
comment_id: unique identifier for the comment
comment_created_at: when the comment has been created
comment_updated_at: when the comment has been updated
comment_user_id: user commenting – id
comment_user: user commenting – name
comment_text: contents of the comment
project: project name

Deliverables

By July 22nd (4:00 PM, London time), groups have to upload:

SQL, JS, or Python scripts;
Supporting documentation (accepted format: .md, .docx, or .pdf) containing:
- a detailed justification of your design choices;
- a clear and concise description of the insights coming from descriptive statistics obtained;
- [optional] a clear and concise description of further insights and results obtained analyzing data through PySpark.

References

Here you can find some academic articles dealing with open source software:

Here you can find some further readings:

An amazing documentary on the early stages of open source:

Notes

1: For what concerns PyTorch, the commits data further contain information on the child repositories:

audio
vision
xla
serve
torchrec
text