Getting Started
Setting up PaReco
To setup and test PaReco
tool on your local computer, following the steps below:
Directory Structure
.
├── LICENSE # MIT license
├── README.md
├── docs # documentaion files as markdowns
├── mkdocs.yml # mkdocs configuration file
├── requirements.txt # python development packages
├── tokens-example.txt # GitHub API tokens file, renamed to `tokens.txt` in production
├── PaReco.py # main application entrypoint
├── src
│ ├── bin # contains shell script for install of OS dependencies
│ ├── constants # application wide global and constant variable
│ ├── core # contains the files that are at the heart of the tool for classification
│ ├── utils # helpers function used in the main files
│ ├── legacy # backup of the original version of PaReco
│ ├── notebooks # experiments and analysis
│ └── tests # Test cases
└──
Get the code
The easiest way is using the git clone
command:
Dependencies
PaReco
consist of two categories of depencies i.e. (i) OS specific dependencies and (ii) development dependencies. The OS specific dependency is libmagic
. To install this dependency on Ubuntu/Debian
or MacOS X
, run the shell script in the bin
directory.
Now, let us install the dependencies
Note:PaReco
has been tested on python >= 3.7
Testing PaReco
GitHub Tokens
To access private repositories, and to have a higher rate limit, GitHub tokens are needed.
They can be set in the tokens.txt
file or by directly inserting it in the token_list in the notebooks. GitHub tokens are a MUST to run the code, because of the high number of requests done to the GitHub API. Every token in the tokens.txt
file is seperated by only a comma. The user can add as many tokens as needed. A minimal of 5 tokens can be used to safely execute code and to make sure that the rate limit is not reached for a token.
Using Notebooks
There are 3 notebooks that are at the heart of this tool: /legacy/src/getData.ipynb
, /legacy/src/classify.ipynb
and /legac/src/analysis.ipynb
:
* getData.ipynb
extracts the pull request data from the GitHub API and stores it in Repos_prs
* After which classify.ipynb
extracts the files from for each pull request, the diffs
for each modified/added/removed file and classifies the hunk
and files
.
* Then analysis.ipynb
does the last classification for the patch
and calcualtes the total classification per repository and plots the results
* Finally, timeLag.ipynb
calcualtes the techinical lag for each patch
.
Examples
The folder /legacy/Examples
contains a Jupyter notebook that can be used to quickly run the tool and classify one or more pull requests for two variant repositories. A simple class PaReco
exists that does the classification. Running the notebook for source and target variant mrbrdo/rack-webconsole -> codegram/rack-webconsole
and pull request 2 will give an output as:
mrbrdo/rack-webconsole -> codegram/rack-webconsole
Pull request nr ==> 2
File classifications ==>
lib/rack/webconsole/assets.rb
Operation - MODIFIED
Class - ED
Patch classification ==> ED
Additionally, the classification distribution will also be plotted.
To use this notebook for classification, you need to have:
* The source
and target
repository names,
* The list of pull requests
* The cut off date
.
NB: Create the Repos_files
and Repos_results
directories before running the examples.