ghuser.io's database scripts
This repository provides scripts to update the database for the
ghuser.io Reframe app. The database consists of
JSON files. The production data is stored on
AWS. The scripts expect it at ~/data
and
this can be overridden by setting the GHUSER_DBDIR
environment variable.
The fetchBot calls these scripts. It runs every few days on an EC2 instance.
API keys can be created here.
$ npm install
Start tracking a user
$ ./addUser.js USER
Stop tracking a user
$ ./rmUser.js USER "you asked us to remove your profile in https://github.com/ghuser-io/ghuser.io/issues/666"
Refresh and clean data for all tracked users
$ export GITHUB_CLIENT_ID=0123456789abcdef0123
$ export GITHUB_CLIENT_SECRET=0123456789abcdef0123456789abcdef01234567
$ export GITHUB_USERNAME=AurelienLourot
$ export GITHUB_PASSWORD=********
$ ./fetchAndCalculateAll.sh
GitHub API key found.
GitHub credentials found.
...
/home/ubuntu/data/users
2654 users
largest: gdi2290.json (26 KB)
total: 5846 KB
/home/ubuntu/data/contribs
largest: orta.json (144 KB)
total: 14 MB
/home/ubuntu/data/repos
112924 repos
65706 significant repos
largest: jlord/patchwork.json (712 KB)
total: 203 MB
/home/ubuntu/data/repoCommits
largest: CocoaPods/Specs.json (3965 KB)
total: 397 MB
/home/ubuntu/data/orgs
11072 orgs
largest: google-certified-mobile-web-specialists.json (445 B)
total: 3520 KB
/home/ubuntu/data/nonOrgs.json: 252 KB
/home/ubuntu/data/meta.json: 49 B
total: 623 MB
=> 240 KB/user
real 449m19.774s
user 15m52.644s
sys 2m21.976s
Several scripts form a pipeline for updating the database. Here is the data flow:
[ ./addUser.js myUser ] [ ./rmUser.js myUser ]
│ │
v v
┌───────────────────┐
│ users/myuser.json │<───────────┐
└────────────────┬──┘ │─┐ │
└──────────────│────┘ │ │ ╔════════╗
└────┬───────│──────┘ │ ║ GitHub ║
│ │ │ ╚════╤═══╝
│ v │ │
│ [ ./fetchUserDetailsAndContribs.js myUser ]<──┤
│ │
├────────────>[ ./fetchOrgs.js ]<─────────────────┤
│ ^ ^ │
│ │ │ │
│ v v │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ nonOrgs.json │ │ orgs/myOrg.json │─┐ │
│ └──────────────┘ └─────────────────┘ │─┐ │
│ └─────────────────┘ │ │
│ └──────────┬──────┘ │
│ │ │
├──>[ ./fetchRepos.js ]<──────────────────────────┘
│ ^ │
│ │ │
│ v │
│ ┌───────────────────────────┐ │
│ │ repo*/myOwner/myRepo.json │─┐ │
│ └───────────────────────────┘ │─┐ │
│ └───────────────────────────┘ │ │
│ └────┬──────────────────────┘ │
│ │ │
│ │ ┌───────────────┘
│ │ │
v v v
[ ./calculateContribsAndMeta.js ]
│ │
v v
┌──────────────────────┐ ┌───────────┐
│ contribs/myuser.json │─┐ │ meta.json │
└──────────────────────┘ │─┐ └───────────┘
└──────────────────────┘ │
└──────────────────────┘
NOTES:
- These scripts also delete unreferenced data.
- Instead of calling each of these scripts directly, you can call
./fetchAndCalculateAll.sh
which will orchestrate them.
The production JSON files are currently stored on S3 and exposed to front end over HTTPS, e.g.
users/brillout.json
nonOrgs.json
orgs/reframejs.json
repos/reframejs/reframe.json
repoCommits/reframejs/reframe.json
contribs/brillout.json
meta.json
Every few days a backup named YYYY-MM-DD.tar.gz
containing all the JSON files is created, e.g.
2018-10-07.tar.gz
.
Thanks goes to these wonderful people (emoji key):
Aurelien Lourot 💬 💻 📖 👀 |
Charles 💻 📖 🤔 |
Romuald Brillout 🤔 |
---|
This project follows the all-contributors specification. Contributions of any kind welcome!