gitcollector collects and stores git repositories.
gitcollector is the source{d} tool to download and update git repositories at large scale. To that end, it uses a custom repository storage file format called siva optimized for saving storage space and keeping repositories up-to-date.
The project is in a preliminary stable stage and under active development.
A rooted repository is a bare Git repository that stores all objects from all repositories that share a common history, that is, they have the same initial commit. It is stored using the Siva file format.
Rooted repositories have a few particularities that you should know to work with them effectively:
- They have no
HEADreference. - All references are of the following form:
{REFERENCE_NAME}/{REMOTE_NAME}. For example, the referencerefs/heads/masterof the remotefoowould be/refs/heads/master/foo. - Each remote represents a repository that shares the common history of the rooted repository. A remote can have multiple endpoints.
- A rooted repository is simply a repository with all the objects from all the repositories which share the same root commit.
- The root commit for a repository is obtained following the first parent of each commit from HEAD.
gitcollector entry point usage is done through the subcommand download (at this time is the only subcommand):
Usage: gitcollector [OPTIONS] download [download-OPTIONS] Help Options: -h, --help Show this help message [download command options] --library= path where download to [$GITCOLLECTOR_LIBRARY] --bucket= library bucketization level (default: 2) [$GITCOLLECTOR_LIBRARY_BUCKET] --tmp= directory to place generated temporal files (default: /tmp) [$GITCOLLECTOR_TMP] --workers= number of workers, default to GOMAXPROCS [$GITCOLLECTOR_WORKERS] --half-cpu set the number of workers to half of the set workers [$GITCOLLECTOR_HALF_CPU] --no-updates don't allow updates on already downloaded repositories [$GITCOLLECTOR_NO_UPDATES] --no-forks github forked repositories will not be downloaded [$GITCOLLECTOR_NO_FORKS] --orgs= list of github organization names separated by comma [$GITHUB_ORGANIZATIONS] --excluded-repos= list of repos to exclude separated by comma [$GITCOLLECTOR_EXCLUDED_REPOS] --token= github token [$GITHUB_TOKEN] --metrics-db= uri to a database where metrics will be sent [$GITCOLLECTOR_METRICS_DB_URI] --metrics-db-table= table name where the metrics will be added (default: gitcollector_metrics) [$GITCOLLECTOR_METRICS_DB_TABLE] --metrics-sync-timeout= timeout in seconds to send metrics (default: 30) [$GITCOLLECTOR_METRICS_SYNC] Log Options: --log-level=[info|debug|warning|error] Logging level (default: info) [$LOG_LEVEL] --log-format=[text|json] log format, defaults to text on a terminal and json otherwise [$LOG_FORMAT] --log-fields= default fields for the logger, specified in json [$LOG_FIELDS] --log-force-format ignore if it is running on a terminal or not [$LOG_FORCE_FORMAT]Usage example, --library and --orgs are always required:
gitcollector download --library=/path/to/repos/directoy --orgs=src-d
To collect repositories from several github organizations:
gitcollector download --library=/path/to/repos/directoy --orgs=src-d,bblfsh
Note that all the download command options are also configurable with environment variables.
gitcollector upload a new docker image to docker hub on each new release. To use it:
docker run --rm --name gitcollector_1 \ -e "GITHUB_ORGANIZATIONS=src-d,bblfsh" \ -e "GITHUB_TOKEN=foo" \ -v /path/to/repos/directory:/library \ srcd/gitcollector:latestNote that you must mount a local directory into the specific container path shown in -v /path/to/repos/directory:/library. This directory is where the repositories will be downloaded into rooted repositories in siva files format.
GPL v3.0, see LICENSE
