Methodology
Benchmark and security harness design
The Benchmark and Security pages report results. This page documents how those numbers are produced — the test corpus, the harness, the metrics, and the threat model — so the figures can be read with the right priors.
§1. Design principles
The harness is designed around three principles. Fairness: every tool receives the same target served by the same web-server instance, with the same starting state. Completeness: evaluation covers all recoverable Git artifacts (source code, reflogs, stashes, commits, branches, remotes, tags), not just whether a clone succeeds. Reproducibility: the test repository is generated from a fixed random seed, and every component — tool images, server containers, scoring code — is version-controlled.
§2. Test repository
A ground-truth Git repository is generated programmatically
(benchmark/generate.py,
seed = 0). It contains 2–16 random commits, 2–8 random
branches, 15 well-known branches (master, main,
dev, release, …), up to 16 semantic-version
tags, 2–16 stashes, files in both the staging area and working
directory, and a PHP LFI script. A manifest categorises
every file by feature, which is what makes the per-feature
accuracy column on the benchmark page meaningful.
§3. Web-server scenarios
The repository is served over HTTP under five configurations,
each a Docker Compose service
(test/docker/):
- Apache httpd with
DirectoryIndexenabled (directory listing on) - Apache httpd with
Options -Indexes(directory listing off) - Nginx with
autoindex on - Nginx with
autoindex off(default) - PHP 7.2 + Apache with a Local File Inclusion entry point
§4. Tool execution
Each tool runs in its own Docker container
(benchmark/tools/)
behind a unified interface (run.sh <url> <output-dir>).
A 300-second timeout is enforced. The server is restarted before
each run so the access-log line count (one of the metrics) is
directly attributable to the tool under test.
§5. Scoring metrics
Recovered files are compared against the manifest via MD5
checksums (benchmark/compare.py).
The four reported metrics are:
- Recovery Rate = correct files / total files (per feature, and aggregate)
- Feature Support = binary, per feature, per (tool, scenario)
- Duration = wall-clock seconds from container start to first idle
- HTTP Requests = server log line count attributable to the tool
§6. Threat model (security suite)
The security suite inverts the benchmark assumption: the
pillager is the victim. Each test is a malicious .git/
crafted to attack the tool during recovery — code execution
via core.fsmonitor, arbitrary file write via path
traversal, SSRF via redirect, and known Git CVEs delivered as
submodules or LFS objects. A test is a FAIL
if the malicious server attains code execution, arbitrary file
write, or SSRF against the tool's container during recovery; a
PASS if the tool refused or contained the
payload. Findings against unpatched tools are anonymized in
Security as
Tool A…F until coordinated disclosure completes.
§7. Environment
Results on this site are produced by the harness running on Linux + Docker. Each run records its harness commit, generation timestamp, and random seed into the JSON output, so any plot you see can be traced to an exact reproducible configuration. The benchmark also runs weekly via GitHub Actions.
Reproduce instructions: /reproduce.