Methodology

Benchmark and security harness design

The Benchmark and Security pages report results. This page documents how those numbers are produced — the test corpus, the harness, the metrics, and the threat model — so the figures can be read with the right priors.

§1. Design principles

The harness is designed around three principles. Fairness: every tool receives the same target served by the same web-server instance, with the same starting state. Completeness: evaluation covers all recoverable Git artifacts (source code, reflogs, stashes, commits, branches, remotes, tags), not just whether a clone succeeds. Reproducibility: the test repository is generated from a fixed random seed, and every component — tool images, server containers, scoring code — is version-controlled.

§2. Test repository

A ground-truth Git repository is generated programmatically (benchmark/generate.py, seed = 0). It contains 2–16 random commits, 2–8 random branches, 15 well-known branches (master, main, dev, release, …), up to 16 semantic-version tags, 2–16 stashes, files in both the staging area and working directory, and a PHP LFI script. A manifest categorises every file by feature, which is what makes the per-feature accuracy column on the benchmark page meaningful.

§3. Web-server scenarios

The repository is served over HTTP under five configurations, each a Docker Compose service (test/docker/):

  1. Apache httpd with DirectoryIndex enabled (directory listing on)
  2. Apache httpd with Options -Indexes (directory listing off)
  3. Nginx with autoindex on
  4. Nginx with autoindex off (default)
  5. PHP 7.2 + Apache with a Local File Inclusion entry point

§4. Tool execution

Each tool runs in its own Docker container (benchmark/tools/) behind a unified interface (run.sh <url> <output-dir>). A 300-second timeout is enforced. The server is restarted before each run so the access-log line count (one of the metrics) is directly attributable to the tool under test.

§5. Scoring metrics

Recovered files are compared against the manifest via MD5 checksums (benchmark/compare.py). The four reported metrics are:

§6. Threat model (security suite)

The security suite inverts the benchmark assumption: the pillager is the victim. Each test is a malicious .git/ crafted to attack the tool during recovery — code execution via core.fsmonitor, arbitrary file write via path traversal, SSRF via redirect, and known Git CVEs delivered as submodules or LFS objects. A test is a FAIL if the malicious server attains code execution, arbitrary file write, or SSRF against the tool's container during recovery; a PASS if the tool refused or contained the payload. Findings against unpatched tools are anonymized in Security as Tool A…F until coordinated disclosure completes.

§7. Environment

Results on this site are produced by the harness running on Linux + Docker. Each run records its harness commit, generation timestamp, and random seed into the JSON output, so any plot you see can be traced to an exact reproducible configuration. The benchmark also runs weekly via GitHub Actions.

Reproduce instructions: /reproduce.