Koalrdocs
FeaturesDeploy risk

How Deploy Risk is Scored

The 13 research-validated signals Koalr uses to predict incident probability before each deployment.

How Deploy Risk is Scored

Every deployment gets a risk score from 0–100. Scores ≥ 70 are HIGH risk. Scores 40–69 are MEDIUM. Below 40 is LOW.

The score is a weighted sum of 13 factors derived from published software engineering research (Hassan 2009, Bird 2011, Nagappan & Zeller 2010, McIntosh 2015, and others). Factor weights are per-organization and auto-tune over time as Koalr learns which signals actually predict incidents in your codebase.

Factor Tiers

Factors are organized into three tiers by predictive strength:

TierFactorsDefault weight each
Tier 1 — Highest signalChange Entropy, Author File Expertise, Minor Contributors, Change Burst8–11%
Tier 2 — Moderate signalReview Coverage, Historical Failures, File Count, Patch Coverage, SLO Burn Rate7–8%
Tier 3 — Supporting signalChange Size, Semantic Risk, Schema Migration, Timing5–8%

Tier 1 Factors

Change Entropy (11%)

Hassan 2009 — consistently the top predictor in replications across Vista/Win7, Linux, Eclipse.

Entropy measures how spread out changes are across the codebase's historical change patterns. A PR that modifies files that have never been changed together scores high entropy — these cross-cutting changes are harder to reason about and more likely to introduce unexpected interactions.

Computed by Koalr from your repository's git history. Higher entropy → higher score.

Author File Expertise (10%)

Bird 2011, Rahman 2011 — file-level expertise is more predictive than repo-level experience.

Measures how familiar the PR author is with each specific file being changed, based on historical commit activity. An experienced engineer changing files they've never touched before scores higher risk than a junior engineer changing files they own.

Scores from 0 (deep expertise in all changed files) to 100 (no prior history with any changed file).

Minor Contributors (9%)

Bird 2011 — single highest-correlating metric in the Vista/Windows 7 bug study.

A file with many authors, most of whom contributed only a small fraction of the changes, is significantly more defect-prone. This captures "ownership diffusion" — when no single person deeply understands a file's full evolution.

Minor contributor densityScore
≤ 10% of authors are minor10
11–30%35
31–50%60
> 50%85

Change Burst (8%)

Nagappan & Zeller 2010 — 91.1% precision, 92.0% recall at predicting post-release defects.

A "change burst" occurs when the same files are modified repeatedly in a short window (≤ 14 days). Files under active churn are in an unstable state where any individual change is riskier — tests may not have caught up, the codebase may be inconsistent, and reviewers have context fatigue.


Tier 2 Factors

Review Coverage (8%)

McIntosh 2015 — files without review have up to 5× more post-release defects.

Review stateScore
Approved (≥ 1 approver)5
Reviewed but not approved40
No reviews90

CI status is also factored in — a failed CI check adds 20 points to this score.

Historical Failures (8%)

The repository's change failure rate (CFR) over the past 90 days — the percentage of deployments that resulted in an incident within 4 hours.

Score = min(CFR × 100, 100).

A repo with a 45% CFR gets a historical failure score of 45. This grounds the risk model in your codebase's actual track record.

File Count (7%)

Mockus & Weiss 2000 — "diffusion breadth" of a change predicts defects.

Files changedScore
≤ 35
4–1025
11–2050
21–4075
> 4095

Patch Coverage (7%)

Inozemtseva 2014 — coverage of the changed lines specifically is more predictive than overall repo coverage.

Koalr uses patch coverage — the percentage of the lines actually added or changed in this PR that are covered by tests — rather than overall repository coverage. A PR that adds untested code to a well-covered codebase is riskier than overall coverage suggests.

Patch coverageScore
≥ 80%5
60–79%25
40–59%50
20–39%75
< 20%95
Unknown50

Requires Codecov or SonarCloud integration. Falls back to overall repo coverage if patch coverage is unavailable.

SLO Burn Rate (7%)

Google SRE Workbook — deploying into a service already consuming its error budget accelerates failure.

If the target service's SLO error budget is burning faster than normal, a new deployment carries higher risk. Koalr reads burn rate from your observability integration.

SLO burn rate multiplierScore
< 1.5×10
1.5–3×45
3–10×70
> 10×95

Tier 3 Factors

Change Size (6%)

Total lines added + deleted. Weighted lower than File Count because raw line count is less predictive than the entropy/diffusion signals above.

Lines changedScore
≤ 505
51–20020
201–50045
501–100070
> 100095

Semantic Risk (6%)

Koalr's LLM (Claude) analyzes the PR diff for semantic patterns associated with deployment risk: authentication changes, cryptographic updates, dependency version bumps, data migration logic, and API contract changes. Returns a 0–100 risk score based on change type classification.

This is a unique Koalr signal with no equivalent in research literature — weighted conservatively (6%) while more data is collected.

Schema Migration (8%)

Any PR touching .sql files, Prisma migration directories, Alembic versions, or Flyway scripts is automatically scored at 85+. Schema changes are categorical risk — they are often irreversible, affect data integrity, and require careful sequencing with application code.

This factor also acts as a hard gate (see below).

Timing (5%)

Eyolfson 2011 — validated predictor, but lower coefficient than code-structure signals.

Time windowScore
Normal hours (Mon–Thu, 6 am–8 pm local)10
Late night (10 pm–6 am)60
Friday afternoon (after 4 pm)75
Weekend80

Hard Gates

In addition to the weighted score, Koalr applies hard gate rules that override the composite score:

  • Schema migration detected — any PR touching SQL/migration files is automatically flagged to at least 80/100, regardless of other factors. Requires DBA review annotation before proceeding.
  • Active incident in this service — if an incident was opened in the past 4 hours for this service, deployment risk is raised to at least 90/100.

Weight Auto-Tuning

Once your organization has 30+ deployment outcomes logged (success vs. incident), Koalr runs weekly logistic regression on your historical data to recalibrate factor weights to your specific codebase. Your codebase is unique — the research-based defaults are a strong starting point, but your production data makes the model more accurate over time.

View current weights and tuning history at Delivery → Deploy Risk → Weight Tuning.


Accuracy Tracking

Koalr computes precision, recall, F1, and ROC-AUC against your logged outcomes. The model is considered accurate when F1 ≥ 0.75. View at Delivery → Deploy Risk → Accuracy.