How Deploy Risk is Scored
The 13 research-validated signals Koalr uses to predict incident probability before each deployment.
How Deploy Risk is Scored
Every deployment gets a risk score from 0–100. Scores ≥ 70 are HIGH risk. Scores 40–69 are MEDIUM. Below 40 is LOW.
The score is a weighted sum of 13 factors derived from published software engineering research (Hassan 2009, Bird 2011, Nagappan & Zeller 2010, McIntosh 2015, and others). Factor weights are per-organization and auto-tune over time as Koalr learns which signals actually predict incidents in your codebase.
Factor Tiers
Factors are organized into three tiers by predictive strength:
| Tier | Factors | Default weight each |
|---|---|---|
| Tier 1 — Highest signal | Change Entropy, Author File Expertise, Minor Contributors, Change Burst | 8–11% |
| Tier 2 — Moderate signal | Review Coverage, Historical Failures, File Count, Patch Coverage, SLO Burn Rate | 7–8% |
| Tier 3 — Supporting signal | Change Size, Semantic Risk, Schema Migration, Timing | 5–8% |
Tier 1 Factors
Change Entropy (11%)
Hassan 2009 — consistently the top predictor in replications across Vista/Win7, Linux, Eclipse.
Entropy measures how spread out changes are across the codebase's historical change patterns. A PR that modifies files that have never been changed together scores high entropy — these cross-cutting changes are harder to reason about and more likely to introduce unexpected interactions.
Computed by Koalr from your repository's git history. Higher entropy → higher score.
Author File Expertise (10%)
Bird 2011, Rahman 2011 — file-level expertise is more predictive than repo-level experience.
Measures how familiar the PR author is with each specific file being changed, based on historical commit activity. An experienced engineer changing files they've never touched before scores higher risk than a junior engineer changing files they own.
Scores from 0 (deep expertise in all changed files) to 100 (no prior history with any changed file).
Minor Contributors (9%)
Bird 2011 — single highest-correlating metric in the Vista/Windows 7 bug study.
A file with many authors, most of whom contributed only a small fraction of the changes, is significantly more defect-prone. This captures "ownership diffusion" — when no single person deeply understands a file's full evolution.
| Minor contributor density | Score |
|---|---|
| ≤ 10% of authors are minor | 10 |
| 11–30% | 35 |
| 31–50% | 60 |
| > 50% | 85 |
Change Burst (8%)
Nagappan & Zeller 2010 — 91.1% precision, 92.0% recall at predicting post-release defects.
A "change burst" occurs when the same files are modified repeatedly in a short window (≤ 14 days). Files under active churn are in an unstable state where any individual change is riskier — tests may not have caught up, the codebase may be inconsistent, and reviewers have context fatigue.
Tier 2 Factors
Review Coverage (8%)
McIntosh 2015 — files without review have up to 5× more post-release defects.
| Review state | Score |
|---|---|
| Approved (≥ 1 approver) | 5 |
| Reviewed but not approved | 40 |
| No reviews | 90 |
CI status is also factored in — a failed CI check adds 20 points to this score.
Historical Failures (8%)
The repository's change failure rate (CFR) over the past 90 days — the percentage of deployments that resulted in an incident within 4 hours.
Score = min(CFR × 100, 100).
A repo with a 45% CFR gets a historical failure score of 45. This grounds the risk model in your codebase's actual track record.
File Count (7%)
Mockus & Weiss 2000 — "diffusion breadth" of a change predicts defects.
| Files changed | Score |
|---|---|
| ≤ 3 | 5 |
| 4–10 | 25 |
| 11–20 | 50 |
| 21–40 | 75 |
| > 40 | 95 |
Patch Coverage (7%)
Inozemtseva 2014 — coverage of the changed lines specifically is more predictive than overall repo coverage.
Koalr uses patch coverage — the percentage of the lines actually added or changed in this PR that are covered by tests — rather than overall repository coverage. A PR that adds untested code to a well-covered codebase is riskier than overall coverage suggests.
| Patch coverage | Score |
|---|---|
| ≥ 80% | 5 |
| 60–79% | 25 |
| 40–59% | 50 |
| 20–39% | 75 |
| < 20% | 95 |
| Unknown | 50 |
Requires Codecov or SonarCloud integration. Falls back to overall repo coverage if patch coverage is unavailable.
SLO Burn Rate (7%)
Google SRE Workbook — deploying into a service already consuming its error budget accelerates failure.
If the target service's SLO error budget is burning faster than normal, a new deployment carries higher risk. Koalr reads burn rate from your observability integration.
| SLO burn rate multiplier | Score |
|---|---|
| < 1.5× | 10 |
| 1.5–3× | 45 |
| 3–10× | 70 |
| > 10× | 95 |
Tier 3 Factors
Change Size (6%)
Total lines added + deleted. Weighted lower than File Count because raw line count is less predictive than the entropy/diffusion signals above.
| Lines changed | Score |
|---|---|
| ≤ 50 | 5 |
| 51–200 | 20 |
| 201–500 | 45 |
| 501–1000 | 70 |
| > 1000 | 95 |
Semantic Risk (6%)
Koalr's LLM (Claude) analyzes the PR diff for semantic patterns associated with deployment risk: authentication changes, cryptographic updates, dependency version bumps, data migration logic, and API contract changes. Returns a 0–100 risk score based on change type classification.
This is a unique Koalr signal with no equivalent in research literature — weighted conservatively (6%) while more data is collected.
Schema Migration (8%)
Any PR touching .sql files, Prisma migration directories, Alembic versions, or Flyway scripts is automatically scored at 85+. Schema changes are categorical risk — they are often irreversible, affect data integrity, and require careful sequencing with application code.
This factor also acts as a hard gate (see below).
Timing (5%)
Eyolfson 2011 — validated predictor, but lower coefficient than code-structure signals.
| Time window | Score |
|---|---|
| Normal hours (Mon–Thu, 6 am–8 pm local) | 10 |
| Late night (10 pm–6 am) | 60 |
| Friday afternoon (after 4 pm) | 75 |
| Weekend | 80 |
Hard Gates
In addition to the weighted score, Koalr applies hard gate rules that override the composite score:
- Schema migration detected — any PR touching SQL/migration files is automatically flagged to at least 80/100, regardless of other factors. Requires DBA review annotation before proceeding.
- Active incident in this service — if an incident was opened in the past 4 hours for this service, deployment risk is raised to at least 90/100.
Weight Auto-Tuning
Once your organization has 30+ deployment outcomes logged (success vs. incident), Koalr runs weekly logistic regression on your historical data to recalibrate factor weights to your specific codebase. Your codebase is unique — the research-based defaults are a strong starting point, but your production data makes the model more accurate over time.
View current weights and tuning history at Delivery → Deploy Risk → Weight Tuning.
Accuracy Tracking
Koalr computes precision, recall, F1, and ROC-AUC against your logged outcomes. The model is considered accurate when F1 ≥ 0.75. View at Delivery → Deploy Risk → Accuracy.