How Deploy Risk is Scored

The research-validated signals Koalr uses to predict incident probability before each deployment.

Every deployment gets a risk score from 0–100. Scores ≥ 70 are HIGH risk. Scores 40–69 are MEDIUM. Below 40 is LOW.

The score is a weighted sum of 33 factors across five categories, derived from published software engineering research (Hassan 2009, Bird 2011, Nagappan & Zeller 2010, McIntosh 2015, and others). Factor weights are per-organization and auto-tune over time as Koalr learns which signals actually predict incidents in your codebase.

Signal categories

Category	Signals	Weight
Change characteristics	Entropy, Size, File Count, File Churn Rate, DDL Migration, Dependency Changes	30%
Author expertise	File Familiarity, Module Ownership, Historical Rework Rate	25%
Review coverage	Review Depth, Reviewer Familiarity, CODEOWNERS Compliance, Time-to-Approval	20%
Test coverage	Coverage Delta on Changed Files, New Code Coverage, Test-to-Code Ratio	15%
Deployment context	Timing, Deploy Frequency, SLO Burn Rate, Concurrent Incidents, Rollback Frequency, AI Authorship, Cross-Service Blast Radius	10%

Factor Tiers

Factors are organized into three tiers by predictive strength:

Tier	Factors	Default weight each
Tier 1 — Highest signal	Change Entropy, Author File Expertise, Minor Contributors, Change Burst	8–11%
Tier 2 — Moderate signal	Review Coverage, CODEOWNERS Compliance, Historical Failures, File Count, Patch Coverage, SLO Burn Rate	6–8%
Tier 3 — Supporting signal	Change Size, Semantic Risk, Schema Migration, Timing, AI Authorship, Rollback Frequency, Concurrent Incidents	3–7%

Tier 1 Factors

Change Entropy (11%)

Hassan 2009 — consistently the top predictor in replications across Vista/Win7, Linux, Eclipse.

Entropy measures how spread out changes are across the codebase's historical change patterns. A PR that modifies files that have never been changed together scores high entropy — these cross-cutting changes are harder to reason about and more likely to introduce unexpected interactions.

Computed by Koalr from your repository's git history. Higher entropy → higher score.

Author File Expertise (10%)

Bird 2011, Rahman 2011 — file-level expertise is more predictive than repo-level experience.

Measures how familiar the PR author is with each specific file being changed, based on historical commit activity. An experienced engineer changing files they've never touched before scores higher risk than a junior engineer changing files they own.

Scores from 0 (deep expertise in all changed files) to 100 (no prior history with any changed file).

Minor Contributors (9%)

Bird 2011 — single highest-correlating metric in the Vista/Windows 7 bug study.

A file with many authors, most of whom contributed only a small fraction of the changes, is significantly more defect-prone. This captures "ownership diffusion" — when no single person deeply understands a file's full evolution.

Minor contributor density	Score
≤ 10% of authors are minor	10
11–30%	35
31–50%	60
> 50%	85

Change Burst (8%)

Nagappan & Zeller 2010 — 91.1% precision, 92.0% recall at predicting post-release defects.

A "change burst" occurs when the same files are modified repeatedly in a short window (≤ 14 days). Files under active churn are in an unstable state where any individual change is riskier — tests may not have caught up, the codebase may be inconsistent, and reviewers have context fatigue.

Tier 2 Factors

Review Coverage (8%)

McIntosh 2015 — files without review have up to 5× more post-release defects.

Review state	Score
Approved (≥ 1 approver)	5
Reviewed but not approved	40
No reviews	90

CI status is also factored in — a failed CI check adds 20 points to this score.

Historical Failures (8%)

The repository's change failure rate (CFR) over the past 90 days — the percentage of deployments that resulted in an incident within 4 hours.

Score = min(CFR × 100, 100).

A repo with a 45% CFR gets a historical failure score of 45. This grounds the risk model in your codebase's actual track record.

File Count (7%)

Mockus & Weiss 2000 — "diffusion breadth" of a change predicts defects.

Files changed	Score
≤ 3	5
4–10	25
11–20	50
21–40	75
> 40	95

Patch Coverage (7%)

Inozemtseva 2014 — coverage of the changed lines specifically is more predictive than overall repo coverage.

Koalr uses patch coverage — the percentage of the lines actually added or changed in this PR that are covered by tests — rather than overall repository coverage. A PR that adds untested code to a well-covered codebase is riskier than overall coverage suggests.

Patch coverage	Score
≥ 80%	5
60–79%	25
40–59%	50
20–39%	75
< 20%	95
Unknown	50

Requires Codecov or SonarCloud integration. Falls back to overall repo coverage if patch coverage is unavailable.

SLO Burn Rate (7%)

Google SRE Workbook — deploying into a service already consuming its error budget accelerates failure.

If the target service's SLO error budget is burning faster than normal, a new deployment carries higher risk. Koalr reads burn rate from your observability integration.

SLO burn rate multiplier	Score
< 1.5×	10
1.5–3×	45
3–10×	70
> 10×	95

Tier 3 Factors

Change Size (6%)

Total lines added + deleted. Weighted lower than File Count because raw line count is less predictive than the entropy/diffusion signals above.

Lines changed	Score
≤ 50	5
51–200	20
201–500	45
501–1000	70
> 1000	95

Semantic Risk (6%)

Koalr's LLM (Claude) analyzes the PR diff for semantic patterns associated with deployment risk: authentication changes, cryptographic updates, dependency version bumps, data migration logic, and API contract changes. Returns a 0–100 risk score based on change type classification.

This is a unique Koalr signal with no equivalent in research literature — weighted conservatively (6%) while more data is collected.

Schema Migration (8%)

Any PR touching .sql files, Prisma migration directories, Alembic versions, or Flyway scripts is automatically scored at 85+. Schema changes are categorical risk — they are often irreversible, affect data integrity, and require careful sequencing with application code.

This factor also acts as a hard gate (see below).

Timing (5%)

Eyolfson 2011 — validated predictor, but lower coefficient than code-structure signals.

Time window	Score
Normal hours (Mon–Thu, 6 am–8 pm local)	10
Late night (10 pm–6 am)	60
Friday afternoon (after 4 pm)	75
Weekend	80

AI Authorship (4%)

PRs containing commits with Co-authored-by trailers from known AI coding assistants (GitHub Copilot, Cursor, Claude Code, Codeium, CodeWhisperer) receive an authorship risk signal. AI-generated code is associated with 2.3× higher rework rates (DORA 2025 research). This is an informational signal — it raises risk enough to prompt additional review, not to block.

Rollback Frequency (4%)

The percentage of recent deployments to this service that were rolled back within 24 hours. A service with a high rollback history suggests instability that new deployments will encounter.

Concurrent Incidents (3%)

If an active incident is open in this service or a dependent service at deploy time, risk is elevated. Deploying into a degraded system has higher failure probability.

Cross-Service Blast Radius (5%)

Koalr maps your service dependency graph to estimate how far impact from a failed deployment can propagate. Changes to shared infrastructure — platform packages, API gateways, auth libraries, database clients — carry elevated risk because a defect affects not just the changed service but all downstream dependents.

The blast radius estimate is expressed as an impact tier:

Impact tier	Affected service count	Score boost
LOW	1 service	0
MEDIUM	2–4 services	+5 pts
HIGH	5–9 services	+10 pts
CRITICAL	10+ services	+18 pts

Scores with LOW confidence (single-file PRs with insufficient history) skip this signal to avoid spurious elevation. The service dependency graph is built from your GitHub integration's import graph and Koalr's co-deployment history.

New Signals (v2.0+)

These signals were added in the 32-signal release and are weighted at the conservative end while more outcome data is gathered:

CODEOWNERS Compliance (7%)

Whether the PR has received approval from all required CODEOWNERS for the changed file paths. Non-compliant PRs (missing required owner approval) score 70–90 on this signal.

Reviewer Familiarity (6%)

Whether the approving reviewers have historical commit context in the changed files. A review from an engineer unfamiliar with the code area is less effective than a review from the file owner.

Historical Rework Rate (6%)

The PR author's personal rework rate over the past 90 days — what percentage of their code is rewritten within 21 days. Authors with high personal rework rates produce riskier changes.

New Code Coverage (5%)

Coverage on lines that are new additions in this PR specifically (from SonarCloud new_coverage metric). New code with < 60% coverage scores high on this signal.

Test-to-Code Ratio (4%)

The ratio of test lines added to implementation lines added in this PR. PRs that add significant implementation without corresponding tests score higher risk.

Dependency Changes (3%)

PRs modifying package.json, go.mod, requirements.txt, Cargo.toml, or similar dependency files are flagged. If known CVEs are found in the updated dependencies (via Snyk/GitHub Dependabot data), the score is elevated further.

Hard Gates

In addition to the weighted score, Koalr applies hard gate rules that override the composite score:

Schema migration detected — any PR touching SQL/migration files is automatically flagged to at least 80/100, regardless of other factors. Requires DBA review annotation before proceeding.
Active incident in this service — if an incident was opened in the past 4 hours for this service, deployment risk is raised to at least 90/100.

Weight Auto-Tuning

Once your organization has 30+ deployment outcomes logged (success vs. incident), Koalr runs weekly logistic regression on your historical data to recalibrate factor weights to your specific codebase. Your codebase is unique — the research-based defaults are a strong starting point, but your production data makes the model more accurate over time.

View current weights and tuning history on the Deploy Risk page (weight tuning is inline on the main page).

Accuracy Tracking

Koalr computes precision, recall, F1, and ROC-AUC against your logged outcomes. The model is considered accurate when F1 ≥ 0.75. View at Delivery → Deploy Risk → Accuracy.