<< All versions
Skill v1.0.1
currentAutomated scan100/100majiayu000/claude-skill-registry-data/slo-implementation-tringo0108-z-command
3 files
──Details
PublishedMay 14, 2026 at 08:37 PM
Content Hashsha256:55c3873b2ce37ab8...
Git SHA6c0be08ba74a
Bump Typepatch
──Files
Files (1 file, 8.4 KB)
SKILL.md8.4 KBactive
SKILL.md · 330 lines · 8.4 KB
version: "1.0.1" name: slo-implementation description: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
SLO Implementation
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
When to Use
- Define service reliability targets
- Measure user-perceived reliability
- Implement error budgets
- Create SLO-based alerts
- Track reliability goals
SLI/SLO/SLA Hierarchy
SLA (Service Level Agreement)↓ Contract with customersSLO (Service Level Objective)↓ Internal reliability targetSLI (Service Level Indicator)↓ Actual measurement
Defining SLIs
Common SLI Types
1. Availability SLI
promql
# Successful requests / Total requestssum(rate(http_requests_total{status!~"5.."}[28d]))/sum(rate(http_requests_total[28d]))
2. Latency SLI
promql
# Requests below latency threshold / Total requestssum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))/sum(rate(http_request_duration_seconds_count[28d]))
3. Durability SLI
# Successful writes / Total writessum(storage_writes_successful_total)/sum(storage_writes_total)
Reference: See references/slo-definitions.md
Setting SLO Targets
Availability SLO Examples
| SLO % | Downtime/Month | Downtime/Year | |
|---|---|---|---|
| 99% | 7.2 hours | 3.65 days | |
| 99.9% | 43.2 minutes | 8.76 hours | |
| 99.95% | 21.6 minutes | 4.38 hours | |
| 99.99% | 4.32 minutes | 52.56 minutes |
Choose Appropriate SLOs
Consider:
- User expectations
- Business requirements
- Current performance
- Cost of reliability
- Competitor benchmarks
Example SLOs:
yaml
slos:- name: api_availabilitytarget: 99.9window: 28dsli: |sum(rate(http_requests_total{status!~"5.."}[28d]))/sum(rate(http_requests_total[28d]))- name: api_latency_p95target: 99window: 28dsli: |sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))/sum(rate(http_request_duration_seconds_count[28d]))
Error Budget Calculation
Error Budget Formula
Error Budget = 1 - SLO Target
Example:
- SLO: 99.9% availability
- Error Budget: 0.1% = 43.2 minutes/month
- Current Error: 0.05% = 21.6 minutes/month
- Remaining Budget: 50%
Error Budget Policy
yaml
error_budget_policy:- remaining_budget: 100%action: Normal development velocity- remaining_budget: 50%action: Consider postponing risky changes- remaining_budget: 10%action: Freeze non-critical changes- remaining_budget: 0%action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
SLO Implementation
Prometheus Recording Rules
yaml
# SLI Recording Rulesgroups:- name: sli_rulesinterval: 30srules:# Availability SLI- record: sli:http_availability:ratioexpr: |sum(rate(http_requests_total{status!~"5.."}[28d]))/sum(rate(http_requests_total[28d]))# Latency SLI (requests < 500ms)- record: sli:http_latency:ratioexpr: |sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))/sum(rate(http_request_duration_seconds_count[28d]))- name: slo_rulesinterval: 5mrules:# SLO compliance (1 = meeting SLO, 0 = violating)- record: slo:http_availability:complianceexpr: sli:http_availability:ratio >= bool 0.999- record: slo:http_latency:complianceexpr: sli:http_latency:ratio >= bool 0.99# Error budget remaining (percentage)- record: slo:http_availability:error_budget_remainingexpr: |(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100# Error budget burn rate- record: slo:http_availability:burn_rate_5mexpr: |(1 - (sum(rate(http_requests_total{status!~"5.."}[5m]))/sum(rate(http_requests_total[5m])))) / (1 - 0.999)
SLO Alerting Rules
yaml
groups:- name: slo_alertsinterval: 1mrules:# Fast burn: 14.4x rate, 1 hour window# Consumes 2% error budget in 1 hour- alert: SLOErrorBudgetBurnFastexpr: |slo:http_availability:burn_rate_1h > 14.4andslo:http_availability:burn_rate_5m > 14.4for: 2mlabels:severity: criticalannotations:summary: "Fast error budget burn detected"description: "Error budget burning at {{ $value }}x rate"# Slow burn: 6x rate, 6 hour window# Consumes 5% error budget in 6 hours- alert: SLOErrorBudgetBurnSlowexpr: |slo:http_availability:burn_rate_6h > 6andslo:http_availability:burn_rate_30m > 6for: 15mlabels:severity: warningannotations:summary: "Slow error budget burn detected"description: "Error budget burning at {{ $value }}x rate"# Error budget exhausted- alert: SLOErrorBudgetExhaustedexpr: slo:http_availability:error_budget_remaining < 0for: 5mlabels:severity: criticalannotations:summary: "SLO error budget exhausted"description: "Error budget remaining: {{ $value }}%"
SLO Dashboard
Grafana Dashboard Structure:
┌────────────────────────────────────┐│ SLO Compliance (Current) ││ ✓ 99.95% (Target: 99.9%) │├────────────────────────────────────┤│ Error Budget Remaining: 65% ││ ████████░░ 65% │├────────────────────────────────────┤│ SLI Trend (28 days) ││ [Time series graph] │├────────────────────────────────────┤│ Burn Rate Analysis ││ [Burn rate by time window] │└────────────────────────────────────┘
Example Queries:
promql
# Current SLO compliancesli:http_availability:ratio * 100# Error budget remainingslo:http_availability:error_budget_remaining# Days until error budget exhausted (at current burn rate)(slo:http_availability:error_budget_remaining / 100)*28/(1 - sli:http_availability:ratio) * (1 - 0.999)
Multi-Window Burn Rate Alerts
yaml
# Combination of short and long windows reduces false positivesrules:- alert: SLOBurnRateHighexpr: |(slo:http_availability:burn_rate_1h > 14.4andslo:http_availability:burn_rate_5m > 14.4)or(slo:http_availability:burn_rate_6h > 6andslo:http_availability:burn_rate_30m > 6)labels:severity: critical
SLO Review Process
Weekly Review
- Current SLO compliance
- Error budget status
- Trend analysis
- Incident impact
Monthly Review
- SLO achievement
- Error budget usage
- Incident postmortems
- SLO adjustments
Quarterly Review
- SLO relevance
- Target adjustments
- Process improvements
- Tooling enhancements
Best Practices
- Start with user-facing services
- Use multiple SLIs (availability, latency, etc.)
- Set achievable SLOs (don't aim for 100%)
- Implement multi-window alerts to reduce noise
- Track error budget consistently
- Review SLOs regularly
- Document SLO decisions
- Align with business goals
- Automate SLO reporting
- Use SLOs for prioritization
Reference Files
assets/slo-template.md- SLO definition templatereferences/slo-definitions.md- SLO definition patternsreferences/error-budget.md- Error budget calculations