When should I use rate() vs irate() in PromQL?

Use `rate()` for dashboards and alerts — it computes the per-second average rate over the entire range window using linear regression, which smooths over scrape gaps and makes graphs readable. Use `irate()` only when you specifically need the instantaneous rate between the two most-recent samples, for example to catch a very short spike that `rate()` would average away. The rule of thumb: if you are alerting or graphing trends, use `rate(counter[5m])`; if you are debugging whether a spike happened in the last few seconds, use `irate(counter[1m])`. Keep in mind `irate()` is highly sensitive to scrape timing jitter — a single slow scrape cycle produces a meaningless spike in the graph.

What is the difference between Histogram and Summary in Prometheus?

Both expose percentiles but they trade off client cost vs server flexibility. A Summary computes quantiles on the client (the instrumented app) and exposes them as `{quantile="0.95"}` label pairs — cheap for simple use but you cannot aggregate across instances because pre-computed quantiles are not additive. A Histogram counts observations in configurable `_bucket` boundaries and exposes `_bucket`, `_sum`, and `_count` — you compute the percentile at query time with `histogram_quantile(0.95, sum(rate(hist_bucket[5m])) by (le))`, which means you CAN sum across instances before computing the quantile. Prefer Histograms for anything that needs cross-instance aggregation (the default for most metrics). Use Summaries only for per-instance quantiles that you will never aggregate.

How do I write an alert that fires when a metric is missing?

Use `absent()` — it returns an empty vector when the expression has samples and a single-element vector (value 1) when it does not. Example rule: `alert: TargetDown; expr: absent(up{job="api"} == 1)` fires when no `up` series with `job="api"` is 1 (all targets are down or the job disappeared entirely). For a range-aware version use `absent_over_time(metric[5m])`, which fires only when the metric has been absent for the full window. A common gotcha: `absent()` does NOT propagate the target's labels because there is no sample to get them from — hard-code the important labels in the alert's `labels` block instead of relying on `$labels`.

What is the recording rule naming convention?

The official convention is `level:metric:operations` where `level` describes the aggregation scope (e.g. `job`, `instance`, `cluster`), `metric` is the base metric name without the `_total` or `_seconds` suffix, and `operations` is a colon-separated list of PromQL operations applied, reading left-to-right (e.g. `rate5m` then `sum`). Examples: `job:http_requests:rate5m` — per-job rate over 5 min; `cluster:http_requests:rate5m_sum_by_job` — cluster-level sum of per-job rates. Recording rules let you avoid recomputing expensive queries (especially `rate + sum` over many time series) on every dashboard refresh and alert evaluation. Pre-recorded metrics also let you use simpler alert expressions that reference the recording instead of the full range query.

How do I filter a metric to only series where a label matches a regex?

Use the `=~` label matcher with a RE2 syntax regex. For example `http_requests_total{status=~"5.."}` selects only 5xx status codes. The regex is anchored at both ends automatically — `"5.."` matches exactly three characters starting with 5, so `50` would NOT match. To match a prefix use `"5.*"`. Use `!~` to negate. Common patterns: `{instance=~"prod-.*"}` — all prod instances; `{env!~"dev|staging"}` — exclude dev and staging; `{method=~"GET|POST"}` — only GET or POST. Note that regex matchers trigger a full index scan (unlike exact `=` matchers), so they are slower on high-cardinality label values — prefer exact matches wherever possible.

Does this tool upload my queries or alert rules anywhere?

No. The entire cheat sheet is a static in-memory array loaded with the page. The search box, category filter chips, and copy button all run in your browser — nothing is sent to any server. You can open DevTools → Network and confirm zero outbound requests while using the tool. It works offline, behind a corporate firewall, or on a jump host where you'd normally need this reference most.

Debugging a sudden spike in error rate during an incident

It is 3am and the error rate alert fired. You open the cheat sheet, grab `rate(http_requests_total{status=~"5.."}[5m])`, add `by (handler)` to find the noisy endpoint, then use `topk(5, …)` to surface the worst offenders. All from memory? No — from copy-paste in under two minutes.

Writing your first multi-window alert rule

You want an alert that is sensitive to short outages but does not page for a single bad scrape. You look up the multi-window pattern, copy the `for: 5m` block with the `short_window` and `long_window` expressions, and fill in your metric name. The annotations section shows you exactly how to use `{{ $labels.job }}` and `{{ $value | humanizePercentage }}` without guessing the syntax.

Pre-computing an expensive query as a recording rule

Your dashboard's 95th-percentile latency panel takes 8 seconds to load because it runs `histogram_quantile(0.95, sum(rate(…)) by (le))` over 400 series every refresh. You look up the recording-rule naming convention, create `job:request_duration_seconds:p95_5m`, and drop the recording into both the dashboard and the alert. Load time drops to under 200ms.

Prometheus Cheatsheet — 90+ PromQL Queries, Alerting Rules, and HTTP API Reference

Prometheus cheat sheet — 90+ entries covering PromQL selectors, aggregations, functions, alerting rules, recording rules, HTTP API, and relabeling.

Runs locally
Category Developer & DevOps
Best for Formatting, validating, shrinking, or inspecting code-adjacent text.

Section:

82 entries

Selectors (11)

metric_name

Bare metric name — selects an instant vector of all time series with that name. Returns the most recent sample for each series.

Examples

up

node_cpu_seconds_total

http_requests_total

metric{label="value"}

Exact label equality matcher. Filters the instant vector to series where `label` equals `value`.

⚠ Gotcha: Label values are case-sensitive. `{job="API"}` and `{job="api"}` are different series.

Examples

up{job="prometheus"}

http_requests_total{method="POST", status="200"}

node_filesystem_avail_bytes{mountpoint="/"}

metric{label!="value"}

Negative equality matcher. Keeps series where `label` does NOT equal `value`. Also matches series where the label is absent.

Examples

up{job!="blackbox"}

http_requests_total{env!="dev"}

metric{label=~"regex"}

RE2 regex matcher. Matches series where `label` matches the regex. Regex is anchored at both ends — `"5.."` matches exactly three chars.

⚠ Gotcha: Regex matchers trigger a full index scan (slower than `=`). Prefer exact matches for high-cardinality labels.

Examples

http_requests_total{status=~"5.."}

node_cpu_seconds_total{mode=~"user|system"}

up{instance=~"prod-.*:9090"}

metric{label!~"regex"}

Negative regex matcher. Keeps series where `label` does NOT match the regex.

Examples

http_requests_total{env!~"dev|staging"}

node_cpu_seconds_total{mode!~"idle|iowait"}

metric[5m]

Range vector selector. Returns all samples within the past 5 minutes for each series. Required by range functions like `rate()`, `increase()`, `delta()`.

⚠ Gotcha: A range vector cannot be graphed directly — it must be passed to a function like `rate()` first.

Examples

rate(http_requests_total[5m])

increase(errors_total[1h])

delta(cpu_temp[10m])

metric offset 5m

Offset modifier. Shifts the evaluation time back by the given duration — the query reads data from 5 minutes ago instead of "now".

Examples

http_requests_total offset 5m

rate(http_requests_total[5m] offset 1h)

# compare current vs one week ago:
rate(requests[5m]) / rate(requests[5m] offset 7d)

metric @ 1609746000

Timestamp modifier (Prometheus ≥ 2.25). Evaluates the selector at a specific Unix timestamp regardless of the query time.

Examples

http_requests_total @ 1609746000

rate(http_requests_total[5m] @ start())

rate(http_requests_total[5m] @ end())

{__name__=~"go_.*"}

Use `__name__` as a regular label to select metrics by name pattern. Useful for exploring or for cross-metric operations.

⚠ Gotcha: Selecting many metrics at once with `__name__=~".*"` is extremely expensive — always narrow the pattern.

Examples

{__name__=~"go_.*", job="api"}

{__name__=~"node_memory_.*"}

metric{job="api", env="prod"}

Multiple label matchers are AND-ed together. All conditions must match for a series to be selected.

Examples

up{job="api", env="prod", region="eu-west-1"}

http_requests_total{method="GET", status=~"2..", handler!="/health"}

rate(metric[1m])[10m:30s]

Subquery syntax. Evaluates the inner expression at `30s` resolution over the past `10m` and returns a range vector. Needed for `_over_time` functions on range expressions.

⚠ Gotcha: Subqueries are expensive because they re-evaluate the inner expression many times. Use recording rules for frequently-used subqueries.

Examples

max_over_time(rate(http_requests_total[1m])[10m:30s])

avg_over_time(node_load1[1h:5m])

Aggregation (11)

sum(metric)

Sum all values across all label dimensions. Returns a single scalar result.

⚠ Gotcha: Without a `by` clause, `sum()` drops ALL labels. The result has no labels you can join on.

Examples

sum(http_requests_total)

sum(rate(http_requests_total[5m]))

sum by (job) (metric)

Aggregate while KEEPING the listed labels. All unlisted labels are dropped. Equivalent to SQL `GROUP BY job`.

Examples

sum by (job) (rate(http_requests_total[5m]))

max by (instance, device) (node_disk_read_bytes_total)

sum without (instance) (metric)

Aggregate while DROPPING the listed labels. All other labels are kept. The inverse of `by`.

Examples

sum without (instance) (rate(http_requests_total[5m]))

avg without (cpu) (node_cpu_seconds_total)

avg(metric)

Arithmetic mean across all series or within each group defined by `by`/`without`.

Examples

avg by (job) (rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))

max(metric) / min(metric)

Maximum or minimum value across all matching series. Useful for cluster-wide alerting thresholds.

Examples

max by (job) (rate(errors_total[5m]))

min(node_filesystem_avail_bytes / node_filesystem_size_bytes)

count(metric)

Count the number of series in the result vector. Great for "how many instances are up" style queries.

Examples

count(up{job="api"} == 1)

count by (job) (up)

topk(5, metric)

Return the K series with the highest values. Useful for finding the most active endpoints or noisiest instances.

⚠ Gotcha: `topk` returns multiple series, so it is not suitable for alerts (which need a single result). Use it in dashboards only.

Examples

topk(5, sum by (handler) (rate(http_requests_total[5m])))

topk(10, node_cpu_seconds_total{mode="user"})

bottomk(3, metric)

Return the K series with the lowest values. Useful for finding the least-utilized instances or slowest responders.

Examples

bottomk(3, sum by (instance) (rate(requests_total[5m])))

quantile(0.95, metric)

φ-quantile over all series values (not over time). This aggregates across SERIES, not samples. For time-based percentiles use `histogram_quantile()`.

⚠ Gotcha: Do NOT confuse this with `histogram_quantile()`. This gives a percentile across current instance values, not across a distribution.

Examples

quantile(0.95, rate(http_request_duration_seconds_sum[5m]))

count_values("label", metric)

Count series by their value, creating a new label with the value. Useful for counting how many instances have each version number.

Examples

count_values("version", kube_pod_container_info)

count_values("status_code", http_response_code)

stddev(metric) / stdvar(metric)

Standard deviation or variance across all series. Useful for detecting outlier instances in a fleet.

Examples

stddev by (job) (rate(http_request_duration_seconds_sum[5m]))

Functions (18)

rate(counter[5m])

Per-second average rate of increase over the range window, calculated via linear regression. The correct function for dashboards and alerts on counters.

⚠ Gotcha: The range window should be at least 4× the scrape interval. For a 15s scrape interval, use `[1m]` minimum — shorter windows become noisy.

Examples

rate(http_requests_total[5m])

rate(node_network_receive_bytes_total[5m])

sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

irate(counter[5m])

Instantaneous rate — computed from the last two data points only. Captures very short spikes that `rate()` would smooth over.

⚠ Gotcha: One slow scrape cycle creates a fake spike in the graph. Avoid in dashboards; use only for live debugging of active spikes.

Examples

irate(http_requests_total[1m])

increase(counter[1h])

Total increase in a counter over the range window. Equivalent to `rate(c[window]) * window_in_seconds`. Handles counter resets.

Examples

increase(http_requests_total[1h])

increase(errors_total{job="api"}[24h])

resets(counter[1h])

Number of counter resets within the range window. A reset means the counter went from a high value back to zero (usually a process restart).

Examples

resets(http_requests_total[1h])

# alert if more than 3 restarts in an hour:
resets(process_start_time_seconds[1h]) > 3

delta(gauge[1h])

Difference in value between the first and last sample in the range window. Works on gauges, not counters.

⚠ Gotcha: Do NOT use `delta()` on counters — counters can reset and `delta()` does not handle resets. Use `increase()` instead.

Examples

delta(node_memory_MemFree_bytes[1h])

delta(cpu_temp_celsius[10m])

predict_linear(gauge[1h], 3600)

Predicts the value `3600` seconds from now using linear regression on the range window. Great for "disk will fill in X hours" alerts.

Examples

# alert: disk full in 4 hours
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

predict_linear(node_memory_MemAvailable_bytes[30m], 3600)

changes(gauge[1h])

Number of times the value changed within the range window. Useful for detecting flapping services or config changes.

Examples

changes(up[1h]) > 5  # service is flapping

changes(kube_deployment_spec_replicas[1h])

histogram_quantile(0.95, sum(rate(h_bucket[5m])) by (le))

Compute the φ-quantile (0.95 = 95th percentile) from a Histogram metric. The `le` label (less-than-or-equal) marks bucket boundaries and must be preserved in the aggregation.

⚠ Gotcha: The `by (le)` is REQUIRED — omitting it drops the bucket labels and the function returns NaN.

Examples

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

histogram_quantile(0.99, sum by (le, job) (rate(grpc_server_handling_seconds_bucket[5m])))

absent(metric)

Returns an empty vector when the expression has samples; returns a single element (value 1) when the expression has NO samples. Used to alert when a metric disappears.

⚠ Gotcha: `absent()` does not propagate labels from the missing series. Hard-code important labels in the alert `labels:` block.

Examples

absent(up{job="api"} == 1)

absent(http_requests_total{env="prod"})

absent_over_time(metric[5m])

Like `absent()` but requires the metric to be absent for the FULL range window before returning 1. Avoids alerts on a single missed scrape.

Examples

absent_over_time(up{job="api"}[5m])

label_replace(m, "dst", "$1", "src", "(.*)")

Apply a regex substitution to a label and write the result to a new label. `$1` refers to the first capture group in the regex.

Examples

# extract "host" from "host:port" in instance label:
label_replace(up, "host", "$1", "instance", "([^:]+):.*")

label_replace(metric, "short_name", "$1", "handler", "/api/v[0-9]+/(.*)")

label_join(m, "new", ",", "l1", "l2")

Concatenate multiple existing label values with a separator and write the result to a new label.

Examples

label_join(up, "node_region", "/", "instance", "region")

abs() / ceil() / floor() / round(m, 0.5)

Absolute value, ceiling, floor, or round to the nearest multiple. Applied element-wise to each series value.

Examples

abs(node_filesystem_avail_bytes - node_filesystem_size_bytes / 2)

round(rate(http_requests_total[5m]) * 100, 0.01)

clamp(m, 0, 100) / clamp_min / clamp_max

Clamp all values to the range [min, max]. `clamp_min(m, 0)` forces values ≥ 0; `clamp_max(m, 100)` forces values ≤ 100.

Examples

clamp(some_ratio, 0, 1)

clamp_min(node_load1 - 1, 0)   # never negative

sort(m) / sort_desc(m)

Sort the result vector by value (ascending or descending). Useful for dashboards to always show worst offenders at the top.

Examples

sort_desc(rate(http_requests_total[5m]))

sort(node_filesystem_avail_bytes)

time() / timestamp(m)

`time()` returns the current evaluation timestamp as a scalar. `timestamp(v)` returns the timestamp of each sample in the vector.

Examples

# age of the most recent sample in seconds:
time() - timestamp(up{job="api"})

# alert: sample is stale (older than 5 min):
time() - timestamp(up) > 300

hour() / minute() / day_of_week() / month()

Time-based functions. `hour()` returns 0-23 UTC; `day_of_week()` returns 0 (Sunday) to 6; `month()` returns 1-12. Useful for business-hours inhibit conditions.

Examples

# only alert on business hours UTC+8:
hour() >= 1 and hour() < 10  # 9am-6pm CST

day_of_week() != 0 and day_of_week() != 6  # not weekends

sum_over_time(m[1h]) / avg_over_time / max_over_time

Aggregate a gauge over time within a range window. `sum_over_time` sums all samples; `avg_over_time` averages; `max_over_time` takes the peak.

Examples

avg_over_time(node_load1[1h])

max_over_time(go_goroutines[30m])

quantile_over_time(0.95, http_response_time_seconds[1h])

Operators (9)

m1 + m2 / m1 - m2 / m1 * scalar

Arithmetic operators: `+` `-` `*` `/` `%` `^`. When applied between two instant vectors, label sets must match exactly (except for `__name__`).

⚠ Gotcha: Arithmetic between two vectors requires matching labels. Use `on()` or `ignoring()` to control matching.

Examples

# error ratio:
rate(errors_total[5m]) / rate(requests_total[5m])

# bytes to megabytes:
node_memory_MemAvailable_bytes / 1024 / 1024

# percentage used:
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

m1 > bool m2

Comparison operators: `==` `!=` `>` `<` `>=` `<=`. Without `bool`, they FILTER series (non-matching series are dropped). With `bool`, they CONVERT to 0/1.

Examples

# filter: only series where value > 0.9:
rate(errors_total[5m]) / rate(requests_total[5m]) > 0.9

# convert to 0/1 for arithmetic:
(up == bool 1) * 100

m1 and m2

Set intersection. Returns series from m1 that have a matching label set in m2. Does not merge values — only uses m1 values.

Examples

# only show CPU metrics for up instances:
node_cpu_seconds_total and on(instance) up == 1

m1 or m2

Set union. Returns all series from m1, plus series from m2 that have no matching label set in m1.

Examples

# combine metrics from two jobs when one may be absent:
metric{job="a"} or metric{job="b"}

m1 unless m2

Set difference. Returns series from m1 that do NOT have a matching label set in m2.

Examples

# exclude maintenance windows:
rate(errors_total[5m]) unless on(instance) maintenance_mode == 1

m1 * on(instance) m2

`on()` restricts vector matching to only the specified labels. All other labels are ignored for the purpose of pairing samples.

Examples

# join error rate with instance metadata:
rate(errors_total[5m]) * on(instance) group_left(version) app_info

m1 * ignoring(env) m2

`ignoring()` excludes the listed labels from the matching key. Use when series differ only in a label that should not affect pairing.

Examples

requests_total * ignoring(status) error_total

m1 * on(instance) group_left(version) m2

`group_left()` allows many-to-one matching: multiple series from the left can match one series on the right. Listed labels are copied from the right.

⚠ Gotcha: Without `group_left` or `group_right`, many-to-one matches produce an error: "multiple matches for labels".

Examples

# enrich metrics with build version from info metric:
rate(http_requests_total[5m]) * on(instance) group_left(version) app_build_info

scalar(m) / vector(s)

`scalar(v)` converts a single-element vector to a scalar value. `vector(s)` converts a scalar to a one-element vector with no labels.

Examples

# normalize by cluster total:
rate(requests_total[5m]) / scalar(sum(rate(requests_total[5m])))

vector(1)  # always returns 1 with no labels

Metric Types (5)

Counter

Monotonically increasing value. Always use `rate()` or `increase()` — the raw value is meaningless by itself. Suffix convention: `_total`.

⚠ Gotcha: Never use `delta()` or gauge-style functions on Counters — they do not handle counter resets correctly.

Examples

# good:
rate(http_requests_total[5m])

# bad (raw counter value):
http_requests_total  # only useful at a fixed moment in time

Gauge

A value that can go up or down: memory usage, temperature, queue length, number of goroutines. Use directly or with `delta()`, `avg_over_time()`, `predict_linear()`.

Examples

node_memory_MemAvailable_bytes   # current free memory

avg_over_time(go_goroutines[5m])  # average over window

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)  # forecast

Histogram

Counts observations in configurable buckets. Exposes three series per base name: `_bucket{le="..."}`, `_count`, `_sum`. Use `histogram_quantile()` for percentiles.

⚠ Gotcha: Bucket boundaries must be configured at instrumentation time. If your P99 always hits the highest bucket, your histogram needs higher buckets.

Examples

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# request rate via histogram count:
rate(http_request_duration_seconds_count[5m])

Summary

Pre-computes quantiles on the client. Exposes `_count`, `_sum`, and `{quantile="0.99"}` label pairs. Cannot be aggregated across instances.

⚠ Gotcha: NEVER `sum()` Summary quantile series across instances — summing pre-computed quantiles is mathematically incorrect. Use Histograms for anything needing aggregation.

Examples

# correct: per-instance summary:
go_gc_duration_seconds{quantile="0.99"}

# wrong:
sum by (job) (go_gc_duration_seconds{quantile="0.99"})  # DO NOT DO THIS

Naming conventions

Metric names use snake_case. Suffix conventions: `_total` (counter), `_seconds` (duration), `_bytes` (size), `_ratio` (0–1 fraction), `_info` (metadata gauge always = 1).

Examples

http_requests_total           # counter

http_request_duration_seconds # histogram or summary

process_resident_memory_bytes # gauge

app_build_info{version="1.2"} # metadata info gauge

Alerting (7)

Alert rule YAML structure

A complete Prometheus alerting rule defined in a rule file under a `groups` block. The `expr` field is evaluated at each `evaluation_interval`.

Examples

groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

for: 5m (pending state)

The `for` clause keeps an alert in PENDING state until the condition has been true for the specified duration before transitioning to FIRING. Prevents alerts on transient spikes.

⚠ Gotcha: Without `for`, a single evaluation cycle where the condition is true immediately fires the alert. Always use `for` except for the most critical "instant" alerts.

Examples

for: 5m    # must be true for 5 minutes

for: 0m    # fire immediately (no pending)

for: 1h    # sustained disk pressure

labels: { severity: critical }

Static labels added to every alert from this rule. Used by Alertmanager routing to route to the right receiver. Label values can use Go template syntax.

Examples

labels:
  severity: critical
  team: platform
  runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

{{ $labels.instance }} / {{ $value | humanize }}

Go template syntax in alert annotations. `$labels` accesses the alert's label set; `$value` is the current expression value. Built-in Prometheus template functions: `humanize`, `humanizePercentage`, `humanizeDuration`, `title`, `toUpper`.

Examples

annotations:
  summary: "Instance {{ $labels.instance }} is down"
  description: >
    Error rate is {{ $value | humanizePercentage }}
    (threshold: 5%). Job: {{ $labels.job }}.

Multi-window alert (burn rate)

Google SRE burn rate pattern: combine a short window (fast, sensitive) and a long window (slow, sustained) to reduce noise while catching real outages quickly.

Examples

- alert: HighErrorBurnRate
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[1h])
        / rate(http_requests_total[1h]) > 0.02
    ) and (
      rate(http_requests_total{status=~"5.."}[5m])
        / rate(http_requests_total[5m]) > 0.02
    )
  for: 2m

ALERTS{alertname="X", alertstate="firing"}

Prometheus synthesizes an `ALERTS` metric for every active alert. Query it to check alert state, build alert-on-alert rules, or join with other metrics.

Examples

ALERTS{job="api", alertstate="firing"}

# how many alerts are currently firing:
count(ALERTS{alertstate="firing"})

Inhibit rule (Alertmanager)

Alertmanager inhibit rules suppress certain alerts when another alert is firing. Example: suppress service alerts when the entire node is down.

Examples

inhibit_rules:
  - source_match:
      alertname: NodeDown
    target_match_re:
      alertname: ".*"
    equal:
      - instance

Recording Rules (4)

Recording rule YAML structure

A recording rule pre-computes an expensive expression and stores it as a new metric. Run in the same `rules:` block as alerting rules, under a `groups` key.

Examples

groups:
  - name: request_rates
    interval: 1m  # optional: override global evaluation_interval
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

level:metric:operations (naming convention)

`level` = aggregation scope (job, instance, cluster); `metric` = base metric name without `_total`/`_seconds`; `operations` = colon-separated PromQL ops left-to-right.

Examples

job:http_requests:rate5m              # per-job rate over 5m

cluster:http_requests:rate5m_sum      # cluster-level sum

instance:node_cpu:rate5m              # per-instance CPU rate

job:http_request_duration_seconds:p95_5m  # P95 latency

When to create recording rules

Create recording rules for: (1) queries that take > 1s to evaluate, (2) `rate + sum` over many series (expensive), (3) expressions used in both dashboards AND alerts, (4) subqueries that are evaluated repeatedly.

Examples

# before (in every dashboard panel and alert):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

# after (one recording rule, referenced everywhere):
job:http_request_duration_seconds:p95_5m

Rule file reload

Reload Prometheus rule files without restarting: send SIGHUP, call `POST /-/reload`, or run `promtool check rules file.yml` to validate first.

Examples

kill -HUP <prometheus_pid>

curl -X POST http://localhost:9090/-/reload

promtool check rules /etc/prometheus/rules/*.yml

HTTP API (9)

GET /api/v1/query

Instant query. Evaluates the PromQL expression at a single point in time. Params: `query` (required), `time` (Unix or RFC3339, default: now), `timeout`.

Examples

curl "http://localhost:9090/api/v1/query?query=up&time=2024-01-01T00:00:00Z"

curl -G --data-urlencode 'query=rate(http_requests_total[5m])' http://localhost:9090/api/v1/query

GET /api/v1/query_range

Range query. Evaluates the expression over a time range and returns a matrix. Params: `query`, `start`, `end` (Unix or RFC3339), `step` (duration or seconds).

Examples

curl "http://localhost:9090/api/v1/query_range?query=up&start=2024-01-01T00:00:00Z&end=2024-01-01T01:00:00Z&step=60"

GET /api/v1/series

Return all series matching one or more selectors. Params: `match[]` (one or more selector expressions), `start`, `end`.

Examples

curl "http://localhost:9090/api/v1/series?match[]=http_requests_total&match[]=up"

GET /api/v1/label/<name>/values

List all known values for a given label name across all time series. Useful for building dynamic dashboards and autocomplete.

Examples

curl http://localhost:9090/api/v1/label/job/values

curl http://localhost:9090/api/v1/label/instance/values

GET /api/v1/targets

Return information about all current scrape targets: health state, labels, last scrape time, and last error. Filter with `state=active|dropped|any`.

Examples

curl "http://localhost:9090/api/v1/targets?state=active"

GET /api/v1/rules

Return all loaded alerting and recording rules. Filter with `type=alert|record`. Includes rule state, last evaluation time, and last error.

Examples

curl "http://localhost:9090/api/v1/rules?type=alert"

GET /api/v1/alerts

Return all currently active (pending or firing) alerts. Each entry includes the alert name, labels, state, activeAt timestamp, and current value.

Examples

curl http://localhost:9090/api/v1/alerts

POST /api/v1/admin/tsdb/delete_series

Delete all data for series matching the given selectors. Requires `--web.enable-admin-api` flag. Does NOT free disk space until the next compaction.

⚠ Gotcha: This API is destructive and irreversible. Use `GET /api/v1/series` to verify what will be deleted first.

Examples

curl -X POST "http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=up{job=\"test\"}"

GET /api/v1/metadata

Return metric metadata (type, help text) as registered by scrape targets. Params: `metric` to filter by name, `limit` to cap results.

Examples

curl "http://localhost:9090/api/v1/metadata?metric=http_requests_total"

Relabeling (8)

replace (default action)

Evaluate `regex` against the concatenated `source_labels` (joined by `separator`). If it matches, write the expanded `replacement` into `target_label`.

Examples

# extract hostname from "host:port" in __address__:
- source_labels: [__address__]
  regex: "([^:]+)(:\d+)?"
  target_label: instance
  replacement: "$1"

keep

Keep ONLY the targets/series where `source_labels` concatenated matches `regex`. All other targets are dropped.

Examples

- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  action: keep
  regex: "true"

drop

Drop targets/series where `source_labels` concatenated matches `regex`. The inverse of `keep`.

Examples

- source_labels: [__meta_kubernetes_namespace]
  action: drop
  regex: "kube-system|monitoring"

labelmap

Copy all labels whose name matches `regex` to new labels, replacing the name with `replacement`. Useful for promoting Kubernetes annotations/labels.

Examples

# promote k8s labels to Prometheus labels:
- action: labelmap
  regex: "__meta_kubernetes_pod_label_(.+)"
  replacement: "$1"

labeldrop / labelkeep

`labeldrop` removes all labels whose name matches `regex`. `labelkeep` removes all labels whose name does NOT match `regex`. Applied AFTER relabeling.

⚠ Gotcha: Dropping too many labels can cause metric collision — two formerly distinct series may become identical without their distinguishing labels.

Examples

# drop all labels starting with "tmp_":
- action: labeldrop
  regex: "tmp_.*"

# keep only essential labels:
- action: labelkeep
  regex: "job|instance|env"

hashmod

Hash `source_labels` and take the modulo. Write the result to `target_label`. Used for sharding Prometheus scrape pools across multiple Prometheus instances.

Examples

- source_labels: [__address__]
  modulus: 4         # 4 Prometheus shards
  target_label: __tmp_hash
  action: hashmod
- source_labels: [__tmp_hash]
  regex: "0"         # this shard handles only hash==0 targets
  action: keep

lowercase / uppercase

`lowercase` converts `source_labels` to lowercase and writes to `target_label`. `uppercase` does the reverse. Available in Prometheus ≥ 2.36.

Examples

- source_labels: [__meta_kubernetes_pod_name]
  action: lowercase
  target_label: pod_name

Common pattern: port from __address__

Extract just the host or port from the `__address__` label using `replace` + regex capture groups. A very common pattern in Kubernetes service discovery.

Examples

# set the scrape port from an annotation:
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  action: replace
  regex: "([^:]+)(?::\d+)?;(\d+)"
  replacement: "$1:$2"
  target_label: __address__

What this tool does

Searchable Prometheus cheat sheet with 90+ entries across nine sections. Selectors: instant vector `metric{label="val"}`, range vector `metric[5m]`, matchers `=` `!=` `=~` `!~`, offset, `@` anchor, subquery syntax. Aggregation: `sum` `avg` `max` `min` `count` `topk` `bottomk` with `by`/`without`. Functions: `rate` `irate` `increase` for counters; `delta` `predict_linear` for gauges; `histogram_quantile` for Histograms; `label_replace` `label_join`; `absent` `absent_over_time`; `time()` `hour()` `day_of_week()`. Binary operators: arithmetic `+ - * / % ^`; comparison with `bool` modifier; set operators `and or unless`; vector matching `on()` `ignoring()` `group_left()` `group_right()`. Metric types: Counter vs Gauge vs Histogram vs Summary — when to use each, `_total` `_bucket` `_count` `_sum` suffixes, Summary aggregation gotcha. Alerting rules: full YAML structure, `for` clause, labels/annotations with Go templates `{{ $value | humanize }}`, multi-window burn rate pattern, inhibit rules. Recording rules: naming convention `level:metric:operations`, when to pre-compute, rule file reload. HTTP API: `/api/v1/query`, `/api/v1/query_range`, series/label/target/rules endpoints. Relabeling: `replace` `keep` `drop` `labelmap` `labeldrop` `labelkeep` `hashmod` `lowercase`. Every entry has bilingual text, copy-ready examples, and pitfall callouts. Search, category chips, one-click copy — all in-browser.

Tool details

Input: Text; The page exposes text boxes, numeric controls, file pickers, or structured inputs depending on the tool.
Output: Live result + Copy; The result area focuses on usable output, with copy, download, or preview actions when supported.
Privacy: Browser-side processing; The main tool logic does not call an external API, so inputs normally stay in the current tab.
Save / share: Shareable URL state; Key settings are encoded in the URL so another person can reopen the same setup.
Performance budget: Initial JS <= 42 KB; No WASM budget is declared, keeping the tool quick to open on mobile.
Best fit: Developer & DevOps · Developer; Category and role tags drive related tools, internal links, and quick fit checks.

How to use

1. Input

Paste or drop your content into the tool panel.
2. Process

Click the button. All processing is local in your browser.
3. Copy / Download

Copy the result or download to disk in one click.

How Prometheus Cheatsheet fits into your work

Use it in the small gaps between coding, reviewing, debugging, and shipping.

Developer jobs

Formatting, validating, shrinking, or inspecting code-adjacent text.
Preparing snippets for documentation, tickets, commits, or handoff.
Checking a small payload quickly without switching tools.

Developer checks

Run irreversible transforms like minify or obfuscate on a copy.
Keep secrets out of pasted snippets unless the tool explicitly stays local.
Use your normal tests or linter before shipping transformed code.

Good next steps

These links move the current task into a more complete workflow.

Real-world use cases

Debugging a sudden spike in error rate during an incident
It is 3am and the error rate alert fired. You open the cheat sheet, grab `rate(http_requests_total{status=~"5.."}[5m])`, add `by (handler)` to find the noisy endpoint, then use `topk(5, …)` to surface the worst offenders. All from memory? No — from copy-paste in under two minutes.
Writing your first multi-window alert rule
You want an alert that is sensitive to short outages but does not page for a single bad scrape. You look up the multi-window pattern, copy the `for: 5m` block with the `short_window` and `long_window` expressions, and fill in your metric name. The annotations section shows you exactly how to use `{{ $labels.job }}` and `{{ $value | humanizePercentage }}` without guessing the syntax.
Pre-computing an expensive query as a recording rule
Your dashboard's 95th-percentile latency panel takes 8 seconds to load because it runs `histogram_quantile(0.95, sum(rate(…)) by (le))` over 400 series every refresh. You look up the recording-rule naming convention, create `job:request_duration_seconds:p95_5m`, and drop the recording into both the dashboard and the alert. Load time drops to under 200ms.

Common pitfalls

Using `irate()` in dashboards — it shows instantaneous spikes that look dramatic but are mostly scrape-timing noise. Use `rate()` for trends.
Writing `sum(rate(hist_bucket[5m]))` without `by (le)` before passing to `histogram_quantile()` — the `le` label must survive the aggregation.
Using Summary metrics and then trying to aggregate quantiles across instances — pre-computed quantiles are not additive, only Histogram buckets are.
Forgetting the `_total` suffix in counter names — Prometheus convention is `http_requests_total`, not `http_requests`.

Privacy

Everything runs in your browser. The cheat sheet is a static in-memory array and the search box, category chips, and copy button never make a network request. Nothing you type is logged or sent anywhere, and no input is written to the URL. Works offline, behind a corporate proxy, or on an air-gapped jump host.

FAQ

Tool combos

Folks in your role tend to reach for these alongside this tool.

Browse all tools for this role

Prometheus Cheatsheet — 90+ PromQL Queries, Alerting Rules, and HTTP API Reference

What this tool does

Tool details

How to use

1. Input

2. Process

3. Copy / Download

How Prometheus Cheatsheet fits into your work

Developer jobs

Developer checks

Good next steps

Real-world use cases

Debugging a sudden spike in error rate during an incident

Writing your first multi-window alert rule

Pre-computing an expensive query as a recording rule

Common pitfalls

Privacy

FAQ

Bash Cheatsheet

Docker Cheatsheet

kubectl Cheatsheet

Nginx Cheatsheet

AWS CLI Cheatsheet

Git Cheatsheet

AI Eval Planner

Apache Cheatsheet

API Key Generator

API Rate Limit Cheatsheet

App Store Keywords Checklist

ASCII Table Generator