metric_nameBare metric name — selects an instant vector of all time series with that name. Returns the most recent sample for each series.
up
node_cpu_seconds_total
http_requests_total
Prometheus cheat sheet — 90+ entries covering PromQL selectors, aggregations, functions, alerting rules, recording rules, HTTP API, and relabeling.
metric_nameBare metric name — selects an instant vector of all time series with that name. Returns the most recent sample for each series.
up
node_cpu_seconds_total
http_requests_total
metric{label="value"}Exact label equality matcher. Filters the instant vector to series where `label` equals `value`.
⚠ Gotcha: Label values are case-sensitive. `{job="API"}` and `{job="api"}` are different series.
up{job="prometheus"}http_requests_total{method="POST", status="200"}node_filesystem_avail_bytes{mountpoint="/"}metric{label!="value"}Negative equality matcher. Keeps series where `label` does NOT equal `value`. Also matches series where the label is absent.
up{job!="blackbox"}http_requests_total{env!="dev"}metric{label=~"regex"}RE2 regex matcher. Matches series where `label` matches the regex. Regex is anchored at both ends — `"5.."` matches exactly three chars.
⚠ Gotcha: Regex matchers trigger a full index scan (slower than `=`). Prefer exact matches for high-cardinality labels.
http_requests_total{status=~"5.."}node_cpu_seconds_total{mode=~"user|system"}up{instance=~"prod-.*:9090"}metric{label!~"regex"}Negative regex matcher. Keeps series where `label` does NOT match the regex.
http_requests_total{env!~"dev|staging"}node_cpu_seconds_total{mode!~"idle|iowait"}metric[5m]Range vector selector. Returns all samples within the past 5 minutes for each series. Required by range functions like `rate()`, `increase()`, `delta()`.
⚠ Gotcha: A range vector cannot be graphed directly — it must be passed to a function like `rate()` first.
rate(http_requests_total[5m])
increase(errors_total[1h])
delta(cpu_temp[10m])
metric offset 5mOffset modifier. Shifts the evaluation time back by the given duration — the query reads data from 5 minutes ago instead of "now".
http_requests_total offset 5m
rate(http_requests_total[5m] offset 1h)
# compare current vs one week ago: rate(requests[5m]) / rate(requests[5m] offset 7d)
metric @ 1609746000Timestamp modifier (Prometheus ≥ 2.25). Evaluates the selector at a specific Unix timestamp regardless of the query time.
http_requests_total @ 1609746000
rate(http_requests_total[5m] @ start())
rate(http_requests_total[5m] @ end())
{__name__=~"go_.*"}Use `__name__` as a regular label to select metrics by name pattern. Useful for exploring or for cross-metric operations.
⚠ Gotcha: Selecting many metrics at once with `__name__=~".*"` is extremely expensive — always narrow the pattern.
{__name__=~"go_.*", job="api"}{__name__=~"node_memory_.*"}metric{job="api", env="prod"}Multiple label matchers are AND-ed together. All conditions must match for a series to be selected.
up{job="api", env="prod", region="eu-west-1"}http_requests_total{method="GET", status=~"2..", handler!="/health"}rate(metric[1m])[10m:30s]Subquery syntax. Evaluates the inner expression at `30s` resolution over the past `10m` and returns a range vector. Needed for `_over_time` functions on range expressions.
⚠ Gotcha: Subqueries are expensive because they re-evaluate the inner expression many times. Use recording rules for frequently-used subqueries.
max_over_time(rate(http_requests_total[1m])[10m:30s])
avg_over_time(node_load1[1h:5m])
sum(metric)Sum all values across all label dimensions. Returns a single scalar result.
⚠ Gotcha: Without a `by` clause, `sum()` drops ALL labels. The result has no labels you can join on.
sum(http_requests_total)
sum(rate(http_requests_total[5m]))
sum by (job) (metric)Aggregate while KEEPING the listed labels. All unlisted labels are dropped. Equivalent to SQL `GROUP BY job`.
sum by (job) (rate(http_requests_total[5m]))
max by (instance, device) (node_disk_read_bytes_total)
sum without (instance) (metric)Aggregate while DROPPING the listed labels. All other labels are kept. The inverse of `by`.
sum without (instance) (rate(http_requests_total[5m]))
avg without (cpu) (node_cpu_seconds_total)
avg(metric)Arithmetic mean across all series or within each group defined by `by`/`without`.
avg by (job) (rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))
max(metric) / min(metric)Maximum or minimum value across all matching series. Useful for cluster-wide alerting thresholds.
max by (job) (rate(errors_total[5m]))
min(node_filesystem_avail_bytes / node_filesystem_size_bytes)
count(metric)Count the number of series in the result vector. Great for "how many instances are up" style queries.
count(up{job="api"} == 1)count by (job) (up)
topk(5, metric)Return the K series with the highest values. Useful for finding the most active endpoints or noisiest instances.
⚠ Gotcha: `topk` returns multiple series, so it is not suitable for alerts (which need a single result). Use it in dashboards only.
topk(5, sum by (handler) (rate(http_requests_total[5m])))
topk(10, node_cpu_seconds_total{mode="user"})bottomk(3, metric)Return the K series with the lowest values. Useful for finding the least-utilized instances or slowest responders.
bottomk(3, sum by (instance) (rate(requests_total[5m])))
quantile(0.95, metric)φ-quantile over all series values (not over time). This aggregates across SERIES, not samples. For time-based percentiles use `histogram_quantile()`.
⚠ Gotcha: Do NOT confuse this with `histogram_quantile()`. This gives a percentile across current instance values, not across a distribution.
quantile(0.95, rate(http_request_duration_seconds_sum[5m]))
count_values("label", metric)Count series by their value, creating a new label with the value. Useful for counting how many instances have each version number.
count_values("version", kube_pod_container_info)count_values("status_code", http_response_code)stddev(metric) / stdvar(metric)Standard deviation or variance across all series. Useful for detecting outlier instances in a fleet.
stddev by (job) (rate(http_request_duration_seconds_sum[5m]))
rate(counter[5m])Per-second average rate of increase over the range window, calculated via linear regression. The correct function for dashboards and alerts on counters.
⚠ Gotcha: The range window should be at least 4× the scrape interval. For a 15s scrape interval, use `[1m]` minimum — shorter windows become noisy.
rate(http_requests_total[5m])
rate(node_network_receive_bytes_total[5m])
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))irate(counter[5m])Instantaneous rate — computed from the last two data points only. Captures very short spikes that `rate()` would smooth over.
⚠ Gotcha: One slow scrape cycle creates a fake spike in the graph. Avoid in dashboards; use only for live debugging of active spikes.
irate(http_requests_total[1m])
increase(counter[1h])Total increase in a counter over the range window. Equivalent to `rate(c[window]) * window_in_seconds`. Handles counter resets.
increase(http_requests_total[1h])
increase(errors_total{job="api"}[24h])resets(counter[1h])Number of counter resets within the range window. A reset means the counter went from a high value back to zero (usually a process restart).
resets(http_requests_total[1h])
# alert if more than 3 restarts in an hour: resets(process_start_time_seconds[1h]) > 3
delta(gauge[1h])Difference in value between the first and last sample in the range window. Works on gauges, not counters.
⚠ Gotcha: Do NOT use `delta()` on counters — counters can reset and `delta()` does not handle resets. Use `increase()` instead.
delta(node_memory_MemFree_bytes[1h])
delta(cpu_temp_celsius[10m])
predict_linear(gauge[1h], 3600)Predicts the value `3600` seconds from now using linear regression on the range window. Great for "disk will fill in X hours" alerts.
# alert: disk full in 4 hours predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
predict_linear(node_memory_MemAvailable_bytes[30m], 3600)
changes(gauge[1h])Number of times the value changed within the range window. Useful for detecting flapping services or config changes.
changes(up[1h]) > 5 # service is flapping
changes(kube_deployment_spec_replicas[1h])
histogram_quantile(0.95, sum(rate(h_bucket[5m])) by (le))Compute the φ-quantile (0.95 = 95th percentile) from a Histogram metric. The `le` label (less-than-or-equal) marks bucket boundaries and must be preserved in the aggregation.
⚠ Gotcha: The `by (le)` is REQUIRED — omitting it drops the bucket labels and the function returns NaN.
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum by (le, job) (rate(grpc_server_handling_seconds_bucket[5m])))
absent(metric)Returns an empty vector when the expression has samples; returns a single element (value 1) when the expression has NO samples. Used to alert when a metric disappears.
⚠ Gotcha: `absent()` does not propagate labels from the missing series. Hard-code important labels in the alert `labels:` block.
absent(up{job="api"} == 1)absent(http_requests_total{env="prod"})absent_over_time(metric[5m])Like `absent()` but requires the metric to be absent for the FULL range window before returning 1. Avoids alerts on a single missed scrape.
absent_over_time(up{job="api"}[5m])label_replace(m, "dst", "$1", "src", "(.*)")Apply a regex substitution to a label and write the result to a new label. `$1` refers to the first capture group in the regex.
# extract "host" from "host:port" in instance label: label_replace(up, "host", "$1", "instance", "([^:]+):.*")
label_replace(metric, "short_name", "$1", "handler", "/api/v[0-9]+/(.*)")
label_join(m, "new", ",", "l1", "l2")Concatenate multiple existing label values with a separator and write the result to a new label.
label_join(up, "node_region", "/", "instance", "region")
abs() / ceil() / floor() / round(m, 0.5)Absolute value, ceiling, floor, or round to the nearest multiple. Applied element-wise to each series value.
abs(node_filesystem_avail_bytes - node_filesystem_size_bytes / 2)
round(rate(http_requests_total[5m]) * 100, 0.01)
clamp(m, 0, 100) / clamp_min / clamp_maxClamp all values to the range [min, max]. `clamp_min(m, 0)` forces values ≥ 0; `clamp_max(m, 100)` forces values ≤ 100.
clamp(some_ratio, 0, 1)
clamp_min(node_load1 - 1, 0) # never negative
sort(m) / sort_desc(m)Sort the result vector by value (ascending or descending). Useful for dashboards to always show worst offenders at the top.
sort_desc(rate(http_requests_total[5m]))
sort(node_filesystem_avail_bytes)
time() / timestamp(m)`time()` returns the current evaluation timestamp as a scalar. `timestamp(v)` returns the timestamp of each sample in the vector.
# age of the most recent sample in seconds:
time() - timestamp(up{job="api"})# alert: sample is stale (older than 5 min): time() - timestamp(up) > 300
hour() / minute() / day_of_week() / month()Time-based functions. `hour()` returns 0-23 UTC; `day_of_week()` returns 0 (Sunday) to 6; `month()` returns 1-12. Useful for business-hours inhibit conditions.
# only alert on business hours UTC+8: hour() >= 1 and hour() < 10 # 9am-6pm CST
day_of_week() != 0 and day_of_week() != 6 # not weekends
sum_over_time(m[1h]) / avg_over_time / max_over_timeAggregate a gauge over time within a range window. `sum_over_time` sums all samples; `avg_over_time` averages; `max_over_time` takes the peak.
avg_over_time(node_load1[1h])
max_over_time(go_goroutines[30m])
quantile_over_time(0.95, http_response_time_seconds[1h])
m1 + m2 / m1 - m2 / m1 * scalarArithmetic operators: `+` `-` `*` `/` `%` `^`. When applied between two instant vectors, label sets must match exactly (except for `__name__`).
⚠ Gotcha: Arithmetic between two vectors requires matching labels. Use `on()` or `ignoring()` to control matching.
# error ratio: rate(errors_total[5m]) / rate(requests_total[5m])
# bytes to megabytes: node_memory_MemAvailable_bytes / 1024 / 1024
# percentage used: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
m1 > bool m2Comparison operators: `==` `!=` `>` `<` `>=` `<=`. Without `bool`, they FILTER series (non-matching series are dropped). With `bool`, they CONVERT to 0/1.
# filter: only series where value > 0.9: rate(errors_total[5m]) / rate(requests_total[5m]) > 0.9
# convert to 0/1 for arithmetic: (up == bool 1) * 100
m1 and m2Set intersection. Returns series from m1 that have a matching label set in m2. Does not merge values — only uses m1 values.
# only show CPU metrics for up instances: node_cpu_seconds_total and on(instance) up == 1
m1 or m2Set union. Returns all series from m1, plus series from m2 that have no matching label set in m1.
# combine metrics from two jobs when one may be absent:
metric{job="a"} or metric{job="b"}m1 unless m2Set difference. Returns series from m1 that do NOT have a matching label set in m2.
# exclude maintenance windows: rate(errors_total[5m]) unless on(instance) maintenance_mode == 1
m1 * on(instance) m2`on()` restricts vector matching to only the specified labels. All other labels are ignored for the purpose of pairing samples.
# join error rate with instance metadata: rate(errors_total[5m]) * on(instance) group_left(version) app_info
m1 * ignoring(env) m2`ignoring()` excludes the listed labels from the matching key. Use when series differ only in a label that should not affect pairing.
requests_total * ignoring(status) error_total
m1 * on(instance) group_left(version) m2`group_left()` allows many-to-one matching: multiple series from the left can match one series on the right. Listed labels are copied from the right.
⚠ Gotcha: Without `group_left` or `group_right`, many-to-one matches produce an error: "multiple matches for labels".
# enrich metrics with build version from info metric: rate(http_requests_total[5m]) * on(instance) group_left(version) app_build_info
scalar(m) / vector(s)`scalar(v)` converts a single-element vector to a scalar value. `vector(s)` converts a scalar to a one-element vector with no labels.
# normalize by cluster total: rate(requests_total[5m]) / scalar(sum(rate(requests_total[5m])))
vector(1) # always returns 1 with no labels
CounterMonotonically increasing value. Always use `rate()` or `increase()` — the raw value is meaningless by itself. Suffix convention: `_total`.
⚠ Gotcha: Never use `delta()` or gauge-style functions on Counters — they do not handle counter resets correctly.
# good: rate(http_requests_total[5m])
# bad (raw counter value): http_requests_total # only useful at a fixed moment in time
GaugeA value that can go up or down: memory usage, temperature, queue length, number of goroutines. Use directly or with `delta()`, `avg_over_time()`, `predict_linear()`.
node_memory_MemAvailable_bytes # current free memory
avg_over_time(go_goroutines[5m]) # average over window
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) # forecast
HistogramCounts observations in configurable buckets. Exposes three series per base name: `_bucket{le="..."}`, `_count`, `_sum`. Use `histogram_quantile()` for percentiles.
⚠ Gotcha: Bucket boundaries must be configured at instrumentation time. If your P99 always hits the highest bucket, your histogram needs higher buckets.
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# request rate via histogram count: rate(http_request_duration_seconds_count[5m])
SummaryPre-computes quantiles on the client. Exposes `_count`, `_sum`, and `{quantile="0.99"}` label pairs. Cannot be aggregated across instances.
⚠ Gotcha: NEVER `sum()` Summary quantile series across instances — summing pre-computed quantiles is mathematically incorrect. Use Histograms for anything needing aggregation.
# correct: per-instance summary:
go_gc_duration_seconds{quantile="0.99"}# wrong:
sum by (job) (go_gc_duration_seconds{quantile="0.99"}) # DO NOT DO THISNaming conventionsMetric names use snake_case. Suffix conventions: `_total` (counter), `_seconds` (duration), `_bytes` (size), `_ratio` (0–1 fraction), `_info` (metadata gauge always = 1).
http_requests_total # counter
http_request_duration_seconds # histogram or summary
process_resident_memory_bytes # gauge
app_build_info{version="1.2"} # metadata info gaugeAlert rule YAML structureA complete Prometheus alerting rule defined in a rule file under a `groups` block. The `expr` field is evaluated at each `evaluation_interval`.
groups:
- name: example
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }}"for: 5m (pending state)The `for` clause keeps an alert in PENDING state until the condition has been true for the specified duration before transitioning to FIRING. Prevents alerts on transient spikes.
⚠ Gotcha: Without `for`, a single evaluation cycle where the condition is true immediately fires the alert. Always use `for` except for the most critical "instant" alerts.
for: 5m # must be true for 5 minutes
for: 0m # fire immediately (no pending)
for: 1h # sustained disk pressure
labels: { severity: critical }Static labels added to every alert from this rule. Used by Alertmanager routing to route to the right receiver. Label values can use Go template syntax.
labels: severity: critical team: platform runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
{{ $labels.instance }} / {{ $value | humanize }}Go template syntax in alert annotations. `$labels` accesses the alert's label set; `$value` is the current expression value. Built-in Prometheus template functions: `humanize`, `humanizePercentage`, `humanizeDuration`, `title`, `toUpper`.
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: >
Error rate is {{ $value | humanizePercentage }}
(threshold: 5%). Job: {{ $labels.job }}.Multi-window alert (burn rate)Google SRE burn rate pattern: combine a short window (fast, sensitive) and a long window (slow, sustained) to reduce noise while catching real outages quickly.
- alert: HighErrorBurnRate
expr: |
(
rate(http_requests_total{status=~"5.."}[1h])
/ rate(http_requests_total[1h]) > 0.02
) and (
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.02
)
for: 2mALERTS{alertname="X", alertstate="firing"}Prometheus synthesizes an `ALERTS` metric for every active alert. Query it to check alert state, build alert-on-alert rules, or join with other metrics.
ALERTS{job="api", alertstate="firing"}# how many alerts are currently firing:
count(ALERTS{alertstate="firing"})Inhibit rule (Alertmanager)Alertmanager inhibit rules suppress certain alerts when another alert is firing. Example: suppress service alerts when the entire node is down.
inhibit_rules:
- source_match:
alertname: NodeDown
target_match_re:
alertname: ".*"
equal:
- instanceRecording rule YAML structureA recording rule pre-computes an expensive expression and stores it as a new metric. Run in the same `rules:` block as alerting rules, under a `groups` key.
groups:
- name: request_rates
interval: 1m # optional: override global evaluation_interval
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))level:metric:operations (naming convention)`level` = aggregation scope (job, instance, cluster); `metric` = base metric name without `_total`/`_seconds`; `operations` = colon-separated PromQL ops left-to-right.
job:http_requests:rate5m # per-job rate over 5m
cluster:http_requests:rate5m_sum # cluster-level sum
instance:node_cpu:rate5m # per-instance CPU rate
job:http_request_duration_seconds:p95_5m # P95 latency
When to create recording rulesCreate recording rules for: (1) queries that take > 1s to evaluate, (2) `rate + sum` over many series (expensive), (3) expressions used in both dashboards AND alerts, (4) subqueries that are evaluated repeatedly.
# before (in every dashboard panel and alert): histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) # after (one recording rule, referenced everywhere): job:http_request_duration_seconds:p95_5m
Rule file reloadReload Prometheus rule files without restarting: send SIGHUP, call `POST /-/reload`, or run `promtool check rules file.yml` to validate first.
kill -HUP <prometheus_pid>
curl -X POST http://localhost:9090/-/reload
promtool check rules /etc/prometheus/rules/*.yml
GET /api/v1/queryInstant query. Evaluates the PromQL expression at a single point in time. Params: `query` (required), `time` (Unix or RFC3339, default: now), `timeout`.
curl "http://localhost:9090/api/v1/query?query=up&time=2024-01-01T00:00:00Z"
curl -G --data-urlencode 'query=rate(http_requests_total[5m])' http://localhost:9090/api/v1/query
GET /api/v1/query_rangeRange query. Evaluates the expression over a time range and returns a matrix. Params: `query`, `start`, `end` (Unix or RFC3339), `step` (duration or seconds).
curl "http://localhost:9090/api/v1/query_range?query=up&start=2024-01-01T00:00:00Z&end=2024-01-01T01:00:00Z&step=60"
GET /api/v1/seriesReturn all series matching one or more selectors. Params: `match[]` (one or more selector expressions), `start`, `end`.
curl "http://localhost:9090/api/v1/series?match[]=http_requests_total&match[]=up"
GET /api/v1/label/<name>/valuesList all known values for a given label name across all time series. Useful for building dynamic dashboards and autocomplete.
curl http://localhost:9090/api/v1/label/job/values
curl http://localhost:9090/api/v1/label/instance/values
GET /api/v1/targetsReturn information about all current scrape targets: health state, labels, last scrape time, and last error. Filter with `state=active|dropped|any`.
curl "http://localhost:9090/api/v1/targets?state=active"
GET /api/v1/rulesReturn all loaded alerting and recording rules. Filter with `type=alert|record`. Includes rule state, last evaluation time, and last error.
curl "http://localhost:9090/api/v1/rules?type=alert"
GET /api/v1/alertsReturn all currently active (pending or firing) alerts. Each entry includes the alert name, labels, state, activeAt timestamp, and current value.
curl http://localhost:9090/api/v1/alerts
POST /api/v1/admin/tsdb/delete_seriesDelete all data for series matching the given selectors. Requires `--web.enable-admin-api` flag. Does NOT free disk space until the next compaction.
⚠ Gotcha: This API is destructive and irreversible. Use `GET /api/v1/series` to verify what will be deleted first.
curl -X POST "http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=up{job=\"test\"}"GET /api/v1/metadataReturn metric metadata (type, help text) as registered by scrape targets. Params: `metric` to filter by name, `limit` to cap results.
curl "http://localhost:9090/api/v1/metadata?metric=http_requests_total"
replace (default action)Evaluate `regex` against the concatenated `source_labels` (joined by `separator`). If it matches, write the expanded `replacement` into `target_label`.
# extract hostname from "host:port" in __address__: - source_labels: [__address__] regex: "([^:]+)(:\d+)?" target_label: instance replacement: "$1"
keepKeep ONLY the targets/series where `source_labels` concatenated matches `regex`. All other targets are dropped.
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: "true"
dropDrop targets/series where `source_labels` concatenated matches `regex`. The inverse of `keep`.
- source_labels: [__meta_kubernetes_namespace] action: drop regex: "kube-system|monitoring"
labelmapCopy all labels whose name matches `regex` to new labels, replacing the name with `replacement`. Useful for promoting Kubernetes annotations/labels.
# promote k8s labels to Prometheus labels: - action: labelmap regex: "__meta_kubernetes_pod_label_(.+)" replacement: "$1"
labeldrop / labelkeep`labeldrop` removes all labels whose name matches `regex`. `labelkeep` removes all labels whose name does NOT match `regex`. Applied AFTER relabeling.
⚠ Gotcha: Dropping too many labels can cause metric collision — two formerly distinct series may become identical without their distinguishing labels.
# drop all labels starting with "tmp_": - action: labeldrop regex: "tmp_.*"
# keep only essential labels: - action: labelkeep regex: "job|instance|env"
hashmodHash `source_labels` and take the modulo. Write the result to `target_label`. Used for sharding Prometheus scrape pools across multiple Prometheus instances.
- source_labels: [__address__] modulus: 4 # 4 Prometheus shards target_label: __tmp_hash action: hashmod - source_labels: [__tmp_hash] regex: "0" # this shard handles only hash==0 targets action: keep
lowercase / uppercase`lowercase` converts `source_labels` to lowercase and writes to `target_label`. `uppercase` does the reverse. Available in Prometheus ≥ 2.36.
- source_labels: [__meta_kubernetes_pod_name] action: lowercase target_label: pod_name
Common pattern: port from __address__Extract just the host or port from the `__address__` label using `replace` + regex capture groups. A very common pattern in Kubernetes service discovery.
# set the scrape port from an annotation: - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: "([^:]+)(?::\d+)?;(\d+)" replacement: "$1:$2" target_label: __address__
Searchable Prometheus cheat sheet with 90+ entries across nine sections. Selectors: instant vector `metric{label="val"}`, range vector `metric[5m]`, matchers `=` `!=` `=~` `!~`, offset, `@` anchor, subquery syntax. Aggregation: `sum` `avg` `max` `min` `count` `topk` `bottomk` with `by`/`without`. Functions: `rate` `irate` `increase` for counters; `delta` `predict_linear` for gauges; `histogram_quantile` for Histograms; `label_replace` `label_join`; `absent` `absent_over_time`; `time()` `hour()` `day_of_week()`. Binary operators: arithmetic `+ - * / % ^`; comparison with `bool` modifier; set operators `and or unless`; vector matching `on()` `ignoring()` `group_left()` `group_right()`. Metric types: Counter vs Gauge vs Histogram vs Summary — when to use each, `_total` `_bucket` `_count` `_sum` suffixes, Summary aggregation gotcha. Alerting rules: full YAML structure, `for` clause, labels/annotations with Go templates `{{ $value | humanize }}`, multi-window burn rate pattern, inhibit rules. Recording rules: naming convention `level:metric:operations`, when to pre-compute, rule file reload. HTTP API: `/api/v1/query`, `/api/v1/query_range`, series/label/target/rules endpoints. Relabeling: `replace` `keep` `drop` `labelmap` `labeldrop` `labelkeep` `hashmod` `lowercase`. Every entry has bilingual text, copy-ready examples, and pitfall callouts. Search, category chips, one-click copy — all in-browser.
Paste or drop your content into the tool panel.
Click the button. All processing is local in your browser.
Copy the result or download to disk in one click.
Use it in the small gaps between coding, reviewing, debugging, and shipping.
These links move the current task into a more complete workflow.
It is 3am and the error rate alert fired. You open the cheat sheet, grab `rate(http_requests_total{status=~"5.."}[5m])`, add `by (handler)` to find the noisy endpoint, then use `topk(5, …)` to surface the worst offenders. All from memory? No — from copy-paste in under two minutes.
You want an alert that is sensitive to short outages but does not page for a single bad scrape. You look up the multi-window pattern, copy the `for: 5m` block with the `short_window` and `long_window` expressions, and fill in your metric name. The annotations section shows you exactly how to use `{{ $labels.job }}` and `{{ $value | humanizePercentage }}` without guessing the syntax.
Your dashboard's 95th-percentile latency panel takes 8 seconds to load because it runs `histogram_quantile(0.95, sum(rate(…)) by (le))` over 400 series every refresh. You look up the recording-rule naming convention, create `job:request_duration_seconds:p95_5m`, and drop the recording into both the dashboard and the alert. Load time drops to under 200ms.
Using `irate()` in dashboards — it shows instantaneous spikes that look dramatic but are mostly scrape-timing noise. Use `rate()` for trends.
Writing `sum(rate(hist_bucket[5m]))` without `by (le)` before passing to `histogram_quantile()` — the `le` label must survive the aggregation.
Using Summary metrics and then trying to aggregate quantiles across instances — pre-computed quantiles are not additive, only Histogram buckets are.
Forgetting the `_total` suffix in counter names — Prometheus convention is `http_requests_total`, not `http_requests`.
Everything runs in your browser. The cheat sheet is a static in-memory array and the search box, category chips, and copy button never make a network request. Nothing you type is logged or sent anywhere, and no input is written to the URL. Works offline, behind a corporate proxy, or on an air-gapped jump host.
Folks in your role tend to reach for these alongside this tool.