Edge Cluster Monitoring with Kube-Prometheus-Stack, Thanos, and Cilium
In my latest homelab project, I set out to bring observability to the two edge Kubernetes clusters I maintain. One runs on a trio of Raspberry Pi nodes, the other on Oracle Cloud’s always-free tier with four nodes. Both are tight on resources and not well-suited to running a full-blown Kube-Prometheus-Stack. My goal was simple: centralize monitoring into my more capable "production" cluster—where compute and storage are far more plentiful—while keeping the edge clusters as lightweight as possible. That meant rethinking how Prometheus and Thanos were deployed, and figuring out just how minimal an edge monitoring footprint could be while still getting meaningful metrics.
Architecture
When it comes to forwarding metrics from edge clusters to a central store, two common approaches with Thanos are using the Thanos Sidecar alongside Prometheus, or pushing metrics via remote write to a central Thanos Receive endpoint. The sidecar method allows Prometheus to write blocks locally and then upload them to object storage, where they can be picked up by Thanos Query and Compactor later. This has the benefit of buffering metrics even when connectivity is spotty and keeps Prometheus fully self-contained. On the other hand, remote write to Thanos Receive pushes metrics in near real-time to the central cluster, skipping local storage entirely. While this can reduce resource usage on edge clusters, it comes with trade-offs: remote write doesn't deduplicate like the sidecar does, and missing labels or dropped samples can affect alerting and long-term queries. In my setup, I went with the Thanos Sidecar model for reliability, better deduplication, and compatibility with object storage, which also made block compaction and retention easier to manage centrally.
Here's what my solution ended up looking like (RPI cluster removed for simplicity):

Cilium
One of the most seamless parts of this project was leveraging Cilium ClusterMesh to enable cross-cluster discovery of Thanos Sidecars. Since ClusterMesh establishes native Kubernetes service connectivity across clusters, I was able to expose the kube-prometheus-stack-thanos-discovery service from my edge clusters to the central cluster simply by labeling it and my central cluster with service.cilium.io/global: "true"
. This meant that my central Thanos Query could automatically discover and scrape sidecars from remote clusters without requiring extra networking plumbing, VPNs, or manual service entries. ClusterMesh essentially made my separate Kubernetes clusters behave like one large mesh, and that simplicity was a huge win for keeping the Thanos integration lightweight and dynamic.
⚠️ Warning
As of Cilium 1.17, endpoint slice synchronization is in beta and must be explicitly enable to perform discovery across clusters with headless services!
Deploying Kube-Prometheus-Stack
For all my deployments, I am using Flux, hence the HelmRelease CRD. The full source is always available in my homelab repository (you can find my Grafana values there, if you're interested).
Central Cluster
In this deployment, we're going to enable Thanos Ruler, Prometheus, and Alertmanager.
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: kube-prometheus-stack
spec:
values:
defaultRules:
create: true
rules:
etcd: false
alertmanager:
config:
global:
slack_api_url: "${SECRET_PROMETHEUS_DISCORD_ALERTS_WEBHOOK}"
resolve_timeout: 5m
receivers:
- name: "null"
- name: "pushover"
pushover_configs:
- send_resolved: true
user_key: "${SECRET_PUSHOVER_USER_KEY}"
token: "${SECRET_ALERTMANAGER_PUSHOVER_TOKEN}"
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ if ne .CommonAnnotations.summary ""}}{{ .CommonAnnotations.summary }}{{ else }}{{ .CommonLabels.alertname }}{{ end }}
message: >-
{{ range .Alerts -}}
**Alert:** {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
**Description:** {{ if ne .Annotations.description ""}}{{ .Annotations.description }}{{else}}N/A{{ end }}
**Details:**
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
- name: "discord"
slack_configs:
- channel: "#prometheus-alerts"
icon_url: https://avatars3.githubusercontent.com/u/3380462
username: "prom-alert-bot"
send_resolved: true
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ if ne .CommonAnnotations.summary ""}}{{ .CommonAnnotations.summary }}{{ else }}{{ .CommonLabels.alertname }}{{ end }}
text: >-
{{ range .Alerts -}}
**Alert:** {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
**Description:** {{ if ne .Annotations.description ""}}{{ .Annotations.description }}{{else}}N/A{{ end }}
**Details:**
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
route:
group_by: ["alertname", "job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 6h
receiver: "null"
routes:
- receiver: "null"
matchers:
- alertname =~ "InfoInhibitor"
- alertname =~ "Watchdog"
- receiver: "discord"
match_re:
severity: critical|warning|error
continue: true
- receiver: "pushover"
match_re:
severity: critical|warning|error
continue: true
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "namespace"]
ingress:
enabled: true
pathType: Prefix
ingressClassName: "traefik"
hosts:
- &host-alert-manager "alert-manager.${SECRET_DOMAIN}"
tls:
- hosts:
- *host-alert-manager
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: "ceph-block"
resources:
requests:
storage: 6Gi
nodeExporter:
enabled: true
grafana:
enabled: false
kube-state-metrics:
metricLabelsAllowlist:
- "pods=[*]"
- "deployments=[*]"
- "persistentvolumeclaims=[*]"
prometheus:
monitor:
enabled: true
relabelings:
- action: replace
regex: ^(.*)$
replacement: $1
sourceLabels: ["__meta_kubernetes_pod_node_name"]
targetLabel: kubernetes_node
kubelet:
enabled: true
serviceMonitor:
metricRelabelings:
# Remove duplicate labels
- action: keep
sourceLabels: ["__name__"]
regex: (apiserver_audit|apiserver_client|apiserver_delegated|apiserver_envelope|apiserver_storage|apiserver_webhooks|authentication_token|cadvisor_version|container_blkio|container_cpu|container_fs|container_last|container_memory|container_network|container_oom|container_processes|container|csi_operations|disabled_metric|get_token|go|hidden_metric|kubelet_certificate|kubelet_cgroup|kubelet_container|kubelet_containers|kubelet_cpu|kubelet_device|kubelet_graceful|kubelet_http|kubelet_lifecycle|kubelet_managed|kubelet_node|kubelet_pleg|kubelet_pod|kubelet_run|kubelet_running|kubelet_runtime|kubelet_server|kubelet_started|kubelet_volume|kubernetes_build|kubernetes_feature|machine_cpu|machine_memory|machine_nvm|machine_scrape|node_namespace|plugin_manager|prober_probe|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|registered_metric|rest_client|scrape_duration|scrape_samples|scrape_series|storage_operation|volume_manager|volume_operation|workqueue)_(.+)
- action: replace
sourceLabels: ["node"]
targetLabel: instance
# Drop high cardinality labels
- action: labeldrop
regex: (uid)
- action: labeldrop
regex: (id|name)
- action: drop
sourceLabels: ["__name__"]
regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
kubeApiServer:
enabled: true
serviceMonitor:
metricRelabelings:
# Remove duplicate metrics
- action: keep
sourceLabels: ["__name__"]
regex: (aggregator_openapi|aggregator_unavailable|apiextensions_openapi|apiserver_admission|apiserver_audit|apiserver_cache|apiserver_cel|apiserver_client|apiserver_crd|apiserver_current|apiserver_envelope|apiserver_flowcontrol|apiserver_init|apiserver_kube|apiserver_longrunning|apiserver_request|apiserver_requested|apiserver_response|apiserver_selfrequest|apiserver_storage|apiserver_terminated|apiserver_tls|apiserver_watch|apiserver_webhooks|authenticated_user|authentication|disabled_metric|etcd_bookmark|etcd_lease|etcd_request|field_validation|get_token|go|grpc_client|hidden_metric|kube_apiserver|kubernetes_build|kubernetes_feature|node_authorizer|pod_security|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|registered_metric|rest_client|scrape_duration|scrape_samples|scrape_series|serviceaccount_legacy|serviceaccount_stale|serviceaccount_valid|watch_cache|workqueue)_(.+)
# Drop high cardinality labels
- action: drop
sourceLabels: ["__name__"]
regex: (apiserver|etcd|rest_client)_request(|_sli|_slo)_duration_seconds_bucket
- action: drop
sourceLabels: ["__name__"]
regex: (apiserver_response_sizes_bucket|apiserver_watch_events_sizes_bucket)
kubeControllerManager:
enabled: true
endpoints: &cp
- 172.16.20.105
- 172.16.20.106
- 172.16.20.107
serviceMonitor:
metricRelabelings:
# Remove duplicate metrics
- action: keep
sourceLabels: ["__name__"]
regex: "(apiserver_audit|apiserver_client|apiserver_delegated|apiserver_envelope|apiserver_storage|apiserver_webhooks|attachdetach_controller|authenticated_user|authentication|cronjob_controller|disabled_metric|endpoint_slice|ephemeral_volume|garbagecollector_controller|get_token|go|hidden_metric|job_controller|kubernetes_build|kubernetes_feature|leader_election|node_collector|node_ipam|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|pv_collector|registered_metric|replicaset_controller|rest_client|retroactive_storageclass|root_ca|running_managed|scrape_duration|scrape_samples|scrape_series|service_controller|storage_count|storage_operation|ttl_after|volume_operation|workqueue)_(.+)"
kubeScheduler:
enabled: true
endpoints: *cp
serviceMonitor:
metricRelabelings:
# Remove duplicate metrics
- action: keep
sourceLabels: ["__name__"]
regex: "(apiserver_audit|apiserver_client|apiserver_delegated|apiserver_envelope|apiserver_storage|apiserver_webhooks|authenticated_user|authentication|disabled_metric|go|hidden_metric|kubernetes_build|kubernetes_feature|leader_election|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|registered_metric|rest_client|scheduler|scrape_duration|scrape_samples|scrape_series|workqueue)_(.+)"
kubeProxy:
enabled: false
kubeEtcd:
enabled: true
endpoints: *cp
service:
enabled: true
port: 2381
targetPort: 2381
serviceMonitor:
scheme: https
insecureSkipVerify: false
serverName: localhost
caFile: /etc/prometheus/secrets/etcd-certs/etcd-ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/etcd-client.crt
keyFile: /etc/prometheus/secrets/etcd-certs/etcd-client-key.key
prometheus:
ingress:
enabled: true
pathType: Prefix
ingressClassName: "traefik"
hosts:
- &host-prometheus "prometheus.${SECRET_DOMAIN}"
tls:
- hosts:
- *host-prometheus
thanosService:
enabled: true
annotations:
service.cilium.io/global: "true"
thanosServiceMonitor:
enabled: true
thanosIngress:
enabled: true
pathType: Prefix
ingressClassName: "traefik"
hosts:
- &host-thanos-sidecar "thanos-sidecar.${SECRET_DOMAIN}"
tls:
- hosts:
- *host-thanos-sidecar
prometheusSpec:
replicas: 3
replicaExternalLabelName: __replica__
externalLabels:
cluster: prod
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
probeSelectorNilUsesHelmValues: false
retention: 2d
retentionSize: 50GB
enableAdminAPI: true
walCompression: true
ruleSelector:
matchLabels:
role: some-fake-nonexistent-role
secrets:
- etcd-certs
thanos:
image: quay.io/thanos/thanos:v0.38.0
# renovate: datasource=docker depName=quay.io/thanos/thanos
version: "v0.38.0"
objectStorageConfig:
secret:
type: S3
config:
insecure: true
# bucket: ""
# endpoint: ""
# region: ""
# access_key: ""
# secret_key: ""
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ceph-block
resources:
requests:
storage: 60Gi
additionalPrometheusRulesMap:
oom-rules:
groups:
- name: oom
rules:
- alert: OomKilled
annotations:
summary: Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
labels:
severity: critical
thanosRuler:
enabled: true
ingress:
enabled: true
pathType: Prefix
ingressClassName: "traefik"
hosts:
- &host-thanos-ruler "thanos-ruler.${SECRET_DOMAIN}"
tls:
- hosts:
- *host-thanos-ruler
thanosRulerSpec:
replicas: 3
image:
# renovate: datasource=docker depName=quay.io/thanos/thanos
registry: quay.io
repository: thanos/thanos
tag: v0.38.0
alertmanagersConfig:
secret:
alertmanagers:
- api_version: v2
static_configs:
- kube-prometheus-stack-alertmanager:9093
scheme: http
timeout: 10s
objectStorageConfig:
secret:
type: S3
config:
insecure: true
# bucket: ""
# endpoint: ""
# region: ""
# access_key: ""
# secret_key: ""
queryEndpoints:
["dnssrv+_http._tcp.thanos-query.monitoring.svc.cluster.local"]
storage:
volumeClaimTemplate:
spec:
storageClassName: "ceph-block"
resources:
requests:
storage: 6Gi
valuesFrom:
# Thanos Sidecar
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.bucket
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_NAME
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.endpoint
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_HOST
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.region
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_REGION
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.access_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_ACCESS_KEY_ID
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.secret_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_SECRET_ACCESS_KEY
# Thanos Ruler
- targetPath: thanosRuler.thanosRulerSpec.objectStorageConfig.secret.config.bucket
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_NAME
- targetPath: thanosRuler.thanosRulerSpec.objectStorageConfig.secret.config.endpoint
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_HOST
- targetPath: thanosRuler.thanosRulerSpec.objectStorageConfig.secret.config.region
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_REGION
- targetPath: thanosRuler.thanosRulerSpec.objectStorageConfig.secret.config.access_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_ACCESS_KEY_ID
- targetPath: thanosRuler.thanosRulerSpec.objectStorageConfig.secret.config.secret_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_SECRET_ACCESS_KEY
Some highlights from the values above:
- I'm enabling my global thanos discovery service by adding the
service.cilium.io/global: "true"
in prometheus.thanosService.annotations - I'm setting my global label in prometheus.prometheusSpec.externalLabels
- I'm preventing prometheus from evaluating rules in prometheus.prometheusSpec.ruleSelector with a fake label. I do this because I still want PrometheusRules to be created - these will mounted into Thanos Ruler!
Edge Cluster
My edge cluster configuration is quite simple:
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: kube-prometheus-stack
spec:
values:
defaultRules:
create: false
alertmanager:
enabled: false
nodeExporter:
enabled: true
grafana:
enabled: false
kube-state-metrics:
metricLabelsAllowlist:
- "pods=[*]"
- "deployments=[*]"
- "persistentvolumeclaims=[*]"
prometheus:
monitor:
enabled: true
relabelings:
- action: replace
regex: ^(.*)$
replacement: $1
sourceLabels: ["__meta_kubernetes_pod_node_name"]
targetLabel: kubernetes_node
kubelet:
enabled: true
serviceMonitor:
metricRelabelings:
# Remove duplicate labels
- action: keep
sourceLabels: ["__name__"]
regex: (apiserver_audit|apiserver_client|apiserver_delegated|apiserver_envelope|apiserver_storage|apiserver_webhooks|authentication_token|cadvisor_version|container_blkio|container_cpu|container_fs|container_last|container_memory|container_network|container_oom|container_processes|container|csi_operations|disabled_metric|get_token|go|hidden_metric|kubelet_certificate|kubelet_cgroup|kubelet_container|kubelet_containers|kubelet_cpu|kubelet_device|kubelet_graceful|kubelet_http|kubelet_lifecycle|kubelet_managed|kubelet_node|kubelet_pleg|kubelet_pod|kubelet_run|kubelet_running|kubelet_runtime|kubelet_server|kubelet_started|kubelet_volume|kubernetes_build|kubernetes_feature|machine_cpu|machine_memory|machine_nvm|machine_scrape|node_namespace|plugin_manager|prober_probe|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|registered_metric|rest_client|scrape_duration|scrape_samples|scrape_series|storage_operation|volume_manager|volume_operation|workqueue)_(.+)
- action: replace
sourceLabels: ["node"]
targetLabel: instance
# Drop high cardinality labels
- action: labeldrop
regex: (uid)
- action: labeldrop
regex: (id|name)
- action: drop
sourceLabels: ["__name__"]
regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
kubeApiServer:
enabled: true
serviceMonitor:
metricRelabelings:
# Remove duplicate metrics
- action: keep
sourceLabels: ["__name__"]
regex: (aggregator_openapi|aggregator_unavailable|apiextensions_openapi|apiserver_admission|apiserver_audit|apiserver_cache|apiserver_cel|apiserver_client|apiserver_crd|apiserver_current|apiserver_envelope|apiserver_flowcontrol|apiserver_init|apiserver_kube|apiserver_longrunning|apiserver_request|apiserver_requested|apiserver_response|apiserver_selfrequest|apiserver_storage|apiserver_terminated|apiserver_tls|apiserver_watch|apiserver_webhooks|authenticated_user|authentication|disabled_metric|etcd_bookmark|etcd_lease|etcd_request|field_validation|get_token|go|grpc_client|hidden_metric|kube_apiserver|kubernetes_build|kubernetes_feature|node_authorizer|pod_security|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|registered_metric|rest_client|scrape_duration|scrape_samples|scrape_series|serviceaccount_legacy|serviceaccount_stale|serviceaccount_valid|watch_cache|workqueue)_(.+)
# Drop high cardinality labels
- action: drop
sourceLabels: ["__name__"]
regex: (apiserver|etcd|rest_client)_request(|_sli|_slo)_duration_seconds_bucket
- action: drop
sourceLabels: ["__name__"]
regex: (apiserver_response_sizes_bucket|apiserver_watch_events_sizes_bucket)
kubeControllerManager:
enabled: true
endpoints: &cp
- 172.16.20.111
- 172.16.20.112
- 172.16.20.113
serviceMonitor:
metricRelabelings:
# Remove duplicate metrics
- action: keep
sourceLabels: ["__name__"]
regex: "(apiserver_audit|apiserver_client|apiserver_delegated|apiserver_envelope|apiserver_storage|apiserver_webhooks|attachdetach_controller|authenticated_user|authentication|cronjob_controller|disabled_metric|endpoint_slice|ephemeral_volume|garbagecollector_controller|get_token|go|hidden_metric|job_controller|kubernetes_build|kubernetes_feature|leader_election|node_collector|node_ipam|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|pv_collector|registered_metric|replicaset_controller|rest_client|retroactive_storageclass|root_ca|running_managed|scrape_duration|scrape_samples|scrape_series|service_controller|storage_count|storage_operation|ttl_after|volume_operation|workqueue)_(.+)"
kubeScheduler:
enabled: true
endpoints: *cp
serviceMonitor:
metricRelabelings:
# Remove duplicate metrics
- action: keep
sourceLabels: ["__name__"]
regex: "(apiserver_audit|apiserver_client|apiserver_delegated|apiserver_envelope|apiserver_storage|apiserver_webhooks|authenticated_user|authentication|disabled_metric|go|hidden_metric|kubernetes_build|kubernetes_feature|leader_election|process_cpu|process_max|process_open|process_resident|process_start|process_virtual|registered_metric|rest_client|scheduler|scrape_duration|scrape_samples|scrape_series|workqueue)_(.+)"
kubeProxy:
enabled: false
kubeEtcd:
enabled: true
endpoints: *cp
service:
enabled: true
port: 2381
targetPort: 2381
serviceMonitor:
scheme: https
insecureSkipVerify: false
serverName: localhost
caFile: /etc/prometheus/secrets/etcd-certs/etcd-ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/etcd-client.crt
keyFile: /etc/prometheus/secrets/etcd-certs/etcd-client-key.key
prometheus:
ingress:
enabled: true
pathType: Prefix
ingressClassName: "nginx"
hosts:
- &host-prometheus "prometheus.staging.${SECRET_DOMAIN}"
tls:
- hosts:
- *host-prometheus
thanosService:
enabled: true
annotations:
service.cilium.io/global: "true"
thanosServiceMonitor:
enabled: true
thanosIngress:
enabled: true
pathType: Prefix
ingressClassName: "nginx"
hosts:
- &host-thanos-sidecar "thanos-sidecar.staging.${SECRET_DOMAIN}"
tls:
- hosts:
- *host-thanos-sidecar
prometheusSpec:
replicas: 1
scrapeInterval: 60s
evaluationInterval: 60s
replicaExternalLabelName: __replica__
externalLabels:
cluster: sj
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
probeSelectorNilUsesHelmValues: false
retention: 2d
retentionSize: 20GB
enableAdminAPI: true
walCompression: true
secrets:
- etcd-certs
thanos:
image: quay.io/thanos/thanos:v0.38.0
# renovate: datasource=docker depName=quay.io/thanos/thanos
version: "v0.38.0"
objectStorageConfig:
secret:
type: S3
config:
insecure: false
# bucket: ""
# endpoint: ""
# region: ""
# access_key: ""
# secret_key: ""
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
resources:
requests:
storage: 20Gi
valuesFrom:
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.bucket
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_NAME
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.endpoint
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_HOST
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.region
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_REGION
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.access_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_ACCESS_KEY_ID
- targetPath: prometheus.prometheusSpec.thanos.objectStorageConfig.secret.config.secret_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_SECRET_ACCESS_KEY
Note that I disable rule creation and alert manager - all of my rules will be centrally evaluated and we want to offload any of that overhead from our edge cluster.
If you're curious about how I enable etcd scraping on my Talos cluster, see this post.
Deploying Thanos
Finally, we'll deploy the rest of the Thanos components to the central cluster using the Bitnami Thanos chart. I chose to split out Thanos Ruler from here because the Kube-Prometheus-Stack chart has the helpful feature of converting all PrometheusRules into a ConfigMap that Ruler can automatically ingest.
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: thanos
spec:
values:
global:
security:
allowInsecureImages: true
image:
registry: quay.io
repository: thanos/thanos
tag: v0.38.0
objstoreConfig:
type: s3
config:
insecure: true
queryFrontend:
enabled: true
resourcesPreset: "none"
replicaCount: 3
ingress:
enabled: true
ingressClassName: traefik
hostname: &host thanos.${SECRET_DOMAIN}
tls: true
extraTls:
- hosts:
- *host
query:
enabled: true
resourcesPreset: "none"
replicaCount: 3
replicaLabel: ["__replica__"]
dnsDiscovery:
sidecarsService: kube-prometheus-stack-thanos-discovery
sidecarsNamespace: monitoring
bucketweb:
enabled: true
resourcesPreset: "none"
replicaCount: 3
compactor:
enabled: true
resourcesPreset: "none"
concurrency: 4
extraFlags:
- --delete-delay=30m
retentionResolutionRaw: 30d
retentionResolution5m: 60d
retentionResolution1h: 90d
persistence:
enabled: true
storageClass: ceph-block
size: 50Gi
storegateway:
enabled: true
resourcesPreset: "none"
replicaCount: 3
persistence:
enabled: true
storageClass: ceph-block
size: 20Gi
ruler:
enabled: false
metrics:
enabled: true
serviceMonitor:
enabled: true
valuesFrom:
- targetPath: objstoreConfig.config.bucket
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_NAME
- targetPath: objstoreConfig.config.endpoint
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_HOST
- targetPath: objstoreConfig.config.region
kind: ConfigMap
name: thanos-bucket
valuesKey: BUCKET_REGION
- targetPath: objstoreConfig.config.access_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_ACCESS_KEY_ID
- targetPath: objstoreConfig.config.secret_key
kind: Secret
name: thanos-bucket
valuesKey: AWS_SECRET_ACCESS_KEY
With that, we're practically done! A lot of Grafana dashboards, including the built-in Kubernetes views will handle the cluster label for you, so you can just start visualizing!
Member discussion