How to enable monitoring¶
The Canonical Observability Stack (COS) is a set of tools that facilitates gathering, processing, visualising, and setting up alerts on telemetry signals generated by workloads in and outside of Juju.
The etcd charm can use COS to connect to Grafana and Prometheus to use monitoring, alert rules, and logging features.
Deploy COS¶
Deploy the cos-lite
bundle in a Kubernetes controller following the MicroK8s guide.
Since the etcd charm is deployed directly on a cloud infrastructure environment, it requires offering the endpoints of the COS relations with the offers-overlay. For instructions on how to do this, see the section Deploy the COS Lite bundle with overlays of the MicroK8s guide.
Once the COS bundle is deployed, you should have Grafana and Prometheus running in your Kubernetes cluster.
The output of juju status
should look similar to the following:
Model Controller Cloud/Region Version SLA Timestamp
cos k8s microk8s/localhost 3.6.4 unsupported 12:26:22Z
App Version Status Scale Charm Channel Rev Address Exposed Message
alertmanager 0.27.0 active 1 alertmanager-k8s latest/stable 154 10.152.183.216 no
catalogue active 1 catalogue-k8s latest/stable 80 10.152.183.174 no
grafana 9.5.3 active 1 grafana-k8s latest/stable 138 10.152.183.234 no
loki 2.9.6 active 1 loki-k8s latest/stable 186 10.152.183.31 no
prometheus 2.52.0 active 1 prometheus-k8s latest/stable 232 10.152.183.55 no
traefik 2.11.0 active 1 traefik-k8s latest/stable 232 10.152.183.116 no Serving at 125.121.179.159
Unit Workload Agent Address Ports Message
alertmanager/0* active idle 10.1.153.187
catalogue/0* active idle 10.1.153.177
grafana/0* active idle 10.1.153.189
loki/0* active idle 10.1.153.188
prometheus/0* active idle 10.1.153.190
traefik/0* active idle 10.1.153.186 Serving at 125.121.179.159
Offer interfaces from the COS model¶
While still in the COS model, offer COS interfaces to be cross-model related with the machine model that you are using to deploy the etcd charm:
juju offer grafana:grafana-dashboard
juju offer loki:logging
juju offer prometheus:receive-remote-write
Consume offers on the etcd model¶
Switch to the machine model running etcd and run the following commands to consume the COS offers:
juju switch <machine_model_name>
juju consume k8s:admin/cos.prometheus-receive-remote-write
juju consume k8s:admin/cos.loki-logging
juju consume k8s:admin/cos.grafana-dashboards
In the commands above, k8s
refers to the controller where the COS-lite bundle is deployed, and cos
refers to the COS model.
Deploy grafana-agent
¶
The Grafana agent is a lightweight, open-source agent that runs on your host and sends metrics and logs to Grafana Cloud. The Grafana agent is deployed as a sidecar container in the etcd charm.
To deploy the Grafana agent, run the following command:
juju deploy grafana-agent --base [email protected]
Once the Grafana agent is deployed, integrate it with the etcd charm by running the following command:
juju integrate grafana-agent charmed-etcd
Once the model stabilises, you should see the Grafana agent running as a sidecar container in the etcd charm. Run juju status
to see the status of the etcd model. The output should look similar to the following:
Model Controller Cloud/Region Version SLA Timestamp
etcd vm localhost/localhost 3.6.4 unsupported 12:33:25Z
App Version Status Scale Charm Channel Rev Exposed Message
charmed-etcd active 3 charmed-etcd 0 no
grafana-agent blocked 3 grafana-agent latest/edge 457 no Missing ['grafana-cloud-config']|['grafana-dashboards-provider']|['logging-consumer']|['s
end-remote-write'] for cos-a...
Unit Workload Agent Machine Public address Ports Message
charmed-etcd/0* active idle 0 10.8.159.214
grafana-agent/3 blocked idle 10.8.159.214 Missing ['grafana-cloud-config']|['grafana-dashboards-provider']|['logging-consumer']|['send-remote-write'] for cos-a...
charmed-etcd/1 active idle 1 10.8.159.42
grafana-agent/2 blocked idle 10.8.159.42 Missing ['grafana-cloud-config']|['grafana-dashboards-provider']|['logging-consumer']|['send-remote-write'] for cos-a...
charmed-etcd/2 active idle 2 10.8.159.235
grafana-agent/0* blocked idle 10.8.159.235 Missing ['grafana-cloud-config']|['grafana-dashboards-provider']|['logging-consumer']|['send-remote-write'] for cos-a...
Machine State Address Inst id Base AZ Message
0 started 10.8.159.214 juju-604e01-0 [email protected] Running
1 started 10.8.159.42 juju-604e01-1 [email protected] Running
2 started 10.8.159.235 juju-604e01-2 [email protected] Running
Currently grafana-agent
is blocked because it is missing the required relations. To unblock the grafana-agent
unit, run the following commands:
juju integrate grafana-agent prometheus-receive-remote-write
juju integrate grafana-agent loki-logging
juju integrate grafana-agent grafana-dashboards
Once the model stabilises, the grafana-agent
unit should be active and running as a sidecar container in the etcd charm. It should periodically scrape metrics from the etcd charm and send them to Grafana Cloud.
Run juju status
to see the status of the etcd model. The output should look similar to the following:
Model Controller Cloud/Region Version SLA Timestamp
etcd vm localhost/localhost 3.6.4 unsupported 12:36:27Z
SAAS Status Store URL
grafana-dashboards active k8s admin/cos.grafana-dashboards
loki-logging active k8s admin/cos.loki-logging
prometheus-receive-remote-write active k8s admin/cos.prometheus-receive-remote-write
App Version Status Scale Charm Channel Rev Exposed Message
charmed-etcd active 3 charmed-etcd 0 no
grafana-agent active 3 grafana-agent latest/edge 457 no tracing: off
Unit Workload Agent Machine Public address Ports Message
charmed-etcd/0* active idle 0 10.8.159.214
grafana-agent/3 active executing 10.8.159.214 tracing: off
charmed-etcd/1 active idle 1 10.8.159.42
grafana-agent/2 active executing 10.8.159.42 tracing: off
charmed-etcd/2 active idle 2 10.8.159.235
grafana-agent/0* active executing 10.8.159.235 tracing: off
Machine State Address Inst id Base AZ Message
0 started 10.8.159.214 juju-604e01-0 [email protected] Running
1 started 10.8.159.42 juju-604e01-1 [email protected] Running
2 started 10.8.159.235 juju-604e01-2 [email protected] Running
Access the Grafana dashboard¶
To get the Grafana dashboard URL, run the following command:
juju switch k8s
juju run traefik/0 show-proxied-endpoints --format=yaml \
| yq '."traefik/0".results."proxied-endpoints"' \
| jq
In the output, look for the grafana
endpoint. The URL should look similar to the following:
{
"traefik": {
"url": "http://125.121.179.159"
},
"prometheus/0": {
"url": "http://125.121.179.159/cos-prometheus-0"
},
"loki/0": {
"url": "http://125.121.179.159/cos-loki-0"
},
"alertmanager": {
"url": "http://125.121.179.159/cos-alertmanager"
},
"catalogue": {
"url": "http://125.121.179.159/cos-catalogue"
}
}
To get the Grafana dashboard URL, and admin password, run the following command:
juju run grafana/leader get-admin-password --model cos
The output should look similar to the following:
admin-password: GfPP7OBmvJOe
url: http://125.121.179.159/cos-grafana
You can now access the Grafana dashboard using the URL and admin password.
Access the etcd dashboard¶
A default dashboard for etcd is available in Grafana. To access the etcd dashboard, follow these steps:
Head to the Grafana dashboard URL.
Log in using the admin username and password.
Head to the dashboard section and search for the etcd dashboard.
The etcd dashboard should display the metrics collected by the Grafana agent from the etcd charm. It should look similar to the following:
Metrics collected by the Grafana agent¶
The metrics exposed by etcd are detailed in the upstream etcd documentation.
Default alerts for the etcd charm¶
The etcd charm comes with default alerts that are set up in Grafana. These alerts are based on the metrics collected by the Grafana agent. The default alerts are detailed in the etcd documentation.
They include alerts for:
etcdMembersDown
:Description: Alerts when etcd cluster members are down.
Trigger: Triggers if any etcd member is down or if there are excessive network peer sent failures.
Severity: warning.
Duration: 20 minutes.
Details: Detects if the count of running etcd instances is zero or if the rate of network peer sent failures exceeds a threshold, indicating communication issues.
etcdInsufficientMembers
:Description: Alerts when the etcd cluster has an insufficient number of members.
Trigger: Triggers if the number of running etcd members is less than the quorum needed.
Severity: critical.
Duration: 3 minutes.
Details: Checks if the alive members are less than the majority required for the etcd cluster to function correctly.
etcdNoLeader
:Description: Alerts when an etcd cluster member has no leader.
Trigger: Triggers when an etcd member reports that it has no leader.
Severity: critical.
Duration: 1 minute.
Details: Indicates a critical issue where an etcd instance is unable to find a leader, signifying cluster instability.
etcdHighNumberOfLeaderChanges
:Description: Alerts when the etcd cluster has a high number of leader changes.
Trigger: Triggers if the number of leader changes exceeds a threshold within a specified time frame.
Severity: warning.
Duration: 5 minutes.
Details: Detects frequent leader elections, which can indicate performance or stability issues.
etcdHighNumberOfFailedGRPCRequests
:Description: Alerts when a high percentage of gRPC requests fail on an etcd instance.
Trigger: Triggers if the failure rate of gRPC requests exceeds 1%.
Severity: warning.
Duration: 10 minutes.
Details: Monitors gRPC request failures, indicating potential communication or processing problems.
etcdHighNumberOfFailedGRPCRequests
:Description: Alerts when a high percentage of gRPC requests fail on an etcd instance.
Trigger: Triggers if the failure rate of gRPC requests exceeds 5%.
Severity: critical.
Duration: 5 minutes.
Details: A more severe version of the previous alert, indicating a significant and urgent problem with gRPC request handling.
etcdGRPCRequestsSlow
:Description: Alerts when etcd gRPC requests are slow.
Trigger: Triggers if the 99th percentile of gRPC request latency exceeds a threshold.
Severity: critical.
Duration: 10 minutes.
Details: Detects high latency in gRPC requests, impacting performance.
etcdMemberCommunicationSlow
:Description: Alerts when etcd cluster member communication is slow.
Trigger: Triggers if the 99th percentile of member communication latency exceeds a threshold.
Severity: warning.
Duration: 10 minutes.
Details: Indicates slow communication between etcd members, which can affect cluster performance.
etcdHighNumberOfFailedProposals
:Description: Alerts when the etcd cluster has a high number of proposal failures.
Trigger: Triggers if the rate of failed proposals exceeds a threshold.
Severity: warning.
Duration: 15 minutes.
Details: Monitors failures in proposing changes to the etcd cluster, indicating potential issues with consensus.
etcdHighFsyncDurations
:Description: Alerts when etcd cluster 99th percentile fsync durations are too high.
Trigger: Triggers if the 99th percentile of fsync durations exceeds a threshold.
Severity: warning.
Duration: 10 minutes.
Details: Indicates high latency in writing data to disk, potentially impacting performance and data durability. Threshold of 0.5 seconds.
etcdHighFsyncDurations
:Description: Alerts when etcd cluster 99th percentile fsync durations are too high.
Trigger: Triggers if the 99th percentile of fsync durations exceeds a higher threshold.
Severity: critical.
Duration: 10 minutes.
Details: A more severe version of the previous alert, indicating a more critical issue with disk write latency. Threshold of 1 second.
etcdHighCommitDurations
:Description: Alerts when etcd cluster 99th percentile commit durations are too high.
Trigger: Triggers if the 99th percentile of commit durations exceeds a threshold.
Severity: warning.
Duration: 10 minutes.
Details: Monitors latency in committing transactions to the etcd backend, impacting performance.
etcdDatabaseQuotaLowSpace
:Description: Alerts when the etcd cluster database is running full.
Trigger: Triggers if the database size exceeds a percentage of the defined quota.
Severity: critical.
Duration: 10 minutes.
Details: Indicates that the etcd database is approaching its storage limit, which can lead to write failures.
etcdExcessiveDatabaseGrowth
:Description: Alerts when the etcd cluster database is growing very fast.
Trigger: Triggers if the predicted database size exceeds the quota within a specified time frame.
Severity: warning.
Duration: 10 minutes.
Details: Predicts potential disk space exhaustion based on recent database growth.
etcdDatabaseHighFragmentationRatio
:Description: Alerts when the etcd database size in use is less than 50% of the actual allocated storage.
Trigger: Triggers when the in use database size is less than 50% of the total database size and the in use database size is over 100MB.
Severity: warning.
Duration: 10 minutes.
Details: Indicates high fragmentation, suggesting the need for defragmentation to reclaim disk space.