Using the datadog
backend class, you can query any metrics available in Datadog to create an SLO.
backends:
datadog:
api_key: ${DATADOG_API_KEY}
app_key: ${DATADOG_APP_KEY}
The following methods are available to compute SLOs with the datadog
backend:
good_bad_ratio
for computing good / bad metrics ratios.query_sli
for computing SLIs directly with Datadog.query_slo
for getting SLO value from Datadog SLO endpoint.
Optional arguments to configure Datadog are documented in the Datadog initialize
method here. You can pass them in the backend
section, such as specifying api_host: api.datadoghq.eu
in order to use the EU site.
The good_bad_ratio
method is used to compute the ratio between two metrics:
- Good events, i.e events we consider as 'good' from the user perspective.
- Bad or valid events, i.e events we consider either as 'bad' from the user perspective, or all events we consider as 'valid' for the computation of the SLO.
This method is often used for availability SLOs, but can be used for other purposes as well (see examples).
Config example:
backend: datadog
method: good_bad_ratio
service_level_indicator:
filter_good: app.requests.count{http.path:/, http.status_code_class:2xx}
filter_valid: app.requests.count{http.path:/}
The query_sli
method is used to directly query the needed SLI with Datadog: Datadog's query language is powerful enough that it can do ratios natively.
This method makes it more flexible to input any datadog
SLI computation and eventually reduces the number of queries made to Datadog.
backend: datadog
method: query_sli
service_level_indicator:
expression: sum:app.requests.count{http.path:/, http.status_code_class:2xx} / sum:app.requests.count{http.path:/}
The query_slo
method is used to directly query the needed SLO with Datadog: indeed, Datadog has SLO objects that you can directly refer to in your config by inputing their slo_id
.
This method makes it more flexible to input any datadog
SLI computation and eventually reduces the number of queries made to Datadog.
To query the value from Datadog SLO, simply add a slo_id
field in the measurement
section:
backend: datadog
method: query_slo
service_level_indicator:
slo_id: ${DATADOG_SLO_ID}
Complete SLO samples using datadog
are available in samples/datadog. Check them out!
The datadog
exporter allows to export SLO metrics to the Datadog API.
exporters:
datadog:
api_key: ${DATADOG_API_KEY}
app_key: ${DATADOG_APP_KEY}
Optional arguments to configure Datadog are documented in the Datadog initialize
method here. You can pass them in the backend
section, such as specifying api_host: api.datadoghq.eu
in order to use the EU site.
Optional fields:
metrics
: [optional]list
- List of metrics to export (see docs).
The distribution_cut
method is not currently implemented for Datadog.
The reason for this is that Datadog distributions (or histograms) do not conform to what histograms should be (see old issue), i.e a set of configurable bins, each providing the number of events falling into each bin.
Standard histograms representations (see wikipedia) already implement this, but the approach Datadog took is to pre-compute (client-side) or post-compute (server-side) percentiles, resulting in a different metric for each percentile representing the percentile value instead of the number of events in the percentile.
This implementation has a couple of advantages, like making it easy to query and graph the value of the 99th, 95p, or 50p percentiles; but it makes it effectively very hard to compute a standard SLI for it, since it's not possible to see how many requests fall in each bin; hence there is no way to know how many good and bad events there are.
Three options can be considered to implement this:
- Add support for
gostatsd
's Timer histograms implementation indatadog-agent
.
OR
- Implement support for standard histograms where bucketization is configurable and where it's possible to query the number of events falling into each bucket.
OR
- Design an implementation that tries to reconstitute the original distribution by assimilating it to a Gaussian distribution and estimating its parameters. This is a complex and time-consuming approach that will give approximate results and is not a straightforward problem (see StackExchange thread)