The DevOps 2.2 Toolkit
上QQ阅读APP看书,第一时间看更新

Which tool should we choose?

All the tools we listed are (or were) good in their merit. They are different in many aspects while similar in others.

Nagios and Sensu served us well in the past. They were designed in a different era and based on principles that are today considered obsolete. They work well with static clusters and monolithic applications and services running on predefined locations. The metrics they store (or lack of them) are not suitable for more complex decision making. We would have a hard time using them as means to accomplish our goals of operating a scheduler like Docker Swarm running in an auto-scalable cluster. Among the solutions we explored, they are the first ones we should discard. One is out; three are left to choose from.

Dot-separated metrics format used by Graphite is limiting. Excluding elements of a metric with asterisks (*) is often inadequate for proper filtering, grouping, and other operations. Its query language, when compared with InfluxDB and Prometheus, is the main reason we'll discard it.

We're left with InfluxDB and Prometheus as finalists and are facing only minor differences.

InfluxDB and Prometheus are similar in many ways, so the choice is not going to be an easy one. Truth be told, we cannot make a wrong decision. Whichever we choose of the two, the choice will be based on slight differences.

If we would not limit ourselves to open source solutions as the only candidates, InfluxDB enterprise version could be the winner due to its scalability. However, we will discard it in favor of Prometheus. It provides a more complete solution. More importantly, Prometheus is slowly becoming the de-facto standard, at least when working with schedulers. It is a preferred solution in Kubernetes. Docker (and therefore Swarm) is soon going to expose its metrics in Prometheus format. That, in itself, is the tipping point that should make us lean slightly more towards Prometheus.

The decision is made. We'll use Prometheus to store metrics, to query them, and to trigger alerts.