Deploying Prometheus stack
We'll start by cloning vfarcic/docker-flow-monitor repository from https://github.com/vfarcic/docker-flow-monitor. It contains all the scripts and Docker stacks we'll use throughout this chapter.
git clone \
https://github.com/vfarcic/docker-flow-monitor.git
cd docker-flow-monitor
Before we create a Prometheus service, we need to have a cluster. It will consist of three nodes created with Docker machine.
Feel free to skip the commands that follow if you already have a working Swarm cluster.
chmod +x scripts/dm-swarm.sh ./scripts/dm-swarm.sh eval $(docker-machine env swarm-1)
The dm-swarm.sh script created the nodes and joined them into a Swarm cluster.
Now we can create the first Prometheus service. We'll start small and slowly move toward a more robust solution.
We'll deploy the stack defined in stacks/prometheus.yml. It is as follows:
version: "3" services: prometheus: image: prom/prometheus ports: - 9090:9090
As you can see, it is as simple as it can get. It specifies the image and the port that should be opened.
Let's deploy the stack.
docker stack deploy \ -c stacks/prometheus.yml \ monitor
Please wait a few moments until the image is pulled and deployed. You can monitor the status by executing the docker stack ps monitor command.
Let's confirm that Prometheus service is indeed up-and-running.
open "http://$(docker-machine ip swarm-1):9090"
You should see the Prometheus graph screen.
Let's take a look at the configuration.
open "http://$(docker-machine ip swarm-1):9090/config"
You should see the default config that does not define much more than intervals and internal scraping. In its current state, Prometheus is not very useful, so we'll have to spice it up a bit.
We should start fine tuning Prometheus. There are quite a few ways we can do that.
We can create a new Docker image that would extend the one we used and add our own configuration file. That solution has a distinct advantage of being immutable and, hence, very reliable. Since Docker image cannot be changed, we can guarantee that the configuration is exactly as we want it to be no matter where we deploy it. If the service fails, Swarm will reschedule it and, since the configuration is baked into the image, it'll be preserved. The problem with that approach is that it is not suitable for microservices architecture. If Prometheus has to be reconfigured with every new service (or at least those that expose metrics), we would need to build it quite often and tie that build to CD processes executed for the services we're developing. This approach is suitable only for a relatively static cluster and monolithic applications. Discarded!
What would be the alternative approach?
We can enter a running Prometheus container, modify its configuration, and reload it. While this allows a higher level of dynamism, it is not fault-tolerant. If Prometheus fails, Swarm will reschedule it, and all the changes we made will be lost. Besides fault tolerance, modifying a config in a running container poses additional problems when running it as a service inside a cluster. We need to find out the node it is running in, SSH into it, figure out the ID of the container, and, only then, we can exec into it, modify the config, and send a reload request. While those steps are not overly complicated and can be scripted, they will pose an unnecessary operational complexity. Discarded!
Among other reasons, we discarded the previous solution because it is not fault-tolerant.
We could mount a network volume to the service. That would solve persistence, but would still leave the problem created by a dynamic nature of a cluster. We still, potentially, need to change the configuration and reload Prometheus every time a new service is deployed or updated.
From the operational perspective, this solution is simpler than the previous solution we discussed. We do not need to find out the node it is running in, SSH into it, figure out the ID of the container, exec into it, and modify the config. Instead, we can alter the file on the network drive and send a reload request to Prometheus. While network drive simplifies the process, it does not make it as dynamic and independent from the services as it should be. We would need to make sure that the deployment pipeline of each of the services has the required steps that will reconfigure Prometheus. By doing that we would break one of our objectives. That is, our services would not contain all the information about themselves. Instead, we'd need to create a different pipeline for each and specify the targets, alerts, and other information we might need before reconfiguring Prometheus. We'll discard this solution as well.
What other options do we have? If we're looking for an out-of-the-box solution that uses the official Prometheus image, all our options are exhausted. But we are engineers. We are used to extending other people solutions and adapting them to suit our needs. Let's not limit our options and try to design a solution that would suit us well.