As a Site Reliability Engineer, my primary directive is to make sure systems are performing well for my customers and to take corrective action when they are not. In the middle of an incident, I often find myself looking at various application logs to determine what is happening in the system at any given time.
When a system is performing well, there is usually no reason to log verbosely. Once you find yourself searching through the logs though, it becomes very apparent that
ERROR level logging are not always sufficient.
This must be a common problem, as Google’s Site Reliability Engineering also mentions how powerful changing log levels can be in a running application. The benefit is obvious — you have a system not behaving like it should, and you need to know why. Restarting or redeploying the system could mask the issue or make things worse. Leaving an impaired system running and increasing the granularity of its logs can help uncover the issue.
So, how do you build that?
Let’s create an example rooted in reality. We’ll build a simple HTTP web server in Go with two endpoints:
/-/config endpoint will accept
GET requests and display the currently loaded configuration. The
/-/reload endpoint will accept
POST requests and trigger a reload of the logging configuration.
Once we’ve got those handlers defined, we need to register them with our server and start it. We will also define an initial logger (using Logrus) so that we have something to start with. I am using Logrus because Go’s standard log package does not support log levels.
Now that we have a functional HTTP server, we need to move on to the core of our code: configuring the logger. We will perform that configuration in
YAML. The configurable settings will be the log format (
json), colors enabled or disabled (
text format only), and the verbosity
level of the logs (any of:
Let’s define two structs: one for our logging configuration and one for other configuration-related items we might want to add in the future.
We will also create a
config.yaml file. An example configuration will look like this:
In order to reload the configuration, we’ll need to read the file from the machine, parse it from YAML into our structs and adjust the logger accordingly.
If we start our server using
go run main.go, we will get a message that our server is starting on
localhost:8080. We are ready to send requests!
View the currently loaded configuration:
$ curl localhost:8080/-/config
Change the configuration file and trigger a reload (you’ll have to send more requests to see the changes take effect):
$ curl -X POST localhost:8080/-/reload
Do less work
Another key principle of Site Reliability is not performing extra work. Having our server reload a configuration file if it hasn’t changed falls into the “extra work” category. Let’s update our code to only update the logging configuration if the file has changed.
How do we know if the configuration changed?
To ensure we are only performing necessary work, let’s compute and store a hash of the currently loaded configuration internally.
We can add this hashing functionality to our existing
loadLoggingConfiguration function. When the
/-/reload endpoint receives a request, a hash of the “new” configuration will computed and compared with the current configuration. If the hashes match, the configuration hasn’t changed and no extra work is performed. If they do not match, the logging configuration is updated.