Graylog at Scale

Published in

New Work Development

12 min readMar 13, 2020

How do you handle 150.000 log messages per second? And how can you make those logs accessible to users? Logging is one of the three pillars of observability. Depending on the number of log records you create, you will have to think about a system that can store your logs and make searching and visualising them easy. We at XING have been using Graylog for this purpose for about 5 to 6 years now, and we just recently upgraded our infrastructure to match our growing demand on the system.

Starting Situation

When we started thinking about upgrading and renewing our Graylog infrastructure, we were facing several difficulties with the system:

Response Time: In general response time of Graylog was bad. The web interface felt sluggish and took several seconds to load our stream list. This was mainly caused by the Graylog servers being fully utilised when it came to CPU.
Lost Messages: With the overloaded Graylog nodes came another problem. The journal was filling up regularly and we lost messages in the process. If you are using such a system for alerting and monitoring, this is far from ideal.
Slow Searches: What good are stored logs if you can’t retrieve them? In general, the search performance was barely acceptable, and especially searches over extended time frames did cost the system a lot of performance. This was in part caused by the cluster still running on spinning disks.
Service Outage: In several occasions, we had a partial or complete service disruption of our Graylog cluster. This was either due to excessive garbage collection on the Elasticsearch cluster or the Graylog server becoming unresponsive due to high load. In general, the system was considered very unstable.

Planning the system

When we started planning to improve the situation, our first idea was to use some recently freed up hardware, equipped with SSDs and replace the old Elasticsearch cluster with this new hardware. This alone would have worked without a doubt, but we also saw a chance to improve the situation even further. Our old Graylog cluster was running on version 2.2.4, while the most recent version actually passed 3.0 already at that point. Since we did not have replicas set up in our Elasticsearch cluster for performance reasons, upgrading the individual components was not as straight forward as one would hope. Especially, Elasticsearch would have caused a planned downtime while upgrading from version 2.4.4 to 6.8.4. The risk was just to high to do this while we relied on the very same system.

So our next way of thinking was to create a new cluster in parallel and start migrating. This is also the way we went with in the end, since we saw several advantages. First, we could start from scratch and fine tune some performance settings while the system was still in development. Second, we could use the most recent versions of every component. We switched to Graylog 3.0 and Elasticsearch 6.8, we used OpenJDK 11 where possible (Graylog does only support up to Java 8), used a more recent version of MongoDB and, of course, we also updated the operating system. Last but not least, we could overthink how we want to handle our developers interacting with the cluster. So far, everybody created streams as they liked, which caused an uncontrollable performance degradation of the system. To prevent this from happening again, we needed to think about a way of allowing developers to interact with Graylog in a more controlled fashion.

What we ended up with is an Elasticsearch cluster with 20 dedicated data nodes, each holding 8TB of data on SSDs in a RAID5 configuration. The servers have 192GB of memory, roughly 32GB are dedicated to Elasticsearch. The 5 master nodes are running on VMs and the whole cluster is split up over two data centres. With this setup we could also use replicas in a sensible way by storing primaries and replicas in different data centres.

Graylog ended up with dedicated hardware as well. Before, it was a mix of hardware and VMs, basically anything we could get our hands on. With the new hardware we now have a homogenous environment again and can better plan ahead if we need more performance. It also meant an upgrade to the journal of Graylog, since we did not need the disks in the Graylog servers for anything else. Each node now has 150GB of journal, with 10 nodes in total this means we have 1.5 TB of journal available. This way we can tolerate a complete outage of the Elasticsearch Cluster for a couple of hours without losing messages. They would just be ingested at a later point.

New Graylog Cluster Overview

Monitoring the system

With the new system in place, we also wanted some insights into how the system is behaving. With the old cluster we had metrics stored in Graphite, but since there were quite some storage constraints on our Graphite installation, we could not store everything. With the new infrastructure we also thought about using newer systems for monitoring Graylog. Pretty recently we started using Thanos as a long-term storage solution for Prometheus metrics and since we already had experience with Prometheus, we wanted to give this a shot.

The dashboard above is what we came up with. The screenshot only shows a part of it, but gives a good indication as to how Graylog is currently behaving. We have also added further drill down into Graylog and Elasticsearch to see where problems occur and have an early indication if we need more capacity. We use the official Graylog Prometheus exporter and an Elasticsearch exporter from JustWatch. We also added a small Ruby app to get the retention statistics, basically looking for the creation_date of the oldest index of a given index set. With that we have all our metrics in one place and thanks to Alertmanager we also can trigger alerts based on those metrics. All of this has replaced our Graphite setup and some monitoring checks in Icinga2.

Automating the system

Now that we had a running system and sufficient metrics for it, we wanted to prevent what happened to our old setup: over time a lot of streams and other things accumulate without control, some of which were not even used anymore. Even if a stream does not contain any messages, it still is matched against all messages and takes up CPU time. To prevent this from happening again, we needed more control over the system and we decided to build a system which would allow product teams (our main customer) to create a pull request against a central configuration repository. After we merge the pull request, the changes automatically get deployed to our Graylog cluster. For this purpose we developed a set of Ansible modules, which can interact with the Graylog API. It currently supports the management of extractors, inputs, pipelines, pipeline rules, event definitions, streams and roles with the potential to add more functionality in the future. This has several advantages for us. Mainly, we have more control over what is going on in the cluster. Before this, developers could directly create streams and alerts in Graylog, which is of course a great way to speed up the teams. Since they can easily just test new stream rules or add a new alert, they are not dependent on a central team to have time to configure it for them. But it also leads to some legacy configuration in the system. Sometimes people leave or change teams and a log stream is forgotten. Others might not want to clean it up, since they do not know if it is still needed. However, especially with streams this puts quite some unnecessary load on the system, since the stream rules are processed anyway. A central code repository for our Graylog configuration doesn’t prevent legacy configuration necessarily, but we can push and help people with setting up more efficient ways of routing their messages, like pipelines.

A second benefit of using Ansible in combination with storing the cluster configuration in git is that it allowed us to automate setting up Graylog clusters. For example, in a worst case scenario we could set up a new Graylog cluster and just apply the Ansible code to have it configured and processing messages within minutes. We still do backups of our MongoDB cluster, but this way we can also easily spin up a new cluster with the same settings, if we just want to test a new version for example. We can spin up a small cluster and with relatively little work can set it up the same way we have our production cluster, perform some test on it and destroy it again.

Putting the Ansible deployment into our CI/CD system now also allows us to test changes before they go live, at least on a semantic level. Ansible will perform a dry-run on each pull request so developers get feedback on their changes. If everything is good, we just need to merge, if not, we can collaborate to fix the issue and get the new configuration rolled out as soon as we know it is working. We still have some parts left out of this, mainly things that are very visual like dashboards and views. For us it didn’t make sense to describe dashboards in code where the definition would be very abstract so people can still create those directly via the WebUI. Overall the adoption of this is slowly rolling out to the company, with one of the heavy users with lots of alerts on the old system directly contacting us to help them with migration efforts.

Tuning the system

As you would expect, a change of this magnitude does not come without difficulties. Besides some delayed hardware deliveries, we also faced some issues with the system itself. Since we were changing pretty much every part of the system, we actually expected this. But it was a great learning experience.

Once we started slowly ingesting data into Graylog, we finally had some metrics to see if the system would hold up. The first issue we faced was garbage collection on Elasticsearch. Since we now had 20 instead of 15 nodes, but also kept replicas, there was more pressure on the Java heap. We couldn’t easily increase it, since that would negatively impact performance. But we also used Java 11, which meant we could test the new G1GC garbage collector.

As you can see, the time spent for garbage collection looked bad, going up to over 150ms on average. After we switched to G1GC at around 11:00 we immediately saw a drop. Of course we forgot one server, so the green line shows the difference this made pretty nicely. So garbage collection was no longer a big issue. Of course this switch came with one caveat: we had an increased CPU usage on our Elasticsearch nodes. But since they where under-utilised anyway, this was and is not an issue.

So now that Elasticsearch was no longer a bottle neck due to garbage collection, we could scale up the traffic. Once we had our new hardware setup for the Graylog nodes, we sent 100% of our log traffic to the new system. At first everything looked very good. Graylog processed even the peak messages and the journal utilisation was very low. However, looking deeper into the system, we saw about 75.000 indexing failures per day. In relation to around 6 billion messages ingested per day this might not seem much, but it might lead to loss of important information. We saw that the majority of those messages looked like this:

RemoteTransportException[[elasticsearch-data-2][X.X.X.X:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elasticsearch-data-2][X.X.X.X:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@50d9ba94 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@26f665c[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 303907660]]];

This was bad. Not only was Elasticsearch again the bottle neck, but this was something we could not easily fix. You might think that we could have just increased the thread pool size for the bulk requests, but that would only worsen the problem and shift it back to the heap. Each request in the queue will take up heap space and increasing the queue doesn’t actually help processing the requests any faster.

We had several options on the table, and some of them seemed pretty drastic. At this point we already released the cluster to our developers to use, so any major change would require announcements and maintenance work. Doing that on a system you have just declared production ready did not seem like a good idea. But one of the options was to increase the write capacity of our Elasticsearch cluster by changing the disk setup from RAID5 to RAID0. This would have increased the write capacity a lot, but we would have needed to rely on Elasticsearch for fault tolerance. We have confidence that Elasticsearch can handle this, but we also have quite a big data set per node. Filling up an empty node takes about 1.5 days and during this time we would have no fault tolerance. Quite a big risk and in the end this was the reason we decided not to implement this yet.

In the end we decided to test a bigger bulk size that Graylog sends to Elasticsearch. The default setup is to send requests every 1 second or when 500 messages would be in one bulk request. We never relied on the 1 second, 500 messages came in a lot faster, but increasing the size of the bulk requests from 500 messages to 1000 messages had the impact we needed. The requests will take up more space in the heap, but in the end that is way better than losing messages.

We immediately saw the rejections drop to 0. We still had a few indexing failures, but that was due to type mismatches and needed to be fixed by the team injecting the data.

Summary and next steps

The new Graylog cluster can now process about 150.000 msg/s with peaks even going above that. In total we ingest about 5.5 billion messages per day with some days going up to 6.8 billion messages. This amounts to 13TB of data per data maximum, giving us about 1.5 weeks of retention. This is fluctuating a bit, because our traffic patterns show a lot less traffic on weekends, so retention usually is highest at the beginning of the week.

Looking into the future of Graylog here at New Work SE, we still have some way to go. Right now we are in the migration phase from the old cluster to the new one. This will be finished by the end of February 2020 and will free up a lot of storage and CPU resources. We have some ideas on how to leverage at least the storage to extend the new cluster and increase retention of our log files, but as of now have no concrete plans or timelines on that. We also continue monitoring the system to see if we do run into performance problems and need to change something about the setup. Ordering hardware always comes with some lead time, but at least now we can predict the growth of the system better.

One major step for the near future is the introduction of SRE for our Graylog infrastructure. We have come up with SLIs and SLOs to measure the availability of our system and with the help of our internal SRE team will implement them to have a better understanding how the system is actually seen by users of it. The metrics we collect so far have been rather technical and show us where the system is suffering, but not if there is an actual problem for users. Measuring more user centric metrics feels like a step in the right direction and we will continue to tweak these metrics in the future. We also want to see if we can open source our ansible module for managing Graylog. It helped us tremendously and we think others might benefit from it as well.

In summary: we spent almost half a year of planning and implementing this new Graylog cluster, with 2 months of focusing two members of our team solely on Graylog. Now we have a stable cluster that we can confidently provide to our teams. We can more easily predict hardware requirements due to better observability and control over the resources required to ingest logs. Looking back at the old cluster, we have made quite some improvements and learned a lot about logging with Graylog at this scale.