Rakuten Platform as a Service and Apache Kafka
Rakuten has been running an internal Platform-as-a-Service (PaaS) for over 4 years. Rakuten application teams use our PaaS not only for testing but also for running production scale services. Because of the power of PaaS, we’ve been enabling them great productivity. For example, they can release their application and scale them out horizontally when needed using a single command.
We use Cloud Foundry (CF), an open source PaaS project as the basis of our platform. Since many Rakuten services now use our PaaS, we run it at a very large scale and operate one of the biggest CF deployment in the world (See exact numbers and how we operate it here).
As we grew quickly, we faced many challenges in scaling of our platform. Many of these challenges are because we’re still running CF version 1. We are currently upgrading to CF version 2. As part of our upgrade, we are rebuilding our log pipeline. The log pipeline is one of our original custom components. It collects all logs from the CF components and sends them to other stakeholders and services such as object storage for long-term archiving, Elasticsearch/Logstash/Kibana (ELK) for visualization, Hadoop cluster for data analysis and so on (among CF users, logsearch is a popular choice for this kind of purpose).
For the core of this new log pipeline, we use Apache Kafka. In this blog post, I will describe the history of our log pipeline, the problems we faced and how we address these problems in v2.
Why do we need a log pipeline?
Why we need a custom log pipeline? Why is it not enough to use what is included in CF by default (loggregator)? There are 3 main reasons for building this pipeline:
- To enable PaaS users to debug their application with more than the
cf logscommand can do. With
cf logs, users can view the recent logs of their application or tail real time logs. It’s not possible for users to see logs expired from the CF log buffer, or to use their favorite logging system like Elasticsearch/Kibana for visualization and analysis.
- To make it easy for PaaS operators (us) to debug CF components and help users debug an application problem.
- To provide our logs to other teams like data analysis team.
To meet these requirements we have been running our log pipeline since the early days of our PaaS.
What kind of logs do we handle?
We handles 2 types of logs:
- Component logs - logs from CF components (e.g., router logs)
- Application logs - logs from user applications (e.g., rails logs)
Currently, we’re running around 2,000 user application instances on our PaaS. They generate around 5TB logs per day.
Implementation of log pipeline v1
The below picture shows the simple architecture of the log pipeline in v1.
At the initial stage of the log pipeline we used to send component logs to a syslog server via rsyslog. Compared with the volume of component logs, the volume of application logs was quite large. So we prepared GlusterFS and sent all application logs there via fluentd.
As adoption of our PaaS increased, users started asking to send logs to other log storage systems run by specific teams or our Big Data Hadoop clusters. For log routing from GlusterFS, we used fluentd. Over time, log volumes became quite large and it became difficult to store everything on GlusterFS. So we started to send logs to LeoFS via a batch process.
This setup works but we have had some difficulty with our operation:
- Logs are stored on different servers, so we need to login to different servers for debugging different types of problems. It was challenging to merge them back, especially when some parts are missing.
- Because we use an old version of fluentd (it became an important hub of our pipeline and now it is difficult to upgrade) sometimes it gets stuck, its buffers fills up and the problem eventually propagates upstream. In general, this kind of propagation happens in push-based messaging systems. Because we need to provide logs to other teams, if their system gets stuck it can also affects our system.
After upgrading to CF v2, we expect 10 times the log volume as we have now. To scale to that volume and to make operation easier, we need to address these problems.
Implementation of log pipeline v2
The picture below shows our new log pipeline. The types of logs we handle is same as in v1: component logs and application logs. In v2, all logs are sent to Apache Kafka and then from there the logs are consumed by different components.
To publish component logs to Kafka, we use rsyslog and its module omkafka. rsyslog is installed by default and is simple to configure. Regarding application logs, CF has a component which aggregates all application logs (loggregator) and provides a WebSocket endpoint for consuming logs from outside of CF. We consume logs from that endpoint and publish the logs to Kafka (we will open source this tool soon).
To store logs for long-term availability, we persist the logs on LeoFS. To move the logs into the LeoFS, we use Secor from Pinterest. LeoFS and Secor are both Amazon S3 API compatible. To make the logs searchable, we send a copy of the logs on Elasticsearch using Logstash (it has nice plugins and support for Kafka). This is still work-in-progress, but we are planing to allow application users to write their own consumers to pull logs from Kafka and send their logs to their favorite data store.
By putting Kafka into the core of our log pipeline, we have gained many benefits:
- Simple. It’s much simpler than the old pipeline. Everything goes to Kafka. When we build something new component we just send the logs to Kafka.
- Flexible. Kafka consumers use a pull model. To add or remove backend systems we don’t touch Kafka itself. Now it’s easy to meet the new log routing request from users.
- Reliable. No more log propagation problems. If consumers are suffering from problem consuming logs, Kafka itself is unaffected. After consumers recover, they resume to consumption from their last position.
We are currently deploying Cloud Foundry in one of the Rakuten data centers. We are planning to use public cloud providers with our DC in a hybrid cloud configuration so applications running in the Rakuten DC can easily burst out to public clouds during traffic peaks. In this case, we need to think about log aggregation between multiple DCs. In this scenario, we can again rely on Kafka. The following picture is rough blueprint of this project.
The local Kafka aggregates all logs within its DC and the global Kafka collects all logs from the remote Kafkas. We are still planing for this, we will write another post about this in the future here.
In this post, we described how we use Apache Kafka with our PaaS. Our previous pipeline was not flexible enough to provide logs to the other teams or safely operate during outages of external systems. By leveraging Kafka, our architecture is much simpler than before and we can expect to easily scale out to accommodate future data volume.
Also, we are hiring! If you are interested in working as a Platform-as-a-Service engineer, please contact us here