WareIQ Webinar: Prepare for Peak-Season – Reduce RTOs & Improve Delivery Speed
Perimeter: An Egress Traffic Controller

Perimeter: An Egress Traffic Controller

Perimeter is an egress traffic controller designed for distributed systems, to ensure that all outgoing requests comply with the rate limits of external systems. This article provides an overview of the need for a rate limiting solution, the architecture of Perimeter, and how it is being used at WareIQ to observe and manage outgoing API traffic. Understanding WareIQ's platform and the need for Perimeter WareIQ is a logistics middleware that seamlessly connects merchants to various sales channels, warehouse management systems and delivery partners. WareIQ platform interacts with various third party systems as mentioned above via APIs, to maintain a single source-of-truth and provide data consistency for our customers as far as the e-commerce order related details are concerned. Currently, WareIQ makes over 1 million API requests daily across our partners, and this number is climbing swiftly as we onboard more clients and partners. WareIQ platform is comprised of multiple micro-services. As each micro-service can call same external partner simultaneously for different functionalities, we've noticed an increasing number of request failures due to rate limits being hit on these external partners. This led us to realize that rate limiting solely at the individual micro-service level is insufficient. We needed a centralized traffic controller that shapes egress traffic being generated across all our micro-services, so that we stay within the bounds of each external partner's rate limits. Suppose we have a service that fetches orders from Shopify, which has a rate limit of 2 requests per second. If we have 10 instances of this service running, the combined rate limit would be 20 requests per second in the worst case scenario. However, if all 10 instances are making requests simultaneously, the actual limit we must adhere to is still 2 requests per second. This is where Perimeter comes into play. It ensures that the total rate limit is not exceeded across all instances, maintaining smooth and efficient operations. By implementing Perimeter, we centralized rate limiting and are effectively managing our API requests and ensuring we meet the rate limits set by our external partners. Requirements and research Addressing the issue of rate limiting at the network level is indeed an efficient solution. However, it presents certain challenges. We could not afford to drop any request that hit rate-limit. If a request is made that exceeds the rate limit, it must be queued and processed soon, ensuring no loss of requests. In our research into network layer proxies for rate limiting, we considered Envoy Proxy and Nginx. However, we discovered that both solutions do not meet our requirements. Specifically, both Envoy and Nginx drop requests that exceed the rate limit instead of queuing them for later processing. This behavior does not align with our need to honor all outgoing requests. Another requirement was to have a rate limiter where the configured rate limits were dynamic and could be updated on the fly without restarting the service. Also the service should react to the configuration changes in real-time. Network layer solutions for this requirement would require a lot of custom code to be written on top of the existing solutions. This would make the solution complex and difficult to maintain. Since an application layer best fits these crucial requirements, we decided to build Perimeter as an application layer rate limiter. Understanding Perimeter WareIQ uses a microservices architecture and these services are managed in a kubernetes cluster. All key services are written in Python. Perimeter sits between these microservices and the external systems. Since Perimeter is a critical component, there are 2 questions that we needed to answer before we started building Perimeter: How do we ensure that the rate limits are enforced correctly? We implemented a token bucket algorithm to enforce rate limits. The token bucket algorithm is a widely used algorithm for rate limiting. It works by adding tokens to a bucket at a fixed rate. When a request comes in, a token is removed from the bucket. If there are no tokens in the bucket, the request will wait till there is a token available to consume. This ensures that the rate limit is enforced correctly, and the requests are processed in a timely manner. Tokens are added to the bucket based on configurations saved in a postgresql database. This allows us to change the rate limits on the fly without having to restart the service. How do we ensure Perimeter is highly available? All our services are in a Kubernetes cluster, and perimeter is deployed as another service in the cluster. Even though having multiple replicas of Perimeter is desirable for high availability, it brings up a new challenge. If we have multiple replicas of Perimeter, how do we ensure that the rate limits are enforced correctly across all these replicas? We decided to park this problem for the future and make the single-replica Perimeter service as robust and fault-tolerant as possible. Perimeter is a single central service that is deployed as a Kubernetes deployment with a single replica. This ensures that all requests pass through the same instance of Perimeter, and the rate limits are enforced correctly. An instance of perimeter is set to be available at all times, and if it goes down, the Kubernetes deployment ensures that a new instance is spun up immediately. Additionally, as a fallback, we have updated all our services to have a retry mechanism in case of a rate limit error. This ensures that even if Perimeter goes down, the services will continue to function, albeit with a higher failure rate. Since it is going to be a single instance, the service itself needed to be fast and lightweight. We chose to write Perimeter in Golang, as it is known for its speed and efficiency in handling concurrent requests. Components of Perimeter Perimeter has four main components: Configurations: Rate limits are saved in a PostgreSQL database. Perimeter reads these configurations and creates Beats for each configuration. A Beat is a goroutine that adds tokens to the bucket at a fixed rate. These configurations are updated periodically, and Beats are created, updated, or destroyed based on the configurations. Beats: Each Beat is responsible for adding tokens to the bucket at a fixed rate. When a request comes in, the Beat checks if there are enough tokens in the bucket to process the request. If there are enough tokens, the request is processed, and a token is removed from the bucket. If there are not enough tokens, the request is queued and processed as soon as there are enough tokens in the bucket. Structured Logging: Perimeter logs all the requests that come in and the rate limits that are enforced. This allows us to monitor the traffic and ensure the rate limits are enforced correctly. Alerts: We have set up alerts for various metrics to notify us when thresholds are exceeded, enabling us to respond swiftly and maintain system performance. Perimeter utilizes the blocking behavior of Go channels to queue requests when the rate limit is exceeded. This ensures that no requests are dropped and all requests are processed as soon as the rate limit allows. Testing Perimeter's Performance and Reliability To ensure Perimeter can handle the requirements of our system, we conducted a variety of tests to evaluate its performance and reliability. We built our own load simulator using a simple Python script that sends requests to Perimeter at a configurable rate or load. Additionally, we incorporated our own custom services, each with different rate limits, to test Perimeter's flexibility and enforcement capabilities. We spun up different docker containers corresponding to some of our micro-services and triggers, enabling us to simulate various network conditions and configurations. This comprehensive testing approach ensured that Perimeter could handle the load and enforce rate limits correctly across diverse scenarios. Perimeter in Action Perimeter enhances the consistency of egress traffic and is now utilized to monitor and manage outgoing API traffic at WareIQ. Perimeter has enabled us to identify and resolve several previously undetected external API related issues within our system. We have established alerts for different metrics, helping us respond to issues quickly. Future Work Rate Limiting Types: Currently, Perimeter only supports rate limiting based on the number of requests per rate period. We are working on adding more rate limiting types, such as API-level rate limiting and service-level rate limiting. Request Analytics: We are adding request analytics to Perimeter. This will allow us to monitor traffic and identify patterns, helping us detect anomalies and respond quickly. Enhanced Alerting: We are improving our alerting capabilities. This will allow us to set up alerts for different metrics and get notified when thresholds are crossed, ensuring proactive issue resolution. Conclusion Perimeter is a critical component in ensuring that WareIQ's egress traffic adheres to the rate limits of external partners, enhancing system reliability and performance. By implementing Perimeter, we have centralized rate limiting, effectively managed our API requests, and ensured compliance with external rate limits. As we continue to develop Perimeter, we are focused on adding new features and improving existing capabilities to meet our growing needs.

August 07, 2024

Categories: