Spring Microservices In Action

Eventually, all systems experience failure; however, how we respond to this failure is an important issue, and this is one of the critical points we should consider while building resilient distributed systems.

Resilience is the ability of an application to function despite failures of system components and recover from failure situations.

When a service fails, it can be easily detected and routed around, whereas identifying a poor-performing service is difficult and can lead to a cascading effect that can ripple throughout an entire system and end with the down of multiple applications.

What is the client-side resiliency?

Client-side resiliency is an approach to protecting a client from failing or poorly performing remote sources. There are four patterns (client-side load balancing, circuit breaker, fallback, and bulkhead) to implement in the client (microservice) to fail fast and not consume application resources such as thread pools and network bandwidth.

Client-side load balancing: To implement client-side load balancing, the client is responsible for obtaining the list of all existing service instances from a service discovery agent. If the client-side load balancer detects a poorly behaving instance, it can eliminate that service instance from the pool of available services.

Circuit Braker: The circuit breaker pattern keeps track of all calls to a remote resource. When a certain number of calls fail, the circuit breaker switches to failing fast and prevents future calls to the failing remote resource.

Fallback Processing: The situation is when the client (the service consumer) executes an alternative code path to carry out the functioning because a service called by the client fails to respond to a client call. Instead of interrupting the process by throwing an exception, fallback processing involves retrieving data from another data source or queuing the client's request.

Bulkheads: A service that interacts with multiple resources must be segregated and isolated with its own resources. One slow remote resource call can lead to cascading failure and bring down the whole system. Each remote resource can be assigned to a thread pool to ensure one slow service call doesn't cause a bottleneck.

Implementing Resilience4j

Resilience4j is a lightweight fault tolerance library, previously Hystrix was used, which offers the following patterns:

Circuit Breaker: Stops requesting while an invoked service is not responding
Retry: Retries a request when service temporarily fails.
Bulkhead: Avoids overload by limiting the number of outgoing concurrent service requests
Rate Limiter: Limits the number of requests that can be received at a duration.
Fallback: Performs an alternative way to handle failing requests.

We can provide multiple patterns to the same method calls by defining the annotations.

Setting up the environment to use Spring Cloud and Resilience4j
To implement Resilience4j patterns, we need to set up our project and import the Resilience4j dependency via Maven or Gradle.

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-circuitbreaker</artifactId>
    <version>${resilience4jVersion}</version>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-ratelimiter</artifactId>
    <version>${resilience4jVersion}</version>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-retry</artifactId>
    <version>${resilience4jVersion}</version>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-bulkhead</artifactId>
    <version>${resilience4jVersion}</version>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-cache</artifactId>
    <version>${resilience4jVersion}</version>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-timelimiter</artifactId>
    <version>${resilience4jVersion}</version>
</dependency>
 <!-- or just you can have all with 'resilience4j-all' -->

<dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

The AOP dependency has also been added to have Spring AOP aspects run.

Implementing The Circuit Breaker Pattern

The pattern implements a fast fail and prevents future requests to an unavailable remote resource by monitoring remote calls. The circuit breaker pattern is implemented via a finite state machine with three normal states: closed, open, and half-open.

A ring bit buffer is used to calculate the success or failure rate of the requests in the Closed and Half-Open state.

Closed: Initial state, request failure rate is below the threshold, the circuit breaker only opens if the failure rate is above the threshold for a duration.

Open: All calls are rejected (fail-fast) during a time in the open state and thrown a CallNotPermitted exception. After a while (the configured time expires), the circuit breaker switches to the Half-Open state to allow several requests to check if the service is still unavailable.

Half-Open: In the half-open state, the circuit breaker lets the calls to the remote source to evaluate the new failure rate. If this new failure rate is above the threshold, it returns to the open state with a refreshed timeout. Otherwise, it switches to the closed state.

The @CircuitBreaker annotation is used to mark the Java class methods managed by a Resilience4j. It wraps to the method to manage all calls to that method through a thread pool specifically set aside to handle remote calls.

@FeignClient(name = "product-service/product")
public interface ProductService {

    @PutMapping("/reduceQuantity/{id}")
    @CircuitBreaker(name = "productService", fallbackMethod = "productFallbackMethod")
    ResponseEntity<Void> reduceQuantity(@PathVariable("id") long productId,
                                        @RequestParam long quantity);

    default void productFallbackMethod(Exception e) {
        throw new CustomException("PRODUCT_UNAVAILABLE", "UNAVAILABLE", 500);
    }
}

The @CircuitBreaker annotation provides properties to define a name to customize it and a name for the fallback method to invoke when the Circuit Breaker is open.

Customizing The Circuit Breaker

It can be accomplished by adding several parameters to the application.yml, boostrap.yml, or service configuration file.

resilience4j:
  circuitbreaker:
    instances:
      productService:
        failureRateThreshold: 50 # %50 percent
        minimumNumberOfCalls: 5
        automaticTransitionFromOpenToHalfOpenEnabled: true
        waitDurationInOpenState:
          seconds: 5
        permittedNumberOfCallsInHalfOpenState: 3
        slidingWindowSize: 10
        slidingWindowType: COUNT_BASED

The circuit breakers' behaviors can be customized through these parameters. We can define a failure rate to switch to the open state, the wait duration in the open state to switch to the half-open state, and more. If you would like to learn more about the parameters, you can visit the following link:
https://resilience4j.readme.io/docs/circuitbreaker#create-and-configure-a-circuitbreaker

Fallback Processing

To implement a fallback strategy to intercept a service failure, we need to add a fallback method attribute to the @CircuitBreaker annotation and define a fallback method residing in the same class as the circuit breaker implemented. It is demonstrated in the previous example.

On fallbacks:

If you're only logging an error in a fallback, it's better to use a try-catch block around your service call to handle the exception and include the logging there.
If you call another service in a fallback, you may wrap the fallback with a @CircuitBreaker because of the same concern about service failure.

Implementing The Bulkhead Pattern

By default, the same thread pool is used to handle remote resource calls, a performance problem in one of the these remote resources can result in all of the threads for the Java container being maxed out and waiting to process work, while new requests for work back up.

The bulkhead pattern prevents the eventual crash of the Java container by segregating remote resource calls' thread pools. This way, a single misbehaving service saturates its thread pool and stops processing requests. The bulkhead pattern improves the resiliency of the system. Two different bulkhead patterns can be implemented via Resilience4j:

Semaphore: In this isolation approach, the number of concurrent requests to the service is limited. It starts rejecting requests once the limit is reached. By default, semaphore is used in Resilience4j.
Fixed Thread Pool: This approach is focused on a bounded queue and fixed thread pool. It isolates a set of a thread pool from system resources, using only that thread pool for the service. It rejects a request if the pool and the queue are full.

Resilience4j lets us customize the behavior of the bulkhead patterns. The required parameters are given respectively below for the patterns.

resilience4j.bulkhead:
  instances:
    bulkheadService:
      maxWaitDuration: 10ms
      maxConcurrentCalls: 20

resilience4j.thread-pool-bulkhead:
  instances:
    bulkheadService:
      maxThreadPoolSize: 1
      coreThreadPoolSize: 1
      queueCapacity: 1
      keepAliveDuration: 20ms

    @Bulkhead(name = "bulkheadService", fallbackMethod = "productFallbackMethod" )
    ResponseEntity<Void> reduceQuantity(@PathVariable("id") long productId,
                                        @RequestParam long quantity);

    default void productFallbackMethod(Exception e) {
        throw new CustomException("PRODUCT_UNAVAILABLE", "UNAVAILABLE", 500);
    }

To implement the bulkhead pattern in Resilience4j, we need to define an additional annotation@Bulkhead and the allowed maximum number of concurrent calls and also the maximum wait duration when entering a bulkhead. For further information, you can visit the following link:
https://resilience4j.readme.io/docs/bulkhead#create-and-configure-a-ThreadPoolBulkhead

Implementing The Retry Pattern

The retry pattern involves making multiple attempts to communicate with a service after a failure, such as a network disruption. Its primary purpose is to ensure that the desired response is get by repeatedly invoking the same service until it succeeds.

It is implemented by marking a method with the @Retry annotation, and defining the maximum number of retry attempts and the the wait duration between the retry attempts parameters.

Implementing The Rate Limiter Pattern

It is used to limit the incoming requests to a service in a given timeframe. If the request attempts exceed the allowed limit defined by the rate limiter, all the excess calls are blocked.

To implement this pattern, we need to mark a service with the @RateLimiter annotation, set a limit on the number of requests at a particular time, and define a refresh period of limits. The rate limiter pattern can be implemented using two different techniques, SemaphoreBasedRateLimiter and AtomicRateLimiter.
For further information, you can visit the following link: https://resilience4j.readme.io/docs/ratelimiter

The critical distinction between the bulkhead and rate limiter patterns lies in their functions: the bulkhead limits the number of concurrent calls at any given moment, whereas the rate limiter controls the number of total calls within a specified time period.