Hystrix : How to handle Cascading Failures in Microservices
Close

Hystrix: How To Handle Cascading Failures In Microservices

Avatar
editor
Tech Talks
Est. Reading Time:
3 mins

Hystrix: How To Handle Cascading Failures In Microservices

Avatar
editor

Problem Statement: One of our Microservice (say X) is dependent on a third party service (say Y) for its functionality. We observed when the service Y became unhealthy every request from X involving a call to Y increased response time as the service X kept on calling the service Y repeatedly without handling the failures that were happening . Application X’s threads were busy processing high response requests which led to an increase in CPU usage and a decrease in the number of free threads to process other requests which eventually led to service becoming unresponsive on production and further leading to an outage for the business.

Solution: Used Netflix Hystrix Library to handle external service failure scenarios so our application does not waste its resources on continuously calling the unhealthy external service, it skips the call based on threshold parameters configured, ensuring that the application threads and health are in an efficient state. We maintain a hystrix thread pool for external calls with maximum size of 10 threads which limits the impact, in case the external service is unhealthy, also the circuit breaker is set to open within 10 seconds if 60% of the requests fail, the circuit remains in open state for 5 seconds then goes to half-open state and eventually to closed state based on if the subsequent request fails or succeeds.

What Can Go Wrong in a Microservice Architecture?

There are a number of moving components in a microservice architecture, hence it has more points of failures. Failures can be caused by a variety of reasons – errors and exceptions in code, release of new code, bad deployments, hardware failures, data center failure, poor architecture, lack of unit tests, communication over an unreliable network, dependent services, etc.

Why Do You Need to Make Services Resilient?

The problem with distributed applications is that they communicate over a network – which is unreliable. Hence you need to design your microservices in a manner so that they are fault-tolerant and handle failures gracefully. In your microservice architecture, there might be a dozen services talking with each other hence you need to ensure that one failed service does not bring down the entire architecture.

Circuit Breaker Pattern 

You wrap a protected function call in a circuit breaker object, which looks for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error or with some alternative service or a default message, without the protected call being made at all. This will make sure the system is responsive and the threads are not waiting for an unresponsive call.

The Different States of the Circuit Breaker The circuit breaker has three distinct states: Closed, Open, and Half-Open:

  • Closed – When everything is normal, the circuit breaker remains in the closed state and all calls pass through to the services. When the number of failures exceeds a predetermined threshold the breaker trips, and it goes into the Open state.
  • Open – The circuit breaker returns an error for calls without executing the function.
  • Half-Open – After a timeout period, the circuit switches to a half-open state to test if the underlying problem still exists. If a single call fails in this half-open state, the breaker is once again tripped. If it succeeds, the circuit breaker resets back to the normal, closed state.

What Is Hystrix?

Hystrix is a Latency and Fault Tolerance Library for Distributed Systems It is a latency and fault tolerance library designed to isolate points of access to remote systems, services, and 3rd-party libraries in a distributed environment. It helps to stop cascading failures and enable resilience in complex distributed systems where failure is inevitable.

How Does Hystrix Accomplish Its Goals?

Hystrix does this by:

  • Wrapping all calls to external systems (or “dependencies”) in a HystrixCommand or HystrixObservableCommand object which typically executes within a separate thread.
  • Timing-out calls that take longer than the thresholds you define.
  • Maintaining a small thread-pool (or semaphore) for each dependency; if it becomes full, requests destined for that dependency will be immediately rejected instead of queued up.
  • Measuring successes, failures (exceptions thrown by client), timeouts, and thread rejections.
  • Tripping a circuit-breaker to stop all requests to a particular service for a period of time, either manually or automatically if the error percentage for the service exceeds the threshold.
  • Performing fallback logic when a request fails/ is rejected /times-out, or short-circuits.

Using Hystrix with Spring Boot Application: –

Add the below entry in the POM file :

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
    <version>{latest-version}</version>
</dependency>

For version refer – Spring Cloud Starter Netfilx 2.0.1.RELEASE

Add @EnableCircuitBreaker annotation to enable hystrix circuit break for your application.

@SpringBootApplication
@EnableCircuitBreaker
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
    
}

One way to Wrap a function call with Hystrix Command Example below :

public static final String DISPOSE_KEY = "disposeKey";
public static final String DISPOSE_POOL = "disposePool";


@HystrixCommand(commandKey = DISPOSE_KEY, threadPoolKey = DISPOSE_POOL)
public Future<Void> disposeCall(InteractioData interactionData) {

Hystrix Properties

Setting Properties in configuration files example below :-

hystrix.command.disposeKey.circuitBreaker.sleepWindowInMilliseconds=5000
hystrix.command.disposeKey.circuitBreaker.requestVolumeThreshold=5
hystrix.command.disposeKey.circuitBreaker.errorThresholdPercentage=60
hystrix.command.disposeKey.execution.isolation.thread.timeoutInMilliseconds=10000

hystrix.threadpool.disposePool.maxQueueSize=10
hystrix.threadpool.disposePool.queueSizeRejectionThreshold=10

Check out the Hystrix Configuration

Hystrix Dashboard :

The Hystrix Dashboard allows you to monitor Hystrix metrics in real time.

Check out the Hystrix Dashboard