High Availability

Introduction

All services - whether composable or not - can benefit from the JSLEE high availability (HA) system. To enable HA for a service, add a "ha" section to the service’s configuration section in the executing JSLEE server.json file.

The HA section of a service determines how all message delivery is handled by the service for which the HA is configured. The HA layer becomes a pass-through proxy for all message delivery by the service.

For example, a SMPP endpoint may be configured with the high availability configuration in this form to failover between SMS-GW instances when the local instance is down. For detailed configuration documentation, see the high availability configuration documentation.

{
  "handler": "nz.co.nsquared.slee.smpp.SMPPVerticle",
  "instance-count": 1,
  "configuration": {
     "message-handler": "EventBus",
     "endpoints": [
       {
         "name": "smpp1",
         "port": 4001,
         "server": true,
         "authentication-address": "myauth",
         "processing-address": "ha-sms"
       }
     ],
    "ha": {
      "circuit-breakers": [
        {
          "name": "to cluster on failure",
          "failures-before-open": 5,
          "half-open-delay-ms": 60000,
          "failure-count-rolling-window-ms": 1000,
          "maximum-retries": 0
        } 
      ],
      "routing": [
        {
          "match-address": "smsgw",
          "distribute-to": "local:smsgw",
          "circuit-breaker": {
            "name": "to cluster on failure",
            "on-failure": {
              "distribute-to": "any:smsgw"
            }
          }
        }     
      ]
    }
  }
}

To enable high-availability for a service, one or more circuit breaker templates must be defined using the circuit-breakers configuration section. The circuit breaker templates are instantiated as required by the sender of a message, with independent circuit breaker instances for each configured route destination.

A circuit breaker template defines how to respond to errors in message delivery and follows certain rules when handling errors. As the HA layer is a pass-through that sends on every message to be sent by the source service, it captures the replies received to these sent messages and considers each for failure handling.

Note that it is the sender of the message that handles the failure, not the destination.

  1. A permanent error code received by the HA layer will always be propagated back to the source of the message. The source in this context is the software code in the service which generated the message.
  2. Any Java exception received during message delivery will be propagated back to the source of the message. All such exceptions will occur within the sender’s thread, and this is a special case of a permanent error being received.
  3. Errors received by the HA layer that indicate a timeout has occurred waiting for a response for a sent message trigger circuit breaker handling. Timeouts are triggered by the sender due to failure of a reply from the recipient being received. Note that this failure covers many classes of failure by the remote end.
  4. A temporary error code received from the remote end will trigger circuit breaker handling.
  5. If a destination for a message is not available (e.g. the service is shut down) then this is treated as a recipient failure and retry handling is skipped - instead failure handling is immediately invoked. This ensures that on service shutdown failover can be engaged as quickly as possible.

Each circuit breaker template is configured to determine how the circuit breaker response to temporary failure ( permanent failure is always propogated back up the chain, and the HA layer does not deal with such failures directly).

  1. Retry handling. On a temporary failure the circuit breaker may retry the request by resending it to the same destination. The maximum number of times a request is retried can be controlled. If the retry count is exhausted, a message is considered failed.
  2. Failure handling. For failed requests - either due to the retry count being exceeded, or due to the circuit being open - failure handling is triggered.
  3. The circuit breaker will open after too many failures. While a circuit breaker is open the circuit breaker will immediately fail any send request without actually trying to send the message.
  4. The circuit breaker will half-open after a configured period. In the half-open state a single request will be attempted. On success the circuit will “close”, allowing normal behaviour again. However if that single request also fails (with a temporary failure) the circuit will reopen again. 4 The circuit breaker while closed will send all messages received to their destination. Failures while closed are handled using retry and then failure handling.

Circuit Breaker Instances

Each object in the circuit-breakers section of the configuration file is a template. During message processing, the JSLEE HA layer will instantiate a dedicated circuit breaker instance for each unique message destination / route pair.

A unique message destination / route pair is identified using a key with the following form:

route.match-address + destination address

Where

This allows each destination for each route to be handled by a unique circuit breaker instance. Each circuit breaker instance tracks:

  1. The current state of the circuit breaker - CLOSED, OPEN or HALF_OPEN.
  2. Number of actual failures that have occurred within the failure period - where the failure period is the last n milliseconds (where n is configurable).
  3. When, if the circuit is HALF_OPEN, when to retry.

Note that as each route has a separate circuit breaker, the address local:smsgw and any:smsgw track their failures and state separately, however for greater control per-node configuration might be useful as any:smsgw actually covers all nodes.

Note that each message being sent has an individual retry count that is tracked during message delivery of that specific message.

Route Selection

While each circuit breaker is named and independent, the routes list is a list of routes to be matched in order as written. which route is used, if any, depends on which matches first. This means that the most specific route must be listed first.

Circuit Breaker Selection and Configuration Override

Each route may configure a named circuit breaker template to use for destinations where messages are sent by a route. The circuit breaker configuration can override any field configured in the circuit breaker template with a route-specific value. In particular this section of the route configuration can be used to define a unique on-failure configuration for the route, which will be used by the circuit breaker on failures that occur for any circuit breaker instance created by the route.