Cloud Design - Data Management Patterns (4)

Other Patterns

Other Patterns

Compensating Transaction Pattern

Problem

If an operation that implements eventual consistency spans several heterogeneous data stores, undoing the steps in the operation will require visiting each data store in turn. The work performed in every data store must be undone reliably to prevent the system from remaining inconsistent. In a service oriented architecture (SOA) environment an operation could invoke an action in a service, and cause a change in the state held by that service. To undo the operation, this state change must also be undone. This can involve invoking the service again and performing another action that reverses the effects of the first.

Solution

The solution is to implement a compensating transaction. The steps in a compensating transaction must undo the effects of the steps in the original operation. It must be an intelligent process that takes into account any work done by concurrent instances. This process will usually be application specific, driven by the nature of the work performed by the original operation.

A common approach is to use a workflow to implement an eventually consistent operation that requires compensation. As the original operation proceeds, the system records information about each step and how the work performed by that step can be undone. If the operation fails at any point, the workflow rewinds back through the steps it’s completed and performs the work that reverses each step. Note that a compensating transaction might not have to undo the work in the exact reverse order of the original operation, and it might be possible to perform some of the undo steps in parallel. This approach is similar to the Sagas strategy discussed in Clemens Vasters’ blog

A compensating transaction is also an eventually consistent operation and it could also fail. The system should be able to resume the compensating transaction at the point of failure and continue. It might be necessary to repeat a step that’s failed, so the steps in a compensating transaction should be defined as idempotent commands.

In some cases it might not be possible to recover from a step that has failed except through manual intervention. In these situations the system should raise an alert and provide as much information as possible about the reason for the failure.

Considerations

It might not be easy to determine when a step in an operation that implements eventual consistency has failed. A step might not fail immediately, but instead could block. It might be necessary to implement some form of time-out mechanism.
Compensation logic isn’t easily generalized. A compensating transaction is application specific. It relies on the application having sufficient information to be able to undo the effects of each step in a failed operation.
You should define the steps in a compensating transaction as idempotent commands. This enables the steps to be repeated if the compensating transaction itself fails.
The infrastructure that handles the steps in the original operation, and the compensating transaction, must be resilient. It must not lose the information required to compensate for a failing step, and it must be able to reliably monitor the progress of the compensation logic.
A compensating transaction doesn’t necessarily return the data in the system to the state it was in at the start of the original operation. Instead, it compensates for the work performed by the steps that completed successfully before the operation failed. The order of the steps in the compensating transaction doesn’t necessarily have to be the exact opposite of the steps in the original operation.
Placing a short-term timeout-based lock on each resource that’s required to complete an operation, and obtaining these resources in advance, can help increase the likelihood that the overall activity will succeed.
Consider using retry logic that is more forgiving than usual to minimize failures that trigger a compensating transaction.

Usage

Use this pattern only for operations that must be undone if they fail. If possible, design solutions to avoid the complexity of requiring compensating transactions.

Retry Pattern

If an application detects a failure when it tries to send a request to a remote service, it can handle the failure using the following strategies:

Cancel. If the fault indicates that the failure isn’t transient or is unlikely to be successful if repeated, the application should cancel the operation and report an exception.
Retry. If the specific fault reported is unusual or rare, it might have been caused by unusual circumstances such as a network packet becoming corrupted while it was being transmitted.
Retry after delay. If the fault is caused by one of the more commonplace connectivity or busy failures, the network or service might need a short period while the connectivity issues are corrected or the backlog of work is cleared. The application should wait for a suitable time before retrying the request.

For the more common transient failures, the period between retries should be chosen to spread requests from multiple instances of the application as evenly as possible.

If the request still fails, the application can wait and make another attempt. If necessary, this process can be repeated with increasing delays between retry attempts, until some maximum number of requests have been attempted. The delay can be increased incrementally or exponentially.

An aggressive retry policy with minimal delay between attempts, and a large number of retries, could further degrade a busy service that’s running close to or at capacity. This retry policy could also affect the responsiveness of the application if it’s continually trying to perform a failing operation. If a request still fails after a significant number of retries, it’s better for the application to prevent further requests going to the same resource and simply report a failure immediately. When the period expires, the application can tentatively allow one or more requests through to see whether they’re successful.

The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed above. Requests sent to different services can be subject to different policies.

An application should log the details of faults and failing operations. This information is useful to operators. It is best to log early failures as informational entries and only the failure of the last of the retry attempts as an actual error. Consider whether the operation is idempotent. If so, it’s inherently safe to retry. Otherwise, retries could cause the operation to be executed more than once, with unintended side effects. Consider how retrying an operation that’s part of a transaction will affect the overall transaction consistency. Fine tune the retry policy for transactional operations to maximize the chance of success and reduce the need to undo all the transaction steps.

If a service is frequently unavailable or busy, it’s often because the service has exhausted its resources. You can reduce the frequency of these faults by scaling out the service.

The retry policy should be tuned to match the business requirements of the application and the nature of the failure.

For some noncritical operations, it’s better to fail fast rather than retry several times and impact the throughput of the application.
For a batch application, it might be more appropriate to increase the number of retry attempts with an exponentially increasing delay between attempts.

It’s useful for the retry policy to adjust the time between retry attempts based on the type of the exception.

Ensure that all retry code is fully tested against a variety of failure conditions. Check that it doesn’t severely impact the performance or reliability of the application, cause excessive load on services and resources, or generate race conditions or bottlenecks.

Implement retry logic only where the full context of a failing operation is understood, especially for the cases with downstream operations.

Throttling Pattern

There’re many strategies available for handling varying load in the cloud, depending on the business goals for the application. One strategy is to use autoscaling to match the provisioned resources to the user needs at any given time. This has the potential to consistently meet user demand, while optimizing running costs and it may trigger the provisioning of additional resources which is not immediate causing a resource deficit. An alternative strategy to autoscaling is to allow applications to use resources only up to a limit, and then throttle them when this limit is reached. The system should monitor how it’s using resources so that, when usage exceeds the threshold, it can throttle requests from one or more users. This will enable the system to continue functioning and meet any service level agreements (SLAs) that are in place.

The system could implement several throttling strategies, including:

Rejecting requests from an individual user who’s already accessed system APIs more than n times per second over a given period of time. This requires the system to meter the use of resources for each tenant or user running an application.
Disabling or degrading the functionality of selected nonessential services so that essential services can run unimpeded with sufficient resources.
Using load leveling to smooth the volume of activity. Or deferring operations being performed on behalf of lower priority applications or tenants.

Considerations:

Throttling should be considered early in the application design process.
Throttling must be performed quickly. The system must be capable of detecting an increase in activity and react accordingly. The system must also be able to revert to its original state quickly after the load has eased. This requires that the appropriate performance data is continually captured and monitored.
If a service needs to temporarily deny a user request, it should return a specific error code so the client application understands that the reason for the refusal to perform an operation is due to throttling. The client application can wait for a period before retrying the request.
Throttling can be used as a temporary measure while a system autoscales. In some cases it’s better to simply throttle, rather than to scale, if a burst in activity is sudden and isn’t expected to be long lived. But if resource demands grow very quickly, the system might not be able to continue functioning.