by Guido Schmutz and Ronald van Luttikhuizen
An Introduction to Fault Handling in a Service-Oriented Environment
November 2012
It is one thing to design and code the "happy flow" of your automated business processes and services. It is another thing to deal with unwanted, unexpected situations that might occur in your processes and services. This article, the first in a four-part series, will dive into fault handling and prevention in an environment based on Service-Oriented Architecture (SOA) and Business Process Management (BPM) principles. You will learn about the different types of faults that can occur and how fault handling in an SOA environment differs from fault handling in traditional systems. We will investigate what can go wrong in such environments based on a case study of an Order-to-Cash business process. For each of these problems you will learn about the out-of-the-box capabilities in Oracle Service Bus and Oracle SOA Suite that can be applied to prevent faults from happening and to deal with them when they do occur.
A fault can be defined as something that is unusual and happens outside the normal and expected operational activity or "happy flow" of a process. Faults in IT-systems can be categorized into the following types:
However hard we try though it is impossible to prevent all faults from happening. When faults do occur it is important to detect them and have effective fault handling measures in place to deal with, and recover from them. Don't just ignore faults and hope nothing bad happens. The differences between these approaches are shown in the following comics from Geek and Poke
(Presented under the Creative Commons Attribution 3.0 Unported License.)
The remainder of this article focuses on preventing and handling technical and business faults in the context of a SOA-environment.
Business faults and technical faults differ in a number of areas. The following table illustrates the differences between these types of faults.
Business Faults... | Technical Faults... |
---|---|
indicate a failure to meet a particular business requirement | indicate a runtime failure of a hardware or software component |
are often expected and can be defined in service interfaces | are often unexpected and are not defined in service interfaces |
have business value | have no business value |
are faults that services clients can recover from | are faults that service clients cannot (easily) recover from; they are dependent on others to fix the underlying issue |
The following sections each provide an example of such a fault.
Consider a customer that orders some products on a new web shop for the first time. The web shop application is dependent on the CRM system to retrieve the customer's details and preferences. In this case, the CRM system doesn't know of the customer and throws a business fault to the web shop application indicating the customer is unknown. This is a business fault that can be expected, and you can recover from it by, for example, redirecting the new customer to a webpage where he can complete his registration. After the registration business continues as usual.
The following snippet shows the business fault as returned by CRM System's Web Service. The fault indicates that a particular customer is not found in the backend system.
<soap:Envelope>
<soap:Header/>
<soap:Body>
<soap:Fault>
<faultcode>CST-1234</faultcode>
<faultstring>Customer not found</faultstring>
<detail>
<CustomerNotFoundFault>
<CustomerName>John Doe</CustomerName>
</CustomerNotFoundFault>
</detail>
</soap:Fault>
</soap:Body>
</soap:Envelope>
This particular fault, CustomerNotFoundFault, is specified in the Web Service interface so the service's clients can expect it. The WSDL for the message above is partially shown in the following snippet.
<wsdl:definitions name="CustomerServicequot; targetNamespace="..." ...>
...
<wsdl:message name="FindCustomerRequestMsg">
<wsdl:part name="FindCustomerRequest" element="tns:
FindCustomerRequest"/>
</wsdl:message>
<wsdl:message name="FindCustomerResponseMsg">
<wsdl:part name="FindCustomerResponse"
element="tns:FindCustomerResponse"/>
</wsdl:message>
<wsdl:message name="CustomerNotFoundFaultMsg">
<wsdl:part name="Error"
element="tns:CustomerNotFoundFault"/>
</wsdl:message>
<wsdl:portType name="CustomerServicePortType">
<wsdl:operation name="FindCustomer">
<wsdl:input name="FindCustomerRequest"
message="tns:FindCustomerRequestMsg"/>
<wsdl:output name="FindCustomerResponse"
message="tns:FindCustomerResponseMsg"/>
<wsdl:fault name="CustomerNotFoundFault"
message="tns:CustomerNotFoundFaultMsg"/>
</wsdl:operation>
</wsdl:portType>
...
</wsdl:definitions>
A fault is just another message being returned by the service operation, and should be used whenever a business fault is signaled. This is much better than adding error codes or flags to the "normal" response message, as it forces the consumer to detect and deal with the fault situation.
Let's revisit the same scenario, only this time the CRM system is unreachable due to unplanned downtime. You cannot easily recover from this error, and need to wait for IT operations or the hosting provider to restore the system to a normal operational condition. In that case one option for handling the fault, and it's not a good one, is to ask the customer to retry his order in a few hours.
The following snippet shows a technical fault returned by the service infrastructure when a service on an external system is not reachable. The fault indicates that a technical error occurred on the server.
<soap:Envelope>
<soap:Header/>
<soap:Body>
<soap:Fault>
<faultcode>S:Server</faultcode>
<faultstring>Could not connect to 127.0.0.1:443</faultstring>
</soap:Fault>
</soap:Body>
</soap:Envelope>
The fault is returned as a generic SOAP fault and is not explicitly specified in the Web Service interface.
Be careful not to expose implementation details when returning technical faults to service clients; do not just return the technical fault of the underlying system. Such faults may include sensitive information, including connection details, driver versions, used credentials, and operating system details which can be exploited by hackers. This best practice is known as exception shielding.
The goal of every developer should be to create unbreakable systems. The extent to which that goal can be achieved depends on the success of efforts to handle and manage expected and unexpected exception conditions. Object-oriented languages such as C++ and Java provide an efficient way for handling exceptions using constructs such as try, catch, and finally. In a SOA most of what is available at language level is still valid and usable for creating elementary services.
However, SOA raises different challenges when orchestrating services and creating composite applications. Figure 1 illustrate a typical SOA and BPM environment. This environment includes:
Figure 1: Typical SOA and BPM environment
The following aspects of such an environment have an impact on the way you should implement fault prevention and handling:
In the rest of this article—and in subsequent articles in this series—you will learn what patterns and out-of-the-box features can be used to implement effective fault prevention and handling in a SOA-environment.
In parts two and three of this article series we will delve into the fault prevention and fault handling capabilities of the most important SOA building blocks of Oracle SOA Suite: the SCA infrastructure (with its service components such as BPEL and Mediator) and Oracle Service Bus. We will present these capabilities using a scenario that is complex enough to show some real-life error situations.
Figure 2 shows this scenario, an order process that is implemented on Oracle SOA Suite 11g. We are using the Trivadis integration blueprint notation, as presented in Service Oriented Architecture: An Integration Blueprint [See Sources].The left side shows the process steps from the moment where the order request is received until the order is processed. The right side shows all the external systems that the process application interacts with to complete orders. Such systems include the application in which clients can order products, two different credit card service providers to bill clients, a product database, the order processing application, and a history service to store completed instances of the process. The middle lanes show the integration of the process with the backend systems using services that are invoked from the process and exposed by Oracle Service Bus 11g.
The Order Processing system is a legacy application, which can only be integrated with our system by using queues.
Figure 2: Order Process implemented on Oracle SOA Suite 11g
The scenario includes the following steps:
Now let's take a look at some of the fault situations we can get in our scenario. Thanks to Murphy's Law there are quite a lot of issues that can happen, some expected and some unexpected. The following image shows the scenario, but this time overlaid with some potential problems.
Figure 3: Possible Fault Situations
The following situations can occur, which we either have to avoid or handle:
The next two articles in the series will discuss how these faults can be prevented and/or handled using the capabilities of Oracle Service Bus and Oracle SOA Suite.
Various patterns can be applied to prevent faults from happening or to handle them when they cannot be prevented. The following table lists patterns that improve the fault prevention and handling capabilities of your software. These patterns should be seen from the perspective of the service provider. In other words, in order for services to provide added value and good quality-of-service, the service provider is responsible for implementing the fault handling patterns. Thereby relieving the service consumers from this task.
You can read more about these patterns in Patterns for Fault Tolerant Software [See Sources].
Action | Prevention or Handling | Description |
---|---|---|
Inaction | Handling | Simply ignore the request |
Balk | Handling | Admit failure |
Guarded suspension | Handling | Suspend execution until conditions for correct execution are established |
Provisional action | Prevention | Pretend to perform the request, but do not commit until success is granted |
Alternative action | Prevention | Perform an acceptable alternative; e.g. automatic failover by the service provider |
Rollback | Handling | Try to proceed, but on failure, undo the effects of the failed action |
Retry | Prevention | Repeatedly attempt a failed action after recovering from failed attempts. In case of a successful retry the service consumer is not aware of any failure. In case of unsuccessful retry the fault is passed back to the consumer |
Appeal to higher authority | Handling | Ask someone to apply judgment and steer the software to an acceptable resolution |
Resign | Handling | Minimize damage, write log information, then signal definite and safe failure |
Compensation | Handling | Undo activities by executing the opposite actions in a reverse order |
Exception shielding | Handling | Hide implementation details in returned fault messages for security reasons |
Share the load | Prevention | Have multiple instances for a system so the unavailability of one instance can be handled by other instances |
Heartbeat | Prevention | Periodically check the availability of the system to detect failures in an early stage |
Throttling | Prevention | Manage the amount of messages that are sent to systems by using a queuing mechanism |
In Part 2 of this series you will learn how to use out-of-the-box features of Oracle Service Bus 11g to prevent faults and to expose reliable and robust services to your service consumers.
Guido Schmutz is Technology Manager for SOA and Emerging Trends at Trivadis and an Oracle ACE Director.
Ronald van Luttikhuizen is Managing Partner and Architect at Vennster and an Oracle ACE Director.