Fault Handling and Prevention - Part 1

by Guido Schmutz and Ronald van Luttikhuizen

An Introduction to Fault Handling in a Service-Oriented Environment

November 2012

Introduction

It is one thing to design and code the "happy flow" of your automated business processes and services. It is another thing to deal with unwanted, unexpected situations that might occur in your processes and services. This article, the first in a four-part series, will dive into fault handling and prevention in an environment based on Service-Oriented Architecture (SOA) and Business Process Management (BPM) principles. You will learn about the different types of faults that can occur and how fault handling in an SOA environment differs from fault handling in traditional systems. We will investigate what can go wrong in such environments based on a case study of an Order-to-Cash business process. For each of these problems you will learn about the out-of-the-box capabilities in Oracle Service Bus and Oracle SOA Suite that can be applied to prevent faults from happening and to deal with them when they do occur.

What is fault handling?

A fault can be defined as something that is unusual and happens outside the normal and expected operational activity or "happy flow" of a process. Faults in IT-systems can be categorized into the following types:

  • Technical errors - Faults caused by errors in the underlying infrastructure or middleware components on which applications run. Examples are network errors, server failures, corrupt disks, full tablespaces, and so on.
  • Software errors - Faults caused by programming errors in custom-made applications, faults in 3rd party software libraries that are used, software faults and bugs in packaged applications, etc. Think of division by zero, infinite loops, memory leaks, null pointer exceptions, and so on.
  • Faulty operation by users - Faults caused by human errors when using IT-systems. Examples of such faults are entering a wrong credit card number, accidentally switching the to and from date when booking flight tickets, ordering the wrong eBook in a web shop, and so on.
  • Exceptional business behavior - Faults caused by a failure to meet a certain business rule. For example, a customer with a bad credit rating, a new customer that wants to purchase something but is unknown in the CRM system, an invoice with an incorrect invoice amount.
Compared to "business-as-usual" it is expensive and time-consuming to deal with faults since fault handling often involves human work and expertise: the CRM department might need to call a customer, IT-operations might need to increase tablespaces, the Financial department might need to compare statements of works and invoices, etc. etc. Fault handling should therefore focus on prevention first:
  • Technical errors can be prevented by installing and configuring infrastructure and middleware in a robust and (possibly) redundant fashion. Hardening and active monitoring help to maintain quality of the infrastructure.
  • Software errors can be prevented by applying software development methodologies and best practices such as pair-programming, collegial reviews, and test-driven development.
  • Faulty user operations can be prevented by applying good User Experience (UX) techniques and methodologies so that IT systems are well designed, easy-to-use, and that applications provide the information that users need.
  • Business faults can be prevented by clearly specifying to stakeholders what conditions apply. Stakeholders can be customers, suppliers, employees, and so on.

However hard we try though it is impossible to prevent all faults from happening. When faults do occur it is important to detect them and have effective fault handling measures in place to deal with, and recover from them. Don't just ignore faults and hope nothing bad happens. The differences between these approaches are shown in the following comics from Geek and Poke

 

(Presented under the Creative Commons Attribution 3.0 Unported License.)

The remainder of this article focuses on preventing and handling technical and business faults in the context of a SOA-environment.

Business Faults versus Technical Faults

Business faults and technical faults differ in a number of areas. The following table illustrates the differences between these types of faults.

Business Faults... Technical Faults...
indicate a failure to meet a particular business requirement indicate a runtime failure of a hardware or software component
are often expected and can be defined in service interfaces are often unexpected and are not defined in service interfaces
have business value have no business value
are faults that services clients can recover from are faults that service clients cannot (easily) recover from; they are dependent on others to fix the underlying issue

The following sections each provide an example of such a fault.

Business Faults

Consider a customer that orders some products on a new web shop for the first time. The web shop application is dependent on the CRM system to retrieve the customer's details and preferences. In this case, the CRM system doesn't know of the customer and throws a business fault to the web shop application indicating the customer is unknown. This is a business fault that can be expected, and you can recover from it by, for example, redirecting the new customer to a webpage where he can complete his registration. After the registration business continues as usual.

The following snippet shows the business fault as returned by CRM System's Web Service. The fault indicates that a particular customer is not found in the backend system.

 


<soap:Envelope>  
<soap:Header/>
<soap:Body>   
<soap:Fault>     
<faultcode>CST-1234</faultcode>  
<faultstring>Customer not found</faultstring>   
<detail>    
<CustomerNotFoundFault>     
<CustomerName>John Doe</CustomerName>   
</CustomerNotFoundFault>   
</detail>  
</soap:Fault>  
</soap:Body> 
</soap:Envelope>

This particular fault, CustomerNotFoundFault, is specified in the Web Service interface so the service's clients can expect it. The WSDL for the message above is partially shown in the following snippet.



<wsdl:definitions name="CustomerServicequot; targetNamespace="..." ...> 
...   
<wsdl:message name="FindCustomerRequestMsg">  
<wsdl:part name="FindCustomerRequest" element="tns:  
FindCustomerRequest"/>    
</wsdl:message>   
<wsdl:message name="FindCustomerResponseMsg">    
<wsdl:part name="FindCustomerResponse"   
element="tns:FindCustomerResponse"/>
</wsdl:message> 
<wsdl:message name="CustomerNotFoundFaultMsg">  
<wsdl:part name="Error"   
element="tns:CustomerNotFoundFault"/>  
</wsdl:message>  
<wsdl:portType name="CustomerServicePortType">  
<wsdl:operation name="FindCustomer"> 
<wsdl:input name="FindCustomerRequest" 
message="tns:FindCustomerRequestMsg"/>      
<wsdl:output name="FindCustomerResponse"   
message="tns:FindCustomerResponseMsg"/>  
<wsdl:fault name="CustomerNotFoundFault"  
message="tns:CustomerNotFoundFaultMsg"/>  
</wsdl:operation>    
</wsdl:portType>   
...
</wsdl:definitions> 

A fault is just another message being returned by the service operation, and should be used whenever a business fault is signaled. This is much better than adding error codes or flags to the "normal" response message, as it forces the consumer to detect and deal with the fault situation.

Technical Faults

Let's revisit the same scenario, only this time the CRM system is unreachable due to unplanned downtime. You cannot easily recover from this error, and need to wait for IT operations or the hosting provider to restore the system to a normal operational condition. In that case one option for handling the fault, and it's not a good one, is to ask the customer to retry his order in a few hours.

The following snippet shows a technical fault returned by the service infrastructure when a service on an external system is not reachable. The fault indicates that a technical error occurred on the server.



<soap:Envelope>  
<soap:Header/>  
<soap:Body>   
<soap:Fault>  
<faultcode>S:Server</faultcode>  
<faultstring>Could not connect to 127.0.0.1:443</faultstring>  
</soap:Fault>  
</soap:Body> 
</soap:Envelope>

The fault is returned as a generic SOAP fault and is not explicitly specified in the Web Service interface.

Be careful not to expose implementation details when returning technical faults to service clients; do not just return the technical fault of the underlying system. Such faults may include sensitive information, including connection details, driver versions, used credentials, and operating system details which can be exploited by hackers. This best practice is known as exception shielding.

Fault Handling in SOA versus Traditional Systems

The goal of every developer should be to create unbreakable systems. The extent to which that goal can be achieved depends on the success of efforts to handle and manage expected and unexpected exception conditions. Object-oriented languages such as C++ and Java provide an efficient way for handling exceptions using constructs such as trycatch, and finally. In a SOA most of what is available at language level is still valid and usable for creating elementary services.

However, SOA raises different challenges when orchestrating services and creating composite applications. Figure 1 illustrate a typical SOA and BPM environment. This environment includes:

  • Providers of services. Services that an organization uses are provided by internal systems and applications such as packaged applications, commercial off-the-shelf (COTS) software, custom-build applications, client/server applications, and other software components. External organizations can also act as service provider; consider a trading partner that offers invoicing services.
  • Exposed services. The functionality of service providers is exposed as small building blocks that provide well described functionality that is easily accessible; i.e. services. Services can be exposed through an intermediary such as an Enterprise Service Bus; although this is not necessarily the case.
  • Consumers of services. The services that you expose are used by a variety of consumers. The services can be offered to your trading partners such as suppliers and customers (external). Services can be orchestrated together with manual activities into business processes that your organization implements. This can be achieved using BPM or Case Management platforms. Finally, user interfaces such as Portals and Mobile Devices consume services to enable end users to do their jobs. User interfaces typically also interact with BPM and Case Management platforms so the manual tasks that need to be executed are visualized together with the information that end users need to complete these tasks.

 

Figure 1: Typical SOA and BPM environment

The following aspects of such an environment have an impact on the way you should implement fault prevention and handling:

  • Services can be asynchronous or fire-and-forget, and processes might be long-running. This means that work is not executed in a single transaction that can be rolled back. You need other mechanisms to undo changes.
  • Services and processes contain timed events that might or might not occur. You need to guard that such events occur within a certain time-period, or that messages don't get lost so that process instances are not waiting forever.
  • Services are autonomous building blocks with their own functionality and should not be dependent on other services. Whenever possible, faults should be handled in the service in which they occurred.
  • Services are used in larger units of work, such as composite applications and processes. Fault recovery should also be addressed in a similar scope, beyond that of an individual service.
  • There are multiple service consumers, often also outside your own organization. You cannot implement fault handling logic that is specific to only one service consumer. The logic should be applicable for all (future) consumers of the service.
  • Typically, a SOA environment consists of heterogeneous and external components, which lends even greater important to standards for fault handling.

In the rest of this article—and in subsequent articles in this series—you will learn what patterns and out-of-the-box features can be used to implement effective fault prevention and handling in a SOA-environment.

Scenario

In parts two and three of this article series we will delve into the fault prevention and fault handling capabilities of the most important SOA building blocks of Oracle SOA Suite: the SCA infrastructure (with its service components such as BPEL and Mediator) and Oracle Service Bus. We will present these capabilities using a scenario that is complex enough to show some real-life error situations.

Figure 2 shows this scenario, an order process that is implemented on Oracle SOA Suite 11g. We are using the Trivadis integration blueprint notation, as presented in Service Oriented Architecture: An Integration Blueprint [See Sources].

The left side shows the process steps from the moment where the order request is received until the order is processed. The right side shows all the external systems that the process application interacts with to complete orders. Such systems include the application in which clients can order products, two different credit card service providers to bill clients, a product database, the order processing application, and a history service to store completed instances of the process. The middle lanes show the integration of the process with the backend systems using services that are invoked from the process and exposed by Oracle Service Bus 11g.

The Order Processing system is a legacy application, which can only be integrated with our system by using queues.

 

Figure 2: Order Process implemented on Oracle SOA Suite 11g

The scenario includes the following steps:

  • 1. The application invokes the Order process through a synchronous SOAP call.
  • 2. The BPEL process calls the ProductService on the OSB to get the product details.
  • 3. The OSB service is using a Database adapter to access the data from the product database.
  • 4. The BPEL process invokes the CreditCardService on the OSB to bill the customer's credit card.
  • 5. The OSB service decides which CreditCardService provider the request should be sent to based on the credit card type.
  • 6. A successful or fault response is returned to the application.
  • 7. If the process was successful so far, it will continue asynchronously by invoking another service on the OSB.
  • 8. The OSB service will send the order to the Order Processing system by placing it in the request queue.
  • 9. The Order Processing system is constantly dequeing the orders and processes them.
  • 10. Depending on the result of the processing, a successful or fault message is placed into the response queue.
  • 11. A service on the OSB is dequeing the messages from the response queue and invokes the BPEL Order process.
  • 12. The BPEL Order process, which is waiting for the callback, continues.
  • 13. The BPEL Order process invokes the HistoryService on the OSB to archive relevant data of the order process.
  • 14. The OSB service invokes the Web Service provided by the Order History system.

Now let's take a look at some of the fault situations we can get in our scenario. Thanks to Murphy's Law there are quite a lot of issues that can happen, some expected and some unexpected. The following image shows the scenario, but this time overlaid with some potential problems.

 

Figure 3: Possible Fault Situations

The following situations can occur, which we either have to avoid or handle:

  • 1. The product database is an older system where we know that the scalability is limited. If we send to many requests to it, it might no longer perform well or even crash completely.
  • 2. The network between the external service providers and us is not stable, so we have to deal with very short network interruptions (usually below a second).
  • 3. A business fault is raised if the credit card provided is not valid. Each of the two credit card provider defines his own fault in the service contract.
  • 4. The second credit card provider does not guarantee 7x24 availability with one single service instance. Therefore he provides a second service instance, running in parallel to the first instance.
  • 5. Responses on the Order Processing system sometimes get lost. Remember, it's a legacy system and nobody wants to touch it and investigate the problem.
  • 6. The Order Processing system returns a fault, if the product is no longer available.
  • 7. The Order History System is not always available.

The next two articles in the series will discuss how these faults can be prevented and/or handled using the capabilities of Oracle Service Bus and Oracle SOA Suite.

Fault Prevention and Recovery Strategies

Various patterns can be applied to prevent faults from happening or to handle them when they cannot be prevented. The following table lists patterns that improve the fault prevention and handling capabilities of your software. These patterns should be seen from the perspective of the service provider. In other words, in order for services to provide added value and good quality-of-service, the service provider is responsible for implementing the fault handling patterns. Thereby relieving the service consumers from this task.

You can read more about these patterns in Patterns for Fault Tolerant Software [See Sources].

Action Prevention or Handling Description
Inaction Handling Simply ignore the request
Balk Handling Admit failure
Guarded suspension Handling Suspend execution until conditions for correct execution are established
Provisional action Prevention Pretend to perform the request, but do not commit until success is granted
Alternative action Prevention Perform an acceptable alternative; e.g. automatic failover by the service provider
Rollback Handling Try to proceed, but on failure, undo the effects of the failed action
Retry Prevention Repeatedly attempt a failed action after recovering from failed attempts. In case of a successful retry the service consumer is not aware of any failure. In case of unsuccessful retry the fault is passed back to the consumer
Appeal to higher authority Handling Ask someone to apply judgment and steer the software to an acceptable resolution
Resign Handling Minimize damage, write log information, then signal definite and safe failure
Compensation Handling Undo activities by executing the opposite actions in a reverse order
Exception shielding Handling Hide implementation details in returned fault messages for security reasons
Share the load Prevention Have multiple instances for a system so the unavailability of one instance can be handled by other instances
Heartbeat Prevention Periodically check the availability of the system to detect failures in an early stage
Throttling Prevention Manage the amount of messages that are sent to systems by using a queuing mechanism

In Part 2 of this series you will learn how to use out-of-the-box features of Oracle Service Bus 11g to prevent faults and to expose reliable and robust services to your service consumers.

Sources

  • 1. Service Oriented Architecture: An Integration Blueprint
    Packt Publishing
    Guido Schmutz, Peter Welkenbach, Daniel Liebhart
    ISBN-10: 184968104X | ISBN-13: 9781849681049

  • 2. Patterns for Fault Tolerant Software
    Wiley Software Patterns Series
    Robert Hanmer
    ISBN-10: 0470319798 | ISBN-13: 978-0470319796

About the Authors

Guido Schmutz is Technology Manager for SOA and Emerging Trends at Trivadis and an Oracle ACE Director. 

Ronald van Luttikhuizen is Managing Partner and Architect at Vennster and an Oracle ACE Director.