How to Implement a Big Data System

by Jean-Pierre Dijcks

Published January 2012

Understanding a big data infrastructure by looking at a typical use case.

I often get asked about "big data," and more often than not we seem to be talking at different levels of abstraction and understanding. Words such as real time and advanced analytics show up, and we are instantly talking about products, which is typically not a good idea.

So let's try to step back and look at what big data means from a use-case perspective, and then we can map the use case into a usable, high-level infrastructure picture. As we walk through all this, you will—I hope—start to see a pattern and start to understand how words such as real time and analytics fit in.

The Use Case in Business Terms

Rather then inventing something from scratch I've looked at the keynote use case describing Smartmall.

Figure 1. Smartmall

The idea behind Smartmall is often referred to as multichannel customer interaction, meaning "how can I interact with customers that are in my brick-and-mortar store via their smartphones"? Rather than requiring customers to whip out their smartphone to browse prices on the internet, we would like to drive their behavior proactively.

The goals of Smartmall are straightforward:

Increase store traffic within the mall.
Increase revenue per visit and per transaction.
Reduce the non-buy percentage.

What Do You Need?

In terms of technologies you would be looking at the following:

Smart devices with location information tied to an individual
Data collection and decision points for real-time interactions and analytics
Storage and processing facilities for batch-oriented analytics

In terms of data sets, you would want to have at least the following:

Customer profiles tied to an individual and linked to the individual's identifying device (phone, loyalty card, and so on)
A very fine-grained customer segmentation tied to detailed buying behavior and tied to elements such as coupon usage, preferred products, and other product recommendations

High-Level Components

A picture speaks a thousand words, so Figure 2 shows both the real-time decision-making infrastructure and the batch data processing and model generation (analytics) infrastructure.

Figure 2. Example Infrastructure

The first—and, arguably, most important—step and the most important piece of data is the identification of a customer. Step 1, in this case, is the fact that a user with a smartphone walks into a mall. By identifying this, we trigger the lookups in step 2a and step 2b in a user-profile database.

We will discuss this a little more later but, in general, this is a database leveraging an indexed structure to do fast and efficient lookups. Once we find the actual customer, we feed the profile of this customer into our real-time expert system (step 3).

The models in the expert system (custom-built or COTS software) evaluate the offers and the profile and determine what action to take (for example, send a coupon). All this happens in real time, keeping in mind that Websites do this in milliseconds and our smart mall would probably be OK doing it in a second or so.

To build accurate models—and this where many of the typical big data buzz words come in—we add a batch-oriented massive-processing farm into the picture. The lower half of Figure 3 shows how we leverage a set of components that includes Apache Hadoop and the Apache Hadoop Distributed File System (HDFS) to create a model of buying behavior. Traditionally, we would leverage a database (or data warehouse [DW]) for this. We still do, but we now leverage an infrastructure before the database/data warehouse to go after more data and to continuously re-evaluate all the data.

Figure 3. Creating a Model of Buying Behavior

A word on the data sources. One key element is point-of-sale (POS) data (in the relational database), which you want to link to customer information (either from your Web store, from cell phones, or from loyalty cards). The NoSQL database with customer profiles in Figure 2 and Figure 3 show the Web store element. It is very important to make sure this multichannel data is integrated (and deduplicated, but that is a different topic) with your Web browsing, purchasing, searching, and social media data.

Once the data linking and data integration is done, you can figure out the behavior of an individual. In essence, big data allows microsegmentation at the person level—in effect, for every one of your millions of customers!

The final goal of all this is to build a highly accurate model that is placed within the real-time decision engine. The goal of the model is directly linked to the business goals mentioned earlier. In other words, how can you send a customer a coupon while the customer is in the mall that gets the customer to go to your store and spend money?

Detailed Data Flows and Product Ideas

Now, how do you implement this with real products and how does your data flow within this ecosystem? The answer is shown in the following sections.

Step 1: Collect Data

To look up data, collect it, and make decisions on it, you need to implement a system that is distributed. Because the devices essentially keep sending data, you need to be able to load the data (collect or acquire it) without much delay. That is done in the collection points shown in Figure 4. That is also the place to evaluate the data for real-time decisions. We will come back to the collection points later.

Figure 4. Collection Points

The data from the collection points flows into the Hadoop cluster, which, in our case, is a big data appliance. You would also feed other data into this appliance. The social feeds shown in Figure 4 would come from a data aggregator (typically a company) that sorts out relevant hash tags, for example. Then you use Flume or Scribe to load the data into Hadoop.

Step 2: Collate and Move the Data

The next step is to add data (social feeds, user profiles, and any other data required to make the results relevant to analysis) and to start collating, interpreting, and understanding the data.

Figure 5. Collating and Interpreting the Data

For instance, add user profiles to the social feeds and add the location data to build a comprehensive understanding of an individual user and the patterns associated with this user. Typically, this is done using Apache Hadoop MapReduce. The user profiles are batch-loaded from the Oracle NoSQL Database via a Hadoop InputFormat interface and, thus, added to the MapReduce data sets.

To combine all this with the POS data, customer relationship management (CRM) data, and all sorts of other transactional data, you would use Oracle Big Data Connectors to efficiently move the reduced data into the Oracle Database. Then you have a comprehensive view of the data that you can go after, either by using Oracle Exalytics or business intelligence (BI) tools or—and this is the interesting piece— via things such as data mining.

Figure 6. Moving the Reduced Data

Step 3: Analyze the Data

That last phase—here called "analyze"— creates data mining models and statistical models that are used to produce the right coupons. These models are the real crown jewels, because they allow you to make decisions in real time based on very accurate models. The models go into the collection and decision points to act on real-time data, as shown in Figure 7.

Figure 7. Analyzing the Data

In Figure 7, you see the gray model being utilized in the Expert Engine. That model describes and predicts the behavior of an individual customer and, based on those predictions, determines what action to take.

Conclusion

The description above is an end-to-end look at "big data" and real-time decisions. Big data allows us to leverage tremendous amounts of data and processing resources to arrive at accurate models. It also allows us to determine all sorts of things that we were not expecting, which creates more-accurate models and also new ideas, new business, and so on.

You can implement the entire solution shown here using the Oracle Big Data Appliance on Oracle technology. Then you'll just need to find a few people who understand the programming models to create those crown jewels.