Analyzing Data with Oracle Big Data Manager

Before You Begin

In this 10-minute tutorial, you learn how to view, manipulate, and analyze HDFS data in Oracle Big Data Manager Notebook.

Background

This is the second tutorial in the Working with Oracle Big Data Manager series. Read them sequentially.

Copying Data from an HTTP(S) Server with Oracle Big Data Manager
Analyzing Data with Oracle Big Data Manager Notebook
Creating a Personal Dashboard in Oracle Big Data Manager

What Do You Need?

Access to an instance of Oracle Big Data Cloud Service and the required login credentials.
Access to Oracle Big Data Manager on a non-secure Oracle Big Data Cloud Service instance. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
The required sign in credentials for Oracle Big Data Manager.
Read/Write privileges to the /user/demo HDFS directory.
Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.

Access the Oracle Big Data Manager Console

Sign in to Oracle Cloud and open your Oracle Big Data Cloud Service console.

Description of the illustration bdcs-console.png
In the row for the cluster, click Manage this service , and then click Oracle Big Data Manager console from the context menu to display the Oracle Big Data Manager Home page.

Description of the illustration select-bdm-console.png

Analyze the Loaded Data in Oracle Big Data Manager Notebook

In this section, you add a third party spark_csv library to Oracle Big Data Manager to parse .csv files in Spark. You also import a note into Oracle Big Data Manager Notebook. This note contains several paragraphs that reference the .csv data files that you copied into the /user/demo HDFS directory. Finally, you run the imported note.

On the Oracle Big Data Manager page, click the Notebook tab.

Description of the illustration
notebook-tab.png — Description of the illustration notebook-tab.png

Add Databrick's spark_csv library to the Oracle Big Data Manager Notebook to enable Spark to read .csv files. Click the Menu drop-down list, and then select Interpreter. The Interpreters page is displayed.

Description of the illustration
notebook-menu.png — Description of the illustration notebook-menu.png

Scroll-down to the spark interpreter section, and then click edit.

Description of the illustration
spark-interpreter.png — Description of the illustration spark-interpreter.png

Scroll-down to the Dependencies section, enter the following Maven artifact in the artifact field, and then click Save.

com.databricks:spark-csv_2.10:1.5.0

Description of the illustration
dependencies.png — Description of the illustration dependencies.png

A Do you want to update this interpreter and restart with new settings? confirmation message is displayed. Click OK.

Right-click the copy_data_from_http_to_hdfs.json file, select Save link as from the context menu, and then save it to your local machine.
On the Notebook tab banner, click Home . In the Notebook section, click Import note.

Description of the illustration
import-note.png — Description of the illustration import-note.png

The Import new note dialog box is is displayed.

In the Import AS field enter Copy Data from http to HDFS. By default, the name of the imported note is the same as the original note but you can override it by providing a new name in this field. Click the Choose a JSON here icon. In the Open dialog box, navigate to your local directory that contains the copy_data_from_http_to_hdfs.json file, and then select the file.

Description of the illustration import-new-note.png

The Copy Data from http to HDFS note is imported and displayed in the list of available notes in the Notebook.

Click the Copy Data from http to HDFS note to view it. The initial status of each paragraph in the note is READY which indicates that the paragraph has not been executed yet.

Description of the
illustration display-note.png — Description of the illustration display-note.png

The first paragraph uses the %md Markdown interpreter to generate static html from Markdown plain text. The second paragraph imports some Spark libraries.
The Load and Select HDFS Data paragraph uses the %spark Spark interpreter to create two dataframes. The first dataframe references all of the .csv files in the /user/demo HDFS directory (using the * wildcard character). This dataframe is stored in the df1 variable. The second dataframe selects some of the columns from the first dataframe. This dataframe is stored in the df2 variable.

Description of the illustration create-dataframes.png

Note: You can reference the df1 and df2 variables anywhere in this Note.
The Register Dataframes as Temporary Tables paragraph registers the df1 and df2 dataframes as temporary tables taxi and taxi_summary respectively. You can run SQL queries on these temporary tables.

Description of the illustration register-dataframes.png
The View All Taxi Data paragraph uses the %sql interpreter. This enables you to execute a Spark SQL query. The query in this paragraph displays the data in all rows and columns in the taxi table in a tabular format.

Description of the illustration data-table.png
The Group Trips by Duration paragraph groups the individual trips by the trip duration, and then counts the number of trips in each group. The taxi data is displayed in a Bar Chart format.

Description of the illustration data-chart.png
The View Dataset Summary paragraph uses the %sql interpreter. This enables you to execute a Spark SQL query. The query displays all rows in the taxi_summary table in a tabular format.

Description of the illustration view-dataset-summary.png
Click Run all paragraphs on the Note's toolbar to run all paragraphs in this note.