Before You
Begin
In this 10-minute tutorial, you learn how to view, manipulate, and analyze HDFS data in Oracle Big Data Manager Notebook.
Background
This is the second tutorial in the Working with Oracle Big Data Manager series. Read them sequentially.
- Copying Data from an HTTP(S) Server with Oracle Big Data Manager
- Analyzing Data with Oracle Big Data Manager Notebook
- Creating a Personal Dashboard in Oracle Big Data Manager
What Do You Need?
- Access to an instance of Oracle Big Data Cloud Service and the required login credentials.
- Access to Oracle Big Data Manager on a non-secure Oracle Big Data Cloud Service instance. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
- The required sign in credentials for Oracle Big Data Manager.
- Read/Write privileges to the
/user/demo
HDFS directory. - Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.
Access the Oracle Big Data Manager Console
- Sign in to Oracle Cloud and open your Oracle Big Data Cloud Service console.
Description of the illustration bdcs-console.png - In the row for the cluster, click Manage this service
, and then click Oracle Big Data Manager console from the context menu to display the Oracle Big Data Manager Home page.
Description of the illustration select-bdm-console.png
Analyze the Loaded Data in Oracle Big Data Manager Notebook
In this section, you add a third party spark_csv
library to Oracle Big Data Manager to parse
.csv
files in Spark. You also import a note into Oracle Big Data Manager Notebook.
This note contains several paragraphs that reference the .csv
data files that you
copied into the /user/demo
HDFS directory. Finally, you run the imported note.
- On the Oracle Big Data Manager page, click the Notebook tab.
- Add Databrick's
spark_csv
library to the Oracle Big Data Manager Notebook to enable Spark to read.csv
files. Click the Menu drop-down list, and then select Interpreter. The Interpreters page is displayed. - Scroll-down to the spark interpreter section, and then click edit.
- Scroll-down to the Dependencies section, enter the following Maven artifact in the artifact field, and then click Save.
- Right-click the copy_data_from_http_to_hdfs.json file, select Save link as from the context menu, and then save it to your local machine.
- On the Notebook tab banner, click Home
. In the Notebook section, click Import note.
- In the Import AS field enter Copy Data from http to HDFS.
By default, the name of the imported note is the same as the original note but you can override
it by providing a new name in this field. Click the Choose a JSON here icon.
In the Open dialog box, navigate to your local directory that contains the
copy_data_from_http_to_hdfs.json
file, and then select the file.Description of the illustration import-new-note.png - Click the
Copy Data from http to HDFS
note to view it. The initial status of each paragraph in the note isREADY
which indicates that the paragraph has not been executed yet. - The first paragraph uses the
%md
Markdown interpreter to generate static html from Markdown plain text. The second paragraph imports some Spark libraries. - The Load and Select HDFS Data paragraph uses the
%spark
Spark interpreter to create two dataframes. The first dataframe references all of the.csv
files in the/user/demo
HDFS directory (using the*
wildcard character). This dataframe is stored in thedf1
variable. The second dataframe selects some of the columns from the first dataframe. This dataframe is stored in thedf2
variable.Description of the illustration create-dataframes.png Note: You can reference the
df1
anddf2
variables anywhere in this Note. - The Register Dataframes as Temporary Tables paragraph registers the
df1
anddf2
dataframes as temporary tablestaxi
andtaxi_summary
respectively. You can run SQL queries on these temporary tables.Description of the illustration register-dataframes.png - The View All Taxi Data paragraph uses the
%sql
interpreter. This enables you to execute a Spark SQL query. The query in this paragraph displays the data in all rows and columns in thetaxi
table in a tabular format.Description of the illustration data-table.png - The Group Trips by Duration paragraph groups the individual trips by the trip duration, and then counts the number of trips in each group. The
taxi
data is displayed in a Bar Chart format.Description of the illustration data-chart.png - The View Dataset Summary paragraph uses the
%sql
interpreter. This enables you to execute a Spark SQL query. The query displays all rows in thetaxi_summary
table in a tabular format.Description of the illustration view-dataset-summary.png - Click Run all paragraphs
on the Note's toolbar to run all paragraphs in this note.



com.databricks:spark-csv_2.10:1.5.0

A Do you want to update this interpreter and restart with new settings? confirmation message is displayed. Click OK.

The Import new note dialog box is is displayed.
The Copy Data from http to HDFS
note is imported and displayed in
the list of available notes in the Notebook.


A Run all paragraphs confirmation message is displayed. Click OK. When a paragraph
executes successfully, its status changes to FINSIHED
.