Before You Begin
In this 15-minute tutorial, you learn how to import a Zeppelin note from a GitHub repository into
Oracle Big Data Manager Notebook. You also learn how to load a comma-separated-value
(.csv)
file from the Hadoop Distributed File System (HDFS) and an Oracle database table
as new paragraphs into the imported note. Finally, you run the note.
Background
This is the third tutorial in the Integrating GitHub and Oracle Database with Oracle Big Data Manager series. Read them sequentially.
- Registering a GitHub Repository as a Storage Provider with Oracle Big Data Manager
- Registering an Oracle Database as a Storage Provider with Oracle Big Data Manager
- Analyzing Data from Multiple Sources with Oracle Big Data Manager
What Do You Need?
- Access to an instance of Oracle Big Data Cloud Service and the required login credentials.
- Access to Oracle Big Data Manager on an Oracle Big Data Cloud Service instance. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
- The required sign in credentials for Oracle Big Data Manager.
- Read/Write privileges to the
/user/demo
HDFS directory. - Complete the first two tutorials in this series: The Registering a GitHub Repository as a Storage Provider with Oracle Big Data Manager and the Registering an Oracle Database as a Storage Provider with Oracle Big Data Manager tutorials.
- Access to an Oracle database and the required login credentials.
- Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.
Access the Oracle Big Data Manager Console
- Sign in to Oracle Cloud and open your Oracle Big Data Cloud Service console.
- In the row for the cluster, click Manage this service , and then click Oracle Big Data Manager console from the context menu to display the Oracle Big Data Manager Home page.
Import a Zeppelin Note from GitHub into Oracle Big Data Manager Notebook
In this section, you import a Zeppelin note from a GitHub repository into Oracle Big Data Manager Notebook.
- On the Oracle Big Data Manager page, click the Data tab.
- In the Data explorer section, select Github (github)
from the Storage drop-down list. The
bdm-notebook-demo
GitHub repository that you registered in the first tutorial is displayed in the Name column. Double-click the repository name and navigate to themaster
branch. This branch contains theNotebook.json
Zeppelin note that you will import into Oracle Big Data Manager Notebook. - Right-click the
Notebook.json
file. From the context menu, select Notebook, and then Import as Notebook Note. The Import as Notebook Note dialog box is displayed. - Enter
HDFS and Oracle Database Data Analysis
in the Note name field, and then click Create. TheHDFS and Oracle Database Data Analysis
note is imported and displayed in Notebook. The initial status of each paragraph in the note isREADY
which indicates that the paragraph has not been executed yet. - The empty paragraph at the top of the note uses the
%md
Markdown interpreter to generate static html from Markdown plain text. Click Run this paragraph(Shift+Enter) on the paragraph's toolbar to display detailed information about this note.
Import Oracle JDBC Driver to Oracle Big Data Manager Notebook
In this section, you add the Oracle JDBC driver to the Spark interpreter classpath in Oracle Big Data Manager. This enables Spark to connect to the Oracle database.
- On the Oracle Big Data Manager page, click the Notebook tab.
- Click the Menu drop-down list, and then select Interpreter. The Interpreters page is displayed.
- Scroll-down to the spark interpreter section, and then click edit.
- Scroll-down to the Dependencies section, enter the following path to the Oracle JDBC driver in the artifact field to allow the Spark interpreter to create a DataFrame that connects to the Oracle database, and then click Save.
/usr/lib/oracle/12.1/client64/lib/ojdbc7.jar
A Do you want to update this interpreter and restart with new settings? confirmation message is displayed. Click OK.
Load an HDFS File as a Paragraph into the Imported Note
In this section, you upload the edges.csv
file to HDFS and load the file as a new
paragraph into the imported note. The edges.csv
file contains two columns. Each column represents a student id. The
two student ids in each row represent two students who are collaborating on a specific class project.
- Right-click the edges.csv file, select Save link as... from the context menu, and then save it to your local machine.
- Upload the
edges.csv
file to HDFS. On the Oracle Big Data Manager page, click the Data tab. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the/user/demo
directory, and then click File upload on the toolbar. - In the Files upload dialog box, click Choose files
to upload. In the Open dialog box, navigate to your local directory
that contains the
edges.csv
file, and then select the file. Theedges.csv
file is displayed in the Name column. Click Upload. - Drill-down on the Details section in the Files upload
dialog box to display the progress of the file upload operation. When the file is uploaded
successfully to HDFS, the Upload has finished message is displayed. Click
Close to close the dialog box. The
edges.csv
file is now displayed in the/user/demo
HDFS directory. - Right-click the
edges.csv
file. From the context menu, select Notebook, and then Load as general file. The Load file in Notebook wizard is displayed. It has three pages: Paragraph settings, Note settings, and Overview. - In the Paragraph settings wizard page, enter
Load Data from HDFS
in the Paragraph name field. Selectspark
from the Interpreter drop-down list, and then click Next. - In the Note settings wizard page, select Add to existing note
in the Note field. Select the
HDFS and Oracle Database Data Analysis
note from the Select note drop-down list. Select Top of note from the New paragraph(s) position drop-down list, and then click Next. - In the Overview wizard page, review the settings. If you need to make a correction, click the back arrow . If you are satisfied with the settings, click Finish to display the updated note. The new Load Data from HDFS paragraph is displayed at the top of the updated note.
- The new Load Data from HDFS paragraph uses the
%spark
Spark interpreter to create a DataFrame that references theedges.csv
file that you uploaded earlier to the/user/demo
HDFS directory. This DataFrame is stored in thefileDataFrame
variable. Rename this variable tographText.
The Process the Data paragraph in this note references thegraphText
variable.val graphText = sc.textFile("hdfs:/user/demo/edges.csv")
Load an Oracle Database Table into the Imported Note
In this section, you load the USERS
Oracle database table that you created in the
second tutorial as a new paragraph into the imported note. The USERS
table contains
information about each student's name, id, and class category.
- On the Oracle Big Data Manager page, click the Data tab. In the Data
explorer section, select
Oracle database (oracle-database-demo)
from the Storage drop-down list. Navigate to your schema that contains theUSERS
table that you created earlier. In this example, we created theUSERS
table in the BDM_DEMO schema. The table is displayed in the Name column. - Right-click the
USERS
table. From the context menu, select Notebook, and then Load table. The Load table in Notebook wizard is displayed. It has three pages: Paragraph settings, Note settings, and Overview. - In the Paragraph settings wizard page, enter
Load Data from Oracle Database Table
in the Paragraph name field. Selectspark
from the Interpreter drop-down list, and then click Next. - In the Note settings wizard page, select Add to existing note
in the Note field. Select the
HDFS and Oracle Database Data Analysis
note from the Select note drop-down list. Select Top of note from the New paragraph(s) position drop-down list, and then click Next. - In the Overview wizard page, review the settings. If you need to make a correction, click the back arrow . If you are satisfied with the settings, click Finish to display updated note. Two new paragraphs are added at the top of the note.
- Click the title of the top new paragraph. The title becomes active in read/write mode. Rename the title to
Load Data from Oracle Database Table.
Rename the title of the second new paragraph toCount Number of Students.
When you load a relational database table into a note, Oracle Big Data Manager creates two paragraphs. The first paragraph creates the
jdbcDF
DataFrame which references theUSERS
table in your Oracle database. It also registers thejdbcDF
DataFrame as a temporary table namedUSERS.
You can run SQL queries on this temporary table. The second paragraph uses the%sql
interpreter which enables you to execute a Spark SQL query. This simple query counts the number of rows in theUSERS
table. - Make sure that the url field references your host name and port # where your Oracle database is running. Enter your Oracle database username in the empty user field. Enter your Oracle database password in the empty password field.
Run the Note
In this section, you run the updated note and review the output.
- Click Run all paragraphs on the Note's toolbar to run all paragraphs in this note. A Run all paragraphs
confirmation message is displayed. Click OK. When a paragraph executes
successfully, its status changes to
FINSIHED.
- The output of the Count Number of Students paragraph is displayed in the default table format in the result section at the bottom of the paragraph.
- The Process the Data paragraph performs most of the analysis on the dataset.
It joins the HDFS data with the Oracle database table data. The code accesses the two
id
columns from theedges.csv
file, and then uses theUSERS
table as a look-up table to find the associated names and class categories for those student ids. The code then registers the joined data as a new temporary table namedFRIENDS
. This new table contains six columns which represent the two students collaborating on a project from theedges.csv
file. Finally, the code counts the number total projects, same-class projects, and cross-class projects. These values are used and displayed in the Project Collaboration Statistics paragraph. - The output of the Projects Collaboration Details paragraph is displayed in the default table format in the result section at the bottom of the paragraph. The following image shows the partial output.
- The output of the Projects Collaboration Statistics paragraph is displayed in the result section at the bottom of the paragraph.
Note: The results of the SQL query are returned in an arbitrary
order unless you use the order by
clause followed by the column(s) name; therefore, your results might not match the result shown in the image.