Oracle by Example brandingAnalyzing Data with Oracle Big Data Manager Notebook

section 0Before You Begin

In this 10-minute tutorial, you learn how to view, manipulate, and analyze HDFS data in Oracle Big Data Manager Notebook.

Background

This is the second tutorial in the Working with Oracle Big Data Manager series. Read them sequentially.

What Do You Need?

  • Access to an instance of Oracle Big Data Cloud Service and the required login credentials.
  • Access to Oracle Big Data Manager on a non-secure Oracle Big Data Cloud Service instance. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
  • The required sign in credentials for Oracle Big Data Manager.
  • Read/Write privileges to the /user/demo HDFS directory.
  • Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.

section 1 Access the Oracle Big Data Manager Console

  1. Sign in to Oracle Cloud and open your Oracle Big Data Cloud Service console.
    Description of the illustration bdcs-console.png
    Description of the illustration bdcs-console.png
  2. In the row for the cluster, click Manage this service Manage icon, and then click Oracle Big Data Manager console from the context menu to display the Oracle Big Data Manager Home page.
    Description of the illustration 
                                select-bdm-console.png
    Description of the illustration select-bdm-console.png

section 5Analyze the Loaded Data in Oracle Big Data Manager Notebook

In this section, you add a third party spark_csv library to Oracle Big Data Manager to parse .csv files in Spark. You also import a note into Oracle Big Data Manager Notebook. This note contains several paragraphs that reference the .csv data files that you copied into the /user/demo HDFS directory. Finally, you run the imported note.

  1. On the Oracle Big Data Manager page, click the Notebook tab.
  2. Description of the illustration 
                                notebook-tab.png
    Description of the illustration notebook-tab.png
  3. Add Databrick's spark_csv library to the Oracle Big Data Manager Notebook to enable Spark to read .csv files. Click the Menu drop-down list, and then select Interpreter. The Interpreters page is displayed.
  4. Description of the illustration 
                                notebook-menu.png
    Description of the illustration notebook-menu.png
  5. Scroll-down to the spark interpreter section, and then click edit.
  6. Description of the illustration 
                                spark-interpreter.png
    Description of the illustration spark-interpreter.png
  7. Scroll-down to the Dependencies section, enter the following Maven artifact in the artifact field, and then click Save.
  8. com.databricks:spark-csv_2.10:1.5.0
    Description of the illustration 
                                dependencies.png
    Description of the illustration dependencies.png

    A Do you want to update this interpreter and restart with new settings? confirmation message is displayed. Click OK.

  9. Right-click the copy_data_from_http_to_hdfs.json file, select Save link as from the context menu, and then save it to your local machine.
  10. On the Notebook tab banner, click Home Home icon. In the Notebook section, click Import note.
  11. Description of the illustration 
                                import-note.png
    Description of the illustration import-note.png

    The Import new note dialog box is is displayed.

  12. In the Import AS field enter Copy Data from http to HDFS. By default, the name of the imported note is the same as the original note but you can override it by providing a new name in this field. Click the Choose a JSON here icon. In the Open dialog box, navigate to your local directory that contains the copy_data_from_http_to_hdfs.json file, and then select the file.
    Description of the illustration 
                                import-new-note.png
    Description of the illustration import-new-note.png
  13. The Copy Data from http to HDFS note is imported and displayed in the list of available notes in the Notebook.

  14. Click the Copy Data from http to HDFS note to view it. The initial status of each paragraph in the note is READY which indicates that the paragraph has not been executed yet.
  15. Description of the 
                                 illustration display-note.png
    Description of the illustration display-note.png
  16. The first paragraph uses the %md Markdown interpreter to generate static html from Markdown plain text. The second paragraph imports some Spark libraries.
  17. The Load and Select HDFS Data paragraph uses the %spark Spark interpreter to create two dataframes. The first dataframe references all of the .csv files in the /user/demo HDFS directory (using the * wildcard character). This dataframe is stored in the df1 variable. The second dataframe selects some of the columns from the first dataframe. This dataframe is stored in the df2 variable.
    Description of the 
                                 illustration create-dataframes.png
    Description of the illustration create-dataframes.png

    Note: You can reference the df1 and df2 variables anywhere in this Note.

  18. The Register Dataframes as Temporary Tables paragraph registers the df1 and df2 dataframes as temporary tables taxi and taxi_summary respectively. You can run SQL queries on these temporary tables.
    Description of the 
                                 illustration register-dataframes.png
    Description of the illustration register-dataframes.png
  19. The View All Taxi Data paragraph uses the %sql interpreter. This enables you to execute a Spark SQL query. The query in this paragraph displays the data in all rows and columns in the taxi table in a tabular format.
    Description of the 
                                 illustration data-table.png
    Description of the illustration data-table.png
  20. The Group Trips by Duration paragraph groups the individual trips by the trip duration, and then counts the number of trips in each group. The taxi data is displayed in a Bar Chart format.
    Description of the 
                                 illustration data-chart.png
    Description of the illustration data-chart.png
  21. The View Dataset Summary paragraph uses the %sql interpreter. This enables you to execute a Spark SQL query. The query displays all rows in the taxi_summary table in a tabular format.
    Description of the 
                                 illustration view-dataset-summary.png
    Description of the illustration view-dataset-summary.png
  22. Click Run all paragraphs Run icon on the Note's toolbar to run all paragraphs in this note.
  23. Description of the 
                                 run-paragraphs.png
    Description of the illustration run-paragraphs.png

    A Run all paragraphs confirmation message is displayed. Click OK. When a paragraph executes successfully, its status changes to FINSIHED.


next stepNext Tutorial

Creating a Personal Dashboard in Oracle Big Data Manager