Exercise Overview

In this tutorial, you will learn how to use the Workflow Builder to orchestrate multiple operations; you will learn how to create a workflow, move data between sources, add a dataflow to prepare and clean a dataset, and finally, publish a dataset to an accelerated data layer for consumption by business intelligence tools or further analysis.

Exercise #1


Objectives:

  1. Copy a dataset from NFS to HDFS
  2. Run a dataflow on the dataset
  3. Copy the output dataset to Jethro
WF101 Example 1 Overview

A) Create a new Workflow

  1. After logging in to Aunsight and selecting the relevant organization, click on Workflows in the menu bar on the left side of the screen.
  2. Click the plus button to create a new workflow.
Create a new workflow
  1. Enter some descriptive for your workflow such as Workflow101 Example. In the tags, enter wf-example1. When finished, click Create.
Name workflow
  1. You should be at the Workflow page for the workflow you just created. Click the Modify button which will bring you to the Workflow Builder main screen.
Modify Workflow
  1. You will see two components, Build Aunsight Context and Workflow Parameters. For simplicity, you may delete the Workflow Parameters component by clicking on it to highlight, and then pressing the red trashcan button in the right side panel
Delete component

B) Move data from source file system (NFS) into target file system for processing (HDFS)

Copy from NFS to HDFS
  1. On the right side panel click Components and select Get Atlas Record (you might use the search functionality to simplify). The Get Atlas Record component pulls metadata about your dataset from the centralized Aunsight platform. You will need two of these components for the next step, so after adding the first one, repeat this step to select a second Get Atlas Record component.
Get Atlas Record
  1. You should now have two of these components in your workflow; name one NFS Record for the original dataset located in NFS, and the other HDFS Record, which will be a new dataset that will reside in HDFS.
Name components
  1. From the Build Aunsight Context component, connect the outbound Context port to the inbound Context port of each of the Get Atlas Record components by clicking and dragging a line from the context on the Build Aunsight Context block to the context on the NFS Record block.
    TIP: To disconnect components, select one of the connected pair. Then on the right side panel, scroll to the Inputs or Outputs section and click on the red delete sign next to the context you are trying to remove. (Hint: Hovering the mouse over the delete sign will highlight the connection that will be deleted.)
  2. Select the Get Atlas Record component named NFS Record. Then on the right side panel, scroll to the Inputs section and click on the Edit (pen and paper) symbol next to ID.
Connect components
  1. Using the search screen that pops up, select Northwind Orders Data (Example 3).
  2. Create a dataset (see tutorial on copying datasets) called Workflow 101 HDFS Landing Dataset and tag it with wf-example1.
  3. Go back and edit your workflow by clicking modify.
Select dataset
  1. From the Components panel, add Copy Dataset and name it move data from NFS to HDFS”.
Copy Dataset
  1. Connect the outbound record port from the the NFS record to the source inbound port on Copy Dataset, and connect the outbound “record” port from the HDFS record to the target inbound port on Copy Dataset.
  2. On Copy Dataset, select options on the right side panel.
Connect Atlas records
  1. Under Write Mode select overwrite and click submit. You could now run this workflow and it will move data from the source to target record crossing NFS to HDFS. If you don’t select this run mode, the workflow will error when the contents of the target dataset aren’t empty.
Overwrite

C) Set up a Dataflow to run within the Workflow

Run Dataflow and save
  1. On the right side panel, click Components and select Get Dataflow (name it Get Groupby) and Run Tokamak Dataflow (name it Run Groupby).
Get dataflow and run dataflow
  1. On Get Dataflow, connect the context outbound port from Get Aunsight Context to the inbound context port.
Connect contact
  1. Select Get Dataflow and open id on the right side panel. Select the Workflow101 Example1 dataflow which has been pre-built for you (or you can go to this tutorial to do it yourself).
Select dataflow
  1. On Run Dataflow, select Options from the right side panel.
Choose options
  1. Set Throw on Failure to True and Overwrite to True and then click Submit.
Set
  1. Connect the Get Dataflow dataflow outbound port to the Run Tokamak Dataflow dataflow inbound port.
Connect Get Dataflow to Run Dataflow
  1. From Components on the right sidebar, select Get Resource and Get Organization.
Get Organization and Get Resource
  1. Connect the resource outbound port on Get Resource to the resource inbound port on Run Tokamak Dataflow
Connect Get Resource to Run Dataflow
  1. Connect the organization outbound port on Get Organization to the organization inbound port on Get Resource
Connect Get Organization to Get Resource
  1. Connect the context outbound port from Build Aunsight Context to the context inbound port on both Get Resource and Get Organization.
Connect Build Aunsight Context to both Get Resource and Get Organization
  1. On Get Resource, using the right side panel, enter shared_hdh_oozie in the id field.
    NOTE: This is getting a reference to the Hadoop resource that will run the dataflow. This can be found on the resource tab and references the HDFS resource that your organization has setup. Contact our support if you need help determining what this is.
Set Get Resource ID
  1. On Get Organization, enter your organization’s UUID into the id field. The UUID appears in the URL string following “#organization” and can be available from your system administrator or Aunalytics support.
Add organization UUID
  1. Connect the target outbound port on Copy Dataset to the wait inbound port on Get Dataflow. This creates a dependency for the copy to happen prior to the execution of the dataflow. You can read more about our workflow execution model in the Aunsight documentation.
Connect Copy Dataset to Get Workflow

D) Publish data to accelerator

Publish to Jethro
  1. Create two more Get Atlas Record components and a Copy Dataset component. Rename one of the Get Atlas Record components Get Dataflow Output, and the other Get Jethro.
Create Copy Dataset and Get Atlas Records components
  1. On the Get Dataflow output component, edit the id and select Workflow Example Dataflow Output. (To learn how to create an empty dataset, see this section of the Dataset Creation tutorial.
Select Dataflow output
  1. On the Get Jethro component, edit the id and select Northwind Orders Aggregated Jethro.
Get Dataflow output
  1. Connect the context outbound port from Build Aunsight Text to both Get Atlas Record inbound context ports.
Connect to Build Aunsight Context
  1. Connect the Record outbound port from Get Dataflow output component to the source inbound port on the last copy dataset.
Get Dataflow Output to Copy Dataset
  1. Connect the Record outbound port form Get Jethro component to the target inbound port on the last copy dataset
Get Jethro to Copy Dataset as Target
  1. On the last Copy dataset, select options and set Write mode to overwrite and click submit.
Set Copy Dataset to Overwrite
  1. Connect the Run Groupby job outbound port to the wait inbound port on the last copy dataset.
Copy Dataset wait for Run Groupby
  1. Click save.
Save Workflow
  1. Return to the workflow page.
Return to Workflow page
  1. Click run.
  2. Click submit job.

Congrats; your job is now running!

Return to Run page and Submit Job