Dataflow Builder - Overview

Overview

The Dataflow Builder in Aunsight allows users to easily organize, transform, and format data to be used for reporting, BI, or advanced analytics. The Dataflow Builder abstracts and visualizes complex data operations, making it easier to prepare data and automate the process as data is added or updated. This allows the user to leverage powerful computing engines like Spark and Map/Reduce with an easy to use drag-and-drop interface rather than requiring 20 years of Java experience to perform basic operations. This builder combines popular operations such as table joining, deduplication, arithmetic, aggregations, and string/date manipulations all into a single place—streamlining the number of tools you need to get the job done.

What you can do with Dataflow Builder?

Fix Data

Format Data

Filter Data

Integrate Data Sources

Build Features

Main Page

When a dataflow is created or an existing dataflow is selected from the list on the left side of the screen, the main page will display on the right. From this page, you can view general information about the dataflow and its history, duplicate or delete the dataflow, or set watch notifications.

  1. The Details tab contains the general information about the dataflow; its name and ID, creation and last updated dates, tags, inputs and outputs, and context information.
  2. The Versions tab contains a record of previous versions and gives you the ability to view, delete, or revert to a previous version.
  3. The Jobs tab lists the previous runs of the dataflow with the current state (whether it is in progress, completed, or failed.)
  4. The Run tab allows you to configure and submit a dataflow job.

Modify Dataflow

After clicking Modify from the main page, you can view or edit the details of the dataflow itself.

  1. The middle portion of the page is a visual representation of the dataflow. The dataflow runs from top to bottom, and each operation is displayed as a box with the resultant dataset directly underneath.  Connections between inputs and outputs are shown by arrows connecting the operations.
  2. The menu on the right-hand side has two tabs; the Operations tab allows you to add an operation from a list of all available operations. Operations are grouped by category and there is also a search function. The Details tab takes you to the details of a selected operation and allows you to view or edit its title and description, arguments, inputs, and outputs.
  3. The left side of the screen has three tabs. The first tab shows all the datasets linked in the dataflow, and allows you to import new datasets. The second tab shows the schema of the selected dataset. The last tab is a search feature.

Operations

Operations are grouped into multiple categories. A short description of each category is below.
More detailed information about each individual operations can be found in the Aunsight documentation.

Dataset


These operations are related to retrieving, renaming, and storing the datasets themselves.

Join


These operations bring multiple datasets together, whether through lookup, outer joins, or cartesian crossing.

Group


These operations combine individual rows of a dataset together into a collection based on an aggregate field or fields.

Collection


These operations allow you to interact with aggregations across grouped rows of data. You can compute sums or means or get a field based on a selected sorting method.

Field


These operations manipulate the columns or fields of a dataset, including the ability to add, select, convert, rearrange, or remove fields from the dataset as required.

Row


These operators alter a dataset at the record level to create, explode, filter, pivot, sort, or append rows of data.

Operation Details

The detail pane on the right side of the screen displays the following information;

  1. the name of the operation,
  2. Input(s),
  3. Argument(s),
  4. Output(s)

Expression Builder

The Expression Builder allows users to write Pig expressions through the interface via drop-down lists.

Dataset Details

Manage Datasets

Dataset Schema

Search Datasets