Automate failed dataset ingestions using Amazon QuickSight

Amazon QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people you work with, wherever they are. Users can import data into SPICE (Super-fast, Parallel, In-memory Calculation Engine) in scheduled or programmatic ways.

When setting up QuickSight dataset refresh, you can email owners when a refresh has failed, but there is no way to natively retry them, because refreshes can fail due to transient errors.

In this post, we show how to use AWS CloudFormation to deploy all the necessary resources to automate the retry of the ingestion of a failed dataset refresh. This can help speed up the time to have the data available to the users by either completing the refresh successfully or providing more information on the cause of the failure to the dataset owner.

Additionally, QuickSight assets can be monitored using Amazon CloudWatch metrics. QuickSight developers and administrators can use these metrics to observe and respond to the availability and performance of their QuickSight ecosystem in near-real time.

Overview of solution

We use CloudWatch metrics to capture near-real-time events from QuickSight, and set up a CloudWatch metric alarm and an Amazon EventBridge rule that targets an AWS Step Functions state machine that starts a new ingestion of the failed dataset refresh.

The following diagram illustrates the architecture used for this post. The Step Functions state machine and related AWS Lambda functions are deployed through a CloudFormation template.

Solution Architecture

In the following sections, we demonstrate how to find the dataset ID that you want to monitor, then deploy the CloudFormation template, review the created resources, test the solution, and clean up to avoid unnecessary charges.

Prerequisites

To complete this walkthrough, you need the following:

  • An AWS account
  • A subscription to QuickSight Enterprise Edition
  • An AWS Identity and Access Management (IAM) role with permissions to CloudWatch metrics, CloudWatch alarms, EventBridge, QuickSight, Step Functions, and Lambda
  • Basic knowledge of the services used and of Python

Find the dataset ID that you want to monitor

Complete the following steps to locate your dataset ID:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose the dataset you want to monitor.
  3. From the URL on the browser, copy the part between data-sets/ and /view. It will look similar to the following URL:

https://us-west-2.quicksight.aws.amazon.com/sn/data-sets/4712aba2-6ecc-4521-b009-deb5b276a5f6/view

If you don’t have a dataset available for testing, you can create one following these steps where we create an Amazon Athena table. We use this table to create a QuickSight dataset that we import into SPICE. To simulate a refresh failure, we drop the table, which causes the dataset refresh to fail. This failure triggers the state machine we created, retrying the refresh automatically.

  1. On the Athena console, create a table. For this post, we use CTAS in Athena to create a simple table called foo_bar.

CTAS for foo_bar table

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

Create dataset

  1. Under Create dataset, choose Athena.
  2. Enter a name under Data source name, and choose the Athena workgroup you used to create the table.
  3. Choose Create data source.

Create data source

  1. In the Choose your table section, specify your catalog, database, and table.
  2. Choose Edit/Preview data.

Edit/Preview data

  1. Under Query mode, choose SPICE.
  2. Choose Save & Publish, and Cancel, to return to the dataset view.

Dataset view

  1. Open the dataset from this view by choosing its name.

On the Refresh tab, you can see the status of the refresh when the dataset was created, which is Completed.

Refresh tab

  1. From the browser URL, get the dataset ID (as illustrated earlier).

Deploy the resources

In this step, we create the Step Functions state machine and Lambda functions to manage the automated ingestion after a failure. Complete the following steps:

  1. Open the AWS CloudFormation console.
  2. Choose Launch Stack to open the template on the console.

Launch Stack

  1. Choose Next.

Create Stack

  1. Under Parameters, enter the QuickSight dataset ID from the previous step for DataSetID.
  2. Choose Next.

Specify stack details

  1. On the next page, choose Next.
  2. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Submit.

The stack creation takes around 2 minutes to complete.

Review resources

Complete the following steps to review your resources:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the stack that you created. If you didn’t change the stack name, it will be called automate-failed-ingestion-blog.
  3. Choose the Resources tab. Here you will see all the resources created by the CloudFormation stack.
  4. Open the resource of type AWS::CloudWatch::Alarm by choosing the physical ID. This will open the resource in a new tab.
  5. Review the details of the CloudWatch alarm. Under View EventBridge rule, you will see the pattern used as the basis of the EventBridge rule.
  6. On your browser, navigate back to the tab where the CloudFormation console is open.
  7. Open the resource of type AWS::Events::Rule by choosing the physical ID. This will open the resource in a new tab.
  8. Under Event pattern, you can find a similar event to the one described in the EventBridge rule. The only difference is that we added the state value to ALARM, so the targets are not triggered also when the status changes to OK.
  9. On your browser, navigate back to the tab where the CloudFormation console is open.
  10. Open the resource of type AWS::StepFunctions::StateMachine by choosing the physical ID. This will open the resource in a new tab.
  11. Choose Edit to open the Step Functions state machine definition.
  12. You can select the Lambda functions and choose View function to open the Lambda functions.

Step Functions state machine logic

When the state machine runs, the first Lambda function in the Check Previous Ingestions state is run to check the status of the previous ingestions. If the latest status is completed, the state machine will send the status of COMPLETED and will end through the Success state. If the Lambda function detects that there have been more than a specific number (six by default) of failed attempts on the refresh, it will send the status of TOO_MANY_FAILURES and will end through the Failed state. The number of retries is configurable on the Lambda function.

State machine logic 1/2

If the status is neither COMPLETED or TOO_MANY_FAILURES, then the state machine moves through the Default state to run the next Lambda function in the Start New Ingestion state. Then the flow will wait for 60 seconds before checking the status of the ingestion by running the Lambda function in the Get Ingestion Status state. If this Lambda function returns the status COMPLETED, then the state machine will end through the Success state. If the status is FAILED, it will end through the Failed state. If the status is neither of those, it will wait again for 60 seconds and check again.

The type of refresh that is started will be the same as the last refresh. Because there is a possibility that an error will also happen when we are editing a dataset, which will create an error with refresh type EDIT, the code will not retry an ingestion at this point because it’s not an automated refresh process that is failing.

State machine logic 2/2

Test the solution

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the stack that you created. If you didn’t change the stack name, it will be called automate-failed-ingestion-blog.
  3. Choose the Resources tab. Here you will see all the resources created by the CloudFormation stack.
  4. Open the resource of type AWS::CloudWatch::Alarm by choosing the physical ID. This will open the resource in a new tab.
  5. Choose the Resources tab and choose the link specified for the Physical ID of the resource which has Local ID as DataSourceIngestionFailed.

This will open the CloudWatch alarm that is related to the dataset we are monitoring.

Cloudwatch Alarm

  1. Make the source of your dataset unavailable for a refresh, or, if you are using the dataset we created for testing, drop the table you created.

Drop foo_bar

  1. From the QuickSight dataset, choose Refresh Now, choose Full Refresh, and choose Refresh.

The dataset refresh will show as Failed.

Dataset refresh tab

  1. Wait a minute and check the status of the CloudWatch alarm again.

It will be in state In alarm.

Cloudwatch Alarm

  1. On the Resources tab of the CloudFormation stack, choose the link specified in the Physical ID of the resource that has Local ID as QuickSightRestartIngestionStateMachine.

This will open the Step Functions state machine that runs the retry logic.

  1. Choose the ID under Name to view the latest run of the state machine.

In this run, the ingestion was retried but it failed, that is why the status is set to Failed.

State machine status graph

  1. To enable email alerts, navigate to your QuickSight console and choose Datasets in the navigation pane.
  2. Choose the dataset that you want to enable the feature.
  3. On the Refresh tab select Email owners when a refresh fails.

Dataset refresh tab Email owners when a refresh fails

If you have chosen this option on the dataset, the owner will get a notification of all the failures via email.

Email with failed refresh status

Clean up

To avoid incurring future charges, delete the resources you created:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the stack that you created. If you didn’t change the stack name, it will be called automate-failed-ingestion-blog.
  3. Choose Delete.

This will delete all the resources created by the stack.

Conclusion

In this walkthrough, you successfully created all the resources needed to automatically retry the ingestion of a failed refresh on a QuickSight data source. We showed how these resources are related to each other and how this solution can improve the QuickSight user experience by ensuring the data is refreshed in an automated fashion and improve the QuickSight operator experience by automating their manual work when a dataset refresh fails.

To learn more, visit Amazon QuickSight. Join the QuickSight Community to ask, answer, and learn with others.


About the Author

Andres Castro is a Global Solutions Architect in Global Financial Services. He has been working in the consulting and financial sector for the past 25 as a DevOps Engineer, Finance Data Solutions Architect, and Cloud Engineer before joining AWS. He is passionate about business intelligence, data governance, data analytics, and everything cloud.


This is a companion discussion topic for the original entry at https://aws.amazon.com/blogs/business-intelligence/automate-failed-dataset-ingestions-using-amazon-quicksight/