How can we use the primary key feature to avoid duplication during incremental refresh?

I noticed that QuickSight now supports defining a primary key for each dataset. Can this be used to prevent data duplication during incremental refresh? This is a capability already supported in many BI tools like Power BI and Tableau.

If it’s not currently supported, is this functionality on the roadmap? If so, is there an estimated timeline for its availability? Thanks

p.s. We are using Athena for dataset ingestion

Hello Edison,

First of all Welcome to the Quicksight community !

I am not sure if P key or unique key column can prevent data duplication during incremental refresh but it should Prevents Duplicate Records – Ensuring that datasets contain only unique entries for that column.

This unique key acts as a global sorting key for the dataset and optimizes query generation for table visuals. When a user creates a table visual in QuickSight and adds the unique key column to the value field well,data is sorted from left to right up to the unique key column. All columns to the right of the unique key column are ignored in the sort order. Tables that don’t contain a unique key are sorted based on the order that the columns appar in the dataset.

The following limitations apply to unique keys:

  • Unique keys are only supported for unaggregated tables.
  • If a dataset column is used for column level security (CLS), the column can’t also be used as the unique key.

See reference here

Some other public reference but not from AWS - Amazon QuickSight Launches Unique Key for Datasets

Hope this gives some idea.

Cheers,
Deep

HI @Deep , Thanks for your quick response. I have read the documentation you provided and have a question: After enable Unique Keys for the dataset, during the dataset ingestion:

  • If any duplicates are detected, QuickSight will prompt you with an error notification.
  • You’ll need to resolve any errors, such as eliminating duplicate entries or adjusting the dataset, before proceeding.

I’m wondering if QuickSight can identify which records or unique keys are duplicated in the response. If so, I could use that information to trigger an S3 deletion and remove the old files to avoid duplication.

Hello Edison,

I have not tried it myself but i think when QuickSight detects duplicates during dataset ingestion with Unique Keys enabled, it will show an error, but may not provide detailed information about which specific records are duplicated in its error response. The error message could be likely generic indicating that duplicates were found.

To handle this situation, you potentially could:

  1. Export your dataset to a temporary location
  2. Use other AWS services like Athena or Glue to identify the duplicate records by querying the data directly
  3. Once identified, you can then programmatically remove the duplicate files from S3

Alternative approach:

  • Use AWS Glue or AWS Lambda to pre-process your data before ingesting into QuickSight
  • Implement deduplication logic that maintains a record of processed files
  • Only then load the deduplicated data into QuickSight

This way, you can handle duplicates proactively rather than relying on QuickSight’s error messages.

Hope this gives some idea.

Cheers,
Deep