I noticed that QuickSight now supports defining a primary key for each dataset. Can this be used to prevent data duplication during incremental refresh? This is a capability already supported in many BI tools like Power BI and Tableau.
If it’s not currently supported, is this functionality on the roadmap? If so, is there an estimated timeline for its availability? Thanks
First of all Welcome to the Quicksight community !
I am not sure if P key or unique key column can prevent data duplication during incremental refresh but it should Prevents Duplicate Records – Ensuring that datasets contain only unique entries for that column.
This unique key acts as a global sorting key for the dataset and optimizes query generation for table visuals. When a user creates a table visual in QuickSight and adds the unique key column to the value field well,data is sorted from left to right up to the unique key column. All columns to the right of the unique key column are ignored in the sort order. Tables that don’t contain a unique key are sorted based on the order that the columns appar in the dataset.
The following limitations apply to unique keys:
Unique keys are only supported for unaggregated tables.
If a dataset column is used for column level security (CLS), the column can’t also be used as the unique key.
HI @Deep , Thanks for your quick response. I have read the documentation you provided and have a question: After enable Unique Keys for the dataset, during the dataset ingestion:
If any duplicates are detected, QuickSight will prompt you with an error notification.
You’ll need to resolve any errors, such as eliminating duplicate entries or adjusting the dataset, before proceeding.
I’m wondering if QuickSight can identify which records or unique keys are duplicated in the response. If so, I could use that information to trigger an S3 deletion and remove the old files to avoid duplication.
I have not tried it myself but i think when QuickSight detects duplicates during dataset ingestion with Unique Keys enabled, it will show an error, but may not provide detailed information about which specific records are duplicated in its error response. The error message could be likely generic indicating that duplicates were found.
To handle this situation, you potentially could:
Export your dataset to a temporary location
Use other AWS services like Athena or Glue to identify the duplicate records by querying the data directly
Once identified, you can then programmatically remove the duplicate files from S3
Alternative approach:
Use AWS Glue or AWS Lambda to pre-process your data before ingesting into QuickSight
Implement deduplication logic that maintains a record of processed files
Only then load the deduplicated data into QuickSight
This way, you can handle duplicates proactively rather than relying on QuickSight’s error messages.