Delete duplicate rows appended by incremental refresh

Hello Quicksight Team,

Context:
I am using incremental refresh for my dataset of size 25M rows.
Due to the incremental refresh, my dataset goes up to ~65-70M and thus caused data size limit failure as I have a limited budget.

What I need :

  • I just want to delete the appended rows automatically.
  • I currently use a max (rank) function to filter out the data to the latest modified date but that is not enough, I need to delete these duplicate rows that are created

Note:
Previously I had raised a feature request:

Can I get an update on this?

Thanks alot,
Ajinkya

Hi @ajinkya_ghodake,
what datasource are you using?
BG

I am currently using RDS as my data source
I also have Snowflake as source.

You are loading from RDS to SPICE, right?
Would it be possible to reload the full data without the last rows? By custom sql or a filter within the dataset (latest modified date > xy)?
I’m not aware of a function to “unload” data out of SPICE.

1 Like

I am using custom SQL query to load my data:
select <column_list> from <my_table>;

Incremental refresh condition:
modified_at > Yesterday (24 hours)

For the data outside this incremental range, the refresh adds new rows (i.e. duplicate rows)

I have added a filter to remove these duplicates in my analysis by using rank function on modified_at field

My only problem is the duplicate rows that are being appended after every refresh cycle

Do you have a sample on dataset level? Where i can see what do you get and what are you looking for.

Do you have duplicates because there are multiple mofified_at dates?

ID | Value | Modified_at
2 | 11 | 2023-10-01
1 | 15 | 2023-10-10
2 | 10 | 2023-10-10
1 | 16 | 2023-10-11

@ajinkya_ghodake I understand your are trying to cleanup data already ingested into SPICE dataset. I don’t think there is a direct way to achieve that.

If your SPICE dataset grew from 25M rows to ~65-70M, it could be because the source dataset has a retention period set. Inorder to sync SPICE dataset to its source, you would want to do a full refresh as required to baseline. This would take care of duplicate scenario aswell. Happy to help if you have followup questions.

Can we raise a feature request to automatically do a full-refresh after lets say every 7-15 days?
It would really help us to tackle this issue

You may setup 2 refresh schedule for the same dataset, for instance, weekly full refresh and a hourly incremental refresh.

can mark it as “Solution” if you are happy with this solution? Thanks!

Damn!!!
I never knew that we could configure two different refresh schedules.

Thanks alot @royyung