Delete duplicate rows appended by incremental refresh

ajinkya_ghodake · October 11, 2023, 9:29am

Hello Quicksight Team,

Context:
I am using incremental refresh for my dataset of size 25M rows.
Due to the incremental refresh, my dataset goes up to ~65-70M and thus caused data size limit failure as I have a limited budget.

What I need :

I just want to delete the appended rows automatically.
I currently use a max (rank) function to filter out the data to the latest modified date but that is not enough, I need to delete these duplicate rows that are created

Note:
Previously I had raised a feature request:

Can I get an update on this?

Thanks alot,
Ajinkya

ErikG · October 11, 2023, 1:22pm

Hi @ajinkya_ghodake,
what datasource are you using?
BG

ajinkya_ghodake · October 11, 2023, 1:40pm

I am currently using RDS as my data source
I also have Snowflake as source.

ErikG · October 11, 2023, 2:07pm

You are loading from RDS to SPICE, right?
Would it be possible to reload the full data without the last rows? By custom sql or a filter within the dataset (latest modified date > xy)?
I’m not aware of a function to “unload” data out of SPICE.

ajinkya_ghodake · October 11, 2023, 2:21pm

I am using custom SQL query to load my data:
select <column_list> from <my_table>;

Incremental refresh condition:
modified_at > Yesterday (24 hours)

For the data outside this incremental range, the refresh adds new rows (i.e. duplicate rows)

I have added a filter to remove these duplicates in my analysis by using rank function on modified_at field

My only problem is the duplicate rows that are being appended after every refresh cycle

ErikG · October 11, 2023, 2:43pm

Do you have a sample on dataset level? Where i can see what do you get and what are you looking for.

Do you have duplicates because there are multiple mofified_at dates?

ID | Value | Modified_at
2 | 11 | 2023-10-01
1 | 15 | 2023-10-10
2 | 10 | 2023-10-10
1 | 16 | 2023-10-11

knamburi · October 11, 2023, 5:00pm

@ajinkya_ghodake I understand your are trying to cleanup data already ingested into SPICE dataset. I don’t think there is a direct way to achieve that.

If your SPICE dataset grew from 25M rows to ~65-70M, it could be because the source dataset has a retention period set. Inorder to sync SPICE dataset to its source, you would want to do a full refresh as required to baseline. This would take care of duplicate scenario aswell. Happy to help if you have followup questions.

ajinkya_ghodake · October 12, 2023, 5:15am

Can we raise a feature request to automatically do a full-refresh after lets say every 7-15 days?
It would really help us to tackle this issue

royyung · October 12, 2023, 5:52am

You may setup 2 refresh schedule for the same dataset, for instance, weekly full refresh and a hourly incremental refresh.

can mark it as “Solution” if you are happy with this solution? Thanks!

ajinkya_ghodake · October 12, 2023, 11:41am

Damn!!!
I never knew that we could configure two different refresh schedules.

Thanks alot @royyung

Topic		Replies	Views
Need to delete rows appended by incremental refresh Question & Answer feature-request , Business-Intelligence-Engineer , tables	2	225	October 21, 2023
Deleted rows not removed from dataset with incremental refresh Question & Answer spice , how-to	7	1659	May 22, 2024
Remove old data from SPICE dataset without full refresh Question & Answer data-source , feature-request , dataset , qls-spice , Business-Intelligence-Engineer	7	1082	February 23, 2024
Rows removed during incremental refresh Question & Answer spice , error	14	1274	November 2, 2022
Data synced in SPICE memory using greatest function though incremental refresh duplicates old data instead of overriding it for latest updates Question & Answer data-source , spice , quicksight	8	498	October 20, 2023

Delete duplicate rows appended by incremental refresh

Related topics