Data synced in SPICE memory using greatest function though incremental refresh duplicates old data instead of overriding it for latest updates

boneogtm · July 20, 2023, 8:15am

We have a dataset created in quick-sight using custom SQL and data is loaded in SPICE memory. Our data source is Redshift server-less.
We are using the greatest function to sync latest data into SPICE as we are using JOIN between multiple tables in a quick-sight dataset.

Our goal is,
Sync latest data of the parent/child table from Redshift to SPICE using incremental refresh. So we are using the updated date of parent and child both tables along with the greatest function. so whether parent or child got updated it reflects changes of either into quick-sight

Issue we face,
We found data duplicates into SPICE memory instead of updating existing for the ones which already exists into SPICE and due to some update we sync them again using incremental refresh

Exception,
We need to only sync updates from Redshift along with newly created between time range (E.g last 24 hours). And for updated ones it should update existing loaded data into SPICE with new updates and inserts newly created one

Let us know if we are missing anything in our implementation approach.

Thanks!

DeepakS · July 20, 2023, 12:37pm

@boneogtm, can you confirm 1/ what is refresh frequency (hourly/daily etc) 2/if you are observing the duplicates for the data that was updated within last 24 hours or prior to last 24 hours since 24hours is your look back window. Most likely you will find duplicates for the data that was updated outside the look back window. To address this you can do full refresh (may be nightly). This youtube video will you build understanding of SPICE incremental refresh process and accordingly you can update the refresh strategy 2022 QuickSight Learning Series: Keep your data fresh with incremental SPICE refresh - YouTube

boneogtm · July 20, 2023, 12:46pm

Thanks for you suggestion @DeepakS

Here is some more details on our use case,

Our refresh frequency is daily (we sync data only one time daily)
We sync data with lookback window 24 hours with updated/newly created during last 24 hours.

Full refresh is one of the option for us but issue is with COST of loading full data from Redshift serverless to SPICE each day. As we have data in GBs (E.g 30+ GB) currently and going forward it will increase more with passing days.

So due to considering cost it could incur we can not prefer Full refresh option in our case.

DeepakS · July 20, 2023, 7:20pm

@boneogtm, thank you for clarification. Can you confirm if the duplicate values you are observing are because of the data that is being updated outside 24-hour window. If it is, do you know data is typically updated within last few days or can be updated for any day going back in months. In first case if data is updated generally within last few days then you can increase the look back window to 3-5 days.In latter use case, you want to consider full refresh may be weekly depending on importance of the issue.

boneogtm · July 21, 2023, 10:10am

@DeepakS Duplicate values we observe because of lookback window of 24 hours.

Actually we are using JOIN between multiple tables in dataset and we have created and updated date in each table. So, when anything gets updated in any of table we consider it and update data in SPICE.

Since quick-sight only supports single date field for incremental refresh. We are using greatest function so we put created and updated date of each table in greatest method and we settled up incremental refresh on it.

But instead of updating existing rows those are getting duplicated in SPICE. When we full refresh it data seems accurate but that is not possible every time for us to do full refresh.

And we don’t know any specific window in which any of parent/joined tables will gets updated (E.g 3-5 days etc.)

So let us know, If there is some way to avoid such duplication in incremental refresh (without complete full refresh)

Koushik_Muthanna · July 21, 2023, 10:50am

If the behaviour of your datasets is similar to my explanation below , then as @DeepakS said, a periodic full refresh is required to avoid duplication.

Example on incremental refresh

Incremental refresh process creates a checkpoint on mutable and immutable records.

SPICE Dataset on 26/07/2022 [Consider incremental refresh has run on 26/07/2022, lookback window of 2 days on column Last Update Date]

ID	Create Date	Last Update Date	Mutable/Immutable
1	7/21/2022	7/21/2022	Immutable
2	7/25/2022	7/25/2022	Mutable
3	7/25/2022	7/25/2022	Mutable
4	7/26/2022	7/26/2022	Mutable

Source Table 27/07/2022 [Lets say Redshift]
Record ID 5 is a new record inserted
Record ID 1 is an old record updated

ID	Create Date	Last Update Date
1	7/21/2022	7/27/2022
2	7/25/2022	7/25/2022
3	7/25/2022	7/25/2022
4	7/26/2022	7/26/2022
5	7/27/2022	7/27/2022

The checkpoint meaning when the previous incremental refresh executed for QuickSight is ( 26/07/2022 )
When an incremental refresh call is executed on 27/07/2022 , it will bring all records > 24rd July 2022
26-07-2022 - 2 days = 24-07-2022

SPICE Dataset on 27/07/2022

ID	Create Date	Last Update Date	Mutable/Immutable
1	7/21/2022	7/21/2022	Immutable
2	7/25/2022	7/25/2022	Mutable
3	7/25/2022	7/25/2022	Mutable
4	7/26/2022	7/26/2022	Mutable
1	7/21/2022	7/27/2022	Mutable
5	7/27/2022	7/27/2022	Mutable

Note : Duplication for Record ID 1 exists because the lookback window is for 2 days. The Record ID 1 falls outside of the lookback window , hence the updated record id in source system is appended as a new record in SPICE ( the existing record is not updated ) .

The new checkpoint for QuickSight is (27/07/2022)

What is not happening ( Key matching ) :
Example Source Table 27/07/2022: Record ID 1 which was created on 21st July 2022 and had an update on 27th july
A typical database merge query would be
Select * from source_table where last_update_date >= 2 days ( dateadd ( day , -2 , getdate () )

when matched then update
when not matched then insert
when not matched by source then delete

What is actually happening for SPICE ( No key matching ) :
Get all records within the lookback window
Delete all mutable records from 24-07-2022 and after , append newly queried data.
Here Record ID 1 with last update date (27/07/2022) is a new row . The existing Record ID 1 with last update date ( 21/07/2022 ) is outside of the lookback window.

Regards,
Koushik

boneogtm · July 21, 2023, 1:55pm

thanks @Koushik_Muthanna for detailed explanation.

Yes, the behaviour of our datasets is similar to your explanation.

In our case we don’t know window period when particular record will gets updated in our system. It can be in updated in few days, weeks and months too.

So we are not sure how exactly we need to setup lookback window in quick-sight for updating latest data in SPICE.

Also, It is technically possible in quick-sight if we need to check in Redshift serverless for inserted/updated data of last 24 hours but we can setup lookback window to 26 weeks (E.g 6 months) as we assume most of the day will gets updated within it. (Still there might be chances of duplicate rows but less than compared to current one)

We don’t need to query in Redshift serverless for more than 24 hours data due to cost it can incur.

Can you confirm if that can be possible in Quick-sight(SPICE) ?

duncan · October 20, 2023, 1:42am

Hello @boneogtm !

I believe for your use case, considering that the refresh window is not know on your dataset, it would be best to complete the refresh window in the way suggest above. That being said if you have found a workaround please share it with the community.

I am going to archive this topic since it has been open for an extended period of time. If you would still like assistance with this, or if you have a new QuickSight question, feel free to post a new topic in the community so you will be at the top of the priority list for a response from one of our QuickSight experts. Thank you!