SPICE incremental refresh with created snapshot_date column

Hi Team, our use case is that we have a source data which update record for each Id, thus in that source table there is only one latest record rather than historical data.

source table on 5/21/2024 midnight:
id last_update_date
111 5/20/2024 00:00:00
222 5/20/2024 00:00:00
333 5/20/2024 00:00:00

source table on 5/22/2024 midnight(111,222 got update):
id last_update_date
111 5/21/2024 00:00:00
222 5/21/2024 00:00:00
333 5/20/2024 00:00:00

source table on 5/23/2024 midnight(333 got update):
id last_update_date
111 5/21/2024 00:00:00
222 5/21/2024 00:00:00
333 5/22/2024 00:00:00

Now, since we want to showcase historical metrics, we want to leverage incremental refresh to append data daily. We creat a new column : DATE_FORMAT(current_timestamp, ‘%Y-%m-%d %H:%i:%s’) AS snapshot_date,
this is the timestamp when the refresh actually happens

Expected dataset:
|Id |last_update_date|Snapshot|
|111|5/20/2024|5/21/24 0:00:00|
|222|5/20/2024|5/21/24 0:00:00|
|333|5/20/2024|5/21/24 0:00:00|
|111|5/21/2024|5/22/24 0:00:00|
|222|5/21/2024|5/22/24 0:00:00|
|333|5/20/2024|5/22/24 0:00:00|
|111|5/21/2024|5/23/24 0:00:00|
|222|5/21/2024|5/23/24 0:00:00|
|333|5/22/2024|5/23/24 0:00:00|

However if we set incremental refresh on last_update_date, look back window 1 day, refresh daily, we get this:
|Id |last_update_date|Snapshot|
|111|5/20/2024|5/21/24 0:00:00|
|222|5/20/2024|5/21/24 0:00:00|
|333|5/20/2024|5/21/24 0:00:00|
|111|5/21/2024|5/22/24 0:00:00|
|222|5/21/2024|5/22/24 0:00:00|
|333|5/22/2024|5/23/24 0:00:00|

if we set incremental refresh on above new created snapshot_date column, look back window 1 day, refresh daily, we get whole data fully replaced.

if we set incremental refresh on above new created snapshot_date column, look back window 1 hr, refresh daily, we also get whole data fully replaced, not appending(not sure why in this case previous day data all get deleted, although look back window is just 1 hour).

Can you advise how can we achieve above expected dataset?

Thank you!

Hi @ruiqyang - The incremental should work with last_update_date correctly. did you added the snapshot_date at the start of the refresh.

Hi @David_Wong - did you see any issues with the incremental logic here. Please advise.

Regards - Sanjeeb

@ruiqyang ,

The incremental refresh is working correctly .
Here is a post where I have explained it a bit more in detail ( Data synced in SPICE memory using greatest function though incremental refresh duplicates old data instead of overriding it for latest updates - #6 by Koushik_Muthanna ) .
Youtube vide on how it work : https://www.youtube.com/watch?v=jl_91ahT-P8

source table on 5/21/2024 midnight:
id last_update_date
111 5/20/2024 00:00:00
222 5/20/2024 00:00:00
333 5/20/2024 00:00:00

source table on 5/22/2024 midnight(111,222 got update):
id last_update_date
111 5/21/2024 00:00:00
222 5/21/2024 00:00:00
333 5/20/2024 00:00:00

Based on incremental refresh
|Id |last_update_date|Snapshot|
|111|5/20/2024|5/21/24 0:00:00|
|222|5/20/2024|5/21/24 0:00:00|
|333|5/20/2024|5/21/24 0:00:00|
|111|5/21/2024|5/22/24 0:00:00|
|222|5/21/2024|5/22/24 0:00:00|
|333|5/22/2024|5/23/24 0:00:00|

On 22nd when the incremental refresh was executed , with a lookback window of 1 day , it will get any records after 21st . 333 5/20/2024 00:00:00 is outside of this window .

Kind regards,
Koushik

1 Like

What you mean for “added the snapshot_date at the start of the refresh.”? I add that column in the Custom SQL as additional column

Thanks Koushik_Muthanna but this does not answer the question… the expected result we are looking for is

Expected dataset:

|Id |last_update_date|Snapshot|
|111|5/20/2024|5/21/24 0:00:00|
|222|5/20/2024|5/21/24 0:00:00|
|333|5/20/2024|5/21/24 0:00:00|
|111|5/21/2024|5/22/24 0:00:00|
|222|5/21/2024|5/22/24 0:00:00|
|333|5/20/2024|5/22/24 0:00:00|
|111|5/21/2024|5/23/24 0:00:00|
|222|5/21/2024|5/23/24 0:00:00|
|333|5/22/2024|5/23/24 0:00:00|

That’s why we try create the snapshot_date column and use it for incremental refresh date anchor. rather than last_update_date.

How to achieve above expected dataset?

Another question is why using snapshot_date column for incremental refresh is not working as expected?
Why it is replacing whole dataset with latest snapshot_date than appending?(See example in original question)

Hi @ruiqyang - Ok got it. We need to replicate this issue to understand why this is providing expected result. In parallel, can you please raise a ticket to AWS team as well so that they can check the background refresh logs. To raise the ticket, please follow the below link - Creating support cases and case management - AWS Support

Regards - Sanjeeb

Hi @ruiqyang - I am suspecting ( not 100% sure) when you change the incremental refresh condition, it replaced all the row and the subsequent refresh should add as increments. Can you check the subsequent refresh outcomes.

Regards - Sanjeeb

Another question is why using snapshot_date column for incremental refresh is not working as expected?
Why it is replacing whole dataset with latest snapshot_date than appending?(See example in original question)

If you have access to the SQL query which has been executed in the backend, that should easily answer what is happening.

What I am expecting is happening when you add a snapshot date in custom sql is, that snapshot date is applied to all rows within your source table and this falls within lookback window. So all rows are ingested.

From you screenshot , you had 226300 rows. Then the latest import had 228104 . So 1804 new records. But it imported everything because every row had the same snapshotdate.

Using incremental I don’t know if you can achieve the desired result.

You can test

Assuming your source table is a snapshot of daily data, then the snapshotdate should get updated in the base table.
The incremental refresh is then running on the snapshotdate.

Detailed Explanation - Keep your data fresh with incremental SPICE refresh