Incremental Refresh to create historical dataset

Daniel_Dubravec · October 6, 2022, 3:19pm

Hi,

I am having difficulties with incremental refresh on spice dataset. To the point where I am not sure if its the right tool for what I am trying to achieve.
What I am trying to do:

a) I have data sitting on my redshift, that gets refreshed 1x per day and it always displays only current data available for that day. So on refresh in Redshift, it overwrites previous entries. It doesn’t have any snapshot built in.
b) My plan is to utilize incremental refresh to create SPICE dataset from above table that will store snapshotted data from said days.

So when I query data today, it will get data with timestamp of 6-10-2022. When I refresh it tomorrow, I want to have 6-10-2022 there, but add 7-10-2022. Again, when I refresh on 8-10-2022, I want to see 6,7,8th Octobers data. Gradually I want to build this dataset to be able to do historical analysis on it.
I tried several lookback window configurations, but it always seem to overwrite the data with current days data. I used
current_date as snapshot_date,
in my DDL.
So question is: How should I set up my look back window to achieve this?
Thanks,
Daniel

saunakchandra · October 6, 2022, 4:06pm

Incremental refresh in QuickSight works by a look back window on a timestamp column in the database, last_updated_timestamp in a data warehouse (DW) table in many implementations. During incremental refresh QuickSight looks back the window configured during setup, deletes the rows in SPICE dataset during the look back window, add rows in SPICE from the database rows during the look back window.

If I understand it right you want to store the historical snapshot in SPICE to save space in the Redshift DW. Please note that SPICE capacity is limited 500 GB compressed in Enterprise edition at the moment and not a replacement of DW storage.

However there are ways to model the data in the DW to avoid full refresh in SPICE while updating with the latest data in DW. There are few sophisticated DW models you can review, especially transaction fact and accumulating fact. Periodic snapshot model has the space constraint irrespective of where being implemented. Here are some reads from dimensional modeling guru Ralph Kimball Dimensional Modeling Techniques - Kimball Group

Daniel_Dubravec · October 6, 2022, 6:40pm

Hi,

Thanks for reply.
My concern is not redshift space, nor SPICE space.

I don’t control the redshift cluster the data is on, so I’d need to talk to Data Engineering to set up a job for me to snapshot data each day to new table.

My rationale is that I want to do this on my own, with as little outside help as possible.

Thanks,

Daniel

saunakchandra · October 7, 2022, 1:24am

Can you elaborate more on why you want to keep the snapshot days in the SPICE? A sample structure of the data might be helpful to understand the use case what you are trying to achieve.

Daniel_Dubravec · October 7, 2022, 7:04am

Sure.

So my dataset captures current status of Financial Data on certain partner accounts. Its my custom SQL, but main data source is table that refreshes every day.

Mock up:

|Vendor|ORG|Amount|sys_create_date|
|Vendor1|US|55000 |7/10/2022|
|Vendor2|US|44000 |7/10/2022|

sys_create_date changes every day as new data is pulled in. So today it is 7/10/2022, tomorrow it will be 8/10/2022.

What I want to achieve is to have this dataset incrementally grow on SPICE. So tomorrows refresh would add 2 new rows, while keeping previous ones:

|Vendor|ORG|Amount|sys_create_date|
|Vendor1|US|55000 |7/10/2022|
|Vendor2|US|44000 |7/10/2022|
|Vendor1|US|54000 |8/10/2022|
|Vendor2|US|47000 |8/10/2022|

And so forth, for at least amount of past 30 days.

Does that make sense?

Thx,

Daniel

royyung · October 7, 2022, 1:56pm

Thanks Daniel for your detail explanation with clear examples!
it is a typical incremental refresh use case, you may setup incremental refresh according to below screenshot. You need to define a date column to identify data update date(e.g. sys_create_date in your case). In “Window size”, you need to define how many days/week/hour data need to be refreshed. For instance, last 40 days in my example, QuickSight will remove last 40days data in SPICE and ingest again from data source
Screenshot 2022-10-07 at 14.49.45
Screenshot 2022-10-07 at 14.50.46

Ref document: Refreshing SPICE data - Amazon QuickSight

Kristin · October 9, 2022, 2:39am

Hi @Daniel_Dubravec. Did Roy’s solution help? I am marking his reply as, “Solution,” but let us know if this is not resolved. And thanks for posting your questions on the community Q&A Forum!

Daniel_Dubravec · October 11, 2022, 8:46am

Hi Kristin, Roy,

I’ve tried to implement it, but each time QS pulls data in, it will just replace the yesterdays data with todays and erases it. I suspect I will need to add one more column to indicate change, like day count, or something similar.

Any thoughts?

Thanks,

Daniel

royyung · October 11, 2022, 9:45am

When you configure incremental refresh, there is a window size(in screenshot), QuickSight removes old data according to this window size. If you want to keep old dat in SPICE, you may reduce the size of this window size(for instance, down to 1 hour). In this case, QuickSight keeps historical data in SPICE(Oct7 data in your example) and ingests new data(Oct8).
However, SPICE is not a managed database, it is a fast storage designed for QuickSight visualisation. If historical data is only stored in QuickSight, there is a risk that historical data will lost if you need to do a Full data refresh in QuickSight.
Therefore, suggest to store the historical data in other storage such as database, S3 rather than only in QuickSight SPICE

Kristin · October 11, 2022, 4:48pm

@Daniel_Dubravec Thanks for the update. Let us know if you have any other questions or if you are all set.