QuickSight Incremental Data Refreshes

Im testing using incremental refreshes on large datasets with the goal to to reduce the query load on database cluster to process large volumes of data and reduce the data ingestion time. However noticing 2 issues with the current set up :

  1. Query still runs and scans the entire timeframe specified for the and not just the lookback timeframe
  2. Incremental load deletes and reprocesses the most recent data only. This will eventually result in the SPICE Ingestion hitting the the upper bound row limitation.

A)
I would like to understand the correct usage for custom SQL (CSQL)queries for incremental refreshes.
For example :
I have a dataset with custom sql that is being used to test the incremental data refresh for quicksight.The CSQL itself have some date filters with a lookback of 30 days. I configured the incremental refresh to lookback 7 days .In the custom SQL have specified a timestamp field (UTC). The incremental refresh looks to be processing the records for the look back however , the query is running for the entire timeframe.

B)
Is there a way for the incremental processing to delete data records for earlier timeframes such that we dont run into issue of hitting the upper bound limits ?

1 Like

@alexresh ,

When you use custom sql, the query sent back to database will include the entire sql statement and any filters applied will be outside of that custom sql (In the below example : the filter is lookback window configuration for incremental refresh ) .

Example custom sql ( Athena as datasource )

with a as ( select t.run_date , round(random() * 9 + 1) as count_value , 'A' as category
from (select sequence(cast('2023-07-01' as date), cast(now() as date) + interval '1' month, interval '1' DAY) dates),
     unnest(dates) as t(run_date)
)
select run_date , extract(year from run_date) as year , extract(month from run_date) as month , extract (day from run_date ) as day from a
where extract(day from run_date) >=21

With incremental refresh configured for this dataset and when executed, following is the query in Athena

/* QuickSight 48d1782a-7291-4b64-abca-8e9945528da1 */
SELECT "run_date",
	"year",
	"month",
	"day"
FROM (
		with a as (
			select t.run_date,
				round(random() * 9 + 1) as count_value,
				'A' as category
			from (
					select sequence(
							cast('2023-07-01' as date),
							cast(now() as date) + interval '1' month,
							interval '1' DAY
						) dates
				),
				unnest(dates) as t(run_date)
		)
		select run_date,
			extract(
				year
				from run_date
			) as year,
			extract(
				month
				from run_date
			) as month,
			extract (
				day
				from run_date
			) as day
		from a
		where extract(
				day
				from run_date
			) >= 21
	) AS "dynamic_date_spice"
WHERE "run_date" > from_unixtime(
		cast(
			substr(cast(1689324062421 as varchar), 1, 10) AS bigint
		)
	)

Data changes within the lookback window is taken into consideration.
Older data will not be deleted. Depending upon how quickly your data volume is growing, you would need to handle this at your end so that SPICE limit of 1Tb or 1 billion records is not hit during ingestion.

Regards,
Koushik

2 Likes

For the 2nd question
"B)
Is there a way for the incremental processing to delete data records for earlier timeframes such that we dont run into issue of hitting the upper bound limits ?"

You may use full refresh and put a filtering condition in custom SQL to get the latest record only

1 Like

Similar request here for B: Remove Old Data During Refresh

Summarizing above replies -
a) Even though an incremental refresh is set up - the database query will remain unaffected and hence the execution time would remain about the same from a database perspective
b) Incremental refresh has a limitation that the dataset will eventually hit the 1Tb or 1 billion record so the customer has to account for this when setting up the refresh.

For B- is there a plan to address this feature request Remove Old Data During Refresh