Rows removed during incremental refresh

I have a database table where new rows are appended and there is a CreateTimestamp column that indicates when each row was added. Existing rows in the table are never updated. In theory this sounds like an ideal candidate for incremental refresh in QuickSight but for some reason QuickSight is removing rows from the dataset during the incremental refresh.

I created a test dataset with just an Id column and the CreateTimestamp column. I used the CreateTimestamp column for my lookback window and the window size is 1 day.
Incremental refresh - dataset

Incremental refresh - configuration

I duplicated the dataset and used full refresh on the duplicate to compare the results of the incremental and full refreshes. The visual on the left shows the count of rows per day and the one on the right shows the count per minute.

The incremental refresh completed successfully and QuickSight doesn’t show any rows dropped (although it is weird that it shows the same number in the “Ingested rows” and “Dataset rows” columns).

However, when I look a the data, I see that the number of rows for Oct 23 went from 8758 to 4662. Looking more closely at the number of rows per minute, it appears that QuickSight removed all rows between 4:44am and 4:01pm.

The dataset where I did a full refresh has thousands of rows during that time period. Does anybody know why QuickSight removed those rows?

My understanding was that QuickSight was supposed to remove all rows in the lookback window from the SPICE dataset and ingest all rows in the lookback window from the source. Did I misunderstand that or did QuickSight do something weird here? The full refresh and the incremental refresh finished just 1 minute apart, so I was expecting the number of rows in the 2 datasets to be about the same.

1 Like

@Kristin
I used the “spice” tag but I don’t know if there is a better tag for questions related to incremental refresh. There’s currently no tag for incremental refresh.

hi David,
from what I tested in my end, this logic should be correct “QuickSight was supposed to remove all rows in the lookback window from the SPICE dataset and ingest all rows in the lookback window from the source.”
It is weird that data between 4:44am and 4:01pm was removed. I am wondering whether it will be caused by timezone difference. any data after 4:01pm are considered to be future time. can you try reducing the window to 1 hour and see the results?
another case may be the data after 4:01pm are ingested after incremental refresh

Hi Roy,

Sorry for the confusion. The time in the source database is in UTC but the refresh history shows EDT. The incremental refresh completed at 7:45 PM UTC on Oct 23 and the latest timestamp in the data is Oct 23, 2022 7:32 PM UTC. The dataset doesn’t contain any timestamp in the future.

If the incremental refresh started at 7:45 PM UTC on Oct 23 with a lookback window of 1 day, does it mean the anchor point was Oct 22 at 0:00 AM UTC or Oct 22 at 7:45 PM UTC?

I’ll do another test using 1 hour as my lookback window next.

1 Like

Hi David,

If the incremental refresh started at 7:45 PM UTC on Oct 23 with a lookback window of 1 day, does it mean the anchor point was Oct 22 at 0:00 AM UTC or Oct 22 at 7:45 PM UTC?

In one of the questions I had covered incremental refresh : Scheduled incremental refresh of dataset is not working - #3 by Koushik_Muthanna
When you would have setup an incremental refresh, initially there would be a full refresh. The incremental refresh anchor point would then be ( Oct 23rd 7:45 PM -1 day (24 hours) = Oct 22nd 7:45 PM )

1 Like

Hi Roy,

I repeated my test with different window sizes: 1 hour, 2 hours, 1 week, 1 day and 2 days.

It works fine for 1 hour, 2 hours and 1 week but not for 1 day and 2 days.

Just like the first time I did the test using a window size of 1 day, some rows got dropped:

There’s data missing between Oct 24 5:45pm and Oct 24 7:03pm:

Some problem with a window size of 2 days:

There’s data missing between Oct 24 8:06pm and Oct 25 12:53am, and between Oct 23 5:16pm and Oct 24 12:49am.

Hi Koushik,

Thanks for the clarification about the anchor point. I’m still not able to figure out why many rows are getting dropped during the incremental refresh. The refresh history doesn’t show any rows dropped but there is definitely data missing from the dataset after the incremental refresh.

just want to make sure the IDs are unique? and the count is not a “Distinct count”?

The count that I’m showing is not a distinct count. All the IDs in the source table are also unique.

@David_Wong I am think whether the rows were missed in ingestion or they just not show up in the table. After incremental refresh, there is a number showing “ingested rows”. if you check the number, does it include the missed rows?
if i take below image as an example, can we do a count in the data source and see whether data between Oct23,4:45am and Oct23, 4:01pm are included in the 10,925 rows?
sorry for too many questions, thanks for your patience and deep dive to evaluate and solve the problem :slight_smile:

image

Hi Roy,

I’m glad to provide any additional information to help figure out the issue.

The refresh history shows the correct number of rows ingested during the incremental refresh (10,925) but the rows are somehow not in the analysis.

The anchor point used in the incremental refresh was Oct 22 7:45pm and the latest date at the time was Oct 23 7:32pm. When I perform a full refresh and filter the analysis by those dates, I get 10,925 rows.

The refresh history for the incremental refresh shows the same number but for some reason only 6,829 rows are visible in the analysis in that date range.

You can see here that this is the only filter that I have:

When I query the source, there are 3903 rows between Oct 23 4:45am and Oct 23 4:01pm. Those are the rows missing from the analysis.
Incremental refresh - incremental - count

as the ingested rows are showing 10,925. looks like the rows were ingested but not showing in the analysis. may I know what is the total number of rows if you disable the filter for the table “Incremental Refresh”?
can you also try to put columns “Id” and “CreateTimestamp” as dimension and list out all rows. and see how many rows in total?
like below example, I put all rows into “Group by” in Field wells
image

Hi Roy,

Here’s a screenshot showing the total number of rows when the filter is disabled.

Here is another screenshot showing all the rows after the incremental refresh.

Thanks,
David

I think we need more information on the backend to figure out the issue.
I would recommend filing a case with AWS Support where we can dive into the details so that we can help you further. Here are the steps to open a support case. If your company has someone who manages your AWS account, you might not have direct access to AWS Support and will need to raise an internal ticket to your IT team or whomever manages your AWS account. They should be able to open an AWS Support case on your behalf. Hope this helps!