I have been using QuickSight successfully where I have datasets that are a single source. I am getting issues when I join multiple sources into one dataset.
I have a json file in S3 which is updated hourly. This is my call data. I have a dynamodb-athena table which is my customer data, which is also updated hourly.
I have these sources as individual datasets, and I have another dataset where I join the two sources together.
The individual datasets are reliable. Despite the odd occassion when the hourly refresh fails, the datasets work fine. I have an issue with the joined dataset, where a refresh unexpectedly ‘skips’ data. This is not considered a failure and I am not alerted. I only find out there is problem when my dashboard produces incorrect results, and I check the datasets.
The only information provided on the cause of the issue is a note in the dataset which says “Some of the rows scanned from the source tables were skipped”.
The really strange thing is the dataset never recovers the ‘lost’ data. Full refreshes continue to produce the problem. However, if I recreate the dataset from scratch, there is no problem, and all of the data is present.
For this reason, it appears to me like some sort of permanently corrupted SPICE cache. Or some sort of cache that is preventing the data from appearing during the next full refresh.
I have now experienced this issue 3 times, and each time I have had to recreate the dataset to fix the problem.
Has anyone experienced anything like this? I have been looking into whether I can force a deletion or clear down of the SPICE cache, but have not found a way to do this. It wouldn’t solve the problem anyway but would at least indicate if the cache is causing the error.
I have also been wondering if the joined dataset has any relation to the individual datasets. For example, could a RACE condition occur, where one of the underlying datasets is still refreshing when the joined dataset is trying to do so?