During refresh, I am getting occasional “Data Ingestion Failures” with the reason given: “A query to the data source timed out waiting for a response”.
My dataset performs a full refresh hourly using a SQL query to pull in a RDS dataset (roughly 10GB in size), and then joins that with three small tables. When a failure happens, the next refresh often takes close to 10 hours to complete, while normally a refresh takes about 10 minutes.
Any advice much appreciated.
Hi @JMthinkrf - This is a very strange. When the ingestion process failed, new ingestion is taking time, looks like internally SPICE engine is doing some operation. To do the investigate in details, it will be better if you can raise a ticket to AWS Customer support team so that they can analyze this issue. To raise a ticket, please follow the link - Creating support cases and case management - AWS Support
Regards - Sanjeeb
1 Like
Thanks @Sanjeeb2022. Unfortunately, we are still on the basic support plan, so can’t raise a ticket.
I agree, I think the SPICE engine is the issue for me here. I think it’s rooted in the fact that each refresh is ingesting over 7M rows, which is in turn caused by the following SQL query that is fetching the whole base dataset:
select
*
from sa_records_view
where “sa_records_view” is the base Aurora dataset.
1 Like
Hi @JMthinkrf - Not a problem. Let me request some of our expert to help you on this.
Hi @AnwarAli @EnriqueS @namysore @knamburi - Can you please help on this. When there is a failure in the SPICE ingestion, the next ingestion is taking hours of time. Please provide some details on the back ground of this issue.
Hi @Kristin - fyi, @JMthinkrf has basic support so not able to raise a support ticket, looking forward some help from community.
Regards - Sanjeeb
Hello @JMthinkrf , here we will need to have a bit of context of the status (CPU load, disk IO, availability) of your source database at the time of the (failed) refresh (May 28, 2023 at 10:59 PM EDT).
Also configuring the slow query log in aurora could be useful as it seems that the query takes a long time to complete or it is not even able to be submitted.
Please share with us the status of the instance, to check if it can be this the cause for the issue.
Thanks.
1 Like
Hi @EnriqueS - Thanks, I agreed with you. But the symptom is bit different here. When the ingestion is failed, the next ingestion is running for hour. In ideal cases, the ingestion took very less time. See the screenshot attached by @JMthinkrf .
Hi @JMthinkrf - Please correct me if I am wrong.
Regards - Sanjeeb
1 Like
If there is a failure on the database (or it had to be restarted) the cache may be empty.
Also all the “long” updates are during the night, it would be interesting to know if:
- The database is scaled down during the night
- Is the database an serverless database? Aurora supports this and this can explain the initial failure and subsequent slow load
- Is the database running any ETLs/stored procedures during the hours when the load was slow? This could be also affecting the time it takes to ingest data
Another important thing to know (sorry for missing it before) is why the ingestion failed. @JMthinkrf can you please provide the output of the describe-ingestion CLI command for one of the Failed
ingestions.
You can list all the ingestions (to choose the id of one of the failed ingestions) by using list-ingestion CLI command.
Will wait for your reply to take a look to this info and try to pinpoint the root cause of the issue.
Kind regards.
2 Likes
Thanks @EnriqueS and @Sanjeeb2022 . We found the issue. Turns out the problem was not directly in the dataset being refreshed. That was just a symptom. Within that dataset was a query to another dataset which itself was experiencing problems. It was the query that was failing, but it was showing up as a refresh failure. Once the other dataset was fixed, our problems went away.
2 Likes
Great to hear @JMthinkrf . Please mark your observation as solution so that it can help other community member.
Have a great week ahead.
Regards - Sanjeeb
1 Like