QuickSight Data Ingestion Failure

I often get this error message from quicksight. What are some causes for this? I have a spice capacity of 40 gb remaining for the quicksight account. Can you suggest any common fixes for this?

Hi @Prajakta - Looks like it is taking more time to refresh the SPICE. Can you please give some more details on the type of source , whether it is relational database like RDS or athena based. Also are you doing a full refresh or incremental? if you are using custom sql, check the performance of the sql as well.

hi @David_Wong @sagmukhe - Any advise on this from your side.

Regards - Sanjeeb

1 Like

@Prajakta - Thank you for posting your query. As @Sanjeeb2022 mentioned, please share more details so that we can get more insight on the issue. Also, meanwhile, you can go through the below links and see if those help you in anyway.

1 Like

@sagmukhe @Sanjeeb2022 thank you for replying.

The type of data source is MySQL so it is a RDS. I’m trying to do a full refresh but I’m limiting data for only 3 months.
I’m also using a customer SQL query and this ran previously but now it is giving me a lot of problems. I have just unselected unnecessary columns in the dataset and the has failed.

@Sanjeeb2022 what do you mean by checking performance of sql?

This is my dataset refresh history

In terms of spice capacity, this is how much I have

I think the problem here is the query timing out but I’m not sure.
And when it comes to raising a ticket, I’m using the enterprise version of quicksight but I don’t have permissions in AWS to raise a ticket
image

Hi @Prajakta - Can you please run the same sql from any sql client tool using CTE ( Common Table Expression) and see the timing. If you can share the sql, we can have a look as well. If you see SPICE refresh time in past, it is 9 hours which is very high from my experience.

Regards - Sanjeeb

1 Like

Hi @Prajakta - Also if you are only interested for 3 months data, can you create another table and load only 3 months data and share the record count and size of the table as well. This is just for understand the problem and find a suitable a resolution for the same.

Regards - Sanjeeb

1 Like

@Sanjeeb2022 , ideally, I would want to be able to see all the data- like from April until today but I’m not able to do that because the dataset fails to load. So therefore I decided just to get 3 months so that it will at least load in quicksight.

Currently, I’m runny the sql query in mysql workbench and i have set the timeout to 10 hours so as soon as it executes, I will be able to provide the count. Will update when it is executed.

I’m also taking in only the necessary columns of the dataset but since it is it is a huge dataset (initial refresh was 32,853,459 rows and around 21 gb for 3 months ) , it takes a lot of time. But quicksight is supposed to take upto 500 million rows for enterprise version.
And ideas to reduce the size or do some dimensional modeling?

Hi @Prajakta - Thanks for the details. You are correct QuickSight SPICE can store up to 500M rows or 100GB ( I believe) and you have more than 50 GB available ( from the above screenshot). Couple of questions -

  1. Can you create a cut down version of the table at ETL layer and then do the reporting. Do we really need all granular data or any summarization can be done at sql level.
  2. Can you also check the explain plan for the query. can we create some index for faster select.

We need some business context so that we can design this table and bring the right data to QS for reporting.

Hi @David_Wong - It is one of the interesting usecase, any suggestion from you.

Regards - Sanjeeb

Hi @Prajakta,

How long does the full query take to run in Workbench?

To reduce the refresh duration, I would suggest breaking your dataset into multiple smaller ones and then joining them in SPICE. For example, you can create one dataset for your fact table and one dataset for each of your dimension tables. This would also reduce the performance impact on your database server.

1 Like

Hi @David_Wong @Sanjeeb2022 , I cannot break the dataset into facts and dimensions as it is already a different table for all my dimensions. I have to join these tables later or run through parameters through the dataset.

All the columns I’m trying to bring from the message table are important to me and there are only 10 of them. I’m trying to just get data for 3 months for it is 13 M rows and it takes 2 hours in the workbench to run it.

Now since I have so many load problems, I tried using parameters on the dataset. I have tried explaining the dataset and parameter issues over here,

The problem I’m facing now is that the parameter is integer type and cannot show the entire list.
image

Please let me know if you can help, thanks a lot!

Hi @Prajakta -I am expecting you have many distinct values for Lead Parameter. That is why it may take time to display. Did you try to create an index on the where clause ( if you have any)

Hi @David_Wong - What’s your suggestion on this.

Also I will suggest to raise a ticket to AWS customer support team so that they can also see this issue. To raise a request, please follow the below link - Creating support cases and case management - AWS Support

Regards - Sanjeeb

1 Like

@Sanjeeb2022 So you mean creating an index on the table in the database itself? I do not have permission to do that, I can just view data from the databases and not alter them. Or do you mean creating an index while selecting it from customer sql?

Hi @Prajakta - It should be index at database level. This will improve the performance of the sql when you will execute via any client tool.

Regards - Sanjeeb

1 Like

Thanks @Sanjeeb2022 for the clarification. I just checked and we do have an index on lead_id in the message table. Message_id is the primary key so do you think creating another index on message_id would help? But as far as I know, primary key is already an index.

image

What can be done in this case?

Hi @Prajakta - Can you please share your SQL query and explain plan for the SQL query. The explain plan will tell us how the query flow is happening the db engine and cost of the sql as well.

Depending upon the database the sql syntax of explain may change, please check that before executing.

Regards - Sanjeeb

1 Like

@Sanjeeb2022 , I apologise for the late response.

Here is the query plan using explain.

Hi @Prajakta - Is it possible to share the sql and each table count. Performance issues for a SQL may take sometime to figure out and it require some detail analysis and understanding of the data as well.
Also what is the database type ( is it mysql or postgres).

Regards - Sanjeeb

I have 4 tables and I’m limiting data for 1 month now (earlier it was 3 months but the query still fails for 1 month) for messages so from Aug 1 2023- Aug 31 2023
The counts for all the tables are-
Leads: 448834
Messages: 5,013,341
Accounts: 1106 (all time)
Customers: 9,968,106 (all time)
I realise that there is a lot of data in the customers table but for now at least, there is no way for me to filter this data as that table does not have a date.

But since I’m limiting the date for the message_date column and just taking it from 1 Aug 2023- 31 Aug 2023, will it have any effect on the number of customers?

Hi @Prajakta - It depends upon the business requirement and model of the table. if you want to see 1 month messages w.r.t customers, then possibly you are taking right approach.

Regards - Sanjeeb

Yes, that’s the case. For 1this, changing the database would help or some tuning needs to be done. Thanks for all your help