Do different types of datasources take longer for datasets to refresh?

ineedqshelp · September 7, 2023, 10:40pm

I have two datasets, one is about 3x the size of the other. The smaller dataset has an Athena datasource, the larger one is Aurora. The smaller one takes almost the same amount of time to do a full refresh, is Athena just slower at refreshing, or is there something wrong with the SQL query for it that’s causing it to take so long?

Sanjeeb2022 · September 8, 2023, 5:02am

Hi @ineedqshelp - It depends upon how you have set up your athena table? Did you store the files in Parquet format. If not, you can do it to enhance the performance. Also you can test the sql query in Athena console to see the performance.

Regards - Sanjeeb

ineedqshelp · September 8, 2023, 8:01pm

The SQL query in athena times out (takes longer than 15 minutes), but I can see it if I add a limit clause. How do I see if it’s in Parquet format?

neelay · September 9, 2023, 12:35am

@ineedqshelp generally these files has {file}.parquet ext.

This problem most probably has nothing to do with the data transfer.

It is most likely tied to “which engine takes how long to find the result”. Once the result is processed, data transfer from data source to spice is more or less the same for each MB, (may vary depending upon the region, and other stuff), but it won’t be much if everything is hosted on AWS.

e.g. top 10 products purchased in the current month’s 10M orders is a very small dataset, but it will take a longer time to find these 10 products in 10M rows (as it requires aggregation) vs. 10000 rows from 1 or more tables without any aggregation.

If you are asking about how to optimize the Athena query then that is a different question.

I hope this makes sense