How to speed up a query

Hi guys,
I create a dataset with this query

select machineid as machinecode, timestamp, l3dspmaxvalue, t041, t040, t038, t043, t044, t039, t042, t030,
      date_parse(substring(timestamp,1, 19),'%Y-%m-%d %H:%i:%s') as timestamp_parsed, 
      year,month,day,
      lag(timestamp) over (partition by machinecode,year,month,day order by timestamp asc) as prev_timestamp, 
      speed, activealarms, t001, t002, t003, t004, t005, t006
from prod_datalake.machinesmetrics

my datasource has more than 165 milions records and it take a very long time to be executed (more than 60seconds). I would like to add filters on the dashboard not in the dataset. Is it correct?

My environment is this

S3 → Glue → Athena → Quicksight.

Within Glue I already create a partition index, without any improvements. I partially solve my problem creating a SPICE dataset, but its size is 50giga limited to 120 days instead otherwise I receive a SQL_EXCEPTION (timeout) while executing the first full refresh.

How can I improve its performance?
Best regards
Enrico

You’ll definitely need to use spice with that amount of data.

What are you partitioning by? If it’s too many values then it takes too long to query.

You should look to partition by values that you can use as a filter in your sql statement.

Mh, can you please explain better? I’m a newbie in AWS, so some concepts are still “unclear” to me.

My Glue has a schema of 428 fields where last 4 the partition keys.

My query does not have any filter within the sql statement (and I would like to keep it without filters), so in my analysis I can execute filters as I need

“Codice macchina” and “Intervallo data” are my filters (machine code and date intervals) that works on dataset, but not work directly in my partition columns. Should I create filters that use the specific partition columns?

I well understand how spice works, and I agree with you that I need to use it. What I’m not able to do is run a full refresh of the query without any filters, because I always get a SQL_EXCEPTION (timeout). Setting a filters on the dataset of 120 days the full refresh works.

Should I increase the QUOTA from 30 mins to 60 mins? I need the full refresh works only the first time, once done I will create an incremental schedulation of 2 hours, so very very small and quick to execute

Best regards
Enrico

Hi,

Data should be include the mandatory fields/columns.
Fist of all you need to look at the data and remove the unnecessary
column.

Direct is only designed for the small table like in thousonds or less.

Your dataset is millions transaction.

so I will recommend you to use the SPICE storage. if you are new, when you creating dataset in QS you must select the SFPICE option not the direct query. Both are visible when you creating dataset.

Later on you can configure the incremental refresh on the dataset.

In this way you can handle you datasets.

regards,
Naveed

Hi,
what is the best practise with datasets so huge? I would like to create an 2 hours incremental refresh, but I don’t know what happen if I don’t start with a full refresh. How can I set up the spice in order to work correctly?

Best regards
Enrico

Hi,

SPICE is very intelligent, I am running 15 plus clients on it with millions of row.
you can configure incremental refresh with confidence :wink:

if your backend data not change so don’t worry the for the full refresh.
if any change in you historical data, you configure complete refresh once a day and twice etc. with incremental refresh.

Same approach I adopted working perfect :slight_smile:

Regards,
Naveed Ali

Still confused… My backend data won’t change. If I create the spice with all my data (more than 165milions records) every time I change the dataset the Manual Refresh (edit) fails to execute because of a timeout. Setting a filter of 120 day it will work. My whole datasource in S3 has more than 1 years of data.

120 days create a dataset with 70 milions rows, that is less than 1/2 of the whole data. What should I do to spice the entire dataset?

Best regards
Enrico

Hi @egobbo ,

Amazon Athena default timeout is 30 minutes , you can request for a quota increase Service Quotas - Amazon Athena.

Kind Regards,
Koushik