Joining 2 SPICE datasets to create a new one takes too much time

Hi There,
I am trying to join two datasets out of which the left one is a huge fact table with almost half a billion records and the right one is just a lookup of values with a handful of records. Joining these 2 in SPICE, a refresh is taking around 3 hours.
Reason I can’t do it at the source is that my lookup is a SCD2 but I need to pick the latest value from it for all the records on fact and doing it at the source(GLUE DB) would mean reloading the full table each time.
Thanks

Hi @ashu.ap - Welcome to AWS QuickSight community and thanks for posting the question. are these tables are physicalize in any relational database? If yes, can you see the explain plan of the joining query. Possibly you may need to create an index for faster query performance.

Please give more details on the source of the data, and joining conditions so that it will help in guiding the solution.

Regards - Sanjeeb

1 Like

I have created 2 datasets in SPICE first from these tables and then joining these datasets into another one in SPICE using a left join. The underlying tables for these datasets are in a glue DB but it does not matter as we have already loaded the data into SPICE

1 Like

Thanks @ashu.ap - Thanks for the details. if both the data sets are in SPICE, then joining should be done faster. Can you check joining conditions whether the data type of join fields are same or not.

Also the SPICE refresh which is taking time, collect the ingestion id and raise a ticket to AWS Customer support team so that they can view the logs backend and give details why it is taking more time. To raise a ticket to AWS, please follow the link - Creating support cases and case management - AWS Support. If you do not have access to create ticket, request your admin to help you on that.

Regards - Sanjeeb

1 Like

Thanks for the prompt reply

1 Like

Hi @ashu.ap ,

Can you
1/share a screenshot with an example of the huge fact table and the lookup table join ?
2/If the lookup table is supposed to pick the latest value , Is this like max date for records ?
3/screenshot of an example record and what you expect to see after the join happens ?

A few other questions
4/What is the total number of records after join ?
5/ You are glue as a metadata catalog , are you loading this data using the following method S3 > Athena > QuickSight or is it S3 > QuickSight (using manifest file ) ?
6/What is the frequency of reloading the huge fact table into SPICE if there are updates happening ? ( Is it a full refresh everytime ? )

Kind regards,
Koushik

2 Likes

Hi @Koushik_Muthanna information attached below:

  1. Can’t share exact details but it is like any other fact, some FK columns followed by some measures. It has around 40 columns and half a billion rows.
  2. Query for lookup dataset in SPICE already gets the latest record and there would only be one matching record per row of fact.
  3. Join fact and lookup on lookup_key and pick the lookup_value and add it to columns in fact.
  4. Same as there were in the fact, ~half a billion.
  5. We are materializing a table in a GLUE DB and then reading data using Athena Query into the SPICE.
  6. Right now the dataset is an on demand refresh but in future it would be a daily incremental load.

Thanks
Ashutosh

Hi @ashu.ap

You have a lot of data which means it will take sometime to load.

The only thing I can think of is trying to incrementally refreshing to limit the amount of additional data that needs to be imported.