How backend works when joining datasets

hoyeon · September 25, 2024, 12:15am

On the dataset page, users can join multiple QuickSight datasets with inner/left/right/full joins.
What is happening when users join two different mode datasets, SPICE and Direct Query datasets on the UI?
Does QuickSight run a SQL query for the joins? Does it run a direct query on the underlying database for the Direct Query dataset and joins the data with SPICE afterwards?

I would like to know what happens on the backend side for this join.

shravya · September 25, 2024, 2:15am

Hi @hoyeon,

Yes, QuickSight effectively runs a SQL-like query in the background for Direct Query datasets. However, it doesn’t run a SQL query in the traditional sense for SPICE datasets since SPICE is an in-memory engine. The join happens after QuickSight retrieves the Direct Query data and combines it with SPICE-stored data in-memory.

Just a reference → Joining data - Amazon QuickSight

Let me know if that helps.

Thank you,
Shravya

hoyeon · September 25, 2024, 3:48pm

When users view the dashboard that is generated from the dataset which has joins with Direct query datasets and SPICE datasets in QuickSight, does it always run each Direct query dataset first and combine them in-memory?
What is the timeout limit for joins?
What is the language to query data from in-memory data if that’s not SQL? I am curious how it efficiently retrieves data for dashboarding in seconds.

shravya · September 28, 2024, 11:52am

Hello @hoyeon

Thank you for the follow up questions.

Timeout: QuickSight enforces a 2-minute timeout for visual generation. If the Direct Query exceeds this time, the visual may fail to load.

I would suggest you to have a look on these blogs, they are really very insightful and provides the required answers.

Let me know if these helps.

Thank you,
Shravya

Sanjeeb2022 · September 28, 2024, 1:23pm

In addition to @shravya , when you join a spice data set and with direct query data set, the result is always store in SPICE. Please check this as well.

Regards - Sanjeeb

hoyeon · September 30, 2024, 3:14pm

@Sanjeeb2022
Q1: If joined output is always stored in SPICE, when does it refresh data for direct query dataset that used in joins?

Q2: If a dataset enabled RLS rules, and joins with other dataset, does the joined dataset also convey/apply the RLS rules for the derived output?
Does joining logic care for RLS rules for the base dataset?

hoyeon · October 1, 2024, 12:46am

Here are my findings.

CLS enabled datasets cannot be used for join
RLS enabled SPICE datasets are changed to Direct Query mode in joins. It fails when joining RLS enabled dataset with other datasets. (Error message: Can’t create join. Joining tables from different sources requires SPICE, which this dataset isn’t permitted to use. To continue, you’ll need to remove the join.)

hoyeon · October 12, 2024, 12:02am

@Sanjeeb2022
If the child dataset is SPICE, when is it refreshed? Should users schedule refresh?

When the child dataset is Direct Query - i.e, both parent datasets are Direct Query, does it automatically refresh when users open parent dataset?

This is the default query mode. If you choose this option, the data for this dataset automatically refreshes when you open an associated dataset, analysis, or dashboard.

Sanjeeb2022 · October 12, 2024, 4:02am

Hi @hoyeon - When the child data set is in SPICE, yes you have to schedule it or refresh it first before using it in Parent data set. When the child data set is direct query, no refresh is require as the query will always connect to source database always.

If your final dataset is spice mode ( which is combination of child datasets). The best approach is to create a lambda function and refresh it through quicksight api.

Regards - Sanjeeb