Joining datasets based on ranges

JMthinkrf · June 16, 2023, 3:25pm

I am hoping to join two datasets but I need to do so based upon a range. For example, assume I have the following dataset #1:
LOWER UPPER COLOUR
1 50 Red
51 63 Green
75 100 Yellow
125 500 Purple

Dataset#2 might look like:
NAME VALUE
George 23
Frank 82
Henry 71
Sally 222

And the joined dataset then should look like this:
NAME VALUE COLOUR
George 23 Red
Frank 82 Yellow
Henry 71
Sally 222 Purple

My actual datasets are much more complicated than these, and there are more fields in the join (which aren’t ranges), but hopefully, that gets the idea across.

Is this possible within Quicksight? My only option right now is through a very long calculated field, which is growing longer by the day.

Sanjeeb2022 · June 16, 2023, 3:32pm

Hi @JMthinkrf - What is the source of this data set? If it is relational, can we create custom field which will have one value w.r.t range and also put same logic to have custom field in data set 2. Then you can have a inner join for both data sets.

If the data source is relational, custom sql will really helpful to create derive fields which can be used as join in the data set.

Hi @ArunSanthosh @eperts - Any advise on this.

Regards - Sanjeeb

David_Wong · June 16, 2023, 5:28pm

@Sanjeeb2022

I’m not a SQL expert but wouldn’t a query like this work?

SELECT
Table2.Name,
Table2.Value,
Table1.Colour
FROM Table2
INNER JOIN Table1
ON Table2.Value BETWEEN Table1.Lower AND Table1.Upper

If it’s a valid query and the data is in Excel or CSV files, this can be done in Athena, right?

Sanjeeb2022 · June 16, 2023, 5:55pm

Hi @David_Wong - Good one not sure whether it will throw an error or not, however I will test it and get back to you. As per my knowledge join should have always equal or not equal, never see a between clause

My idea is to have a derive column at run time using custom sql which will give one column and that can be used as joining condition.

I agree with you, if it is a valid query, then athena can work.

Hi @JMthinkrf - Is it possible to check whether the sql is valid or not.

Regards - Sanjeeb

David_Wong · June 16, 2023, 6:00pm

@Sanjeeb2022
You can have greater than or less than in a join clause, so if BETWEEN is not allowed, we can always change it to >= and <=.

Sanjeeb2022 · June 16, 2023, 6:03pm

Thanks @David_Wong . Let’s wait for the outcome from @JMthinkrf . If the data volume is less, it will good , if the data volume is high, it may take some time as it will do a full scan.

However it is a cool solution.

Regards - Sanjeeb

JMthinkrf · June 16, 2023, 6:21pm

Thanks for the ideas @Sanjeeb2022 and @David_Wong. I’m not an SQL expert myself, but I will try out David’s suggestion next week and post here success or failure. With the SQL concept, I can probably simplify Dataset #1 into a continuous range, and then just use a < operator to select, filling in the blanks in the table with “null”.

JMthinkrf · June 19, 2023, 9:11pm

I put together something based more or less on @Sanjeeb2022 's first suggestion:

Created a dataset with an inner join of the two datasets, with dataset2 as the primary
Created a calculated field that is true if VALUE was between LOWER and UPPER, or if COLOUR was null (so that the Henry 71 entry from dataset2 is still included)
Filtered the dataset on whether the calculated field is true.

I haven’t experimented yet with the full larger dataset, but it seems to be working so far.

Sanjeeb2022 · June 20, 2023, 7:44am

Thank you @JMthinkrf for the confirmation, glad it is working fine. Can you please marked the suggestion as solution so that it will help other community members as well.

Have a great week ahead.

Regards - Sanjeeb