Joining datasets based on ranges

I am hoping to join two datasets but I need to do so based upon a range. For example, assume I have the following dataset #1:
LOWER UPPER COLOUR
1 50 Red
51 63 Green
75 100 Yellow
125 500 Purple

Dataset#2 might look like:
NAME VALUE
George 23
Frank 82
Henry 71
Sally 222

And the joined dataset then should look like this:
NAME VALUE COLOUR
George 23 Red
Frank 82 Yellow
Henry 71
Sally 222 Purple

My actual datasets are much more complicated than these, and there are more fields in the join (which aren’t ranges), but hopefully, that gets the idea across.

Is this possible within Quicksight? My only option right now is through a very long calculated field, which is growing longer by the day.

Hi @JMthinkrf - What is the source of this data set? If it is relational, can we create custom field which will have one value w.r.t range and also put same logic to have custom field in data set 2. Then you can have a inner join for both data sets.

If the data source is relational, custom sql will really helpful to create derive fields which can be used as join in the data set.

Hi @ArunSanthosh @eperts - Any advise on this.

Regards - Sanjeeb

@Sanjeeb2022

I’m not a SQL expert but wouldn’t a query like this work?

SELECT
Table2.Name,
Table2.Value,
Table1.Colour
FROM Table2
INNER JOIN Table1
ON Table2.Value BETWEEN Table1.Lower AND Table1.Upper

If it’s a valid query and the data is in Excel or CSV files, this can be done in Athena, right?

1 Like

Hi @David_Wong - Good one not sure whether it will throw an error or not, however I will test it and get back to you. As per my knowledge join should have always equal or not equal, never see a between clause :slight_smile:

My idea is to have a derive column at run time using custom sql which will give one column and that can be used as joining condition.

I agree with you, if it is a valid query, then athena can work.

Hi @JMthinkrf - Is it possible to check whether the sql is valid or not.

Regards - Sanjeeb

@Sanjeeb2022
You can have greater than or less than in a join clause, so if BETWEEN is not allowed, we can always change it to >= and <=.

1 Like

Thanks @David_Wong . Let’s wait for the outcome from @JMthinkrf . If the data volume is less, it will good , if the data volume is high, it may take some time as it will do a full scan.

However it is a cool solution.

Regards - Sanjeeb

Thanks for the ideas @Sanjeeb2022 and @David_Wong. I’m not an SQL expert myself, but I will try out David’s suggestion next week and post here success or failure. With the SQL concept, I can probably simplify Dataset #1 into a continuous range, and then just use a < operator to select, filling in the blanks in the table with “null”.

1 Like

I put together something based more or less on @Sanjeeb2022 's first suggestion:

  1. Created a dataset with an inner join of the two datasets, with dataset2 as the primary
  2. Created a calculated field that is true if VALUE was between LOWER and UPPER, or if COLOUR was null (so that the Henry 71 entry from dataset2 is still included)
  3. Filtered the dataset on whether the calculated field is true.

I haven’t experimented yet with the full larger dataset, but it seems to be working so far.

1 Like

Thank you @JMthinkrf for the confirmation, glad it is working fine. Can you please marked the suggestion as solution so that it will help other community members as well.

Have a great week ahead.

Regards - Sanjeeb