Dataset used in another dataset - Keys not recognized

Massi · April 19, 2022, 1:31pm

Hi all,

I am trying to bring in some data into SPICE to be used in another dataset. My issue is that the data then doesn’t join properly when in SPICE, but if I load everything together, it works.

Here my set-up.
Big Daily Activity table containing a foreign key with a Product ID, type Integer.
Small Product table whose primary key is the Product ID, again type Integer.
Now, if I create my dataset adding the big Daily Activity table as main, then left joining the Product table on the product ID, and loading everything into SPICE, everything works as expected.
But if instead I create a SPICE dataset with the Daily Activity table only, save and load it, and then use it to create a second SPICE dataset where I left join the Product table on the product ID in the exact same way as above, then some IDs do not get joined, as if they had no match and as such the fields result in NULLs.
In other words in the first example I’d have Daily_Activity.FK_ProductId 123 matching Product.PK_ProductID 123, while in the second case Product.PK_ProductID for that row would be NULL.

I did this twice, without joining any other tables, and I always had the same results. I double checked, the type is always Integer and the values look fine, I also made sure to check the same IDs, to confirm that in one case they would work as expected and in the other they do not.

I googled, searched in the older posts (a bit difficult since the lack of proper keywords), and I checked the official documents, but I couldn’t find anything; my case is not in the “Known limitations” list as far as I can see and my situation seems a quite common one (using a dataset in another and join by some key). I am sorry if this was already asked or if the solution is trivial.

Any idea why this is happening, by any chance?

Just to clarify, I’d like to follow the dataset-in-another-dataset approach so that the bigger table can be furst loaded incrementally, and then everything gets a full refresh; as I can see it, this would require more space but the aim is to make it faster and less taxing on the DB.

Thank you in advance for your help.

cgreenacre · April 19, 2022, 9:41pm

This may not apply to your situation, but in some of our source systems, the business key in the “small product” table will be padded with leading zero’s up to a length of 8. And the same business key in the “large activity” table won’t be padded with leading zeros. Go figure. (Source application is ancient!) Anyhow the source application deals with it in some way (probably not joining via SQL, but issuing two different SQL select commands). When that data makes it way over to the datawarehouse and to Spice, we have to supply the leading zeros before the data will match up.
Summary is: be on the lookout for subtle data differences between the two columns. The eye sees the issue but the brain ignores the difference. Good luck.

Massi · April 20, 2022, 8:04am

Hello @cgreenacre,
First of all thank you for taking the time to reply!
I had already thought about leading zeroes but unfortunately that is not the case. No IDs have leading zeroes, an example of an ID which is recognized in one case but not in the other is 160252. And as mentioned they are both type Integer, so I have no clue why they wouldn’t get the match.

Also, I am assuming that they do not match only in some of my activity rows; I did a quick check with a couple of analyses, but honestly QS is not behaving as expected: I created a table with first column FK_ProductID, second column PK_ProductID, then amount. I would expect that it would be something like this:

AT.FK_ProductID	PT.PK_ProductID	AT.Amount
160252	NULL	100
160252	160252	200

In other words I would expect to see the sum of the amount in the activity table split between the matching ID in the product table and the non-matching, NULL, rows in the product table.
Instead I can only see where it is joined, the other rows are not shown at all.
This is worrying since not only the join is not working properly, but also the data is not all displayed in the analysis and makes me question the results QuickSight is providing.
NB: I checked the non-SPICE dataset vs. the original table and the numbers match so I know for a fact that the SPICE version has wrong figures.

Thanks again!

Koushik_Muthanna · April 20, 2022, 10:28am

Hello,

With a left join ( Big Daily Activity Table.Product_ID = Product_Table.Product_ID ) > What is the record count once imported into SPICE ?

In both scenarios below, I would expect the record count to be the same

1.Big Daily Activity Table joined with Product Table based on product_id
2.Big Daily Activity Table Dataset , Product Table Dataset . Then joined based on product_id

Screenshots :
In my example the record count is 7 . ( Activity Table has 7 rows , a left join has matching records in Product table which has 2 rows )
Activity joined with Product

Activity ( Use in a new dataset )

Regards,
Koushik

Massi · April 20, 2022, 11:22am

Hello @Koushik_Muthanna,
Thank you for helping out!

I created two analyses, one per each set-up, and in there I created a table with the AT.FK_Product_ID and then aggregated the amounts. Only one product ID is missing in the SPICE version, not sure why, but still, as per my last message, QuickSight is simply ignoring the non-joined rows, and in fact the sum of the amount for some products does not match at all (in the SPICE version it is lower, never higher so I assume it is ignoring the non-matching rows).

To summarize: the Activity Table is fully imported from my primary source into SPICE.
I then use it in another dataset and I join the Product Table from my primary source using an ID (Integer, no leading zeroes, nothing fancy).
In the preview now I can still see the original activity rows but for some Product IDs I will see that the columns in the Product Table section are NULLs. Something like this:

AT.Date	AT.FK_ProductID	AT.Amount	PT.PK_ProductID	PT.Description
01/04/2022	160252	100	NULL	NULL
01/04/2022	160253	200	160253	Gesha

Once creating an analysis, I find our that only one product is missing (very small numbers so probably no joins at all), while the others are there with lower aggregated amounts, probably because QuickSight is ignoring the non-joined rows. And in fact, if I create a table as explained in my previous message, using both Product_ID fields, in a table, only the non-NULL are shown.
And even though I am only using data from my main Activity Table, Quicksight is ignoring rows in which there is no match.

In the Non-SPICE set-up instead, everything works like a charm.

I hope that checking in an analysis was fine, I was not sure if I should check the count in the dataset page but I don’t know where I’d find that information (my Non-SPICE set-up has also other tables joined at the moment but I can recreate another dataset to test).

Thanks again.

Koushik_Muthanna · April 20, 2022, 11:44am

Hi,

Thanks for providing the details, the easiest way to understand : row count based on the different tests with the same joins.

Screenshot below to find the count of records for a dataset

Additionally if you know a specific product is missing, probably filtering the data ( for the 2 tests which you are performing ) and then testing will also make it easier in troubleshooting.

Regards,
Koushik

Massi · April 21, 2022, 2:13pm

Hello @Koushik_Muthanna,

After a lot of trial and error I got to the conclusion that the data is actually joined but the dataset preview is not working. The numbers look fine in my analysis (sorry for the confusion), but the preview shows that some IDs are not joined even though they are.
It is annoying but I can live with that, I guess it is some bug.

Unfortunately I also stumbled in another issue: I cannot schedule an incremental refresh (no data is updated). I found an older post with the same issue so I am not sure if I should write a new post here, or directly write to the support email address.

In short: I set up an incremental refresh of my SPICE dataset, it is just the Activity Table, no joins, nothing. There are two columns: ORDER_DATE and ORDER_CREATED, both UTC, the first is DATE type and the second is TIMESTAMP. So you would get:

ORDER_DATE	ORDER_CREATED	Product_ID	Amount
21/04/2022	21/04/2022 16:00:00	1234	100
21/04/2022	21/04/2022 16:10:00	1234	200

I tried using both to create a scheduled incremental refresh by following the instructions to the dot.
E.g.: use ORDER_DATE, window size 7 Days, run every 1 hour. Nothing.
I tried changing the column, the window size, running it every 30 minutes, changing the timezone it is run at in UTC (just in case), but each time what happens is that the refresh is run at the specified time, but no rows are added or updated.

I ran out of options and thinking it is another bug.

Thanks!

royyung · October 24, 2022, 5:38pm

Thank you for letting us know. I would recommend filing a case with AWS Support where we can dive into the details so that we can help you further. Here are the steps to open a support case (Creating support cases and case management - AWS Support). If your company has someone who manages your AWS account, you might not have direct access to AWS Support and will need to raise an internal ticket to your IT team or whomever manages your AWS account. They should be able to open an AWS Support case on your behalf. Hope this helps!