Merging 2 S3 sources on QuickSight

cpenaa · March 4, 2022, 3:45pm

I have a QS dashboard already built using a S3 file. This means that I cannot rebuild the data-source from scratch without impacting my stakeholders. To solve this I have uploaded a second object to S3, which I merge using a “Left Join” with the primary source using 1 common key on both sources.

First merge seemed good and dashboards were not affected at all. Problem emerged after a new refresh of data for which the “Left Join” seems to fail and some records on primary table get deleted.

Please Note:

Neither the primary nor the secondary source have any calculated attribute on data preparation.

rbrady · March 8, 2022, 3:36pm

Hi,
Are you familiar with the ability to replace a dataset? If I were in your shoes I would start by duplicating my analysis just to be safe. Then rebuild the dataset so it’s exactly as you want it. Go into into your duplicate analysis and click the pencil icon by “Dataset”, and when you select the ellipsis next to your Dataset name you will see the option to “Replace”. Then you can select your newly updated dataset and so long as the new dataset contains all the columns being used in the dashboard by your old dataset you shouldn’t have any difficulties. Only thing is when you’re in the duplicate analysis you can’t use it to replace an existing dashboard. So once you’re confident that it worked they way you want I would delete the duplicate then go back to the original and replace the dataset there.

2.

3.

4.

darcoli · March 8, 2022, 3:59pm

When you say

"Left Join” seems to fail and some records on primary table get deleted.

do you mean that the ingestion does not import all rows? If so, did you check the errors that are given?

cpenaa · March 9, 2022, 11:11am

Thank you @rbrady, I am trying this solution right now thanks for your detailed guideline. I will circle back with my outputs

cpenaa · March 9, 2022, 5:57pm

Hey @rbrady. Seems that rebuilding the source from scratch and replacing it made the magic. I will keep monitoring both analysis (old + new) before proceeding to take this as the valid solution. Thanks a lot for the help!

rbrady · March 9, 2022, 6:09pm

I’m glad it worked out, happy to help!

cpenaa · March 10, 2022, 8:15am

Sources were merged ok on first attempt, but issue reappeared with new-day AM scheduled refresh.
Individual sources got refreshed as expected. Primary source is loading the expected number of rows (198,460), while the merged source on “left join” with these two S3 inputs is only loading (197,072) when using on “left” the primary table.Secondary source had no problems on data ingestion.

Primary Load:

cpenaa · March 10, 2022, 8:16am

Merged S3 load

darcoli · March 10, 2022, 11:45am

Can you check the errors as well? That would help identifying the underlying issue… Troubleshooting skipped row errors - Amazon QuickSight

cpenaa · March 10, 2022, 12:22pm

Following the guidelines, there is no possibility on this error to troubleshoot this error.
troubleshoot_dataset_error
Seems to be that the import was completed with not detected errors.

wwwyyy · March 10, 2022, 8:17pm

Unfortunately we don’t show skipped rows when joining across data sources. It is a miss on our side and we have it on our roadmap.

In the meantime, to troubleshoot, can you create a single dataset using just the primary source S3 file, import it into SPICE and check to see if there’s any rows skipped? If so, it should have a link “View error summary” to show which rows were skipped.

Ying_Wang · March 11, 2022, 2:14am

@cpenaa create athena table with the s3 file, and then using custom sql to left join two tables together. You might be able to get all the data by doing this.

cpenaa · March 11, 2022, 8:09am

Thanks for your recommendation @wwwyyy and visibility on your roadmap. I am currently using the primary S3 source on a daily basis. It is never suffering any “skipped rows” or any other issue. I strongly think it is a defect on the tool when merging 2 S3 sources directly on QS. Moreover, sometimes the data gets correctly merged, so after monitoring this issue for 2 days, I can conclude it is completely arbitrary.
I have already contacted the QS help center and they will reach me soon.

cpenaa · March 11, 2022, 12:34pm

Hey @Ying_Wang thank you for your recommendation, it inspired me to find a workaround solution.

Current State
This is an A/B AWS account interaction problem, sorry for not specifying it before. Therefore, I generate the data on my A AWS account, which is ingested by account B. Therefore, I cannot directly produce the output Athena table and consume it on QuickSight.

Possible solution
Create 2 tables using Athena based on primary and secondary S3 source. Merge them and produce a new S3 object using an Athena CTAS table. Schedule this process to be executed. Finally, I will just need to connect QuickSight to this new S3 object with the merged table.

I will keep this post updated once done.

Ying_Wang · March 11, 2022, 3:32pm

Great plan! You can try this: Use Amazon Athena and Amazon QuickSight in a cross-account environment | AWS Big Data Blog