Alternate strategies to dataset UNION

Hi,

I’m trying to work out how to solve a few problems with growing datasets that being able to UNION the datasets would probably solve but as that’s not available I’m looking for alternative approaches.

We have a few SPICE datasets which are based on Athena tables. They are used to back daily analytics dashboards for our customers.

They take a long time to do the SPICE refresh (and so we occasionally have timeouts and have to look at them) and they are covering all our customers so we are refreshing the dataset multiple times a day to cover the different timezones so we can give customers their analytics from yesterdays business day.

Ideally I’d like to have different datasets for different regions (or a sub set of the region when we have a large volume of data from that region) but manage that efficiently.

I was thinking that UNIONing the dataset together (if it was possible) would mean we could refresh the region dataset SPICE and then “Parent” dataset SPICE and just use the one analysis/dashboard for all our customers (it’s filtered using RLS).

I’ve looked at some of the ways of joining datasets together at the QS level (multiple joins from a linking table to the sub datasets), but QS was complaining that the joined dataset couldn’t have more than x rows so I don’t think that will work in our case as the sub datasets are all going to be quite large.

We’ve thought about just having multiple datasets one for each region but it would mean replicating the analysis/dashboards and having to coordinate which customer is getting which dashboard which connects to which dataset (and making sure this process stays in sync so extra testing etc). I’m assuming there is no way of dynamically changing the dataset at runtime for a dashboard.

We could go away from SPICE and do direct query but then the analysis will be really slow and we are displaying results in a table for CSV download so every time the customer changes the data range or goes to a different page of the CSV it reruns the direct query so it is both slow to use and expensive as I think we are fetching the data in Athena over and over.

So looking for guidance on what creative ways there might be to manage this :slight_smile:

Hi @nlp - One analysis supports multiple datasets. A visual within the analysis can map to only one dataset though. So, one option that you have here is to have different visuals in your analysis for different datasets that you have.
Another option is to preprocess your data using AWS Glue to apply rules before loading into QuickSight.
How many different datasets do you have? Would this approach work for you?

Hi @awsvig,

Thanks for the ideas, I think separate visuals ends up being very complicated.

We have used multi visuals to show alternate versions of a table and layered the tables over the top of each other with parameters to control which visuals is shown, but maintaining this sorts of reports quickly becomes a pain (you have to know that there are multiple tables layered on top of each other and remember when you make a change to one to see if you need to make the same change to others etc), I think to try and do this to datasets that are refreshed for different timezones (even if we started with just four groups of timezones around the world would become horrible to manage manually (and once we move into having to programmatically manage the analysis then the advantages of using quicksight to be able to create analysises/dashboards quickly starts to be less of an advantage).

We already use Glue to do most of the creation of tables, we could create specific glue jobs that would make this data slightly better to ingest into quicksight, but it means running the glues jobs and ingesting all the data multiple times a day from Athena - possibly for each new report we have to create. I guess we could make the glue jobs clever to only process the data we need updated by timezone (or use Hudi tables) but having to create glue jobs or new tables each time we want to create reports just seems overly complicated and a slow way to get reports out to customers.

I’m currently thinking that replicating the dataset and dashboard as part of our deployment process (we manage the rollout of dashboards across dev/qa/prod through asset bundles storing the JSON in github along the way so can change IDs/replicate if needed) to create the different “regions” version of the dataset and then connecting the replica dashboards to the dataset and then coordinating the dashboard “region” group permissions so the correct customers get their regions “set” of dashboards might be what we’ll have to do for now. Still messy but moves most of the complication away from having to be handled in the analysis, will mean some coordination on our end to set the right permissions but we can probably hide that away so the front end report selector isn’t aware that there are multiple groups of the different “regions”. This may be able to contain all that logic in one or two places in our code and not require extra changes outside of QS every time we deploy a new report.

Anyway, thanks again for the ideas.

1 Like