Is it better to add additional data as a dataset or datasource?

jm_fen · August 17, 2023, 3:33pm

I have a dataset which combined a bunch of other datasources, all using a left join. When I add a new datasource, I have the option add it as a datasource or a dataset. What are the best practices with respect to this?

All of my data sources are through S3 files.

neelay · August 17, 2023, 4:42pm

@jm_fen, your data source represents the origin of your dataset, serving no additional purpose beyond that.

The optimal choice can be determined based on the data type and security protocols in place.

For instance, consider a scenario with one schema containing 15 tables in a relational database. This setup would correspond to 15 datasets linked to a singular data source.

Periodically, we establish multiple schemas within the same application for valid reasons. If these reasons are justifiable, it signifies a sound design and adheres to best practices; otherwise, it may lead to confusion.

Similar principles apply when crafting datasets for use in QuickSight.

If you are retrieving various files from a shared S3 bucket, all pertaining to the same application, it’s advisable to consolidate them under one data source (S3 bucket). Subsequently, you can designate distinct datasets for each respective file.

jm_fen · August 18, 2023, 6:19pm

Thank you @neelay although I am not sure I follow you completely. When you say

"If you are retrieving various files from a shared S3 bucket, all pertaining to the same application, it’s advisable to consolidate them under one data source (S3 bucket). "

I have multiple files in the same S3 bucket but the only way I can get it to work is to create individual manifest files for each.

neelay · August 18, 2023, 6:54pm

@jm_fen

I probably misunderstood you then. if these are CSV files, I think the only way to do this is to have a separate datasource / s3-file.

If you have an Excel file, I think you can do 1 QS-DataSource, and treat each sheet as a table.

if it was not s3, you can have rds as your data source, Then you can have 1 QS-DataSrouce / Schema and have n QS-Datasets, which could be different queries, and tables of that schema.

jm_fen · August 19, 2023, 2:12pm

Thank you @neelay. I have a feeling I do not have the optimal structure for what I am trying to achieve. Any thoughts?

I have a 12 data files which are updated on a regular basis. Generated with Python and pushed up to S3 bucket, all in separate sub-folders and files. Also have 12 manifest files, 1 for each. So 12 datasources all together. And with S3 each one creates its own dataset. I have tried combining with fewer manifest files but I always get weird results.

9 of the 12 files all time-series data and I combined them in a combined dataset. In this combined dataset, I add each one as a datasource, although I am not sure if that is better than dataset. Since the source files on s3 are updated frequently I thought the datasource updates easier.

Lastly, the analysis I am working on uses both the combined dataset and the 3 other datasets.

neelay · August 19, 2023, 3:18pm

@jm_fen

That’s beyond QuickSight.
But the answer depends on which process is under your control and which is not. If you have full control of the process.

Notes:

Each visual element can only be linked to a single dataset.
When preparing your dataset for QuickSight, keep reporting, visual, and chart requirements in mind.

For instance, in one of my projects, I employed six distinct datasets, each feeding into its corresponding tab on the dashboard. Sometimes, a more streamlined approach involves directly pulling data from the database to QuickSight, especially if you’re proficient in SQL and working with a well-defined problem.

Feel free to reach out separately if you require additional assistance; I’d be more than happy to provide guidance.

Sanjeeb2022 · August 20, 2023, 5:21am

Hi @jm_fen - It depends upon your requirement. If the new data set is require for your reporting and visualization, you can always create the data set and add that data set and do the join like you have performed before that and the new fields can be part of the new customized data set. We did this approach for some of the use cases and it worked perfectly.

Regards - Sanjeeb

Jesse · August 22, 2023, 4:44pm

Hi @jm_fen - the main benefit of using existing datasets in a new dataset is that you can leverage calculated fields and other changes from that existing dataset. Think of it as a parent/child relationship where one team manages a dataset that has a set of key metrics, and I want to use that dataset but join some other data to it. In the future if the owners of that original dataset were to make changes/additions, my new dataset will automatically inherit them.

In your case I dont think there is much of a difference between creating a dataset for each S3 file and then joining all those existing datasets into a new one, or adding them as new datasources at time of creating your ‘end state’ dataset - the result is going to be the same. S3 and Flat Files wind up having a 1:1:1 relationship between data source, table and dataset, whereas with other sources that support SQL you can reuse the same data source to add multiple tables.

An alternative way to do this that will simplify things for you would be to sit Athena on top of your S3 files, which will create a table for each in the Glue data catalog. Then when you pick Athena as your source in QuickSight you will see all the tables and can add multiple tables to be joined more efficiently. Also opens up the possibility to use Custom SQL against that data since Athena is a SQL query engine.