Loading very large datasets into Quicksight

tamilmani · September 12, 2024, 9:25am

Hi Quicksight community,
I have a very large table containing more than 100 million records stored in Amazon PostGres Aurora DB. I want to load this data into Quicksight for creating visualizations. I am unable to load such a huge table. I would like to know a better way to load such a big dataset into Quicksight. Could Athena with S3 be a possible option?
Attaching the schema of the large table for reference:

Xclipse · September 12, 2024, 2:45pm

Hi @tamilmani

Welcome to the community!

Yes, using Amazon Athena with S3 can be a viable option for loading large datasets into Amazon QuickSight, especially when dealing with large tables like yours with more than 100 million records.

You can export the data from your Amazon Aurora PostgreSQL DB into Amazon S3, typically in a partitioned Parquet format, which will help reduce storage costs and improve query performance. By storing your data in S3 in partitioned and optimized Parquet format, you can significantly reduce the amount of data scanned and improve query performance.

Once the data is in S3, you can use Amazon Athena to query it. Athena is a serverless query service that allows you to run SQL queries directly on data stored in S3.

QuickSight can connect to Athena as a data source, allowing you to create datasets for visualizations either by using Direct Query without loading the entire dataset or by loading the entire dataset into QuickSight’s SPICE memory.

Please refer the below data source limitation documentation this might be helpful for you.

tamilmani · September 13, 2024, 5:17am

Thanks @Xclipse, that makes sense. However, even if you’re querying data through Athena, is there still a risk that the queried data could exceed QuickSight’s 100 million row limit? Or does using a more efficient format like Parquet reduce the number of rows sufficiently to avoid this issue?

andres007 · September 13, 2024, 6:44am

Hi @tamilmani,

The number of rows on your dataset will depend on it, there is nothing Athena or QuickSight can do to help you here. Parquet uses a columnar approach so the amount of data scanned by Athena can be greatly reduced if you use the proper queries which pull only the columns needed for your dataset, as well as if you use compression the storage costs can be significantly reduced too.

You can also use incremental refreshes in order to manage the amount of data stored in SPICE if you do not need the 100m rows to be available for querying.

Hope this helps!

tamilmani · September 18, 2024, 6:06am

Thanks @andres007 this helps!