How to Append Datasets

paruanil · July 2, 2024, 5:53am

I upload excel files as my dataset, but I am soon to run out of excel row limit. I tried to split and upload the files, but I do not see the option to append the files, only join/merge option. Is there any solution so that I can upload two files and append them perhaps? Or may be another better solution?

I also tried to use a helper column in all files and merge, but the second data file gets add on the right instead of getting added after the last row.

Xclipse · July 3, 2024, 3:04am

Hi @paruanil, it sounds like you’re facing limitations with Excel row limits and looking for a way to efficiently manage and append large datasets in Amazon QuickSight.

Here’s a detailed approach to address this issue:

Approach: Converting Excel Files to CSV and Using S3 with Amazon Athena

Step 1: Convert Excel Files to CSV

Convert Each Excel File to CSV:

Open each Excel file and save it as a CSV file. This can be done manually or via a script if you have many files.

Step 2: Upload CSV Files to S3

Create an S3 Bucket:

Create an S3 bucket (if you don’t have one already) to store your CSV files.

Upload CSV Files:

Upload all the converted CSV files to the S3 bucket. You can organize them in folders if needed.

Step 3: Create an Athena Table

Configure AWS Athena:

Go to the AWS Athena console.
Create a new database if you don’t have one.

Create an Athena Table for CSV Files:

Write a DDL statement to create a table in Athena that reads from the S3 bucket. Here’s an example:

Pseudocode (Syntax may vary)

CREATE EXTERNAL TABLE IF NOT EXISTS my_dataset (
    column1 STRING,
    column2 STRING,
    column3 STRING,
    ...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = ',',
    'field.delim' = ','
)
LOCATION 's3://your-bucket/path-to-csv-files/'
TBLPROPERTIES ('has_encrypted_data'='false');

Adjust the column definitions to match your data schema.

Partitioning (Optional):

If your data is large and you want to improve query performance, consider partitioning the Athena table by a logical key, such as date.

Step 4: Use Athena Table in QuickSight

Connect QuickSight to Athena:

In QuickSight, create a new dataset.
Choose Athena as the data source and select the database and table you created.

SPICE (Optional):

If you’re using SPICE for faster performance, remember to refresh SPICE after updating the data in S3.
Amazon QuickSight launches SPICE Incremental Refresh

Benefits of This Approach

Scalability: You can manage and query large datasets without hitting row limits.
Automation: New data added to the S3 bucket will be immediately available in Athena and subsequently in QuickSight (with SPICE refresh if used).
Flexibility: You can append data easily by simply uploading new CSV files to the S3 bucket.

Alternative Simple Approach

If you prefer a simpler method and don’t need the scalability of Athena, you can directly append CSV files:

Combine CSV Files Locally:

Use a script to combine your CSV files locally into a single large CSV file.

Upload to QuickSight:

Upload this combined CSV file directly to QuickSight.

Example Script to Combine CSV Files (Python)

Pseudocode (Syntax may vary)

import pandas as pd
import glob

# Get list of CSV files
csv_files = glob.glob('path/to/csv/files/*.csv')

# Combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in csv_files])

# Export to a single CSV file
combined_csv.to_csv('path/to/output/combined.csv', index=False)

By following these approaches, you can efficiently manage large datasets and ensure smooth operation within Amazon QuickSight.

Did this solution work for you? I am marking this reply as, “Solution,” but let us know if this is not resolved. Thanks for posting your questions on the QuickSight Community!

paruanil · July 3, 2024, 5:54am

Hi,
I’m sort of new to all this, how do I check if have access to S3, Athena etc? Where do I start on them?