How to Append Datasets

I upload excel files as my dataset, but I am soon to run out of excel row limit. I tried to split and upload the files, but I do not see the option to append the files, only join/merge option. Is there any solution so that I can upload two files and append them perhaps? Or may be another better solution?

I also tried to use a helper column in all files and merge, but the second data file gets add on the right instead of getting added after the last row.

Hi @paruanil, it sounds like you’re facing limitations with Excel row limits and looking for a way to efficiently manage and append large datasets in Amazon QuickSight.

Here’s a detailed approach to address this issue:

Approach: Converting Excel Files to CSV and Using S3 with Amazon Athena

Step 1: Convert Excel Files to CSV

  1. Convert Each Excel File to CSV:
  • Open each Excel file and save it as a CSV file. This can be done manually or via a script if you have many files.

Step 2: Upload CSV Files to S3

  1. Create an S3 Bucket:
  • Create an S3 bucket (if you don’t have one already) to store your CSV files.
  1. Upload CSV Files:
  • Upload all the converted CSV files to the S3 bucket. You can organize them in folders if needed.

Step 3: Create an Athena Table

  1. Configure AWS Athena:
  • Go to the AWS Athena console.
  • Create a new database if you don’t have one.
  1. Create an Athena Table for CSV Files:
  • Write a DDL statement to create a table in Athena that reads from the S3 bucket. Here’s an example:

Pseudocode (Syntax may vary)

CREATE EXTERNAL TABLE IF NOT EXISTS my_dataset (
    column1 STRING,
    column2 STRING,
    column3 STRING,
    ...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = ',',
    'field.delim' = ','
)
LOCATION 's3://your-bucket/path-to-csv-files/'
TBLPROPERTIES ('has_encrypted_data'='false');
  • Adjust the column definitions to match your data schema.
  1. Partitioning (Optional):
  • If your data is large and you want to improve query performance, consider partitioning the Athena table by a logical key, such as date.

Step 4: Use Athena Table in QuickSight

  1. Connect QuickSight to Athena:
  • In QuickSight, create a new dataset.
  • Choose Athena as the data source and select the database and table you created.
  1. SPICE (Optional):

Benefits of This Approach

  • Scalability: You can manage and query large datasets without hitting row limits.
  • Automation: New data added to the S3 bucket will be immediately available in Athena and subsequently in QuickSight (with SPICE refresh if used).
  • Flexibility: You can append data easily by simply uploading new CSV files to the S3 bucket.

Alternative Simple Approach

If you prefer a simpler method and don’t need the scalability of Athena, you can directly append CSV files:

  1. Combine CSV Files Locally:
  • Use a script to combine your CSV files locally into a single large CSV file.
  1. Upload to QuickSight:
  • Upload this combined CSV file directly to QuickSight.

Example Script to Combine CSV Files (Python)

Pseudocode (Syntax may vary)

import pandas as pd
import glob

# Get list of CSV files
csv_files = glob.glob('path/to/csv/files/*.csv')

# Combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in csv_files])

# Export to a single CSV file
combined_csv.to_csv('path/to/output/combined.csv', index=False)

By following these approaches, you can efficiently manage large datasets and ensure smooth operation within Amazon QuickSight.

Did this solution work for you? I am marking this reply as, “Solution,” but let us know if this is not resolved. Thanks for posting your questions on the QuickSight Community!

1 Like

Hi,
I’m sort of new to all this, how do I check if have access to S3, Athena etc? Where do I start on them?