How to Append Datasets

I upload excel files as my dataset, but I am soon to run out of excel row limit. I tried to split and upload the files, but I do not see the option to append the files, only join/merge option. Is there any solution so that I can upload two files and append them perhaps? Or may be another better solution?

I also tried to use a helper column in all files and merge, but the second data file gets add on the right instead of getting added after the last row.

Hi @paruanil, it sounds like you’re facing limitations with Excel row limits and looking for a way to efficiently manage and append large datasets in Amazon Quick Sight.

Here’s a detailed approach to address this issue:

Approach: Converting Excel Files to CSV and Using S3 with Amazon Athena

Step 1: Convert Excel Files to CSV

  1. Convert Each Excel File to CSV:
  • Open each Excel file and save it as a CSV file. This can be done manually or via a script if you have many files.

Step 2: Upload CSV Files to S3

  1. Create an S3 Bucket:
  • Create an S3 bucket (if you don’t have one already) to store your CSV files.
  1. Upload CSV Files:
  • Upload all the converted CSV files to the S3 bucket. You can organize them in folders if needed.

Step 3: Create an Athena Table

  1. Configure AWS Athena:
  • Go to the AWS Athena console.
  • Create a new database if you don’t have one.
  1. Create an Athena Table for CSV Files:
  • Write a DDL statement to create a table in Athena that reads from the S3 bucket. Here’s an example:

Pseudocode (Syntax may vary)

CREATE EXTERNAL TABLE IF NOT EXISTS my_dataset (
    column1 STRING,
    column2 STRING,
    column3 STRING,
    ...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = ',',
    'field.delim' = ','
)
LOCATION 's3://your-bucket/path-to-csv-files/'
TBLPROPERTIES ('has_encrypted_data'='false');
  • Adjust the column definitions to match your data schema.
  1. Partitioning (Optional):
  • If your data is large and you want to improve query performance, consider partitioning the Athena table by a logical key, such as date.

Step 4: Use Athena Table in Quick Sight

  1. Connect Quick Sight to Athena:
  • In Quick Sight, create a new dataset.
  • Choose Athena as the data source and select the database and table you created.
  1. SPICE (Optional):

Benefits of This Approach

  • Scalability: You can manage and query large datasets without hitting row limits.
  • Automation: New data added to the S3 bucket will be immediately available in Athena and subsequently in Quick Sight (with SPICE refresh if used).
  • Flexibility: You can append data easily by simply uploading new CSV files to the S3 bucket.

Alternative Simple Approach

If you prefer a simpler method and don’t need the scalability of Athena, you can directly append CSV files:

  1. Combine CSV Files Locally:
  • Use a script to combine your CSV files locally into a single large CSV file.
  1. Upload to Quick Sight:
  • Upload this combined CSV file directly to Quick Sight.

Example Script to Combine CSV Files (Python)

Pseudocode (Syntax may vary)

import pandas as pd
import glob

# Get list of CSV files
csv_files = glob.glob('path/to/csv/files/*.csv')

# Combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in csv_files])

# Export to a single CSV file
combined_csv.to_csv('path/to/output/combined.csv', index=False)

By following these approaches, you can efficiently manage large datasets and ensure smooth operation within Amazon Quick Sight.

Did this solution work for you? I am marking this reply as, “Solution,” but let us know if this is not resolved. Thanks for posting your questions on the Quick Sight Community!

Hi,
I’m sort of new to all this, how do I check if have access to S3, Athena etc? Where do I start on them?