SQL Being Generated from New Data Prep Experience Not What is Expected

We created a new Dataset using the new Data Prep Experience that only brings in 35 columns of the 74 in a table. Using that Dataset, we are trying to create a SPICE Dataset with only fifteen of the columns and with filters applied, but the SQL we see in Athena being generated is all 74 columns and no filtering and thus the build is failing with the following message so we do not understand why the SQL is not correct:

DATA_SET_SIZE_LIMIT_EXCEEDED

Error details:

Input size of secondary tables is over limit. Size limit of secondary tables is 21474836480. Actual size of secondary tables is 56031334944. (maxSize = 21474836480)

Hi. Thanks you for your question! :slight_smile: Many of our community Experts and folks from our team are on holiday right now. I asked “My Assistant” (universal chat) in Quick, and got this back. @Brett Could you review and validate the answer below on Monday after you return from holiday?

From “My Assistant”:

”Understanding the Problem

Based on your description, you:

  1. Created a dataset using the New Data Prep Experience that brings in 35 columns from a 74-column table

  2. Are trying to create a SPICE dataset with only 15 of those columns and filters applied

  3. But the SQL generated in Athena is querying all 74 columns with no filtering

The error indicates your secondary tables exceed the 20GB limit (actual size: ~52GB vs. limit: 20GB).

Root Cause

This appears to be a workflow configuration issue where the Select Columns and Filter steps in your Data Prep workflow are not being properly applied to the generated SQL query. The transformations you’ve configured in the visual interface aren’t translating into the actual query execution.

Troubleshooting Steps

1. Verify Your Workflow Steps

In the New Data Prep Experience, ensure you have:

  • An Input step connecting to your Athena table

  • A Select Columns step that explicitly selects only your 15 desired columns

  • A Filter step with your filter conditions applied

  • These steps should be connected in sequence

Important: Unselected columns in the Select Columns step should be excluded from subsequent steps.

2. Check Step Order and Connections

  • Verify that your Filter and Select Columns steps are before any final output

  • Ensure the workflow connectors (arrows) properly link each step

  • The Preview Pane should show only your 15 columns after the Select Columns step

3. Publish and Save the Dataset

  • After configuring your workflow, you must Save and Publish the dataset

  • Until published, transformations may not be fully applied to query generation

  • Once published in the new experience, you cannot switch back to legacy [106]

4. Consider Using the Legacy Experience

If the issue persists, you can switch to the legacy data preparation experience:

  • Click “Switch to legacy experience” button (top right of data prep page)

  • This allows you to verify if column selection and filtering work correctly there

  • Note: Once saved and published in the new experience, you cannot switch back [108]

5. Alternative: Use Custom SQL

As a workaround, you can use Custom SQL in your dataset creation:

  • When creating the dataset, choose “Use custom SQL”

  • Write a SQL query that explicitly selects only your 15 columns with WHERE clauses for filtering

  • This ensures the exact SQL you want is executed [102]

Example:


sql

SELECT column1, column2, ..., column15
FROM your_table
WHERE your_filter_conditions

Data Preparation Limits Context

Your error is specifically hitting the Secondary Tables Size limit of 20GB.

In the New Data Prep Experience:

  • Primary tables are those with maximum size in a workflow

  • All other tables are considered secondary tables

  • Secondary tables combined cannot exceed 20GB

Recommended Immediate Action

  1. Switch to Custom SQL as a temporary workaround to explicitly control which columns are queried

  2. Apply filters in the SQL to reduce row count

  3. If you need the visual workflow, contact AWS Support to report this as a potential bug where Select Columns and Filter steps aren’t being applied to the generated SQL”

Hi @TraderRef

In order to understand the issue you may need to provide the detais of the two datasets including the filters applied.

Regards,

Giri

OK, @Giridhar.Prabhu, I tried a different approach with this. I first used a data source to only bring in three columns and filtered those columns with two filters, and published it. The SQL shown in Athena queried all the columns of the data source and did not show any filtering, but the dataset did refresh successfully.

I then added the same data source, selected only fifteen columns, filtered those with the exact same filter as the first stream except I only used one of the filters from the first stream because that’s exactly what I needed to do. Then, I added an inner join between the the two streams on one of the like columns and published it. The SQL in Athena showed two queries, both choosing all the columns and no filtering, but after a period of time, I got the email that the dataset failed to refresh.

Here is a pic of the steps I took attempting to create the dataset, with the SQL generated for both queries at the end:

I tried the old data prep and it does seem to send the correct SQL to Athena. Also, when I create the dataset using custom SQL, I get the correct results, which returns just over 111 million rows. I need to do a UNION between the results of both streams so that it looks something like the following:

I forgot to mention earlier that I get results when I run a stream just to get the fifteen columns filtered on type_of_indicator does not equal 3. I also get results when I run a stream to get diagnosis_code_1 filtered on type_of_indicator does not equal 3 and anomaly_flag equals 1. However, when I try to join them on diagnosis_code_1, that is when it fails. I do wonder, though, if there is an issue because I am not getting DISTINCT diagnosis_code_1 in the stream querying for type_of_indicator does not equal 3 and anomaly_flag equals 1, and that is causing so many records. My custom SQL has DISTINCT diagnosis_code_1.

Hi @TraderRef

I don’t have much exposure creating datasets from S3 bucket. But I did try creating a new dataset using the new experience and compared to Custom SQL or old experience it performed badly (20 seconds in Custom SQL vs 1 minute 15 seconds using new dataset experience) with my data size being 6000 rows.

I would suggest you raise a ticket with AWS. They can review and suggest some options. Since this is new we may expect a few issues and fixes along the way.

I have provided a lot of detail about the process, so does anyone have any insight into this?

Hi @TraderRef,

Just checking back in since we haven’t heard from you in a bit. I wanted to see if the guidance shared earlier helped resolve your question, or if you found a solution in the meantime.

Now from what you have said, it sounds like what you are trying to do properly works in the old data prep experience compared to the new experience. Submitting a support ticket could give you some more insight into the type of error you are running into. To do that I would refer to this resource (Case management - AWS Support). Moreover, it should also be noted that the new data prep experience is still new and constantly being improved upon. From my experience, I have run into some bugs and errors that Quick is still ironing out.

So, my recommendation is to use the old data prep experience for the timing being. Quick still gives you access to use the old data prep experience for a reason and you know definitively that what you are trying to do works there.

If you still have any additional questions related to your initial post, feel free to share them. Otherwise, any update you’re able to provide within the next 3 business days would be helpful for the community.

Thank you

Hi @TraderRef,

Since I haven’t received any further updates from you, I’ll treat this inquiry as complete for now. If you have any additional questions, feel free to create a new post in the community and link this discussion for context.

Thank you

I have a case open with AWS Support to help address this issue. I have another case with them for a different issue, but both are related to the new Data Prep Experience, which really does not seem to be working as expected. I did read something about there is some functionality not yet available to GovCloud, so this might be some of what is happening with the issues I have encountered.