Joining table in Athena vs in Quicksight

Leon · November 17, 2023, 5:52am

I’m using SPICE dataset with underlying view from Athena. The view is fairly big(CUDOS) in which there’s a join with some internal tables. The end result is that I get 150GB+ of records when ingestion succeeds but most of time it failed with query timeout.

Will this be any better if the join is done within QS, if so why? Is there any other way to reduce the dataset size? I already reduced the needed fields to a minimum and only last 7 months of record is retrieved from CUR.

Thanks,

andres007 · November 17, 2023, 8:33am

Hello @Leon,

I do not think performing the joins inside QuickSight will improve the performance but it might be workth trying it if the problem is that the joins in the athena view are not optimized.

Have you looked at the views and tried some of the Athena performance optimizations? You can find some of those to get ideas from this blog.

Hope this helps!

Leon · November 18, 2023, 10:27am

@andres007 Just tried to do the joins within QS and got the same query timeout error.

Strangely, when I updated the query in Athena with LIMIT 100000000 at the end, the refresh succeeded and with less ingestion time - 46mins. Previously without the LIMIT part, it took about 1.4 hours with ingestion of 125545000 records and rarely succeed.

Koushik_Muthanna · November 27, 2023, 10:10pm

@Leon

Athena will write the results to S3 once a SQL query is submitted and completed. It produces a single CSV . Based on the information you have provided, the final output file takes time to write. QuickSight will then start the ingestion of this data into SPICE.

How long does the Athena query take to complete for the SQL which you run ?
Apart from that, the performance tuning is more relevant when dealing with large amounts of data ( Performance tuning in Athena - Amazon Athena )

Regarding the failing ingestion >
When the data refresh into SPICE happens, does the Athena query first succeed ? .
What is the error message for the SPICE dataset which failed to refresh ?

Leon · November 28, 2023, 3:00am

hi @Koushik_Muthanna,

I tried to run the query in Athena by itself and got run time of 2 h 29 min and error of “his query ran against the “test” database, unless qualified by the query.”.

I guess in this case raising DML quota is no use as the query itself will time out anyway?

Sanjeeb2022 · November 28, 2023, 4:45am

Hi @Leon - As per the details, the athena is taking more than 2 hours to give the result. I will suggest you may need to create a glue etl ( using spark) as the data volume is high and the final result set may be stored in new athena table with parquet format and then do the reporting on the final athena table.

Also if you can share the sql query which is taking time, possibly some optimization techniques can be explored.

Regards - Sanjeeb

Leon · November 28, 2023, 5:33am

Can you elaborate some more on this?

The following is the query which largely base on the CUDOS query with a few more joins with tables coming from S3.

SELECT
  UPPER(service.application_name) application_name
, service.account_name
, service.environment "Environment"
, "line_item_unblended_cost"
, "line_item_usage_start_date"
, (CASE WHEN (service.is_active = false) THEN CONCAT(service.account_alias, '(inactive)') ELSE service.account_alias END) "AWS Tenancy accounts"
, "bill_billing_period_start_date" "billing_period"
, (CASE WHEN ("date_trunc"('month', "line_item_usage_start_date") >= ("date_trunc"('month', current_timestamp) - INTERVAL  '3' MONTH)) THEN "date_trunc"('day', "line_item_usage_start_date") ELSE "date_trunc"('month', "line_item_usage_start_date") END) "usage_date"
, "bill_payer_account_id" "payer_account_id"
, "line_item_usage_account_id" "linked_account_id"
, "line_item_resource_id"
, "cost_category_a_m_s_l_z" "cost_category_ams"
, "resource_tags_user_billing" "Tag_Billing"
, "line_item_line_item_type" "charge_type"
, (CASE WHEN ("line_item_line_item_type" = 'DiscountedUsage') THEN 'Running_Usage' WHEN ("line_item_line_item_type" = 'SavingsPlanCoveredUsage') THEN 'Running_Usage' WHEN ("line_item_line_item_type" = 'Usage') THEN 'Running_Usage' ELSE 'non_usage' END) "charge_category"
, (CASE WHEN ("savings_plan_savings_plan_a_r_n" <> '') THEN 'SavingsPlan' WHEN ("reservation_reservation_a_r_n" <> '') THEN 'Reserved' WHEN ("line_item_usage_type" LIKE '%Spot%') THEN 'Spot' ELSE 'OnDemand' END) "purchase_option"
, "line_item_product_code" "product_code"
, "product_product_name" "product_name"
, (CASE WHEN (("bill_billing_entity" = 'AWS Marketplace') AND (NOT ("line_item_line_item_type" LIKE '%Discount%'))) THEN "Product_Product_Name" WHEN ("product_servicecode" = '') THEN "line_item_product_code" ELSE "product_servicecode" END) "service"
, "product_product_family" "product_family"
, "line_item_usage_type" "usage_type"
, "line_item_operation" "operation"
, "line_item_line_item_description" "item_description"
, "line_item_usage_amount" "line_item_usage_amount"
, "pricing_unit" "pricing_unit"
, "product_region" "region"
, (CASE WHEN ((("line_item_usage_type" LIKE '%Spot%') AND ("line_item_product_code" = 'AmazonEC2')) AND ("line_item_line_item_type" = 'Usage')) THEN "split_part"("line_item_line_item_description", '.', 1) ELSE "product_instance_type_family" END) "instance_type_family"
, (CASE WHEN ((("line_item_usage_type" LIKE '%Spot%') AND ("line_item_product_code" = 'AmazonEC2')) AND ("line_item_line_item_type" = 'Usage')) THEN "split_part"("line_item_line_item_description", ' ', 1) ELSE "product_instance_type" END) "instance_type"
, (CASE WHEN ((("line_item_usage_type" LIKE '%Spot%') AND ("line_item_product_code" = 'AmazonEC2')) AND ("line_item_line_item_type" = 'Usage')) THEN "split_part"("split_part"("line_item_line_item_description", ' ', 2), '/', 1) ELSE "product_operating_system" END) "platform"
, "product_tenancy" "tenancy"
, "product_physical_processor" "processor"
, "product_processor_features" "processor_features"
, "product_database_engine" "database_engine"
, "product_group" "product_group"
, "product_from_location" "product_from_location"
, "product_to_location" "product_to_location"
, "product_current_generation" "current_generation"
, "product_volume_api_name" "product_volume_api_name"
, "bill_billing_entity" "billing_entity"
, "sum"((CASE WHEN ("line_item_line_item_type" = 'SavingsPlanCoveredUsage') THEN "line_item_usage_amount" WHEN ("line_item_line_item_type" = 'DiscountedUsage') THEN "line_item_usage_amount" WHEN ("line_item_line_item_type" = 'Usage') THEN "line_item_usage_amount" ELSE 0 END)) "usage_quantity"
, "sum"("line_item_unblended_cost") "unblended_cost"
, "sum"((CASE WHEN ("line_item_line_item_type" = 'SavingsPlanCoveredUsage') THEN "savings_plan_savings_plan_effective_cost" WHEN ("line_item_line_item_type" = 'SavingsPlanRecurringFee') THEN ("savings_plan_total_commitment_to_date" - "savings_plan_used_commitment") WHEN ("line_item_line_item_type" = 'SavingsPlanNegation') THEN 0 WHEN ("line_item_line_item_type" = 'SavingsPlanUpfrontFee') THEN 0 WHEN ("line_item_line_item_type" = 'DiscountedUsage') THEN "reservation_effective_cost" WHEN ("line_item_line_item_type" = 'RIFee') THEN ("reservation_unused_amortized_upfront_fee_for_billing_period" + "reservation_unused_recurring_fee") WHEN (("line_item_line_item_type" = 'Fee') AND ("reservation_reservation_a_r_n" <> '')) THEN 0 ELSE "line_item_unblended_cost" END)) "amortized_cost"
FROM
  ((cur
INNER JOIN service_registration_request service ON (cur.line_item_usage_account_id = service.account_id))
INNER JOIN application_master am ON (service.application_name = am.name))
WHERE ((am.status = 'Approved') AND (("bill_billing_period_start_date" >= ("date_trunc"('month', current_timestamp) - INTERVAL  '7' MONTH)) AND (CAST("concat"("year", '-', "month", '-01') AS date) >= ("date_trunc"('month', current_date) - INTERVAL  '7' MONTH))))
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39

Sanjeeb2022 · November 28, 2023, 5:44am

Hi @Leon - Thank for the sql. are you saying this select statement is taking 2 hours to run in athena? and also what is cur in the query? is it a table or cte in the query?

If you can give the approximate data volume of each table we can also think of alternatives to select filtered data.

Regards - Sanjeeb

Leon · November 28, 2023, 5:57am

@Sanjeeb2022

If you can give the approximate data volume of each table we can also think of alternatives to select filtered data.

How do I do that?

Yes, this is the query that is taking 2+ and still causing time out. cur is the partitioned table with location from S3 where CUR data is stored. The query is based on the query from here https://github.com/aws-samples/aws-cudos-framework-deployment/blob/main/cid/builtin/core/data/queries/cid/summary_view.sql.

Sanjeeb2022 · November 28, 2023, 8:06am

Hi @Leon - Can you please extract the record count of the tables via select count(1) from Table Name. [Replace table name by table used in the query]

If the athena query is taking 2+ hours, definitely we need to optimize this at athena level not QS level.

Regards - Sanjeeb

Leon · November 28, 2023, 9:03am

with the following query, I get 1997781686

select count(1)
FROM
  ((cur
INNER JOIN service_registration_request service ON (cur.line_item_usage_account_id = service.account_id))
INNER JOIN application_master am ON (service.application_name = am.name))
WHERE ((am.status = 'Approved') AND (("bill_billing_period_start_date" >= ("date_trunc"('month', current_timestamp) - INTERVAL  '7' MONTH)) AND (CAST("concat"("year", '-', "month", '-01') AS date) >= ("date_trunc"('month', current_date) - INTERVAL  '7' MONTH))))

Koushik_Muthanna · November 28, 2023, 9:41am

Troubleshooting possibilities ( Workshop Studio , search for Timeout)
1,997,781,686 ( 1.9 billion records ) : Spice limit is 1TB or 1 billion records whichever reaches first. This will fail during the ingestion process anyways.

Will be marking this as ‘Solution’ as data limits and query runtime has to be solved in the source before ingesting into SPICE.

Sanjeeb2022 · November 28, 2023, 10:27am

Thanks @Leon . Can you please provide the result of the below query.

Select count(1) from cur;
select count(1) from service_registration_request;
select count(1) from application_master;

One observation your service and am tables are join together, can we create a table which will have service and am table join and bring that data set with cur.

Regards - Sanjeeb

Leon · November 28, 2023, 10:46am

I get the following:
8940807217
234
118

You mean instead of joining with two tables separately with cur, joining with a single table which already have two tables joined together? what kind of difference it will make?