Best practices for working with large, disparate datasets?

Hello,

I have a question about general workflow and data structuring for my use case. Let me go into a bit of detail:

We are a fitness business that has customers who check in using passes or memberships. We are trying to answer various questions about our customers using data collected from our PoS system, which resides on AWS. We have Quicksight author access and nothing else (no direct database access, aside from the connection to Quicksight).

Basically, we have a customers table, and four tables that relate to customers: transactions, check ins, passes/memberships, and geography/census data. These tables will help us create calculations to answer question like what is the Lifetime Value of a customer, what is a customer’s average check ins per week, what is their most recent pass/membership purchased, that sort of thing. When I first started using Quicksight to pull these answers out, my first instinct was to make an “omni-table” and just left join everything to customers. This makes creating the appropriate calculated fields very messy as there is a lot of de-duplicating to do when doing aggregations. My second approach was to create individual datasets where we join customers with each individual table for all the areas where we want to answer questions, do the calculations on the dataset level, then join all this into customers again so we can try to make our omni-table without the mess. Unfortunately I discovered the way quicksight works is that it will just do the calculations done on the dataset level all over again with the newly joined data, so we are back to square one.

Your first question might be why do we need to join everything together in the first place? The answer is that I may want to make a visual that shows me the relationship between, say, a customer’s favored day of the week to check in and their conversion velocity (number of passes bought before buying a membership). The data for these two calculations resides in two different tables, and trying to calculate them in the same joined table becomes very cumbersome. And this is just an example- basically I have questions that need to relate any of the related tables to any of the other ones. Or even tables that need to draw from three different relations.

My question to the Quicksight community is this- how would you best go about constructing and analyzing this data in Quicksight? Is the “omni-table” the right call and I just need to get really good at accounting for all the records? Is there some other way my inexperience isn’t letting me see? Or is Quicksight even the right tool for the job? Would trying to connect a different service be a better solution?

Thank you for hearing me out and thanks in advance for any responses.

Hi @B_Burgess,

If I’m understanding you correctly, in the first approach, you’re joining your tables first and then aggregating. In your second approach, you’re aggregating first and then joining the resulting aggregated tables. Is that right?

Like you said, the first approach requires a lot of deduplicating when performing aggregations. Did you have any success deduplicating the data using LAC-A functions?

With the second approach, you can create multiple datasets at different levels of granularity. For example, you can create a dataset that calculates the weekly average check-ins per customer and then join that dataset with your customers dataset. It’s “cleaner” than the first approach but you end up with more datasets to manage.

You’re correct, those are the two approaches we’ve taken. I didn’t have much success in using LAC-A functions to de-duplicate, but that might have been inexperience. I might take another crack at it soon, but it’s a daunting task.

For the second approach, what we found is that the aggregations from the “fine grain” datasets are not passed in as values. Instead, it seems quicksight is joining the tables, and then doing the calculations for those aggregate fields again. I tested this by making a visual in both the fine-grain dataset and the dataset that has everything joined in. Despite the visual plotting the same fields from each dataset, the joined dataset had massively inflated values, even though the calculated fields were solely from the fine-grain dataset. See pics below. The top pic is the joined dataset and the bottom is the fine-grain dataset (which is correct)


Unless I am doing something wrong when joining the datasets or creating the calculations, I don’t really know how to approach this problem. Advice would be greatly helpful.

Hi @B_Burgess,

Do you have a one-to-many relationship between the 2 tables that you’re joining? That would inflate your values.

Take a look at use case #2 in this article. It’s about using LAC-A functions to deduplicate data.

Thank you for linking this article, it is helping me think about deduplication in a much cleaner way. I started working on deduplicating the calculated fields in a single dataset this morning and things are going okay so far. I’ll just outline my thought process below so others can see and hopefully benefit.

We have a one-to-many relationship in every table that we are relating to our “customers” table. Specifically, we have the following tables we are working with:

transactions: this is a table that records the purchases each customer makes and when.

check ins: this is a table that records when customers visit our business

passes: these are items customers buy and have attached to their account to be able to check in

membership: same as passes except they have a start and end date and persist after check ins

To use LAC-A functions to deduplicate the resulting joined table, I will need to add the unique identifiers from each joined table to the partition list, in addition to my desired partitioning fields.

For example, if I wanted to calculate the lifetime value of a single customer (sum of all successful transactions, partitioned by customer ID), I would need to use a function like the following:

LTVbyCustomer:

sum(ifelse({transactionStatus}="success",transactionAmount,0),[{customerID},{passID},{membershipID},{checkinID}])

In the above, customerID is the desired partitioning field, while passID, membershipID, and checkID are there for deduplicating purposes.

I also noticed the article mentions that this method of deduplicating is recommended for LAC-A functions only. I have some calculations where I need the field to become a dimension in a visual, so I will need to use a LAC-W function. What precautions do I need to take? For instance, one dimension I will need to calculate is the number of passes a customer buys before buying a membership. This “conversion velocity” number is very important to me for identifying customer segments and so it needs to be a dimension I can group by in a visual. Doing the above but using an “Over” function with PRE_FILTER argument doesn’t seem to yield any errors or incorrect values, so why is LAC-A recommended for deduplications and not also LAC-W?

Do you have a transactionID in your dataset? The calculated field you need in this example should be something like this:

sum(min({transactionAmount}, [{transactionID}]))

You need to use the field which uniquely identifies your transaction in the partition. Sometimes it’s not one single field. Depending on your dataset, it could be a combination of multiple fields.

Let’s say your dataset looks like this:

transactionID passID transactionAmount
1 11 100
1 12 100
2 21 300
2 22 300

The min function is what deduplicates the data. The min of transactionAmount grouped by transactionID gives you {100, 300}. QuickSight then performs the sum grouped by the dimension(s) in your visual. If you add the calculated field to a KPI, the sum of the resulting min values gives you 400.

Including the condition based on transactionStatus:

sum(min(ifelse({transactionStatus} = "success", {transactionAmount}, 0), [{transactionID}]))

If you use LAC-W instead, your calculated field would look like this:

minOver(ifelse({transactionStatus} = "success", {transactionAmount}, 0), [{transactionID}], PRE_AGG)

LAC-W is like adding a calculated column to your dataset without grouping, so your data ends up looking like this:

transactionID passID transactionAmount minOver
1 11 100 100
1 12 100 100
2 21 300 300
2 22 300 300

If you then do a sum on top of that, you get 100 + 100 + 300 + 300 = 800.

To handle for the duplicates, you have to divide by the number of times each transactionID appears.

minOver(ifelse({transactionStatus} = "success", {transactionAmount}, 0), [{transactionID}], PRE_AGG) /
countOver(ifelse({transactionStatus} = "success", {transactionID}, null), [{transactionID}], PRE_AGG)
transactionID passID transactionAmount minOver divided by countOver
1 11 100 50
1 12 100 50
2 21 300 150
2 22 300 150

If you then do a sum on top of that, you get 50 + 50 + 150 + 150 = 400.

As you can see, it’s much easier to deduplicate data using LAC-A. However, in some cases you have no choice but to use LAC-W, e.g., if you need to use the calculated field as a dimension in your visual (LAC-A isn’t allowed as a dimension).

Thank you, this is a great explanation of the deduplication problem and LAC-A and LAC-W solutions. I have been working on my dataset using the fully joined table (which is very big in SPICE but still under our limit), and these tips are helpful.

I am marking as solution for the time being.