Only selecting specific columns when joining datasets

rohit_SB · December 19, 2023, 7:39am

Good day everyone! I’m looking to join two datasets (tables) together. The first table consists of all of the useful information I want. The second table consists of four columns that I need for my analysis. The datasets are an Amazon OpenSearch type of dataset and have been converted to SPICE for better visualising and faster query. The tables have been left joined.

However, here’s the issue. Table 1 consists of the Order ID’s along with other useful data and table 2 consists of the order items per order ID along with the supplier information that I need for my analysis. The issue here is that each ID is now counted N times when I’m calculating the total cost, profit, etc. The order items field is not necessary in my analysis. Is there a way to exclude columns when joining two SPICE datasets?

ErikG · December 19, 2023, 9:24am

Hi @rohit_SB
sure you can exclude columns.
grafik
But based on what you describe it will not solve the fact that you have duplicate rows after the join. Thats your problem, right?
BR

rohit_SB · December 20, 2023, 2:49am

Yup. For for further info, each order ID can consist of several order items, which can have multiple ID’s. How do I ensure that the info for Order ID doesn’t get replicated? As of now, each row consists of 1 item with its ID, order ID, client ID, etc.

David_Wong · December 20, 2023, 3:32am

Hi @rohit_SB,

You have to address the duplicates at the analysis level by using level-aware calculations. Take a look at use case #2 in this article.

rohit_SB · December 20, 2023, 3:32am

@ErikG Here’s an example:

This table consists of two merged datasets: order ID and order items. A single order ID can consist of various order items. The issue is that the only value i really need is the single GMV value from the gmv column (1210.52). However, since the database is built of several rows, it’s causing the GMV to get repeated which causes the actual value to be significantly higher. How do I solve this? I can’t separate the tables as I need some of the key information that lies in the order items dataset, so that’s not possible either. Could a possible solution be to exclude the item ID column altogether to ensure it doesn’t cause the order ID to get repeated?

Edit: @David_Wong Would LAC help with this issue where the key metric I need has been duplicated? Thanks a lot!

David_Wong · December 20, 2023, 3:44am

@rohit_SB
Yes, LAC-A will help here. Excluding the item ID column won’t help because that won’t remove the rows that are introducing duplicates in your dataset.

rohit_SB · December 20, 2023, 7:12am

Using Example #2, LAC-A uses grouping to calculate the sum. The issue is that the value I need is a singular value of GMV from the column. How would I do that? Using a function like max?
Edit: Apologies, misread the case

sagmukhe · December 20, 2023, 7:46am

@rohit_SB - Yes, that would solve your case. As @David_Wong and @ErikG mentioned you need to use LAC with any of the aggregate function to eliminate the additional rows. Since the value of “gmv” is not going to change for a given Order ID, you can use any of the aggregate function i.e. min, max, avg. etc., to tackle this. The objective of using LAC is not to actually aggregate the value rather eliminating the unwanted rows which are coming up due to the varied granularity of the datasets. Hope this helps! Thank you!

rohit_SB · December 21, 2023, 7:24am

@ErikG @David_Wong @sagmukhe that was super helpful, thank you guys! i’ll mark the solution