Jump to content

Featured Replies

Posted

KQL best practices for joins and filters

 

For Power BI developers but not only

 

 

 

 

Summary

 

 

Many users who try ADX in direct query mode encounter errors right away.

 

The errors complain about lack of memory.

 

If the tables are small enough, it may work but still performance will not be as advertised on TV.

 

The reason in most cases is the behavior of joins in ADX as they are created by PBI.

 

In this article I’ll show different approaches to joining tables as used by PBI for related tables or as can be expressed in KQL in general.

 

I created a special table in the help cluster with 31 million rows that is big enough to demonstrate the performance differences between the variations.

 

Worst vanilla case

 

 

All the examples use a model with three tables. One is a fact and two are dimensions, for dates and customers.

 

In PBI the model will look like this:

 

 

 

mediumvv2px400.png.2a03bcf4d604e676cc086c0b220888b8.png

 

 

 

If you try to build a report with two slicers and a simple table, you’ll see this error:

 

mediumvv2px400.png.e43b8df7d5ca7ce71a5bb01373b4deb6.png

 

 

 

This is the situation in the attached example worst case.pbix.

 

I forced the relationships to be 1:M and not M:M because the experienced PBI practitioner will try to avoid M:M and will find away to do that. Actually````````` 1:M relationship is part of the problem.

 

The KQL statement created by PBI will look like (Simplified for readability)

 

Execute in [Web] [Desktop] [cluster('help.kusto.windows.net').database('ContosoSales')]

 

// Lacks memory

 

DuplicateFact

 

| join kind=leftouter Dates on $left.DateKey == $right.Datekey

 

| join kind=leftouter Customers on CustomerKey

 

| where RegionCountryName == "Canada"

 

| where FiscalYear == 2008

 

| summarize sum(SalesAmount) by StateProvinceName, FiscalHalfYear

 

 

 

You can run these examples by following the link to web, and see the error.

 

If we limit the size of the fact table by ~ 50% it will run but with very high memory consumption.

 

 

 

 

 

Execute in [Web] [Desktop] [cluster('help.kusto.windows.net').database('ContosoSales')]

 

// Very high memry with limited size fact table

 

DuplicateFact

 

| where SalesAmount > 100

 

| join kind=leftouter Dates on $left.DateKey == $right.Datekey

 

| join kind=leftouter Customers on CustomerKey

 

| where RegionCountryName == "Canada"

 

| where FiscalYear == 2008

 

| summarize sum(SalesAmount) by StateProvinceName, FiscalHalfYear

 

 

 

Statistics :

 

mediumvv2px400.png.104fea79b317a01f74bcddc027e3a5b7.png

 

 

 

 

 

First improvement – inner join

 

 

A simple change will make things better.

 

Execute in [Web] [Desktop] [cluster('help.kusto.windows.net').database('ContosoSales')]

 

// High memory but still works with inner join

 

DuplicateFact

 

| join kind=inner Dates on $left.DateKey == $right.Datekey

 

| join kind=inner Customers on CustomerKey

 

| where RegionCountryName == "Canada"

 

| where FiscalYear == 2008

 

| summarize sum(SalesAmount) by StateProvinceName, FiscalHalfYear

 

 

 

Stats:

 

mediumvv2px400.png.78fedaebdc2033a17e649a3ad49b8fc5.png

 

 

 

Notice the high memory utilization but it works.

 

In PBI, you can get inner joins in one of two ways:

 

  • M:M relationships with single direction filtering.
  • 1:M relationships with assume referential integrity checked.

 

Both ways are acceptable but you should avoid leftouter or rightouter joins.

 

See the attached file referential integrity.pbix

 

The query still consumes 3.46 seconds of CPU which is pretty high.

 

If you try to move the position of the filters , you’ll see no change because with inner joins, we know to move the filters during optimization

 

Execute in [Web] [Desktop] [cluster('help.kusto.windows.net').database('ContosoSales')]

 

// Same memory as before

 

DuplicateFact

 

| join kind=inner (Dates | where FiscalYear == 2008) on $left.DateKey == $right.Datekey

 

| join kind=inner (Customers | where RegionCountryName == "Canada") on CustomerKey

 

| summarize sum(SalesAmount) by StateProvinceName, FiscalHalfYear

 

 

 

 

 

Using lookup instead of join

 

By using lookup instead of join we make the real difference

 

 

Execute in [Web] [Desktop] [cluster('help.kusto.windows.net').database('ContosoSales')]

 

DuplicateFact

 

| lookup kind=inner Customers on CustomerKey

 

| lookup kind=inner Dates on $left.DateKey == $right.Datekey

 

| where RegionCountryName == "Canada"

 

| where FiscalYear == 2008

 

| summarize sum(SalesAmount) by StateProvinceName, FiscalHalfYear

 

 

 

mediumvv2px400.png.429df28ad9accfaf4b4495c7b88723bb.png

 

 

 

 

 

Notice the memory utilization going down from 1.15GB to 20.38MB

 

 

 

 

 

 

 

If you don’t make sure that the joins will be inner joins, CPU cost will more than double.

 

 

 

Execute in [Web] [Desktop] [cluster('help.kusto.windows.net').database('ContosoSales')]

 

Customers

 

| join kind=rightouter hint.strategy=broadcast

 

(Dates

 

| join kind=rightouter hint.strategy=broadcast DuplicateFact on $left.Datekey==$right.DateKey)

 

on CustomerKey

 

| where CityName =="Paris"

 

| where FiscalYear ==2007

 

| summarize count(),sum(SalesAmount) by Education

 

 

 

mediumvv2px400.png.8c4a0d8baf852d574f028da2d97f651b.png

 

 

 

 

 

 

 

 

 

How to achieve the lookup effect from PBI?

 

 

We can’t convince PBI to generate lookup instead of joins, but we can get the same effect.

 

First, we need to understand what the magic behind lookup is.

 

LargeTable kind=inner lookup Smalltable on key1

 

Is executed as:

 

Smalltable | join kind=inner hint.strategy=broadcast LargetTable on key1

 

The reverse order of the tables + the hint are making the difference.

 

A full equivalent example not using lookup is:

 

Execute in [Web] [Desktop] [cluster('help.kusto.windows.net').database('ContosoSales')]

 

Customers

 

| join hint.strategy=broadcast kind=inner

 

(Dates

 

| join hint.strategy=broadcast kind=inner DuplicateFact on $left.Datekey == $right.DateKey)

 

on CustomerKey

 

| where RegionCountryName == "Canada"

 

| where FiscalYear == 2008

 

| summarize sum(SalesAmount) by StateProvinceName, FiscalHalfYear

 

 

 

mediumvv2px400.png.e4479bd850cc27dfe9596b91103799df.png

 

 

 

 

 

Getting this query pattern in PB is achieved by adding a setting to the 4th parameter in the AzureDataExplorer.Contents function in each query for the dimensions.

 

= AzureDataExplorer.Contents("help", null, null, [isDimension=true])

 

See the attached file is dimension true.pbix

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...