D
DanyHoter
Filtering and visualizing Kusto data in local time
Best practices and latest optimizations
Summary
Datetime values in Kusto (aka ADX/KQL database in Fabric) are assumed to be in UTC.
There are good reasons why you should always keep it this way.
On the other hand, in many cases you want to visualize the datetime values in a specific time zone and filter the data using values expressed in local time.
In our documentation, you can see this note:
This is correct but may lead to severe performance degradations if not done correctly.
Using datetime_utc_to_local over large number of values can be very costly for two reasons:
We recently implemented some optimizations that will make some such scenarios much more efficient.
This article will show examples that are optimized and some cases which should still be avoided.
A special paragraph will be dedicated to handling local time in Power BI.
Examples of using datetime_utc_to_local
The examples are using a table in the help cluster database SampleLogs.
The main metric I’ll use for comparing the queries will be CPU seconds used by the engine.
All queries can be found in the attached file UtcToLocal.zip
Base query
// Base query using UTC time - 0.375 seconds
RawSysLogs
| where timestamp between(datetime(2023-1-2)..6h)
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(timestamp,1h)
The # of rows in this period is 3.2 Million and the base query uses 0.375 CPU seconds.
From the statistics we can see that 1 extent out of 102 is scanned which means that the filter on timestamp is very effective in limiting the scanned data
Filtering on UTC + forcing calculation of datetime_utc_to_local
// Filtering on UTC time + shifting to local only the filtered data (3.2 Million rpws) - 0.7 seconds
RawSysLogs
| where timestamp between(datetime(2023-1-3)..6h)
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2)
,round(avg(current_tasks_count),2) by bin(Localtime,1h)
This query takes 0.7 CPU seconds.
The difference is the cost of running the conversion function on all the filtered rows.
This cost is significant.
The number of scanned extents is the same as in the base query because the filtering is done using the actual timestamp.
Filtering on local time
// Filter on datetime shifted to local after optmization - 5.07 seconds
RawSysLogs
| where datetime_utc_to_local( timestamp,"NZ") between(datetime(2023-1-2)..6h)
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(Localtime,1h)
This query filters the data based on a value shifted to a specific time zone.
Until recently this query was taking many times the amount of CPU.
We introduced an optimization that wraps the filter with another filer based on the original timestamp with some buffer around it that will not affect the results but will not scan the entire table.
The cost of this query is 5.07 CPU sec. and it scans 5 out 102 extents
We are considering limiting the buffer further to scan even less data.
Filtering on local time using a calculated column
// worst case, filtering on a calculated column 80 seconds
RawSysLogs
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| where Localtime between(datetime(2023-1-2)..6h)
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(Localtime,1h)
The new optimization does not kick in in this case and so all the table is scanned. The cost in this case is 80 CPU seconds.
Part of the high cost is running the conversion function on the entire dataset which has ~900 million rows.
Shifting the time range from local to UTC***
// best case , shifting the range and filtering on UTC - 0.66 seconds
RawSysLogs
| where timestamp between(datetime_local_to_utc(datetime(2023-1-2),"NZ")..6h)
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(Localtime,1h)
This version is the most recommended version.
Instead of shifting the timestamp of the entire table, only the selected range of data is shifted.
The time range is assumed to be in local time and is shifted to UTC.
The cost of this version is 0.66 CPU seconds
Summary of timings:
How to implement the best solution in Power BI
The best way to implement the best practice in Power BI is by creating a function that will receive the time range as parameters, shift it to UTC, filter the table and shift the filtered rows from UTC to the same time zone.
The function QuerySyslogs can be run from the ContosoSales database in the cluster help
The table used is the same table used in the previous example.
QuerySyslogs(StartTimeLocal:datetime,Timespan:string="1h",TimeZone:string="NZ")
{
// Strings are easier in PBI , converting to timespan
let Timespant=totimespan(Timespan);
database( 'SampleLogs').RawSysLogs
// filtering on start time shifted from local to UTC
| where timestamp between(datetime_local_to_utc(StartTimeLocal,TimeZone)..Timespant)
// Shifting the datetimevalues only on the filtered rows
| extend TimeLocal=datetime_utc_to_local(timestamp,TimeZone)
// Calculating the granularity based on the size of the time range
| extend BinSize=case(Timespant > 3d,1d,Timespant> 12h,1h,Timespant>3h,10m,1m)
// Bin size as string is used in the visual
| extend BinSizeName=case(Timespant > 3d,"1d",Timespant> 12h,"1h",Timespant>3h,"10m","1m")
| summarize Logs=count(),FirstDate=min(timestamp),LastDate=max(timestamp) by bin(TimeLocal, BinSize),BinSizeName
}
The attached PBIX file uses this function and use three dynamic parameters to select the starting point, the size of the range and the timezone
If you select different time zones, you’ll see the base range used changing accordingly.
Continue reading...
Best practices and latest optimizations
Summary
Datetime values in Kusto (aka ADX/KQL database in Fabric) are assumed to be in UTC.
There are good reasons why you should always keep it this way.
On the other hand, in many cases you want to visualize the datetime values in a specific time zone and filter the data using values expressed in local time.
In our documentation, you can see this note:
This is correct but may lead to severe performance degradations if not done correctly.
Using datetime_utc_to_local over large number of values can be very costly for two reasons:
- The function itself is very “expensive”
- Filtering on local time ranges is very inefficient because it will not use indices on the base datetime values and the query will probably scan the entire table.
We recently implemented some optimizations that will make some such scenarios much more efficient.
This article will show examples that are optimized and some cases which should still be avoided.
A special paragraph will be dedicated to handling local time in Power BI.
Examples of using datetime_utc_to_local
The examples are using a table in the help cluster database SampleLogs.
The main metric I’ll use for comparing the queries will be CPU seconds used by the engine.
All queries can be found in the attached file UtcToLocal.zip
Base query
// Base query using UTC time - 0.375 seconds
RawSysLogs
| where timestamp between(datetime(2023-1-2)..6h)
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(timestamp,1h)
The # of rows in this period is 3.2 Million and the base query uses 0.375 CPU seconds.
From the statistics we can see that 1 extent out of 102 is scanned which means that the filter on timestamp is very effective in limiting the scanned data
Filtering on UTC + forcing calculation of datetime_utc_to_local
// Filtering on UTC time + shifting to local only the filtered data (3.2 Million rpws) - 0.7 seconds
RawSysLogs
| where timestamp between(datetime(2023-1-3)..6h)
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2)
,round(avg(current_tasks_count),2) by bin(Localtime,1h)
This query takes 0.7 CPU seconds.
The difference is the cost of running the conversion function on all the filtered rows.
This cost is significant.
The number of scanned extents is the same as in the base query because the filtering is done using the actual timestamp.
Filtering on local time
// Filter on datetime shifted to local after optmization - 5.07 seconds
RawSysLogs
| where datetime_utc_to_local( timestamp,"NZ") between(datetime(2023-1-2)..6h)
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(Localtime,1h)
This query filters the data based on a value shifted to a specific time zone.
Until recently this query was taking many times the amount of CPU.
We introduced an optimization that wraps the filter with another filer based on the original timestamp with some buffer around it that will not affect the results but will not scan the entire table.
The cost of this query is 5.07 CPU sec. and it scans 5 out 102 extents
We are considering limiting the buffer further to scan even less data.
Filtering on local time using a calculated column
// worst case, filtering on a calculated column 80 seconds
RawSysLogs
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| where Localtime between(datetime(2023-1-2)..6h)
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(Localtime,1h)
The new optimization does not kick in in this case and so all the table is scanned. The cost in this case is 80 CPU seconds.
Part of the high cost is running the conversion function on the entire dataset which has ~900 million rows.
Shifting the time range from local to UTC***
// best case , shifting the range and filtering on UTC - 0.66 seconds
RawSysLogs
| where timestamp between(datetime_local_to_utc(datetime(2023-1-2),"NZ")..6h)
| extend Localtime=datetime_utc_to_local( timestamp,"NZ")
| extend current_tasks_count=tolong(fields.current_tasks_count)
| extend current_workers_count=tolong(fields.current_workers_count)
| extend active_workers_count=tolong(fields.active_workers_count)
| summarize count(),round(avg(active_workers_count),2),round(avg(current_workers_count),2),
round(avg(current_tasks_count),2) by bin(Localtime,1h)
This version is the most recommended version.
Instead of shifting the timestamp of the entire table, only the selected range of data is shifted.
The time range is assumed to be in local time and is shifted to UTC.
The cost of this version is 0.66 CPU seconds
Summary of timings:
Case | CPU | Ratio to base |
Base all in UTC | 0.375 | 1 |
Base + shift for summaries | 0.7 | 1.9 |
Shift utc to local | 5.07 | 13.5 |
Filter on calculated column | 80 | 213 |
Shift the range and filter on base timestamp | 0.6 | 1.6 |
How to implement the best solution in Power BI
The best way to implement the best practice in Power BI is by creating a function that will receive the time range as parameters, shift it to UTC, filter the table and shift the filtered rows from UTC to the same time zone.
The function QuerySyslogs can be run from the ContosoSales database in the cluster help
The table used is the same table used in the previous example.
QuerySyslogs(StartTimeLocal:datetime,Timespan:string="1h",TimeZone:string="NZ")
{
// Strings are easier in PBI , converting to timespan
let Timespant=totimespan(Timespan);
database( 'SampleLogs').RawSysLogs
// filtering on start time shifted from local to UTC
| where timestamp between(datetime_local_to_utc(StartTimeLocal,TimeZone)..Timespant)
// Shifting the datetimevalues only on the filtered rows
| extend TimeLocal=datetime_utc_to_local(timestamp,TimeZone)
// Calculating the granularity based on the size of the time range
| extend BinSize=case(Timespant > 3d,1d,Timespant> 12h,1h,Timespant>3h,10m,1m)
// Bin size as string is used in the visual
| extend BinSizeName=case(Timespant > 3d,"1d",Timespant> 12h,"1h",Timespant>3h,"10m","1m")
| summarize Logs=count(),FirstDate=min(timestamp),LastDate=max(timestamp) by bin(TimeLocal, BinSize),BinSizeName
}
The attached PBIX file uses this function and use three dynamic parameters to select the starting point, the size of the range and the timezone
If you select different time zones, you’ll see the base range used changing accordingly.
Continue reading...