Configuring auto_explain for Advanced Query Analysis

mahend · Aug 13, 2024

Debugging stored procedures in PostgreSQL can present challenges, particularly due to the lack of integrated debugging tools. Developers may encounter difficulties when dealing with complex code structures, such as nested loops and dynamic SQL, especially when addressing performance issues. In the absence of built-in debugging features like breakpoints, developers often resort to inserting manual debug statements or using RAISE NOTICE for logging. While effective, this approach can be time-consuming, particularly with large datasets.

Performance monitoring can also be challenging, as intermittent issues may not be easily isolated using standard tools like EXPLAIN. The complexity increases when procedures interact with other database objects, such as indexes and constraints. This process of analyzing code, testing performance, and refining logic requires a systematic and patient approach to identify and resolve performance bottlenecks.

However, by enabling the auto_explain module, you can gain detailed insights into query execution. Configuring the auto_explain server parameter allows you to log comprehensive details, including timing, buffer usage, and verbose plan information. Reviewing these logs provides valuable information for identifying performance bottlenecks, making the process of optimizing stored procedures more manageable and effective.

Detailed Explanation of `auto_explain` Configuration

The auto_explain module in PostgreSQL is a powerful tool for diagnosing query performance issues by automatically logging execution plans of slow queries. Properly configuring auto_explain can significantly enhance your ability to troubleshoot and optimize complex queries and stored procedures. Here’s a detailed guide on configuring auto_explain effectively:

Note: Adding auto_explain to the shared_preload_libraries parameter requires a restart of the PostgreSQL server to take effect.

Once the auto_explain module is enabled by adding it to the shared_preload_libraries and restarting the PostgreSQL server, other auto_explain parameters can be adjusted dynamically. This means you can configure parameters such as auto_explain.log_min_duration and auto_explain.log_analyze without needing to restart the server. This flexibility allows you to tune the behavior of auto_explain in real-time based on your performance monitoring and diagnostic needs.

Customize behavior with the following settings to tailor the level of detail captured in the logs:

auto_explain.log_min_duration

Purpose: This parameter controls the minimum duration a query must take to be logged by auto_explain.
Usage:
- Setting this to a specific value (in milliseconds) means that only queries exceeding this duration will be logged. For instance, setting it to 100 will log only queries that take longer than 100 milliseconds.
- If you set it to 0, every query will be logged, regardless of how quickly it executes. This is useful in a development environment where you want to capture all activity, but in a production environment, you may want to set a higher threshold to focus on slower queries.

Notes: The parameter only considers the time spent executing the query (ExecutorRun). It does not include the time taken for query planning or compilation. This means if a query is slow due to planning or compilation issues, those won't be reflected in the auto_explain logs.

auto_explain.log_analyze

Purpose: When this is set to true, the module will run the query as EXPLAIN ANALYZE, which includes actual run-time statistics.
Details:
- It provides detailed timing information for each stage of the query, including parsing, planning, and execution.
- This helps in identifying where time is being spent within the query execution, offering deeper insights compared to a standard EXPLAIN that only shows the query plan without timing.
Consideration:
- Resource Intensity: Enabling log_analyze can be resource-intensive. Since EXPLAIN ANALYZE requires the database to time each operation, it adds overhead to the execution of each query.
- All Queries Executed with EXPLAIN ANALYZE: When log_analyze is enabled, PostgreSQL runs all queries with EXPLAIN ANALYZE, regardless of whether they meet the log_min_duration threshold. This is because the system cannot predict upfront how long a query will take. Only after the query completes does PostgreSQL compare its actual duration to the log_min_duration threshold. If the duration exceeds this threshold, the query, along with its runtime statistics, is logged. If not, the statistics are discarded, but the overhead of collecting them has already been incurred.
- Overhead of Timing Operations: Collecting runtime statistics involves overhead, particularly due to system clock readings, which can vary depending on the system's clock source. This can add significant resource consumption, especially on high-traffic systems.

auto_explain.log_buffers

Purpose: When enabled, this logs buffer usage during query execution.
Details:
- This setting helps you understand how much data is being read from or written to disk buffers, which is crucial for analyzing I/O performance.
- For example, if a query shows high buffer usage, it may indicate that the query is performing a lot of disk I/O, which could be a performance bottleneck, especially if the data is not cached in memory.

auto_explain.log_timing

Purpose: Similar to log_analyze, this setting logs detailed timing information for each phase of the query.
Details:
- It captures the time spent on different operations within the query execution, allowing you to pinpoint where delays are occurring.
- This is particularly useful for complex queries where the execution time is not evenly distributed across different operations.
Considerations:
- Performance Overhead: Detailed timing information introduces some performance overhead. The need to read the system clock frequently during query execution can affect overall query performance, particularly in high-throughput environments.
- Increased Log Volume: Enabling this setting will generate more log data, which can impact log management and storage. This is especially relevant for production systems where log volume needs to be carefully managed.
- Environment Suitability: It’s best to use this feature in development or testing environments rather than production, where the overhead and increased log volume might be less acceptable.

auto_explain.log_triggers

Purpose: This option logs details about triggers fired during query execution.
Details:
- Triggers can have a significant impact on performance, especially if they involve complex operations or if they are fired frequently.
- Logging trigger activity can help you understand their role in query performance, making it easier to optimize both the query and the associated triggers.

auto_explain.log_verbose

Purpose: When set to true, this provides a more detailed output of the execution plan.
Details:
- The verbose output includes additional information such as join types, exact row estimates, and other details that might be omitted in a standard EXPLAIN output.
- This level of detail is particularly useful for diagnosing complex queries where understanding the exact nature of the execution plan is critical.

auto_explain.log_wal

Purpose: This parameter logs information about Write-Ahead Logging (WAL) during query execution.
Details:
- Captures details on WAL activity, which can be useful for understanding how much data is being written to the WAL and how it impacts performance.
- Helps in diagnosing performance issues related to write operations and understanding the impact of transactions on WAL.

auto_explain.log_settings

Purpose: Logs the settings of the auto_explain module for each query.
Details:
- This parameter helps you track the configuration used for each query, making it easier to understand and reproduce the conditions under which specific performance characteristics were observed.
- Useful for debugging issues and ensuring that the correct settings are applied consistently.

Example: Optimizing Query Performance Using auto_explain in PostgreSQL

Prerequisites

Log Analytics Workspace: Ensure you have a Log Analytics workspace created. If not, create one in the same region as your PostgreSQL server to minimize latency and costs.
Permissions: Ensure you have the necessary permissions to configure diagnostic settings for the PostgreSQL server and access the Log Analytics workspace.

To create a Log Analytics Workspace in Azure, you can refer to the official Microsoft documentation. Here’s a direct link to the guide that walks you through the process: Create a Log Analytics Workspace

Configure Diagnostic settings:

Navigate to Your PostgreSQL Server

In the left-hand menu, select All services.

Search for and select Azure Database for PostgreSQL servers.

Choose the PostgreSQL server you want to configure.

Configure Diagnostic Settings

In the PostgreSQL server blade, scroll down to the Monitoring section.

Click on Diagnostic settings.

Add a Diagnostic Setting

Click on + Add diagnostic setting.

Provide a name for your diagnostic setting.

Select Logs and Metrics

In the Diagnostic settings pane, you’ll see options to configure logs and metrics.

Logs: Check the logs you want to collect. Typical options include:

PostgreSQLLogs (server logs)

ErrorLogs (error logs)

Metrics: Check the metrics you want to collect if needed.

Send to Log Analytics

Under Destination details, select Send to Log Analytics.

Choose your Log Analytics workspace from the drop-down menu. Make sure it is in the same region as your PostgreSQL server.

If you don’t see your workspace, ensure it’s in the same region and refresh the list.

Review and Save

Review your settings.

Click Save to apply the diagnostic settings.

Download Server Log for offline analysis

Downloading server logs from the Azure portal provides several benefits. It enables offline analysis with local tools, supports long-term archival for compliance and backup, and facilitates detailed troubleshooting. Additionally, it allows for seamless integration with other datasets and simplifies sharing logs with team members or auditors, enhancing overall data management and accessibility.

Instructions for Downloading PostgreSQL Server Logs

In addition to configuring diagnostic settings to send logs to a Log Analytics workspace, you can download server logs directly from the Azure portal. Follow these steps to leverage the benefits of downloading server logs:

Access Logs:

Navigate to the Azure portal.

Go to your PostgreSQL server and select the Logs section under Monitoring.

Download Logs:

Click on the Download option to export logs to your local system.

SQL Script: Table Creation and Data Insertion

-- Create the stores table
CREATE TABLE IF NOT EXISTS stores (
id SMALLINT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
location VARCHAR(100)
);

-- Create the sales_stats table
CREATE TABLE IF NOT EXISTS sales_stats (
store_id INTEGER NOT NULL,
store_name TEXT COLLATE pg_catalog."default" NOT NULL,
sales_time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
daily_sales DOUBLE PRECISION,
customer_count INTEGER,
average_transaction_value DOUBLE PRECISION,
PRIMARY KEY (store_id, sales_time)
);

-- Insert sample data into stores table
INSERT INTO stores (id, name, location)
VALUES
(1, 'Main Street Store', '123 Main St'),
(2, 'Mall Outlet', '456 Mall Rd'),
(3, 'Downtown Boutique', '789 Downtown Ave');

-- Create or replace the procedure to generate sales statistics
CREATE OR REPLACE PROCEDURE public.generate_sales_stats(
IN start_date TIMESTAMP WITHOUT TIME ZONE,
IN end_date TIMESTAMP WITHOUT TIME ZONE
)
LANGUAGE plpgsql
AS $$
BEGIN
INSERT INTO sales_stats (store_id, store_name, sales_time, daily_sales, customer_count, average_transaction_value)
SELECT
stores.id AS store_id,
stores.name AS store_name,
s1.time AS sales_time,
random() * 10000 AS daily_sales,
(random() * 200 + 50)::INTEGER AS customer_count,
random() * 200 + 20 AS average_transaction_value
FROM generate_series(
start_date,
end_date,
INTERVAL '50 second'
) AS s1(time)
CROSS JOIN (
SELECT
id,
name
FROM stores
) stores
ORDER BY
stores.id,
s1.time;
END;
$$;

Configure following Server Parameter

auto_explain.log_analyze = ON
auto_explain.log_buffers = ON
auto_explain.log_min_duration = 5000
auto_explain.log_verbose = ON
auto_explain.log_wal = ON
auto_explain.log_timing = ON

Execute Procedure

-- Execute the procedure to generate sales statistics
CALL generate_sales_stats('2010-01-01 00:00:00', '2024-01-01 23:59:59');

Access Logs

In the PostgreSQL server blade, look for the Monitoring section in the left-hand menu.
Click on Logs to open the Log Analytics workspace associated with your server. This may redirect you to the Log Analytics workspace where logs are stored.

Run KQL Query

-- Filter execution plans for queries taking more than 10 seconds
AzureDiagnostics
| where Message contains " plan"
| extend DurationMs = todouble(extract("duration: (\\d+\\.\\d+) ms", 1, Message))
| where isnotnull(DurationMs) and DurationMs > 10000
| project TimeGenerated, Message, DurationMs
| order by DurationMs desc

-- Filter logs for a specific instance using the LogicalServerName_s
AzureDiagnostics
| where LogicalServerName_s == "myServerInstance" // Filter for the specific instance
| where Message contains " plan"
| extend DurationMs = todouble(extract("duration: (\\d+\\.\\d+) ms", 1, Message))
| where isnotnull(DurationMs) and DurationMs > 10000
| project TimeGenerated, Message, DurationMs
| order by DurationMs desc

-- Queries with Sort nodes that have used disk-based sorting methods, where the actual total execution time is over 70% of the total statement time, and that have read or written at least X temporary blocks.
AzureDiagnostics
| where Message contains "execution plan" // Filter for logs containing execution plans
| extend
SortMethod = extract("Sort Method: (\\w+)", 1, Message), // Extract the sort method
TotalExecutionTimeMs = todouble(extract("total execution time: (\\d+\\.\\d+) ms", 1, Message)), // Extract total execution time
DiskSortTimeMs = todouble(extract("disk sort time: (\\d+\\.\\d+) ms", 1, Message)), // Extract disk-based sort time
TmpBlksRead = todouble(extract("tmp_blks_read: (\\d+)", 1, Message)), // Extract temporary blocks read
TmpBlksWritten = todouble(extract("tmp_blks_written: (\\d+)", 1, Message)) // Extract temporary blocks written
| where
isnotnull(TotalExecutionTimeMs) and
isnotnull(DiskSortTimeMs) and
DiskSortTimeMs > 0 and // Ensure disk-based sorting occurred
(DiskSortTimeMs / TotalExecutionTimeMs) > 0.70 and // Check if disk sort time is over 70% of total execution time
(TmpBlksRead > X or TmpBlksWritten > X) // Replace X with the threshold value for temporary blocks
| project
TimeGenerated,
Message,
SortMethod,
TotalExecutionTimeMs,
DiskSortTimeMs,
TmpBlksRead,
TmpBlksWritten
| order by TotalExecutionTimeMs desc

This KQL query filters logs to find entries containing query execution plans, extracts and converts the duration of these queries to milliseconds, and then focuses on those that took longer than 10 seconds. It displays the timestamp, the log message, and the execution duration. For more details on the AzureDiagnostics table, refer to Azure Monitor documentation.

Here is a detailed execution plan extracted from the log analytics workspace:

2024-08-08 20:14:51 UTC-66b52434.3e45-LOG: duration: 194804.905 ms plan:
Query Text: INSERT INTO miner_stats (miner_id, miner_name, stime, cpu_usage, average_mhs, temperature, fan_speed)
SELECT
miners.id AS miner_id,
miners.name AS miner_name,
s1.time AS stime,
random() * 100 AS cpu_usage,
(random() * 4 + 26) * miners.graphic_cards AS average_mhs,
random() * 40 + 50 AS temperature,
random() * 100 AS fan_speed
FROM generate_series(
'2000-10-14 00:00:00',
'2024-08-30 23:59:59',
INTERVAL '59 second') AS s1(time)
CROSS JOIN (
SELECT
id,
name,
graphic_cards
FROM miners
) miners
ORDER BY
miners.id,
s1.time;
Insert on public.miner_stats (cost=153024.30..204774.30 rows=0 width=0) (actual time=194804.893..194804.897 rows=0 loops=1)
Buffers: shared hit=39107638 read=97 dirtied=395027 written=399443, temp read=496743 written=497630
WAL: records=38317668 fpi=1 bytes=4253262922
-> Subquery Scan on "*SELECT*" (cost=153024.30..204774.30 rows=900000 width=76) (actual time=75088.036..113372.899 rows=38317668 loops=1)
Output: "*SELECT*".miner_id, "*SELECT*".miner_name, "*SELECT*".stime, "*SELECT*".cpu_usage, "*SELECT*".average_mhs, "*SELECT*".temperature, "*SELECT*".fan_speed
Buffers: shared hit=2, temp read=496743 written=497630
-> Result (cost=153024.30..191274.30 rows=900000 width=100) (actual time=75088.030..103714.013 rows=38317668 loops=1)
Output: miners.id, miners.name, s1."time", (random() * '100'::double precision), (((random() * '4'::double precision) + '26'::double precision) * (miners.graphic_cards)::double precision), ((random() * '40'::double precision) + '50'::double precision), (random() * '100'::double precision)
Buffers: shared hit=2, temp read=496743 written=497630
-> Sort (cost=153024.30..155274.30 rows=900000 width=70) (actual time=75088.021..90570.249 rows=38317668 loops=1)
Output: miners.id, miners.name, s1."time", miners.graphic_cards
Sort Key: miners.id, s1."time"
Sort Method: external merge Disk: 1249832kB
Buffers: shared hit=2, temp read=496743 written=497630
-> Nested Loop (cost=0.00..11281.25 rows=900000 width=70) (actual time=1621.592..12365.782 rows=38317668 loops=1)
Output: miners.id, miners.name, s1."time", miners.graphic_cards
Buffers: shared hit=2, temp read=28065 written=28065
-> Function Scan on pg_catalog.generate_series s1 (cost=0.00..10.00 rows=1000 width=8) (actual time=1621.564..2913.084 rows=12772556 loops=1)
Output: s1."time"
Function Call: generate_series('2000-10-14 00:00:00+00'::timestamp with time zone, '2024-08-30 23:59:59+00'::timestamp with time zone, '00:00:59'::interval)
Buffers: shared hit=1, temp read=28065 written=28065
-> Materialize (cost=0.00..23.50 rows=900 width=62) (actual time=0.000..0.000 rows=3 loops=12772556)
Output: miners.id, miners.name, miners.graphic_cards
Buffers: shared hit=1
-> Seq Scan on public.miners (cost=0.00..19.00 rows=900 width=62) (actual time=0.016..0.017 rows=3 loops=1)
Output: miners.id, miners.name, miners.graphic_cards
Buffers: shared hit=1

Lack of Support for Cancelled Queries:

One notable limitation of auto_explain is that it does not capture queries that are cancelled before they complete. Here’s what that means:

What It Logs: auto_explain is designed to log the execution plans and performance statistics of queries that finish running and meet specific criteria, such as a minimum execution time.
What It Misses: If a query is interrupted or cancelled—whether due to a timeout, user intervention, or some other reason—it won’t be logged by auto_explain. This means you won’t have detailed information about queries that were terminated prematurely, which can be crucial for diagnosing issues.

Conclusion

Debugging and optimizing stored procedures in PostgreSQL, particularly with complex queries and large datasets, can be a daunting task. However, using tools like the auto_explain module significantly enhances the ability to diagnose performance issues by automatically logging detailed execution plans and runtime statistics. Configuring auto_explain to capture various metrics, such as query duration, buffer usage, and WAL activity, provides deep insights into the performance characteristics of queries.

By combining these detailed logs with additional strategies such as indexing, memory tuning, and using third-party monitoring tools, developers can effectively pinpoint and resolve performance bottlenecks. This methodical approach not only aids in optimizing stored procedures but also in maintaining overall database performance.

For additional insights on optimizing PostgreSQL performance, you can refer to our previous blog on optimizing query performance with work_mem here.

Continue reading...

Configuring auto_explain for Advanced Query Analysis

mahend

Detailed Explanation of auto_explain Configuration​

Example: Optimizing Query Performance Using auto_explain in PostgreSQL​

Prerequisites​

Configure Diagnostic settings:​

Configure Diagnostic Settings​

Add a Diagnostic Setting​

Select Logs and Metrics​

Send to Log Analytics​

Review and Save​

Download Server Log for offline analysis​

Access Logs​

Run KQL Query​

Conclusion​