Effectively troubleshoot latency in SQL Server Transactional replication: Part 2

Arslanov · Jul 17, 2024

Let us continue our troubleshooting by checking threads in this part.

Step 4.1. Troubleshoot latency in Log Reader agent’s reader thread

Firstly, define the level of reader thread latency by running below query in Publisher server.
Code:
```
sp_replcounters
GO
```
The above shows reader thread replicating on average 115 transactions per second and more than 7.5mln transactions are waiting to be replicated to the distribution database. On average, transactions are waiting 134880secs to be replicated, which is high latency.

Run below in publisher server and find session id of Log Reader’ reader thread:

Code:

SELECT 
   SessionId    = s.session_id, 
   App          = ISNULL(s.program_name, N'')
    FROM sys.dm_exec_sessions s
WHERE s.program_name LIKE '%LogReader%'

Place the session id to the below Event session and create event session. Run the session for about 5mins:

Code:

CREATE EVENT SESSION [LogReaderMonitor] ON SERVER
  ADD EVENT sqlos.wait_completed(
      ACTION(package0.callstack)
         WHERE ([sqlserver].[session_id]=(123))), ---Change session id here
  ADD EVENT sqlos.wait_info_external(
      ACTION(package0.callstack)
         WHERE (([opcode]=('End')) AND ([sqlserver].[session_id]=(123)))), ---Change session id here
  ADD EVENT sqlserver.rpc_completed(
      ACTION(package0.callstack)
         WHERE ([sqlserver].[session_id]=(123)))  ---Change session id here
ADD TARGET package0.event_file(SET filename=N'C:\Temp\logreader_reader_track',max_file_size=(256),max_rollover_files=(5))
WITH (MAX_MEMORY=8192 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

Investigate the event file. For example, below you can confirm the activities with the same GUID with sequence ids. As you can see, initially, we are spending time for memory allocation and then sp_replcmds is finishing after 789 microseconds.
Note: duration of wait_info_external is in milliseconds while rpc_completed is in microseconds.

If wait time is high compared to CPU time, check wait type and troubleshoot accordingly. For example, on the above example we faced MEMORY_ALLOCATION_EXT wait_type but duration is 0. So, we are not waiting.

If CPU time is higher, this means log thread is running but latency is being observed because you have high load. High load can be caused by several causes:

Large batch of replicated transactions: large batch of transactions are the main cause of latency in reader thread performance. Check number of commands and transactions in agent statistics from verbose logs we obtained in Step 3.1.b. If the number of commands is significantly high compared to the number of transactions, it is possible that large transactions are being replicated. For example, as below:

If Reader Latency is caused by large number of pending commands, waiting for the Log Reader to catch up may be the best short-term solution. Long-term options include replicating batches during non-peak time.

Large number of non-replication transactions: A transaction log with a high percentage of non-replicated transaction will cause latency as the Log Reader scans over transaction to be ignored. You can check whether this problem exists by looking at Log Reader agent history we checked in Step 3.1.a. For example, in the below log reader history, we can see more than 5mln rows are being scanned but only 142 rows have been marked for replication.
In this case, ensure constant transaction log truncation and try to perform maintenance activities offline.

High number of VLFs: A large number of Virtual Log Files (VLFs) can contribute to long running read times. For the number of VLFs, execute the following command. Counts in 100K+ may be contributing to Log Reader Reader-Thread performance problems.
SELECT COUNT (DISTINCT vlf_sequence_number) FROM sys.dm_db_log_info ( PublisherDBID )

Step 4.2. Troubleshoot latency in Log Reader agent’s writer thread

By using log reader’s history log (refer to Step 3.1.a), you can get last transaction sequence number and delivery rate, latency information. If you do not observe latency in reader thread (Step 4.1), this means the latency rate is mainly by writer thread:

You can use below command to check transaction (xact_seqno) at where we are currently:
Code:
```
-- Get publisher db id
USE distribution
GO

SELECT * FROM dbo.MSpublisher_databases
-- Get commands we are at
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
GO

BEGIN TRAN
USE distribution
GO

EXEC Sp_browsereplcmds
@xact_seqno_start = 'xact_seqno',
@xact_seqno_end = 'xact_seqno',
@publisher_database_id = PUBLISHERDB_ID
COMMIT TRAN
GO
```

Run below in publisher server and find session id of Log Reader’ writer thread:

Code:

SELECT 
       SessionId    = s.session_id, 
       App            = ISNULL(s.program_name, N'')
FROM sys.dm_exec_sessions s
WHERE s.program_name LIKE '%LogReader%'

Check whether any blocking happening with this session:

sp_who2

Then run the below query in distributor server by changing the session id to log reader session id:

Code:

CREATE EVENT SESSION [logreader_writer_track] ON SERVER
  ADD EVENT sqlos.wait_completed(
    ACTION(package0.callstack)
    WHERE ([sqlserver].[session_id]=(64))),  -- Change session id to log reader writer session id
  ADD EVENT sqlos.wait_info_external(
    ACTION(package0.callstack)
    WHERE (([opcode]=('End')) AND ([sqlserver].[session_id]=(64)))), -- Change session id to log reader writer session id
  ADD EVENT sqlserver.sp_statement_completed(  
ACTION(package0.event_sequence,sqlserver.plan_handle,sqlserver.session_id,sqlserver.transaction_id)
    WHERE ([sqlserver].[session_id]=(64)))  -- Change session id to log reader writer session id
ADD TARGET package0.event_file(SET filename=N'C:\Temp\logreader_writer_track',max_file_size=(256),max_rollover_files=(5))
WITH (MAX_MEMORY=8192 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)

Investigate the collected event logs. For example, below you can confirm the activities with the same GUID with sequence ids. Each of this activity is writer threads attempt to write replication logs to distribution database. As you can see, we spent very little time (nearly 0) initially for MEMORY_ALLOCATION_EXT, then select statement is finishing after 39 microseconds.

Compare wait time (duration – cpu_time) and cpu_time. If wait time is high compared to CPU time, check wait_type and troubleshoot accordingly. For example, above we face MEMORY_ALLOCATION_EXT wait_type. If CPU time is high, you can investigate the execution plan by using corresponding plan_handle for time consuming query which you can get from above event logs:

SELECT * FROM sys.dm_exec_query_plan(PLAN_HANDLE)

Step 4.3. Troubleshoot latency in Distribution agent’s reader thread

To find the session id for Distribution agent, you need to find if it is a Push or Pull Subscription. In the case of push subscription, run below command in distributor server. In the case of Pull subscription, run below command in subscriber server.

Code:

SELECT 
       SessionId    = s.session_id, 
       App            = ISNULL(s.program_name, N'') 
FROM sys.dm_exec_sessions s LEFT OUTER JOIN sys.dm_exec_connections c ON (s.session_id = c.session_id)
WHERE (select text from sys.dm_exec_sql_text(c.most_recent_sql_handle)) LIKE '%sp_MSget_repl_command%'

Check whether any blocking happening with this session:

sp_who2

Then run the below query in distributor server by changing the session id to distribution agent session id:

Code:

CREATE EVENT SESSION [distributor_writer_track] ON SERVER
ADD EVENT sqlos.wait_completed(
    ACTION(package0.callstack)
    WHERE ([sqlserver].[session_id]=(64))),  -- Change session id to dist agent session id
ADD EVENT sqlos.wait_info_external(
    ACTION(package0.callstack)
    WHERE (([opcode]=('End')) AND ([sqlserver].[session_id]=(64)))), -- Change session id to dist agent session id
ADD EVENT sqlserver.sp_statement_completed(
   ACTION(package0.event_sequence,sqlserver.plan_handle,sqlserver.session_id,sqlserver.transaction_id)
    WHERE ([sqlserver].[session_id]=(64)))  -- Change session id to dist agent session id
ADD TARGET package0.event_file(SET filename=N'C:\Temp\distributor_reader_track',max_file_size=(256),max_rollover_files=(5))
WITH (MAX_MEMORY=8192 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)

Investigate the collected event logs. For example, below you can confirm the activities with the same GUID with sequence ids. Each of this activity is reader threads attempt to read replication logs. As you can see, we spent very little time (nearly 0) initially for MEMORY_ALLOCATION_EXT, then select statement is finishing after 29 microseconds.
Compare wait time (duration – cpu_time) and cpu_time. If wait time is high compared to CPU time, check wait_type and troubleshoot accordingly. For example, above we face MEMORY_ALLOCATION_EXT wait_type. If CPU time is high, you can investigate the execution plan by using corresponding plan_handle for time consuming query which you can get from above event logs:

SELECT * FROM sys.dm_exec_query_plan(PLAN_HANDLE)

High CPU time can often mean there is a high load which can be caused by large batch of replicated transactions. You can compare the number of commands and transactions by using the below query.

Code:

SELECT count(c.xact_seqno) as CommandCount, count(DISTINCT t.xact_seqno) as TransactionCount
FROM MSrepl_commands c with (nolock)
LEFT JOIN msrepl_transactions t with (nolock)
on t.publisher_database_id = c.publisher_database_id and t.xact_seqno = c.xact_seqno
WHERE c.publisher_database_id = 1 --Change to target database id here

For the past days’ statistics, you can leverage below command:

Code:

USE distribution
select t.publisher_database_id, t.xact_seqno,
max(t.entry_time) as EntryTime, count(c.xact_seqno) as
CommandCount, count(DISTINCT t.xact_seqno) as TransactionCount
into #results
FROM MSrepl_commands c with (nolock)
LEFT JOIN msrepl_transactions t with (nolock)
on t.publisher_database_id = c.publisher_database_id
and t.xact_seqno = c.xact_seqno
GROUP BY t.publisher_database_id, t.xact_seqno
SELECT publisher_database_id
,datepart(year, EntryTime) as Year
,datepart(month, EntryTime) as Month
,datepart(day, EntryTime) as Day
,datepart(hh, EntryTime) as Hour
,sum(CommandCount) as CommandCountPerTimeUnit
,sum(TransactionCount) as TransactionCountPerTimeUnit
FROM #results
GROUP BY publisher_database_id
,datepart(year, EntryTime)
,datepart(month, EntryTime)
,datepart(day, EntryTime)
,datepart(hh, EntryTime)
ORDER BY publisher_database_id, Month, Day, Hour

As you see, I am executing one command per transaction making TransactionCount nearly equal to CommandCount.

Step 4.4. Troubleshoot latency in Distribution agent’s writer thread

Find the session id and App name for Distribution agent by inserting your publication name to WHERE clause below:

Code:

SELECT 
       SessionId    = s.session_id, 
       App            = ISNULL(s.program_name, N'')
    FROM sys.dm_exec_sessions s
WHERE s.program_name LIKE '%publish%'
GO

Check whether there is blocking for the above session id(s):

sp_who2

Create event session by inserting app name:

Code:

CREATE EVENT SESSION [distributor_writer_track] ON SERVER
ADD EVENT sqlos.wait_completed(
    ACTION(package0.callstack,sqlserver.session_id,sqlserver.sql_text)
    WHERE ([sqlserver].[client_app_name]=N'SQLVM4-TRANSACR_AdventureWorksLT_test_table_pub' AND [package0].[greater_than_uint64]([duration],(0)))),
ADD EVENT sqlos.wait_info_external(
    ACTION(package0.callstack,sqlserver.session_id,sqlserver.sql_text)
    WHERE ([sqlserver].[client_app_name]=N'SQLVM4-TRANSACR_AdventureWorksLT_test_table_pub' AND [package0].[greater_than_uint64]([duration],(0)))),
ADD EVENT sqlserver.sp_statement_completed(
   ACTION(package0.event_sequence,sqlserver.plan_handle,sqlserver.session_id,sqlserver.transaction_id)
    WHERE ([sqlserver].[client_app_name]=N'SQLVM4-TRANSACR_AdventureWorksLT_test_table_pub'))
ADD TARGET package0.event_file(SET filename=N'C:\Temp\distributor_writer_track',max_file_size=(5),max_rollover_files=(5))
WITH (MAX_MEMORY=8192 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

Investigate the collected event logs. For example, below you can confirm the activities with the same GUID with sequence ids as statement levels. Can you find any high duration at any of the statements?

Compare wait time (duration – cpu_time) and cpu_time. If wait time is high compared to CPU time, check wait_type and troubleshoot accordingly. For example, above we face NETWORK_IO wait_type. If CPU time is high, you can investigate the execution plan by using corresponding plan_handle for time consuming query which you can get from above event logs:

SELECT * FROM sys.dm_exec_query_plan(PLAN_HANDLE)

Supervisor: Collin Benkler, Sr EE for SQL Server in Microsoft

Continue reading...

Effectively troubleshoot latency in SQL Server Transactional replication: Part 2

Arslanov

Step 4.1. Troubleshoot latency in Log Reader agent’s reader thread​

Step 4.2. Troubleshoot latency in Log Reader agent’s writer thread​

Step 4.3. Troubleshoot latency in Distribution agent’s reader thread​

Step 4.4. Troubleshoot latency in Distribution agent’s writer thread​

Step 4.1. Troubleshoot latency in Log Reader agent’s reader thread

Step 4.2. Troubleshoot latency in Log Reader agent’s writer thread

Step 4.3. Troubleshoot latency in Distribution agent’s reader thread

Step 4.4. Troubleshoot latency in Distribution agent’s writer thread