Jump to content

Effectively troubleshoot latency in SQL Server Transactional replication: Part 1


Recommended Posts

Guest Arslanov
Posted

[HEADING=1]High level transactional replication architecture[/HEADING]

 

The initial stage of transactional replication is initializing the subscriber. Although this can be done via backup, the typical approach generating a snapshot by the Snapshot Agent and storing it in the snapshot folder. It’s C:\Program Files\Microsoft SQL Server\<INST>\MSSQL\ReplData by default and configurable. Then, the Distribution Agent transfers the snapshot to the subscriber.

 

Afterwards, incremental changes in the published database are tracked and replicated to subscriber database. This replication process is done in three phases:

 

  1. Transactions are marked "for replication" in the transaction log.
  2. The Log Reader Agent reader thread scans through the transaction log using sp_replcmds and looks for transactions that are marked "for replication." These transactions are then saved to the distribution database by the Log Reader agent writer thread using sp_MSadd_replcmds.
  3. The Distribution Agent reader thread scans through the distribution database using sp_MSget_repl_commands. Then, by using the distribution writer thread, this agent connects to the subscriber to apply those changes to the subscriber using sp_MSupd..., sp_MSins..., and sp_Msdel_* (where the “*” denotes the schema and object name of the published article)..

 

989x437vv2.png.1839db69febadd512c38ed8c288988e8.png

 

[HEADING=3]

Figure 1. Transactional Replication architecture that shows the locations of each thread and agent in the case of Remote distributor and pull subscription case

[/HEADING]

[HEADING=1]Troubleshooting steps[/HEADING]

 

The following graph shows the process we use to troubleshoot. We troubleshoot by dividing the process into two parts.

 

687x653vv2.png.033b835a42d68107164e106668382e84.png

 

[HEADING=2]Step 1. Get information about “Big Picture”[/HEADING]

 

Before you dive into solving any issue, you need to fully understand the type of environment you have as there might have been changes you are unaware of. An easy way to do that is to run script SQLServer/Script Replication Topology at master · sqlserver-parikh/SQLServer (github.com) which gives output like below.

 

600x222vv2.png.69228c9202caec5a7eb7b8470942771c.png

 

[HEADING=2]Step 2. Get tracer tokens[/HEADING]

 

After confirming the environment, insert tracer tokens and identify where we are stuck. Tokens can be inserted via Replication Monitor:

 

652x119vv2.png.6716beefa86a4cd76548ece15bc80d18.png

 

 

 

For historical tracer token results, you can run below query in distributor and compare with current results, being the last row the last result:

 

 

 

 

 

USE Distribution  
SELECT p.publication_id, p.publication, agent_id,  
Datediff(s,t.publisher_commit,t.distributor_commit) as 'Time To Dist (sec)',   
Datediff(s,t.distributor_commit,h.subscriber_commit) as 'Time To Sub (sec)'  
FROM MStracer_tokens t  
JOIN MStracer_history h  
ON t.tracer_id = h.parent_tracer_id  
JOIN MSpublications p 
ON p.publication_id = t.publication_id 

 

 

 

 

 

NOTE –

 

“distribution” is the default for the distribution database. Be sure to change this if you configured a different name for the distribution database. Additionally, this history is cleaned up any time replication upgrade scripts are executed, there is a change to the distributor configuration, sp_MShistory_cleanup (depending on retention duration specified), or sp_MSdelete_tracer_history is executed (again, depending on parameters used).

 

504x68vv2.png.b26692a3717c7704ebac67d486777606.png

 

If you observe latency or “Pending” status in “Publisher to Distributor”, the issue is with Log Reader agent (refer to Step 3.1). If the latency is seen in “Distributor to Subscriber” as the screenshot above, the issue is with Distribution agent (refer to Step 3.2).

 

[HEADING=2]Step 3.1. Troubleshoot latency in Log Reader agent[/HEADING]

  1. Check agent history table for any errors by specifically paying attention to comments and error text columns:
     
    USE distribution 
    
    SELECT a.name AS agent_name, 
          CASE [runstatus]  
                 WHEN 1 THEN 'Start' 
                 WHEN 2 THEN 'Succeed' 
                 WHEN 3 THEN 'In progress' 
                 WHEN 4 THEN 'Idle' 
                 WHEN 5 THEN 'Retry' 
                 WHEN 6 THEN 'Fail' 
          END AS Status 
         ,[start_time] 
         ,h.[time] -- The time the message is logged. 
         ,[duration]  --The duration, in seconds, of the message session. 
         ,[comments] 
         ,h.[xact_seqno] -- The last processed transaction sequence number. 
         ,[delivery_time] -- The time first transaction is delivered. 
         ,[delivered_transactions] --The total number of transactions delivered in the session. 
         ,[delivered_commands] -- The total number of commands delivered in the session. 
         ,[average_commands] -- The average number of commands delivered in the session. 
         ,[delivery_rate] -- The average delivered commands per second. 
         ,[delivery_latency] -- The latency between the command entering the published database and being entered into the distribution database. In milliseconds. 
         ,[error_id] -- The ID of the error in the MSrepl_error system table. 
         ,e.error_text -- error text 
     FROM [distribution].[dbo].[MSlogreader_history] h 
     JOIN MSlogreader_agents a 
     ON a.id = h.agent_id 
     LEFT JOIN MSrepl_errors e 
     ON e.id = h.error_id 
    ORDER BY h.time DESC


     
    Furthermore, 5-min interval performance statistics have been added to the history table.
     
    628x98vv2.jpg.e10419c1ed6b07c449ee63c47329cd23.jpg
    If stats state=1, both reader and writer thread of Log Reader agent are performing as expected. If state=2, writer thread is taking a long time to write changes to distribution database. In this case, you should investigate, writer thread (Step 4.2). State=3 means the reader thread is taking a long time scanning the replicated changes from the transaction log and this thread should be investigated (Step 4.1). For example, below, the writer thread is causing latency as reader thread waited for it for 300 seconds to free queue buffer for new replicated data.629x88vv2.png.f4496da787a9f2ef8a464d414edd1132.png
    Ref: Statistics for Log Reader and Distribution agents - SQL Server | Microsoft Learn

  2. Sometimes the history table is not enough to resolve latency issues. In this case, you should enable verbose logging for detailed logs (-Output C:\Temp\OUTPUTFILE.txt -Outputverboselevel 3 ). https://learn.microsoft.com/en-US/sql/relational-databases/replication/troubleshoot-tran-repl-errors?view=sql-server-ver16#enable-verbose-logging-on-any-agent
    You can investigate the verbose detailed logs for any errors. Particularly pay attention to "Status" logs.517x84vv2.png.e52d593a771e04c0621e3101c4c4ae9f.png
    Furthermore, verbose logging provides 5-min interval Log Reader agent statistics as below. Check the Fetch time (reader thread performance) and Write time (writer thread performance) for any latency.
    507x280vv2.png.e7a9566d1a2b3360cbd097eeba687f38.png

 

If you cannot find any error logs but you detect latencies in either reader or writer thread, go to Step 4 and check the corresponding thread. For example, in the above example, we detected high latencies in Fetch time compared to Write time. So, the issue is probably with reader thread (refer to Step 4.1 in Part 2).

 

[HEADING=2]Step 3.2. Troubleshoot latency in Distribution agent[/HEADING]

  1. Check Distribution agent history for any errors by specifically paying attention to comments and error text:
    USE distribution 
    
    SELECT a.name AS agent_name, 
          CASE [runstatus]  
               WHEN 1 THEN 'Start' 
               WHEN 2 THEN 'Succeed' 
               WHEN 3 THEN 'In progress' 
               WHEN 4 THEN 'Idle' 
               WHEN 5 THEN 'Retry' 
               WHEN 6 THEN 'Fail' 
          END AS Status 
         ,[start_time] 
         ,h.[time] -- The time the message is logged. 
         ,[duration] --The duration, in seconds, of the message session. 
         ,[comments] 
         ,h.[xact_seqno] -- The last processed transaction sequence number. 
         ,[current_delivery_rate] -- The average number of commands delivered per second since the last history entry. 
         ,[current_delivery_latency] --The latency between the command entering the distribution database and being applied to the Subscriber since the last history entry. In milliseconds. 
         ,[delivered_transactions] --The total number of transactions delivered in the session. 
         ,[delivered_commands] -- The total number of commands delivered in the session. 
         ,[average_commands] -- The average number of commands delivered in the session. 
         ,[delivery_rate] -- The average delivered commands per second. 
         ,[delivery_latency] -- The latency between the command entering the distribution database and being applied to the Subscriber. In milliseconds.
         ,[total_delivered_commands] -- The total number of commands delivered since the subscription was created. 
         ,[error_id] -- The ID of the error in the MSrepl_error system table. 
         ,e.error_text -- error text 
     FROM MSdistribution_history h 
     JOIN MSdistribution_agents a 
     ON a.id = h.agent_id 
     LEFT JOIN MSrepl_errors e 
     ON e.id = h.error_id 
     ORDER BY h.time DESC


    Furthermore, 5-min interval performance statistics have been added to the history table. 640x100vv2.jpg.928d67e8aefccef0acdbce88fc657dec.jpg
    If stats state=1, both reader and writer thread of Log Reader agent are performing as expected. If state=2, writer thread is taking a long time to write changes to distribution database. In this case, you should investigate, writer thread (Step 4.2). State=3 means the reader thread is taking a long time scanning the replicated changes from the transaction log and this thread should be investigated (Step 4.1). For example, below, the writer thread is causing latency as reader thread waited for it for 300 seconds to free queue buffer for new replicated data.
    636x89vv2.png.913a8618c56d22d97c5f14cce5ed3633.png
    Ref: Statistics for Log Reader and Distribution agents - SQL Server | Microsoft Learn

  2. Sometimes the history table is not enough to resolve latency issues. In this case, you should enable verbose logging for detailed logs (-Output C:\Temp\OUTPUTFILE.txt -Outputverboselevel 3 ). Find errors with transactional replication - SQL ServerYou can investigate the verbose detailed logs for any errors. Particularly pay attention to "Status" logs.
    597x97vv2.png.115bdca6eb149250a6d74b213c5a15e4.png
    Furthermore, verbose logging provides distribution agent statistics as below. Check the Fetch time (reader thread performance) and Write time (writer thread performance) for any latency.
    504x367vv2.png.dfe22a4204bac8da65d9b4c32e417153.png

 

If you cannot find any error logs but you detect latencies in either reader or writer thread, go to Step 4 and check the corresponding thread. For example, in the above example, we detected high latencies in Reader thread compared to Writer thread. Therefore, the issue is with reader thread (Step 4.3). Let us continue doing the next steps in Part 2!!

 

Supervisor: Collin Benkler, Sr. EE for SQL Server in Microsoft.

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...