Creating and Monitoring Read Replicas in Azure Postgres Flexible Server

TyBecker · Sep 26, 2023

Why use a read replica

Azure Postgres Flexible Server is the latest Postgres offering sitting on optimized Linux based infrastructure. This fully managed service offers true open-source PostgreSQL with added benefits including built in resiliency, performance monitoring and easy scaling.

One of the newest, and most widely adopted features of Flexible Servers is the built in read replica capability. Introduced in March 2023, this feature enables your database to replicate out to read replica instances, and have them fully managed as part of the service. Having the replication process managed ensures healthy operation and prevents on of the most common Postgres issues I see which is replication slots improperly managed causing database outages (more details on this can be found in my Understanding Replication Slots blog entry)

In this post, we’ll explore the read replica feature of Azure Postgres Flexible Server and how it adds value to your solution architecture.

To start, read replicas provide three main benefits:

Resiliency: Provide disaster recovery (DR) capabilities that do not impact primary database performance and enable faster recovery time objectives (RTO) than a traditional database backup. In the event of a primary database failure the replica server can be promoted to a standalone server and handle application write and read requests. This can be extended out to geo replicas which span Azure regions and provide even more resiliency to your database architecture.

Performance: Offload query and read loads from the primary database to a secondary. This helps protect primary server resources and is often used to back operational reporting solutions.

Scalability: Read replicas can help you scale your database reads by allowing you to distribute read traffic across multiple replicas using each replicas respective connection string. This can help you handle more read traffic without having to scale up your primary database.

Creating a read replica

When a replica is created, an asynchronous (async) connection is established. An async connection is used to ensure there is no negative performance impact to the primary database. Due to the nature of async connections there is a possibility of data loss when failures occur, and newly committed data has not been replicated to the read replica server.

Azure portal deployment

Log in to the Azure Portal, and navigate to Azure Database for PostgreSQL servers blade. Select the database server you want to enable a read replica on, go into the Replication blade and select ‘Add Replica’.

Note: Read Replicas are currently supported for the General Purpose and Memory Optimized server compute tiers, Burstable server compute tier is not supported.

You will notice that several options are grayed out. These are options that are inherited from the primary server and can not be customized during the read replica creation process. You can select the same region as the primary server, or another region to make the replica geo redundant. You can deploy a replica into

Note: coming in the next few months will be the ability to specify a different Azure Resource Group for read replica servers.

After selecting Create, the deployment process will begin. This can take minutes or several hours depending on a few criteria:

Same region replicas will use a database snapshot, so the time of creation does not depend on the size of the database. This method uses disk-snapshot to create the read replica, similar to how PITR works, then syncs the remaining delta with the primary using the physical replication process. With this method, most of the provision time is spent creating the compute instance compared to migrating the data.

Geo replicas will use a base backup of the primary server which is moved from the primary region to the replica region over the Azure network, and therefore size of the database will have a direct impact on the replication deployment duration. Once a backup is copied it will be restored and transaction logs replayed to bring the replica up to date with the primary.

While the primary database is deploying a replica, it will be in an “updating” state. You will notice some control plane options are not available (this is also true when deleting or promoting a replica). These options will become available again once the replication is complete. During this time the primary database is fully functional and online and available for normal operation (both read and write).

Once complete the read replica will be tied to the primary with an asynchronous physical replication process managed by the service.

Azure CLI deployment

Azure Postgres Flexible Servers can also be deployed through the Azure CLI (Command Line Interface) or infrastructure as code (IaC) solutions like ARM, Bicep, Terraform, or Ansible.

Before deployment you need to have the following prerequisites met:

Install Azure CLI 2.0
An Azure Database for PostgreSQL Flexible Server to be the primary server.

Login to Azure from the command line and run the following to deploy a Read Replica:

Check to ensure the Flexible Server has capacity for additional read replicas:

az postgres flexible-server configuration show --resource-group <yourRG> --server-name <primarydb> --name azure.replication_support

Create the replica with a public endpoint using following command:

az postgres flexible-server replica create --replica-name testReplicaServer -g testGroup --source-server testserver --zone 3 --location testLocation

To create a read replica that is vnet injected use the following command with additional parameters:

az postgres flexible-server replica create --replica-name testReplicaServer -g testGroup \

source-server testserver --zone 3 --location testLocation \

vnet newVnet --subnet newSubnet \

address-prefixes 172.0.0.0/16 --subnet-prefixes 172.0.0.0/24 \

private-dns-zone testDNS.postgres.database.azure.com

Additional details on the CLI command including parameters and usage see the Microsoft Learn Document on flexible-server replica commands.

How read replication works

The good news is that the Azure Postgres is a PaaS service and manages the replication components as part of the service. However, it is still important to understand at a high level, how the replication works under the covers of the service. In Postgres, replication is initiated by creating a replication slot on the primary database server. This instructs the primary server to retain write ahead log (WAL) files which are a transactional record of events in the database.

Once the slot is created on the primary server, a streaming connection is configured and initiated from the replica server and an asynchronous connection is established. The primary server uses the WAL Sender process to send log activities to the replica which has a WAL Receiver process running.

With Azure Read Replicas an asynchronous physical replication is established. Physical replications will replicate any data or structural DDL changes from the primary to the replica server.

When the replication slot is created, it ensures that the primary database WAL files are retained, until the replica has a chance to copy those to the replica server. If a replica server goes down (planned or otherwise), the primary server will keep collecting and retaining WAL files knowing that eventually the replica will need them.

Idle replication slots can become problematic for the primary server as it will hold WAL files until the database server storage is full. In Azure we protect against this by empowering the service to manage the creation and deletion of replication slots, as well as providing monitoring metrics such as Storage Percentage and Replication Lag that may indicate an issue with the replication process.

Monitoring Replicas

When using any form of replication from a Postgres database, it is important to implement proper monitoring and alerting. This helps ensure you are notified if there are any network, data or service-related issues.

Key replication related metrics include:

Storage Percentage – helps indicate if replication slots are not being used and the primary is filling up with WAL files. A good starting value is 80% to be notified if the database storage starts to fill up and risk impacting database operations. Storage can be scaled up, not down, so it is important to plan storage scaling accordingly.
Max Physical Replication Lag – Maximum lag across all asynchronous physical replication slots. Apply this metric to the primary server.
Read Replica Lag - Read replica lag in seconds. Apply this metric to the replica server with a value around 5 minutes (300 seconds). Replication is typically millisecond based, so any replication that is delayed into minutes for a sustained period indicates something serious is impacting database performance.

Viewing metrics can be done in the Azure portal, under the Metrics blade of the Azure Postgres Database:

Alerts can also be created to send notifications in the event these metrics exceed a threshold value. These alerts can be triggered and sent to a custom distribution group.

Alerts can be static by providing a threshold value, or dynamic by enabling machine learning algorithms to continually learn patterns and notify based on anomalies outside what it determines to be normal operating conditions.

Promoting and Failover

Promoting a replica server will decouple it from the primary database and enable write capabilities to the database. The database will apply all pending logs and promote the replica to a write capable database independent of the primary database.

Once a replica is promoted it can not be made a replica again. If you want to go back to the old primary region, you will need to create a new replica from the promoted server, this includes a new replica name. If you want to use the same original database server name, you will need to delete to original primary, and replicate from the promoted server to the original region with the original name.

…..

Postgres Management Views

Like most databases, Postgres has a set of system management views. These views can be used as an additional source to monitor the replication processes. The following queries can be run to check the status of replication. These can be used in addition to the built in Azure metrics to give more insight into the status of the replication process.

-- Check the replication state with all details.

SELECT *

FROM pg_stat_replication

WHERE sync_state = ';

Note: the ‘async’ filter is used to ensure we only see read replicas and filter out HA (high availability) replications which will use a ‘sync’ connection state:

-- Check to make sure the slots are being used

SELECT *

FROM pg_replication_slots;

-- Check status of ongoing replication and check for lag

SELECT application_name, write_lag, flush_lag, flush_lag, state

FROM pg_stat_replication

WHERE sync_state = ';

Why can’t I just setup my own Postgres native replication?

You can create your own custom replication processes with Azure Postgres Databases. These custom replication processes can be logical replications using Postgres native replication, or an extensions such as pglogical. The built-in read replica process that Azure Postgres Flexible Server uses, is a physical replication which simplifies the replication procedure.

Physical replications differ from logical replications in that they support database structural changes (DDL commands) to be replicated to the replica server. Additional limitations of logical replication can be found in the Postgres Logical Replication documentation.

There is an administrative overhead to setting up, maintaining, and monitoring your own Postgres replication process. Azure Postgres Database provides a built-in replication offering as part of the managed service to make it easy to deploy, and monitor replications. The steps to create and remove replicas are simplified and backend processes like replication slots are handled automatically when you use the built-in read replica feature.

Final Thoughts

Read replicas are a powerful feature of Azure Postgres Flexible Server that can help you improve the performance, scalability, and availability of your application. By offloading read traffic from your primary database and distributing it across multiple replicas, you can reduce latency and improve the response time for read requests. Read replicas are one direction replication from the primary, and they can not handle write requests until promoted and decoupled from the primary server. They are not a replacement for backups, as they do not provide any point in time recovery. However, if you’re using Azure Postgres Flexible Server, we highly recommend taking advantage of the read replica feature, to make your administrative life easier, and to optimize the performance of your application.

Links to Resources:

Creating and Monitoring Read Replicas in Azure Postgres Flexible Server

TyBecker