High reliability PostgreSQL cluster based on Patroni, Haproxy, Keepalived

Recently I faced a task: to configure the most reliable cluster of PostgreSQL 9.6 servers. According to the idea, I wanted to get a cluster that survives the loss of any server, or even several servers, and is able to automatically put servers into operation after accidents. When planning a cluster, I studied many articles, both from the main documentation for PostgreSQL, and various howto, including from Habr, and tried to set up a standard cluster with RepMgr, experimented with pgpool. In general, it worked, but I periodically had problems with switching, manual intervention was required to recover from disasters, etc. In general, I decided to look for more options. As a result, somewhere (I can’t remember exactly where) I found a link to the wonderful project Zalando Patroni, and wrap everything up …

Introduction Patroni is a python daemon that allows automatic maintenance of PostgreSQL clusters with different types of replication and automatic role switching. Its special beauty, in my opinion, is that distributed DCS storages are used to keep the cluster and master elections up to date (supported by Zookeeper, etcd, Consul). Thus, the cluster can be easily integrated into almost any system, you can always find out who is the master at the moment, and the status of all servers by queries in DCS, or directly to Patroni via http. Well, it’s just beautiful 🙂

I tested the work of Patroni, tried to drop the master and other servers, tried to fill in different databases (~ 25 GB the database automatically rises from zero to 10 GB of the network in a few minutes), and in general I really liked the Patroni project. After the complete implementation of the scheme described below, I carried out testing with a simple bencher, who went to the database at a single address, and experienced the fall of all cluster elements (server master, haproxy, keepalived). The delay in transferring a role to a new master was a couple of seconds. When the former master returns to the cluster, or a new server is added, the roles are not changed. To automate the deployment of the cluster and add new servers, it was decided to use the familiar Ansible (I will give links to the resulting roles at the end of the article). Consul, which we already use, acts as a DCS. The article has two main goals: to show PostgreSQL users that there is such a wonderful thing as Patroni (there are practically no mentions in Internet in general and on github in particular), and at the same time to share a little experience of using Ansible with a simple example, to those who are just starting to work with it. I will try to explain all the action at once using the example of parsing Ansible roles and playbooks. Those who do not use Ansible can transfer all actions to their favorite automated server management tool, or perform them manually. Since most of the yaml scripts will be long, I will wrap them in a spoiler. The story will be divided into two parts – preparing the servers and deploying the cluster itself. Those who are familiar with Ansible will not be interested in the first part, so I recommend that you go directly to the second.

Part 1

For this example, I am using Centos 7 based virtual machines. Virtual machines are deployed from a template that is periodically updated (kernel, system packages), but this topic is beyond the scope of this article. I will only note that no application or server software is installed on the virtual machines in advance. Also, any cloud resources, for example, with AWS, DO, vScale, etc., are quite suitable. For them, there are scripts for dynamic inventory and integration with Ansible, or you can screw Terraform, so the whole process of creating and removing servers from scratch can be automated. First you need to create an inventory of used resources for Ansible. Ansible I have (and by default) is located in / etc / ansible. Create inventory in / etc / ansible / hosts file:

[pgsql]
cluster-pgsql-01.local
cluster-pgsql-02.local
cluster-pgsql-03.local

We use the internal domain zone .local, so the servers have such names. Next, you need to prepare each server to install all the necessary components and working tools. For this purpose, we create a playbook in / etc / ansible / tasks:

/etc/ansible/tasks/essentialsoftware.yml

---

- name: Install essential software
  yum: name={{ item }} state=latest
  tags: software
  with_items:
   - ntpdate
   - bzip2
   - zip
   - unzip
   - openssl-devel
   - mc
   - vim
   - atop
   - wget
   - mytop
   - screen
   - net-tools
   - rsync
   - psmisc
   - gdb
   - subversion
   - htop
   - bind-utils
   - sysstat
   - nano
   - iptraf
   - nethogs
   - ngrep
   - tcpdump
   - lm_sensors
   - mtr
   - s3cmd
   - psmisc
   - gcc
   - git
   - python2-pip
   - python-devel

- name: install the 'Development tools' package group
  yum:
    name: "@Development tools"
    state: present

The Essential suite of packages is designed to create a familiar working environment on any server. The Development tools package group, some libraries -devel and python are needed by pip to build Python modules for PostgreSQL. We use virtual machines based on VmWare ESXi, and for ease of administration, we need to run the vmware agent in them. To do this, we will launch the open vmtoolsd agent, and describe its installation in a separate playbook (since not all our servers are virtual, and perhaps for some of them this task will not be needed):

/etc/ansible/tasks/open-vm-tools.yml

---

- name: Install open VM tools for VMWARE
  yum: name={{ item }} state=latest
  tags: open-vm-tools
  with_items:
   - open-vm-tools

- name: VmWare service start and enabling
  service: name=vmtoolsd.service state=started enabled=yes
  tags: open-vm-tools

In order to complete the preparation of the server for the installation of the main part of the software, in our case, the following steps are required: 1) configure time sync with ntp 2) install and run zabbix agent for monitoring 3) roll the required ssh keys and authorized_keys. In order not to inflate the article too much with details not related to the cluster itself, I will briefly quote the ansible playbooks that perform these tasks:

NTP:

/etc/ansible/tasks/ntpd.yml

---
    - name: setting default timezone
      set_fact:
        timezone: name=Europe/Moscow
      when: timezone is not defined

    - name: setting TZ
      timezone: name={{ timezone }}
      when: timezone is defined
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd

    - name: Configurating cron for ntpdate
      cron: name="ntpdate" minute="*/5" job="/usr/sbin/ntpdate pool.ntp.org"
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd

    - name: ntpd stop and disable
      service: name=ntpd state=stopped enabled=no
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd
      ignore_errors: yes

    - name: crond restart and enabled
      service: name=crond state=restarted enabled=yes
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd

First, it is checked whether a personal time zone is set for the server, and if not, then the Moscow time zone is set (we have the majority of such servers). We do not use ntpd due to problems with time drift on ESXi virtual machines, after which ntpd refuses to synchronize time. (And tinker panic 0 doesn’t help.) Therefore, we just start the ntp client cron every 5 minutes. Zabbix-agent:

/etc/ansible/tasks/zabbix.yml

---

    - name: set zabbix ip external
      set_fact:
        zabbix_ip: 132.xx.xx.98
      tags: zabbix

    - name: set zabbix ip internal
      set_fact:
        zabbix_ip: 192.168.xx.98
      when: ansible_all_ipv4_addresses | ipaddr('192.168.0.0/16')
      tags: zabbix

    - name: Import Zabbix3 repo
      yum: name=http://repo.zabbix.com/zabbix/3.0/rhel/7/x86_64/zabbix-release-3.0-1.el7.noarch.rpm state=present
      tags: zabbix

    - name: Remove old zabbix
      yum: name=zabbix2* state=absent
      tags: zabbix

    - name: Install zabbix-agent software
      yum: name={{ item }} state=latest
      tags: zabbix
      with_items:
        - zabbix-agent
        - zabbix-release

    - name: Creates directories
      file: path={{ item }}  state=directory
      tags:
      - zabbix
      - zabbix-mysql
      with_items:
        - /etc/zabbix/externalscripts
        - /etc/zabbix/zabbix_agentd.d
        - /var/lib/zabbix

    - name: Copy scripts
      copy: src=/etc/ansible/templates/zabbix/{{ item }} dest=/etc/zabbix/externalscripts/{{ item }} owner=zabbix group=zabbix  mode=0755
      tags: zabbix
      with_items:
        - netstat.sh
        - iostat.sh
        - iostat2.sh
        - iostat_collect.sh
        - iostat_parse.sh
        - php_workers_discovery.sh

    - name: Copy .my.cnf
      copy: src=/etc/ansible/files/mysql/.my.cnf dest=/var/lib/zabbix/.my.cnf owner=zabbix group=zabbix  mode=0700
      tags:
      - zabbix
      - zabbix-mysql

    - name: remove default configs
      file: path={{ item }} state=absent
      tags: zabbix
      with_items:
        - /etc/zabbix_agentd.conf
        - /etc/zabbix/zabbix_agentd.conf

    - name: put zabbix-agentd.conf to default place
      template: src=/etc/ansible/templates/zabbix/zabbix_agentd.tpl dest=/etc/zabbix_agentd.conf owner=zabbix group=zabbix force=yes
      tags: zabbix

    - name: link zabbix-agentd.conf to /etc/zabbix
      file: src=/etc/zabbix_agentd.conf dest=/etc/zabbix/zabbix_agentd.conf state=link
      tags: zabbix

    - name: zabbix-agent start and enable
      service: name=zabbix-agent state=restarted enabled=yes
      tags: zabbix

When installing Zabbix, the agent config is rolled from the template, you only need to change the server address. Servers located within our network go to 192.168.x.98, and servers that do not have access to it go to the real address of the same server. Transferring ssh keys and setting up ssh has been moved to a separate role, which can be found, for example, on ansible-galaxy. There are many options, but the essence of the changes is quite trivial, so I don’t see any point in quoting all of its content here. It’s time to roll the created configuration onto the servers. In general, I install all the components and the cluster itself in one step, already with the full config, but it seems to me that for the purposes of this tutorial it would be better to divide it into two steps, respectively by chapters. Create a playbook for a group of servers:

/etc/ansible/cluster-pgsql.yml

---
- hosts: pgsql

  pre_tasks:
    - name: Setting system hostname
      hostname: name="{{ ansible_host }}"

    - include: tasks/essentialsoftware.yml
    - include: tasks/open-vm-tools.yml
    - include: tasks/ntpd.yml

  post_tasks:
    - include: tasks/zabbix.yml

  roles:
     - ssh.role
     - ansible-role-patroni

We start processing all servers:

~# ansible-playbook cluster-pgsql.yml --skip-tags patroni

The preparation is over.

Part II

We proceed to deploy the cluster itself. Since setting up a cluster requires a lot of work (installing PostgreSQL and all components, uploading individual configs for them), I separated this whole process into a separate role. Roles in Ansible allow you to group sets of related tasks, making it easier to write scripts and maintain them in working order. I took the role template for installing Patroni here: https://github.com/gitinsky/ansible-role-patroni, for which thanks to its author. For my own purposes, I reworked the existing one and added my haproxy and keepalived playbooks. My roles are in the / etc / ansible / roles directory. Create a directory for the new role, and subdirectories for its components:

~# mkdir /etc/ansible/roles/ansible-role-patroni/tasks
~# mkdir /etc/ansible/roles/ansible-role-patroni/templates

In addition to PostgreSQL, our cluster will consist of the following components: 1) haproxy to monitor the status of servers and redirect requests to the master server. 2) keepalived to provide a single entry point to the cluster – virtual IP. All playbooks performed by this role are listed in the file launched by ansible by default:

/etc/ansible/roles/ansible-role-patroni/tasks/main.yml

- include: postgres.yml
- include: haproxy.yml
- include: keepalived.yml

Next, we begin to describe individual tasks. The first playbook installs PostgreSQL 9.6 from the native repository, and additional packages required by Patroni, and then downloads Patroni itself from GitHub:

/etc/ansible/roles/ansible-role-patroni/tasks/postgres.yml

---

- name: Import Postgresql96 repo
  yum: name=https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-centos96-9.6-3.noarch.rpm state=present
  tags: patroni
  when: install is defined

- name: Install PGsql96
  yum: name={{ item }} state=latest
  tags: patroni
  with_items:
    - postgresql96
    - postgresql96-contrib
    - postgresql96-server
    - python-psycopg2
    - repmgr96
  when: install is defined

- name: checkout patroni
  git: repo=https://github.com/zalando/patroni.git dest=/opt/patroni
  tags: patroni
  when: install is defined

- name: create /etc/patroni
  file: state=directory dest=/etc/patroni
  tags: patroni
  when: install is defined

- name: put postgres.yml
  template: src=postgres0.yml dest=/etc/patroni/postgres.yml backup=yes
  tags: patroni
  when: install is defined

- name: install python packages
  pip: name={{ item }}
  tags: patroni
  with_items:
    - python-etcd
    - python-consul
    - dnspython
    - boto
    - mock
    - requests
    - six
    - kazoo
    - click
    - tzlocal
    - prettytable
    - PyYAML
  when: install is defined

- name: put patroni.service systemd unit
  template: src=patroni.service dest=/etc/systemd/system/patroni.service backup=yes
  tags: patroni
  when: install is defined

- name: Reload daemon definitions
  command: /usr/bin/systemctl daemon-reload
  tags: patroni

- name: restart
  service: name=patroni state=restarted enabled=yes
  tags: patroni

In addition to installing the software, this playbook also fills in the configuration for the current Patroni server, and a systemd unit to start the daemon in the system, after which it starts the Patroni daemon. Config templates and systemd unit must be in the templates directory inside the role. Patroni config template:

/etc/ansible/roles/ansible-role-patroni/templates/postgres.yml.j2

name: {{ patroni_node_name }}
scope: &scope {{ patroni_scope }}

consul:
  host: consul.services.local:8500

restapi:
  listen: 0.0.0.0:8008
  connect_address: {{ ansible_default_ipv4.address }}:8008
  auth: 'username:{{ patroni_rest_password }}'

bootstrap:
  dcs:
    ttl: &ttl 30
    loop_wait: &loop_wait 10
    maximum_lag_on_failover: 1048576 # 1 megabyte in bytes
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        archive_mode: "on"
        wal_level: hot_standby
        archive_command: mkdir -p ../wal_archive && cp %p ../wal_archive/%f
        max_wal_senders: 10
        wal_keep_segments: 8
        archive_timeout: 1800s
        max_replication_slots: 5
        hot_standby: "on"
        wal_log_hints: "on"

pg_hba:  # Add following lines to pg_hba.conf after running 'initdb'
  - host replication replicator 192.168.0.0/16 md5
  - host all all 0.0.0.0/0 md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: {{ ansible_default_ipv4.address }}:5432
  data_dir: /var/lib/pgsql/9.6/data
  pg_rewind:
    username: superuser
    password: {{ patroni_postgres_password }}
  pg_hba:
  - host all all 0.0.0.0/0 md5
  - hostssl all all 0.0.0.0/0 md5
  replication:
    username: replicator
    password: {{ patroni_replicator_password }}
    network:  192.168.0.0/16
  superuser:
    username: superuser
    password: {{ patroni_postgres_password }}
  admin:
    username: admin
    password: {{ patroni_postgres_password }}
  restore: /opt/patroni/patroni/scripts/restore.py

Since each server in the cluster requires an individual Patroni configuration, its config is in the form of a jinja2 template (postgres0.yml.j2 file), and the template step forces ansible to translate this template with the replacement of variables, the values of which are taken from a separate description for each server. We will indicate the variables common to the entire cluster directly in the inventory, which will now take the following form: Patroni config template:

/etc/ansible/hosts

[pgsql]
cluster-pgsql-01.local
cluster-pgsql-02.local
cluster-pgsql-03.local

[pgsql:vars]
patroni_scope: "cluster-pgsql"
patroni_rest_password: flsdjkfasdjhfsd
patroni_postgres_password: flsdjkfasdjhfsd
patroni_replicator_password: flsdjkfasdjhfsd
cluster_virtual_ip: 192.xx.xx.125
</spoiler>
And a separate one for each server - in the host_vars / server_name directory:

<spoiler title="/etc/ansible/host_vars/pgsql-cluster-01.local/main.yml">
<source lang="yaml">
patroni_node_name: cluster_pgsql_01
keepalived_priority: 99

I will decipher what some variables are for: patroni_scope – the name of the cluster when registering with Consul patroni_node_name – server name when registering with Consul patroni_rest_password – password for the Patroni http interface (required to send commands to change the cluster) patroni_postgres_password: password for user postgres. It is installed when patroni creates a new base. patroni_replicator_password – password for the replicator user. Replication to the slaves is carried out on his behalf. Also, this file lists some other variables used in other playbooks or roles, in particular, that can be the ssh setting (keys, users), the time zone for the server, the priority of the server in the keepalived cluster, etc. The configuration for the rest of the servers is the same, the server name and priority change accordingly (for example 99-100-101 for three servers). Installing and configuring haproxy:

/etc/ansible/roles/ansible-role-patroni/tasks/haproxy.yml

---

- name: Install haproxy
  yum: name={{ item }} state=latest
  tags:
    - patroni
    - haproxy
  with_items:
    - haproxy
  when: install is defined

- name: put config
  template: src=haproxy.cfg.j2 dest=/etc/haproxy/haproxy.cfg backup=yes
  tags:
    - patroni
    - haproxy

- name: restart and enable
  service: name=haproxy state=restarted enabled=yes
  tags:
    - patroni
    - haproxy

Haproxy is installed on each host, and contains links to all PostgreSQL servers in its config, checks which server is the master now, and sends requests to it. For this check, a wonderful feature of Patroni is used – the REST interface. When accessing url server: 8008 (8008 is the default port) Patroni returns a report on the state of the cluster in json, and also reflects the http response code whether this server is a master. If it is, there will be a response with a code of 200. If not, an answer with a code of 503. I strongly advise you to refer to the Patroni documentation, the http interface is quite interesting there, it is also allowed to forcibly switch roles, and manage the cluster. Similarly, this can be done using the patronyctl.py console utility from the Patroni distribution. The haproxy configuration is pretty simple:

/etc/ansible/roles/ansible-role-patroni/templates/haproxy.cfg

global
maxconn 800

defaults
log global
mode tcp
retries 2
timeout client 30m
timeout connect 4s
timeout server 30m
timeout check 5s

frontend ft_postgresql
bind *:5000
default_backend postgres-patroni

backend postgres-patroni
  option httpchk

  http-check expect status 200
  default-server inter 3s fall 3 rise 2

  server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008
  server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008
  server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008

According to this configuration, haproxy listens on port 5000 and sends traffic from it to the master server. The status check occurs with an interval of 1 second, to bring the server down, 3 unsuccessful responses are required (code 500), to switch the server back – 2 successful responses (with code 200). At any time, you can contact any haproxy directly, and it will correctly route the traffic to the master server. Also included with Patroni is a template for configuring the confd daemon, and an example of its integration with etcd, which allows you to dynamically change the haproxy config when you remove or add new servers. For now, I’m making a fairly static cluster, unnecessary automation in this situation, IMHO, can lead to unforeseen problems. We wanted clients to have custom logic changes, live server tracking, etc. were not required, so we make a single entry point to the cluster using keepalived. The keepalived daemon works using the vrrp protocol with its neighbors, and as a result of choosing one of the daemons as the main one (the priority is specified in the config, and templated into the keepalived_priority variable in host_vars for each server), it picks up a virtual ip address. The rest of the demons are patiently waiting. If the current main keepalived server for some reason dies or signals a failure to neighbors, re-elections will occur, and the next priority server will take the virtual IP address for itself. To protect against haproxy crashes, the keepalived daemons perform a check by running the “killall -0 haproxy” command once a second. It returns a code of 0 if there is a haproxy process, and 1 if it is not. If the haproxy disappears, the keepalived daemon will signal a vrrp crash and remove the virtual ip. The virtual IP will immediately be picked up by the next priority server, with a live haproxy. Installing and configuring keepalived:

/etc/ansible/roles/ansible-role-patroni/tasks/keepalived.yml

---

- name: Install keepalived
  yum: name={{ item }} state=latest
  tags:
    - patroni
    - keepalived
  with_items:
    - keepalived
  when: install is defined

- name: put alert script
  template: src=alert.sh.j2 dest=/usr/local/sbin/alert.sh backup=yes mode=755
  tags:
    - patroni
    - keepalived
  when: install is defined

- name: put config
  template: src=keepalived.conf.j2 dest=/etc/keepalived/keepalived.conf backup=yes
  tags:
    - patroni
    - keepalived

- name: restart and enable
  service: name=keepalived state=restarted enabled=yes
  tags:
    - patroni
    - keepalived

In addition to installing keepalived, this playbook also copies a simple script for sending alerts via telegrams. The script accepts the message as a variable, and simply pulls the curl telegram API. In this script, you only need to specify your token and telegram group ID to send notifications. The keepalived configuration is described as a jinja2 template:

/etc/ansible/roles/ansible-role-patroni/templates/keepalived.conf.j2

global_defs {
   router_id {{ patroni_node_name }}
}

vrrp_script chk_haproxy {
        script "killall -0 haproxy"
        interval 1
        weight -20
        debug
        fall 2
        rise 2
}

vrrp_instance {{ patroni_node_name }} {
        interface ens160
        state BACKUP
        virtual_router_id 150
        priority {{ keepalived_priority }}
        authentication {
            auth_type PASS
            auth_pass secret_for_vrrp_auth
        }
        track_script {
                chk_haproxy weight 20
        }
        virtual_ipaddress {
                {{ cluster_virtual_ip }}/32 dev ens160
        }
        notify_master "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became MASTER'"
        notify_backup "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became BACKUP'"
        notify_fault "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became FAULT'"

}

The variables patroni_node_name, cluster_virtual_ip and keepalived_priority are translated corresponding data from host_vars. Also in the keepalived config there is a script for sending messages about status change to the telegram channel. We roll the full cluster configuration onto the servers:

~# ansible-playbook cluster-pgsql.yml

Since Ansible is idempotent, i.e. performs steps only if they have not been completed before, you can run the playbook without additional parameters. If you don’t want to wait any longer, or you are sure that the servers are completely ready, you can run ansible-playbook with the -t patroni switch. Then only the steps from the Patroni role will be executed. Note that I do not specify the role of servers separately – master or slave. This configuration will create an empty base, and the first configured server will simply become the wizard. When adding new servers, Patroni will see through DCS that the cluster master already exists, automatically copy the base from the current master, and connect the slave to it. If a slave is started that has lagged behind the master for some time, Patroni will automatically merge changes using pg_rewind. We make sure that all servers have started and have chosen roles for ourselves:

~# journalctl -f -u patroni

Afterword

The only thing that this cluster, in my opinion, lacks for happiness is pooling of connections and proxying read requests to all slaves to improve read performance, and insert and update requests only to the master. In a configuration with asynchronous replication, unloading the read load can lead to unexpected responses, if the slave lags behind the master, this must be taken into account. Streaming (asynchronous) replication does not provide consistency to the cluster at all times, and it requires synchronous replication. In this mode, the master server will wait for confirmation of copying and applying transactions to the slaves, which will slow down the operation of the database. However, if transaction losses are unacceptable (for example, some financial applications), synchronous replication is your choice. Patroni supports all options, and if synchronous replication suits you better, you just need to change the value of several fields in the Patroni configs. The issues of different replication methods are perfectly covered in the Patroni documentation. Someone will probably suggest using pgpool, which itself, in fact, covers all the functionality of this system. It can monitor databases, proxy requests, and expose a virtual IP, as well as pool client connections. Yes, he can do it all. But in my opinion, the scheme with Patroni is much more transparent (of course, this is just my opinion), and while experimenting with pgpool, I caught strange behavior with its watchdog and virtual addresses, which I did not debug too deeply yet, deciding to look for another solution. Of course, it is possible that the problem is only mine in my hands, and later I plan to return to testing pgpool. However, in any case, pgpool will not be able to fully automatically manage the cluster, introduce new and (especially) return failed servers, or work with DCS. In my opinion, this is the most interesting functionality of Patroni. Thank you for your attention, I will be glad to see suggestions for further improving this solution, and answer questions in the comments.

Similar Posts: