本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger的Oracle&MySQL技术博客
MySQL Group Replication已经发布2年了;虽然目前来讲还不够成熟和稳定,然而从技术角度来讲,我认为应该是未来的趋势;针对跨IDC容灾的情况,MGR可以很好的应对;相比而言,PXC就逊色的多了。同时PXC性能相比MGR也要差一些,本人亲测过。
MGR主要架构支持多主和单主模式;所谓多主,也就是指每个节点均可读写,不难理解,多主有点类似MySQL 双主。对于MySQL数据库而言,节点之间的数据同步复制其实是逻辑复制,因此多写必须面临的一个问题就是:数据冲突解决。 同时根据我们现在经验来看,多主的问题还是比较多,因此我们都建议客户以单主为主,相对靠谱一些。
那么使用单主,就面临一个问题。解决了数据一致性,那么业务连续性呢?比如比如现在主节点是Node1;Node2、3是Read节点(slave节点);所以写操作都在Node1进行,如果Node1 出现故障,数据库主节点切换到Node2后,怎么办? 此时对于业务来讲,怎么处理?难道还要去修改IP地址吗?
当然,也有人仍然在使用MGR+Keepalive的方式,通过Keepalive来虚拟出一个VIP地址提供给业务使用;当出现节点切换后,VIP会进行飘逸。对于Keepalive来说,是一个比较古老的软件了,本身也有不少问题。因此现在大家更倾向于使用服务发现的Consul来代替了。
实际上不仅仅是MGR这种架构,目前用的最多的MHA架构,也有不少用人在试用MHA+consul.
对于Consul的一些基本知识和原理,这里我就不做过多解释了。如下是我的一个简单的测试,供参考。
1.每个节点都创建对应目录
[root@killdb tmp]# mkdir -p /etc/consul.d/ [root@killdb tmp]# mkdir -p /opt/consul [root@killdb tmp]# mkdir -p /opt/consul/shell
2. consul解压即可用,无需按照,步骤略
3. consul server端json配置如下:
[root@killdb consul.d]# cat server.json { "data_dir": "/opt/consul", "datacenter": "dc1", "log_level": "INFO", "server": true, "node_name": "killdb", "bootstrap_expect": 1, "bind_addr": "172.16.29.160", "client_addr": "172.16.29.160", "ui":true }
4. mgr 每个节点都配置如下:
[root@mysqldb1 consul.d]# cat client.json { "data_dir": "/opt/consul", "enable_script_checks": true, "bind_addr": "172.16.29.154", --对应每个mgr节点的地址 "retry_join": ["172.16.29.160"], "retry_interval": "30s", "rejoin_after_leave": true, "start_join": ["172.16.29.160"] , "node_name": "mgr_mysqldb1" --对应每个mgr节点名称 } [root@mysqldb1 consul.d]# cat w-mgr-test-enmotech.json { "service": { "name": "w-mgr-test-enmotech", "tags": ["enmotech-3306"], "address": "172.16.29.154", "meta": { "meta": "for my service" }, "port": 3306, "enable_tag_override": false, "checks": [ { "args": ["/opt/consul/shell/check_mysql_mgr_master.sh"], "interval": "10s" } ] } } [root@mysqldb1 consul.d]# cat r-mgr-test-enmotech.json { "service": { "name": "r-mgr-test-enmotech", "tags": ["enmotech-3306"], "address": "172.16.29.154", "meta": { "meta": "for my service" }, "port": 3306, "enable_tag_override": false, "checks": [ { "args": ["/opt/consul/shell/check_mysql_mgr_slave.sh"], "interval": "1s" } ] } }
5. mgr每个节点部署上述2个check mysql状态的脚本
这里我参考了互联网上的一些脚本,大家可以进行随意更改;
[root@mysqldb1 shell]# cat check_mysql_mgr_master.sh #!/bin/bash port=3306 user="root" passwod="enmotech" comm="/usr/bin/mysql -u$user -hlocalhost -P $port -p$passwod" value=`$comm -Nse "select 1"` primary_member=`$comm -Nse "select variable_value from performance_schema.global_status WHERE VARIABLE_NAME= 'group_replication_primary_member'"` server_uuid=`$comm -Nse "select variable_value from performance_schema.global_variables where VARIABLE_NAME='server_uuid';"` # 判断MySQL是否存活 if [ -z $value ] then echo "mysql $port is down....." exit 2 fi # 判断节点状态,是否存活 node_state=`$comm -Nse "select MEMBER_STATE from performance_schema.replication_group_members where MEMBER_ID='$server_uuid'"` if [ $node_state != "ONLINE" ] then echo "MySQL $port state is not online...." exit 2 fi # 判断是不是主节点 if [[ $server_uuid == $primary_member ]] then echo "MySQL $port Instance is master ........" exit 0 else echo "MySQL $port Instance is slave ........" exit 2 fi [root@mysqldb1 shell]# cat check_mysql_mgr_slave.sh #!/bin/bash port=3306 user="root" passwod="enmotech" comm="/usr/bin/mysql -u$user -hlocalhost -P $port -p$passwod" value=`$comm -Nse "select 1"` primary_member=`$comm -Nse "select variable_value from performance_schema.global_status WHERE VARIABLE_NAME= 'group_replication_primary_member'"` server_uuid=`$comm -Nse "select variable_value from performance_schema.global_variables where VARIABLE_NAME='server_uuid';"` # 判断mysql是否存活 if [ -z $value ] then echo "mysql $port is down....." exit 2 fi # 判断节点状态 node_state=`$comm -Nse "select MEMBER_STATE from performance_schema.replication_group_members where MEMBER_ID='$server_uuid'"` if [ $node_state != "ONLINE" ] then echo "MySQL $port state is not online...." exit 2 fi # 判断是不是主节点 if [[ $server_uuid != $primary_member ]] then echo "MySQL $port Instance is slave ........" exit 0 else node_num=`$comm -Nse "select count(*) from performance_schema.replication_group_members"` # 判断如果没有任何从节点,主节点也注册从角色服务。 if [ $node_num -eq 1 ] then echo "MySQL $port Instance is slave ........" exit 0 else echo "MySQL $port Instance is master ........" exit 2 fi fi
6. 启动consul server
如果是生产环境,建议至少3个节点的consul集群,这里模拟暂且用一个节点把。
[root@killdb ~]# /etc/consul.d/consul agent -config-dir=/etc/consul.d -enable-script-checks > /opt/consul/consul.log &
[1] 19187
[root@killdb ~]#
7. 启动mgr每个节点的consul angnt
./consul agent -config-dir=/etc/consul.d -enable-script-checks > /opt/consul/consul.log &
正常启动后会看到如下的类似的信息:
==> Starting Consul agent... ==> Joining cluster... Join completed. Synced with 1 initial agents ==> Consul agent running! Version: 'v1.4.0' Node ID: '1b1f724e-a57a-01f7-2bed-661faea0f3d1' Node name: 'mgr_mysqldb1' Datacenter: 'dc1' (Segment: '') Server: false (Bootstrap: false) Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600) Cluster Addr: 172.16.29.154 (LAN: 8301, WAN: 8302) Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false ==> Log data will now stream in as it occurs: 2018/12/27 17:05:07 [WARN] agent: Node name "mgr_mysqldb1" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes. 2018/12/27 17:05:07 [INFO] serf: Ignoring previous leave in snapshot 2018/12/27 17:05:07 [INFO] serf: EventMemberJoin: mgr_mysqldb1 172.16.29.154 2018/12/27 17:05:07 [INFO] serf: Attempting re-join to previously known node: killdb: 172.16.29.160:8301 2018/12/27 17:05:07 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp) 2018/12/27 17:05:07 [WARN] agent/proxy: running as root, will not start managed proxies 2018/12/27 17:05:07 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp) 2018/12/27 17:05:07 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp) 2018/12/27 17:05:07 [INFO] agent: (LAN) joining: [172.16.29.160] 2018/12/27 17:05:07 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s os packet scaleway softlayer triton vsphere 2018/12/27 17:05:07 [INFO] agent: Joining LAN cluster... 2018/12/27 17:05:07 [INFO] agent: (LAN) joining: [172.16.29.160] 2018/12/27 17:05:07 [INFO] serf: EventMemberJoin: killdb 172.16.29.160 2018/12/27 17:05:07 [INFO] agent: (LAN) joined: 1 Err: <nil> 2018/12/27 17:05:07 [INFO] agent: started state syncer 2018/12/27 17:05:07 [INFO] consul: adding server killdb (Addr: tcp/172.16.29.160:8300) (DC: dc1) 2018/12/27 17:05:07 [INFO] serf: Re-joined to previously known node: killdb: 172.16.29.160:8301 2018/12/27 17:05:07 [INFO] agent: (LAN) joined: 1 Err: <nil> 2018/12/27 17:05:07 [INFO] agent: Join LAN completed. Synced with 1 initial agents 2018/12/27 17:05:07 [INFO] agent: Synced service "r-mgr-test-enmotech" 2018/12/27 17:05:07 [INFO] agent: Synced service "w-mgr-test-enmotech" 2018/12/27 17:05:07 [INFO] agent: Deregistered service "r-mgr-test-enmotech.com" 2018/12/27 17:05:07 [INFO] agent: Deregistered service "w-mgr-test-enmotech.com" 2018/12/27 17:05:07 [INFO] agent: Deregistered check "service:r-mgr-test-enmotech.com" 2018/12/27 17:05:07 [INFO] agent: Deregistered check "service:w-mgr-test-enmotech.com" 2018/12/27 17:05:09 [INFO] agent: Synced check "service:r-mgr-test-enmotech" 2018/12/27 17:05:16 [WARN] agent: Check "service:w-mgr-test-enmotech" is now critical 2018/12/27 17:05:26 [WARN] agent: Check "service:w-mgr-test-enmotech" is now critical ......
8. 当所有节点都启动后后,可以检查整个consul的注册信息,如下:
[root@killdb ~]# /etc/consul.d/consul members -http-addr='172.16.29.160:8500' Node Address Status Type Build Protocol DC Segment killdb 172.16.29.160:8301 alive server 1.4.0 2 dc1 <all> mgr_mysqldb1 172.16.29.154:8301 alive client 1.4.0 2 dc1 <default> mgr_mysqldb2 172.16.29.132:8301 alive client 1.4.0 2 dc1 <default> mgr_mysqldb3 172.16.29.133:8301 alive client 1.4.0 2 dc1 <default> [root@killdb ~]#
9.进行验证
–检查域名状态
[root@killdb ~]# dig @172.16.29.160 -p 8600 w-mgr-test-enmotech.service.consul ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.0.2.el6_4.6 <<>> @172.16.29.160 -p 8600 w-mgr-test-enmotech.service.consul ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10320 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;w-mgr-test-enmotech.service.consul. IN A ;; ANSWER SECTION: w-mgr-test-enmotech.service.consul. 0 IN A 172.16.29.154 ;; ADDITIONAL SECTION: w-mgr-test-enmotech.service.consul. 0 IN TXT "consul-network-segment=" ;; Query time: 4 msec ;; SERVER: 172.16.29.160#8600(172.16.29.160) ;; WHEN: Sat Oct 7 04:07:45 2017 ;; MSG SIZE rcvd: 104 [root@killdb ~]# dig @172.16.29.160 -p 8600 r-mgr-test-enmotech.service.consul ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.0.2.el6_4.6 <<>> @172.16.29.160 -p 8600 r-mgr-test-enmotech.service.consul ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13148 ;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 2 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;r-mgr-test-enmotech.service.consul. IN A ;; ANSWER SECTION: r-mgr-test-enmotech.service.consul. 0 IN A 172.16.29.133 r-mgr-test-enmotech.service.consul. 0 IN A 172.16.29.132 ;; ADDITIONAL SECTION: r-mgr-test-enmotech.service.consul. 0 IN TXT "consul-network-segment=" r-mgr-test-enmotech.service.consul. 0 IN TXT "consul-network-segment=" ;; Query time: 4 msec ;; SERVER: 172.16.29.160#8600(172.16.29.160) ;; WHEN: Sat Oct 7 04:06:53 2017 ;; MSG SIZE rcvd: 156
可以看到,目前写节点在154;读节点在132,133上。接下来我们停掉154,看看写服务是否会转移到其他节点:
—stop mgr node1
mysql> show variables like '%hostname%'; +---------------+----------+ | Variable_name | Value | +---------------+----------+ | hostname | mysqldb1 | +---------------+----------+ 1 row in set (0.00 sec) mysql> stop group_replication; Query OK, 0 rows affected (9.36 sec) mysql> shutdown; Query OK, 0 rows affected (0.00 sec) mysql> exit Bye
—再次检查服务
[root@killdb ~]# dig @172.16.29.160 -p 8600 w-mgr-test-enmotech.service.consul ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.0.2.el6_4.6 <<>> @172.16.29.160 -p 8600 w-mgr-test-enmotech.service.consul ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63202 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;w-mgr-test-enmotech.service.consul. IN A ;; ANSWER SECTION: w-mgr-test-enmotech.service.consul. 0 IN A 172.16.29.133 ;; ADDITIONAL SECTION: w-mgr-test-enmotech.service.consul. 0 IN TXT "consul-network-segment=" ;; Query time: 4 msec ;; SERVER: 172.16.29.160#8600(172.16.29.160) ;; WHEN: Sat Oct 7 04:09:57 2017 ;; MSG SIZE rcvd: 104 [root@killdb ~]# dig @172.16.29.160 -p 8600 r-mgr-test-enmotech.service.consul ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.0.2.el6_4.6 <<>> @172.16.29.160 -p 8600 r-mgr-test-enmotech.service.consul ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49407 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;r-mgr-test-enmotech.service.consul. IN A ;; ANSWER SECTION: r-mgr-test-enmotech.service.consul. 0 IN A 172.16.29.132 ;; ADDITIONAL SECTION: r-mgr-test-enmotech.service.consul. 0 IN TXT "consul-network-segment=" ;; Query time: 4 msec ;; SERVER: 172.16.29.160#8600(172.16.29.160) ;; WHEN: Sat Oct 7 04:10:02 2017 ;; MSG SIZE rcvd: 104
我们可以看到,写节点已经发生了改变,成功解析到其他节点了。我们不难看出,如果能结合dns,业务通过域名的方式来访问mgr集群,几乎可以完美解决业务连续性问题。