4

we are using two instances of 5.7.20-18-log (in the meantime updated to 5.7.21-21-lo) Percona Server in a master-master (dbm1 / dbm2) replication. The application only uses dbm1 for writes and reads so the dbm2 is used as a cold standby only. Since a couple of weeks the replication on dbm2 stops multiple times per week with the error message "unknown error reading log event on the master".

Here is the full slave status (dbm2) from the last incident:

mysql> show slave status \G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.6.233.101 Master_User: replicator Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.007091 Read_Master_Log_Pos: 71137 Relay_Log_File: mysqld-relay-bin.001603 Relay_Log_Pos: 71350 Relay_Master_Log_File: mysql-bin.007091 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 71137 Relay_Log_Space: 71605 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'unknown error reading log event on the master; the first event 'mysql-bin.007003' at 43445, the last event read from './mysql-bin.007091' at 71137, the last byte read from './mysql-bin.007091' at 71137.' Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: 7cb8eccb-e1a0-11e7-9d96-545edb2e572a Master_Info_File: /var/lib/mysql/master.info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: 180308 06:41:25 Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec) 

Generally we address this with a full restore of dbm2 with XtraBackup but after a few days the problem on dbm2 appears again. As we can see in our monitoring, both servers have enough resources left (and no IO-Wait) - what we can see is, that this often happens when the application is doing some import jobs.

Both servers are located in the same datacenter, communication via the internal network.

More details:

OS: Ubuntu 14.04.5 LTS Kernel: OpenVZ VPS (Kernel 2.6.32-042stab127.2) CPU: 8 cores Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz RAM: 16 gb Percona: 5.7.20-18-log Percona Server (GPL), Release '18', Revision '7ce54a6deef' 

my.cnf:

[client] port = 3306 socket = /var/run/mysqld/mysqld.sock [isamchk] key_buffer_size = 16M [mysqld] basedir = /usr bind_address = * binlog_cache_size = 1M binlog_format = mixed bulk_insert_buffer_size = 64M datadir = /var/lib/mysql expire_logs_days = 2 innodb_buffer_pool_dump_at_shutdown = ON innodb_buffer_pool_load_at_startup = ON innodb_buffer_pool_size = 12G innodb_doublewrite = OFF innodb_flush_log_at_trx_commit = 2 innodb_io_capacity = 1000 innodb_log_file_size = 256M innodb_read_io_threads = 8 innodb_write_io_threads = 8 join_buffer_size = 244K key_buffer_size = 100M log-bin = mysql-bin log-error = /var/log/mysql/error.log log_slow_verbosity = full long_query_time = 1 max_allowed_packet = 16M max_binlog_files = 200 max_binlog_size = 100M max_connections = 500 max_heap_table_size = 128M max_relay_log_size = 256M max_slowlog_files = 1 max_slowlog_size = 4G myisam_sort_buffer_size = 64M optimizer_switch = index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=off,block_nested_loop=off,batched_key_access=on,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on pid-file = /var/run/mysqld/mysqld.pid port = 3306 query_cache_limit = 1M query_cache_size = 128M query_cache_type = ON read_buffer_size = 244K relay_log = mysqld-relay-bin server_id = 1 skip-external-locking skip_name_resolve = ON slow_query_log = ON slow_query_log_file = /var/log/mysql/mysqld.slowlog socket = /var/run/mysqld/mysqld.sock sort_buffer_size = 2047K sql_mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION ssl = false ssl-ca = /etc/mysql/cacert.pem ssl-cert = /etc/mysql/server-cert.pem ssl-key = /etc/mysql/server-key.pem thread_cache_size = 8 thread_stack = 256K tmp_table_size = 128M tmpdir = /tmp user = mysql userstat = ON [mysqld-5.0] myisam-recover = BACKUP [mysqld-5.1] myisam-recover = BACKUP [mysqld-5.5] myisam-recover = BACKUP [mysqld-5.6] myisam-recover-options = BACKUP [mysqld-5.7] myisam-recover-options = BACKUP [mysqld_safe] log-error = /var/log/mysql/error.log nice = 0 socket = /var/run/mysqld/mysqld.sock [mysqldump] max_allowed_packet = 16M quick quote-names 

Any Ideas on how to mitigate the root cause for this type of replication error?

Edit to answer questions from @wilson-hauck:

MySQLTuner report from dbm1

>> MySQLTuner 1.7.9 - Major Hayden <[email protected]> >> Bug reports, feature requests, and downloads at mysqltuner.com >> Run with '--help' for additional options and output filtering [--] Skipped version check for MySQLTuner script [OK] Currently running supported MySQL version 5.7.20-18-log [OK] Operating on 64-bit architecture -------- Log file Recommendations ------------------------------------------------------------------ [--] Log file: /var/log/mysql/error.log(0B) [OK] Log file /var/log/mysql/error.log exists [OK] Log file /var/log/mysql/error.log is readable. [!!] Log file /var/log/mysql/error.log is empty [OK] Log file /var/log/mysql/error.log is smaller than 32 Mb [OK] /var/log/mysql/error.log doesn't contain any warning. [OK] /var/log/mysql/error.log doesn't contain any error. [--] 0 start(s) detected in /var/log/mysql/error.log [--] 0 shutdown(s) detected in /var/log/mysql/error.log -------- Storage Engine Statistics ----------------------------------------------------------------- [--] Status: +ARCHIVE +BLACKHOLE +CSV -FEDERATED +InnoDB +MEMORY +MRG_MYISAM +MyISAM +PERFORMANCE_SCHEMA [--] Data in MEMORY tables: 46M (Tables: 2) [--] Data in MyISAM tables: 9G (Tables: 368) [--] Data in InnoDB tables: 12G (Tables: 131) [OK] Total fragmented tables: 0 -------- Security Recommendations ------------------------------------------------------------------ [OK] There are no anonymous accounts for any database users [OK] All database users have passwords assigned [!!] There is no basic password file list! -------- CVE Security Recommendations -------------------------------------------------------------- [--] Skipped due to --cvefile option undefined -------- Performance Metrics ----------------------------------------------------------------------- [--] Up for: 6d 19h 49m 31s (3B q [5K qps], 7M conn, TX: 5877G, RX: 1160G) [--] Reads / Writes: 94% / 6% [--] Binary logging is enabled (GTID MODE: OFF) [--] Physical Memory : 16.0G [--] Max MySQL memory : 13.8G [--] Other process memory: 160.2M [--] Total buffers: 12.4G global + 3.0M per thread (500 max threads) [--] P_S Max memory usage: 72B [--] Galera GCache Max memory usage: 0B [OK] Maximum reached memory usage: 13.1G (82.10% of installed RAM) [!!] Maximum possible memory usage: 13.8G (86.35% of installed RAM) [OK] Overall possible memory usage with other process is compatible with memory available [OK] Slow queries: 0% (31K/3B) [OK] Highest usage of available connections: 53% (266/500) [OK] Aborted connections: 0.13% (9830/7378007) [!!] Query cache may be disabled by default due to mutex contention. [OK] Query cache efficiency: 53.3% (1B cached / 3B selects) [!!] Query cache prunes per day: 93123042 [OK] Sorts requiring temporary tables: 0% (261K temp sorts / 112M sorts) [OK] No joins without indexes [OK] Temporary tables created on disk: 12% (4M on disk / 35M total) [OK] Thread cache hit rate: 88% (883K created / 7M connections) [!!] Table cache hit rate: 0% (2K open / 360K opened) [OK] Open file limit used: 28% (1K/5K) [OK] Table locks acquired immediately: 99% (1B immediate / 1B locks) [OK] Binlog cache memory access: 100.00% (47747694 Memory / 47748535 Total) -------- Performance schema ------------------------------------------------------------------------ [--] Memory used by P_S: 72B [--] Sys schema is installed. -------- ThreadPool Metrics ------------------------------------------------------------------------ [--] ThreadPool stat is enabled. [--] Thread Pool Size: 8 thread(s). [!!] thread_pool_size between 16 and 36 when using InnoDB storage engine. -------- MyISAM Metrics ---------------------------------------------------------------------------- [OK] Key buffer used: 100.0% (104M used / 104M cache) [OK] Key buffer size / total MyISAM indexes: 100.0M/3.0G [OK] Read Key buffer hit rate: 99.6% (41B cached / 152M reads) [!!] Write Key buffer hit rate: 14.2% (1B cached / 242M writes) -------- InnoDB Metrics ---------------------------------------------------------------------------- [--] InnoDB is enabled. [--] InnoDB Thread Concurrency: 0 [OK] InnoDB File per table is activated [!!] InnoDB buffer pool / data size: 12.0G/12.8G [!!] Ratio InnoDB log file size / InnoDB Buffer pool size (4.16666666666667 %): 256.0M * 2/12.0G should be equal 25% [!!] InnoDB buffer pool instances: 8 [--] Number of InnoDB Buffer Pool Chunk : 96 for 8 Buffer Pool Instance(s) [OK] Innodb_buffer_pool_size aligned with Innodb_buffer_pool_chunk_size & Innodb_buffer_pool_instances [OK] InnoDB Read buffer efficiency: 100.00% (55125068296 hits/ 55125632976 total) [OK] InnoDB Write log efficiency: 92.59% (140220341 hits/ 151440226 total) [OK] InnoDB log waits: 0.00% (0 waits / 11219885 writes) -------- AriaDB Metrics ---------------------------------------------------------------------------- [--] AriaDB is disabled. -------- TokuDB Metrics ---------------------------------------------------------------------------- [--] TokuDB is disabled. -------- XtraDB Metrics ---------------------------------------------------------------------------- [--] XtraDB is disabled. -------- RocksDB Metrics --------------------------------------------------------------------------- [--] RocksDB is disabled. -------- Spider Metrics ---------------------------------------------------------------------------- [--] Spider is disabled. -------- Connect Metrics --------------------------------------------------------------------------- [--] Connect is disabled. -------- Galera Metrics ---------------------------------------------------------------------------- [--] Galera is disabled. -------- Replication Metrics ----------------------------------------------------------------------- [--] Galera Synchronous replication: NO [--] No replication slave(s) for this server. [--] Binlog format: MIXED [--] XA support enabled: ON [--] Semi synchronous replication Master: Not Activated [--] Semi synchronous replication Slave: Not Activated [!!] This replication slave is running with the read_only option disabled. [OK] This replication slave is up to date with master. -------- Recommendations --------------------------------------------------------------------------- General recommendations: Reduce your overall MySQL memory footprint for system stability Increasing the query_cache size over 128M may reduce performance Increase table_open_cache gradually to avoid file descriptor limits Read this before increasing table_open_cache over 64: This is MyISAM only table_cache scalability problem, InnoDB not affected. See more details here: This bug already fixed in MySQL 5.7.9 and newer MySQL versions. Beware that open_files_limit (5000) variable should be greater than table_open_cache (2000) Thread pool size for InnoDB usage (8) Read this before changing innodb_log_file_size and/or innodb_log_files_in_group: Variables to adjust: query_cache_size (=0) query_cache_type (=0) query_cache_size (> 128M) [see warning above] table_open_cache (> 2000) thread_pool_size between 16 and 36 for InnoDB usage innodb_buffer_pool_size (>= 12G) if possible. innodb_log_file_size should be (=1G) if possible, so InnoDB total log files size equals to 25% of buffer pool size. innodb_buffer_pool_instances(=12) 

MySQLTuner report from dbm2

>> MySQLTuner 1.7.9 - Major Hayden <[email protected]> >> Bug reports, feature requests, and downloads at mysqltuner.com >> Run with '--help' for additional options and output filtering [--] Skipped version check for MySQLTuner script [OK] Currently running supported MySQL version 5.7.20-18-log [OK] Operating on 64-bit architecture -------- Log file Recommendations ------------------------------------------------------------------ [--] Log file: /var/log/mysql/error.log(0B) [OK] Log file /var/log/mysql/error.log exists [OK] Log file /var/log/mysql/error.log is readable. [!!] Log file /var/log/mysql/error.log is empty [OK] Log file /var/log/mysql/error.log is smaller than 32 Mb [OK] /var/log/mysql/error.log doesn't contain any warning. [OK] /var/log/mysql/error.log doesn't contain any error. [--] 0 start(s) detected in /var/log/mysql/error.log [--] 0 shutdown(s) detected in /var/log/mysql/error.log -------- Storage Engine Statistics ----------------------------------------------------------------- [--] Status: +ARCHIVE +BLACKHOLE +CSV -FEDERATED +InnoDB +MEMORY +MRG_MYISAM +MyISAM +PERFORMANCE_SCHEMA [--] Data in InnoDB tables: 12G (Tables: 131) [--] Data in MEMORY tables: 46M (Tables: 2) [--] Data in MyISAM tables: 9G (Tables: 368) [!!] Total fragmented tables: 2 -------- Security Recommendations ------------------------------------------------------------------ [OK] There are no anonymous accounts for any database users [OK] All database users have passwords assigned [!!] There is no basic password file list! -------- CVE Security Recommendations -------------------------------------------------------------- [--] Skipped due to --cvefile option undefined -------- Performance Metrics ----------------------------------------------------------------------- [--] Up for: 6d 21h 4m 15s (89K q [0.150 qps], 39K conn, TX: 325M, RX: 52G) [--] Reads / Writes: 0% / 100% [--] Binary logging is enabled (GTID MODE: OFF) [--] Physical Memory : 20.0G [--] Max MySQL memory : 13.8G [--] Other process memory: 166.5M [--] Total buffers: 12.4G global + 3.0M per thread (500 max threads) [--] P_S Max memory usage: 72B [--] Galera GCache Max memory usage: 0B [OK] Maximum reached memory usage: 12.4G (61.86% of installed RAM) [OK] Maximum possible memory usage: 13.8G (69.08% of installed RAM) [OK] Overall possible memory usage with other process is compatible with memory available [OK] Slow queries: 2% (1K/89K) [OK] Highest usage of available connections: 0% (3/500) [!!] Aborted connections: 24.95% (9904/39702) [!!] Query cache may be disabled by default due to mutex contention. [!!] Query cache efficiency: 0.0% (0 cached / 29K selects) [OK] Query cache prunes per day: 0 [!!] Sorts requiring temporary tables: 7837% (26K temp sorts / 342 sorts) [OK] No joins without indexes [OK] Temporary tables created on disk: 0% (392 on disk / 1M total) [OK] Thread cache hit rate: 99% (3 created / 39K connections) [!!] Table cache hit rate: 14% (1K open / 7K opened) [OK] Open file limit used: 13% (657/5K) [OK] Table locks acquired immediately: 100% (30M immediate / 30M locks) [OK] Binlog cache memory access: 0% (0 Memory / 0 Total) -------- Performance schema ------------------------------------------------------------------------ [--] Memory used by P_S: 72B [--] Sys schema is installed. -------- ThreadPool Metrics ------------------------------------------------------------------------ [--] ThreadPool stat is enabled. [--] Thread Pool Size: 8 thread(s). [!!] thread_pool_size between 16 and 36 when using InnoDB storage engine. -------- MyISAM Metrics ---------------------------------------------------------------------------- [OK] Key buffer used: 99.9% (104M used / 104M cache) [OK] Key buffer size / total MyISAM indexes: 100.0M/3.0G [OK] Read Key buffer hit rate: 98.8% (6B cached / 73M reads) [!!] Write Key buffer hit rate: 14.9% (1B cached / 186M writes) -------- InnoDB Metrics ---------------------------------------------------------------------------- [--] InnoDB is enabled. [--] InnoDB Thread Concurrency: 0 [OK] InnoDB File per table is activated [!!] InnoDB buffer pool / data size: 12.0G/12.7G [!!] Ratio InnoDB log file size / InnoDB Buffer pool size (4.16666666666667 %): 256.0M * 2/12.0G should be equal 25% [!!] InnoDB buffer pool instances: 8 [--] Number of InnoDB Buffer Pool Chunk : 96 for 8 Buffer Pool Instance(s) [OK] Innodb_buffer_pool_size aligned with Innodb_buffer_pool_chunk_size & Innodb_buffer_pool_instances [OK] InnoDB Read buffer efficiency: 99.99% (2954768806 hits/ 2955078519 total) [OK] InnoDB Write log efficiency: 95.93% (111839144 hits/ 116582132 total) [OK] InnoDB log waits: 0.00% (0 waits / 4742988 writes) -------- AriaDB Metrics ---------------------------------------------------------------------------- [--] AriaDB is disabled. -------- TokuDB Metrics ---------------------------------------------------------------------------- [--] TokuDB is disabled. -------- XtraDB Metrics ---------------------------------------------------------------------------- [--] XtraDB is disabled. -------- RocksDB Metrics --------------------------------------------------------------------------- [--] RocksDB is disabled. -------- Spider Metrics ---------------------------------------------------------------------------- [--] Spider is disabled. -------- Connect Metrics --------------------------------------------------------------------------- [--] Connect is disabled. -------- Galera Metrics ---------------------------------------------------------------------------- [--] Galera is disabled. -------- Replication Metrics ----------------------------------------------------------------------- [--] Galera Synchronous replication: NO [--] This server is acting as master for 1 server(s). [--] Binlog format: MIXED [--] XA support enabled: ON [--] Semi synchronous replication Master: Not Activated [--] Semi synchronous replication Slave: Not Activated [!!] This replication slave is not running but seems to be configured. -------- Recommendations --------------------------------------------------------------------------- General recommendations: Run OPTIMIZE TABLE to defragment tables for better performance OPTIMIZE TABLE `REDACTED`.`REDACTED`; -- can free 191.703922271729 MB OPTIMIZE TABLE `REDACTED`.`REDACTED`; -- can free 143.067314147949 MB Total freed space after theses OPTIMIZE TABLE : 334.771236419678 Mb Reduce or eliminate unclosed connections and network issues Increase table_open_cache gradually to avoid file descriptor limits Read this before increasing table_open_cache over 64: This is MyISAM only table_cache scalability problem, InnoDB not affected. See more details here: This bug already fixed in MySQL 5.7.9 and newer MySQL versions. Beware that open_files_limit (5000) variable should be greater than table_open_cache (2000) Thread pool size for InnoDB usage (8) Read this before changing innodb_log_file_size and/or innodb_log_files_in_group: Variables to adjust: query_cache_size (=0) query_cache_type (=0) query_cache_limit (> 1M, or use smaller result sets) sort_buffer_size (> 1M) read_rnd_buffer_size (> 256K) table_open_cache (> 2000) thread_pool_size between 16 and 36 for InnoDB usage innodb_buffer_pool_size (>= 12G) if possible. innodb_log_file_size should be (=1G) if possible, so InnoDB total log files size equals to 25% of buffer pool size. innodb_buffer_pool_instances(=12) 

ulimits of dbm1

core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 514886 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 514886 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 

ulimits of dbm2

core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1030587 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 1030587 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 

SHOW GLOBAL STATUS @ dbm1: https://gist.github.com/adriankirchner/cfffedfcc0699798e6c9276b09094ee1

SHOW GLOBAL STATUS @ dbm2: https://gist.github.com/adriankirchner/b021d8943f59c9a2448caf5dba6f89af

SHOW GLOBAL VARIABLES @ dbm1: https://gist.github.com/adriankirchner/b644072f063f80c2f41c604d5551eb18

SHOW GLOBAL VARIABLES @ dbm2: https://gist.github.com/adriankirchner/fa0ce51eb30a7368a30242fc9acba93b

Unfortunately iostat does not work inside of OpenVZ containers. Just to be on the safe side I increased the RAM on both containers from 16GB to 20 GB.

The monitoring history for replication failures can be found here: https://i.sstatic.net/18Dff.png

Update 2018-04-30 Wilson provided us with a number of performance tweaks which we slowly deploy to this configuration - thank you very much for this! In the meantime we updated both servers to Percona 5.7.21-21-log but the replication errors still occur. When I dig into the binlog position stated in the error log ([..]the last event read from './mysql-bin.007091' at 71137, the last byte read from './mysql-bin.007091' at 71137.) this is the statement causing the error:

# at 71049 #180308 6:41:25 server id 1 end_log_pos 71137 CRC32 0x8bbeaab8 Query thread_id=61875722 exec_time=0 error_code=0 SET TIMESTAMP=1520487685/*!*/; COMMIT /*!*/; # at 71137 <---- position from the error message #180308 6:41:25 server id 1 end_log_pos 71202 CRC32 0x87513662 Anonymous_GTID last_committed=140 sequence_number=141 rbr_only=no SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/; # at 71202 #180308 6:41:25 server id 1 end_log_pos 71289 CRC32 0xda97e104 Query thread_id=61875722 exec_time=0 error_code=0 SET TIMESTAMP=1520487685/*!*/; BEGIN /*!*/; # at 71289 #180308 6:41:25 server id 1 end_log_pos 71631 CRC32 0x7c566dd1 Query thread_id=61875722 exec_time=0 error_code=0 SET TIMESTAMP=1520487685/*!*/; update some_tables set some_col = 'some_value' where some_pk = 'some_value' <-- REDACTED /*!*/; # at 71631 #180308 6:41:25 server id 1 end_log_pos 71719 CRC32 0x37a2a3a9 Query thread_id=61875722 exec_time=0 error_code=0 SET TIMESTAMP=1520487685/*!*/; COMMIT /*!*/; 

So the statement causing the error is SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/; - I verified this with the binlog of another replication outage. There are very few records on this: - https://bugs.mysql.com/bug.php?id=85480 - https://groups.google.com/forum/#!topic/percona-discussion/u1k85oY5DB8

Both are operating with GTID_MODE=ON (we use GTID_MODE=OFF) and are facing totally different error message in the slave status.

3
  • For MASTER and SLAVE, individually, please post in pastebin.com Additional information request. RAM on your Host server are they each 16GB? current complete my.cnf/ini Text results of: A) SHOW GLOBAL STATUS; B) SHOW GLOBAL VARIABLES; Optional very helpful information, if available includes - htop OR top for most active apps ulimit -a for a linux/unix list of limits, iostat -x when system is busy for an idea of IOPS by device df -h for a linux/unix free space list by device free -m for a linux/unix free memory report complete MySQLTuner.com report for analysis. Commented Mar 15, 2018 at 1:10
  • FROM dbm2 (Ubuntu OS), Please post to pastebin.com or here, A) iostat -x when system is busy for an idea of IOPS by device, B) df -h for a linux/unix free space list by device, includes hda & sda (SSD) filesystem type clues C) free -m for a linux/unix free memory report D) cat /proc/meminfo includes VMallocUused . Analysis of dbm2 almost completed, this information will help. Commented May 21, 2018 at 15:01
  • I think this is an issue of MySQL. Please see the following for detail.<br> wild-growth.blogspot.com/2018/09/… Commented Sep 18, 2018 at 16:06

2 Answers 2

2

This error is usually caused by a hardware glitch or a bug so obscure that it hasn't been found in over a decade. You haven’t mentioned anything about the hardware, so the obvious first question to ask is whether it has ECC memory and whether you are seeing any MCE errors in your syslog, or if sensors are showing overheating on any of the components. If you are always seeing this on the same server(s) (but not ALL servers), then failing hardware could be a very likely explanation.

The optimal solution is obviously to fully re-initialize the replica, as that will avoid any risk of anomalies creeping in.

Having said that, this also clearly unpalatable if you have terabytes of data, so there may be a way you can restore replication without rebuilding the slave completely. Use mysqlbinlog on the master to identify where the transactions begin, and look for the adjacent transactions near positions 71137 (your Exec_Master_Log_Pos) and 43445 (where the error says it choked on reading the binlog). There is a very good chance that the positions listed are actually erroneous and no transaction in the binlog starts at those positions.

If that is the case, you can reset your replication coordinates on the slave to the last transaction mysqlbinlog lists that is before the position the slave thinks it is at. You can find more details on exactly what to do in this mysql troubleshooting article. There is risk associated with doing this (if the coordinates got corrupted, there is no telling what else got corrupted), but if doing the full re-initialisation immediately is prohibitively expensive, you might be able to buy yourself some time using this approach, at least until you can verify (e.g. using pt-table-checksum whether you have to rebuild the entire replica data set.

0

Suggestions for your my.cnf-ini [mysqld] section to consider

innodb_buffer_pool_size=8G # from 12G only 16G on server and MyISAM involved #sort_buffer_size=2047K lead with # for default #read_buffer_size=244K lead with # for default #join_buffer_size=244K lead with # for default thread_cache_size=16 # from 8 to avoid threads_created as often query_cache_min_res_unit=512 # from 4096 to reduce QC RAM reqd per SELECT stored 

Please spend 2 minutes at MySQLCalculator.com to recognize the extreme RAM footprint your current my.cnf-ini variables would require. Only 16G available.

Also, please post complete Mysqltuner report that will include version of MySQL you are using, by ENGINE data space used and table counts, cache utilization, threads created and other helpful information.

Please post complete results of A) ulimit -a so we know the limits your OS are working with and B) iostat -x for extended IOPS read/writing to your drives when busy.

10
  • 1
    Until yesterday the replication was stable. But then the replication crashed again with: Got fatal error 1236 from master when reading data from binary log: 'unknown error reading log event on the master; the first event 'mysql-bin.009853' at 74393, the last event read from './mysql-bin.010340' at 1238, the last byte read from './mysql-bin.010340' at 1238.' I'll add the information you asked for. Commented Apr 17, 2018 at 10:11
  • @adrian How many days was replication stable? Thanks for posting requested available information. Will pursue analysis during the next week and offer suggestions as details emerge. A TON of into to analyze. THANKS. Commented Apr 17, 2018 at 17:44
  • @Adrian Please review this link's information from Percona technician dated October 8, 2014 percona.com/blog/2014/10/08/… When will you have time to use Chat of serverfault.com with me? I am available next 72 hours, then next availability will be after May 2, 2018. Commented Apr 19, 2018 at 10:55
  • @Adrian could you chat at this time? Commented Apr 20, 2018 at 5:33
  • 1
    Hey @Wilson, sorry for the late answer. Thank you very much for your time and effort. I added you in Skype - feel free to ping me when you have the time. I already knew the blog post from Percona, but unfortunately they are only dealing with known fatal errors. Before my last comment, the replication ran fine for about a month. But as you can see in the following screenshot, the replication survived just for about 6 days after the last time we fixed it: i.sstatic.net/18Dff.png Commented Apr 23, 2018 at 12:00

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.