[05:37:05] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) es2029 (es3) and es2030 (es1) have been cloned have been included into dbctl. However, as it is Friday, I won't pool them today, I will slowly pool them on Monday after enabling notif... [05:57:28] 10DBA: Failover s6 master, db1093 to db1131 - https://phabricator.wikimedia.org/T263227 (10Marostegui) [05:57:52] 10DBA: Failover s6 master, db1093 to db1131 - https://phabricator.wikimedia.org/T263227 (10Marostegui) p:05Triage→03Medium a:03Marostegui [05:58:03] 10DBA: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10Marostegui) a:03Kormat [06:03:34] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) 05Open→03Resolved Closing per the internal email thread. If this happens again we'll reopen and contact Dell again. [06:03:40] 10DBA, 10Operations, 10ops-codfw: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) >>! In T262247#6442725, @Papaul wrote: > The log on says "It has been corrected by h/w and requires no further action" so i don't think this will be enough to replace the memory because it is not... [09:00:20] 10DBA, 10User-Kormat: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10Kormat) [10:17:53] https://twitter.com/ShlomiNoach/status/1306886325954580481 [10:43:46] I think I am going to generate an initial release of wmfbackups debian package for better testing, even if there is a lot of work to do still [11:04:22] 10DBA, 10Growth-Structured-Tasks, 10Growth-Team: Add a link engineering: Determine format for accessing and storing link recommendations - https://phabricator.wikimedia.org/T261411 (10Marostegui) In addition to @joe's questions I would like to know a bit more about how this table would be used, especially if... [11:39:30] 10DBA: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Marostegui) [11:43:21] 10DBA: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Marostegui) There was a buffer overflow detected.... ` Sep 18 11:10:32 db2125 mysqld[3189]: It is possible that mysqld could use up to Sep 18 11:10:32 db2125 mysqld[3189]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_... [11:54:38] 10DBA, 10Patch-For-Review: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Marostegui) p:05Triage→03Medium I want to try setting this for this host once it is recloned, to see how it goes: ` innodb_change_buffering=inserts ` MariaDB recommended trying that there's a suspect that... [11:55:09] 10DBA, 10Patch-For-Review: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Kormat) a:03Kormat [11:55:16] 10DBA, 10Patch-For-Review, 10User-Kormat: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Kormat) [12:16:31] 10DBA, 10User-Kormat: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Marostegui) For the record this is Mariadb's comment : ` This is a known problem also in Oracle MySQL, since at least MySQL 5.5. It happened very infrequently, and we did not find out a way to trigger it. Like I wro... [12:49:35] 10DBA, 10User-Kormat: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin2001.codfw.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202009181242_kormat_3990.log`. [13:10:05] 10DBA, 10User-Kormat: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2125.codfw.wmnet'] ` and were **ALL** successful. [14:28:39] 10DBA, 10Operations, 10ops-codfw: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Papaul) @Marostegui any day that works for you works for me as well [14:31:52] 10DBA, 10Operations, 10ops-codfw: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) Thank you @Papaul - I will have it ready by Monday [15:32:14] 10DBA, 10User-Kormat: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Kormat) Machine has been reimaged, and the db restored from backup. Puppet is disabled, `innodb_change_buffering=inserts` has been added to `/etc/my.cnf`, and the machine is currently catching up on replication. [15:34:11] 10DBA, 10User-Kormat: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 (10Marostegui) Thank you so much Stevie for taking care of this!. And for posterity: ` root@db2125.codfw.wmnet[(none)]> show global variables like '%change_buffering'; +-------------------------+---------+ | Variable_... [15:40:54] jynus: marostegui I have been writing my reports of the load on s8 in here: https://phabricator.wikimedia.org/T246415#6474175 [15:41:01] This might be of an interest to you [15:41:19] Thank you Amir1 I will check on Monday [15:44:04] Amir1: one thing that not sure you are aware is that "special replicas" can have a positive impact on performance if done well, but can also have a negative impact on availability [15:44:24] it was a huge problem to maintain specific optimizations just for a subset of servers [15:44:36] yeah, It's a trade-off [15:46:04] that doesn't have any hard consequences, but so it is considered in the ticket [15:46:53] for optimization, one of the things I didn't like much is how static is our load balancing, it is not based on resulting latency [15:48:58] on a scale of what are good and bad uses of load groups: great- dumps, bad: unique partitioning structure that limits failover possiblities [19:01:34] I think one way to do this is term store (wbt_* tables) because none of them join with tables from the outside, so we can slowly move it to a dedicated section in future and to gain lots of storage