[00:41:51] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 5 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [00:43:27] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [05:10:35] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Thank you Papaul [05:36:00] 10DBA, 10Growth-Structured-Tasks, 10Growth-Team: Add a link engineering: Determine format for accessing and storing link recommendations - https://phabricator.wikimedia.org/T261411 (10Marostegui) >>! In T261411#6492401, @Tgr wrote: >>>! In T261411#6486319, @Marostegui wrote: >> What I don't really see is us... [05:51:15] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [05:51:33] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [05:54:35] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [05:56:16] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [05:56:35] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) Waiting 24h with mysql stopped before proceeding with the decommissioning [05:59:05] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Patch-For-Review: Investigate indexes of wb_changes - https://phabricator.wikimedia.org/T262856 (10Marostegui) 05Open→03Resolved @Ladsgroup I am going to close this - once you've got all the changes merged, let's create a normal #blocked-on-schema-change task.... [06:18:37] 10DBA: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 (10Marostegui) Sum up of hosts with the setting changed to `inserts`: s1: db2071 db2085 db2116 s2: db2108 s3: db2109 s4: db2106 s5: db2089 s6: db2087 s7: db2087 s8: db2085 db2081 [06:21:25] 10DBA: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 (10Marostegui) [06:59:19] 10DBA, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [06:59:22] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [07:27:29] 10DBA, 10Growth-Structured-Tasks, 10Growth-Team: Add a link engineering: Determine format for accessing and storing link recommendations - https://phabricator.wikimedia.org/T261411 (10kostajh) @MGerlach and I spoke on Friday and he will adjust the output of his tool to provide structured data, e.g. instead o... [07:43:43] marostegui: i'm repooling db2125 now. i reaimaged+recloned it on friday, and it ran fine over the weekend with no load at least [07:43:58] \o/ [07:43:59] thanks [07:44:11] oh yeah, i should reenable notifications for it. doing now. [07:44:42] haha I was just pushing the change [07:44:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/630535 [07:45:01] kormat: you merge or I merge? ^ [07:45:06] go for it :) [07:45:10] ok! [07:45:28] done [07:45:35] i'll do the puppet runs to have it take effect [07:46:04] thanks! [08:17:04] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 (10Marostegui) >>! In T239238#6491538, @Kormat wrote: > > dbctl instance db1109 set-candidate-master false ^ this should be true? The rest lo... [08:18:43] marostegui: good catch, fixed :) [08:18:48] \o/ [08:19:19] marostegui: i'm going to revert this, if that sounds ok: https://phabricator.wikimedia.org/T263842#6493983 [08:19:46] (when the s5 replication issues happened i added the dump/vslow host to the mw load groups to help things recover) [08:19:54] kormat: sounds good, thanks [08:34:45] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) 05Open→03Resolved Alright, the host is fully back in service now, so resolving this again :) [08:46:51] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) [08:50:30] 10DBA, 10Operations, 10netops, 10ops-eqiad, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) @Cmjohnson the console port is still not responding, could you please have a look before today's maintenance? As we still need to configure the switch (and m... [10:42:19] marostegui: can orchestrator display user-defined metadata about instances? [10:42:39] usecase: i'm looking at s2/codfw in tendril, trying to figure out what host i can add to dump/vslow while i do a schema change [10:43:04] and while i can see the QPS for all hosts, it's impossible to know which are even in dbctl, or which are dedicated hosts, etc [10:43:25] yeah, that is a super-typical "inventory" use case indeed [10:44:06] kormat: user-defined as: being able to add text or whatever as a description or something similar? [10:44:28] marostegui: i'm thinking along the lines of tags, with key/values [10:44:39] kormat: not sure, but I will add it to the doc [10:44:41] so maybe `dedicated:true` or `backup_source:true` [10:44:46] cheers [10:45:01] or even a list of the mw load groups a host is in [10:45:07] kormat: it can definitely get stuff like: use this slave as primary master in case of failure, or: never use this host... [10:45:08] that would all be great [10:45:12] so maybe it has some flexibility about it [10:45:17] Adding it to the doc [10:45:19] maybe custom features could be added as patches if it is easily extensible [10:45:22] 👍 [10:45:25] and not supported [10:45:47] I think I saw a mention of plugins somewhere? [10:47:19] sobanski: not sure about it, but I have added it to the matrix of needs on the observability doc [10:47:22] thanks for the suggestion kormat [10:54:24] kormat: just write 2 backends for cumin, one for the dbctl stuff and the other for orchestrator and then query from there :D [10:54:27] * volans hides [10:55:02] kormat: while you're at it, why not solve global warming as well? [10:55:29] and covid please [11:02:39] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) I have checked previous' week backups (22nd Sept) to see if there was anything existing for any of the involved data, at least on the PK (... [12:13:33] marostegui: looking at https://noc.wikimedia.org/db.php, i see a number of db instances having weight 1. is that normal? [12:14:33] kormat: yeah, vslow on big sections tend to have 1 [12:15:01] so they are checked for lag by MW but not used for traffic, as sometimes slow connections/dumps can overload them [12:15:25] ahh, i see [12:17:52] 10DBA, 10Data-Services: Prepare and check storage layer for arbcom_ruwiki - https://phabricator.wikimedia.org/T262832 (10Urbanecm) @Marostegui The database was just created. [12:18:58] 10DBA, 10Data-Services: Prepare and check storage layer for arbcom_ruwiki - https://phabricator.wikimedia.org/T262832 (10Marostegui) a:03Marostegui Thanks - going to check if it has been filtered correctly [12:21:11] 10DBA, 10Data-Services, 10User-Urbanecm: Prepare and check storage layer for arbcom_ruwiki - https://phabricator.wikimedia.org/T262832 (10Marostegui) 05Open→03Resolved a:05Marostegui→03Urbanecm The database exists on s5, but not on labsdb or sanitarium hosts. Closing this! Thanks @Urbanecm! ` # for... [12:44:50] 10DBA, 10Operations, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) A preliminary incident report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/2020... [13:12:06] 10DBA, 10Operations, 10netops, 10ops-eqiad, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) @ayounsi I am not able to get the console to work on the new switch, it's plugged in, I verfied it worked by connecting to the current asw in d4 and get th... [13:36:16] * sobanski stepping away to do some shopping [13:37:01] sobanski: don't forget the kasztanki sweets! [13:38:44] marostegui: I’ll buy some and send you pictures [13:39:02] sobanski: I have around 1kg at home still! [14:23:30] 10DBA, 10Operations, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) The sequence of events within the transaction that failed is interesting and it definitely didn't... [14:23:44] 10DBA, 10Operations, 10netops, 10ops-eqiad, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) [14:43:48] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Kormat tell @Marostegui to not break the host again :) [14:45:00] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Papaul) [14:48:46] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) >>! In T260670#6498765, @Papaul wrote: > @Kormat tell @Marostegui to not break the host again :) hahah - reminder: you are the on... [15:02:07] 10DBA, 10Operations, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) These are the logs of the blocks (2 inserts and 1 update?) the timestamps would be close to (but not... [15:11:12] 10DBA, 10Operations, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) Yes, they are those, this is the order of events on the binlog for the ipblock table on that IP th... [15:12:25] kormat: was db1114 the one involved on the d4 change I believe? [15:12:35] marostegui: yep [15:12:58] kormat: I am seeing it on icinga, not sure why, as it looks downtimed :-/ [15:13:21] marostegui: see #-operations. there was some issue with its network, we just replaced its sfp [15:13:39] i'm bringing mariadb back up now [15:13:42] ah nice, thank you :) [15:14:00] I wonder why it still shows up, as it has the downtime icon [15:14:20] spite, most likely [15:14:36] XDD [15:14:42] that's why i show up every day, at least [15:14:46] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Papaul) [15:16:16] marostegui: could it be that the host caught up on replication before i stopped mariadb again for the SFP replacement? [15:17:02] kormat: from what I can see the host was downtime but not its services, or something like that [15:17:07] cause all the services do not look downtimed [15:17:10] ohh [15:17:27] that would explain it [15:17:44] sre.hosts.cookbook, what have you done [15:18:02] er, sre.hosts.downtime cookbook [15:18:13] volans: does ^ not downtime the services too? [15:20:13] kormat: eh? it does [15:20:18] looks like it does [15:20:22] I am seeing the history for the host [15:20:34] looks like it expired: [2020-09-28 14:59:57] SERVICE DOWNTIME ALERT: db1114;Check size of conntrack table;STOPPED; Service has exited from a period of scheduled downtime [15:20:39] but the icon remained on icinga [15:21:16] So yeah, it alerted after the downtime: ect [15:21:17] Service Critical[2020-09-28 15:03:50] SERVICE ALERT: db1114;MariaDB read only s8;CRITICAL;SOFT;1;Could not connect to localhost:3306 [15:21:25] icinga says the downtime is until 15:40 [15:21:30] which is in 20mins [15:21:49] yeah, but for the services: [2020-09-28 14:59:57] SERVICE DOWNTIME ALERT: db1114;mysqld processes;STOPPED; Service has exited from a period of scheduled downtime [15:22:14] weeird [15:24:05] at 12 (UTC) i ran `sudo -H cookbook sre.hosts.downtime -H3 -r "rack switch upgrade T196487" 'db1114*'` [15:24:06] T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 [15:24:28] so the service exiting at that time is legit [15:24:36] where's the other downtime from, i wonder [15:24:45] the other one is from arzhel I think? [15:24:52] ahh, i see [15:24:55] ok, that explains things [15:24:55] maybe he only downtimed the host itself and not the services? [15:24:58] volans: unping! unping! [15:25:05] (was it too late?) [15:26:54] marostegui: thanks for spotting that [15:27:02] host is repooled, and everything should be good now [15:27:11] always happy to nerd snipe people [15:27:14] like cdanis or volans [15:30:21] lol [15:42:11] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [15:43:48] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) a:05Papaul→03Marostegui just after 1 month we received this server, we have already a bad disk. Disk replaced. [15:45:28] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Papaul) [15:50:09] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Thanks @Papaul is the disk blinking there? I still don't see it on the OS. [15:53:01] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) ` Time: Mon Sep 28 15:39:48 2020 Event Description: PD 02(e0x20/s2) Path 500056b34b011fc2 reset (Type 03) Time: Mon Sep 28 15:39:48 2020 Event Description: Removed: PD 02(e0x20/s2) Time: Mon... [15:55:51] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [15:59:22] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wm... [16:03:11] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) @Papaul after putting the disk back in, I am seeing the same errors on the controller: ` [1764225.764609] megaraid_sas 0000:af:00.0: 1103 (654623787s/0x0004/CRIT) - Enclosure PD 20(c None/p1)... [16:59:35] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1150.eqia... [17:02:12] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wm... [17:13:07] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1150.eqiad.wmnet'] ` [17:20:08] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wmnet ` The log can be f... [17:39:34] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [17:42:57] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] ` and were **ALL** successful. [17:43:36] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) 05Open→03Resolved @Marostegui @jcrespo All yours [17:49:56] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10jcrespo) Thank you very much for you help, Cmjohnson!!! [18:19:53] 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) @Krinkle I think there are two parts to this. In my mind, the groups used in code are basically hints to the DB layer that a given cluster m... [18:29:42] hello everybody, it is super late but I am trying anyway - anybody still online for a question about how to reduce the binlog size? [18:32:57] 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) A quick inventory of DB groups used in core, based on some ad-hoc grep runs: {P12819} [19:22:39] elukey: what's the problem? [19:23:29] depending if it has been shipped to all its replicas you can tell mysql to delete the binlogs up to a certain point in time [19:24:35] 10DBA, 10CheckUser, 10Stewards-and-global-tools, 10I18n, and 2 others: Incomplete i18n for log entries in CheckUser - https://phabricator.wikimedia.org/T41013 (10Umherirrender) For DBA to hopefully can make a decision: The new fields are all nullable and stay null until the next stage for the migration sc... [19:24:47] but then they will keep growing again, of course you can also change the config to reduce the amount time they are stored, but that reduces the downtime a replica can be before it has to be reimported from a more fresh dump/backup [23:35:27] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 3.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [23:37:01] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104