[01:48:20] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10BrownHairedGirl) >>! In T262239#6458850, @Marostegui wrote: > For what is worth, the last maintenance that needs to happen on s1... [03:24:54] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 31.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [03:25:14] PROBLEM - MariaDB sustained replica lag on db1149 is CRITICAL: 11 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104 [03:28:50] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [03:29:08] RECOVERY - MariaDB sustained replica lag on db1149 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104 [04:57:25] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10Marostegui) [04:57:42] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10Marostegui) p:05Triage→03Medium [05:28:09] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10Marostegui) >>! In T262239#6464980, @BrownHairedGirl wrote: >>>! In T262239#6458850, @Marostegui wrote: >> For what is worth, the... [06:01:49] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) On going transfer: es2015 -> es2031 [06:16:15] I am forcing a rerun of etcd backups, will check what logs say if they fail again [06:16:45] ok [06:18:45] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10MW-1.36-notes (1.36.0-wmf.10; 2020-09-22), and 2 others: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Marostegui) Thanks for the heads up [06:58:50] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) - 22nd Jul 2019 server delivered to the DC - 18th Aug 2020: server crashes with the following errors - server totally unresponsive... [07:04:20] alex giving good new re:m2 maintenance [07:04:37] I think only db2078:m2 lagged as it is the deepest in m2 hierarchy [07:05:27] should go back to normal any time soon: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&var-server=db2078&var-port=13322 [07:35:44] marostegui: on icinga i see that db1096+db2098 have warnings for replica io+sql, but they haven't paged, which seems odd [07:36:02] kormat: those are backups+m2 I believe [07:36:22] kormat: you mean db1095 and db2098? [07:36:49] oops, yes [07:37:01] kormat: db1095 and db2098 are backups, so we stop replication for the snapshots [07:37:08] so that's expected, you can safely ignore it [07:37:24] ohh, right. thanks :) [07:39:01] threads only page (or get critical) if stopped with an error [07:39:33] if stopped manually, we let lag be the one that warns of an issue [07:39:49] this is because in the past it was typical to stop replication on host that did not replicate anything [07:40:33] in a way, replication to be stopped (unless because of an error) it is not an issue, lag is [07:41:02] e.g. when stopped on purpose for a backup [07:41:58] jynus: why doesn't lag fire if the slave thread is stopped? [07:42:07] lag fires [07:42:15] I just acked it manually [07:42:24] everytime the backups run? [07:42:36] in an ideal world, lag would be configured for backup host with a 2-3h grace [07:43:01] I am waiting for your refactoring to be able to do that :-D [07:43:11] hah, i see :) [07:43:22] (it also has disabled notifications, I only did it now because we were around) [07:44:37] kormat: please look at backlog of patches for wmfmariadbpy, I think they are starting to pile up 0:-D [07:45:15] the m2 stuff is a one-time thing due to ongoing maintenance [07:45:19] on otrs [07:58:17] this is as per alex suggestion, but feel free to comment https://gerrit.wikimedia.org/r/c/operations/puppet/+/627745 [08:06:58] how did transfer go when done manually? [08:08:05] maybe what breaks is not the source or target hosts, but the control from cumin? [08:19:46] <_joe_> heads up: I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/625583, which should fix https://phabricator.wikimedia.org/T224589#5597603 in future installations [08:42:11] jynus: it went ok [08:46:25] I will hot-deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/627745 on all affected hosts (m1, m2, m5) [08:46:56] <_joe_> jynus: ok to merge your change? [08:47:01] yes [08:47:24] hot here in context mean with a live mysql session [08:49:29] thank you [08:55:35] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10Marostegui) mysql stopped on db1121 (sanitarium master) [08:55:50] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10Marostegui) mysql stopped on db1123 (s3 master), db1093 (s6 master) and db1109 (s8 master) [08:58:29] I think it is just better and simpler to do it on phab, too [09:05:44] kormat: the same stall that could happen to you days ago may be happening now [09:05:49] Re: gerrit [09:06:09] i'm trying to remember what happened there [09:06:16] (and failing) [09:07:24] * jynus knows exactly what to do [09:07:36] * jynus searches for gerrit on wikitech and trys to find something [09:07:41] :-) [09:33:05] 10DBA, 10Growth-Structured-Tasks, 10Growth-Team: Add a link engineering: Determine format for accessing and storing link recommendations - https://phabricator.wikimedia.org/T261411 (10Joe) A couple clarifying questions: - your idea is to store this in parsercache? - did you estimated the amount of storage... [10:34:18] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) es2027 is fully pooled into es3 es2028 is fully pooled into es1 [10:34:47] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [13:02:08] marostegui: related to what we just talked about: https://i.imgur.com/HNVUodr.png [13:30:36] sobanski: :D [13:31:36] rotfl [13:31:53] but hey, I heard that 2020 is really the year of linux on the desktop [13:42:01] It is on my desktop, and that's all that matters. [14:24:03] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10MW-1.36-notes (1.36.0-wmf.10; 2020-09-22), and 2 others: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Mholloway) I should mention that we'll enable on testwiki first and... [14:25:29] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10MW-1.36-notes (1.36.0-wmf.10; 2020-09-22), and 2 others: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Marostegui) I am fine closing this, but please once it is released e... [14:36:59] 10DBA, 10Performance-Team, 10WikimediaDebug, 10Patch-For-Review: Additional database user for XHGui administration - https://phabricator.wikimedia.org/T260640 (10Marostegui) 05Open→03Resolved I am going to consider this resolved - please reopen if you need something else Thanks! [15:04:55] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10MW-1.36-notes (1.36.0-wmf.10; 2020-09-22), and 2 others: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Mholloway) Will do. I'll leave this open for now, so that we rememb... [15:10:31] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10RobH) [16:17:16] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5 - https://phabricator.wikimedia.org/T261456 (10Marostegui) The on-site work here was fully done or is there anything pending that requires power changes? :) [17:27:08] marostegui: db1131 is it down and can be moved anytime? [17:28:10] cmjohnson1: no, but I can put it down tomorrow if you like? (or any other day that works for you) [17:29:27] put it down tomorrow I will move it after PDU [17:30:11] cmjohnson1: excellent, I will have it ready for you. if you give me an ip I can also change the network/interfaces so it boots up with the new one straightaway [17:30:39] let me see if I can do that...we have new IPD system [17:30:46] IP assignment [17:30:48] upps [17:31:19] sure, just left it here or in the task and I will do that on the OS before powering it down [17:31:29] can you do the DNS change though? [18:14:19] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [18:14:22] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) racked [18:39:09] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10wiki_willy) Thanks @Marostegui for the outline and @Papaul for the verbal context. The email's sent out to our account rep for escalation, so... [23:20:10] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui Papaul, I understand that the last Bios update didn't fix the issue you were experiencing, however these recommendati...