[03:17:31] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) @Marostegui the Next time you have this problem, open the first 1GB NIC and change the setting from None to PXE and do t... [05:15:11] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es2026.codfw.wmnet... [05:23:28] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) >>! In T260373#6441588, @Papaul wrote: > @Marostegui the Next time you have this problem, > > open the first 1GB NIC... [05:55:14] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2026.codfw.wmnet'] ` and were **ALL** successful. [05:56:49] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) es2026 got installed correctly: ` root@es2026:~# free -g ; df -hT /srv total used free... [06:04:42] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) I have given it most of the vg remaining size: ` root@es2026:~# pvs PV VG Fmt Attr PSize PFree /dev/... [06:05:35] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) [06:13:00] For today's PDU maintenance, there's nothing that requires mysql stopping I think: https://phabricator.wikimedia.org/T261452#6417340 none of those are masters or labsdb related [06:13:03] The proxies aren't active either [06:13:31] Well, maybe db1106 we can stop it [06:13:34] it is sanitarium's s1 master [06:13:54] I am going to stop it now so we can forget about it for today [06:14:14] s1 labsdb is delayed anyways due to the on-going alter on the master [06:29:42] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) 05Stalled→03Open [06:58:01] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [07:29:17] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for jawikivoyage - https://phabricator.wikimedia.org/T260482 (10jhsoby) Search is still not working on the Japanese Wikivoyage. I believe it could be related to this task (but I'm not quite sure, since I don't really know... [07:33:48] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for jawikivoyage - https://phabricator.wikimedia.org/T260482 (10jcrespo) @jhsoby this task only relates to cloud infrastructure- it won't make search (or anything else) work on a wiki. I suggest you reopen T260320 and rep... [08:45:30] 10DBA, 10Operations, 10ops-codfw: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) [08:46:05] 10DBA, 10Operations, 10ops-codfw: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) p:05Triage→03Medium [08:52:47] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [09:23:14] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [09:23:46] ^ expected [09:26:38] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [09:32:57] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [09:39:42] hurm. i told db1141 to reboot, but it's stuck at this for the last 10 mins: [09:39:43] `(2 of 5) A stop job is running for …d5b725068569 (9min 52s / no limit)` [09:40:13] uh? [09:40:32] which job, umount, mariadb? [09:40:39] was mysql stopped correctly? [09:40:41] jynus: who can say [09:40:43] marostegui: yes [09:40:58] `systemctl stop mariadb` and `umount /srv` both completed successfully [09:41:09] interesting...maybe it is the exporter? [09:41:30] it looks like some/all of the stop jobs are for /dev/sda2 [09:41:41] kormat: I guess at that point the login service is already shutdown so you cannot even loging back from the idrac, no? [09:41:41] I've only seen these 2 taking a lot of time [09:41:42] which is.. swap [09:41:46] marostegui: correct [09:41:58] Oh, swap could be indeed [09:42:28] theoretically, yes, but it shouldn't be in use, specially after stopping mariadb [09:42:29] i guess if the server had a full swap partition, it might be paging it all back in [09:42:47] adn then flush it :D [09:42:54] let me see the stats for the host [09:43:23] yeaah, look at this [09:43:31] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=18&orgId=1&refresh=5m&var-server=db1141&var-datasource=thanos&var-cluster=mysql [09:43:52] it seems very odd to me that we don't have a graph showing swap usage on that dashboard [09:43:54] indeed [09:43:55] kormat: hah, so then yeah, it is repaging all the stuff [09:44:20] kormat: that, and showing io activitiy rather than "disk utilization" [09:45:11] traffic and saturation as alerting tools won't work if the right metrics are not chosen indeed [09:45:34] or USE, whatever you want to call it [09:48:18] * kormat nods [09:49:51] if I have the time, I will create a better dashboard and propose it as a better replacement, otherwise use it as "IOPS and memory useful metrics for databases" [09:50:26] https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?orgId=1&from=now-3h&to=now&var-server=db1141&var-datasource=thanos&var-cluster=mysql says swap had 7G used [09:50:43] checking top memory offenders right now I can see: dbstores (memory leak) and db1115? [09:51:03] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=12 [09:51:40] memory leak? [09:51:53] I meant that as a (memory leak?) question [09:51:59] ah, ok [09:52:03] not asserting it [09:52:24] memory for those was reduced but they still are using a lot [09:52:24] Those hosts have pretty big queries I think, so god knows what the buffers are doing there [09:52:41] I can reduce them a bit further if needed, I will check [09:52:50] it could just activity, but last time I checked, there were no queries ongoing [09:53:01] when it happened last time [09:53:28] if it is queries not returning memory back it would be still a case of memory leak [09:53:39] idk [09:53:59] it could be just them being loaded at the beginning of the month [09:54:29] extended DT for db1141, as 30 minutes isn't going to cut it. [09:54:51] is db1141 a normal core host? [09:55:02] i'm really wondering if this is going to succeed. 25mins so far to page in 7G from swap, and write out to disk. on ssds. [09:55:18] jynus: yep [09:55:33] kormat: maybe for the next one we can try a swapoff before the shutdown and see what happens [09:55:40] SGTM [09:56:06] to be fair, it is the first time I think this happened to me because of swap [09:56:41] let's also not discard hidden hw malfunction [09:56:49] if swapoff completes quickly, then there's no harm in doing it before shutdown [09:56:59] so i'm fine to include it in the script as standard [09:57:34] +1 [09:59:15] kormat: when it boots, could you check the swappiness of the server? [09:59:48] just checked the hw logs, nothing of note [10:00:16] thinking it could be misconfigured is my only sane reason for this [10:00:29] https://phabricator.wikimedia.org/P12518 [10:01:43] kormat: it might be on syslog once it is back [10:01:50] marostegui: good point [10:03:24] jynus: `vm.swappiness = 0` [10:03:49] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) [10:03:59] that's expected [10:04:08] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) Postponed to Sept. 17th, 1pm Eastern, 17:00 UTC [10:05:25] marostegui: nothing in syslog [10:06:18] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Marostegui) Everything ok from the DB point of view. All the DB hosts in D4 can have a hard downtime, nothing will be impacted from our side. [10:07:14] kormat: yeah, maybe syslog was already shutdown :( [10:07:33] seems likely [10:08:15] kormat: maybe try another reboot now and see what happens? [10:11:03] i note that db1142 also has a full swap partition [10:11:26] https://phabricator.wikimedia.org/P12518#69988 [10:11:26] Interesting, let's see a swapoff prevents that on reboot [10:12:46] i'll do the reboot procedure on db1142 once db1141 has recovered. i've added `swapoff -a` to the pre-steps [10:12:53] sweet thanks [10:21:40] question, is there any meta-task with db maintenance while on switchover? [10:21:59] T243318 [10:22:00] T243318: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 [10:22:03] all the subtasks [10:22:04] thanks [10:22:19] that way I can read it without asking you all the time, thank you [10:35:18] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [10:45:41] can you confirm data on db1133 is not needed? [10:45:51] as in, even not as a backup? [10:45:54] jynus: yep [10:46:22] ok [10:47:25] sorry for so much confirmation, you know I am a paranoid of double checking before wiping a server [10:47:38] no worries, better be safe than sorry [11:01:11] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1133.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202009081101_jynus_9028.log`. [11:37:39] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1133.eqiad.wmnet'] ` and were **ALL** successful. [11:53:45] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables thanks to the addition of login data - https://phabricator.wikimedia.org/T261999 (10Marostegui) [12:04:05] alright - trying to reboot db1142 now [12:05:01] fingers crossed [12:06:51] `swapoff -a` is managing about 6MB/s [12:07:04] 200MB left to go [12:07:23] so maybe that was it with db1141 [12:07:38] stopping mariadb freed up about 6.5G of the swap [12:12:13] reboot success [12:12:20] \o/ [12:20:15] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr) starting maintenance do not expect any outages will be disconnecting pdu`s in about 1 hour [13:11:32] 10DBA, 10Operations, 10ops-codfw: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Papaul) The log on says "It has been corrected by h/w and requires no further action" so i don't think this will be enough to replace the memory because it is not saying that there is an error but there were... [13:12:52] 10DBA, 10Operations, 10ops-codfw: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) Excellent, makes sense @Papaul Right now it is not a good moment to depool an s3 host due to some on-going investigations. I will ping you once we are ready to depool this host and get it upgrad... [13:22:53] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) @Jclark-ctr I understand that is hpe's response. What is //your// advice regarding followup steps, close this due to "no actionable"? [13:34:29] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Dell Tech Support via r1hhmgz5xjn6.0b-gampeak.na98.bnc.salesforce.com 8:30 AM (3 minutes ago) to me ** Please Do Not Change Subj... [13:45:56] marostegui: there is an icinga alert on dbproxy1016 [13:46:05] jynus: yes, but that is not an active proxy [13:46:12] so I am not even looking at that yet [13:46:14] ok, I was just asking about it [13:46:32] jynus: [15:44:48] jynus: there are no active masters on row d, neither active proxies [13:46:40] I haven't checked standby ones yet [13:46:42] but it could be down [13:46:47] and not be on row d [13:46:53] because the replica was down [13:46:53] no [13:46:57] dbproxy1016 is on row D [13:47:01] anyway, not important right now [13:47:01] but it is not active [13:47:03] I didn't know that [13:47:11] so I was noticing you [14:09:28] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui please see below Hello Papaul, After looking over the TSR and the link you'd sent me regarding the troubleshooting for t... [14:30:15] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul the host is depooled, we can power it off for you whenever you like [14:33:06] 10DBA, 10Data-Services, 10Platform Team Initiatives (API Gateway), 10cloud-services-team (Kanban): Prepare and check storage layer for api.wikimedia.org - https://phabricator.wikimedia.org/T246946 (10nskaggs) a:03nskaggs [15:17:56] can I ack disabled alerts on es2026 for better icinga readibility? [15:18:28] (normally this is not needed, but I am trying to remove noise from unhandled) [15:20:00] sure [15:20:19] pc1010 is fine [15:20:21] only catching up [15:25:08] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [15:33:08] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr) [15:33:31] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr) [15:48:27] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10jcrespo) [15:51:07] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10jcrespo) As far as service setup, db1133 was correctly reimaged into buster and populated with a backup of enwiki, and started replicating. This also tested backup at the same time. Tendril and zarcillo were updat... [16:08:17] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Dell mentioned that it is something to do with the OS and requested the sosreport. since we can not share that information with them i... [16:16:37] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul there is nothing really on the OS that we've seen that could cause these crashes. What we did on both crashes is the same:... [16:46:17] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [16:53:31] 10DBA, 10Data-Services: enwiki database replicas appear to be lagged and are falling further behind - https://phabricator.wikimedia.org/T262239 (10bd808) [16:54:12] 10DBA, 10Data-Services: enwiki database replicas appear to be lagged and are falling further behind - https://phabricator.wikimedia.org/T262239 (10Marostegui) This is expected, there is maintenance going on on eqiad's master [16:55:10] 10DBA, 10Goal: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10Papaul) [16:56:14] 10DBA, 10Data-Services: enwiki database replicas appear to be lagged and are falling further behind - https://phabricator.wikimedia.org/T262239 (10Marostegui) For reference: {T254462} And once the above is done, we'll also start with MCR schema changes,which means they'll also get lag: {T238966} There is not... [16:58:49] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [17:08:50] 10DBA, 10Data-Services: enwiki database replicas appear to be lagged and are falling further behind - https://phabricator.wikimedia.org/T262239 (10ST47) Thank you Manuel! Apparently I just didn't know what to search for. What is "MCR"? [17:25:42] 10DBA, 10Data-Services: enwiki database replicas appear to be lagged and are falling further behind - https://phabricator.wikimedia.org/T262239 (10Marostegui) MCR stands for Multi Content Revision (https://mediawiki.org/wiki/Requests_for_comment/Multi-Content_Revisions) It involves altering the huge revision... [17:57:45] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [20:45:38] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [21:44:53] 10DBA, 10Data-Services: enwiki database replicas appear to be lagged and are falling further behind - https://phabricator.wikimedia.org/T262239 (10Krinkle) Wait, does this mean all Toolforge tools are frozen/broken as of 24 hours ago and will not receive any updates for the next few days? Anecdotally, that ind... [21:44:56] 10DBA, 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): enwiki database replicas appear to be lagged and are falling further behind - https://phabricator.wikimedia.org/T262239 (10Krinkle) p:05Triage→03Unbreak! [21:45:46] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10Krinkle) [21:45:52] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10Krinkle) [21:45:57] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10Krinkle) [22:34:19] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul)