[00:21:29] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:03] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989513 [00:39:07] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989513 (owner: 10TrainBranchBot) [00:39:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:51:48] (03PS1) 10Eevans: sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402) [00:51:50] (03PS1) 10Eevans: sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402) [00:51:52] (03PS1) 10Eevans: sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402) [00:57:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989513 (owner: 10TrainBranchBot) [01:10:05] (03PS1) 10Eevans: sessionstore: configure new hosts to reuse /srv [puppet] - 10https://gerrit.wikimedia.org/r/989631 (https://phabricator.wikimedia.org/T353402) [01:35:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:17:09] (03CR) 10Gergő Tisza: [C: 03+1] Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) (owner: 10Tim Starling) [02:39:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:34] !log decommissioning cassandra, restbase2014-{a,b,c} — T352469 [02:51:38] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2014.codfw.wmnet with reason: Decommissioning — T352469 [02:51:52] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2014.codfw.wmnet with reason: Decommissioning — T352469 [02:55:43] (03PS1) 10Andrew Bogott: OpenStack trove: disable online_volume_resize, thus fixing volume resize [puppet] - 10https://gerrit.wikimedia.org/r/989635 [02:58:06] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack trove: disable online_volume_resize, thus fixing volume resize [puppet] - 10https://gerrit.wikimedia.org/r/989635 (owner: 10Andrew Bogott) [03:09:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:55:19] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [05:35:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:13] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [06:09:55] (03PS1) 10Marostegui: db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989637 (https://phabricator.wikimedia.org/T354506) [06:10:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2180 T354506', diff saved to https://phabricator.wikimedia.org/P54589 and previous config saved to /var/cache/conftool/dbconfig/20240111-061039-marostegui.json [06:11:25] (03CR) 10Marostegui: [C: 03+2] db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989637 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:12:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2180.codfw.wmnet with OS bookworm [06:14:57] (03CR) 10Marostegui: "The thing with this is...if we start including misc clusters in the DC switchover (which I strongly think we should), this would break as " [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [06:18:12] (03PS1) 10Marostegui: installserver: Do not reimage db1247 [puppet] - 10https://gerrit.wikimedia.org/r/989638 [06:23:12] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db1247 [puppet] - 10https://gerrit.wikimedia.org/r/989638 (owner: 10Marostegui) [06:28:11] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [06:31:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2180.codfw.wmnet with reason: host reimage [06:34:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: host reimage [06:46:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:48:35] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:54:37] (03PS1) 10Marostegui: Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989597 [06:54:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2180.codfw.wmnet with OS bookworm [06:56:30] (03CR) 10Marostegui: [C: 03+2] Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989597 (owner: 10Marostegui) [06:57:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54590 and previous config saved to /var/cache/conftool/dbconfig/20240111-065747-root.json [06:58:47] (03PS1) 10Marostegui: db2180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/989642 (https://phabricator.wikimedia.org/T354506) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0700) [07:00:05] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0700). [07:01:42] (03CR) 10Marostegui: [C: 03+2] db2180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/989642 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [07:10:17] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:12:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54591 and previous config saved to /var/cache/conftool/dbconfig/20240111-071252-root.json [07:17:07] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1201 T354506', diff saved to https://phabricator.wikimedia.org/P54592 and previous config saved to /var/cache/conftool/dbconfig/20240111-072146-marostegui.json [07:22:34] (03PS1) 10Marostegui: db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989643 (https://phabricator.wikimedia.org/T354506) [07:23:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1201.eqiad.wmnet with OS bookworm [07:23:55] (03CR) 10Marostegui: [C: 03+2] db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989643 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [07:27:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54593 and previous config saved to /var/cache/conftool/dbconfig/20240111-072757-root.json [07:31:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:36:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:40:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:43:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54594 and previous config saved to /var/cache/conftool/dbconfig/20240111-074302-root.json [07:45:07] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:07] (03PS1) 10Slyngshede: Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645 [07:51:00] (03CR) 10Slyngshede: "Right now I think the best cause of action is to disable the RAID alert and then we can work on a solution for those cases where alerts ne" [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede) [07:55:20] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:58:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54595 and previous config saved to /var/cache/conftool/dbconfig/20240111-075807-root.json [07:58:19] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/989549 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0800). [08:00:05] tzatziki: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:05:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:06:49] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:08:33] I'll take a look at phab [08:11:49] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:11:52] heading over to -security [08:13:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54596 and previous config saved to /var/cache/conftool/dbconfig/20240111-081311-root.json [08:28:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54598 and previous config saved to /var/cache/conftool/dbconfig/20240111-082816-root.json [08:29:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1201.eqiad.wmnet with OS bookworm [08:35:32] (03CR) 10Brouberol: [C: 03+1] eventschemas: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/989090 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [08:41:37] (03PS1) 10Effie Mouzeli: ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648 [08:41:51] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9449656, @ABran-WMF wrote: > Maybe it also has something to do with: > >>>! In T352974#9441563, @ABran-WMF wrote: >>... [08:42:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage [08:45:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage [08:54:43] I am going to do the Gerrit upgrade, it will be unavailable while I am performing the maintenance [08:54:47] * hashar grabs a coffee [08:57:05] (03PS1) 10Marostegui: Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989601 [08:58:12] marostegui: ^ :) [08:58:21] Gerrit is going down soonish [08:58:27] hashar: yeah no problem [08:58:36] hashar: It will take a bit for me to be able to merge it - thanks though! [08:58:42] Oh did the back port happen? [08:58:47] (03CR) 10Hashar: [C: 03+2] Gerrit 3.6.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987498 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [08:59:11] Sorry, I’m on a delayed flight (hoped to be home by backport time) [08:59:23] (03Merged) 10jenkins-bot: Gerrit 3.6.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987498 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [08:59:43] foks: it starts at 8:00 UTC or one hour ago [09:00:05] hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0900) [09:00:28] hashar: yeah, though I don’t see chatter around it here [09:01:27] urbanecm: maybe you’re the person to reach :) [09:03:10] foks: if you mean https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/987424/, that wasn't backported [09:03:31] oh [09:04:06] Should i move it to another window? [09:04:26] We need to run the scripts very soon [09:05:13] foks: if you want it to be backported, yes :). you'd also want to upload a cherry-pick of the patch for the wmf.X branches you'd need this on for backport to be possible. [09:05:49] hashar: please let us know when gerrit is properly back :) [09:06:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1201.eqiad.wmnet with OS bookworm [09:07:44] The election begins on Tuesday and the scripts take 2-3 days to compile the voter list [09:08:16] urbanecm: (sorry, my airplane WiFi dropped out) - I see. I don’t know how to do that. But I will try tomorrow. [09:08:24] (Later today, UTC) [09:08:36] Thanks for the tip. [09:08:50] foks: there's a button for it in gerrit :). i can show you someday. [09:09:13] Ah cool. I’ll explore :) [09:09:19] foks: i can probably backport this for you in a couple of hours and leave it for you to run the scripts, if that'd be helpful? [09:09:46] urbanecm: that would be very helpful if possible [09:10:08] !log gerrit: `ssh -p 29418 gerrit.wikimedia.org gerrit copy-approvals` # T309870 [09:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:12] T309870: Upgrade to Gerrit 3.6 - https://phabricator.wikimedia.org/T309870 [09:10:35] foks: sure! i hope your flight doesn't get delayed any more :) [09:10:45] c448fc67 waiting .... 09:10:01.845 com.google.gerrit.server.approval.RecursiveApprovalCopier$$Lambda$395177/0x00007fcf2f8e1508@8fbe670 [09:10:50] * hashar whistles while code is working [09:10:56] urbanecm: me too. :) I apparently land at 3am local time. :( [09:11:28] better than not departing at all though! :) [09:11:35] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) Thanks, let's try the non-intrusive actions first, so re-seating the line-card. I'd expect the other linecards to show the same error if the issue was on the CB0 side, so it might be worth pushing back a bit on... [09:15:13] urbanecm: very true! [09:16:16] Gerrit is still performing some preliminary migration task (copy-approvals) [09:18:01] i'm not planning to do any deployments now :) [09:21:04] !log Stopping Gerrit [09:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:47] !log hashar@deploy2002 Started deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870 [09:21:51] T309870: Upgrade to Gerrit 3.6 - https://phabricator.wikimedia.org/T309870 [09:22:04] of course scap fails ... [09:22:14] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870 (duration: 00m 27s) [09:22:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:23:24] !log hashar@deploy2002 Started deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870 [09:23:31] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870 (duration: 00m 07s) [09:24:31] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:42] (ProbeDown) firing: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:36] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:14] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:28:06] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:11] (JobUnavailable) firing: (3) Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:30:40] I am restarting Gerrit and will check it [09:31:10] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:32:44] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:44] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:22] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:35] !log Gerrit restarted and its reindexing all changes T309870 [09:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:39] T309870: Upgrade to Gerrit 3.6 - https://phabricator.wikimedia.org/T309870 [09:33:58] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:11] (JobUnavailable) firing: (5) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:31] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:42] gerrit looks .. different [09:34:42] (ProbeDown) resolved: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:51] I have resumed Gerrit monitoring [09:37:07] hashar: are we good to go? [09:37:30] and I don't get why jinxer-wm noticed some issues [09:37:37] anyway, still checking [09:39:00] (03CR) 10JMeybohm: [C: 03+1] ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648 (owner: 10Effie Mouzeli) [09:39:37] !log Gerrit back up and operational, now running version 3.6.8 [09:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:45] ok thank you hashar [09:39:47] effie: Gerrit looks fine to me now [09:40:06] of course I might have missed something, but it looks like thebasics are working [09:40:07] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648 (owner: 10Effie Mouzeli) [09:41:03] (03Merged) 10jenkins-bot: ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648 (owner: 10Effie Mouzeli) [09:41:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:43:24] (03CR) 10Volans: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede) [09:44:52] (03CR) 10Vgutierrez: [C: 03+2] lvs::realserver::ipip: Report errors on MSS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/989459 (https://phabricator.wikimedia.org/T354721) (owner: 10Vgutierrez) [09:48:38] (03CR) 10Marostegui: [C: 03+2] Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989601 (owner: 10Marostegui) [09:48:40] hashar: is there a way to tell the new gerrit to show the names of the people that +1 a patch without having to hover the +1 (that has also a bug given that the tooltip goes over the popup hiding some parts of it :D ) [09:48:52] (example https://gerrit.wikimedia.org/r/c/operations/alerts/+/989645 [09:49:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54599 and previous config saved to /var/cache/conftool/dbconfig/20240111-094928-root.json [09:49:35] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989726 (https://phabricator.wikimedia.org/T351430) [09:49:48] (03CR) 10Effie Mouzeli: [C: 03+2] Add ipoid to the service mesh [puppet] - 10https://gerrit.wikimedia.org/r/988453 (https://phabricator.wikimedia.org/T325147) (owner: 10Kamila Součková) [09:49:52] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989726 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan) [09:50:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:50:53] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989726 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan) [09:51:12] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [09:51:43] volans: ahhh yeah that is annoying :) [09:52:06] the votes have been moved up in the `Reviewers` list [09:52:17] so you get each of the reviewers listed together with their votes [09:52:27] ahhh now I see them, I missed them at first sight [09:52:33] too used to check the box below [09:52:46] yeah :\ [09:53:22] the idea of the Submit Requirements is letting one who might give the missing votes before a change get submitted [09:53:30] given Google has developers submitting the changes directly [09:53:33] whereas we rely on CI [09:53:47] the interface might well change next week again when I upgrade to 3.7 [09:53:49] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [09:54:06] (03CR) 10Volans: [V: 03+1 C: 03+1] Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede) [09:54:38] hashar: mmh but if I also V+1 then there is no difference in the +1 close to my name ^^^ [09:54:42] (03PS7) 10Effie Mouzeli: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) [09:54:56] anyway, I guess we'll need to adapt and get used to the new UI :D [09:55:24] not your fault :) [09:55:55] (03CR) 10Volans: [C: 03+1] Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede) [09:55:58] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10ayounsi) >>! In T352893#9450804, @akosiaris wrote: > I 've been fearing this and started thinki... [09:57:59] (03PS8) 10Effie Mouzeli: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) [09:58:38] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:59:02] (03PS1) 10Santiago Faci: Revert "Deploying to staging to test the fix with production data" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989746 [09:59:40] (03CR) 10Santiago Faci: [V: 03+2 C: 03+2] Revert "Deploying to staging to test the fix with production data" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989746 (owner: 10Santiago Faci) [09:59:54] (03PS9) 10Effie Mouzeli: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) [09:59:56] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [10:00:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:00:22] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:00:39] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:00:43] (03Merged) 10jenkins-bot: Revert "Deploying to staging to test the fix with production data" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989746 (owner: 10Santiago Faci) [10:01:20] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.238 second response time https://wikitech.wikimedia.org/wiki/Docker [10:03:17] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:03:59] !log sfaci@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [10:04:10] !log sfaci@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [10:04:15] (03CR) 10Kamila Součková: mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [10:04:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54600 and previous config saved to /var/cache/conftool/dbconfig/20240111-100433-root.json [10:06:41] (03CR) 10Effie Mouzeli: [C: 03+2] service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [10:11:35] (03PS2) 10ArielGlenn: add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) [10:12:03] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [10:12:08] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [10:13:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: druid::public::worker [10:19:13] (03PS1) 10Muehlenhoff: Switch druid::public::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989731 (https://phabricator.wikimedia.org/T349619) [10:19:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54601 and previous config saved to /var/cache/conftool/dbconfig/20240111-101938-root.json [10:22:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch druid::public::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989731 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:23:40] (03PS1) 10Kosta Harlan: ipoid: Remove testing cronjobs from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/989733 [10:25:49] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Remove testing cronjobs from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/989733 (owner: 10Kosta Harlan) [10:26:02] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [10:26:36] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [10:26:39] (03Merged) 10jenkins-bot: ipoid: Remove testing cronjobs from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/989733 (owner: 10Kosta Harlan) [10:28:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: druid::public::worker [10:29:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:29:11] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:40] (03PS1) 10Hashar: gerrit: add trailing slash to gerrit.canonicalWebUrl [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) [10:30:44] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:31:05] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:31:23] !log installing exim4 security updates [10:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:32:40] hashar, I like the new Gerrit interface <3. Thanks for all the work you do and the "Merge Conflict" thing in the previous version is now in the "Status" column which I can hide and not see it again :) [10:32:40] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [10:34:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54602 and previous config saved to /var/cache/conftool/dbconfig/20240111-103443-root.json [10:39:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:47:21] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10cmooney) p:05Triage→03Medium [10:47:26] (03PS1) 10Majavah: P:mail::smarthost: support DKIM dual-signing [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112) [10:48:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1077/co" [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [10:48:49] (03CR) 10Hashar: "Puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [10:49:11] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [10:49:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54603 and previous config saved to /var/cache/conftool/dbconfig/20240111-104948-root.json [10:50:05] (03PS1) 10Majavah: Add fake wmcs-rsa DKIM keys for Cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/989738 (https://phabricator.wikimedia.org/T354112) [10:51:07] (03CR) 10Jelto: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1078/co" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [10:52:01] (03CR) 10Jelto: [V: 03+1 C: 03+2] gerrit: add trailing slash to gerrit.canonicalWebUrl [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [10:52:14] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:54:18] !log installing Linux 5.10.205 updates on Bullseye hosts [10:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1100). [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1100) [11:01:04] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:01:06] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9452446, @ayounsi wrote: >>>! In T352893#9450804, @akosiaris wrote: >>... [11:02:19] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) [11:02:46] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [11:03:40] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) 05Open→03Resolved All work completed on this, lvs2014 made active for several hours and no issues. [11:04:10] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:04:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54604 and previous config saved to /var/cache/conftool/dbconfig/20240111-110453-root.json [11:05:24] * Lucas_WMDE will not be around for today’s backport window btw [11:05:44] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:05:48] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989739 (https://phabricator.wikimedia.org/T354517) [11:06:10] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989739 (https://phabricator.wikimedia.org/T354517) (owner: 10Peter Fischer) [11:07:01] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989739 (https://phabricator.wikimedia.org/T354517) (owner: 10Peter Fischer) [11:08:37] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) Traffic has now been re-routed over the new link. Old interfaces from mr1-codfw to asw-a1-codfw has been disabled, as have the sub-interf... [11:15:10] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:15:12] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54605 and previous config saved to /var/cache/conftool/dbconfig/20240111-111958-root.json [11:23:18] RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [11:31:13] (SwiftObjectCountSiteDisparity) firing: (2) MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:36:35] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [11:39:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:44:57] (03CR) 10FNegri: [C: 03+1] Add fake wmcs-rsa DKIM keys for Cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/989738 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [11:45:12] (03CR) 10Majavah: [V: 03+2 C: 03+2] Add fake wmcs-rsa DKIM keys for Cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/989738 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [11:49:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:49:15] (03PS1) 10Marostegui: db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989745 (https://phabricator.wikimedia.org/T354506) [11:49:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2124 T354506', diff saved to https://phabricator.wikimedia.org/P54606 and previous config saved to /var/cache/conftool/dbconfig/20240111-114930-marostegui.json [11:49:34] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [11:50:44] (03CR) 10Marostegui: [C: 03+2] db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989745 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [11:50:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2124.codfw.wmnet with OS bookworm [11:51:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:52:16] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:52:27] (03CR) 10Clément Goubert: [C: 03+1] mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:52:32] shoot [11:52:34] sorry kamila_ [11:52:46] I misclicked [11:52:51] It's going live :p [11:52:58] Okay :-D [11:53:33] Probably a good thing to do given the above alert [11:55:20] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:56:34] (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:57:18] (03Merged) 10jenkins-bot: mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:57:40] (03PS1) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) [11:59:54] !log installing Python 2.7 security updates on Bullseye [11:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:31] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:00:54] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:01:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:03:22] (03PS2) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) [12:05:42] (03PS11) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) [12:05:44] (03PS11) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [12:06:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1079/co" [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [12:07:47] (03PS1) 10Btullis: Switch all spark images to use Java 8 as their base JDK/JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) [12:08:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2124.codfw.wmnet with reason: host reimage [12:09:06] (03CR) 10Btullis: [C: 04-1] "Setting to -1 for now, since it depends on this being approved and built:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) (owner: 10Btullis) [12:10:38] (03PS12) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) [12:10:40] (03PS12) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [12:11:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2124.codfw.wmnet with reason: host reimage [12:15:11] (03PS1) 10Marostegui: Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989753 [12:20:33] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:20:52] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:22:46] (03PS13) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) [12:22:48] (03PS13) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [12:23:13] (03CR) 10Majavah: [C: 04-2] "Need to check what this does to unsubscribe requirements etc." [puppet] - 10https://gerrit.wikimedia.org/r/971893 (owner: 10Majavah) [12:24:31] (03PS14) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) [12:24:33] (03PS14) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [12:29:12] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:32:23] PROBLEM - Disk space on lists1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lists1001&var-datasource=eqiad+prometheus/ops [12:33:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2124.codfw.wmnet with OS bookworm [12:34:05] (03PS4) 10Ayounsi: k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) [12:36:23] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/989769 [12:37:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:39:13] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [12:40:04] (03CR) 10Marostegui: [C: 03+2] Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989753 (owner: 10Marostegui) [12:40:11] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10ayounsi) > The problem remains that the switch name is not going to be enough to know what to... [12:40:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54607 and previous config saved to /var/cache/conftool/dbconfig/20240111-124028-root.json [12:42:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:42:24] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9452969, @ayounsi wrote: > Yep, I mentioned it in the loooong Gerrit CR... [12:45:37] (03PS1) 10Majavah: toolforge: wheel of misfortune: remove redundant defaults [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430) [12:46:17] jouncebot: now [12:46:17] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [12:46:22] (03PS1) 10Ladsgroup: mariadb: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/989808 [12:46:22] jouncebot: next [12:46:23] In 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1300) [12:46:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1080/co" [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430) (owner: 10Majavah) [12:47:10] !log Restarting Gerrit to apply config change https://gerrit.wikimedia.org/r/c/operations/puppet/+/989735/ # T206049 [12:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:15] T206049: Gitiles project landing pages should have an anonymous clone URL - https://phabricator.wikimedia.org/T206049 [12:47:22] (03CR) 10Cathal Mooney: "LGTM, seems using LLDP seems the easiest way forward, bailing out should protect us from unlikely edge-cases. Hard coding the vlans is fi" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [12:47:31] (03CR) 10Cathal Mooney: [C: 03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [12:50:13] PROBLEM - MegaRAID on db1157 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:50:14] ACKNOWLEDGEMENT - MegaRAID on db1157 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T354854 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:50:21] 10SRE, 10ops-eqiad: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10ops-monitoring-bot) [12:51:05] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) Do we have some spare disks? [12:52:03] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) p:05Triage→03High This is a primary master, so we should replace the disk sooner rather than later [12:52:47] RECOVERY - Disk space on lists1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lists1001&var-datasource=eqiad+prometheus/ops [12:55:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54608 and previous config saved to /var/cache/conftool/dbconfig/20240111-125533-root.json [12:57:34] (03PS22) 10Brouberol: global_config: list IPs of hadoop master/workers and kerberos nodes [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1300) [13:07:56] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [13:10:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54609 and previous config saved to /var/cache/conftool/dbconfig/20240111-131038-root.json [13:11:25] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) 05Open→03Resolved The packages have been rebuilt and appear to install fine on snapshot1014. [13:17:09] (03CR) 10Clément Goubert: "Couple nits inline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [13:18:08] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10ayounsi) > Once agreed it probably makes sense to remove profile::pybal::override_bgp_med from the puppet class, and replace it with some... [13:19:22] (03PS3) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) [13:22:48] (03CR) 10Ayounsi: [C: 03+1] hiera: add new netboxdev:attachments user [puppet] - 10https://gerrit.wikimedia.org/r/989529 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon) [13:23:05] (03CR) 10Ayounsi: [C: 03+1] hiera: add fake swift passwords for netbox_dev user [labs/private] - 10https://gerrit.wikimedia.org/r/989531 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon) [13:25:28] (03CR) 10Ayounsi: [C: 03+1] "Thanks !" [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) (owner: 10Cathal Mooney) [13:25:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54610 and previous config saved to /var/cache/conftool/dbconfig/20240111-132543-root.json [13:26:58] (03PS1) 10Hashar: wm-zuul-status: add SCHEDULED for pending check run [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989828 (https://phabricator.wikimedia.org/T348959) [13:29:05] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:29:41] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:31:12] (SwiftObjectCountSiteDisparity) firing: (3) MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [13:32:42] 10SRE, 10Infrastructure-Foundations, 10serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 (10MoritzMuehlenhoff) [13:32:52] 10SRE, 10Infrastructure-Foundations, 10serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:36:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10cmooney) >>! In T354839#9453034, @ayounsi wrote: > On the implementation I'm wondering if instead of introducing a new BGP community, we... [13:36:57] (03CR) 10Btullis: Add base production images containing Java 8 JDK and JRE (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [13:39:12] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:17] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [13:40:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54611 and previous config saved to /var/cache/conftool/dbconfig/20240111-134048-root.json [13:41:28] !log installing xerces-c security updates [13:41:29] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 7.838 second response time https://wikitech.wikimedia.org/wiki/Docker [13:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:43] (03PS1) 10Effie Mouzeli: services_proxy: Add ipoid to the service mesh (fix) [puppet] - 10https://gerrit.wikimedia.org/r/989829 (https://phabricator.wikimedia.org/T325147) [13:46:17] (03CR) 10Effie Mouzeli: [C: 03+2] services_proxy: Add ipoid to the service mesh (fix) [puppet] - 10https://gerrit.wikimedia.org/r/989829 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [13:46:31] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:47:37] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [13:48:06] (03CR) 10Hashar: [C: 03+2] wm-zuul-status: add SCHEDULED for pending check run [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989828 (https://phabricator.wikimedia.org/T348959) (owner: 10Hashar) [13:48:38] (03Merged) 10jenkins-bot: wm-zuul-status: add SCHEDULED for pending check run [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989828 (https://phabricator.wikimedia.org/T348959) (owner: 10Hashar) [13:49:11] !log hashar@deploy2002 Started deploy [gerrit/gerrit@af34477]: wm-zuul-status: add SCHEDULED for pending check run - T348959 [13:49:17] T348959: Verify ChecksAPI changes between Gerrit 3.5 and 3.6 - https://phabricator.wikimedia.org/T348959 [13:49:19] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@af34477]: wm-zuul-status: add SCHEDULED for pending check run - T348959 (duration: 00m 07s) [13:54:47] (03CR) 10FNegri: [C: 03+1] "Thanks for spotting this!" [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430) (owner: 10Majavah) [13:55:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54612 and previous config saved to /var/cache/conftool/dbconfig/20240111-135553-root.json [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1400). [14:00:04] koi and apergos: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:21] o/ [14:00:26] good afternoon! here for my patch, when my turn comes. [14:01:42] (03CR) 10Majavah: [V: 03+1 C: 03+2] toolforge: wheel of misfortune: remove redundant defaults [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430) (owner: 10Majavah) [14:02:37] 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:04:41] I wonder who is running the deployment window today [14:05:09] TheresNoTime & urbanecm have been online today [14:05:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1081/co" [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:05:34] i'd prefer not to deploy [14:05:55] apergos: might be up to you to take an extra window given you're here [14:06:31] I have stepped back from running these, while we schedule a retrospective on the past 2.5 years of trainings, and while my team and duties have changed [14:07:04] and it's a bit sketchy to run a window and self-deploy too [14:07:35] I'm pretty sure people gave self deployed and taken patches before [14:07:59] I'm not sure what's sketchy about offering to help someone else after you've deployed yours [14:08:32] it's the first part: run the window and self-deploy one's patch, that I don't love. but in any case: not now my role, it's on hold [14:09:41] I'm not sure whose around then. The windows are a best effort of volunteers are not that many deployers do them. [14:09:52] taavi: May do [14:10:04] But I don't really think it's anyone's role [14:10:13] Just nice people being helpful [14:10:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54613 and previous config saved to /var/cache/conftool/dbconfig/20240111-141058-root.json [14:11:05] I've got nothing better to be doing [14:11:06] * Reedy looks [14:12:25] (03CR) 10Reedy: [C: 03+2] add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [14:12:42] people's names get added to the window as deployers, via a process. what I mean is, it's not just random luck of the draw. if someone or several someones can't make it for a slot, that's how it is. life (and other work) happens. but then maybe we need to figure that out better. [14:13:11] (03Merged) 10jenkins-bot: add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [14:14:47] Thanks Reedy [14:15:13] apergos: oh yes better needs to be done and releng know that and are thinking about it [14:15:31] I was talking to Tyler about his plans not so long back [14:15:37] hence (in part) the retro that I will be involved in [14:19:01] (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 90% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976224 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [14:21:39] !log reedy@deploy2002 Synchronized wmf-config/InitialiseSettings.php: T205347 (duration: 07m 41s) [14:21:44] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [14:22:03] is that out to production, complete? [14:22:07] yeah [14:22:13] ok lemme just do the quick check [14:23:37] yep it's there in the edge login domains when I log in [14:23:39] thanks! [14:24:14] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:24:47] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:25:00] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:25:10] (03PS3) 10Reedy: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan) [14:25:13] (03PS3) 10Reedy: zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang) [14:25:34] (03CR) 10Reedy: [C: 03+2] zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang) [14:25:44] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:25:53] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) I've moved a bit further on the testing part. @MoritzMuehlenhoff showed me [[ https://github.com/ikapelyukhin/go-x509-issuer-name-doe... [14:25:59] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:26:23] (03Merged) 10jenkins-bot: zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang) [14:26:42] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:27:45] Reedy: I was about to say I could deploy https://gerrit.wikimedia.org/r/988482 but it looks like you're on it already? [14:27:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:28:11] Tchanders: Yeah, I'll grab it :) [14:28:33] thanks for taking the window, Ree dy [14:28:42] Reedy: Thank you! [14:29:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:35] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:29:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:30:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:30:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:30:17] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:30:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T354336)', diff saved to https://phabricator.wikimedia.org/P54614 and previous config saved to /var/cache/conftool/dbconfig/20240111-143034-marostegui.json [14:30:37] Tchanders: I left a message in Slack, it seems like the localhost:6035 URL is not working yet from mwmwaint, but it might just need time to propagate. In the meantime, I think it makes sense to continue syncing the config patch. [14:30:38] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:30:40] hi Reedy, I see my patch got merged, and is it ok for me to test it now? [14:30:50] koi: it's going through the deploy train :) [14:31:05] (03PS2) 10Ayounsi: Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) [14:31:29] wow, so when will it deployed [14:31:29] Reedy: There's nothing to test for mine btw (following what kostajh said, we can just go ahead) [14:31:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T354336)', diff saved to https://phabricator.wikimedia.org/P54615 and previous config saved to /var/cache/conftool/dbconfig/20240111-143143-marostegui.json [14:31:47] koi: when the train runs.... [14:31:52] uh, command [14:32:01] can take ~10 mins [14:34:00] (03CR) 10Ayounsi: "It's live on netbox-next." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [14:34:31] (03PS1) 10Jelto: trafficserver: switch design.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) [14:34:33] (03PS5) 10Klausman: amd_rocm Prometheus script: Handle a few new metrics [puppet] - 10https://gerrit.wikimedia.org/r/989833 [14:34:35] (03PS1) 10Jelto: miscweb/microsites: move monitoring of design to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) [14:36:15] !log reedy@deploy2002 Synchronized wmf-config/: T344398 (duration: 07m 25s) [14:36:19] T344398: Create abusefilter helper group on zhwiki - https://phabricator.wikimedia.org/T344398 [14:36:22] koi: its live now [14:37:46] (03PS4) 10Reedy: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan) [14:37:50] (03CR) 10Reedy: [C: 03+2] ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan) [14:38:18] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) p:05Triage→03Medium [14:38:41] (03Merged) 10jenkins-bot: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan) [14:38:57] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [14:39:03] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) [14:39:11] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:13] 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:39:25] 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:39:33] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [14:39:38] kostajh: Tchanders, the envoy listener is working on mwmaint, at least curling it works [14:39:41] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10cmooney) [14:39:53] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [14:39:54] ty [14:40:01] 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:40:09] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) [14:41:10] Hey, I haven't deployed in a few years and I'd like to get back into it. I could use a good refresher on the tooling/environment/etc. Any suggestions about where to start / who to talk to? [14:41:46] stephanebisson: releng do do training, if you think you need it [14:41:57] (03CR) 10Jelto: [C: 04-1] "needs approval from design team first" [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:42:05] (03CR) 10Jelto: [C: 04-1] "needs approval from design team first" [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:42:07] Beyond that, you can just volunteer to "have a go" and ask for someone to be around for support :) [14:42:45] stephanebisson: https://wikitech.wikimedia.org/wiki/Deployments/Training [14:43:19] Wonderful [14:44:06] 10SRE, 10Wikimedia-production-error: Error accessing File:KlimtDieJungfrau.jpg after it was moved to the Main Page on enwiki - https://phabricator.wikimedia.org/T354858 (10A_smart_kitten) Adding SRE as from what I've read it seems like it might be a relevant team here. Apologies if this is incorrect. [14:45:57] 10SRE, 10Infrastructure-Foundations: Re-IP hosts running Cassandra to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354871 (10cmooney) p:05Triage→03Medium [14:46:08] 10SRE, 10Infrastructure-Foundations: Re-IP hosts running Cassandra to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354871 (10cmooney) [14:46:10] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [14:46:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54616 and previous config saved to /var/cache/conftool/dbconfig/20240111-144649-marostegui.json [14:47:33] Reedy I'll go through the docs and possibly give it a go. Could you be available in support (with advance notice)? [14:47:52] (03PS6) 10Effie Mouzeli: (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [14:48:05] 10SRE, 10Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872 (10cmooney) p:05Triage→03Medium [14:49:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:32] (03CR) 10CI reject: [V: 04-1] (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli) [14:51:11] !log reedy@deploy2002 Synchronized wmf-config/: T325147 (duration: 06m 43s) [14:51:14] T325147: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 [14:51:31] Note that deployment trainings are on hold for the moment, Reedy and stephanebisson, while we re-evaluate the program [14:53:41] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) Updated firmware per Dells Request cleared logs resent new tsr report. waiting for response. [14:54:11] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:31] (03CR) 10Brouberol: [C: 03+2] spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [14:57:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:00:26] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Jclark-ctr) @Marostegui Server is out of warranty. Replaced Disk with ssd from recently decom server [15:01:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54617 and previous config saved to /var/cache/conftool/dbconfig/20240111-150156-marostegui.json [15:01:59] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) Thank you for being so fast! I can see the disk rebuilding now: ` Raw Size: 1.746 TB [0xdf8fe2b0 Sectors] Firmware state: =====> Rebuild <===== Media Type: Solid State Device Drive T... [15:03:53] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Jclark-ctr) a:03Jclark-ctr [15:04:19] PROBLEM - Check systemd state on aphlict2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:23] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:06:39] 10SRE, 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T354765 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced cable [15:10:23] (03PS1) 10Effie Mouzeli: modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841 [15:17:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T354336)', diff saved to https://phabricator.wikimedia.org/P54618 and previous config saved to /var/cache/conftool/dbconfig/20240111-151702-marostegui.json [15:17:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [15:17:09] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:17:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [15:17:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T354336)', diff saved to https://phabricator.wikimedia.org/P54619 and previous config saved to /var/cache/conftool/dbconfig/20240111-151724-marostegui.json [15:17:49] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, moss-fe2001.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:19:21] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:19:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T354336)', diff saved to https://phabricator.wikimedia.org/P54620 and previous config saved to /var/cache/conftool/dbconfig/20240111-151934-marostegui.json [15:23:45] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Jclark-ctr) a:03Jclark-ctr Confirmed: Service Request 183160693 was successfully submitted. [15:24:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:17] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:30:29] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:32:42] (03PS3) 10Muehlenhoff: mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721 [15:34:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54621 and previous config saved to /var/cache/conftool/dbconfig/20240111-153441-marostegui.json [15:38:28] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) @ayounsi ` Hello Papaul Thanks for re-seating the FPC, at this point the next step will be doing the manual switch over of the RE to test the CB, I am aware that there are several FPCs on the device, the issue c... [15:39:55] (03PS8) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [15:40:34] (03CR) 10Arnaudb: [C: 03+1] mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721 (owner: 10Muehlenhoff) [15:41:33] (03CR) 10MVernon: [V: 03+2 C: 03+2] hiera: add fake swift passwords for netbox_dev user [labs/private] - 10https://gerrit.wikimedia.org/r/989531 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon) [15:41:34] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: cache::upload [15:41:42] (03CR) 10MVernon: [C: 03+2] hiera: add new netboxdev:attachments user [puppet] - 10https://gerrit.wikimedia.org/r/989529 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon) [15:41:59] (03PS1) 10Cwhite: Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877 [15:42:11] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10Volans) [15:43:13] (03CR) 10CI reject: [V: 04-1] Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877 (owner: 10Cwhite) [15:43:39] (03PS1) 10Muehlenhoff: Switch cache/upload to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989845 (https://phabricator.wikimedia.org/T349619) [15:44:03] (03CR) 10Muehlenhoff: Configure ACLs for reprepro upload queue (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [15:45:17] (03CR) 10Muehlenhoff: [C: 03+2] Switch cache/upload to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989845 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:47:13] (03PS2) 10Cwhite: Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877 [15:47:14] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe [15:47:23] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:51] 10SRE, 10Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872 (10cmooney) [15:48:55] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [15:48:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:05] (03CR) 10FNegri: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [15:49:13] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:mail::smarthost: support DKIM dual-signing [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [15:49:15] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_netboxdev:attachments.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54622 and previous config saved to /var/cache/conftool/dbconfig/20240111-154947-marostegui.json [15:50:06] I'm doing a roll-restart of the swift frontends, and the cookbook is meant to downtime them... [15:50:30] Emperor: thanks for sharing :) [15:51:06] so I'm not sure why we're getting alerts here :( [15:51:13] (SwiftObjectCountSiteDisparity) resolved: (3) MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:52:15] Emperor: does the cookbook check for icinga being optimal before proceeding? or removes the downtime and goes ahead blindly [15:52:46] it does look to run checks [15:54:20] 10SRE, 10Infrastructure-Foundations: Re-IP db servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T354878 (10cmooney) p:05Triage→03Medium [15:55:19] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:21] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:55:26] 10SRE, 10Infrastructure-Foundations: Re-IP db servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T354878 (10cmooney) [15:55:28] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [15:58:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe [15:59:18] !log restart pybal on lvs4010 [15:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:31] RECOVERY - Check systemd state on aphlict2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T354336)', diff saved to https://phabricator.wikimedia.org/P54623 and previous config saved to /var/cache/conftool/dbconfig/20240111-160454-marostegui.json [16:04:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:05:05] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:05:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:05:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54624 and previous config saved to /var/cache/conftool/dbconfig/20240111-160516-marostegui.json [16:05:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:07:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cache::upload [16:07:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54625 and previous config saved to /var/cache/conftool/dbconfig/20240111-160725-marostegui.json [16:11:42] 10SRE-swift-storage, 10Patch-For-Review: Create swift account for netbox-next - https://phabricator.wikimedia.org/T354766 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon The new account is created for you. [16:15:02] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [16:15:40] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [16:16:00] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [16:19:36] (03CR) 10David Caro: [C: 03+1] "LGTM (for future reviews, if the change is small and I had approved it earlier, feel free to merge unless you want a re-review, if so, ple" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [16:19:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [16:22:09] (03PS1) 10Bking: WIP: Add new data platform team to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578) [16:22:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54626 and previous config saved to /var/cache/conftool/dbconfig/20240111-162231-marostegui.json [16:23:20] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:23:33] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:24:15] (03PS1) 10Btullis: Add data for the new an-master100[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) [16:26:34] (03PS2) 10Btullis: Add data for the new an-master100[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) [16:29:25] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 197, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:37] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [16:36:09] (03CR) 10FNegri: dologmsg: standardize logging format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [16:37:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54628 and previous config saved to /var/cache/conftool/dbconfig/20240111-163738-marostegui.json [16:39:06] (03PS1) 10Thcipriani: Remove banner for 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904 [16:41:09] RECOVERY - MegaRAID on db1157 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:42:13] (03CR) 10Thcipriani: [C: 03+2] Remove banner for 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904 (owner: 10Thcipriani) [16:42:46] (03Merged) 10jenkins-bot: Remove banner for 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904 (owner: 10Thcipriani) [16:44:34] (03PS1) 10Volans: .wmfconfig: update config for releases [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989905 [16:44:36] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.3.3 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989906 [16:46:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:49:05] (03CR) 10Btullis: "With this change, we can make the hadoop-yarn-resourcemanager service start and run as a standby on an-master100[3-4]." [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [16:49:34] (03CR) 10Volans: [C: 03+2] .wmfconfig: update config for releases [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989905 (owner: 10Volans) [16:49:44] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [16:50:03] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.3.3 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989906 (owner: 10Volans) [16:51:13] (03Merged) 10jenkins-bot: .wmfconfig: update config for releases [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989905 (owner: 10Volans) [16:51:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:51:38] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.3.3 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989906 (owner: 10Volans) [16:52:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54629 and previous config saved to /var/cache/conftool/dbconfig/20240111-165244-marostegui.json [16:52:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:52:48] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:53:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:53:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T354336)', diff saved to https://phabricator.wikimedia.org/P54630 and previous config saved to /var/cache/conftool/dbconfig/20240111-165305-marostegui.json [16:54:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T354336)', diff saved to https://phabricator.wikimedia.org/P54631 and previous config saved to /var/cache/conftool/dbconfig/20240111-165414-marostegui.json [16:55:21] (03PS9) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [16:56:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [16:56:23] (03PS1) 10Brouberol: Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694) [16:57:02] (03PS1) 10Brouberol: Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989908 (https://phabricator.wikimedia.org/T353694) [16:57:23] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) 05Open→03Resolved RAID back to Optimal! Thank you! [16:57:50] (03CR) 10CI reject: [V: 04-1] Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol) [16:58:48] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1084/co" [puppet] - 10https://gerrit.wikimedia.org/r/989908 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol) [16:59:09] (03PS10) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:25] (03CR) 10Btullis: [C: 03+2] "I'm listed as an approver for this group, so I'm confident in adding +2." [puppet] - 10https://gerrit.wikimedia.org/r/989908 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol) [17:06:50] (03PS11) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [17:07:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [17:09:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54632 and previous config saved to /var/cache/conftool/dbconfig/20240111-170920-marostegui.json [17:11:04] (03PS12) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [17:11:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [17:12:13] (03CR) 10Hashar: "Sorry, it looks like I have messed up the fork :-(" [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904 (owner: 10Thcipriani) [17:17:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:21:16] (03PS13) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [17:21:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [17:22:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:22:27] (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [17:24:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54633 and previous config saved to /var/cache/conftool/dbconfig/20240111-172427-marostegui.json [17:31:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:36:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:38:36] (03CR) 10Btullis: "This can be abandone. It was implmented in: https://gerrit.wikimedia.org/r/c/operations/puppet/+/989908" [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol) [17:39:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T354336)', diff saved to https://phabricator.wikimedia.org/P54634 and previous config saved to /var/cache/conftool/dbconfig/20240111-173933-marostegui.json [17:39:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:39:38] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:39:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:39:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T354336)', diff saved to https://phabricator.wikimedia.org/P54635 and previous config saved to /var/cache/conftool/dbconfig/20240111-173955-marostegui.json [17:40:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:41:19] (03PS4) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) [17:42:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T354336)', diff saved to https://phabricator.wikimedia.org/P54636 and previous config saved to /var/cache/conftool/dbconfig/20240111-174204-marostegui.json [17:50:54] (03CR) 10Hashar: Add base production images containing Java 8 JDK and JRE (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [17:57:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P54637 and previous config saved to /var/cache/conftool/dbconfig/20240111-175710-marostegui.json [18:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1800) [18:05:58] nothing from me this week [18:08:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:12:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P54638 and previous config saved to /var/cache/conftool/dbconfig/20240111-181217-marostegui.json [18:13:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:15:04] (03Abandoned) 10Brouberol: Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol) [18:21:08] (03PS1) 10Ilias Sarantopoulos: WIP:ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) [18:23:14] !log deploying gerrit to remove devsat survey (no restart needed) [18:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:21] !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit2002 only) [18:25:26] !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit2002 only) (duration: 00m 05s) [18:26:14] 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10RobH) [18:26:41] 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10RobH) [18:27:01] !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit primary: gerrit.wikimedia.org) [18:27:08] !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit primary: gerrit.wikimedia.org) (duration: 00m 07s) [18:27:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T354336)', diff saved to https://phabricator.wikimedia.org/P54639 and previous config saved to /var/cache/conftool/dbconfig/20240111-182723-marostegui.json [18:27:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:27:27] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:27:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:27:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54640 and previous config saved to /var/cache/conftool/dbconfig/20240111-182745-marostegui.json [18:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54641 and previous config saved to /var/cache/conftool/dbconfig/20240111-182859-marostegui.json [18:44:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P54643 and previous config saved to /var/cache/conftool/dbconfig/20240111-184405-marostegui.json [18:47:00] 10SRE, 10Thumbor, 10Wikimedia-production-error: Error accessing File:KlimtDieJungfrau.jpg after it was moved to the Main Page on enwiki - https://phabricator.wikimedia.org/T354858 (10Aklapper) [18:52:55] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896 (10RobH) [18:53:16] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896 (10RobH) [18:59:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P54644 and previous config saved to /var/cache/conftool/dbconfig/20240111-185912-marostegui.json [19:00:04] jeena and dduvall: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1900). Please do the needful. [19:00:24] o/ [19:02:16] (03PS1) 10Cathal Mooney: Add automation for management router BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) [19:03:01] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989918 (https://phabricator.wikimedia.org/T350089) [19:03:03] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989918 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [19:03:52] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989918 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [19:05:33] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:06:10] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:11:35] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.13 refs T350089 [19:11:44] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [19:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54645 and previous config saved to /var/cache/conftool/dbconfig/20240111-191418-marostegui.json [19:14:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1224.eqiad.wmnet with reason: Maintenance [19:14:34] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:14:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1224.eqiad.wmnet with reason: Maintenance [19:14:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T354336)', diff saved to https://phabricator.wikimedia.org/P54646 and previous config saved to /var/cache/conftool/dbconfig/20240111-191440-marostegui.json [19:16:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T354336)', diff saved to https://phabricator.wikimedia.org/P54647 and previous config saved to /var/cache/conftool/dbconfig/20240111-191650-marostegui.json [19:17:10] 10SRE, 10Traffic: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718 (10A_smart_kitten) Admittedly I’m inexperienced here (and so may well be missing something), but in T354858, I received 429 error... [19:19:09] (03PS14) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [19:20:43] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [19:31:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P54649 and previous config saved to /var/cache/conftool/dbconfig/20240111-193156-marostegui.json [19:34:11] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:35] (03PS1) 10Gehel: Add Guillaume to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989922 (https://phabricator.wikimedia.org/T353694) [19:41:43] (03PS15) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [19:42:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [19:46:18] (03PS3) 10Houseblaster: InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) [19:47:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P54651 and previous config saved to /var/cache/conftool/dbconfig/20240111-194703-marostegui.json [19:49:43] (03PS16) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [19:50:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [19:53:37] 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH) @ayounsi: I never saw a blocker come on on this, so we're good to go ahead and disconnect the cross connections at each site for this correct? [19:54:28] 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10ayounsi) yep, we're well past "after monday" :) [19:55:21] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:56:15] (03PS17) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [19:56:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [19:56:38] (03PS6) 10Houseblaster: InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) [19:56:48] 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH) [19:56:55] 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH) [19:58:04] 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH) As the disconnects will reference contract IDs and disconnect fees, each site's disconnect has bene put to is own S4 space subtask. [19:58:45] (03PS1) 10Ryan Kemper: s/ alue/value [puppet] - 10https://gerrit.wikimedia.org/r/989924 [19:59:47] (03PS2) 10Ryan Kemper: s/ alue/value [puppet] - 10https://gerrit.wikimedia.org/r/989924 [20:00:10] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@07f5320]: (no justification provided) [20:00:25] (03CR) 10Majavah: wdqs-test: Enable PKI (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [20:00:38] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@07f5320]: (no justification provided) (duration: 00m 27s) [20:01:54] (03PS3) 10Ryan Kemper: Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/989924 [20:02:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T354336)', diff saved to https://phabricator.wikimedia.org/P54652 and previous config saved to /var/cache/conftool/dbconfig/20240111-200209-marostegui.json [20:02:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [20:02:25] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:02:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [20:02:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance [20:02:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance [20:02:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T354336)', diff saved to https://phabricator.wikimedia.org/P54653 and previous config saved to /var/cache/conftool/dbconfig/20240111-200253-marostegui.json [20:05:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T354336)', diff saved to https://phabricator.wikimedia.org/P54654 and previous config saved to /var/cache/conftool/dbconfig/20240111-200502-marostegui.json [20:09:27] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:30] (03CR) 10Ryan Kemper: [C: 03+2] Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/989924 (owner: 10Ryan Kemper) [20:13:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:17:07] (03PS18) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [20:18:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:18:53] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [20:20:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P54655 and previous config saved to /var/cache/conftool/dbconfig/20240111-202008-marostegui.json [20:23:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:35:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P54656 and previous config saved to /var/cache/conftool/dbconfig/20240111-203514-marostegui.json [20:36:15] (03PS4) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [20:36:17] (03PS7) 10Effie Mouzeli: (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [20:38:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:42:08] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 6.402% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:47:08] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 6.402% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:50:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T354336)', diff saved to https://phabricator.wikimedia.org/P54657 and previous config saved to /var/cache/conftool/dbconfig/20240111-205021-marostegui.json [20:50:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [20:50:26] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:50:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [20:53:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:54:07] (03PS2) 10Effie Mouzeli: modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841 [20:56:37] (03PS5) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [20:57:12] (03PS8) 10Effie Mouzeli: (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T2100). [21:00:05] jan_drewniak and houseblaster: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:04:33] houseblaster: hi, looks like it's only your patch &my script on the backport window today [21:04:59] given it's just a config change, I can deploy your patches [21:05:37] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:06:29] houseblaster: will you be around to test the config change? [21:08:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:08:42] (03PS9) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [21:08:49] I will [21:09:37] (03CR) 10CI reject: [V: 04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli) [21:09:48] (03PS6) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [21:09:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster) [21:09:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [21:10:40] (03Merged) 10jenkins-bot: InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster) [21:11:19] (03PS7) 10Jdrewniak: InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [21:11:22] (other than opening wikitech and reading it... :P ) [21:11:31] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [21:11:33] ops sry wrong chan [21:11:39] add ping I forgot: jan_drewniak [21:12:16] (03Merged) 10jenkins-bot: InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [21:12:40] (03PS7) 10Effie Mouzeli: modules/app: update to job 2.0.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [21:12:42] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:985647|InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (T354013)]], [[gerrit:984288|InitialiseSettings.php: Allow thanking bots (T341388)]] [21:12:56] T354013: Request to remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on English Wikipedia - https://phabricator.wikimedia.org/T354013 [21:12:56] T341388: Allow thanking bots - https://phabricator.wikimedia.org/T341388 [21:13:49] (03PS10) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [21:14:27] !log jdrewniak@deploy2002 jdrewniak and houseblaster: Backport for [[gerrit:985647|InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (T354013)]], [[gerrit:984288|InitialiseSettings.php: Allow thanking bots (T341388)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:14:36] (03CR) 10CI reject: [V: 04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli) [21:14:45] (03PS11) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [21:15:01] (03PS8) 10Effie Mouzeli: modules/app: update to job 2.0.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [21:16:55] houseblaster: np, the changes are ready to test on mwdebug [21:17:32] (03PS3) 10Effie Mouzeli: modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841 [21:18:41] font tag change working [21:20:06] and thanking bots is working, too [21:20:22] houseblaster: ok great, continuing with sync [21:20:32] !log jdrewniak@deploy2002 jdrewniak and houseblaster: Continuing with sync [21:26:26] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:985647|InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (T354013)]], [[gerrit:984288|InitialiseSettings.php: Allow thanking bots (T341388)]] (duration: 13m 43s) [21:26:38] T354013: Request to remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on English Wikipedia - https://phabricator.wikimedia.org/T354013 [21:26:39] T341388: Allow thanking bots - https://phabricator.wikimedia.org/T341388 [21:27:03] houseblaster: alrighty, all done :) [21:28:30] thank you! [21:31:58] oky, and I just ran my maintenance script on prod and Wikipedia has not burned down :P [21:33:06] jan_drewniak: remember to !log the script run? [21:33:43] taavi: sorry it's the first time I've done that, how do I log it? [21:35:05] jan_drewniak: type !log followed by a short summary of what you did (commands ran, phab tasks, etc.) [21:35:29] https://wikitech.wikimedia.org/w/index.php?title=Tool:Stashbot#!log_processing [21:36:30] !log https://phabricator.wikimedia.org/T349337#9454773 running maintenance script to delete unnecessary user preferences. [21:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:51] taavi: thanks! I'll keep that in mind [21:38:48] (03CR) 10Bking: [V: 03+2 C: 03+1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [21:40:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:44:06] (03CR) 10Dzahn: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [21:46:23] (03CR) 10Bking: [V: 03+2 C: 03+2] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:14:23] 10SRE, 10Thumbor, 10Wikimedia-production-error: Error accessing File:KlimtDieJungfrau.jpg after it was moved to the Main Page on enwiki - https://phabricator.wikimedia.org/T354858 (10A_smart_kitten) [22:17:33] PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:55] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:20:25] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:20:35] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:39] RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:03] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:22:17] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:22:55] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:01] PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:23] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:31:39] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:32:51] RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:08] 10SRE, 10Traffic: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718 (10Tgr) @A_smart_kitten usually what happens is that the first few users get a HTTP 500, then the throttling logic detects that u... [22:37:37] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:37:47] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:09] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:42:59] PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:46] (03PS5) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) [22:44:14] (ProbeDown) firing: (6) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:45:29] (03CR) 10Btullis: Add base production images containing Java 8 JDK and JRE (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [22:45:43] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:46:21] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:47] (03CR) 10Btullis: [C: 03+2] Add Guillaume to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989922 (https://phabricator.wikimedia.org/T353694) (owner: 10Gehel) [23:00:09] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:53] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:02:01] (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989965 (https://phabricator.wikimedia.org/T352583) [23:29:09] (03CR) 10Houseblaster: InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster) [23:34:12] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:54:47] (03PS1) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [23:55:21] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk