[00:21:29] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:03] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:29:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:39:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989513
[00:39:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989513 (owner: 10TrainBranchBot)
[00:39:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:51:48] <wikibugs>	 (03PS1) 10Eevans: sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402)
[00:51:50] <wikibugs>	 (03PS1) 10Eevans: sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402)
[00:51:52] <wikibugs>	 (03PS1) 10Eevans: sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402)
[00:57:54] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989513 (owner: 10TrainBranchBot)
[01:10:05] <wikibugs>	 (03PS1) 10Eevans: sessionstore: configure new hosts to reuse /srv [puppet] - 10https://gerrit.wikimedia.org/r/989631 (https://phabricator.wikimedia.org/T353402)
[01:35:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:17:09] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) (owner: 10Tim Starling)
[02:39:11] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:34] <urandom>	 !log decommissioning cassandra, restbase2014-{a,b,c} — T352469
[02:51:38] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2014.codfw.wmnet with reason: Decommissioning — T352469
[02:51:52] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2014.codfw.wmnet with reason: Decommissioning — T352469
[02:55:43] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack trove: disable online_volume_resize, thus fixing volume resize [puppet] - 10https://gerrit.wikimedia.org/r/989635
[02:58:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack trove: disable online_volume_resize, thus fixing volume resize [puppet] - 10https://gerrit.wikimedia.org/r/989635 (owner: 10Andrew Bogott)
[03:09:11] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:55:19] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[05:35:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:00:13] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui)
[06:09:55] <wikibugs>	 (03PS1) 10Marostegui: db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989637 (https://phabricator.wikimedia.org/T354506)
[06:10:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2180 T354506', diff saved to https://phabricator.wikimedia.org/P54589 and previous config saved to /var/cache/conftool/dbconfig/20240111-061039-marostegui.json
[06:11:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989637 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[06:12:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2180.codfw.wmnet with OS bookworm
[06:14:57] <wikibugs>	 (03CR) 10Marostegui: "The thing with this is...if we start including misc clusters in the DC switchover (which I strongly think we should), this would break as " [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn)
[06:18:12] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db1247 [puppet] - 10https://gerrit.wikimedia.org/r/989638
[06:23:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db1247 [puppet] - 10https://gerrit.wikimedia.org/r/989638 (owner: 10Marostegui)
[06:28:11] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui)
[06:31:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2180.codfw.wmnet with reason: host reimage
[06:34:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: host reimage
[06:46:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:48:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:54:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989597
[06:54:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2180.codfw.wmnet with OS bookworm
[06:56:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989597 (owner: 10Marostegui)
[06:57:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54590 and previous config saved to /var/cache/conftool/dbconfig/20240111-065747-root.json
[06:58:47] <wikibugs>	 (03PS1) 10Marostegui: db2180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/989642 (https://phabricator.wikimedia.org/T354506)
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0700).
[07:01:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/989642 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[07:10:17] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:12:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54591 and previous config saved to /var/cache/conftool/dbconfig/20240111-071252-root.json
[07:17:07] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:21:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1201 T354506', diff saved to https://phabricator.wikimedia.org/P54592 and previous config saved to /var/cache/conftool/dbconfig/20240111-072146-marostegui.json
[07:22:34] <wikibugs>	 (03PS1) 10Marostegui: db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989643 (https://phabricator.wikimedia.org/T354506)
[07:23:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1201.eqiad.wmnet with OS bookworm
[07:23:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989643 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[07:27:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54593 and previous config saved to /var/cache/conftool/dbconfig/20240111-072757-root.json
[07:31:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:36:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:40:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:43:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54594 and previous config saved to /var/cache/conftool/dbconfig/20240111-074302-root.json
[07:45:07] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:07] <wikibugs>	 (03PS1) 10Slyngshede: Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645
[07:51:00] <wikibugs>	 (03CR) 10Slyngshede: "Right now I think the best cause of action is to disable the RAID alert and then we can work on a solution for those cases where alerts ne" [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede)
[07:55:20] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[07:58:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54595 and previous config saved to /var/cache/conftool/dbconfig/20240111-075807-root.json
[07:58:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/989549 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0800).
[08:00:05] <jouncebot>	 tzatziki: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:05:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:06:49] <jinxer-wm>	 (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:08:33] <jelto>	 I'll take a look at phab
[08:11:49] <jinxer-wm>	 (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:11:52] <jelto>	 heading over to -security
[08:13:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54596 and previous config saved to /var/cache/conftool/dbconfig/20240111-081311-root.json
[08:28:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54598 and previous config saved to /var/cache/conftool/dbconfig/20240111-082816-root.json
[08:29:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1201.eqiad.wmnet with OS bookworm
[08:35:32] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] eventschemas: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/989090 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[08:41:37] <wikibugs>	 (03PS1) 10Effie Mouzeli: ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648
[08:41:51] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9449656, @ABran-WMF wrote: > Maybe it also has something to do with: >  >>>! In T352974#9441563, @ABran-WMF wrote: >>...
[08:42:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage
[08:45:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage
[08:54:43] <hashar>	 I am going to do the Gerrit upgrade, it will be unavailable while I am performing the maintenance
[08:54:47] * hashar grabs a coffee
[08:57:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989601
[08:58:12] <hashar>	 marostegui: ^ :)
[08:58:21] <hashar>	 Gerrit is going down soonish
[08:58:27] <marostegui>	 hashar: yeah no problem
[08:58:36] <marostegui>	 hashar: It will take a bit for me to be able to merge it - thanks though!
[08:58:42] <foks>	 Oh did the back port happen?
[08:58:47] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Gerrit 3.6.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987498 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar)
[08:59:11] <foks>	 Sorry, I’m on a delayed flight (hoped to be home by backport time)
[08:59:23] <wikibugs>	 (03Merged) 10jenkins-bot: Gerrit 3.6.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987498 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar)
[08:59:43] <hashar>	 foks: it starts at 8:00 UTC or one hour ago
[09:00:05] <jouncebot>	 hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T0900)
[09:00:28] <foks>	 hashar: yeah, though I don’t see chatter around it here
[09:01:27] <foks>	 urbanecm: maybe you’re the person to reach :)
[09:03:10] <urbanecm>	 foks: if you mean https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/987424/, that wasn't backported
[09:03:31] <foks>	 oh
[09:04:06] <foks>	 Should i move it to another window?
[09:04:26] <foks>	 We need to run the scripts very soon
[09:05:13] <urbanecm>	 foks: if you want it to be backported, yes :). you'd also want to upload a cherry-pick of the patch for the wmf.X branches you'd need this on for backport to be possible. 
[09:05:49] <effie>	 hashar: please let us know when gerrit is properly back :)
[09:06:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1201.eqiad.wmnet with OS bookworm
[09:07:44] <foks>	 The election begins on Tuesday and the scripts take 2-3 days to compile the voter list
[09:08:16] <foks>	 urbanecm: (sorry, my airplane WiFi dropped out) - I see. I don’t know how to do that. But I will try tomorrow.
[09:08:24] <foks>	 (Later today, UTC)
[09:08:36] <foks>	 Thanks for the tip.
[09:08:50] <urbanecm>	 foks: there's a button for it in gerrit :). i can show you someday. 
[09:09:13] <foks>	 Ah cool. I’ll explore :)
[09:09:19] <urbanecm>	 foks: i can probably backport this for you in a couple of hours and leave it for you to run the scripts, if that'd be helpful?
[09:09:46] <foks>	 urbanecm: that would be very helpful if possible
[09:10:08] <hashar>	 !log gerrit: `ssh -p 29418 gerrit.wikimedia.org gerrit copy-approvals` # T309870
[09:10:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:12] <stashbot>	 T309870: Upgrade to Gerrit 3.6 - https://phabricator.wikimedia.org/T309870
[09:10:35] <urbanecm>	 foks: sure! i hope your flight doesn't get delayed any more :)
[09:10:45] <hashar>	 c448fc67 waiting .... 09:10:01.845      com.google.gerrit.server.approval.RecursiveApprovalCopier$$Lambda$395177/0x00007fcf2f8e1508@8fbe670
[09:10:50] * hashar whistles while code is working
[09:10:56] <foks>	 urbanecm: me too. :) I apparently land at 3am local time. :(
[09:11:28] <urbanecm>	 better than not departing at all though! :)
[09:11:35] <wikibugs>	 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) Thanks, let's try the non-intrusive actions first, so re-seating the line-card.  I'd expect the other linecards to show the same error if the issue was on the CB0 side, so it might be worth pushing back a bit on...
[09:15:13] <foks>	 urbanecm: very true!
[09:16:16] <hashar>	 Gerrit is still performing some preliminary migration task (copy-approvals)
[09:18:01] <urbanecm>	 i'm not planning to do any deployments now :)
[09:21:04] <hashar>	 !log Stopping Gerrit
[09:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:47] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870
[09:21:51] <stashbot>	 T309870: Upgrade to Gerrit 3.6 - https://phabricator.wikimedia.org/T309870
[09:22:04] <hashar>	 of course scap fails ...
[09:22:14] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870 (duration: 00m 27s)
[09:22:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:23:24] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870
[09:23:31] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@e099b0b]: Gerrit to version 3.6.8 # T309870 (duration: 00m 07s)
[09:24:31] <jinxer-wm>	 (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:24:42] <jinxer-wm>	 (ProbeDown) firing: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:25:36] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:14] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:28:06] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:29:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:30:40] <hashar>	 I am restarting Gerrit  and will check it 
[09:31:10] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:32:44] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:44] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:22] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:35] <hashar>	 !log Gerrit restarted and its reindexing all changes T309870
[09:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:39] <stashbot>	 T309870: Upgrade to Gerrit 3.6 - https://phabricator.wikimedia.org/T309870
[09:33:58] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:11] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:34:31] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:34:42] <jelto>	 gerrit looks .. different 
[09:34:42] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:35:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:36:51] <hashar>	 I have resumed Gerrit monitoring
[09:37:07] <effie>	 hashar: are we good to  go?
[09:37:30] <hashar>	 and I don't get why jinxer-wm noticed some issues
[09:37:37] <hashar>	 anyway, still checking
[09:39:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648 (owner: 10Effie Mouzeli)
[09:39:37] <hashar>	 !log Gerrit back up and operational, now running version 3.6.8
[09:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:45] <effie>	 ok thank you hashar 
[09:39:47] <hashar>	 effie: Gerrit looks fine to me now
[09:40:06] <hashar>	 of course I might have missed something, but it looks like thebasics are working
[09:40:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648 (owner: 10Effie Mouzeli)
[09:41:03] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: temporary fix for cronjobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/989648 (owner: 10Effie Mouzeli)
[09:41:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:43:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede)
[09:44:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs::realserver::ipip: Report errors on MSS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/989459 (https://phabricator.wikimedia.org/T354721) (owner: 10Vgutierrez)
[09:48:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989601 (owner: 10Marostegui)
[09:48:40] <volans>	 hashar: is there a way to tell the new gerrit to show the names of the people that +1 a patch without having to hover the +1 (that has also a bug given that the tooltip goes over the popup hiding some parts of it :D )
[09:48:52] <volans>	 (example https://gerrit.wikimedia.org/r/c/operations/alerts/+/989645 
[09:49:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54599 and previous config saved to /var/cache/conftool/dbconfig/20240111-094928-root.json
[09:49:35] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989726 (https://phabricator.wikimedia.org/T351430)
[09:49:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Add ipoid to the service mesh [puppet] - 10https://gerrit.wikimedia.org/r/988453 (https://phabricator.wikimedia.org/T325147) (owner: 10Kamila Součková)
[09:49:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989726 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan)
[09:50:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:50:53] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989726 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan)
[09:51:12] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[09:51:43] <hashar>	 volans: ahhh yeah that is annoying :)
[09:52:06] <hashar>	 the votes have been moved up in the `Reviewers` list
[09:52:17] <hashar>	 so you get each of the reviewers listed together with their votes
[09:52:27] <volans>	 ahhh now I see them, I missed them at first sight
[09:52:33] <volans>	 too used to check the box below
[09:52:46] <hashar>	 yeah :\
[09:53:22] <hashar>	 the idea of the Submit Requirements is letting one who might give the missing votes before a change get submitted
[09:53:30] <hashar>	 given Google has developers submitting the changes directly
[09:53:33] <hashar>	 whereas we rely on CI
[09:53:47] <hashar>	 the interface might well change next week again when I upgrade to 3.7
[09:53:49] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[09:54:06] <wikibugs>	 (03CR) 10Volans: [V: 03+1 C: 03+1] Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede)
[09:54:38] <volans>	 hashar: mmh but if I also V+1 then there is no difference in the +1 close to my name ^^^
[09:54:42] <wikibugs>	 (03PS7) 10Effie Mouzeli: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147)
[09:54:56] <volans>	 anyway, I guess we'll need to adapt and get used to the new UI :D
[09:55:24] <volans>	 not your fault :)
[09:55:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede)
[09:55:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10ayounsi) >>! In T352893#9450804, @akosiaris wrote: > I 've been fearing this and started thinki...
[09:57:59] <wikibugs>	 (03PS8) 10Effie Mouzeli: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147)
[09:58:38] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[09:59:02] <wikibugs>	 (03PS1) 10Santiago Faci: Revert "Deploying to staging to test the fix with production data" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989746
[09:59:40] <wikibugs>	 (03CR) 10Santiago Faci: [V: 03+2 C: 03+2] Revert "Deploying to staging to test the fix with production data" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989746 (owner: 10Santiago Faci)
[09:59:54] <wikibugs>	 (03PS9) 10Effie Mouzeli: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147)
[09:59:56] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[10:00:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:00:22] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[10:00:39] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:00:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Deploying to staging to test the fix with production data" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989746 (owner: 10Santiago Faci)
[10:01:20] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.238 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:03:17] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:03:59] <logmsgbot>	 !log sfaci@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[10:04:10] <logmsgbot>	 !log sfaci@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[10:04:15] <wikibugs>	 (03CR) 10Kamila Součková: mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková)
[10:04:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54600 and previous config saved to /var/cache/conftool/dbconfig/20240111-100433-root.json
[10:06:41] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli)
[10:11:35] <wikibugs>	 (03PS2) 10ArielGlenn: add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347)
[10:12:03] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[10:12:08] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[10:13:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: druid::public::worker
[10:19:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch druid::public::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989731 (https://phabricator.wikimedia.org/T349619)
[10:19:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54601 and previous config saved to /var/cache/conftool/dbconfig/20240111-101938-root.json
[10:22:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch druid::public::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989731 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:23:40] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Remove testing cronjobs from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/989733
[10:25:49] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Remove testing cronjobs from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/989733 (owner: 10Kosta Harlan)
[10:26:02] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[10:26:36] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[10:26:39] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Remove testing cronjobs from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/989733 (owner: 10Kosta Harlan)
[10:28:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: druid::public::worker
[10:29:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:29:11] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:30:40] <wikibugs>	 (03PS1) 10Hashar: gerrit: add trailing slash to gerrit.canonicalWebUrl [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049)
[10:30:44] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[10:31:05] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[10:31:23] <moritzm>	 !log installing exim4 security updates
[10:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:01] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:32:40] <xSavitar>	 hashar, I like the new Gerrit interface <3. Thanks for all the work you do and the "Merge Conflict" thing in the previous version is now in the "Status" column which I can hide and not see it again :)
[10:32:40] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar)
[10:34:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54602 and previous config saved to /var/cache/conftool/dbconfig/20240111-103443-root.json
[10:39:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:47:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10cmooney) p:05Triage→03Medium
[10:47:26] <wikibugs>	 (03PS1) 10Majavah: P:mail::smarthost: support DKIM dual-signing [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112)
[10:48:31] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1077/co" [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[10:48:49] <wikibugs>	 (03CR) 10Hashar: "Puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar)
[10:49:11] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar)
[10:49:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54603 and previous config saved to /var/cache/conftool/dbconfig/20240111-104948-root.json
[10:50:05] <wikibugs>	 (03PS1) 10Majavah: Add fake wmcs-rsa DKIM keys for Cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/989738 (https://phabricator.wikimedia.org/T354112)
[10:51:07] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1078/co" [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar)
[10:52:01] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gerrit: add trailing slash to gerrit.canonicalWebUrl [puppet] - 10https://gerrit.wikimedia.org/r/989735 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar)
[10:52:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:54:18] <moritzm>	 !log installing Linux 5.10.205 updates on Bullseye hosts
[10:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:05] <jouncebot>	 mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1100).
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1100)
[11:01:04] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:01:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9452446, @ayounsi wrote: >>>! In T352893#9450804, @akosiaris wrote: >>...
[11:02:19] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney)
[11:02:46] <icinga-wm>	 PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops
[11:03:40] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) 05Open→03Resolved All work completed on this, lvs2014 made active for several hours and no issues.
[11:04:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:04:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54604 and previous config saved to /var/cache/conftool/dbconfig/20240111-110453-root.json
[11:05:24] * Lucas_WMDE will not be around for today’s backport window btw
[11:05:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:05:48] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989739 (https://phabricator.wikimedia.org/T354517)
[11:06:10] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989739 (https://phabricator.wikimedia.org/T354517) (owner: 10Peter Fischer)
[11:07:01] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989739 (https://phabricator.wikimedia.org/T354517) (owner: 10Peter Fischer)
[11:08:37] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) Traffic has now been re-routed over the new link.  Old interfaces from mr1-codfw to asw-a1-codfw has been disabled, as have the sub-interf...
[11:15:10] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:15:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:19:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54605 and previous config saved to /var/cache/conftool/dbconfig/20240111-111958-root.json
[11:23:18] <icinga-wm>	 RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops
[11:31:13] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: (2) MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[11:36:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo)
[11:39:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:44:57] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] Add fake wmcs-rsa DKIM keys for Cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/989738 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[11:45:12] <wikibugs>	 (03CR) 10Majavah: [V: 03+2 C: 03+2] Add fake wmcs-rsa DKIM keys for Cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/989738 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[11:49:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:49:15] <wikibugs>	 (03PS1) 10Marostegui: db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989745 (https://phabricator.wikimedia.org/T354506)
[11:49:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2124 T354506', diff saved to https://phabricator.wikimedia.org/P54606 and previous config saved to /var/cache/conftool/dbconfig/20240111-114930-marostegui.json
[11:49:34] <stashbot>	 T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506
[11:50:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989745 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[11:50:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2124.codfw.wmnet with OS bookworm
[11:51:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:52:16] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková)
[11:52:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková)
[11:52:32] <claime>	 shoot
[11:52:34] <claime>	 sorry kamila_ 
[11:52:46] <claime>	 I misclicked
[11:52:51] <claime>	 It's going live :p
[11:52:58] <kamila_>	 Okay :-D
[11:53:33] <kamila_>	 Probably a good thing to do given the above alert 
[11:55:20] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[11:56:34] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková)
[11:57:18] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková)
[11:57:40] <wikibugs>	 (03PS1) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176)
[11:59:54] <moritzm>	 !log installing Python 2.7 security updates on Bullseye
[11:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:31] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[12:00:54] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[12:01:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:03:22] <wikibugs>	 (03PS2) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176)
[12:05:42] <wikibugs>	 (03PS11) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112)
[12:05:44] <wikibugs>	 (03PS11) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893
[12:06:54] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1079/co" [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[12:07:47] <wikibugs>	 (03PS1) 10Btullis: Switch all spark images to use Java 8 as their base JDK/JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777)
[12:08:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2124.codfw.wmnet with reason: host reimage
[12:09:06] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] "Setting to -1 for now, since it depends on this being approved and built:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) (owner: 10Btullis)
[12:10:38] <wikibugs>	 (03PS12) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112)
[12:10:40] <wikibugs>	 (03PS12) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893
[12:11:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2124.codfw.wmnet with reason: host reimage
[12:15:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989753
[12:20:33] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:20:52] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:22:46] <wikibugs>	 (03PS13) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112)
[12:22:48] <wikibugs>	 (03PS13) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893
[12:23:13] <wikibugs>	 (03CR) 10Majavah: [C: 04-2] "Need to check what this does to unsubscribe requirements etc." [puppet] - 10https://gerrit.wikimedia.org/r/971893 (owner: 10Majavah)
[12:24:31] <wikibugs>	 (03PS14) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112)
[12:24:33] <wikibugs>	 (03PS14) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893
[12:29:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:32:23] <icinga-wm>	 PROBLEM - Disk space on lists1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lists1001&var-datasource=eqiad+prometheus/ops
[12:33:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2124.codfw.wmnet with OS bookworm
[12:34:05] <wikibugs>	 (03PS4) 10Ayounsi: k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893)
[12:36:23] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/989769
[12:37:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:39:13] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi)
[12:40:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989753 (owner: 10Marostegui)
[12:40:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10ayounsi) >  The problem remains that the switch name is not going to be enough to know what to...
[12:40:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54607 and previous config saved to /var/cache/conftool/dbconfig/20240111-124028-root.json
[12:42:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:42:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9452969, @ayounsi wrote: > Yep, I mentioned it in the loooong Gerrit CR...
[12:45:37] <wikibugs>	 (03PS1) 10Majavah: toolforge: wheel of misfortune: remove redundant defaults [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430)
[12:46:17] <hashar>	 jouncebot: now
[12:46:17] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[12:46:22] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/989808
[12:46:22] <hashar>	 jouncebot: next
[12:46:23] <jouncebot>	 In 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1300)
[12:46:33] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1080/co" [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430) (owner: 10Majavah)
[12:47:10] <hashar>	 !log Restarting Gerrit to apply config change https://gerrit.wikimedia.org/r/c/operations/puppet/+/989735/ # T206049
[12:47:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:15] <stashbot>	 T206049: Gitiles project landing pages should have an anonymous clone URL - https://phabricator.wikimedia.org/T206049
[12:47:22] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM, seems using LLDP seems the easiest way forward, bailing out should protect us from unlikely edge-cases.  Hard coding the vlans is fi" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi)
[12:47:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi)
[12:50:13] <icinga-wm>	 PROBLEM - MegaRAID on db1157 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:50:14] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1157 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T354854 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:50:21] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10ops-monitoring-bot)
[12:51:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) Do we have some spare disks?
[12:52:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) p:05Triage→03High This is a primary master, so we should replace the disk sooner rather than later
[12:52:47] <icinga-wm>	 RECOVERY - Disk space on lists1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lists1001&var-datasource=eqiad+prometheus/ops
[12:55:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54608 and previous config saved to /var/cache/conftool/dbconfig/20240111-125533-root.json
[12:57:34] <wikibugs>	 (03PS22) 10Brouberol: global_config: list IPs of hadoop master/workers and kerberos nodes [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894)
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1300)
[13:07:56] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui)
[13:10:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54609 and previous config saved to /var/cache/conftool/dbconfig/20240111-131038-root.json
[13:11:25] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) 05Open→03Resolved The packages have been rebuilt and appear to install fine on snapshot1014.
[13:17:09] <wikibugs>	 (03CR) 10Clément Goubert: "Couple nits inline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[13:18:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10ayounsi) > Once agreed it probably makes sense to remove profile::pybal::override_bgp_med from the puppet class, and replace it with some...
[13:19:22] <wikibugs>	 (03PS3) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176)
[13:22:48] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] hiera: add new netboxdev:attachments user [puppet] - 10https://gerrit.wikimedia.org/r/989529 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon)
[13:23:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] hiera: add fake swift passwords for netbox_dev user [labs/private] - 10https://gerrit.wikimedia.org/r/989531 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon)
[13:25:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Thanks !" [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) (owner: 10Cathal Mooney)
[13:25:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54610 and previous config saved to /var/cache/conftool/dbconfig/20240111-132543-root.json
[13:26:58] <wikibugs>	 (03PS1) 10Hashar: wm-zuul-status: add SCHEDULED for pending check run [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989828 (https://phabricator.wikimedia.org/T348959)
[13:29:05] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:29:41] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:31:12] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: (3) MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[13:32:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 (10MoritzMuehlenhoff)
[13:32:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:36:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10cmooney) >>! In T354839#9453034, @ayounsi wrote: > On the implementation I'm wondering if instead of introducing a new BGP community, we...
[13:36:57] <wikibugs>	 (03CR) 10Btullis: Add base production images containing Java 8 JDK and JRE (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[13:39:12] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:40:17] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[13:40:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54611 and previous config saved to /var/cache/conftool/dbconfig/20240111-134048-root.json
[13:41:28] <moritzm>	 !log installing xerces-c security updates
[13:41:29] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 7.838 second response time https://wikitech.wikimedia.org/wiki/Docker
[13:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:43] <wikibugs>	 (03PS1) 10Effie Mouzeli: services_proxy: Add ipoid to the service mesh (fix) [puppet] - 10https://gerrit.wikimedia.org/r/989829 (https://phabricator.wikimedia.org/T325147)
[13:46:17] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] services_proxy: Add ipoid to the service mesh (fix) [puppet] - 10https://gerrit.wikimedia.org/r/989829 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli)
[13:46:31] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:47:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:48:06] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] wm-zuul-status: add SCHEDULED for pending check run [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989828 (https://phabricator.wikimedia.org/T348959) (owner: 10Hashar)
[13:48:38] <wikibugs>	 (03Merged) 10jenkins-bot: wm-zuul-status: add SCHEDULED for pending check run [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989828 (https://phabricator.wikimedia.org/T348959) (owner: 10Hashar)
[13:49:11] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@af34477]: wm-zuul-status: add SCHEDULED for pending check run - T348959
[13:49:17] <stashbot>	 T348959: Verify ChecksAPI changes between Gerrit 3.5 and 3.6 - https://phabricator.wikimedia.org/T348959
[13:49:19] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@af34477]: wm-zuul-status: add SCHEDULED for pending check run - T348959 (duration: 00m 07s)
[13:54:47] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "Thanks for spotting this!" [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430) (owner: 10Majavah)
[13:55:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54612 and previous config saved to /var/cache/conftool/dbconfig/20240111-135553-root.json
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1400).
[14:00:04] <jouncebot>	 koi and apergos: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:21] <koi>	 o/
[14:00:26] <apergos>	 good afternoon!  here for my patch, when my turn comes.
[14:01:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] toolforge: wheel of misfortune: remove redundant defaults [puppet] - 10https://gerrit.wikimedia.org/r/989807 (https://phabricator.wikimedia.org/T354430) (owner: 10Majavah)
[14:02:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[14:04:41] <apergos>	 I wonder who is running the deployment window today
[14:05:09] <RhinosF1>	 TheresNoTime & urbanecm have been online today
[14:05:31] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1081/co" [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[14:05:34] <urbanecm>	 i'd prefer not to deploy
[14:05:55] <RhinosF1>	 apergos: might be up to you to take an extra window given you're here
[14:06:31] <apergos>	 I have stepped back from running these, while we schedule a retrospective on the past 2.5 years of trainings, and while my team and duties have changed
[14:07:04] <apergos>	 and it's a bit sketchy to run a window and self-deploy too 
[14:07:35] <RhinosF1>	 I'm pretty sure people gave self deployed and taken patches before
[14:07:59] <RhinosF1>	 I'm not sure what's sketchy about offering to help someone else after you've deployed yours
[14:08:32] <apergos>	 it's the first part: run the window and self-deploy one's patch, that I don't love. but in any case: not now my role, it's on hold
[14:09:41] <RhinosF1>	 I'm not sure whose around then. The windows are a best effort of volunteers are not that many deployers do them.
[14:09:52] <RhinosF1>	 taavi: May do
[14:10:04] <RhinosF1>	 But I don't really think it's anyone's role
[14:10:13] <RhinosF1>	 Just nice people being helpful
[14:10:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54613 and previous config saved to /var/cache/conftool/dbconfig/20240111-141058-root.json
[14:11:05] <Reedy>	 I've got nothing better to be doing
[14:11:06] * Reedy looks
[14:12:25] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn)
[14:12:42] <apergos>	 people's names get added to the window as deployers, via a process.  what I mean is, it's not just random luck of the draw. if someone or several someones can't make  it for a slot, that's how it is. life (and other work) happens. but then maybe we need to figure that out better.
[14:13:11] <wikibugs>	 (03Merged) 10jenkins-bot: add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn)
[14:14:47] <RhinosF1>	 Thanks Reedy
[14:15:13] <RhinosF1>	 apergos: oh yes better needs to be done and releng know that and are thinking about it
[14:15:31] <RhinosF1>	 I was talking to Tyler about his plans not so long back
[14:15:37] <apergos>	 hence (in part) the retro that I will be involved in
[14:19:01] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 90% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976224 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[14:21:39] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/InitialiseSettings.php: T205347 (duration: 07m 41s)
[14:21:44] <stashbot>	 T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347
[14:22:03] <apergos>	 is that out to  production, complete?
[14:22:07] <Reedy>	 yeah
[14:22:13] <apergos>	 ok lemme just do the quick check
[14:23:37] <apergos>	 yep it's there in the edge login domains when I log in
[14:23:39] <apergos>	 thanks!
[14:24:14] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:24:47] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:25:00] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:25:10] <wikibugs>	 (03PS3) 10Reedy: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan)
[14:25:13] <wikibugs>	 (03PS3) 10Reedy: zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang)
[14:25:34] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang)
[14:25:44] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:25:53] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) I've moved a bit further on the testing part. @MoritzMuehlenhoff showed me [[ https://github.com/ikapelyukhin/go-x509-issuer-name-doe...
[14:25:59] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:26:23] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang)
[14:26:42] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:27:45] <Tchanders>	 Reedy: I was about to say I could deploy https://gerrit.wikimedia.org/r/988482 but it looks like you're on it already?
[14:27:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:28:11] <Reedy>	 Tchanders: Yeah, I'll grab it :)
[14:28:33] <apergos>	 thanks for taking the window, Ree dy
[14:28:42] <Tchanders>	 Reedy: Thank you!
[14:29:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:29:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:29:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[14:30:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[14:30:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:30:17] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:30:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:30:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T354336)', diff saved to https://phabricator.wikimedia.org/P54614 and previous config saved to /var/cache/conftool/dbconfig/20240111-143034-marostegui.json
[14:30:37] <kostajh>	 Tchanders: I left a message in Slack, it seems like the localhost:6035 URL is not working yet from mwmwaint, but it might just need time to propagate. In the meantime, I think it makes sense to continue syncing the config patch.
[14:30:38] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:30:40] <koi>	 hi Reedy, I see my patch got merged, and is it ok for me to test it now?
[14:30:50] <Reedy>	 koi: it's going through the deploy train :)
[14:31:05] <wikibugs>	 (03PS2) 10Ayounsi: Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529)
[14:31:29] <koi>	 wow, so when will it deployed
[14:31:29] <Tchanders>	 Reedy: There's nothing to test for mine btw (following what kostajh said, we can just go ahead)
[14:31:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T354336)', diff saved to https://phabricator.wikimedia.org/P54615 and previous config saved to /var/cache/conftool/dbconfig/20240111-143143-marostegui.json
[14:31:47] <Reedy>	 koi: when the train runs....
[14:31:52] <Reedy>	 uh, command
[14:32:01] <Reedy>	 can take ~10 mins
[14:34:00] <wikibugs>	 (03CR) 10Ayounsi: "It's live on netbox-next." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi)
[14:34:31] <wikibugs>	 (03PS1) 10Jelto: trafficserver: switch design.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791)
[14:34:33] <wikibugs>	 (03PS5) 10Klausman: amd_rocm Prometheus script: Handle a few new metrics [puppet] - 10https://gerrit.wikimedia.org/r/989833
[14:34:35] <wikibugs>	 (03PS1) 10Jelto: miscweb/microsites: move monitoring of design to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791)
[14:36:15] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/: T344398 (duration: 07m 25s)
[14:36:19] <stashbot>	 T344398: Create abusefilter helper group on zhwiki - https://phabricator.wikimedia.org/T344398
[14:36:22] <Reedy>	 koi: its live now
[14:37:46] <wikibugs>	 (03PS4) 10Reedy: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan)
[14:37:50] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan)
[14:38:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) p:05Triage→03Medium
[14:38:41] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan)
[14:38:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney)
[14:39:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney)
[14:39:11] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[14:39:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[14:39:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney)
[14:39:38] <claime>	 kostajh: Tchanders, the envoy listener is working on mwmaint, at least curling it works
[14:39:41] <wikibugs>	 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10cmooney)
[14:39:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney)
[14:39:54] <koi>	 ty
[14:40:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[14:40:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney)
[14:41:10] <stephanebisson>	 Hey, I haven't deployed in a few years and I'd like to get back into it. I could use a good refresher on the tooling/environment/etc. Any suggestions about where to start / who to talk to?
[14:41:46] <Reedy>	 stephanebisson: releng do do training, if you think you need it
[14:41:57] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "needs approval from design team first" [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[14:42:05] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "needs approval from design team first" [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[14:42:07] <Reedy>	 Beyond that, you can just volunteer to "have a go" and ask for someone to be around for support :)
[14:42:45] <Reedy>	 stephanebisson: https://wikitech.wikimedia.org/wiki/Deployments/Training
[14:43:19] <stephanebisson>	 Wonderful
[14:44:06] <wikibugs>	 10SRE, 10Wikimedia-production-error: Error accessing File:KlimtDieJungfrau.jpg after it was moved to the Main Page on enwiki - https://phabricator.wikimedia.org/T354858 (10A_smart_kitten) Adding SRE as from what I've read it seems like it might be a relevant team here. Apologies if this is incorrect.
[14:45:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Re-IP hosts running Cassandra to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354871 (10cmooney) p:05Triage→03Medium
[14:46:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Re-IP hosts running Cassandra to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354871 (10cmooney)
[14:46:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney)
[14:46:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54616 and previous config saved to /var/cache/conftool/dbconfig/20240111-144649-marostegui.json
[14:47:33] <stephanebisson>	 Reedy I'll go through the docs and possibly give it a go. Could you be available in support (with advance notice)?
[14:47:52] <wikibugs>	 (03PS6) 10Effie Mouzeli: (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852
[14:48:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872 (10cmooney) p:05Triage→03Medium
[14:49:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli)
[14:51:11] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/: T325147 (duration: 06m 43s)
[14:51:14] <stashbot>	 T325147: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147
[14:51:31] <apergos>	 Note that deployment trainings are on hold for the moment, Reedy and stephanebisson, while we re-evaluate the program
[14:53:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) Updated firmware per Dells Request cleared logs resent new tsr report.  waiting for response.
[14:54:11] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:31] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol)
[14:57:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:00:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Jclark-ctr) @Marostegui Server is out of warranty.  Replaced Disk with ssd from recently decom server
[15:01:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54617 and previous config saved to /var/cache/conftool/dbconfig/20240111-150156-marostegui.json
[15:01:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) Thank you for being so fast! I can see the disk rebuilding now:  `     Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]     Firmware state: =====> Rebuild <=====     Media Type: Solid State Device     Drive T...
[15:03:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Jclark-ctr) a:03Jclark-ctr
[15:04:19] <icinga-wm>	 PROBLEM - Check systemd state on aphlict2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:06:39] <wikibugs>	 10SRE, 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T354765 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced cable
[15:10:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841
[15:17:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T354336)', diff saved to https://phabricator.wikimedia.org/P54618 and previous config saved to /var/cache/conftool/dbconfig/20240111-151702-marostegui.json
[15:17:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[15:17:09] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[15:17:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[15:17:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T354336)', diff saved to https://phabricator.wikimedia.org/P54619 and previous config saved to /var/cache/conftool/dbconfig/20240111-151724-marostegui.json
[15:17:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, moss-fe2001.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:19:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:19:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T354336)', diff saved to https://phabricator.wikimedia.org/P54620 and previous config saved to /var/cache/conftool/dbconfig/20240111-151934-marostegui.json
[15:23:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Jclark-ctr) a:03Jclark-ctr Confirmed: Service Request 183160693 was successfully submitted.
[15:24:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:25:17] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:28:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:30:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:32:42] <wikibugs>	 (03PS3) 10Muehlenhoff: mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721
[15:34:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54621 and previous config saved to /var/cache/conftool/dbconfig/20240111-153441-marostegui.json
[15:38:28] <wikibugs>	 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) @ayounsi  ` Hello Papaul  Thanks for re-seating the FPC, at this point the next step will be doing the manual switch over of the RE to test the CB, I am aware that there are several FPCs on the device, the issue c...
[15:39:55] <wikibugs>	 (03PS8) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[15:40:34] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721 (owner: 10Muehlenhoff)
[15:41:33] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] hiera: add fake swift passwords for netbox_dev user [labs/private] - 10https://gerrit.wikimedia.org/r/989531 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon)
[15:41:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: cache::upload
[15:41:42] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: add new netboxdev:attachments user [puppet] - 10https://gerrit.wikimedia.org/r/989529 (https://phabricator.wikimedia.org/T354766) (owner: 10MVernon)
[15:41:59] <wikibugs>	 (03PS1) 10Cwhite: Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877
[15:42:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10Volans)
[15:43:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877 (owner: 10Cwhite)
[15:43:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cache/upload to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989845 (https://phabricator.wikimedia.org/T349619)
[15:44:03] <wikibugs>	 (03CR) 10Muehlenhoff: Configure ACLs for reprepro upload queue (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[15:45:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cache/upload to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989845 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:47:13] <wikibugs>	 (03PS2) 10Cwhite: Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877
[15:47:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe
[15:47:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:48:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872 (10cmooney)
[15:48:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney)
[15:48:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:49:05] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[15:49:13] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:mail::smarthost: support DKIM dual-signing [puppet] - 10https://gerrit.wikimedia.org/r/989736 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[15:49:15] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_netboxdev:attachments.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54622 and previous config saved to /var/cache/conftool/dbconfig/20240111-154947-marostegui.json
[15:50:06] <Emperor>	 I'm doing a roll-restart of the swift frontends, and the cookbook is meant to downtime them...
[15:50:30] <sukhe>	 Emperor: thanks for sharing :)
[15:51:06] <Emperor>	 so I'm not sure why we're getting alerts here :(
[15:51:13] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) resolved: (3) MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[15:52:15] <volans>	 Emperor: does the cookbook check for icinga being optimal before proceeding? or removes the downtime and goes ahead blindly
[15:52:46] <Emperor>	 it does look to run checks
[15:54:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Re-IP db servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T354878 (10cmooney) p:05Triage→03Medium
[15:55:19] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:21] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[15:55:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Re-IP db servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T354878 (10cmooney)
[15:55:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney)
[15:58:54] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe
[15:59:18] <sukhe>	 !log restart pybal on lvs4010 
[15:59:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:31] <icinga-wm>	 RECOVERY - Check systemd state on aphlict2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T354336)', diff saved to https://phabricator.wikimedia.org/P54623 and previous config saved to /var/cache/conftool/dbconfig/20240111-160454-marostegui.json
[16:04:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[16:05:05] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:05:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[16:05:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54624 and previous config saved to /var/cache/conftool/dbconfig/20240111-160516-marostegui.json
[16:05:58] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[16:07:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cache::upload
[16:07:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54625 and previous config saved to /var/cache/conftool/dbconfig/20240111-160725-marostegui.json
[16:11:42] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: Create swift account for netbox-next - https://phabricator.wikimedia.org/T354766 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon The new account is created for you.
[16:15:02] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans)
[16:15:40] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans)
[16:16:00] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans)
[16:19:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM (for future reviews, if the change is small and I had approved it earlier, feel free to merge unless you want a re-review, if so, ple" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri)
[16:19:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[16:22:09] <wikibugs>	 (03PS1) 10Bking: WIP: Add new data platform team to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578)
[16:22:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54626 and previous config saved to /var/cache/conftool/dbconfig/20240111-162231-marostegui.json
[16:23:20] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:23:33] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:24:15] <wikibugs>	 (03PS1) 10Btullis: Add data for the new an-master100[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573)
[16:26:34] <wikibugs>	 (03PS2) 10Btullis: Add data for the new an-master100[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573)
[16:29:25] <icinga-wm>	 RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 197, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:33:37] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[16:36:09] <wikibugs>	 (03CR) 10FNegri: dologmsg: standardize logging format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri)
[16:37:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54628 and previous config saved to /var/cache/conftool/dbconfig/20240111-163738-marostegui.json
[16:39:06] <wikibugs>	 (03PS1) 10Thcipriani: Remove banner for 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904
[16:41:09] <icinga-wm>	 RECOVERY - MegaRAID on db1157 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:42:13] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Remove banner for 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904 (owner: 10Thcipriani)
[16:42:46] <wikibugs>	 (03Merged) 10jenkins-bot: Remove banner for 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904 (owner: 10Thcipriani)
[16:44:34] <wikibugs>	 (03PS1) 10Volans: .wmfconfig: update config for releases [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989905
[16:44:36] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.3.3 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989906
[16:46:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:49:05] <wikibugs>	 (03CR) 10Btullis: "With this change, we can make the hadoop-yarn-resourcemanager service start and run as a standby on an-master100[3-4]." [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[16:49:34] <wikibugs>	 (03CR) 10Volans: [C: 03+2] .wmfconfig: update config for releases [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989905 (owner: 10Volans)
[16:49:44] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[16:50:03] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.3.3 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989906 (owner: 10Volans)
[16:51:13] <wikibugs>	 (03Merged) 10jenkins-bot: .wmfconfig: update config for releases [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989905 (owner: 10Volans)
[16:51:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:51:38] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.3.3 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/989906 (owner: 10Volans)
[16:52:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54629 and previous config saved to /var/cache/conftool/dbconfig/20240111-165244-marostegui.json
[16:52:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[16:52:48] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:53:00] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[16:53:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T354336)', diff saved to https://phabricator.wikimedia.org/P54630 and previous config saved to /var/cache/conftool/dbconfig/20240111-165305-marostegui.json
[16:54:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T354336)', diff saved to https://phabricator.wikimedia.org/P54631 and previous config saved to /var/cache/conftool/dbconfig/20240111-165414-marostegui.json
[16:55:21] <wikibugs>	 (03PS9) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[16:56:13] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[16:56:23] <wikibugs>	 (03PS1) 10Brouberol: Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694)
[16:57:02] <wikibugs>	 (03PS1) 10Brouberol: Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989908 (https://phabricator.wikimedia.org/T353694)
[16:57:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1157 - https://phabricator.wikimedia.org/T354854 (10Marostegui) 05Open→03Resolved RAID back to Optimal! Thank you!
[16:57:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol)
[16:58:48] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1084/co" [puppet] - 10https://gerrit.wikimedia.org/r/989908 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol)
[16:59:09] <wikibugs>	 (03PS10) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[17:00:05] <jouncebot>	 jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:25] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "I'm listed as an approver for this group, so I'm confident in adding +2." [puppet] - 10https://gerrit.wikimedia.org/r/989908 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol)
[17:06:50] <wikibugs>	 (03PS11) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[17:07:21] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[17:09:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54632 and previous config saved to /var/cache/conftool/dbconfig/20240111-170920-marostegui.json
[17:11:04] <wikibugs>	 (03PS12) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[17:11:30] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[17:12:13] <wikibugs>	 (03CR) 10Hashar: "Sorry, it looks like I have messed up the fork :-(" [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/989904 (owner: 10Thcipriani)
[17:17:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:21:16] <wikibugs>	 (03PS13) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[17:21:36] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[17:22:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:22:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[17:24:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54633 and previous config saved to /var/cache/conftool/dbconfig/20240111-172427-marostegui.json
[17:31:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:36:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:38:36] <wikibugs>	 (03CR) 10Btullis: "This can be abandone. It was implmented in: https://gerrit.wikimedia.org/r/c/operations/puppet/+/989908" [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol)
[17:39:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T354336)', diff saved to https://phabricator.wikimedia.org/P54634 and previous config saved to /var/cache/conftool/dbconfig/20240111-173933-marostegui.json
[17:39:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[17:39:38] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[17:39:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[17:39:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T354336)', diff saved to https://phabricator.wikimedia.org/P54635 and previous config saved to /var/cache/conftool/dbconfig/20240111-173955-marostegui.json
[17:40:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:41:19] <wikibugs>	 (03PS4) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176)
[17:42:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T354336)', diff saved to https://phabricator.wikimedia.org/P54636 and previous config saved to /var/cache/conftool/dbconfig/20240111-174204-marostegui.json
[17:50:54] <wikibugs>	 (03CR) 10Hashar: Add base production images containing Java 8 JDK and JRE (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[17:57:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P54637 and previous config saved to /var/cache/conftool/dbconfig/20240111-175710-marostegui.json
[18:00:05] <jouncebot>	 bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1800)
[18:05:58] <bd808>	 nothing from me this week
[18:08:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:12:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P54638 and previous config saved to /var/cache/conftool/dbconfig/20240111-181217-marostegui.json
[18:13:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:15:04] <wikibugs>	 (03Abandoned) 10Brouberol: Add DPE SRE individiual users to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989907 (https://phabricator.wikimedia.org/T353694) (owner: 10Brouberol)
[18:21:08] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: WIP:ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870)
[18:23:14] <thcipriani>	 !log deploying gerrit to remove devsat survey (no restart needed)
[18:23:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:21] <logmsgbot>	 !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit2002 only)
[18:25:26] <logmsgbot>	 !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit2002 only) (duration: 00m 05s)
[18:26:14] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10RobH)
[18:26:41] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10RobH)
[18:27:01] <logmsgbot>	 !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit primary: gerrit.wikimedia.org)
[18:27:08] <logmsgbot>	 !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@376b3e5]: Remove devsat survey banner in 3.6 (gerrit primary: gerrit.wikimedia.org) (duration: 00m 07s)
[18:27:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T354336)', diff saved to https://phabricator.wikimedia.org/P54639 and previous config saved to /var/cache/conftool/dbconfig/20240111-182723-marostegui.json
[18:27:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[18:27:27] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[18:27:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[18:27:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54640 and previous config saved to /var/cache/conftool/dbconfig/20240111-182745-marostegui.json
[18:29:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54641 and previous config saved to /var/cache/conftool/dbconfig/20240111-182859-marostegui.json
[18:44:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P54643 and previous config saved to /var/cache/conftool/dbconfig/20240111-184405-marostegui.json
[18:47:00] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-production-error: Error accessing File:KlimtDieJungfrau.jpg after it was moved to the Main Page on enwiki - https://phabricator.wikimedia.org/T354858 (10Aklapper)
[18:52:55] <wikibugs>	 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896 (10RobH)
[18:53:16] <wikibugs>	 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896 (10RobH)
[18:59:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P54644 and previous config saved to /var/cache/conftool/dbconfig/20240111-185912-marostegui.json
[19:00:04] <jouncebot>	 jeena and dduvall: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T1900). Please do the needful.
[19:00:24] <jeena>	 o/
[19:02:16] <wikibugs>	 (03PS1) 10Cathal Mooney: Add automation for management router BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809)
[19:03:01] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989918 (https://phabricator.wikimedia.org/T350089)
[19:03:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989918 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot)
[19:03:52] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989918 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot)
[19:05:33] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:06:10] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:11:35] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.13  refs T350089
[19:11:44] <stashbot>	 T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089
[19:14:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54645 and previous config saved to /var/cache/conftool/dbconfig/20240111-191418-marostegui.json
[19:14:21] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[19:14:34] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[19:14:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[19:14:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T354336)', diff saved to https://phabricator.wikimedia.org/P54646 and previous config saved to /var/cache/conftool/dbconfig/20240111-191440-marostegui.json
[19:16:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T354336)', diff saved to https://phabricator.wikimedia.org/P54647 and previous config saved to /var/cache/conftool/dbconfig/20240111-191650-marostegui.json
[19:17:10] <wikibugs>	 10SRE, 10Traffic: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718 (10A_smart_kitten) Admittedly I’m inexperienced here (and so may well be missing something), but in T354858, I received 429 error...
[19:19:09] <wikibugs>	 (03PS14) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[19:20:43] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[19:31:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P54649 and previous config saved to /var/cache/conftool/dbconfig/20240111-193156-marostegui.json
[19:34:11] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:34:35] <wikibugs>	 (03PS1) 10Gehel: Add Guillaume to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989922 (https://phabricator.wikimedia.org/T353694)
[19:41:43] <wikibugs>	 (03PS15) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[19:42:51] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[19:46:18] <wikibugs>	 (03PS3) 10Houseblaster: InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013)
[19:47:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P54651 and previous config saved to /var/cache/conftool/dbconfig/20240111-194703-marostegui.json
[19:49:43] <wikibugs>	 (03PS16) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[19:50:26] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[19:53:37] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH) @ayounsi: I never saw a blocker come on on this, so we're good to go ahead and disconnect the cross connections at each site for this correct?
[19:54:28] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10ayounsi) yep, we're well past "after monday" :)
[19:55:21] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[19:56:15] <wikibugs>	 (03PS17) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555)
[19:56:31] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[19:56:38] <wikibugs>	 (03PS6) 10Houseblaster: InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388)
[19:56:48] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH)
[19:56:55] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH)
[19:58:04] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424 (10RobH) As the disconnects will reference contract IDs and disconnect fees, each site's disconnect has bene put to is own S4 space subtask.
[19:58:45] <wikibugs>	 (03PS1) 10Ryan Kemper: s/ alue/value [puppet] - 10https://gerrit.wikimedia.org/r/989924
[19:59:47] <wikibugs>	 (03PS2) 10Ryan Kemper: s/ alue/value [puppet] - 10https://gerrit.wikimedia.org/r/989924
[20:00:10] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@07f5320]: (no justification provided)
[20:00:25] <wikibugs>	 (03CR) 10Majavah: wdqs-test: Enable PKI (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[20:00:38] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@07f5320]: (no justification provided) (duration: 00m 27s)
[20:01:54] <wikibugs>	 (03PS3) 10Ryan Kemper: Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/989924
[20:02:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T354336)', diff saved to https://phabricator.wikimedia.org/P54652 and previous config saved to /var/cache/conftool/dbconfig/20240111-200209-marostegui.json
[20:02:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[20:02:25] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[20:02:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[20:02:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[20:02:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[20:02:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T354336)', diff saved to https://phabricator.wikimedia.org/P54653 and previous config saved to /var/cache/conftool/dbconfig/20240111-200253-marostegui.json
[20:05:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T354336)', diff saved to https://phabricator.wikimedia.org/P54654 and previous config saved to /var/cache/conftool/dbconfig/20240111-200502-marostegui.json
[20:09:27] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:30] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/989924 (owner: 10Ryan Kemper)
[20:13:43] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:17:07] <wikibugs>	 (03PS18) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[20:18:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:18:53] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[20:20:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P54655 and previous config saved to /var/cache/conftool/dbconfig/20240111-202008-marostegui.json
[20:23:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:35:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P54656 and previous config saved to /var/cache/conftool/dbconfig/20240111-203514-marostegui.json
[20:36:15] <wikibugs>	 (03PS4) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847
[20:36:17] <wikibugs>	 (03PS7) 10Effie Mouzeli: (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852
[20:38:47] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:42:08] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 6.402% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:47:08] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 6.402% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:50:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T354336)', diff saved to https://phabricator.wikimedia.org/P54657 and previous config saved to /var/cache/conftool/dbconfig/20240111-205021-marostegui.json
[20:50:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[20:50:26] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[20:50:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[20:53:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:54:07] <wikibugs>	 (03PS2) 10Effie Mouzeli: modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841
[20:56:37] <wikibugs>	 (03PS5) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847
[20:57:12] <wikibugs>	 (03PS8) 10Effie Mouzeli: (WIP) modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240111T2100).
[21:00:05] <jouncebot>	 jan_drewniak and houseblaster: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:04:33] <jan_drewniak>	 houseblaster: hi, looks like it's only your patch &my script on the backport window today
[21:04:59] <jan_drewniak>	 given it's just a config change, I can deploy your patches
[21:05:37] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:06:27] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:06:29] <jan_drewniak>	  houseblaster: will you be around to test the config change?
[21:08:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:08:42] <wikibugs>	 (03PS9) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852
[21:08:49] <houseblaster>	 I will
[21:09:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli)
[21:09:48] <wikibugs>	 (03PS6) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847
[21:09:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster)
[21:09:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster)
[21:10:40] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster)
[21:11:19] <wikibugs>	 (03PS7) 10Jdrewniak: InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster)
[21:11:22] <fabfur>	 (other than opening wikitech and reading it... :P )
[21:11:31] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster)
[21:11:33] <fabfur>	 ops sry wrong chan
[21:11:39] <houseblaster>	 add ping I forgot: jan_drewniak
[21:12:16] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster)
[21:12:40] <wikibugs>	 (03PS7) 10Effie Mouzeli: modules/app: update to job 2.0.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847
[21:12:42] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:985647|InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (T354013)]], [[gerrit:984288|InitialiseSettings.php: Allow thanking bots (T341388)]]
[21:12:56] <stashbot>	 T354013: Request to remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on English Wikipedia - https://phabricator.wikimedia.org/T354013
[21:12:56] <stashbot>	 T341388: Allow thanking bots - https://phabricator.wikimedia.org/T341388
[21:13:49] <wikibugs>	 (03PS10) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852
[21:14:27] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak and houseblaster: Backport for [[gerrit:985647|InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (T354013)]], [[gerrit:984288|InitialiseSettings.php: Allow thanking bots (T341388)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:14:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli)
[21:14:45] <wikibugs>	 (03PS11) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852
[21:15:01] <wikibugs>	 (03PS8) 10Effie Mouzeli: modules/app: update to job 2.0.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847
[21:16:55] <jan_drewniak>	 houseblaster: np, the changes are ready to test on mwdebug
[21:17:32] <wikibugs>	 (03PS3) 10Effie Mouzeli: modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841
[21:18:41] <houseblaster>	 font tag change working
[21:20:06] <houseblaster>	 and thanking bots is working, too
[21:20:22] <jan_drewniak>	 houseblaster: ok great, continuing with sync
[21:20:32] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak and houseblaster: Continuing with sync
[21:26:26] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:985647|InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (T354013)]], [[gerrit:984288|InitialiseSettings.php: Allow thanking bots (T341388)]] (duration: 13m 43s)
[21:26:38] <stashbot>	 T354013: Request to remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on English Wikipedia - https://phabricator.wikimedia.org/T354013
[21:26:39] <stashbot>	 T341388: Allow thanking bots - https://phabricator.wikimedia.org/T341388
[21:27:03] <jan_drewniak>	 houseblaster: alrighty, all done :) 
[21:28:30] <houseblaster>	 thank you!
[21:31:58] <jan_drewniak>	 oky, and I just ran my maintenance script on prod and Wikipedia has not burned down :P
[21:33:06] <taavi>	 jan_drewniak: remember to !log the script run?
[21:33:43] <jan_drewniak>	 taavi: sorry it's the first time I've done that, how do I log it?
[21:35:05] <taavi>	 jan_drewniak: type !log followed by a short summary of what you did (commands ran, phab tasks, etc.)
[21:35:29] <taavi>	 https://wikitech.wikimedia.org/w/index.php?title=Tool:Stashbot#!log_processing
[21:36:30] <jan_drewniak>	 !log https://phabricator.wikimedia.org/T349337#9454773 running maintenance script to delete unnecessary user preferences.
[21:36:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:51] <jan_drewniak>	 taavi:  thanks! I'll keep that in mind
[21:38:48] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[21:40:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:44:06] <wikibugs>	 (03CR) 10Dzahn: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn)
[21:46:23] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking)
[22:14:23] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-production-error: Error accessing File:KlimtDieJungfrau.jpg after it was moved to the Main Page on enwiki - https://phabricator.wikimedia.org/T354858 (10A_smart_kitten)
[22:17:33] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:55] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:20:25] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:20:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:20:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:22:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:22:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:31:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:32:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:35:08] <wikibugs>	 10SRE, 10Traffic: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718 (10Tgr) @A_smart_kitten usually what happens is that the first few users get a HTTP 500, then the throttling logic detects that u...
[22:37:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:37:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:42:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:43:46] <wikibugs>	 (03PS5) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176)
[22:44:14] <jinxer-wm>	 (ProbeDown) firing: (6) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:45:29] <wikibugs>	 (03CR) 10Btullis: Add base production images containing Java 8 JDK and JRE (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[22:45:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:46:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:46:47] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add Guillaume to the analyics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/989922 (https://phabricator.wikimedia.org/T353694) (owner: 10Gehel)
[23:00:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:00:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:02:01] <wikibugs>	 (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989965 (https://phabricator.wikimedia.org/T352583)
[23:29:09] <wikibugs>	 (03CR) 10Houseblaster: InitialiseSettings.php: disallow obsolete HTML in signatures (enwiki) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster)
[23:34:12] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:54:47] <wikibugs>	 (03PS1) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910)
[23:55:21] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk