[00:10:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host backup2011.codfw.wmnet with OS bullseye
[00:10:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 3 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host backup2011.codfw.wmnet with OS bullseye
[00:15:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2010.codfw.wmnet with reason: host reimage
[00:18:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2010.codfw.wmnet with reason: host reimage
[00:22:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr)
[00:23:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10Jclark-ctr)
[00:25:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr)
[00:28:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr)
[00:29:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) a:03Jclark-ctr
[00:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[00:35:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:37:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:37:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2010.codfw.wmnet with OS bullseye
[00:37:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host backup2010.codfw.wmnet with OS bullseye completed: - back...
[00:39:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910531
[00:39:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910531 (owner: 10TrainBranchBot)
[00:56:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910531 (owner: 10TrainBranchBot)
[01:12:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[01:19:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2011.codfw.wmnet with reason: host reimage
[01:22:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[01:22:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2011.codfw.wmnet with reason: host reimage
[01:24:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[01:26:46] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10UOzurumba)
[01:39:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:41:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:41:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2011.codfw.wmnet with OS bullseye
[01:41:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host backup2011.codfw.wmnet with OS bullseye completed: - back...
[01:42:31] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[01:43:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Papaul)
[01:44:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Papaul) 05Open→03Resolved a:03Papaul @jcrespo all yours
[01:45:31] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder)
[01:49:17] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10sgrabarczuk)
[02:02:32] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[02:03:32] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:49] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[02:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:36:11] <icinga-wm_>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:39] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:32:53] <icinga-wm_>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:42:21] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:12:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[05:27:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[05:29:16] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[05:47:17] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[05:50:19] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder)
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230421T0600)
[06:07:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[06:08:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[06:09:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[07:00:06] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230421T0700)
[07:00:43] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:01:19] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:09:29] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:10:25] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.413 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:10:57] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:11:01] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.324 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:27:24] <wikibugs>	 (03PS1) 10Elukey: amd-gpu-tester: workaround to unblock image upload [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910743
[07:45:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:50:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:55:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:20:39] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10MW-1.41-notes (1.41.0-wmf.4; 2023-04-10), 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) That file didn't have a reupload since 2009. My guess is that this is the rsvg problem...
[08:27:53] <icinga-wm_>	 RECOVERY - Disk space on testreduce1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops
[08:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:35:35] <Amir1>	 !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https 
[08:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:45] <icinga-wm_>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[08:51:17] <icinga-wm_>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[09:03:30] <Amir1>	 !log finish of the wikibase populate sites table
[09:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[09:27:32] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[09:29:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[09:47:31] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[09:50:32] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder)
[10:07:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[10:08:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[10:09:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[10:25:59] <wikibugs>	 (03PS1) 10Majavah: tlsproxy: Fix Nginx reload when cfssl certs get renewed [puppet] - 10https://gerrit.wikimedia.org/r/910758 (https://phabricator.wikimedia.org/T335181)
[10:28:59] <wikibugs>	 (03PS2) 10Majavah: tlsproxy: Fix Nginx reload when cfssl certs get renewed [puppet] - 10https://gerrit.wikimedia.org/r/910758 (https://phabricator.wikimedia.org/T335181)
[10:29:25] <wikibugs>	 (03CR) 10Majavah: "https://phabricator.wikimedia.org/P47268" [puppet] - 10https://gerrit.wikimedia.org/r/910758 (https://phabricator.wikimedia.org/T335181) (owner: 10Majavah)
[11:21:15] <duesen>	 I am getting ready to do some monky-patching on mwdebug2001 to investigate https://phabricator.wikimedia.org/T335183. Any objections?
[11:26:33] <wikibugs>	 (03PS1) 10Samtar: labstore: Add text-to-speech project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/910760 (https://phabricator.wikimedia.org/T335184)
[11:33:09] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "GID matches and the syntax is correct." [puppet] - 10https://gerrit.wikimedia.org/r/910760 (https://phabricator.wikimedia.org/T335184) (owner: 10Samtar)
[11:33:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: Add text-to-speech project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/910760 (https://phabricator.wikimedia.org/T335184) (owner: 10Samtar)
[11:39:13] <wikibugs>	 (03PS1) 10Joal: Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167)
[11:39:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal)
[11:44:00] <wikibugs>	 (03CR) 10Joal: "Hi Andrew," [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal)
[11:51:35] <wikibugs>	 (03PS2) 10Joal: Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167)
[11:56:12] <duesen>	 !log monky-patching Ib11a871ff on mwdebug2001 to investigate T335183
[11:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:18] <stashbot>	 T335183: VisualEditor is low to load on Hebrew Wikipedia - https://phabricator.wikimedia.org/T335183
[11:57:34] <duesen>	 gah! "bash: patch: command not found"
[11:57:44] <duesen>	 What'S the proper way to patch stuff on mwdebug, then?
[12:00:19] <duesen>	 I guess I have to apply the patch on the deployment host and scap pull to debug, then
[12:15:05] <wikibugs>	 (03PS1) 10Majavah: Add RealMe to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910767 (https://phabricator.wikimedia.org/T324535)
[12:15:07] <wikibugs>	 (03PS1) 10Majavah: Add $wmgUseRealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910768 (https://phabricator.wikimedia.org/T324535)
[12:15:09] <wikibugs>	 (03PS1) 10Majavah: Enable RealMe on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910769 (https://phabricator.wikimedia.org/T324535)
[12:17:18] <duesen>	 yep, that worked
[12:18:19] <duesen>	 !log reverted monky-patch, mwdebug2001 and deploy2002 are back to wmf/1.41.0-wmf.5 (T335183)
[12:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:25] <stashbot>	 T335183: VisualEditor is low to load on Hebrew Wikipedia - https://phabricator.wikimedia.org/T335183
[12:22:33] <wikibugs>	 (03CR) 10Joal: analytics: Add purge job for webrequest data loss reports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu)
[12:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:12:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[13:32:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[13:34:20] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[13:52:18] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[13:55:16] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder)
[14:09:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:12:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[14:13:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[14:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[14:23:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:28:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:37:15] <icinga-wm_>	 PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:38:51] <icinga-wm_>	 RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:51:51] <icinga-wm_>	 PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:53:29] <icinga-wm_>	 RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:57:11] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748)
[14:59:13] <Lucas_WMDE>	 I’m tempted to do an emergency deploy for ^ (mainly to unbreak the dumps, cc apergos), not sure if warranted… anyone around who has thoughts?
[14:59:59] <wikibugs>	 (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910018 (https://phabricator.wikimedia.org/T335090) (owner: 10Anzx)
[15:03:42] <apergos>	 basically, mediawikiwiki xml dumps are broken until that fix goes in
[15:03:50] <apergos>	 it's just that wiki
[15:08:00] <legoktm>	 Lucas_WMDE: +1 from me
[15:08:23] <apergos>	 hey lego
[15:08:28] <Lucas_WMDE>	 hey
[15:08:34] <Lucas_WMDE>	 for the change in general or for deploying it today too?
[15:08:43] <Lucas_WMDE>	 (reluctant to ping t.hcipriani and j.nuche ^^)
[15:08:51] <legoktm>	 deploying it, I guess I should actually review the change
[15:09:06] <legoktm>	 er wait
[15:09:23] <legoktm>	 does namespaceDupes fix this situation? or is it the other way around
[15:09:39] <wikibugs>	 (03CR) 10Majavah: "This broke connectivity to gerrit-replica because the IP does not match what's defined in the operations/homer/public repo and so caused T" [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn)
[15:09:39] <Lucas_WMDE>	 hm
[15:09:42] <Lucas_WMDE>	 worth a try I guess
[15:09:47] <Lucas_WMDE>	 since it has a dry-run mode anyways
[15:09:49] <legoktm>	 		$this->addDescription( 'Find and fix pages affected by namespace addition/removal' );
[15:09:51] <legoktm>	 yeah
[15:09:54] <legoktm>	 give that a try first
[15:10:04] <legoktm>	 since we want to get rid of the namespace anyways (AIUI)
[15:10:17] <apergos>	 I think that's right about  the namespace, we want it gone
[15:10:24] <Lucas_WMDE>	 “0 pages to fix, 0 were resolvable.”
[15:10:28] <Lucas_WMDE>	 (0 links to fix too)
[15:11:21] <Lucas_WMDE>	 perhaps cleanupTitles would work but I’m not sure running that on mediawikiwiki is a good idea
[15:11:30] <Lucas_WMDE>	 so far I gathered that that’s a maint script best used on small wikis
[15:12:49] <legoktm>	 oh yeah no, there's a bug about how running it is unsafe I think
[15:13:59] <Lucas_WMDE>	 eek, ok
[15:14:22] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Guess namespaceDupes can't handle this case. OK to temporarily restore to get the pages out of there then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748) (owner: 10Lucas Werkmeister (WMDE))
[15:17:58] <Lucas_WMDE>	 alright, I’ll do the emergency deploy hue and cry then
[15:18:08] <apergos>	 heh
[15:18:31] <apergos>	 there's no task yet for this, is using https://phabricator.wikimedia.org/T124748 good enough do you think?
[15:18:46] <Lucas_WMDE>	 I would use that and T335130
[15:18:47] <stashbot>	 T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130
[15:18:57] <apergos>	 right
[15:18:59] <Lucas_WMDE>	 since that’s ^ the same error message right? just on a second wiki
[15:19:15] <apergos>	 well no in this case we don't get the error messga,e just an error out
[15:19:20] <Lucas_WMDE>	 ah ok
[15:19:25] <apergos>	 no exception is thrown, which is obviously not great
[15:20:59] <Lucas_WMDE>	 thcipriani: jnuche: help! I’d like to do an emergency deploy for https://gerrit.wikimedia.org/r/910780 – context is T124748 and T335130: XML dumps for mediawikiwiki (only that wiki) are currently broken
[15:20:59] <stashbot>	 T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748
[15:21:17] <Lucas_WMDE>	 (that second task talks about collabwiki, where we fixed the error yesterday, but it turns out it affects mediawikiwiki too)
[15:23:52] <apergos>	 not sure if anyone US-based is going to be around
[15:24:12] <Lucas_WMDE>	 :/
[15:24:26] <apergos>	 because of the US holiday
[15:27:53] <Lucas_WMDE>	 maybe hashar from releng is around?
[15:27:58] <Lucas_WMDE>	 and perhaps Amir1 from SRE
[15:28:59] <legoktm>	 you have two roots supervising :p (though I'm probably going to go afk soon)
[15:29:24] <Lucas_WMDE>	 I was trying to follow https://wikitech.wikimedia.org/wiki/Deployments/Emergencies :'D
[15:29:35] <Lucas_WMDE>	 though I’m not sure I’ve followed it for all Friday deploys in the past tbh
[15:30:05] <apergos>	 I run one of the backport deployment windows every Thursday. but yeah emergency procedures are a separate thing.
[15:30:29] <jnuche>	 hi Lucas_WMDE, I don't have access to my work computer right now
[15:30:46] <jnuche>	 from my side you can go ahead and backport
[15:30:56] <Lucas_WMDE>	 ok thanks, and sorry to disturb you
[15:31:08] <jnuche>	 no worries at all :)
[15:31:12] <Lucas_WMDE>	 I think if I don’t hear anyone objecting I’ll go ahead in, let’s say 15 minutes
[15:32:23] <apergos>	 I'd ask in the -security back channel too, sometimes people follow that more closely as there's less noise comparatively
[15:35:32] <Lucas_WMDE>	 good idea, done
[15:36:05] <apergos>	 ty
[15:39:27] <apergos>	 so the plan would be, revert, delete the pages, re-revert? or ...?  just to clarify
[15:39:35] <legoktm>	 I need to go afk now, good luck
[15:39:44] <Lucas_WMDE>	 thanks, see you
[15:39:51] <Lucas_WMDE>	 apergos: my plan was just to deploy that one config change and leave it at that
[15:39:54] <apergos>	 see you later
[15:39:56] <apergos>	 ah
[15:40:14] <Lucas_WMDE>	 I didn’t get the impression that the config variable being false for mediawikiwiki was super important or urgent, it felt more like a cleanup
[15:40:24] <Lucas_WMDE>	 so I would think it’s fine to leave it un-cleaned up over the weekend
[15:40:29] <apergos>	 makes sense
[15:40:34] <Lucas_WMDE>	 and then I don’t have to make the decision whether the files are ok to delete or not ^
[15:40:35] <Lucas_WMDE>	 * ^^
[15:40:52] <Lucas_WMDE>	 but if you think we should delete and rerevert I’m ok with that too
[15:41:01] <apergos>	 I can delete stuff over there too but yeah without any info, ehhh
[15:41:14] <apergos>	 nah, do the minimum, given it's a long weekend for some
[15:42:06] <Lucas_WMDE>	 ok
[15:42:17] <Lucas_WMDE>	 might be easier to decide on deletion once we can actually see the page :D
[15:42:32] <Lucas_WMDE>	 I just tried getText but it doesn’t work, it has a --revision option but still insists on also having a title that matches
[15:43:33] <apergos>	 yuck
[15:43:40] <apergos>	 I suppose it wouid as a fail-safe
[15:44:51] <Lucas_WMDE>	 sheesh, 330k php notices in logstash
[15:44:57] <Lucas_WMDE>	 (unrelated, just, that’s a lot)
[15:45:25] <Lucas_WMDE>	 alright, I’ll go ahead
[15:45:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748) (owner: 10Lucas Werkmeister (WMDE))
[15:46:32] <apergos>	 want a second pair of eyes on th elogs or you ok?
[15:46:46] <wikibugs>	 (03Merged) 10jenkins-bot: Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748) (owner: 10Lucas Werkmeister (WMDE))
[15:46:49] <Lucas_WMDE>	 I think I’m ok
[15:47:04] <Lucas_WMDE>	 also both the ?curid URL and the dump command can be tested on mwdebug, so that’s convenient at least
[15:47:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:910780|Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki (T124748 T335130)]]
[15:47:20] <stashbot>	 T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130
[15:47:21] <stashbot>	 T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748
[15:48:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:910780|Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki (T124748 T335130)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[15:49:06] <Lucas_WMDE>	 `mwscript dumpBackup mediawikiwiki --current --start 377536 --end 377537` works on mwdebug2002 now, where it previously errored
[15:49:28] <Lucas_WMDE>	 and https://www.mediawiki.org/w/index.php?curid=1117504 also loads properly on mwdebug2001
[15:50:22] <Lucas_WMDE>	 hm, the mwdebug logstash is totally empty, that’s unusuall
[15:50:41] <Lucas_WMDE>	 I would expect at least one error from when I retried dumpBackup before scap deployed the change there
[15:51:51] <Lucas_WMDE>	 I’ll deploy anyways
[15:51:55] <Lucas_WMDE>	 syncing now
[15:53:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:53:52] <apergos>	 okey dokey
[15:54:25] <Lucas_WMDE>	 hm
[15:55:05] <Lucas_WMDE>	 don’t think I need to worry about that KubernetesAPILatency at the moment
[15:57:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:910780|Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki (T124748 T335130)]] (duration: 10m 01s)
[15:57:21] <stashbot>	 T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130
[15:57:22] <stashbot>	 T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748
[15:58:09] <Lucas_WMDE>	 ok, now https://www.mediawiki.org/w/index.php?curid=1117504 / https://www.mediawiki.org/wiki/Data:Data_demo.tab works without mwdebug too
[15:58:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:58:21] <Lucas_WMDE>	 I think that’s a success
[15:58:26] * Lucas_WMDE done
[15:58:52] <Lucas_WMDE>	 ah, and Andre replied on the task, nice
[15:59:31] <apergos>	 checking
[15:59:50] <apergos>	 ah good good
[16:03:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:12:35] <apergos>	 I'll be around for some hours yet, mostly idling but looking in from time to time in the channel, just in case
[16:12:41] <apergos>	 thanks a lot for the fix, yet again
[16:16:53] <wikibugs>	 (03CR) 10BryanDavis: "Could this be causing T335197? Apparently this change does not match the IPv4 for gerrit-replica.wikimedia.org that is configured in Homer" [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn)
[16:16:57] <Lucas_WMDE>	 np, hope it works
[16:17:02] <Lucas_WMDE>	 I’m probably signing off fairly soon
[16:17:57] <wikibugs>	 (03CR) 10BryanDavis: cloudgw: fix IP address for gerrit-replica.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn)
[16:18:24] <apergos>	 sounds about right, it's Friday evening after all
[16:18:29] <apergos>	 go have a weekend
[16:18:36] <Lucas_WMDE>	 ^^
[16:25:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:30:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:37:49] <icinga-wm_>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:41] <icinga-wm_>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:42:15] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:12:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[17:32:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[17:34:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[17:34:49] <icinga-wm_>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:19] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:51:08] <wikibugs>	 (03PS1) 10Dzahn: Revert "cloudgw: fix IP address for gerrit-replica.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/910717
[17:52:06] <wikibugs>	 (03PS2) 10Dzahn: Revert "cloudgw: fix IP address for gerrit-replica.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/910717 (https://phabricator.wikimedia.org/T335197)
[17:52:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "cloudgw: fix IP address for gerrit-replica.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/910717 (https://phabricator.wikimedia.org/T335197) (owner: 10Dzahn)
[17:52:31] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[17:55:33] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder)
[18:09:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:12:33] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[18:13:33] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[18:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[18:25:29] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.a8 in codfw
[18:27:59] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-en-local-public.a8 in codfw
[18:36:45] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:38:27] <icinga-wm_>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:21] <wikibugs>	 (03PS4) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088)
[18:50:47] <wikibugs>	 (03PS5) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088)
[18:53:35] <wikibugs>	 (03CR) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki conf (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[18:55:49] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:56:56] <wikibugs>	 (03PS4) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088)
[18:56:57] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:58:49] <wikibugs>	 (03PS6) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088)
[19:01:42] <wikibugs>	 (03PS7) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088)
[19:01:44] <wikibugs>	 (03PS5) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088)
[19:03:23] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:03:55] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:04:34] <wikibugs>	 (03PS8) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088)
[19:13:48] <wikibugs>	 (03PS6) 10Cmelo: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088)
[19:16:39] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:17:11] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:19:55] <wikibugs>	 (03PS7) 10Cmelo: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088)
[19:23:31] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.825 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:24:35] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:25:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:35:29] <icinga-wm_>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:29] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:41:50] <wikibugs>	 (03CR) 10Daimona Eaytoy: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki conf (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[19:45:11] <wikibugs>	 (03PS9) 10Cmelo: metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088)
[19:47:17] <wikibugs>	 (03CR) 10Cmelo: metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[19:48:18] <wikibugs>	 (03CR) 10Cmelo: Enable $wgCampaignEventsEnableMultipleOrganizers in production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[19:51:23] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 03+1] metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[19:51:30] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 03+1] Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[20:25:49] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MW-1.41-notes (1.41.0-wmf.4; 2023-04-10), 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10putnik) @Ladsgroup I know that it hasn't been reuploaded since 2009. And it is reall...
[20:33:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:12:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[21:37:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[21:39:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[21:57:17] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[21:57:28] <wikibugs>	 (03PS1) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165)
[21:57:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe)
[22:00:17] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder)
[22:03:43] <wikibugs>	 (03PS2) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165)
[22:05:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe)
[22:06:58] <wikibugs>	 (03CR) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe)
[22:08:23] <wikibugs>	 (03PS3) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165)
[22:09:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:17:18] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[22:18:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[22:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[22:38:01] <wikibugs>	 (03PS1) 10Samtar: InitialiseSettings-labs: Set Phonos config on testwiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910800 (https://phabricator.wikimedia.org/T332787)
[22:43:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910800 (https://phabricator.wikimedia.org/T332787) (owner: 10Samtar)
[22:44:07] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Set Phonos config on testwiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910800 (https://phabricator.wikimedia.org/T332787) (owner: 10Samtar)
[23:58:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown