[00:10:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host backup2011.codfw.wmnet with OS bullseye [00:10:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 3 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host backup2011.codfw.wmnet with OS bullseye [00:15:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2010.codfw.wmnet with reason: host reimage [00:18:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2010.codfw.wmnet with reason: host reimage [00:22:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) [00:23:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10Jclark-ctr) [00:25:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) [00:28:07] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) [00:29:07] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) a:03Jclark-ctr [00:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:35:27] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:37:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:37:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2010.codfw.wmnet with OS bullseye [00:37:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host backup2010.codfw.wmnet with OS bullseye completed: - back... [00:39:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910531 [00:39:21] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910531 (owner: 10TrainBranchBot) [00:56:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910531 (owner: 10TrainBranchBot) [01:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:19:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2011.codfw.wmnet with reason: host reimage [01:22:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [01:22:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2011.codfw.wmnet with reason: host reimage [01:24:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [01:26:46] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10UOzurumba) [01:39:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:41:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:41:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2011.codfw.wmnet with OS bullseye [01:41:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host backup2011.codfw.wmnet with OS bullseye completed: - back... [01:42:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [01:43:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Papaul) [01:44:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Papaul) 05Open→03Resolved a:03Papaul @jcrespo all yours [01:45:31] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [01:49:17] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10sgrabarczuk) [02:02:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [02:03:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:49] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:36:11] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:32:53] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:42:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:27:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [05:29:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [05:47:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [05:50:19] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [06:00:06] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230421T0600) [06:07:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [06:08:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [06:09:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230421T0700) [07:00:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:01:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:09:29] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:10:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.413 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:10:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.324 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:27:24] (03PS1) 10Elukey: amd-gpu-tester: workaround to unblock image upload [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910743 [07:45:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:50:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:55:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:39] 10SRE-swift-storage, 10MediaWiki-File-management, 10MW-1.41-notes (1.41.0-wmf.4; 2023-04-10), 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) That file didn't have a reupload since 2009. My guess is that this is the rsvg problem... [08:27:53] RECOVERY - Disk space on testreduce1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops [08:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:35:35] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [08:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:45] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [08:51:17] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [09:03:30] !log finish of the wikibase populate sites table [09:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:27:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [09:29:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [09:47:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [09:50:32] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [10:07:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:08:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:09:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:25:59] (03PS1) 10Majavah: tlsproxy: Fix Nginx reload when cfssl certs get renewed [puppet] - 10https://gerrit.wikimedia.org/r/910758 (https://phabricator.wikimedia.org/T335181) [10:28:59] (03PS2) 10Majavah: tlsproxy: Fix Nginx reload when cfssl certs get renewed [puppet] - 10https://gerrit.wikimedia.org/r/910758 (https://phabricator.wikimedia.org/T335181) [10:29:25] (03CR) 10Majavah: "https://phabricator.wikimedia.org/P47268" [puppet] - 10https://gerrit.wikimedia.org/r/910758 (https://phabricator.wikimedia.org/T335181) (owner: 10Majavah) [11:21:15] I am getting ready to do some monky-patching on mwdebug2001 to investigate https://phabricator.wikimedia.org/T335183. Any objections? [11:26:33] (03PS1) 10Samtar: labstore: Add text-to-speech project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/910760 (https://phabricator.wikimedia.org/T335184) [11:33:09] (03CR) 10Majavah: [C: 03+1] "GID matches and the syntax is correct." [puppet] - 10https://gerrit.wikimedia.org/r/910760 (https://phabricator.wikimedia.org/T335184) (owner: 10Samtar) [11:33:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: Add text-to-speech project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/910760 (https://phabricator.wikimedia.org/T335184) (owner: 10Samtar) [11:39:13] (03PS1) 10Joal: Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) [11:39:37] (03CR) 10CI reject: [V: 04-1] Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [11:44:00] (03CR) 10Joal: "Hi Andrew," [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [11:51:35] (03PS2) 10Joal: Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) [11:56:12] !log monky-patching Ib11a871ff on mwdebug2001 to investigate T335183 [11:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:18] T335183: VisualEditor is low to load on Hebrew Wikipedia - https://phabricator.wikimedia.org/T335183 [11:57:34] gah! "bash: patch: command not found" [11:57:44] What'S the proper way to patch stuff on mwdebug, then? [12:00:19] I guess I have to apply the patch on the deployment host and scap pull to debug, then [12:15:05] (03PS1) 10Majavah: Add RealMe to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910767 (https://phabricator.wikimedia.org/T324535) [12:15:07] (03PS1) 10Majavah: Add $wmgUseRealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910768 (https://phabricator.wikimedia.org/T324535) [12:15:09] (03PS1) 10Majavah: Enable RealMe on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910769 (https://phabricator.wikimedia.org/T324535) [12:17:18] yep, that worked [12:18:19] !log reverted monky-patch, mwdebug2001 and deploy2002 are back to wmf/1.41.0-wmf.5 (T335183) [12:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:25] T335183: VisualEditor is low to load on Hebrew Wikipedia - https://phabricator.wikimedia.org/T335183 [12:22:33] (03CR) 10Joal: analytics: Add purge job for webrequest data loss reports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [12:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:32:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [13:34:20] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [13:52:18] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [13:55:16] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [14:09:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:12:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [14:13:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [14:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:23:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:37:15] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:38:51] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:51:51] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:53:29] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:57:11] (03PS1) 10Lucas Werkmeister (WMDE): Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748) [14:59:13] I’m tempted to do an emergency deploy for ^ (mainly to unbreak the dumps, cc apergos), not sure if warranted… anyone around who has thoughts? [14:59:59] (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910018 (https://phabricator.wikimedia.org/T335090) (owner: 10Anzx) [15:03:42] basically, mediawikiwiki xml dumps are broken until that fix goes in [15:03:50] it's just that wiki [15:08:00] Lucas_WMDE: +1 from me [15:08:23] hey lego [15:08:28] hey [15:08:34] for the change in general or for deploying it today too? [15:08:43] (reluctant to ping t.hcipriani and j.nuche ^^) [15:08:51] deploying it, I guess I should actually review the change [15:09:06] er wait [15:09:23] does namespaceDupes fix this situation? or is it the other way around [15:09:39] (03CR) 10Majavah: "This broke connectivity to gerrit-replica because the IP does not match what's defined in the operations/homer/public repo and so caused T" [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn) [15:09:39] hm [15:09:42] worth a try I guess [15:09:47] since it has a dry-run mode anyways [15:09:49] $this->addDescription( 'Find and fix pages affected by namespace addition/removal' ); [15:09:51] yeah [15:09:54] give that a try first [15:10:04] since we want to get rid of the namespace anyways (AIUI) [15:10:17] I think that's right about the namespace, we want it gone [15:10:24] “0 pages to fix, 0 were resolvable.” [15:10:28] (0 links to fix too) [15:11:21] perhaps cleanupTitles would work but I’m not sure running that on mediawikiwiki is a good idea [15:11:30] so far I gathered that that’s a maint script best used on small wikis [15:12:49] oh yeah no, there's a bug about how running it is unsafe I think [15:13:59] eek, ok [15:14:22] (03CR) 10Legoktm: [C: 03+1] "Guess namespaceDupes can't handle this case. OK to temporarily restore to get the pages out of there then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748) (owner: 10Lucas Werkmeister (WMDE)) [15:17:58] alright, I’ll do the emergency deploy hue and cry then [15:18:08] heh [15:18:31] there's no task yet for this, is using https://phabricator.wikimedia.org/T124748 good enough do you think? [15:18:46] I would use that and T335130 [15:18:47] T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130 [15:18:57] right [15:18:59] since that’s ^ the same error message right? just on a second wiki [15:19:15] well no in this case we don't get the error messga,e just an error out [15:19:20] ah ok [15:19:25] no exception is thrown, which is obviously not great [15:20:59] thcipriani: jnuche: help! I’d like to do an emergency deploy for https://gerrit.wikimedia.org/r/910780 – context is T124748 and T335130: XML dumps for mediawikiwiki (only that wiki) are currently broken [15:20:59] T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [15:21:17] (that second task talks about collabwiki, where we fixed the error yesterday, but it turns out it affects mediawikiwiki too) [15:23:52] not sure if anyone US-based is going to be around [15:24:12] :/ [15:24:26] because of the US holiday [15:27:53] maybe hashar from releng is around? [15:27:58] and perhaps Amir1 from SRE [15:28:59] you have two roots supervising :p (though I'm probably going to go afk soon) [15:29:24] I was trying to follow https://wikitech.wikimedia.org/wiki/Deployments/Emergencies :'D [15:29:35] though I’m not sure I’ve followed it for all Friday deploys in the past tbh [15:30:05] I run one of the backport deployment windows every Thursday. but yeah emergency procedures are a separate thing. [15:30:29] hi Lucas_WMDE, I don't have access to my work computer right now [15:30:46] from my side you can go ahead and backport [15:30:56] ok thanks, and sorry to disturb you [15:31:08] no worries at all :) [15:31:12] I think if I don’t hear anyone objecting I’ll go ahead in, let’s say 15 minutes [15:32:23] I'd ask in the -security back channel too, sometimes people follow that more closely as there's less noise comparatively [15:35:32] good idea, done [15:36:05] ty [15:39:27] so the plan would be, revert, delete the pages, re-revert? or ...? just to clarify [15:39:35] I need to go afk now, good luck [15:39:44] thanks, see you [15:39:51] apergos: my plan was just to deploy that one config change and leave it at that [15:39:54] see you later [15:39:56] ah [15:40:14] I didn’t get the impression that the config variable being false for mediawikiwiki was super important or urgent, it felt more like a cleanup [15:40:24] so I would think it’s fine to leave it un-cleaned up over the weekend [15:40:29] makes sense [15:40:34] and then I don’t have to make the decision whether the files are ok to delete or not ^ [15:40:35] * ^^ [15:40:52] but if you think we should delete and rerevert I’m ok with that too [15:41:01] I can delete stuff over there too but yeah without any info, ehhh [15:41:14] nah, do the minimum, given it's a long weekend for some [15:42:06] ok [15:42:17] might be easier to decide on deletion once we can actually see the page :D [15:42:32] I just tried getText but it doesn’t work, it has a --revision option but still insists on also having a title that matches [15:43:33] yuck [15:43:40] I suppose it wouid as a fail-safe [15:44:51] sheesh, 330k php notices in logstash [15:44:57] (unrelated, just, that’s a lot) [15:45:25] alright, I’ll go ahead [15:45:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748) (owner: 10Lucas Werkmeister (WMDE)) [15:46:32] want a second pair of eyes on th elogs or you ok? [15:46:46] (03Merged) 10jenkins-bot: Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910780 (https://phabricator.wikimedia.org/T124748) (owner: 10Lucas Werkmeister (WMDE)) [15:46:49] I think I’m ok [15:47:04] also both the ?curid URL and the dump command can be tested on mwdebug, so that’s convenient at least [15:47:13] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:910780|Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki (T124748 T335130)]] [15:47:20] T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130 [15:47:21] T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [15:48:48] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:910780|Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki (T124748 T335130)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:49:06] `mwscript dumpBackup mediawikiwiki --current --start 377536 --end 377537` works on mwdebug2002 now, where it previously errored [15:49:28] and https://www.mediawiki.org/w/index.php?curid=1117504 also loads properly on mwdebug2001 [15:50:22] hm, the mwdebug logstash is totally empty, that’s unusuall [15:50:41] I would expect at least one error from when I retried dumpBackup before scap deployed the change there [15:51:51] I’ll deploy anyways [15:51:55] syncing now [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:52] okey dokey [15:54:25] hm [15:55:05] don’t think I need to worry about that KubernetesAPILatency at the moment [15:57:15] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:910780|Set wmgUseGraphWithJsonNamespace = true for mediawikiwiki (T124748 T335130)]] (duration: 10m 01s) [15:57:21] T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130 [15:57:22] T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [15:58:09] ok, now https://www.mediawiki.org/w/index.php?curid=1117504 / https://www.mediawiki.org/wiki/Data:Data_demo.tab works without mwdebug too [15:58:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:21] I think that’s a success [15:58:26] * Lucas_WMDE done [15:58:52] ah, and Andre replied on the task, nice [15:59:31] checking [15:59:50] ah good good [16:03:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:12:35] I'll be around for some hours yet, mostly idling but looking in from time to time in the channel, just in case [16:12:41] thanks a lot for the fix, yet again [16:16:53] (03CR) 10BryanDavis: "Could this be causing T335197? Apparently this change does not match the IPv4 for gerrit-replica.wikimedia.org that is configured in Homer" [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn) [16:16:57] np, hope it works [16:17:02] I’m probably signing off fairly soon [16:17:57] (03CR) 10BryanDavis: cloudgw: fix IP address for gerrit-replica.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn) [16:18:24] sounds about right, it's Friday evening after all [16:18:29] go have a weekend [16:18:36] ^^ [16:25:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:37:49] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:41] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:42:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:32:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [17:34:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [17:34:49] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:51:08] (03PS1) 10Dzahn: Revert "cloudgw: fix IP address for gerrit-replica.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/910717 [17:52:06] (03PS2) 10Dzahn: Revert "cloudgw: fix IP address for gerrit-replica.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/910717 (https://phabricator.wikimedia.org/T335197) [17:52:31] (03CR) 10Dzahn: [C: 03+2] Revert "cloudgw: fix IP address for gerrit-replica.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/910717 (https://phabricator.wikimedia.org/T335197) (owner: 10Dzahn) [17:52:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [17:55:33] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [18:09:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:12:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [18:13:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [18:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:25:29] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.a8 in codfw [18:27:59] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-en-local-public.a8 in codfw [18:36:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:38:27] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:21] (03PS4) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [18:50:47] (03PS5) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [18:53:35] (03CR) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki conf (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [18:55:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:56:56] (03PS4) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [18:56:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:58:49] (03PS6) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [19:01:42] (03PS7) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [19:01:44] (03PS5) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [19:03:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:03:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:04:34] (03PS8) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [19:13:48] (03PS6) 10Cmelo: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [19:16:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:17:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:19:55] (03PS7) 10Cmelo: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [19:23:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.825 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:24:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:30:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:35:29] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:41:50] (03CR) 10Daimona Eaytoy: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki conf (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [19:45:11] (03PS9) 10Cmelo: metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [19:47:17] (03CR) 10Cmelo: metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [19:48:18] (03CR) 10Cmelo: Enable $wgCampaignEventsEnableMultipleOrganizers in production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [19:51:23] (03CR) 10Daimona Eaytoy: [C: 03+1] metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [19:51:30] (03CR) 10Daimona Eaytoy: [C: 03+1] Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [20:25:49] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MW-1.41-notes (1.41.0-wmf.4; 2023-04-10), 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10putnik) @Ladsgroup I know that it hasn't been reuploaded since 2009. And it is reall... [20:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:37:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:39:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:57:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [21:57:28] (03PS1) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) [21:57:51] (03CR) 10CI reject: [V: 04-1] profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [22:00:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [22:03:43] (03PS2) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) [22:05:47] (03CR) 10CI reject: [V: 04-1] profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [22:06:58] (03CR) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [22:08:23] (03PS3) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) [22:09:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:17:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [22:18:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [22:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:38:01] (03PS1) 10Samtar: InitialiseSettings-labs: Set Phonos config on testwiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910800 (https://phabricator.wikimedia.org/T332787) [22:43:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910800 (https://phabricator.wikimedia.org/T332787) (owner: 10Samtar) [22:44:07] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Set Phonos config on testwiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910800 (https://phabricator.wikimedia.org/T332787) (owner: 10Samtar) [23:58:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown