Request did not return bytes

[00:05:17] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:38:39] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956861
[00:38:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956861 (owner: 10TrainBranchBot)
[00:41:15] <wikibugs>	 (03PS1) 10RLazarus: hieradata: Add kubeconfig files for mw-script [puppet] - 10https://gerrit.wikimedia.org/r/957375 (https://phabricator.wikimedia.org/T341553)
[00:43:46] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43274/console" [puppet] - 10https://gerrit.wikimedia.org/r/957375 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[00:44:25] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] hieradata: Add kubeconfig files for mw-script [puppet] - 10https://gerrit.wikimedia.org/r/957375 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[00:46:40] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:51:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:52:55] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956861 (owner: 10TrainBranchBot)
[00:53:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:11:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: support removal of unhandled image backups [puppet] - 10https://gerrit.wikimedia.org/r/954131 (owner: 10Andrew Bogott)
[01:14:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:15:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:34] <wikibugs>	 (03PS4) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912
[01:17:45] <wikibugs>	 (03Abandoned) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (owner: 10Krinkle)
[01:36:06] <urandom>	 !log starting RESTBase/Cassandra node rebuilds, cassandra-c/row D — T331713
[01:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:10] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[02:07:33] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:16] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[02:17:33] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:16] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) resolved: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[02:31:50] <wikibugs>	 (03PS1) 10RLazarus: admin_ng: Add mw-script namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/957377 (https://phabricator.wikimedia.org/T341553)
[02:33:48] <wikibugs>	 10SRE, 10Traffic, 10Epic, 10User-notice: Deploy Wikimedia DNS: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Shizhao)
[02:37:33] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:49] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] admin_ng: Add mw-script namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/957377 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[02:46:20] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Add mw-script namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/957377 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[02:54:28] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[02:56:15] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[02:57:41] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[02:58:22] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[02:58:54] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[03:03:03] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[03:03:54] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[03:04:32] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[03:42:46] <icinga-wm>	 PROBLEM - Disk space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 32561 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1026&var-datasource=eqiad+prometheus/ops
[03:51:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:51:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:53:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:58:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:01:44] <icinga-wm>	 PROBLEM - Disk space on restbase1027 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 63650 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1027&var-datasource=eqiad+prometheus/ops
[04:03:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:42:36] <icinga-wm>	 PROBLEM - Disk space on restbase1027 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 66237 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1027&var-datasource=eqiad+prometheus/ops
[05:05:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2011,2014].codfw.wmnet,pc1011.eqiad.wmnet with reason: Pre swichover tasks
[05:05:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2011,2014].codfw.wmnet,pc1011.eqiad.wmnet with reason: Pre swichover tasks
[05:05:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Pre swichover tasks
[05:06:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Pre swichover tasks
[05:06:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Pre swichover tasks
[05:06:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Pre swichover tasks
[05:10:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 11 hosts with reason: Pre swichover tasks
[05:11:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 11 hosts with reason: Pre swichover tasks
[05:21:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Pre swichover tasks
[05:22:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Pre swichover tasks
[05:23:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Pre swichover tasks
[05:23:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Pre swichover tasks
[05:51:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:53:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:00:08] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0600)
[06:00:08] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0600).
[06:08:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[06:18:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10ayounsi) @CDanis  Is that still needed now that we have NEL?
[06:18:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10ayounsi)
[06:22:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Pre swichover tasks
[06:22:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Pre swichover tasks
[06:26:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Pre swichover tasks
[06:27:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Pre swichover tasks
[06:37:48] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:38:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10MoritzMuehlenhoff) Looks good
[06:39:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM requested for search-loader - https://phabricator.wikimedia.org/T346273 (10MoritzMuehlenhoff) Looks good
[06:56:00] <wikibugs>	 (03PS4) 10KartikMistry: Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[06:56:34] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, two nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede)
[07:00:04] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0700).
[07:00:04] <jouncebot>	 abijeet and houseofm: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:33] * kart_ will deploy abijeet's change
[07:00:57] <abijeet>	 o/
[07:01:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[07:01:45] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good (the commit message is misleading, though: piuparts has been in Debian for almost twenty years)" [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall)
[07:01:55] <wikibugs>	 (03Merged) 10jenkins-bot: Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[07:02:45] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:956807|Enable MinT translation service on MediaWiki - rollout #4 (T341445)]]
[07:02:50] <stashbot>	 T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445
[07:04:20] <logmsgbot>	 !log kartik@deploy1002 abi and kartik: Backport for [[gerrit:956807|Enable MinT translation service on MediaWiki - rollout #4 (T341445)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:04:52] <kart_>	 abijeet: can you test the patch with mwdebug now?
[07:04:58] <abijeet>	 kart_, checking
[07:05:00] <hashar>	 hello :)
[07:05:23] <hashar>	 Mohd and I are doing a backport training, so we will deploy the second change
[07:05:31] <hashar>	 unless you want to join in the meeting?
[07:06:11] <abijeet>	 kart_, looks good
[07:06:20] <kart_>	 abijeet: nice. Deploying..
[07:06:24] <logmsgbot>	 !log kartik@deploy1002 abi and kartik: Continuing with sync
[07:06:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1001.wikimedia.org
[07:09:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1001.wikimedia.org
[07:12:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:13:02] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:956807|Enable MinT translation service on MediaWiki - rollout #4 (T341445)]] (duration: 10m 17s)
[07:13:12] <stashbot>	 T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445
[07:14:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Code LGTM, though it'll need to be applied to a profile common to both frontends (logstash + OS) and backends (data nodes, OS only). For e" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:16:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10hashar) While doing the backport & config training this morning with Mohd (T345186), we found out he has no access to the deployment server since he ha...
[07:16:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10hashar)
[07:16:56] <hashar>	 kart_: if you are done can we proceed? :)
[07:17:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:17:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet
[07:21:18] <kart_>	 hashar: sorry. Please go ahead.
[07:21:32] <hashar>	 kart_: we are doing it, thank you! :)
[07:21:32] <wikibugs>	 (03PS6) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798)
[07:21:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet
[07:22:01] <wikibugs>	 (03CR) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:22:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) (owner: 10Mhorsey)
[07:23:42] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Campaign Events email feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) (owner: 10Mhorsey)
[07:24:08] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909
[07:25:01] <wikibugs>	 (03PS5) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909
[07:27:30] <wikibugs>	 (03PS6) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909
[07:29:12] <hashar>	 scap backport magically detects it is a beta cluster only change and happilly skips the sync :))
[07:29:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We should probably also move the project to gitlab, where we have an easy way to set up the testing pipeline." [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur)
[07:30:53] <kart_>	 hashar: that's nice!
[07:31:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 (owner: 10Muehlenhoff)
[07:32:32] <hashar>	 !log Backport & config deployment window completed.
[07:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet
[07:36:59] <wikibugs>	 (03PS2) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367
[07:37:04] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43275/console" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:38:32] <wikibugs>	 (03PS22) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836
[07:39:08] <wikibugs>	 (03CR) 10Muehlenhoff: Enable cumin hosts to reach the opensearch API on logstash clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:39:44] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43276/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede)
[07:40:53] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43277/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede)
[07:41:45] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff)
[07:44:06] <wikibugs>	 (03PS7) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798)
[07:44:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:44:31] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host debmonitor2003.codfw.wmnet
[07:44:51] <wikibugs>	 (03PS2) 10Fabfur: add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362
[07:44:53] <wikibugs>	 (03CR) 10Brouberol: "Thanks for the review. I take it `src_sets` can contain variables/defs, and `srange` has to contain ip/ip ranges?" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:45:30] <wikibugs>	 (03PS8) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798)
[07:45:51] <wikibugs>	 (03CR) 10Fabfur: add support for unix sockets (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur)
[07:48:58] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43278/console" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:50:02] <wikibugs>	 (03PS2) 10JMeybohm: Remove conf2* from etcd client srv records [dns] - 10https://gerrit.wikimedia.org/r/957246 (https://phabricator.wikimedia.org/T332010)
[07:53:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Remove conf2* from etcd client srv records [dns] - 10https://gerrit.wikimedia.org/r/957246 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm)
[07:54:20] <wikibugs>	 (03PS2) 10Volans: decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134)
[07:56:48] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.codfw.wmnet on all recursors
[07:56:51] <wikibugs>	 (03CR) 10Brouberol: "Thank you so much for the assistance!" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:56:52] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.codfw.wmnet on all recursors
[07:56:53] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.ulsfo.wmnet on all recursors
[07:56:57] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.ulsfo.wmnet on all recursors
[07:56:58] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.eqsin.wmnet on all recursors
[07:57:02] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.eqsin.wmnet on all recursors
[07:58:11] <wikibugs>	 (03PS2) 10Majavah: nginx::status_site: allow multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/956068
[07:58:13] <wikibugs>	 (03PS3) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069
[07:59:13] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] P:idm allow for installation via Debian packages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede)
[08:00:06] <jouncebot>	 jnuche and hashar: That opportune time is upon us again. Time for a MediaWiki train - Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0800).
[08:01:04] <wikibugs>	 (03PS23) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (https://phabricator.wikimedia.org/T340721)
[08:02:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:02:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[08:02:26] <jnuche>	 andre: morning, ready to continue with the train today? :)
[08:02:56] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[08:03:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[08:03:15] <andre>	 jnuche: argh I am running late. Can do, sure, one moment!
[08:03:38] <jnuche>	 no hurries!
[08:04:21] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm)
[08:06:56] <icinga-wm>	 RECOVERY - Disk space on restbase1027 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1027&var-datasource=eqiad+prometheus/ops
[08:07:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:13:00] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:13:28] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal
[08:14:02] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[08:15:31] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957665 (https://phabricator.wikimedia.org/T343728)
[08:15:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957665 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[08:15:50] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4010 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[08:16:20] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957665 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[08:16:20] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[08:16:26] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal
[08:16:38] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[08:19:04] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:19:04] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5006 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[08:19:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Degraded RAID on netmon1003 - https://phabricator.wikimedia.org/T346275 (10Peachey88)
[08:19:06] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2011 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:19:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:19:20] <wikibugs>	 (03PS1) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721)
[08:20:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:20:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43279/console" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[08:21:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:23:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:23:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956068 (owner: 10Majavah)
[08:24:11] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.26  refs T343728
[08:24:15] <stashbot>	 T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728
[08:24:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:25:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah)
[08:28:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:29:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[08:30:22] <wikibugs>	 (03PS3) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367
[08:31:22] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookwork - https://phabricator.wikimedia.org/T340721 (10SLyngshede-WMF) Plan for testing rollout of Debian packages:  Upgrade test to Bookworm:  **Pre-update:**   - Set idm-test1001 in maintenance mode   - Merge patc...
[08:32:10] <wikibugs>	 (03CR) 10David Caro: "Would be nice to have some tests :/, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[08:33:31] <wikibugs>	 (03PS4) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069
[08:33:58] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] nginx::status_site: allow multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/956068 (owner: 10Majavah)
[08:34:51] <wikibugs>	 (03PS5) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069
[08:34:57] <wikibugs>	 (03CR) 10Majavah: prometheus::nginx_exporter: manage nginx status site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah)
[08:35:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah)
[08:35:51] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] dynamicproxy: improve connection error pages [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah)
[08:36:12] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:36:25] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[08:36:49] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah)
[08:36:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah)
[08:37:00] <vgutierrez>	 ^^ is that you jayme? :)
[08:37:10] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:37:29] <jayme>	 vgutierrez: just restarted the secondary lvs's - so probably yes
[08:37:32] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4010 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[08:37:39] <vgutierrez>	 yeah.. I must missed your !log line
[08:37:56] <jayme>	 no, I did not send it because stupid
[08:38:32] <jayme>	 !log restarted secondary lvs in codfw, eqsin, ulsfo
[08:38:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[08:39:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah)
[08:40:11] <jayme>	 vgutierrez: will restart the primaries now
[08:40:34] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 97 connections established with conf1007.eqiad.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal
[08:40:44] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5006 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[08:40:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955291 (https://phabricator.wikimedia.org/T345702) (owner: 10Jbond)
[08:41:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[08:41:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/928477 (owner: 10Majavah)
[08:43:00] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:43:05] <jayme>	 !log restarting primary lvs in codfw, eqsin, ulsfo
[08:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:26] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf1007.eqiad.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[08:43:30] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf1007.eqiad.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal
[08:43:44] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4009 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[08:45:28] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:45:44] <wikibugs>	 (03PS2) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721)
[08:45:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on A:aqs-eqiad
[08:45:54] <jayme>	 !log restarting confd fleet wide
[08:45:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/output/956904/43280/" [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[08:45:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:10] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:46:14] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2011 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:46:34] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[08:50:16] <wikibugs>	 (03PS1) 10Slyngshede: IDM Switchover [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721)
[08:53:58] <godog>	 !log +50 to prometheus eqiad k8s-staging
[08:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:16] <wikibugs>	 (03PS1) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721)
[08:56:50] <jinxer-wm>	 (ThanosRuleIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleIsDown
[08:57:27] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[08:57:33] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:58:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Regarding the other domains (the ones not part of *.wikimedia.org), only `test.m.wikidata.org` and `m.wikifunctions.org` DNS records are configured.  What should we do wit...
[08:59:54] <btullis>	 !log running build-production-images on build2001 for T344910
[08:59:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:57] <stashbot>	 T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910
[09:02:50] <wikibugs>	 (03PS3) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721)
[09:04:16] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] sre.opensearch.roll-restart-reboot: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[09:04:42] <wikibugs>	 (03PS2) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721)
[09:05:26] <wikibugs>	 (03PS3) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721)
[09:05:42] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=swift.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[09:05:55] <marostegui>	 woot
[09:06:50] <jinxer-wm>	 (ThanosRuleIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleIsDown
[09:07:04] <moritzm>	 !log installing qemu security updates on ganeti-test
[09:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:33] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:07:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff)
[09:09:50] <icinga-wm>	 RECOVERY - Disk space on restbase1026 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1026&var-datasource=eqiad+prometheus/ops
[09:10:42] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[09:10:58] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[09:11:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] add support for unix sockets (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur)
[09:16:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[09:17:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[09:18:21] <volans>	 marostegui: any insight?
[09:19:04] <marostegui>	 volans: not sure what you are asking
[09:19:36] <volans>	 nothing, see private :)
[09:20:42] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[09:22:22] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[09:24:38] <wikibugs>	 (03PS3) 10Fabfur: add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362
[09:25:33] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10MoritzMuehlenhoff)
[09:27:12] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10MoritzMuehlenhoff) Plan looks good to me.
[09:29:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/956905/43281/" [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[09:30:43] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Fix thread_pool_max on esams, eqsin, ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723)
[09:30:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[09:31:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: Fix thread_pool_max on esams, eqsin, ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) (owner: 10Vgutierrez)
[09:31:30] <wikibugs>	 (03CR) 10Fabfur: add support for unix sockets (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur)
[09:32:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org
[09:33:05] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Fix thread_pool_max on esams, eqsin, ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723)
[09:33:21] <wikibugs>	 (03PS1) 10Jelto: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309)
[09:36:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org
[09:36:38] <wikibugs>	 (03CR) 10FNegri: designate nova_fixed_multi: create A record using project_id and project_name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[09:39:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org
[09:40:33] <wikibugs>	 (03PS3) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143)
[09:41:21] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) (owner: 10Vgutierrez)
[09:42:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/956906/43283/" [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[09:43:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org
[09:43:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Following this patch and the resolution of" [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[09:44:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[09:44:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet
[09:47:03] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Marostegui)
[09:47:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:48:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: drop messages with dt == '-' [puppet] - 10https://gerrit.wikimedia.org/r/957682 (https://phabricator.wikimedia.org/T346140)
[09:48:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:48:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] benthos: drop messages with dt == '-' [puppet] - 10https://gerrit.wikimedia.org/r/957682 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi)
[09:49:20] <jayme>	 !log restarted navtiming on webperf2003 to pick up changed etcd service records
[09:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: sync
[09:50:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: sync
[09:51:33] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: sync
[09:51:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync
[09:52:20] <elukey>	 !log remove the 'mediawiki.revision-score' stream form eventstreams public API - T342116
[09:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:23] <stashbot>	 T342116: Deprecate mediawiki revision-score stream - https://phabricator.wikimedia.org/T342116
[09:52:45] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[09:53:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:55:29] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet
[09:56:02] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:57:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Degraded RAID on netmon1003 - https://phabricator.wikimedia.org/T346275 (10fgiunchedi) 05Open→03Invalid Nothing to do, host was reimaged:  ` netmon1003:~$ cat /proc/mdstat  Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid...
[09:59:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] mtail: Record bad requests for varnish SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[09:59:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1005.wikimedia.org
[09:59:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 2 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:00:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:00:06] <jouncebot>	 mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1000).
[10:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1000)
[10:00:32] <icinga-wm>	 RECOVERY - Check systemd state on netmon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:00:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:01:07] <jinxer-wm>	 (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:01:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.179:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.179:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[10:01:26] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:01:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:01:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:01:51] <claime>	 Hmm parsoid what's happening to you
[10:02:12] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:02:16] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:02:35] <volans>	 looking
[10:02:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:03:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:03:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1005.wikimedia.org
[10:03:33] <volans>	 are they recovered already?
[10:03:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1006.wikimedia.org
[10:03:42] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:04:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:04:39] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:04:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W
[10:04:46] <icinga-wm>	 MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:06:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: sync
[10:06:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync
[10:06:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rancid: fix log dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/957685 (https://phabricator.wikimedia.org/T344136)
[10:07:16] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:07:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:07:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1006.wikimedia.org
[10:10:02] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host conf2004.codfw.wmnet with OS bullseye
[10:10:08] <wikibugs>	 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host conf2004.codfw.wmnet with OS bullseye
[10:11:07] <jinxer-wm>	 (ProbeDown) resolved: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:11:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:13:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:14:08] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:18:18] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2194 [puppet] - 10https://gerrit.wikimedia.org/r/957686
[10:18:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on A:aqs-eqiad
[10:19:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2194 [puppet] - 10https://gerrit.wikimedia.org/r/957686 (owner: 10Marostegui)
[10:20:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org
[10:24:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org
[10:25:42] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2004.codfw.wmnet with reason: host reimage
[10:27:24] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1137.eqiad.wmnet with OS bullseye
[10:27:30] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) Requests are typically only coming in about 5 every 10 mins at this stage.  @aborrero I did notice that the...
[10:28:12] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2004.codfw.wmnet with reason: host reimage
[10:30:06] <wikibugs>	 (03PS1) 10Elukey: services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116)
[10:36:35] <wikibugs>	 (03PS1) 10Majavah: hieradata: set authdns_servers for eqiad1/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/957688
[10:37:39] <wikibugs>	 (03PS1) 10Elukey: Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689
[10:37:41] <wikibugs>	 (03PS1) 10Elukey: Set ores.wikimedia.org as CNAME for ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957690
[10:38:27] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10taavi) >>! In T346177#9166102, @cmooney wrote: > Might it be hardcoded some places still?  Instances getting  NAT'd...
[10:41:04] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: host reimage
[10:43:25] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: host reimage
[10:47:56] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:51:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:53:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:54:24] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:54:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696
[10:56:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:01:25] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10brouberol) FYI, I have been working on writi...
[11:02:06] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855)
[11:02:21] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch
[11:03:21] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff)
[11:04:52] <brouberol>	 !log brouberol@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch - T344798
[11:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:56] <stashbot>	 T344798: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798
[11:08:01] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:08:15] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1137.eqiad.wmnet with OS bullseye
[11:09:13] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) 05Open→03Resolved We are now serving 5% of global traffic from mw-on-k8s. Resolving.
[11:12:18] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch
[11:13:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet
[11:14:59] <wikibugs>	 (03PS1) 10Volans: tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701
[11:15:01] <wikibugs>	 (03PS1) 10Volans: decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702
[11:17:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet
[11:17:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet
[11:21:28] <wikibugs>	 (03CR) 10Volans: "eluk" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[11:21:39] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling reboot on A:datahubsearch
[11:22:55] <volans>	 lol, fat fingers
[11:24:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet
[11:24:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet
[11:24:30] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey)
[11:25:22] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@e8a6ae4]: Disable wikifeeds announcements healthcheck
[11:26:42] <wikibugs>	 (03PS1) 10BBlack: haproxy: reduce varnish maxconn to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957704 (https://phabricator.wikimedia.org/T310609)
[11:27:42] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams: set frontend memory reservation to 170 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952866 (owner: 10BBlack)
[11:28:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet
[11:30:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: set authdns_servers for eqiad1/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/957688 (owner: 10Majavah)
[11:31:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet
[11:34:33] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on idm-test1001.wikimedia.org with reason: upgrade to Bookwork
[11:34:57] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on idm-test1001.wikimedia.org with reason: upgrade to Bookwork
[11:35:30] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@e8a6ae4]: Disable wikifeeds announcements healthcheck (duration: 10m 08s)
[11:36:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff)
[11:36:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet
[11:37:02] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[11:37:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:37:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:37:14] <volans>	 hnowlan: ahem...
[11:37:21] <hnowlan>	 ffff
[11:37:27] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling reboot on A:datahubsearch
[11:37:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:46] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) @taavi @aborrero that's not a bad plan of action at all.  In terms of step 4 I'm not sure we need to hold off, but in general there is no...
[11:37:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:38:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:38:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:38:10] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:38:12] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:38:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:38:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:38:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) htt
[11:39:18] <icinga-wm>	 itech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:30] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:41:56] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:42:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:42:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:42:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "ganeti: Avoid Ferm-specific syntax" [puppet] - 10https://gerrit.wikimedia.org/r/957719
[11:42:33] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:43:05] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@8eb62f2]: Revert "Disable wikifeeds announcements healthcheck"
[11:43:54] <wikibugs>	 (03PS1) 10Majavah: acme_chief: Make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/957720
[11:43:56] <wikibugs>	 (03PS1) 10Majavah: acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721
[11:43:58] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] fe_mem_gb_reserved: merge esams settings [nop] [puppet] - 10https://gerrit.wikimedia.org/r/957343 (owner: 10BBlack)
[11:46:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "ganeti: Avoid Ferm-specific syntax" [puppet] - 10https://gerrit.wikimedia.org/r/957719 (owner: 10Muehlenhoff)
[11:46:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: make pdns auth listen on the new ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042)
[11:46:34] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:53] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudservices1006: make pdns auth listen on the new ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042)
[11:47:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[11:47:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[11:47:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:48:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero)
[11:49:07] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) 05Open→03Resolved a:03aborrero thanks everyone involved in the debugging and fix.
[11:49:17] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@8eb62f2]: Revert "Disable wikifeeds announcements healthcheck" (duration: 06m 12s)
[11:49:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[11:49:40] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[11:50:06] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:50:17] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43287/console" [puppet] - 10https://gerrit.wikimedia.org/r/957720 (owner: 10Majavah)
[11:50:38] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimedia.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero)
[11:53:41] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=rest-gateway.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[11:53:57] <vgutierrez>	 hmmm hnowlan ^^
[11:54:05] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm-test1001.wikimedia.org with OS bookworm
[11:54:15] <volans>	 vgutierrez: we're already on it, -sre
[11:54:16] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm
[11:54:20] <vgutierrez>	 volans: sorry :)
[11:55:00] <volans>	 no prob :)
[11:56:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet
[11:56:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:38] <hnowlan>	 ^ that wasn't me :|
[11:56:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:57:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:57:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:57:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:57:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:57:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:57:33] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:58:18] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:58:22] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:58:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:58:41] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (3) ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1200)
[12:00:20] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:00:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:01:26] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.misc-clusters.roll-restart-restbase rolling restart_daemons on A:restbase-canary
[12:01:46] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/957724
[12:02:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:02:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:02:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:02:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:03:13] <TheresNoTime>	 ah, ^ will be why https://www.mediawiki.org/api/rest_v1/page/html/Project%20talk%3AMastodon?redirect=false is 502ing then
[12:03:45] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2004.codfw.wmnet with OS bullseye
[12:03:52] <wikibugs>	 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host conf2004.codfw.wmnet with OS bullseye completed: - conf2004 (**WARN**)   - Downtimed on Icinga/Alertmanager   - Disab...
[12:03:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:03:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:04:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet
[12:04:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:05:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:05:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:06:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:06:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957724 (owner: 10Muehlenhoff)
[12:06:57] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage
[12:06:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:08:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:08:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:09:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[12:09:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: make pdns auth listen on the new ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[12:10:00] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage
[12:10:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:10:31] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:11:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:11:41] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] growthexperiments: Run listTaskCounts for all task types [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm)
[12:11:46] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.misc-clusters.roll-restart-restbase (exit_code=1) rolling restart_daemons on A:restbase-canary
[12:12:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:12:09] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:12:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:12:36] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimedia.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero) p:05Triage→03High
[12:13:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:13:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:14:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet
[12:14:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:14:53] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:16:11] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: sync
[12:16:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:17:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:17:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] firewall::service: Fix logic error in passing srange/drange to nftables [puppet] - 10https://gerrit.wikimedia.org/r/957313 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:17:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync
[12:17:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet
[12:18:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:42] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:19:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno
[12:19:43] <icinga-wm>	 s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:20:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE ipamhandles) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:22:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:22:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:43] <wikibugs>	 (03PS1) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565)
[12:24:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Fix logic error in passing srange/drange to nftables [puppet] - 10https://gerrit.wikimedia.org/r/957313 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:26:31] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:26:41] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:26:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:27:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:27:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:27:33] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:27:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE ipamhandles) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:27:35] <wikibugs>	 (03PS1) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565)
[12:28:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:28:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:28:42] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: (3) ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:29:03] <icinga-wm>	 RECOVERY - SSH on sretest1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:29:13] <icinga-wm>	 RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:29:34] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey)
[12:29:46] <wikibugs>	 (03PS1) 10JMeybohm: Update _etcd-server-ssl._tcp.v3.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/957729 (https://phabricator.wikimedia.org/T332010)
[12:30:15] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:30:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:32:04] <wikibugs>	 (03CR) 10DCausse: "we might still need a flag to disable the jobqueue based updates" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[12:32:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update _etcd-server-ssl._tcp.v3.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/957729 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm)
[12:36:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701 (owner: 10Volans)
[12:37:58] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[12:38:23] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701 (owner: 10Volans)
[12:39:19] <wikibugs>	 (03PS1) 10Slyngshede: P:idm move directory creation. [puppet] - 10https://gerrit.wikimedia.org/r/957730
[12:40:34] <wikibugs>	 (03PS2) 10Slyngshede: P:idm move directory creation. [puppet] - 10https://gerrit.wikimedia.org/r/957730
[12:41:13] <wikibugs>	 (03PS2) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565)
[12:41:15] <wikibugs>	 (03PS2) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565)
[12:42:21] <wikibugs>	 (03Merged) 10jenkins-bot: tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701 (owner: 10Volans)
[12:43:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: drop messages with dt == '-' [puppet] - 10https://gerrit.wikimedia.org/r/957682 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi)
[12:44:11] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43288/console" [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede)
[12:44:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede)
[12:45:21] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm move directory creation. [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede)
[12:45:23] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43289/console" [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede)
[12:46:07] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) cloudservices1006 is now replying to DNS auth queries in the 185.15.56.162 address, which will later be handed to cloudservices1005:  `l...
[12:46:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:47:05] <wikibugs>	 (03CR) 10Elukey: decorators: extend documentation (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[12:48:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:48:33] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero)
[12:53:56] <wikibugs>	 (03CR) 10Volans: "reply inline" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[12:56:39] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm-test1001.wikimedia.org with OS bookworm
[12:56:48] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm completed: - idm-t...
[13:00:06] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1300).
[13:00:06] <jouncebot>	 houseofm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:42] * TheresNoTime is here
[13:01:03] <wikibugs>	 (03PS1) 10Slyngshede: P:idm enable bitu uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/957731
[13:01:06] <TheresNoTime>	 but that patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/956447/) is beta-only and already deployed?
[13:02:11] <TheresNoTime>	 (cc HouseOfM)
[13:02:13] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43290/console" [puppet] - 10https://gerrit.wikimedia.org/r/957731 (owner: 10Slyngshede)
[13:04:00] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[13:05:03] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43291/console" [puppet] - 10https://gerrit.wikimedia.org/r/957731 (owner: 10Slyngshede)
[13:05:14] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm enable bitu uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/957731 (owner: 10Slyngshede)
[13:08:15] <icinga-wm>	 PROBLEM - memcached socket on mw2444 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:10:55] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1138.eqiad.wmnet with OS bullseye
[13:11:43] <wikibugs>	 (03PS1) 10Kamila Součková: wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330)
[13:11:43] <volans>	 claime: FYI ^^^ mw2444, it's very slugghish on ssh
[13:11:54] <claime>	 volans: that server...
[13:12:04] <claime>	 https://phabricator.wikimedia.org/T345884
[13:12:12] <moritzm>	 I think that was just a temporary blip
[13:12:14] <claime>	 It's been a pain for a while
[13:12:25] <moritzm>	 I upgraded packages there which were missed when the server was down
[13:12:28] <moritzm>	 and it's currently depooled
[13:12:32] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1139.eqiad.wmnet with OS bullseye
[13:12:40] <moritzm>	 so that should recover soon
[13:12:50] <claime>	 moritzm: it is until we can call it stable, yeah
[13:13:18] <claime>	 It's had its CPU changed
[13:13:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet
[13:13:34] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet
[13:13:36] <volans>	 it's a lemon then :D
[13:14:21] <moritzm>	 !log installing aom security updates 
[13:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:33] <icinga-wm>	 PROBLEM - Check systemd state on conf2004 is CRITICAL: CRITICAL - degraded: The following units failed: etcd-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:59] <icinga-wm>	 RECOVERY - Check systemd state on conf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:13] <claime>	 Hmm memcached seems like it's not configured correctly on that server
[13:19:14] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Decrease max_connections to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957735
[13:19:16] <wikibugs>	 (03PS1) 10Kamila Součková: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330)
[13:19:41] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host conf2006.codfw.wmnet with OS bullseye
[13:19:44] <wikibugs>	 (03CR) 10Elukey: decorators: extend documentation (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[13:19:47] <wikibugs>	 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host conf2006.codfw.wmnet with OS bullseye
[13:20:02] <claime>	 moritzm: I think memcached got updated, puppet didn't run immediately after and didn't drop the override back
[13:20:20] <claime>	 yeah, confirmed
[13:21:00] <moritzm>	 ack, indeed
[13:21:06] <wikibugs>	 (03Abandoned) 10BBlack: haproxy: reduce varnish maxconn to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957704 (https://phabricator.wikimedia.org/T310609) (owner: 10BBlack)
[13:21:11] <icinga-wm>	 RECOVERY - memcached socket on mw2444 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached
[13:21:19] <wikibugs>	 (03PS3) 10Bking: site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039)
[13:23:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] fe_mem_gb_reserved:170 for all single-backend [puppet] - 10https://gerrit.wikimedia.org/r/957344 (owner: 10BBlack)
[13:23:36] <wikibugs>	 (03PS5) 10Stevemunene: [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648)
[13:24:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[13:24:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140)
[13:25:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for aom [puppet] - 10https://gerrit.wikimedia.org/r/957738
[13:25:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi)
[13:25:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Add library hint for aom [puppet] - 10https://gerrit.wikimedia.org/r/957738
[13:25:54] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1139.eqiad.wmnet with reason: host reimage
[13:26:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi)
[13:26:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi)
[13:27:14] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] varnish: Decrease max_connections to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957735 (owner: 10Vgutierrez)
[13:28:19] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org
[13:28:21] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1139.eqiad.wmnet with reason: host reimage
[13:30:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "it would probably be best to remove this after purged is switched to an UDS" [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack)
[13:31:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, but to be clear this doesn't impact our beta cluster (en.wikipedia.beta.wmflabs.org) but a specific instance in the traffic WM" [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack)
[13:31:58] <moritzm>	 !log installing libwebp security updates on bookworm
[13:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:02] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org
[13:32:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for aom [puppet] - 10https://gerrit.wikimedia.org/r/957738 (owner: 10Muehlenhoff)
[13:35:21] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2006.codfw.wmnet with reason: host reimage
[13:38:23] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2006.codfw.wmnet with reason: host reimage
[13:39:30] <godog>	 !log issue test alertmanager librenms alert - T346318
[13:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:33] <stashbot>	 T346318: Fix librenms/alertmanager integration - https://phabricator.wikimedia.org/T346318
[13:40:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:41:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:42:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:43:08] <wikibugs>	 (03PS2) 10BBlack: fe_mem_gb_reserved:170 for test hosts in other dcs [puppet] - 10https://gerrit.wikimedia.org/r/957352
[13:47:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:47:52] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] beta: haproxy->varnish single UDS config [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack)
[13:47:55] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43294/console" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[13:49:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:50:17] <wikibugs>	 (03CR) 10BBlack: varnish: remove TCP monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack)
[13:50:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:50:53] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] fe_mem_gb_reserved:170 for all single-backend [puppet] - 10https://gerrit.wikimedia.org/r/957344 (owner: 10BBlack)
[13:51:28] <wikibugs>	 (03PS15) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[13:55:28] <wikibugs>	 (03CR) 10Btullis: "The IP addresses for the flink clusters in the deployments section don't look right. They seem to be 1.2.3.4/32 which is presumably just a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse)
[13:55:58] <logmsgbot>	 !log filippo@deploy1002 Started deploy [librenms/librenms@f049593]: (no justification provided)
[13:56:10] <logmsgbot>	 !log filippo@deploy1002 Finished deploy [librenms/librenms@f049593]: (no justification provided) (duration: 00m 11s)
[13:57:33] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1138.eqiad.wmnet with OS bullseye
[13:58:00] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2006.codfw.wmnet with OS bullseye
[13:58:06] <wikibugs>	 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host conf2006.codfw.wmnet with OS bullseye completed: - conf2006 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disab...
[13:58:47] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Varnish: listen on only 1x UDS [puppet] - 10https://gerrit.wikimedia.org/r/957346 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack)
[13:59:35] <icinga-wm>	 PROBLEM - Check systemd state on netmon2002 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-alerts.service,librenms-poller-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:39] <wikibugs>	 (03CR) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse)
[14:03:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur)
[14:03:45] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur)
[14:07:33] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:58] <wikibugs>	 (03CR) 10Btullis: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse)
[14:12:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "This looks good to me, but you might still wish for more eyes first, given that it's an admin_ng change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse)
[14:12:33] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:14:02] <wikibugs>	 (03PS2) 10Jforrester: [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747)
[14:14:13] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) (owner: 10Jforrester)
[14:15:14] <wikibugs>	 (03Merged) 10jenkins-bot: [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) (owner: 10Jforrester)
[14:16:40] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply
[14:17:05] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply
[14:17:33] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:51] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply
[14:18:13] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host conf2005.codfw.wmnet with OS bullseye
[14:18:21] <wikibugs>	 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host conf2005.codfw.wmnet with OS bullseye
[14:18:32] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply
[14:18:40] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[14:19:20] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[14:21:48] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Remove conf2* from etcd client srv records" [dns] - 10https://gerrit.wikimedia.org/r/957394 (https://phabricator.wikimedia.org/T332010)
[14:22:21] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Switch pybals from conf2 to conf1" [puppet] - 10https://gerrit.wikimedia.org/r/957395 (https://phabricator.wikimedia.org/T332010)
[14:22:33] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:22:39] <wikibugs>	 (03PS2) 10Volans: decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702
[14:23:09] <wikibugs>	 (03CR) 10Volans: "addressed comment" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[14:23:39] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1027.eqiad.wmnet with OS bullseye
[14:24:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1027.eqiad.wmnet with OS bullseye
[14:25:57] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 10 NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43297/console" [puppet] - 10https://gerrit.wikimedia.org/r/957395 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm)
[14:26:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/957724 (owner: 10Muehlenhoff)
[14:27:33] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:27:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1028.eqiad.wmnet with OS bullseye
[14:27:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1029.eqiad.wmnet with OS bullseye
[14:27:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye
[14:27:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye
[14:27:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1028.eqiad.wmnet with OS bullseye
[14:27:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1029.eqiad.wmnet with OS bullseye
[14:27:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye
[14:27:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye
[14:32:28] <moritzm>	 !log installing qemu security updates on ganeti-test cluster
[14:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:41] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2005.codfw.wmnet with reason: host reimage
[14:33:56] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[14:34:09] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Jdforrester-WMF)
[14:36:48] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2005.codfw.wmnet with reason: host reimage
[14:37:15] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] OpenSSL 3 compat for update-ocsp script [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) (owner: 10BBlack)
[14:37:24] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1027.eqiad.wmnet with reason: host reimage
[14:38:42] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Vgutierrez) 05Resolved→03Open a:05Jhancock.wm→03None not sure why I've been pinged in this task but anyways, the new disk needs to be added to the RAID, as it's still degraded: `/de...
[14:39:19] <wikibugs>	 (03PS4) 10Bking: site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039)
[14:39:23] <wikibugs>	 (03CR) 10Bking: site.pp: add new search-loader hostnames (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[14:40:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1027.eqiad.wmnet with reason: host reimage
[14:40:35] <wikibugs>	 (03CR) 10Bking: [C: 03+2] site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[14:41:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1028.eqiad.wmnet with reason: host reimage
[14:41:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1029.eqiad.wmnet with reason: host reimage
[14:41:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2004.codfw.wmnet
[14:42:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Decrease max_connections to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957735 (owner: 10Vgutierrez)
[14:43:06] <vgutierrez>	 !log varnish: decrease max_connections to 10k per backend server globally
[14:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1028.eqiad.wmnet with reason: host reimage
[14:45:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2004.codfw.wmnet
[14:45:50] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking)
[14:46:05] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking) Thanks Moritz...closing on our board.
[14:46:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1029.eqiad.wmnet with reason: host reimage
[14:47:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host search-loader2002.codfw.wmnet
[14:47:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:47:51] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[14:48:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957744 (https://phabricator.wikimedia.org/T346055)
[14:48:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357)
[14:48:48] <wikibugs>	 (03PS2) 10AOkoth: wmnet: add ticket-test -> vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/957322
[14:50:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2005.codfw.wmnet
[14:50:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader2002.codfw.wmnet - bking@cumin1001"
[14:50:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host search-loader1002.eqiad.wmnet
[14:50:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:51:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.wikimedia.org with reason: test before full decom
[14:51:28] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.wikimedia.org with reason: test before full decom
[14:51:30] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader2002.codfw.wmnet - bking@cumin1001"
[14:51:30] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:51:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader2002.codfw.wmnet on all recursors
[14:51:34] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader2002.codfw.wmnet on all recursors
[14:51:35] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fd552e4c-12f5-4380-9775-a70e560609fd) set by cmooney@cumin1001 for 2:00:00 on 1 h...
[14:52:01] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader2002.codfw.wmnet - bking@cumin1001"
[14:52:51] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader2002.codfw.wmnet - bking@cumin1001"
[14:53:11] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] wmnet: add ticket-test -> vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/957322 (owner: 10AOkoth)
[14:53:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2005.codfw.wmnet
[14:54:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2002.codfw.wmnet
[14:55:22] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host search-loader2002.codfw.wmnet with OS bullseye
[14:55:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[14:55:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[14:58:13] <wikibugs>	 (03Abandoned) 10BBlack: Add dumps mapping to cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack)
[14:58:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2002.codfw.wmnet
[14:58:21] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[14:58:21] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:58:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader1002.eqiad.wmnet on all recursors
[14:58:25] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader1002.eqiad.wmnet on all recursors
[14:58:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:58:36] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2005.codfw.wmnet with OS bullseye
[14:58:46] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host conf2005.codfw.wmnet with OS bullseye completed: - conf2005 (**WARN**)   - Downtimed on Icinga/...
[14:58:50] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/957685 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi)
[15:00:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:01:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[15:01:08] <wikibugs>	 (03PS1) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027)
[15:01:44] <icinga-wm>	 RECOVERY - Check systemd state on netmon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:03:17] <wikibugs>	 (03PS1) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027)
[15:03:25] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357)
[15:04:04] <wikibugs>	 (03PS3) 10BBlack: fe_mem_gb_reserved:170 for test hosts in other dcs [puppet] - 10https://gerrit.wikimedia.org/r/957352
[15:05:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet
[15:06:21] <wikibugs>	 (03PS1) 10Herron: dispatch::web: add ensure param and ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937)
[15:06:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: donate: Move into dedicated docroot [puppet] - 10https://gerrit.wikimedia.org/r/957750 (https://phabricator.wikimedia.org/T346055)
[15:06:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] donate: Move into dedicated docroot [puppet] - 10https://gerrit.wikimedia.org/r/957750 (https://phabricator.wikimedia.org/T346055) (owner: 10Alexandros Kosiaris)
[15:07:06] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[15:07:06] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:07:06] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader1002.eqiad.wmnet on all recursors
[15:07:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader1002.eqiad.wmnet on all recursors
[15:07:17] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host search-loader1002.eqiad.wmnet
[15:08:02] <bblack>	 !log cp[45]*: restart varnish frontends in all ulsfo + eqsin nodes for memory size changes ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/957344 ), slowly over the next 24h via cumin
[15:09:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host search-loader1002.eqiad.wmnet
[15:09:41] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[15:09:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet
[15:10:11] <wikibugs>	 (03PS1) 10Andrew Bogott: backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752
[15:10:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752 (owner: 10Andrew Bogott)
[15:11:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[15:12:10] <wikibugs>	 (03PS2) 10Andrew Bogott: backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752
[15:12:50] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[15:12:50] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:12:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader1002.eqiad.wmnet on all recursors
[15:12:53] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader1002.eqiad.wmnet on all recursors
[15:13:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[15:13:39] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1031.eqiad.wmnet with OS bullseye
[15:13:42] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1030.eqiad.wmnet with OS bullseye
[15:13:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10...
[15:13:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10...
[15:13:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752 (owner: 10Andrew Bogott)
[15:13:58] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:13:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1028.eqiad.wmnet with OS bullseye
[15:14:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1028.eqiad.wmnet with OS bullseye completed: - kubernetes1028 (**WARN*...
[15:14:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader1002.eqiad.wmnet - bking@cumin1001"
[15:14:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:14:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1029.eqiad.wmnet with OS bullseye
[15:14:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1029.eqiad.wmnet with OS bullseye completed: - kubernetes1029 (**WARN*...
[15:15:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host search-loader1002.eqiad.wmnet with OS bullseye
[15:15:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:15:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1027.eqiad.wmnet with OS bullseye
[15:15:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1027.eqiad.wmnet with OS bullseye completed: - kubernetes1027 (**WARN*...
[15:15:16] <wikibugs>	 (03PS2) 10JMeybohm: Revert "Remove conf2* from etcd client srv records" [dns] - 10https://gerrit.wikimedia.org/r/957394 (https://phabricator.wikimedia.org/T332010)
[15:16:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[15:16:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1054.eqiad.wmnet with OS bullseye
[15:17:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1055.eqiad.wmnet with OS bullseye
[15:17:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1054.eqiad.wmnet with OS bullseye
[15:17:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1055.eqiad.wmnet with OS bullseye
[15:17:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye
[15:17:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye
[15:17:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye
[15:17:27] <logmsgbot>	 !log filippo@deploy1002 Started deploy [librenms/librenms@f049593]: (no justification provided)
[15:17:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye
[15:17:32] <logmsgbot>	 !log filippo@deploy1002 Finished deploy [librenms/librenms@f049593]: (no justification provided) (duration: 00m 05s)
[15:17:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye
[15:17:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye
[15:17:59] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43300/console" [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[15:19:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on search-loader2002.codfw.wmnet with reason: host reimage
[15:20:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "Remove conf2* from etcd client srv records" [dns] - 10https://gerrit.wikimedia.org/r/957394 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm)
[15:22:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye
[15:22:11] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye
[15:23:25] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on search-loader2002.codfw.wmnet with reason: host reimage
[15:24:34] <jayme>	 !log restarted navtiming on webperf2003 to pick up changed etcd service records
[15:24:39] <jayme>	 !log restarting confd fleet wide
[15:25:37] <wikibugs>	 (03CR) 10Andrew Bogott: designate nova_fixed_multi: create A record using project_id and project_name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[15:25:39] <wikibugs>	 (03PS1) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937)
[15:26:44] <wikibugs>	 (03PS6) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175)
[15:26:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on search-loader1002.eqiad.wmnet with reason: host reimage
[15:27:09] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimedia.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10RobH) Email sent, cc'd @aborrero so they can stay apprised of progress.  If this...
[15:27:40] <wikibugs>	 (03PS3) 10Andrew Bogott: designate nova_fixed_multi: create A recs using project_id and project_name [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158)
[15:27:42] <wikibugs>	 (03PS7) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158)
[15:28:52] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero)
[15:29:10] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero) fixing typo, it should be `ns0.openstack.eqiad1.wikimediacloud.org`
[15:29:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage
[15:30:26] <wikibugs>	 (03PS2) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937)
[15:30:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1053.eqiad.wmnet with OS bullseye
[15:31:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye
[15:31:15] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on search-loader1002.eqiad.wmnet with reason: host reimage
[15:31:55] <wikibugs>	 (03CR) 10Muehlenhoff: dispatch: remove puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[15:32:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage
[15:32:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage
[15:32:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[15:34:11] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage
[15:36:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[15:36:11] <wikibugs>	 (03PS3) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937)
[15:36:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Revert "Switch pybals from conf2 to conf1" [puppet] - 10https://gerrit.wikimedia.org/r/957395 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm)
[15:36:25] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host search-loader2002.codfw.wmnet with OS bullseye
[15:36:25] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host search-loader2002.codfw.wmnet
[15:36:43] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage
[15:36:51] <urbanecm>	 jouncebot: nowandnext
[15:36:51] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 23 minute(s)
[15:36:51] <jouncebot>	 In 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1600)
[15:37:25] <wikibugs>	 (03PS1) 10Urbanecm: listTaskCounts: Push total task counts to statsd for all tasks [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204)
[15:37:31] <wikibugs>	 (03PS4) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937)
[15:37:55] <jayme>	 !log running puppet on lvs[2011-2014].codfw.wmnet,lvs[5004-5006].eqsin.wmnet,lvs[4008-4010].ulsfo.wmnet
[15:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:27] <wikibugs>	 (03PS1) 10Urbanecm: linkTaskCounts: Stop producing per-topic statsd data [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210)
[15:38:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage
[15:38:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm)
[15:38:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210) (owner: 10Urbanecm)
[15:39:57] <wikibugs>	 (03PS5) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937)
[15:40:24] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4010 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[15:41:18] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[15:41:23] <wikibugs>	 (03CR) 10Herron: dispatch: remove puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[15:42:35] <jayme>	 !log restarting secondary lvs in codfw, eqsin, ulsfo
[15:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:58] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[15:43:32] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[15:43:52] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[15:43:55] <wikibugs>	 (03PS1) 10Bking: search-loader: Move new VMs into prod role [puppet] - 10https://gerrit.wikimedia.org/r/957762 (https://phabricator.wikimedia.org/T346039)
[15:44:04] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[15:44:34] <jayme>	 !log restarting primary lvs in codfw, eqsin, ulsfo
[15:44:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] listTaskCounts: Push total task counts to statsd for all tasks [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm)
[15:45:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] linkTaskCounts: Stop producing per-topic statsd data [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210) (owner: 10Urbanecm)
[15:45:28] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4010 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[15:45:44] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:46:22] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4009 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[15:46:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage
[15:47:14] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2] "failed with:" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm)
[15:47:19] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2] "failed with:" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm)
[15:47:30] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2] "failed with:" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210) (owner: 10Urbanecm)
[15:47:52] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:957396|listTaskCounts: Push total task counts to statsd for all tasks (T345204)]], [[gerrit:957758|linkTaskCounts: Stop producing per-topic statsd data (T345210)]]
[15:47:57] <stashbot>	 T345210: Stop sending per-topic task counts to statsd/Grafana - https://phabricator.wikimedia.org/T345210
[15:47:57] <stashbot>	 T345204: Alert the Growth team when number of available task recommendations drops significantly - https://phabricator.wikimedia.org/T345204
[15:47:59] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host search-loader1002.eqiad.wmnet with OS bullseye
[15:47:59] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host search-loader1002.eqiad.wmnet
[15:47:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1050.eqiad.wmnet with OS bullseye
[15:48:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye
[15:48:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1051.eqiad.wmnet with OS bullseye
[15:48:06] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[15:48:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye
[15:48:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye
[15:48:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye
[15:48:40] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[15:48:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:13] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[15:49:25] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage
[15:49:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be2003.codfw.wmnet with OS bullseye
[15:49:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye completed: -...
[15:49:55] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Project Migration): Move purged repository from Gerrit to GitLab - https://phabricator.wikimedia.org/T346305 (10Aklapper) + #gitlab-migration
[15:51:14] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Project Migration): Move purged repository from Gerrit to GitLab - https://phabricator.wikimedia.org/T346305 (10Aklapper)
[15:51:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:51:58] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2005.codfw.wmnet
[15:51:58] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2005.codfw.wmnet
[15:52:14] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet
[15:52:15] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet
[15:52:25] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2006.codfw.wmnet
[15:52:25] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2006.codfw.wmnet
[15:53:01] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1139.eqiad.wmnet with OS bullseye
[15:53:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:53:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) 05Open→03Resolved @MatthewVernon Hey I really tried to make this work as JBOD, but the hardware just doesn't work that way. I did wha...
[15:53:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:54:02] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43304/console" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[15:54:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:54:48] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:54:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1056.eqiad.wmnet with OS bullseye
[15:55:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye completed: - kubernetes1056 (**WARN*...
[15:55:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:55:09] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357)
[15:55:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1054.eqiad.wmnet with OS bullseye
[15:55:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1054.eqiad.wmnet with OS bullseye completed: - kubernetes1054 (**WARN*...
[15:55:29] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:957396|listTaskCounts: Push total task counts to statsd for all tasks (T345204)]], [[gerrit:957758|linkTaskCounts: Stop producing per-topic statsd data (T345210)]] (duration: 07m 37s)
[15:55:34] <stashbot>	 T345210: Stop sending per-topic task counts to statsd/Grafana - https://phabricator.wikimedia.org/T345210
[15:55:34] <stashbot>	 T345204: Alert the Growth team when number of available task recommendations drops significantly - https://phabricator.wikimedia.org/T345204
[15:55:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:55:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1055.eqiad.wmnet with OS bullseye
[15:55:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[15:55:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1055.eqiad.wmnet with OS bullseye completed: - kubernetes1055 (**PASS*...
[15:56:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:56:57] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] search-loader: Move new VMs into prod role [puppet] - 10https://gerrit.wikimedia.org/r/957762 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[15:57:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye
[15:57:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye
[15:59:54] <rzl>	 dancy: I'm in a meeting that run might a couple minutes over, but I see it :) be right with you
[15:59:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10JMeybohm)
[16:00:05] <jouncebot>	 jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1600).
[16:00:05] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:15] <RhinosF1>	 rzl: can I also ask for a merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/956813
[16:00:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[16:00:20] <RhinosF1>	 (It should be very simple)
[16:00:32] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10JMeybohm) 05Open→03Resolved This is done and clients (confd/pybal) are back on the cluster.  I tried to capture the process here (minus the need to add a new SAN to the cergen cert whi...
[16:00:56] <RhinosF1>	 I'm on my way home so can't test but it's very very simple
[16:01:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage
[16:01:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:01:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:01:55] <wikibugs>	 (03PS1) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154)
[16:02:42] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042)
[16:03:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[16:03:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage
[16:03:24] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1031.eqiad.wmnet with OS bullseye
[16:03:25] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "failed in reimage script said manually run it - robh@cumin1001 - T342533"
[16:03:28] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1030.eqiad.wmnet with OS bullseye
[16:03:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10...
[16:03:34] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[16:03:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10...
[16:03:48] <stashbot>	 T342533: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533
[16:04:11] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "failed in reimage script said manually run it - robh@cumin1001 - T342533"
[16:04:24] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43305/console" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:04:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage
[16:04:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:05:32] <wikibugs>	 (03PS2) 10Cathal Mooney: wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[16:06:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[16:07:49] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[16:08:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage
[16:09:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: datacenter ops group right addition: sre.puppet.sync-netbox-hiera cookbook - https://phabricator.wikimedia.org/T346368 (10RobH)
[16:09:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:10:12] <wikibugs>	 (03PS1) 10RobH: adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368)
[16:10:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: datacenter ops group right addition: sre.puppet.sync-netbox-hiera cookbook - https://phabricator.wikimedia.org/T346368 (10RobH) p:05Triage→03Medium
[16:10:27] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397
[16:10:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10JMeybohm) conf2 nodes are on bullseye now and the metrics do look better now, as expected
[16:10:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) (owner: 10RobH)
[16:10:46] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "netmon: Failover from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/957398
[16:10:58] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43306/console" [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy)
[16:11:00] <wikibugs>	 (03PS2) 10Andrea Denisse: Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397
[16:11:28] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[16:11:39] <wikibugs>	 (03PS2) 10RobH: adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368)
[16:11:46] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43307/console" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:11:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 (owner: 10Andrea Denisse)
[16:11:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "netmon: Failover from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/957398 (owner: 10Andrea Denisse)
[16:12:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] decorators: extend documentation (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[16:12:06] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:12:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) (owner: 10RobH)
[16:12:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1053.eqiad.wmnet with OS bullseye
[16:12:17] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] Revert "netmon: Failover from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/957398 (owner: 10Andrea Denisse)
[16:12:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye completed: - kubernetes1053 (**WARN*...
[16:12:31] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:12:36] <dancy>	 rzl: I am around if you need me.
[16:12:47] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[16:12:49] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 (owner: 10Andrea Denisse)
[16:12:53] <wikibugs>	 (03PS3) 10RobH: adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368)
[16:12:59] <robh>	 third times the charm
[16:12:59] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43308/console" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:13:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[16:14:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1040.eqiad.wmnet with OS bullseye
[16:14:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1041.eqiad.wmnet with OS bullseye
[16:14:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1042.eqiad.wmnet with OS bullseye
[16:14:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye
[16:14:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1041.eqiad.wmnet with OS bullseye
[16:14:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1042.eqiad.wmnet with OS bullseye
[16:15:02] <rzl>	 dancy: I was double-checking whether semicolons work that way on an ExecStart line but of course they do :) merging now
[16:15:21] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] Sync ldap/ops into GitLab repos/sre group [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy)
[16:15:59] <rzl>	 (famously it's *not* a shell, which trips people up sometimes)
[16:16:51] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[16:16:59] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove manual entry for ns0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957770 (https://phabricator.wikimedia.org/T346326)
[16:17:03] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update - volans@cumin1001"
[16:17:19] <icinga-wm>	 PROBLEM - Check systemd state on netmon2002 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-discovery-all.service,librenms-poller-all.service,librenms-poller-all.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:52] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update - volans@cumin1001"
[16:18:17] <rzl>	 RhinosF1: the SRE who's most familiar with wikistats is on leave -- if you're able to get a code review from someone who knows what you're changing, I would really much prefer that :) but if that's impossible, let me know
[16:18:23] <wikibugs>	 (03Abandoned) 10Cathal Mooney: wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[16:18:44] <RhinosF1>	 rzl: yes it's me looking after wikistats while they are off.
[16:18:45] <rzl>	 dancy: merged and deployed to all three gitlab hosts, test at will
[16:18:55] <RhinosF1>	 I'm cleaning up things that have been long broken
[16:19:44] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) (owner: 10RobH)
[16:19:57] <rzl>	 yeah, I appreciate that! but it's still best to have a second pair of eyes on anything, and I'm not informed enough to do that for you
[16:20:04] <RhinosF1>	 Ok
[16:20:15] <RhinosF1>	 I will try and poke Arnold, he knows bits
[16:20:16] <dancy>	 rzl: The timer will run again in 10 minutes.  I'll keep an eye on it.
[16:20:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Extend the maps restart cookbook to also handle reboots (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff)
[16:20:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:21:09] <rzl>	 RhinosF1: okay sounds good -- if it turns out there's no one and you're completely stuck, let me know
[16:21:15] <RhinosF1>	 Will do
[16:21:19] <rzl>	 but in that case I will insist on you being around to at least test it :)
[16:21:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Remove manual entry for ns0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957770 (https://phabricator.wikimedia.org/T346326) (owner: 10Cathal Mooney)
[16:21:22] <denisse>	 !log Failing over from netmon2002 (codfw) to netmon1003 (eqiad).
[16:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:38] <RhinosF1>	 rzl: yes, sadly the stupid traffic has spoilt my plan
[16:21:54] <RhinosF1>	 And decided to make my journey home much longer than normal
[16:23:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:23:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove manual entry for ns0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957770 (https://phabricator.wikimedia.org/T346326) (owner: 10Cathal Mooney)
[16:23:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:23:52] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1056.eqiad.wmnet with OS bullseye
[16:23:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10...
[16:26:05] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1040.eqiad.wmnet with OS bullseye
[16:26:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye executed with errors: - kubernetes...
[16:27:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage
[16:27:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage
[16:28:59] <wikibugs>	 (03CR) 10Vgutierrez: aptrepo: Add Bookworm HAProxy third party repos (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:30:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1040.eqiad.wmnet with OS bullseye
[16:30:20] <dancy>	 rzl: Doh!  The next run failed.  Looking into it.
[16:30:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye
[16:31:00] <wikibugs>	 (03PS2) 10BCornwall: package_builder: add piuparts package [puppet] - 10https://gerrit.wikimedia.org/r/956968
[16:31:14] <rzl>	 dancy: I'm around if you need anything
[16:31:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage
[16:31:44] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage
[16:32:16] <wikibugs>	 (03CR) 10BCornwall: package_builder: add piuparts package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall)
[16:32:50] <wikibugs>	 (03PS3) 10Andrea Denisse: Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397
[16:32:52] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:07] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2] Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 (owner: 10Andrea Denisse)
[16:33:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage
[16:34:31] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: rancid-differ.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage
[16:36:33] <wikibugs>	 (03PS2) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951)
[16:36:57] <wikibugs>	 10SRE, 10Growth-Team, 10Graphite: Delete MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from Graphite - https://phabricator.wikimedia.org/T346371 (10Urbanecm_WMF)
[16:37:01] <godog>	 the netmon1003 failures are expected
[16:37:09] <jinxer-wm>	 (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
[16:37:13] <wikibugs>	 (03PS2) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154)
[16:37:19] <wikibugs>	 (03CR) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:37:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] rancid: fix log dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/957685 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi)
[16:39:30] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[16:42:10] <wikibugs>	 (03PS6) 10Btullis: [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[16:42:41] <denisse>	 !incidents
[16:42:42] <sirenbot>	 4045 (UNACKED)  EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw)
[16:42:42] <sirenbot>	 4044 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[16:42:42] <sirenbot>	 4043 (RESOLVED)  ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw)
[16:42:42] <sirenbot>	 4042 (RESOLVED)  PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad)
[16:42:42] <sirenbot>	 4041 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin)
[16:42:43] <sirenbot>	 4039 (RESOLVED)  HaproxyUnavailable cache_text global sre ()
[16:42:43] <sirenbot>	 4038 (RESOLVED)  VarnishUnavailable global sre (varnish-text)
[16:42:43] <sirenbot>	 4040 (RESOLVED)  PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad)
[16:42:43] <sirenbot>	 4037 (RESOLVED)  [7x] ProbeDown sre (probes/service)
[16:42:44] <sirenbot>	 4036 (RESOLVED)  db1128 (paged)/MariaDB Replica Lag: s1 (paged)
[16:42:44] <sirenbot>	 4035 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad)
[16:42:50] <denisse>	 !ack 4045
[16:42:50] <sirenbot>	 4045 (ACKED)  EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw)
[16:43:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:44:03] <wikibugs>	 (03CR) 10Bking: [C: 03+2] search-loader: Move new VMs into prod role [puppet] - 10https://gerrit.wikimedia.org/r/957762 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[16:45:51] <wikibugs>	 (03PS1) 10Ahmon Dancy: Sync ldap/ops into GitLab repos/sre group (v2) [puppet] - 10https://gerrit.wikimedia.org/r/957775 (https://phabricator.wikimedia.org/T343035)
[16:46:06] <dancy>	 rzl: Another attempt at https://gerrit.wikimedia.org/r/c/operations/puppet/+/957775
[16:46:29] <dancy>	 rzl: I'm open to suggestions on how to do it right.
[16:46:51] <rzl>	 ah damn I bet you're right
[16:46:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:47:04] <rzl>	 let's try this, but if it doesn't work, wrapping the whole thing in sh -c is the coward's easy way out :)
[16:47:13] <denisse>	 !incidents
[16:47:13] <sirenbot>	 4045 (ACKED)  EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw)
[16:47:13] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Sync ldap/ops into GitLab repos/sre group (v2) [puppet] - 10https://gerrit.wikimedia.org/r/957775 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy)
[16:47:13] <sirenbot>	 4044 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[16:47:13] <sirenbot>	 4043 (RESOLVED)  ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw)
[16:47:14] <sirenbot>	 4042 (RESOLVED)  PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad)
[16:47:14] <sirenbot>	 4041 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin)
[16:47:14] <sirenbot>	 4039 (RESOLVED)  HaproxyUnavailable cache_text global sre ()
[16:47:14] <sirenbot>	 4038 (RESOLVED)  VarnishUnavailable global sre (varnish-text)
[16:47:14] <sirenbot>	 4040 (RESOLVED)  PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad)
[16:47:15] <sirenbot>	 4037 (RESOLVED)  [7x] ProbeDown sre (probes/service)
[16:47:15] <sirenbot>	 4036 (RESOLVED)  db1128 (paged)/MariaDB Replica Lag: s1 (paged)
[16:47:16] <sirenbot>	 4035 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad)
[16:47:28] <dancy>	 rzl: Agreed
[16:47:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:48:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:48:03] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1042.eqiad.wmnet with OS bullseye
[16:48:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1042.eqiad.wmnet with OS bullseye completed: - kubernetes1042 (**PAS...
[16:48:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:48:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:48:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1041.eqiad.wmnet with OS bullseye
[16:48:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1041.eqiad.wmnet with OS bullseye completed: - kubernetes1041 (**PAS...
[16:49:12] <denisse>	 Hi rzl, sorry for pinging you but I saw you online. Do you there's something we should do regarding the EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) alert?
[16:49:21] <denisse>	 I'm looking at our docs regarding etcd.
[16:49:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:49:55] <rzl>	 denisse: I think those hosts were just upgraded to bullseye so I'm immediately suspicious :) let me see what I can find out
[16:50:00] <rzl>	 jayme: I don't suppose you're still online?
[16:51:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:52:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:27] <wikibugs>	 (03CR) 10Btullis: "I fixed the CI issues and I updated the commit message to try to add a bit of clarity." [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[16:53:31] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:53:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1040.eqiad.wmnet with OS bullseye
[16:53:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye completed: - kubernetes1040 (**PAS...
[16:53:44] <rzl>	 denisse: I'm reading up on what I can but I'm not an etcd expert, sorry
[16:53:49] <rzl>	 dancy: in the meantime, puppet's done
[16:53:57] <dancy>	 thx.. Watching.
[16:54:10] <dancy>	 next run in 6 minutes
[16:55:44] <denisse>	 rzl: No worries, it's fine.
[16:56:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1043.eqiad.wmnet with OS bullseye
[16:56:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1043.eqiad.wmnet with OS bullseye
[16:56:52] <rzl>	 denisse: so, conf2005 is the host running etcdmirror, meaning it's responsible for replication between eqiad and codfw
[16:57:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1044.eqiad.wmnet with OS bullseye
[16:57:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1044.eqiad.wmnet with OS bullseye
[16:58:04] <rzl>	 I don't immediately see the cause but, the answer is yes we should treat this as serious
[16:58:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1045.eqiad.wmnet with OS bullseye
[16:58:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1045.eqiad.wmnet with OS bullseye
[16:58:58] <rzl>	 might need to escalate to either _joe_ or akosiaris or jayme even though it's their evening, but let me keep digging and see what I can find
[17:00:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1700)
[17:00:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1046.eqiad.wmnet with OS bullseye
[17:00:08] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye
[17:00:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye completed: - kubernetes1052 (**WARN*...
[17:00:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye
[17:00:23] <dancy>	 rzl: Fixed! Thanks for your help.
[17:00:23] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:30] <rzl>	 dancy: \i/
[17:00:33] <rzl>	 er, \o/
[17:00:43] <dancy>	 hehe
[17:00:49] <dancy>	 I like \i/
[17:00:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye
[17:01:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye
[17:02:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye
[17:02:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye
[17:03:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1048.eqiad.wmnet with OS bullseye
[17:03:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1048.eqiad.wmnet with OS bullseye
[17:03:48] <wikibugs>	 (03PS2) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027)
[17:03:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1049.eqiad.wmnet with OS bullseye
[17:04:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1049.eqiad.wmnet with OS bullseye
[17:04:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: datacenter ops group right addition: sre.puppet.sync-netbox-hiera cookbook - https://phabricator.wikimedia.org/T346368 (10RobH) 05Open→03Resolved
[17:05:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039
[17:05:21] <stashbot>	 T346039: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039
[17:05:23] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039
[17:10:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:10:16] <_joe_>	 rzl: what's going on with etcd?
[17:10:22] <_joe_>	 can I help?
[17:10:48] <rzl>	 _joe_: we got paged for replication on conf2005 -- looks like it was a downtime expiring but I'm not sure what state it's in
[17:11:12] <volans>	 I might be wrong but I wonder if it's a monitoring issue, the mirror unit is up and the last logs show replication
[17:11:12] <_joe_>	 rzl: so first order of business is understanding if the cluster is used by clients right now
[17:11:23] <volans>	 the alert is expr: 'up{job="etcdmirror"} != 1'
[17:11:35] <volans>	 the unit is called etcdmirror-conftool-eqiad-wmnet.service
[17:11:49] <_joe_>	 volans: that never changed
[17:11:59] <rzl>	 yeah, I was looking at logs for the systemd unit and it had some failures earlier but is healthy now
[17:12:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage
[17:12:06] <bblack>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/957395
[17:12:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage
[17:12:18] <bblack>	 ^ was a changed pushed earlier, to revert not using conf2*, so I think conf2* is now in use
[17:12:32] <_joe_>	 bblack: yeah and replication works
[17:12:45] <_joe_>	 we're trying to understand why monitoring thinks otherwise
[17:12:46] <volans>	 let's write something to etcd and see if it replicates, but from logs it looks like it's healthy
[17:12:55] <_joe_>	 let me take a look
[17:13:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm)
[17:13:31] <_joe_>	 volans: damn I'm on my half-setup laptop... just depool a mw appserver in codfw
[17:13:33] <_joe_>	 then repool it
[17:13:43] <_joe_>	 you should see it in the logs for etcdmirror
[17:14:18] <denisse>	 I was looking at the Wiki and I think this is the issue with etcd. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication
[17:14:43] <volans>	 yes replicated immediately
[17:14:44] <volans>	 INFO: Replicating key /conftool/v1/pools/codfw/appserver/nginx/mw2384.codfw.wmnet
[17:14:44] <rzl>	 Sep 14 17:14:31 conf2005 etcdmirror-conftool-eqiad-wmnet[7607]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/codfw/appserver/nginx/mw2384.codfw.wmnet at index 2457350
[17:14:47] <rzl>	 👍
[17:15:01] <volans>	 same for the pool
[17:15:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage
[17:15:14] <volans>	 so yeah I'd say monitoring problem, not real  probem
[17:15:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:15:34] <volans>	 from thanos:
[17:15:35] <volans>	 job:up:avail{job="etcdmirror", prometheus="ops", site="codfw"}
[17:15:37] <volans>	 0
[17:15:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage
[17:16:00] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney)
[17:16:00] <denisse>	 If it's a monitoring problem I'll file a task for it.
[17:16:03] <volans>	 up{cluster="etcd", instance="conf2005:8000", job="etcdmirror", prometheus="ops", site="codfw"}
[17:16:03] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10cmooney) 05Open→03Resolved Change is now live on the ORG servers when I...
[17:16:07] <volans>	 that's 0
[17:16:27] <rzl>	 yeah, went from 1 to 0 at 14:18 and stayed 0
[17:16:37] <volans>	 ferm?
[17:16:50] <rzl>	 which tracks with https://sal.toolforge.org/log/ac8OlIoBxE1_1c7shMTD from SAL
[17:16:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[17:17:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage
[17:18:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage
[17:18:16] <volans>	 rzl: do you know if prometheus checks locally or remotely?
[17:18:26] <volans>	 for this case
[17:18:36] <_joe_>	 I found the issue
[17:18:38] <rzl>	 I don't know, sorry
[17:18:47] <_joe_>	 the web interface of etcdmirror is broken on bullseye
[17:18:59] <rzl>	 ahh okay
[17:19:17] <_joe_>	 curl localhost:8000
[17:19:23] <denisse>	 Oh, interesting.
[17:19:30] <volans>	 lol     <h1>Request did not return bytes</h1>
[17:19:36] <_joe_>	 volans:  I didn't check that immediately because you said the lag was ok?
[17:19:44] <_joe_>	 volans: yes something changed in twisted for sure
[17:19:52] <volans>	 I sayd the "log", not "lag" :D
[17:19:57] <_joe_>	 sigh
[17:19:57] <volans>	 sorry
[17:19:58] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:05] <_joe_>	 ok anyways
[17:20:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[17:20:12] <rzl>	 any objection to just downtiming then? sounds like a working-hours kind of issue -- the only problem is we won't have replication alerts overnight
[17:20:32] <_joe_>	 rzl: yeah I think it's a pretty critical issue if it breaks though
[17:20:42] <rzl>	 ye true
[17:20:46] <_joe_>	 it can lead to all kinds of split-brain situations
[17:21:07] <_joe_>	 I would advise to move back the client SRV records at least to just eqiad 
[17:21:12] <_joe_>	 if you downtime it
[17:21:23] <_joe_>	 pybal having a server not depooled, we can live with
[17:21:53] <rzl>	 haven't we reimaged eqiad already though? we'd just have the same problem there, right
[17:22:01] <denisse>	 One question, so if I understand correctly this is not an issue impacting our users, right?
[17:22:13] <volans>	 how didn't we notice? all etcd are on bullseye
[17:22:16] <rzl>	 denisse: correct, but it means if another issue came up that did impact our users, we wouldn't know about it
[17:22:28] <rzl>	 volans: etcdmirror only runs on one host, in the replica cluster
[17:22:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:22:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage
[17:22:54] <volans>	 rzl: sure, but I thought we would check the web interface when migrating from one OS to another :D
[17:23:00] <volans>	 beside the mirror
[17:23:06] <_joe_>	 volans: because this is the first time we run etcdmirror on bullseye
[17:23:15] <volans>	 the web interface is mirror-specific?
[17:23:48] <_joe_>	 it's part of etcdmirror, yes
[17:23:52] <volans>	 ok
[17:24:02] <_joe_>	 it's offering prom metrics and some local-consumable stats
[17:24:23] <_joe_>	 it's a 50 line file to fix https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/etcd-mirror/+/refs/heads/master/etcdmirror/rest.py
[17:24:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage
[17:24:43] <volans>	 twisted...
[17:25:07] <_joe_>	 they probably changed the method name from render_GET
[17:25:37] <volans>	 we probably went from 18.9 to 20.3
[17:27:33] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:27:49] <wikibugs>	 (03CR) 10Urbanecm: "this is now ready" [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm)
[17:28:59] <volans>	 _joe_: probably adding .encode('utf-8') might do it
[17:29:10] <volans>	 was it running with python2 before?
[17:29:18] <_joe_>	 volans: yes
[17:29:22] <_joe_>	 and yes
[17:29:28] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1046.eqiad.wmnet with OS bullseye
[17:29:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye executed with errors: - kubernetes...
[17:29:35] <_joe_>	 it has been ported to python3 by alex
[17:29:43] <_joe_>	 clearly this was missing
[17:29:47] <rzl>	 oh ugh I bet you're right, I was digging through twisted release notes but that's almost certainly it
[17:30:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:30:11] <_joe_>	 rzl: yeah I saw the docs for render_GET and it's expected to return bytes
[17:30:16] <volans>	 https://stackoverflow.com/a/48320880
[17:30:19] <_joe_>	 it's ofc not explained properly
[17:30:21] <_joe_>	 but yes
[17:30:32] <rzl>	 any objection if I reach in and hot-patch it on conf2005 to see what happens? can't break it any worse than it's broken
[17:30:37] <rzl>	 if that works I'll send a puppet patch
[17:30:38] <volans>	 you can even skip the 'utf-8' if you want as it's the default
[17:30:42] <_joe_>	 rzl: go on
[17:30:53] <_joe_>	 rzl: it's not a puppet patch
[17:30:59] <_joe_>	 etcd-mirror is a deb package :)
[17:31:02] <volans>	 rzl: it's a debian package
[17:31:10] <_joe_>	 but yes hotpatch it for now
[17:31:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:31:29] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1043.eqiad.wmnet with OS bullseye
[17:31:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1043.eqiad.wmnet with OS bullseye completed: - kubernetes1043 (**PAS...
[17:31:46] <rzl>	 oh even bette
[17:31:48] <rzl>	 r
[17:32:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:33:29] * volans going afk
[17:33:52] <rzl>	 volans: thanks <3
[17:34:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:34:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1044.eqiad.wmnet with OS bullseye
[17:34:16] <rzl>	 okay restarting etcdmirror
[17:34:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1044.eqiad.wmnet with OS bullseye completed: - kubernetes1044 (**PAS...
[17:34:37] <wikibugs>	 (03CR) 10Volans: [C: 03+2] decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[17:35:38] <denisse>	 volans _joe_ rzl : Thanks for the help!!
[17:35:39] <denisse>	 <3
[17:36:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage
[17:38:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye
[17:38:08] <wikibugs>	 (03Merged) 10jenkins-bot: decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans)
[17:38:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye completed: - kubernetes1052 (**PASS*...
[17:38:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:38:17] <rzl>	 okay the good news is we're not getting 500s for everything, the bad news is we're getting 404s for everything
[17:38:35] <rzl>	 $ curl localhost:8000/lag
[17:38:36] <rzl>	 The desired url b'/lag' was not found
[17:39:01] <rzl>	 smells like that should be a string and not a bytes so we're missing a decode() somewhere else, I'll dig around
[17:39:13] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:39:19] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1049.eqiad.wmnet with OS bullseye
[17:39:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1049.eqiad.wmnet with OS bullseye completed: - kubernetes1049 (**PAS...
[17:40:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage
[17:40:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:40:11] <rzl>	 (although I would have expected that to happen in library code...)
[17:41:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:41:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1048.eqiad.wmnet with OS bullseye
[17:41:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1048.eqiad.wmnet with OS bullseye completed: - kubernetes1048 (**PAS...
[17:41:43] <_joe_>	 rzl: right? but twisted gonna twist
[17:42:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1046.eqiad.wmnet with OS bullseye
[17:42:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye
[17:42:25] <_joe_>	 rzl: I would decode the path before line 27
[17:42:31] <rzl>	 yeah I just got there too
[17:42:45] <rzl>	 hot take: this is a very silly problem
[17:43:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage
[17:44:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1053.eqiad.wmnet with OS bullseye
[17:44:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye
[17:46:09] <wikibugs>	 (03PS3) 10Hokwelum: Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190)
[17:46:16] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage
[17:46:17] <denisse>	 rzl: It's funny how some silly problems can have such a devastating impact. :o
[17:46:57] <denisse>	 Well, not really funny, mostly interesting.
[17:48:21] <rzl>	 and we're back!
[17:48:23] <wikibugs>	 (03PS4) 10Hokwelum: Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190)
[17:48:39] <rzl>	 curl localhost:8000/metrics is working for me, alert should clear on the next scrape
[17:49:05] <denisse>	 rzl: Thank you so much for your help!!
[17:49:59] <rzl>	 _joe_ and v.olans get all the credit for debugging it, I just fixed what they found :)
[17:50:17] <wikibugs>	 (03CR) 10Hokwelum: Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190) (owner: 10Hokwelum)
[17:50:21] <denisse>	 Thanks to the 3 of you for your help and support!! <3
[17:52:10] <jinxer-wm>	 (EtcdReplicationDown) resolved: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
[17:52:33] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:52:35] <rzl>	 🎉
[17:52:48] <rzl>	 following up with a proper patch now
[17:53:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:54:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:55:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:56:54] <wikibugs>	 (03PS7) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233)
[17:56:57] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:56:58] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1045.eqiad.wmnet with OS bullseye
[17:57:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1045.eqiad.wmnet with OS bullseye completed: - kubernetes1045 (**PAS...
[17:59:35] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[17:59:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage
[18:01:06] <wikibugs>	 (03PS1) 10RLazarus: Python3 fixes: return bytes from render_GET, and accept a bytes path [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784
[18:02:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[18:02:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage
[18:03:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[18:03:13] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1046.eqiad.wmnet with OS bullseye
[18:03:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye completed: - kubernetes1046 (**PAS...
[18:03:52] <wikibugs>	 (03CR) 10Jcrespo: "This is ready for review, more context at: https://phabricator.wikimedia.org/T346233#9167913" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[18:06:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm)
[18:07:54] <wikibugs>	 (03PS8) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233)
[18:18:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1053.eqiad.wmnet with OS bullseye
[18:19:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye completed: - kubernetes1053 (**PASS*...
[18:20:02] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] fe_mem_gb_reserved:170 for test hosts in other dcs [puppet] - 10https://gerrit.wikimedia.org/r/957352 (owner: 10BBlack)
[18:24:40] <bblack>	 !log cp107[56],cp202[78],cp600[19]: (one host from each cluster, at 3 sites): restarting varnish-frontend spaced out over the next ~hour for memory tweaks.
[18:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:51] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) p:05Triage→03Medium
[18:25:03] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[18:25:08] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney)
[18:26:42] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[18:27:04] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@7160e27]: Deploy latest DAGs to analytics Airflow instance T340861
[18:27:13] <stashbot>	 T340861: Implement a backfill job for the dumps hourly table - https://phabricator.wikimedia.org/T340861
[18:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:27:45] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@7160e27]: Deploy latest DAGs to analytics Airflow instance T340861 (duration: 00m 40s)
[18:27:51] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[18:31:07] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[18:31:38] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[18:31:57] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[18:32:19] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[18:34:14] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[18:34:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1051.eqiad.wmnet with OS bullseye
[18:34:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye completed: - kubernetes1051 (**WARN*...
[18:35:01] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) Btw I'm assuming pdns is actually generating all of these packets.  I'm not very familiar with the overall setup and how designate pushes out changes to the t...
[18:35:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1051.eqiad.wmnet with OS bullseye
[18:35:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye
[18:35:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye
[18:35:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye
[18:37:13] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[18:37:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:38:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[18:38:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1050.eqiad.wmnet with OS bullseye
[18:38:33] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:38:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye completed: - kubernetes1050 (**WARN*...
[18:39:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1050.eqiad.wmnet with OS bullseye
[18:39:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye
[18:41:25] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43309/console" [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[18:44:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:44:15] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[18:44:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:45:05] <urandom>	 !log retrying Cassandra bootstrap of restbase1030-c — T331713
[18:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:09] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[18:46:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:49:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:51:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage
[18:51:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[18:52:02] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T346387 (10phaultfinder)
[18:52:32] <urandom>	 !log stopping bootstrap of restbase1030-c — T331713
[18:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:37] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[18:53:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage
[18:54:06] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:54:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage
[18:54:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:56:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[18:57:59] <urandom>	 !log initiating `removenode`, ID=627fe8e9-d298-43b3-a1a2-7c8a3f01370b (restbase1030-c) — T331713
[18:58:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:02] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[18:58:48] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage
[18:59:02] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) Update on current progress on above steps:  ~~1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).~~ ~~2. Update bo...
[18:59:05] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney)
[19:01:06] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190) (owner: 10Hokwelum)
[19:06:09] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Don't offer visual diffs for non-wikitext pages [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252)
[19:06:21] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: ThreadItemStore: Add details to row insertion exceptions [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859)
[19:08:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:08:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10cmooney) Do we have any way to measure it's impact?  I had a quick look at available promethues metrics and didn't see much corresponding to icmp (but may ha...
[19:09:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:10:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1051.eqiad.wmnet with OS bullseye
[19:11:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye
[19:11:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye completed: - kubernetes1051 (**PASS*...
[19:11:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye completed: - kubernetes1052 (**PASS*...
[19:14:24] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff)
[19:14:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1050.eqiad.wmnet with OS bullseye
[19:14:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye completed: - kubernetes1050 (**PASS*...
[19:20:22] <urandom>	 !log rolling Cassandra restart, RESTBase/row-B — T331713
[19:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:25] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[19:20:36] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-14,19,21,24].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[19:27:36] <wikibugs>	 (03PS1) 10Cathal Mooney: Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852)
[19:27:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[19:29:26] <wikibugs>	 (03Merged) 10jenkins-bot: Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[19:29:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) > some sort of rate-limiting configured on the switch-side for ICMP echo, which was IP-aware and didn't count packets from our own internal systems...
[19:31:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) https://grafana.wikimedia.org/d/000000513/ping-offload might be a good starting point (might need some updates/tweaking to get the exact data you wan...
[19:32:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10cmooney) >>! In T345809#9168116, @BBlack wrote: >> some sort of rate-limiting configured on the switch-side for ICMP echo, which was IP-aware and didn't coun...
[19:49:05] <wikibugs>	 (03PS2) 10Cathal Mooney: Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852)
[20:00:05] <jouncebot>	 TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T2000).
[20:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:29] <MatmaRex>	 hi
[20:00:58] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43310/console" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[20:05:55] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[13-14,19,21,24].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[20:06:00] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[20:06:26] <MatmaRex>	 anyone around to deploy?
[20:06:55] <wikibugs>	 (03PS1) 10Krinkle: graphite: Remove temporary blackhole for wanobjectcache hex-like stats [puppet] - 10https://gerrit.wikimedia.org/r/957797 (https://phabricator.wikimedia.org/T178531)
[20:07:25] <wikibugs>	 (03PS1) 10Herron: remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937)
[20:13:10] <MatmaRex>	 any deployers?
[20:13:21] <RhinosF1>	 TheresNoTime: ^
[20:13:48] <RhinosF1>	 brennen, thcipriani; ^
[20:14:44] <brennen>	 MatmaRex: let me make sure i have decent connectivity to the deployment server
[20:15:10] <MatmaRex>	 thanks
[20:15:11] <thcipriani>	 I can deploy
[20:15:22] <thcipriani>	 sorry, missed the ping, thanks for the extra ping RhinosF1 :)
[20:15:52] <RhinosF1>	 thcipriani: jouncebot never pinged you
[20:15:58] <RhinosF1>	 I assumed that was deliberate
[20:16:00] <RhinosF1>	 But I guess not
[20:16:30] <RhinosF1>	 Looks like not brennen was
[20:16:34] <thcipriani>	 I have a meeting ping for this one, I think not pinging brennen was a bad find and replace on my part
[20:16:35] <RhinosF1>	 No idea if you were ever there
[20:16:44] <brennen>	 i took myself off the window today. :)
[20:16:52] <thcipriani>	 MatmaRex: are these fine to go together?
[20:17:02] <thcipriani>	 brennen: so sneaky :P
[20:17:06] <brennen>	 i keep trying to back away from this window
[20:17:26] <thcipriani>	 backport windows are sticky
[20:17:42] <MatmaRex>	 thcipriani: yeah
[20:17:46] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Don't offer visual diffs for non-wikitext pages [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252) (owner: 10Bartosz Dziewoński)
[20:18:04] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] ThreadItemStore: Add details to row insertion exceptions [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859) (owner: 10Bartosz Dziewoński)
[20:20:11] <urandom>	 !log rolling Cassandra restart, RESTBase/row-C — T331713
[20:20:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:15] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[20:20:35] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[15-16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[20:23:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252) (owner: 10Bartosz Dziewoński)
[20:24:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859) (owner: 10Bartosz Dziewoński)
[20:25:40] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[20:26:44] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[20:28:48] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[20:32:04] <wikibugs>	 (03Merged) 10jenkins-bot: Don't offer visual diffs for non-wikitext pages [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252) (owner: 10Bartosz Dziewoński)
[20:32:07] <wikibugs>	 (03Merged) 10jenkins-bot: ThreadItemStore: Add details to row insertion exceptions [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859) (owner: 10Bartosz Dziewoński)
[20:32:25] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]]
[20:32:31] <stashbot>	 T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859
[20:32:31] <stashbot>	 T346252: "Caught exception of type UnexpectedValueException" from visual diff when viewing non-wikitext diffs - https://phabricator.wikimedia.org/T346252
[20:33:56] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD
[20:33:56] <logmsgbot>	 option)
[20:34:23] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10BTullis) >>! In T345726#9158579, @RLazarus wrote: > Hi @joanna_borun -- does this need Infrastructure F...
[20:34:30] <thcipriani>	 ^ MatmaRex both are on mwdebug machines, check please :)
[20:34:55] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase20[15-16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[20:34:59] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[20:36:01] <MatmaRex>	 thcipriani: VE change looks good, DT change we'll see in the logs
[20:37:37] <MatmaRex>	 (so we're good to proceed with both)
[20:37:50] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] "We have decided to take a different route for this now, so this patch can either be abandoned or refactored. Rather than move the user/gro" [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[20:38:41] <thcipriani>	 MatmaRex: thanks for checking, going
[20:38:44] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and matmarex: Continuing with sync
[20:44:01] <wikibugs>	 (03CR) 10Btullis: [WIP] admin: Create analytics-wmde system user and airflow admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[20:45:01] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]] (duration: 12m 35s)
[20:45:06] <stashbot>	 T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859
[20:45:07] <stashbot>	 T346252: "Caught exception of type UnexpectedValueException" from visual diff when viewing non-wikitext diffs - https://phabricator.wikimedia.org/T346252
[20:45:08] <thcipriani>	 ^ MatmaRex should be live everywhere
[20:45:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:25] <MatmaRex>	 thanks thcipriani!
[20:45:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:46:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:47:36] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Brycehughes) @aborrero most (if not all) of the Toolforge tools are throwing 504's (T346126). Seems related to this. Is there any way we can fix t...
[20:47:41] <wikibugs>	 (03CR) 10Stevemunene: [WIP] admin: Create analytics-wmde system user and airflow admin group (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[20:47:56] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[20:47:59] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[20:48:54] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10Brycehughes)
[20:50:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:51:49] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10Brycehughes) @aborrero @cmooney I'm wondering if T346177 was resolved prematurely, since most if not all of the Toolforge tools are failing to resolve now. Any ch...
[20:57:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1032.eqiad.wmnet with OS bullseye
[20:57:13] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:57:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1032.eqiad.wmnet with OS bullseye
[20:57:35] <icinga-wm>	 RECOVERY - BFD status on cr1-esams is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:57:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:58:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1033.eqiad.wmnet with OS bullseye
[20:58:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1033.eqiad.wmnet with OS bullseye
[20:59:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:59:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye
[20:59:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye
[20:59:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:00:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1035.eqiad.wmnet with OS bullseye
[21:00:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1035.eqiad.wmnet with OS bullseye
[21:01:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye
[21:01:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye
[21:02:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1037.eqiad.wmnet with OS bullseye
[21:02:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1037.eqiad.wmnet with OS bullseye
[21:02:28] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: bring wdqs20[3-5] into service [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475)
[21:02:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye
[21:03:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye
[21:03:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1039.eqiad.wmnet with OS bullseye
[21:03:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1039.eqiad.wmnet with OS bullseye
[21:06:07] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:06:48] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: bring wdqs20[3-5] into service [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper)
[21:07:33] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper)
[21:11:49] <wikibugs>	 (03PS1) 10Dduvall: gitlab: Fix permissions of Gemfile.local [puppet] - 10https://gerrit.wikimedia.org/r/957803
[21:11:53] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: bring wdqs20[3-5] into service [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper)
[21:12:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: Fix permissions of Gemfile.local [puppet] - 10https://gerrit.wikimedia.org/r/957803 (owner: 10Dduvall)
[21:12:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (although it appears to me that the dispatch::web and in turn the dispatch::ldap_sync classes can also be removed?)" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[21:13:40] <ryankemper>	 !log T345475 Beginning process to bring 3 new hosts `wdqs202[3-5]` into service. Merged https://gerrit.wikimedia.org/r/957802 and running puppet on hosts
[21:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:45] <stashbot>	 T345475: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475
[21:13:50] <wikibugs>	 (03PS2) 10Dduvall: gitlab: Fix permissions of Gemfile.local [puppet] - 10https://gerrit.wikimedia.org/r/957803
[21:14:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage
[21:15:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage
[21:15:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage
[21:15:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) nginx.service Failed on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:17:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage
[21:17:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage
[21:19:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage
[21:20:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) nginx.service Failed on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:22:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage
[21:24:11] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[21:24:14] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[21:24:19] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage
[21:25:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) nginx.service Failed on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:25:54] <wikibugs>	 (03PS4) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:26:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:26:21] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:26:45] <urandom>	 !log rolling Cassandra restart, RESTBase/row-D — T331713
[21:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:04] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[12,17-18,23,26-27].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[21:27:25] <wikibugs>	 (03PS5) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:27:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:30:08] <wikibugs>	 (03PS6) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:30:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (12) nginx.service Failed on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:31:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:32:03] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:32:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1033.eqiad.wmnet with OS bullseye
[21:32:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1033.eqiad.wmnet with OS bullseye completed: - kubernetes1033 (**PAS...
[21:33:42] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1034.eqiad.wmnet with OS bullseye
[21:33:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors: - kubernetes...
[21:34:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye
[21:34:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye
[21:34:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye
[21:34:27] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1034.eqiad.wmnet with OS bullseye
[21:34:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors: - kubernetes...
[21:34:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage
[21:35:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye
[21:35:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye
[21:35:35] <wikibugs>	 (03PS7) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:35:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1034.eqiad.wmnet with OS bullseye
[21:35:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors: - kubernetes...
[21:37:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:38:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage
[21:38:19] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:38:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1037.eqiad.wmnet with OS bullseye
[21:38:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1037.eqiad.wmnet with OS bullseye completed: - kubernetes1037 (**PAS...
[21:39:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:40:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:40:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1039.eqiad.wmnet with OS bullseye
[21:40:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1039.eqiad.wmnet with OS bullseye completed: - kubernetes1039 (**PAS...
[21:41:04] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:41:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:41:13] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:42:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:42:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1035.eqiad.wmnet with OS bullseye
[21:42:12] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:42:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1035.eqiad.wmnet with OS bullseye completed: - kubernetes1035 (**PAS...
[21:48:04] <wikibugs>	 (03PS8) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:49:38] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43313/console" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:50:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:50:51] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye
[21:51:16] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bookworm
[21:52:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:52:48] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:53:53] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[21:54:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:55:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:55:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1032.eqiad.wmnet with OS bullseye
[21:55:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1032.eqiad.wmnet with OS bullseye completed: - kubernetes1032 (**PAS...
[21:58:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) @Jclark-ctr or @VRiley-WMF  can you check these servers' eth ports. they either aren't connected or might be connected to the wrong port on the switch. thank you...
[21:59:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm)
[22:01:21] <wikibugs>	 (03PS2) 10Krinkle: graphite: Remove temporary blackhole for wanobjectcache hex-like stats [puppet] - 10https://gerrit.wikimedia.org/r/957797 (https://phabricator.wikimedia.org/T178531)
[22:05:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye
[22:06:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye
[22:06:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye
[22:06:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1030.eqiad.wmnet with OS bullseye
[22:06:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye
[22:06:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1031.eqiad.wmnet with OS bullseye
[22:11:55] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage
[22:14:58] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage
[22:16:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:20:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage
[22:21:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage
[22:21:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage
[22:21:43] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[12,17-18,23,26-27].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[22:21:47] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[22:24:36] <jinxer-wm>	 (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:25:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage
[22:27:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage
[22:27:33] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:29:36] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:29:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage
[22:30:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:32:33] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:32:48] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:37:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:38:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:40:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:43:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:44:09] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bookworm
[22:45:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:45:22] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1034.eqiad.wmnet with OS bullseye
[22:45:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:45:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1030.eqiad.wmnet with OS bullseye
[22:45:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye completed: - kubernetes1034 (**WAR...
[22:45:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1030.eqiad.wmnet with OS bullseye completed: - kubernetes1030 (**PAS...
[22:47:06] <jinxer-wm>	 (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:47:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:47:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:51:00] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[22:59:00] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:59:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:03:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[23:03:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1031.eqiad.wmnet with OS bullseye
[23:03:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1031.eqiad.wmnet with OS bullseye completed: - kubernetes1031 (**PAS...
[23:08:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bookworm
[23:09:57] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2005-dev.codfw.wmnet with OS bookworm
[23:10:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye
[23:10:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye
[23:10:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2006-dev.codfw.wmnet with OS bookworm
[23:11:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Put cloudvirt200[4-6]-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/957834 (https://phabricator.wikimedia.org/T342459)
[23:12:55] <urandom>	 !log rolling Cassandra restart, RESTBase/eqiad/row-A — T331713
[23:12:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:59] <stashbot>	 T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713
[23:13:06] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[16,19-21,28,31].eqiad.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001
[23:13:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Put cloudvirt200[4-6]-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/957834 (https://phabricator.wikimedia.org/T342459) (owner: 10Andrew Bogott)
[23:15:32] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:17:13] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:18:28] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:23:57] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10cmooney) @Brycehughes that issue was resolved however there have been other changes made.  They should not have caused any issues, but I can't guarantee the probl...
[23:24:53] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage
[23:26:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[23:26:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage
[23:27:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage
[23:27:27] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage
[23:30:22] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage
[23:32:17] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage
[23:34:43] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[23:49:40] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1056.eqiad.wmnet with OS bullseye
[23:49:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye completed: - kubernetes1056 (**PASS*...
[23:54:45] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:54:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:58:03] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status