[00:05:17] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:38:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956861 [00:38:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956861 (owner: 10TrainBranchBot) [00:41:15] (03PS1) 10RLazarus: hieradata: Add kubeconfig files for mw-script [puppet] - 10https://gerrit.wikimedia.org/r/957375 (https://phabricator.wikimedia.org/T341553) [00:43:46] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43274/console" [puppet] - 10https://gerrit.wikimedia.org/r/957375 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [00:44:25] (03CR) 10RLazarus: [V: 03+1 C: 03+2] hieradata: Add kubeconfig files for mw-script [puppet] - 10https://gerrit.wikimedia.org/r/957375 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [00:46:40] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:32] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:52:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956861 (owner: 10TrainBranchBot) [00:53:00] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:11:21] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: support removal of unhandled image backups [puppet] - 10https://gerrit.wikimedia.org/r/954131 (owner: 10Andrew Bogott) [01:14:22] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:15:50] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:34] (03PS4) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 [01:17:45] (03Abandoned) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (owner: 10Krinkle) [01:36:06] !log starting RESTBase/Cassandra node rebuilds, cassandra-c/row D — T331713 [01:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:10] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [02:07:33] (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:16] (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [02:17:33] (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:16] (MediaWikiMemcachedHighErrorRate) resolved: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [02:31:50] (03PS1) 10RLazarus: admin_ng: Add mw-script namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/957377 (https://phabricator.wikimedia.org/T341553) [02:33:48] 10SRE, 10Traffic, 10Epic, 10User-notice: Deploy Wikimedia DNS: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Shizhao) [02:37:33] (JobUnavailable) firing: (9) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:49] (03CR) 10RLazarus: [C: 03+2] admin_ng: Add mw-script namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/957377 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [02:46:20] (03Merged) 10jenkins-bot: admin_ng: Add mw-script namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/957377 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [02:54:28] !log rzl@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [02:56:15] !log rzl@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [02:57:41] !log rzl@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [02:58:22] !log rzl@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [02:58:54] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [03:03:03] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [03:03:54] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [03:04:32] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [03:42:46] PROBLEM - Disk space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 32561 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1026&var-datasource=eqiad+prometheus/ops [03:51:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:51:38] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:53:04] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:58:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:01:44] PROBLEM - Disk space on restbase1027 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 63650 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1027&var-datasource=eqiad+prometheus/ops [04:03:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:42:36] PROBLEM - Disk space on restbase1027 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 66237 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1027&var-datasource=eqiad+prometheus/ops [05:05:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2011,2014].codfw.wmnet,pc1011.eqiad.wmnet with reason: Pre swichover tasks [05:05:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2011,2014].codfw.wmnet,pc1011.eqiad.wmnet with reason: Pre swichover tasks [05:05:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Pre swichover tasks [05:06:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Pre swichover tasks [05:06:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Pre swichover tasks [05:06:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Pre swichover tasks [05:10:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 11 hosts with reason: Pre swichover tasks [05:11:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 11 hosts with reason: Pre swichover tasks [05:21:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Pre swichover tasks [05:22:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Pre swichover tasks [05:23:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Pre swichover tasks [05:23:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Pre swichover tasks [05:51:42] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:53:10] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:00:08] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0600) [06:00:08] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0600). [06:08:49] (03CR) 10Ayounsi: [C: 03+1] Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [06:18:11] 10SRE, 10Infrastructure-Foundations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10ayounsi) @CDanis Is that still needed now that we have NEL? [06:18:55] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10ayounsi) [06:22:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Pre swichover tasks [06:22:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Pre swichover tasks [06:26:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Pre swichover tasks [06:27:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Pre swichover tasks [06:37:48] (JobUnavailable) firing: (8) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:38:34] 10SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10MoritzMuehlenhoff) Looks good [06:39:00] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM requested for search-loader - https://phabricator.wikimedia.org/T346273 (10MoritzMuehlenhoff) Looks good [06:56:00] (03PS4) 10KartikMistry: Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [06:56:34] (03CR) 10Muehlenhoff: "Looks good, two nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [07:00:04] Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0700). [07:00:04] abijeet and houseofm: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:33] * kart_ will deploy abijeet's change [07:00:57] o/ [07:01:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:01:45] (03CR) 10Muehlenhoff: "Looks good (the commit message is misleading, though: piuparts has been in Debian for almost twenty years)" [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall) [07:01:55] (03Merged) 10jenkins-bot: Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:02:45] !log kartik@deploy1002 Started scap: Backport for [[gerrit:956807|Enable MinT translation service on MediaWiki - rollout #4 (T341445)]] [07:02:50] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [07:04:20] !log kartik@deploy1002 abi and kartik: Backport for [[gerrit:956807|Enable MinT translation service on MediaWiki - rollout #4 (T341445)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:04:52] abijeet: can you test the patch with mwdebug now? [07:04:58] kart_, checking [07:05:00] hello :) [07:05:23] Mohd and I are doing a backport training, so we will deploy the second change [07:05:31] unless you want to join in the meeting? [07:06:11] kart_, looks good [07:06:20] abijeet: nice. Deploying.. [07:06:24] !log kartik@deploy1002 abi and kartik: Continuing with sync [07:06:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1001.wikimedia.org [07:09:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1001.wikimedia.org [07:12:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:13:02] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:956807|Enable MinT translation service on MediaWiki - rollout #4 (T341445)]] (duration: 10m 17s) [07:13:12] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [07:14:49] (03CR) 10Filippo Giunchedi: "Code LGTM, though it'll need to be applied to a profile common to both frontends (logstash + OS) and backends (data nodes, OS only). For e" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:16:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10hashar) While doing the backport & config training this morning with Mohd (T345186), we found out he has no access to the deployment server since he ha... [07:16:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10hashar) [07:16:56] kart_: if you are done can we proceed? :) [07:17:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:17:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [07:21:18] hashar: sorry. Please go ahead. [07:21:32] kart_: we are doing it, thank you! :) [07:21:32] (03PS6) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) [07:21:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [07:22:01] (03CR) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:22:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) (owner: 10Mhorsey) [07:23:42] (03Merged) 10jenkins-bot: Enable Campaign Events email feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) (owner: 10Mhorsey) [07:24:08] (03PS4) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 [07:25:01] (03PS5) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 [07:27:30] (03PS6) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 [07:29:12] scap backport magically detects it is a beta cluster only change and happilly skips the sync :)) [07:29:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We should probably also move the project to gitlab, where we have an easy way to set up the testing pipeline." [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur) [07:30:53] hashar: that's nice! [07:31:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 (owner: 10Muehlenhoff) [07:32:32] !log Backport & config deployment window completed. [07:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [07:36:59] (03PS2) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367 [07:37:04] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43275/console" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:38:32] (03PS22) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [07:39:08] (03CR) 10Muehlenhoff: Enable cumin hosts to reach the opensearch API on logstash clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:39:44] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43276/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [07:40:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43277/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [07:41:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff) [07:44:06] (03PS7) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) [07:44:30] (03CR) 10CI reject: [V: 04-1] Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:44:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host debmonitor2003.codfw.wmnet [07:44:51] (03PS2) 10Fabfur: add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 [07:44:53] (03CR) 10Brouberol: "Thanks for the review. I take it `src_sets` can contain variables/defs, and `srange` has to contain ip/ip ranges?" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:45:30] (03PS8) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) [07:45:51] (03CR) 10Fabfur: add support for unix sockets (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur) [07:48:58] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43278/console" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:50:02] (03PS2) 10JMeybohm: Remove conf2* from etcd client srv records [dns] - 10https://gerrit.wikimedia.org/r/957246 (https://phabricator.wikimedia.org/T332010) [07:53:20] (03CR) 10JMeybohm: [C: 03+2] Remove conf2* from etcd client srv records [dns] - 10https://gerrit.wikimedia.org/r/957246 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [07:54:20] (03PS2) 10Volans: decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) [07:56:48] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.codfw.wmnet on all recursors [07:56:51] (03CR) 10Brouberol: "Thank you so much for the assistance!" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:56:52] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.codfw.wmnet on all recursors [07:56:53] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.ulsfo.wmnet on all recursors [07:56:57] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.ulsfo.wmnet on all recursors [07:56:58] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.eqsin.wmnet on all recursors [07:57:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.eqsin.wmnet on all recursors [07:58:11] (03PS2) 10Majavah: nginx::status_site: allow multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/956068 [07:58:13] (03PS3) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069 [07:59:13] (03CR) 10Slyngshede: [V: 03+1] P:idm allow for installation via Debian packages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [08:00:06] jnuche and hashar: That opportune time is upon us again. Time for a MediaWiki train - Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T0800). [08:01:04] (03PS23) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (https://phabricator.wikimedia.org/T340721) [08:02:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:02:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [08:02:26] andre: morning, ready to continue with the train today? :) [08:02:56] (03CR) 10Brouberol: [C: 03+2] Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [08:03:07] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [08:03:15] jnuche: argh I am running late. Can do, sure, one moment! [08:03:38] no hurries! [08:04:21] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [08:06:56] RECOVERY - Disk space on restbase1027 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1027&var-datasource=eqiad+prometheus/ops [08:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:13:00] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:13:28] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal [08:14:02] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [08:15:31] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957665 (https://phabricator.wikimedia.org/T343728) [08:15:33] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957665 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [08:15:50] PROBLEM - PyBal connections to etcd on lvs4010 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [08:16:20] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957665 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [08:16:20] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [08:16:26] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [08:16:38] PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [08:19:04] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:19:04] PROBLEM - PyBal connections to etcd on lvs5006 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [08:19:06] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Degraded RAID on netmon1003 - https://phabricator.wikimedia.org/T346275 (10Peachey88) [08:19:06] PROBLEM - PyBal connections to etcd on lvs2011 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:19:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:19:20] (03PS1) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) [08:20:14] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:20:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43279/console" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [08:21:42] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:23:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:23:40] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956068 (owner: 10Majavah) [08:24:11] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.26 refs T343728 [08:24:15] T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728 [08:24:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:25:32] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah) [08:28:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:29:29] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [08:30:22] (03PS3) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367 [08:31:22] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookwork - https://phabricator.wikimedia.org/T340721 (10SLyngshede-WMF) Plan for testing rollout of Debian packages: Upgrade test to Bookworm: **Pre-update:** - Set idm-test1001 in maintenance mode - Merge patc... [08:32:10] (03CR) 10David Caro: "Would be nice to have some tests :/, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [08:33:31] (03PS4) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069 [08:33:58] (03CR) 10Majavah: [C: 03+2] nginx::status_site: allow multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/956068 (owner: 10Majavah) [08:34:51] (03PS5) 10Majavah: prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069 [08:34:57] (03CR) 10Majavah: prometheus::nginx_exporter: manage nginx status site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah) [08:35:22] (03CR) 10David Caro: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah) [08:35:51] (03CR) 10Majavah: [V: 03+1 C: 03+2] dynamicproxy: improve connection error pages [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah) [08:36:12] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:25] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:36:49] (03CR) 10David Caro: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah) [08:36:57] (03CR) 10Majavah: [C: 03+2] prometheus::nginx_exporter: manage nginx status site [puppet] - 10https://gerrit.wikimedia.org/r/956069 (owner: 10Majavah) [08:37:00] ^^ is that you jayme? :) [08:37:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:37:29] vgutierrez: just restarted the secondary lvs's - so probably yes [08:37:32] RECOVERY - PyBal connections to etcd on lvs4010 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [08:37:39] yeah.. I must missed your !log line [08:37:56] no, I did not send it because stupid [08:38:32] !log restarted secondary lvs in codfw, eqsin, ulsfo [08:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:00] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [08:39:36] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah) [08:40:11] vgutierrez: will restart the primaries now [08:40:34] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 97 connections established with conf1007.eqiad.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal [08:40:44] RECOVERY - PyBal connections to etcd on lvs5006 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [08:40:56] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955291 (https://phabricator.wikimedia.org/T345702) (owner: 10Jbond) [08:41:28] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [08:41:48] (03CR) 10David Caro: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/928477 (owner: 10Majavah) [08:43:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:43:05] !log restarting primary lvs in codfw, eqsin, ulsfo [08:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:26] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf1007.eqiad.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [08:43:30] RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf1007.eqiad.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [08:43:44] RECOVERY - PyBal connections to etcd on lvs4009 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [08:45:28] RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:45:44] (03PS2) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) [08:45:50] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on A:aqs-eqiad [08:45:54] !log restarting confd fleet wide [08:45:55] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/output/956904/43280/" [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [08:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:10] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:46:14] RECOVERY - PyBal connections to etcd on lvs2011 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:46:34] RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [08:50:16] (03PS1) 10Slyngshede: IDM Switchover [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721) [08:53:58] !log +50 to prometheus eqiad k8s-staging [08:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:16] (03PS1) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) [08:56:50] (ThanosRuleIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleIsDown [08:57:27] (03CR) 10Btullis: [V: 03+2 C: 03+2] Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:57:33] (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:58:34] 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Regarding the other domains (the ones not part of *.wikimedia.org), only `test.m.wikidata.org` and `m.wikifunctions.org` DNS records are configured. What should we do wit... [08:59:54] !log running build-production-images on build2001 for T344910 [08:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:57] T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 [09:02:50] (03PS3) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) [09:04:16] (03CR) 10Brouberol: [C: 03+2] sre.opensearch.roll-restart-reboot: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [09:04:42] (03PS2) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) [09:05:26] (03PS3) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) [09:05:42] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=swift.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:05:55] woot [09:06:50] (ThanosRuleIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleIsDown [09:07:04] !log installing qemu security updates on ganeti-test [09:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:33] (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:07:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff) [09:09:50] RECOVERY - Disk space on restbase1026 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1026&var-datasource=eqiad+prometheus/ops [09:10:42] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:10:58] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [09:11:30] (03CR) 10Vgutierrez: [C: 04-1] add support for unix sockets (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur) [09:16:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [09:17:11] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [09:18:21] marostegui: any insight? [09:19:04] volans: not sure what you are asking [09:19:36] nothing, see private :) [09:20:42] (ATSBackendErrorsHigh) resolved: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:22:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [09:24:38] (03PS3) 10Fabfur: add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 [09:25:33] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10MoritzMuehlenhoff) [09:27:12] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10MoritzMuehlenhoff) Plan looks good to me. [09:29:53] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/956905/43281/" [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [09:30:43] (03PS1) 10Vgutierrez: varnish: Fix thread_pool_max on esams, eqsin, ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) [09:30:51] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [09:31:09] (03CR) 10CI reject: [V: 04-1] varnish: Fix thread_pool_max on esams, eqsin, ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) (owner: 10Vgutierrez) [09:31:30] (03CR) 10Fabfur: add support for unix sockets (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur) [09:32:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [09:33:05] (03PS2) 10Vgutierrez: varnish: Fix thread_pool_max on esams, eqsin, ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) [09:33:21] (03PS1) 10Jelto: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) [09:36:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [09:36:38] (03CR) 10FNegri: designate nova_fixed_multi: create A record using project_id and project_name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [09:39:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [09:40:33] (03PS3) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) [09:41:21] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) (owner: 10Vgutierrez) [09:42:31] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/956906/43283/" [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [09:43:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [09:43:56] (03CR) 10Filippo Giunchedi: "Following this patch and the resolution of" [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [09:44:11] (03CR) 10Elukey: [C: 03+2] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [09:44:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet [09:47:03] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Marostegui) [09:47:26] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:48:02] (03PS1) 10Filippo Giunchedi: benthos: drop messages with dt == '-' [puppet] - 10https://gerrit.wikimedia.org/r/957682 (https://phabricator.wikimedia.org/T346140) [09:48:52] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:48:52] (03CR) 10Elukey: [C: 03+1] benthos: drop messages with dt == '-' [puppet] - 10https://gerrit.wikimedia.org/r/957682 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi) [09:49:20] !log restarted navtiming on webperf2003 to pick up changed etcd service records [09:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:52] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: sync [09:50:03] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: sync [09:51:33] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: sync [09:51:52] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync [09:52:20] !log remove the 'mediawiki.revision-score' stream form eventstreams public API - T342116 [09:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:23] T342116: Deprecate mediawiki revision-score stream - https://phabricator.wikimedia.org/T342116 [09:52:45] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [09:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:55:29] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet [09:56:02] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:42] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Degraded RAID on netmon1003 - https://phabricator.wikimedia.org/T346275 (10fgiunchedi) 05Open→03Invalid Nothing to do, host was reimaged: ` netmon1003:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid... [09:59:13] (03CR) 10Vgutierrez: [C: 03+1] mtail: Record bad requests for varnish SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [09:59:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1005.wikimedia.org [09:59:39] (KeyholderUnarmed) firing: 2 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:00:04] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:00:06] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1000). [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1000) [10:00:32] RECOVERY - Check systemd state on netmon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:01:07] (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:01:26] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.179:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.179:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [10:01:26] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:01:30] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:01:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:01:51] Hmm parsoid what's happening to you [10:02:12] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:02:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:02:35] looking [10:02:50] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:03:00] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:03:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1005.wikimedia.org [10:03:33] are they recovered already? [10:03:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1006.wikimedia.org [10:03:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:04:26] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:04:39] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:04:46] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [10:04:46] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:06:15] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: sync [10:06:30] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync [10:06:52] (03PS1) 10Filippo Giunchedi: rancid: fix log dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/957685 (https://phabricator.wikimedia.org/T344136) [10:07:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:07:36] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:07:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1006.wikimedia.org [10:10:02] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host conf2004.codfw.wmnet with OS bullseye [10:10:08] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host conf2004.codfw.wmnet with OS bullseye [10:11:07] (ProbeDown) resolved: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:14:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:18:18] (03PS1) 10Marostegui: install_server: Do not reimage db2194 [puppet] - 10https://gerrit.wikimedia.org/r/957686 [10:18:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on A:aqs-eqiad [10:19:07] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2194 [puppet] - 10https://gerrit.wikimedia.org/r/957686 (owner: 10Marostegui) [10:20:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org [10:24:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org [10:25:42] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2004.codfw.wmnet with reason: host reimage [10:27:24] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1137.eqiad.wmnet with OS bullseye [10:27:30] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) Requests are typically only coming in about 5 every 10 mins at this stage. @aborrero I did notice that the... [10:28:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2004.codfw.wmnet with reason: host reimage [10:30:06] (03PS1) 10Elukey: services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) [10:36:35] (03PS1) 10Majavah: hieradata: set authdns_servers for eqiad1/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/957688 [10:37:39] (03PS1) 10Elukey: Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 [10:37:41] (03PS1) 10Elukey: Set ores.wikimedia.org as CNAME for ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957690 [10:38:27] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10taavi) >>! In T346177#9166102, @cmooney wrote: > Might it be hardcoded some places still? Instances getting NAT'd... [10:41:04] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: host reimage [10:43:25] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: host reimage [10:47:56] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:53:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:54:24] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:54:57] (03PS1) 10Muehlenhoff: Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 [10:56:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:01:25] 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10brouberol) FYI, I have been working on writi... [11:02:06] (03PS2) 10Muehlenhoff: Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) [11:02:21] !log brouberol@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [11:03:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff) [11:04:52] !log brouberol@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch - T344798 [11:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:56] T344798: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 [11:08:01] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:08:15] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1137.eqiad.wmnet with OS bullseye [11:09:13] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) 05Open→03Resolved We are now serving 5% of global traffic from mw-on-k8s. Resolving. [11:12:18] !log brouberol@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [11:13:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [11:14:59] (03PS1) 10Volans: tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701 [11:15:01] (03PS1) 10Volans: decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 [11:17:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [11:17:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [11:21:28] (03CR) 10Volans: "eluk" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [11:21:39] !log brouberol@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling reboot on A:datahubsearch [11:22:55] lol, fat fingers [11:24:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [11:24:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet [11:24:30] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey) [11:25:22] !log hnowlan@deploy1002 Started deploy [restbase/deploy@e8a6ae4]: Disable wikifeeds announcements healthcheck [11:26:42] (03PS1) 10BBlack: haproxy: reduce varnish maxconn to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957704 (https://phabricator.wikimedia.org/T310609) [11:27:42] (03CR) 10BBlack: [C: 03+2] esams: set frontend memory reservation to 170 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952866 (owner: 10BBlack) [11:28:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet [11:30:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: set authdns_servers for eqiad1/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/957688 (owner: 10Majavah) [11:31:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [11:34:33] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on idm-test1001.wikimedia.org with reason: upgrade to Bookwork [11:34:57] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on idm-test1001.wikimedia.org with reason: upgrade to Bookwork [11:35:30] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@e8a6ae4]: Disable wikifeeds announcements healthcheck (duration: 10m 08s) [11:36:06] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff) [11:36:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [11:37:02] (03CR) 10Slyngshede: [C: 03+2] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [11:37:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:14] hnowlan: ahem... [11:37:21] ffff [11:37:27] !log brouberol@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling reboot on A:datahubsearch [11:37:36] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:38] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:38] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:38] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:38] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:38] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:38] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:46] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) @taavi @aborrero that's not a bad plan of action at all. In terms of step 4 I'm not sure we need to hold off, but in general there is no... [11:37:52] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:52] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:52] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:54] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:54] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:54] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:54] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:54] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:54] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:55] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:55] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:56] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:38:04] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:38:04] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:38:10] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:38:12] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:38:14] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:38:26] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:38:30] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:02] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:02] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:02] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:04] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:06] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:06] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:06] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:18] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:18] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) htt [11:39:18] itech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:30] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:41:56] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:42:07] (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:07] (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:20] (03PS1) 10Muehlenhoff: Revert "ganeti: Avoid Ferm-specific syntax" [puppet] - 10https://gerrit.wikimedia.org/r/957719 [11:42:33] (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:43:05] !log hnowlan@deploy1002 Started deploy [restbase/deploy@8eb62f2]: Revert "Disable wikifeeds announcements healthcheck" [11:43:54] (03PS1) 10Majavah: acme_chief: Make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/957720 [11:43:56] (03PS1) 10Majavah: acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721 [11:43:58] (03CR) 10BBlack: [C: 03+2] fe_mem_gb_reserved: merge esams settings [nop] [puppet] - 10https://gerrit.wikimedia.org/r/957343 (owner: 10BBlack) [11:46:18] (03CR) 10Muehlenhoff: [C: 03+2] Revert "ganeti: Avoid Ferm-specific syntax" [puppet] - 10https://gerrit.wikimedia.org/r/957719 (owner: 10Muehlenhoff) [11:46:30] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: make pdns auth listen on the new ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) [11:46:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:53] (03PS2) 10Arturo Borrero Gonzalez: cloudservices1006: make pdns auth listen on the new ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) [11:47:04] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [11:47:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [11:47:56] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:48:45] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [11:49:07] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) 05Open→03Resolved a:03aborrero thanks everyone involved in the debugging and fix. [11:49:17] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@8eb62f2]: Revert "Disable wikifeeds announcements healthcheck" (duration: 06m 12s) [11:49:38] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:40] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [11:49:40] wikimedia.org/wiki/Services/Monitoring/restbase [11:50:06] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:50:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43287/console" [puppet] - 10https://gerrit.wikimedia.org/r/957720 (owner: 10Majavah) [11:50:38] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimedia.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero) [11:53:41] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=rest-gateway.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:53:57] hmmm hnowlan ^^ [11:54:05] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm-test1001.wikimedia.org with OS bookworm [11:54:15] vgutierrez: we're already on it, -sre [11:54:16] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm [11:54:20] volans: sorry :) [11:55:00] no prob :) [11:56:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [11:56:12] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:38] ^ that wasn't me :| [11:56:48] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:50] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:50] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:50] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:57:02] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:57:07] (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:07] (ProbeDown) firing: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:12] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:57:14] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:57:33] (JobUnavailable) firing: (12) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:58:18] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:58:22] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:32] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:58:32] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:58:41] (ATSBackendErrorsHigh) firing: (3) ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1200) [12:00:20] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:00:54] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:01:26] !log hnowlan@cumin1001 START - Cookbook sre.misc-clusters.roll-restart-restbase rolling restart_daemons on A:restbase-canary [12:01:46] (03PS1) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/957724 [12:02:00] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:02:08] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:02:50] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:02:56] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:03:13] ah, ^ will be why https://www.mediawiki.org/api/rest_v1/page/html/Project%20talk%3AMastodon?redirect=false is 502ing then [12:03:45] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2004.codfw.wmnet with OS bullseye [12:03:52] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host conf2004.codfw.wmnet with OS bullseye completed: - conf2004 (**WARN**) - Downtimed on Icinga/Alertmanager - Disab... [12:03:52] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:03:58] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:04:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [12:04:30] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:05:10] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:05:36] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:06:12] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:06:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957724 (owner: 10Muehlenhoff) [12:06:57] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage [12:06:58] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:08:02] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:08:54] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:09:18] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [12:09:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: make pdns auth listen on the new ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/957722 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [12:10:00] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage [12:10:31] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:10:31] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:11:41] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:11:41] (03CR) 10Sergio Gimeno: [C: 03+1] growthexperiments: Run listTaskCounts for all task types [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [12:11:46] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.misc-clusters.roll-restart-restbase (exit_code=1) rolling restart_daemons on A:restbase-canary [12:12:09] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:12:09] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:12:33] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:12:36] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimedia.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero) p:05Triage→03High [12:13:35] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:13:59] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:14:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [12:14:15] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:14:53] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:16:11] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: sync [12:16:45] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:17:03] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:17:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] firewall::service: Fix logic error in passing srange/drange to nftables [puppet] - 10https://gerrit.wikimedia.org/r/957313 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:17:34] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync [12:17:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [12:18:03] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:11] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:42] (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:19:43] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno [12:19:43] s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:20:53] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE ipamhandles) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:22:41] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:43] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:43] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:49] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:22:53] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:53] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:01] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:09] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:09] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:11] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:11] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:15] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:17] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:19] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:33] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:39] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:43] (03PS1) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) [12:24:59] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:39] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:51] (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Fix logic error in passing srange/drange to nftables [puppet] - 10https://gerrit.wikimedia.org/r/957313 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:26:31] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:26:41] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:26:49] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:49] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:27:07] (ProbeDown) resolved: (2) Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:33] (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:27:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE ipamhandles) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:27:35] (03PS1) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) [12:28:35] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:28:35] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:28:42] (ATSBackendErrorsHigh) resolved: (3) ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:29:03] RECOVERY - SSH on sretest1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:29:13] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:34] (03CR) 10Ladsgroup: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey) [12:29:46] (03PS1) 10JMeybohm: Update _etcd-server-ssl._tcp.v3.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/957729 (https://phabricator.wikimedia.org/T332010) [12:30:15] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:30:23] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:04] (03CR) 10DCausse: "we might still need a flag to disable the jobqueue based updates" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [12:32:08] (03CR) 10JMeybohm: [C: 03+2] Update _etcd-server-ssl._tcp.v3.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/957729 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [12:36:45] (03CR) 10Elukey: [C: 03+1] tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701 (owner: 10Volans) [12:37:58] (03CR) 10Kevin Bazira: [C: 03+1] services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [12:38:23] (03CR) 10Volans: [C: 03+2] tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701 (owner: 10Volans) [12:39:19] (03PS1) 10Slyngshede: P:idm move directory creation. [puppet] - 10https://gerrit.wikimedia.org/r/957730 [12:40:34] (03PS2) 10Slyngshede: P:idm move directory creation. [puppet] - 10https://gerrit.wikimedia.org/r/957730 [12:41:13] (03PS2) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) [12:41:15] (03PS2) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) [12:42:21] (03Merged) 10jenkins-bot: tox.ini: use sphinx-build instead of setup.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957701 (owner: 10Volans) [12:43:46] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: drop messages with dt == '-' [puppet] - 10https://gerrit.wikimedia.org/r/957682 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi) [12:44:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43288/console" [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede) [12:44:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede) [12:45:21] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm move directory creation. [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede) [12:45:23] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43289/console" [puppet] - 10https://gerrit.wikimedia.org/r/957730 (owner: 10Slyngshede) [12:46:07] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) cloudservices1006 is now replying to DNS auth queries in the 185.15.56.162 address, which will later be handed to cloudservices1005: `l... [12:46:55] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:05] (03CR) 10Elukey: decorators: extend documentation (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [12:48:21] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:33] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [12:53:56] (03CR) 10Volans: "reply inline" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [12:56:39] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm-test1001.wikimedia.org with OS bookworm [12:56:48] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm completed: - idm-t... [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1300). [13:00:06] houseofm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:42] * TheresNoTime is here [13:01:03] (03PS1) 10Slyngshede: P:idm enable bitu uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/957731 [13:01:06] but that patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/956447/) is beta-only and already deployed? [13:02:11] (cc HouseOfM) [13:02:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43290/console" [puppet] - 10https://gerrit.wikimedia.org/r/957731 (owner: 10Slyngshede) [13:04:00] 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [13:05:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43291/console" [puppet] - 10https://gerrit.wikimedia.org/r/957731 (owner: 10Slyngshede) [13:05:14] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm enable bitu uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/957731 (owner: 10Slyngshede) [13:08:15] PROBLEM - memcached socket on mw2444 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:10:55] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1138.eqiad.wmnet with OS bullseye [13:11:43] (03PS1) 10Kamila Součková: wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) [13:11:43] claime: FYI ^^^ mw2444, it's very slugghish on ssh [13:11:54] volans: that server... [13:12:04] https://phabricator.wikimedia.org/T345884 [13:12:12] I think that was just a temporary blip [13:12:14] It's been a pain for a while [13:12:25] I upgraded packages there which were missed when the server was down [13:12:28] and it's currently depooled [13:12:32] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1139.eqiad.wmnet with OS bullseye [13:12:40] so that should recover soon [13:12:50] moritzm: it is until we can call it stable, yeah [13:13:18] It's had its CPU changed [13:13:34] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet [13:13:34] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet [13:13:36] it's a lemon then :D [13:14:21] !log installing aom security updates [13:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:33] PROBLEM - Check systemd state on conf2004 is CRITICAL: CRITICAL - degraded: The following units failed: etcd-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:59] RECOVERY - Check systemd state on conf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:13] Hmm memcached seems like it's not configured correctly on that server [13:19:14] (03PS1) 10Vgutierrez: varnish: Decrease max_connections to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957735 [13:19:16] (03PS1) 10Kamila Součková: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) [13:19:41] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host conf2006.codfw.wmnet with OS bullseye [13:19:44] (03CR) 10Elukey: decorators: extend documentation (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [13:19:47] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host conf2006.codfw.wmnet with OS bullseye [13:20:02] moritzm: I think memcached got updated, puppet didn't run immediately after and didn't drop the override back [13:20:20] yeah, confirmed [13:21:00] ack, indeed [13:21:06] (03Abandoned) 10BBlack: haproxy: reduce varnish maxconn to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957704 (https://phabricator.wikimedia.org/T310609) (owner: 10BBlack) [13:21:11] RECOVERY - memcached socket on mw2444 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [13:21:19] (03PS3) 10Bking: site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) [13:23:15] (03CR) 10Vgutierrez: [C: 03+1] fe_mem_gb_reserved:170 for all single-backend [puppet] - 10https://gerrit.wikimedia.org/r/957344 (owner: 10BBlack) [13:23:36] (03PS5) 10Stevemunene: [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [13:24:08] (03CR) 10CI reject: [V: 04-1] [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [13:24:48] (03PS1) 10Filippo Giunchedi: benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140) [13:25:30] (03PS1) 10Muehlenhoff: Add library hint for aom [puppet] - 10https://gerrit.wikimedia.org/r/957738 [13:25:37] (03CR) 10Elukey: [C: 03+1] benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi) [13:25:39] (03PS2) 10Muehlenhoff: Add library hint for aom [puppet] - 10https://gerrit.wikimedia.org/r/957738 [13:25:54] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1139.eqiad.wmnet with reason: host reimage [13:26:26] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi) [13:26:38] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] benthos: bump parallelism for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/957737 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi) [13:27:14] (03CR) 10BBlack: [C: 03+1] varnish: Decrease max_connections to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957735 (owner: 10Vgutierrez) [13:28:19] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [13:28:21] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1139.eqiad.wmnet with reason: host reimage [13:30:09] (03CR) 10Vgutierrez: [C: 03+1] "it would probably be best to remove this after purged is switched to an UDS" [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:31:30] (03CR) 10Vgutierrez: [C: 03+1] "looks good, but to be clear this doesn't impact our beta cluster (en.wikipedia.beta.wmflabs.org) but a specific instance in the traffic WM" [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:31:58] !log installing libwebp security updates on bookworm [13:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [13:32:43] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for aom [puppet] - 10https://gerrit.wikimedia.org/r/957738 (owner: 10Muehlenhoff) [13:35:21] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2006.codfw.wmnet with reason: host reimage [13:38:23] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2006.codfw.wmnet with reason: host reimage [13:39:30] !log issue test alertmanager librenms alert - T346318 [13:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:33] T346318: Fix librenms/alertmanager integration - https://phabricator.wikimedia.org/T346318 [13:40:29] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:41:31] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:42:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:43:08] (03PS2) 10BBlack: fe_mem_gb_reserved:170 for test hosts in other dcs [puppet] - 10https://gerrit.wikimedia.org/r/957352 [13:47:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:47:52] (03CR) 10BBlack: [C: 03+2] beta: haproxy->varnish single UDS config [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:47:55] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43294/console" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [13:49:25] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:50:17] (03CR) 10BBlack: varnish: remove TCP monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:50:37] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:50:53] (03CR) 10BBlack: [C: 03+2] fe_mem_gb_reserved:170 for all single-backend [puppet] - 10https://gerrit.wikimedia.org/r/957344 (owner: 10BBlack) [13:51:28] (03PS15) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [13:55:28] (03CR) 10Btullis: "The IP addresses for the flink clusters in the deployments section don't look right. They seem to be 1.2.3.4/32 which is presumably just a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [13:55:58] !log filippo@deploy1002 Started deploy [librenms/librenms@f049593]: (no justification provided) [13:56:10] !log filippo@deploy1002 Finished deploy [librenms/librenms@f049593]: (no justification provided) (duration: 00m 11s) [13:57:33] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1138.eqiad.wmnet with OS bullseye [13:58:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2006.codfw.wmnet with OS bullseye [13:58:06] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host conf2006.codfw.wmnet with OS bullseye completed: - conf2006 (**PASS**) - Downtimed on Icinga/Alertmanager - Disab... [13:58:47] (03CR) 10BBlack: [C: 03+2] Varnish: listen on only 1x UDS [puppet] - 10https://gerrit.wikimedia.org/r/957346 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:59:35] PROBLEM - Check systemd state on netmon2002 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-alerts.service,librenms-poller-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:39] (03CR) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [14:03:25] (03CR) 10Vgutierrez: [C: 03+1] add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur) [14:03:45] (03CR) 10BBlack: [C: 03+1] add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur) [14:07:33] (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:58] (03CR) 10Btullis: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [14:12:28] (03CR) 10Btullis: [C: 03+1] "This looks good to me, but you might still wish for more eyes first, given that it's an admin_ng change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [14:12:33] (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:14:02] (03PS2) 10Jforrester: [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) [14:14:13] (03CR) 10Jforrester: [C: 03+2] [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) (owner: 10Jforrester) [14:15:14] (03Merged) 10jenkins-bot: [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) (owner: 10Jforrester) [14:16:40] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [14:17:05] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [14:17:33] (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:51] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [14:18:13] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host conf2005.codfw.wmnet with OS bullseye [14:18:21] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host conf2005.codfw.wmnet with OS bullseye [14:18:32] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [14:18:40] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [14:19:20] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [14:21:48] (03PS1) 10JMeybohm: Revert "Remove conf2* from etcd client srv records" [dns] - 10https://gerrit.wikimedia.org/r/957394 (https://phabricator.wikimedia.org/T332010) [14:22:21] (03PS1) 10JMeybohm: Revert "Switch pybals from conf2 to conf1" [puppet] - 10https://gerrit.wikimedia.org/r/957395 (https://phabricator.wikimedia.org/T332010) [14:22:33] (JobUnavailable) firing: (10) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:39] (03PS2) 10Volans: decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 [14:23:09] (03CR) 10Volans: "addressed comment" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [14:23:39] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1027.eqiad.wmnet with OS bullseye [14:24:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1027.eqiad.wmnet with OS bullseye [14:25:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 10 NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43297/console" [puppet] - 10https://gerrit.wikimedia.org/r/957395 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [14:26:34] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/957724 (owner: 10Muehlenhoff) [14:27:33] (JobUnavailable) firing: (6) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:27:39] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1028.eqiad.wmnet with OS bullseye [14:27:40] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1029.eqiad.wmnet with OS bullseye [14:27:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye [14:27:43] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye [14:27:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1028.eqiad.wmnet with OS bullseye [14:27:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1029.eqiad.wmnet with OS bullseye [14:27:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye [14:27:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye [14:32:28] !log installing qemu security updates on ganeti-test cluster [14:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:41] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2005.codfw.wmnet with reason: host reimage [14:33:56] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [14:34:09] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Jdforrester-WMF) [14:36:48] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2005.codfw.wmnet with reason: host reimage [14:37:15] (03CR) 10BBlack: [C: 03+2] OpenSSL 3 compat for update-ocsp script [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) (owner: 10BBlack) [14:37:24] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:31] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1027.eqiad.wmnet with reason: host reimage [14:38:42] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Vgutierrez) 05Resolved→03Open a:05Jhancock.wm→03None not sure why I've been pinged in this task but anyways, the new disk needs to be added to the RAID, as it's still degraded: `/de... [14:39:19] (03PS4) 10Bking: site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) [14:39:23] (03CR) 10Bking: site.pp: add new search-loader hostnames (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:40:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1027.eqiad.wmnet with reason: host reimage [14:40:35] (03CR) 10Bking: [C: 03+2] site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:41:14] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1028.eqiad.wmnet with reason: host reimage [14:41:21] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1029.eqiad.wmnet with reason: host reimage [14:41:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2004.codfw.wmnet [14:42:11] (03CR) 10Vgutierrez: [C: 03+2] varnish: Decrease max_connections to 10k [puppet] - 10https://gerrit.wikimedia.org/r/957735 (owner: 10Vgutierrez) [14:43:06] !log varnish: decrease max_connections to 10k per backend server globally [14:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1028.eqiad.wmnet with reason: host reimage [14:45:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2004.codfw.wmnet [14:45:50] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking) [14:46:05] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking) Thanks Moritz...closing on our board. [14:46:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1029.eqiad.wmnet with reason: host reimage [14:47:46] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host search-loader2002.codfw.wmnet [14:47:47] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:47:51] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [14:48:02] (03PS1) 10Alexandros Kosiaris: Add /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957744 (https://phabricator.wikimedia.org/T346055) [14:48:45] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) [14:48:48] (03PS2) 10AOkoth: wmnet: add ticket-test -> vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/957322 [14:50:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2005.codfw.wmnet [14:50:48] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader2002.codfw.wmnet - bking@cumin1001" [14:50:52] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host search-loader1002.eqiad.wmnet [14:50:53] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:51:14] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.wikimedia.org with reason: test before full decom [14:51:28] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.wikimedia.org with reason: test before full decom [14:51:30] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader2002.codfw.wmnet - bking@cumin1001" [14:51:30] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:51:30] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader2002.codfw.wmnet on all recursors [14:51:34] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader2002.codfw.wmnet on all recursors [14:51:35] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fd552e4c-12f5-4380-9775-a70e560609fd) set by cmooney@cumin1001 for 2:00:00 on 1 h... [14:52:01] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader2002.codfw.wmnet - bking@cumin1001" [14:52:51] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader2002.codfw.wmnet - bking@cumin1001" [14:53:11] (03CR) 10AOkoth: [C: 03+2] wmnet: add ticket-test -> vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/957322 (owner: 10AOkoth) [14:53:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2005.codfw.wmnet [14:54:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2002.codfw.wmnet [14:55:22] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host search-loader2002.codfw.wmnet with OS bullseye [14:55:48] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:55:53] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001" [14:58:13] (03Abandoned) 10BBlack: Add dumps mapping to cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack) [14:58:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2002.codfw.wmnet [14:58:21] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001" [14:58:21] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:21] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader1002.eqiad.wmnet on all recursors [14:58:25] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader1002.eqiad.wmnet on all recursors [14:58:33] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:58:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2005.codfw.wmnet with OS bullseye [14:58:46] 10SRE, 10serviceops, 10Patch-For-Review: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host conf2005.codfw.wmnet with OS bullseye completed: - conf2005 (**WARN**) - Downtimed on Icinga/... [14:58:50] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/957685 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [15:00:01] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:01:03] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM search-loader1002.eqiad.wmnet - bking@cumin1001" [15:01:08] (03PS1) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) [15:01:44] RECOVERY - Check systemd state on netmon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:45] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:03:17] (03PS1) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) [15:03:25] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) [15:04:04] (03PS3) 10BBlack: fe_mem_gb_reserved:170 for test hosts in other dcs [puppet] - 10https://gerrit.wikimedia.org/r/957352 [15:05:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet [15:06:21] (03PS1) 10Herron: dispatch::web: add ensure param and ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) [15:06:28] (03PS1) 10Alexandros Kosiaris: donate: Move into dedicated docroot [puppet] - 10https://gerrit.wikimedia.org/r/957750 (https://phabricator.wikimedia.org/T346055) [15:06:56] (03CR) 10CI reject: [V: 04-1] donate: Move into dedicated docroot [puppet] - 10https://gerrit.wikimedia.org/r/957750 (https://phabricator.wikimedia.org/T346055) (owner: 10Alexandros Kosiaris) [15:07:06] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM search-loader1002.eqiad.wmnet - bking@cumin1001" [15:07:06] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:06] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader1002.eqiad.wmnet on all recursors [15:07:10] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader1002.eqiad.wmnet on all recursors [15:07:17] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host search-loader1002.eqiad.wmnet [15:08:02] !log cp[45]*: restart varnish frontends in all ulsfo + eqsin nodes for memory size changes ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/957344 ), slowly over the next 24h via cumin [15:09:40] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host search-loader1002.eqiad.wmnet [15:09:41] !log bking@cumin1001 START - Cookbook sre.dns.netbox [15:09:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet [15:10:11] (03PS1) 10Andrew Bogott: backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752 [15:10:41] (03CR) 10CI reject: [V: 04-1] backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752 (owner: 10Andrew Bogott) [15:11:48] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001" [15:12:10] (03PS2) 10Andrew Bogott: backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752 [15:12:50] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM search-loader1002.eqiad.wmnet - bking@cumin1001" [15:12:50] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:12:50] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache search-loader1002.eqiad.wmnet on all recursors [15:12:53] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-loader1002.eqiad.wmnet on all recursors [15:13:21] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader1002.eqiad.wmnet - bking@cumin1001" [15:13:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1031.eqiad.wmnet with OS bullseye [15:13:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1030.eqiad.wmnet with OS bullseye [15:13:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [15:13:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [15:13:53] (03CR) 10Andrew Bogott: [C: 03+2] backy2: apply David's patch to fix sqlalchemy >=1.4 [puppet] - 10https://gerrit.wikimedia.org/r/957752 (owner: 10Andrew Bogott) [15:13:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:13:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1028.eqiad.wmnet with OS bullseye [15:14:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1028.eqiad.wmnet with OS bullseye completed: - kubernetes1028 (**WARN*... [15:14:08] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM search-loader1002.eqiad.wmnet - bking@cumin1001" [15:14:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:14:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1029.eqiad.wmnet with OS bullseye [15:14:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1029.eqiad.wmnet with OS bullseye completed: - kubernetes1029 (**WARN*... [15:15:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host search-loader1002.eqiad.wmnet with OS bullseye [15:15:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:15:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1027.eqiad.wmnet with OS bullseye [15:15:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1027.eqiad.wmnet with OS bullseye completed: - kubernetes1027 (**WARN*... [15:15:16] (03PS2) 10JMeybohm: Revert "Remove conf2* from etcd client srv records" [dns] - 10https://gerrit.wikimedia.org/r/957394 (https://phabricator.wikimedia.org/T332010) [15:16:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [15:16:58] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1054.eqiad.wmnet with OS bullseye [15:17:04] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1055.eqiad.wmnet with OS bullseye [15:17:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1054.eqiad.wmnet with OS bullseye [15:17:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1055.eqiad.wmnet with OS bullseye [15:17:20] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye [15:17:26] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye [15:17:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye [15:17:27] !log filippo@deploy1002 Started deploy [librenms/librenms@f049593]: (no justification provided) [15:17:32] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye [15:17:32] !log filippo@deploy1002 Finished deploy [librenms/librenms@f049593]: (no justification provided) (duration: 00m 05s) [15:17:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye [15:17:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye [15:17:59] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43300/console" [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [15:19:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:15] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on search-loader2002.codfw.wmnet with reason: host reimage [15:20:37] (03CR) 10JMeybohm: [C: 03+2] Revert "Remove conf2* from etcd client srv records" [dns] - 10https://gerrit.wikimedia.org/r/957394 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [15:22:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [15:22:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [15:23:25] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on search-loader2002.codfw.wmnet with reason: host reimage [15:24:34] !log restarted navtiming on webperf2003 to pick up changed etcd service records [15:24:39] !log restarting confd fleet wide [15:25:37] (03CR) 10Andrew Bogott: designate nova_fixed_multi: create A record using project_id and project_name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [15:25:39] (03PS1) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [15:26:44] (03PS6) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [15:26:51] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on search-loader1002.eqiad.wmnet with reason: host reimage [15:27:09] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimedia.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10RobH) Email sent, cc'd @aborrero so they can stay apprised of progress. If this... [15:27:40] (03PS3) 10Andrew Bogott: designate nova_fixed_multi: create A recs using project_id and project_name [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) [15:27:42] (03PS7) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) [15:28:52] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero) [15:29:10] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10aborrero) fixing typo, it should be `ns0.openstack.eqiad1.wikimediacloud.org` [15:29:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [15:30:26] (03PS2) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [15:30:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:07] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1053.eqiad.wmnet with OS bullseye [15:31:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye [15:31:15] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on search-loader1002.eqiad.wmnet with reason: host reimage [15:31:55] (03CR) 10Muehlenhoff: dispatch: remove puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [15:32:16] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage [15:32:25] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage [15:32:37] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [15:34:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [15:36:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [15:36:11] (03PS3) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [15:36:22] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Revert "Switch pybals from conf2 to conf1" [puppet] - 10https://gerrit.wikimedia.org/r/957395 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [15:36:25] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host search-loader2002.codfw.wmnet with OS bullseye [15:36:25] !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host search-loader2002.codfw.wmnet [15:36:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage [15:36:51] jouncebot: nowandnext [15:36:51] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [15:36:51] In 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1600) [15:37:25] (03PS1) 10Urbanecm: listTaskCounts: Push total task counts to statsd for all tasks [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) [15:37:31] (03PS4) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [15:37:55] !log running puppet on lvs[2011-2014].codfw.wmnet,lvs[5004-5006].eqsin.wmnet,lvs[4008-4010].ulsfo.wmnet [15:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:27] (03PS1) 10Urbanecm: linkTaskCounts: Stop producing per-topic statsd data [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210) [15:38:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage [15:38:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [15:38:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210) (owner: 10Urbanecm) [15:39:57] (03PS5) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [15:40:24] PROBLEM - PyBal connections to etcd on lvs4010 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [15:41:18] PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:41:23] (03CR) 10Herron: dispatch: remove puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [15:42:35] !log restarting secondary lvs in codfw, eqsin, ulsfo [15:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:58] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [15:43:32] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:43:52] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:43:55] (03PS1) 10Bking: search-loader: Move new VMs into prod role [puppet] - 10https://gerrit.wikimedia.org/r/957762 (https://phabricator.wikimedia.org/T346039) [15:44:04] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:44:34] !log restarting primary lvs in codfw, eqsin, ulsfo [15:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:04] (03CR) 10CI reject: [V: 04-1] listTaskCounts: Push total task counts to statsd for all tasks [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [15:45:06] (03CR) 10CI reject: [V: 04-1] linkTaskCounts: Stop producing per-topic statsd data [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210) (owner: 10Urbanecm) [15:45:28] RECOVERY - PyBal connections to etcd on lvs4010 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [15:45:44] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:46:22] RECOVERY - PyBal connections to etcd on lvs4009 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:46:25] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage [15:47:14] (03CR) 10Urbanecm: [V: 03+2] "failed with:" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [15:47:19] (03CR) 10Urbanecm: [V: 03+2] "failed with:" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957396 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [15:47:30] (03CR) 10Urbanecm: [V: 03+2] "failed with:" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957758 (https://phabricator.wikimedia.org/T345210) (owner: 10Urbanecm) [15:47:52] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:957396|listTaskCounts: Push total task counts to statsd for all tasks (T345204)]], [[gerrit:957758|linkTaskCounts: Stop producing per-topic statsd data (T345210)]] [15:47:57] T345210: Stop sending per-topic task counts to statsd/Grafana - https://phabricator.wikimedia.org/T345210 [15:47:57] T345204: Alert the Growth team when number of available task recommendations drops significantly - https://phabricator.wikimedia.org/T345204 [15:47:59] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host search-loader1002.eqiad.wmnet with OS bullseye [15:47:59] !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host search-loader1002.eqiad.wmnet [15:47:59] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1050.eqiad.wmnet with OS bullseye [15:48:01] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye [15:48:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1051.eqiad.wmnet with OS bullseye [15:48:06] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [15:48:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye [15:48:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye [15:48:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye [15:48:40] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:48:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:13] RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:49:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage [15:49:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be2003.codfw.wmnet with OS bullseye [15:49:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye completed: -... [15:49:55] 10SRE, 10Traffic, 10GitLab (Project Migration): Move purged repository from Gerrit to GitLab - https://phabricator.wikimedia.org/T346305 (10Aklapper) + #gitlab-migration [15:51:14] 10SRE, 10Traffic, 10GitLab (Project Migration): Move purged repository from Gerrit to GitLab - https://phabricator.wikimedia.org/T346305 (10Aklapper) [15:51:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:51:58] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2005.codfw.wmnet [15:51:58] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2005.codfw.wmnet [15:52:14] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet [15:52:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet [15:52:25] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2006.codfw.wmnet [15:52:25] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2006.codfw.wmnet [15:53:01] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1139.eqiad.wmnet with OS bullseye [15:53:15] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:53:28] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) 05Open→03Resolved @MatthewVernon Hey I really tried to make this work as JBOD, but the hardware just doesn't work that way. I did wha... [15:53:59] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:54:02] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43304/console" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [15:54:48] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:54:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:54:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1056.eqiad.wmnet with OS bullseye [15:55:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye completed: - kubernetes1056 (**WARN*... [15:55:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:55:09] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) [15:55:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1054.eqiad.wmnet with OS bullseye [15:55:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1054.eqiad.wmnet with OS bullseye completed: - kubernetes1054 (**WARN*... [15:55:29] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:957396|listTaskCounts: Push total task counts to statsd for all tasks (T345204)]], [[gerrit:957758|linkTaskCounts: Stop producing per-topic statsd data (T345210)]] (duration: 07m 37s) [15:55:34] T345210: Stop sending per-topic task counts to statsd/Grafana - https://phabricator.wikimedia.org/T345210 [15:55:34] T345204: Alert the Growth team when number of available task recommendations drops significantly - https://phabricator.wikimedia.org/T345204 [15:55:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:55:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1055.eqiad.wmnet with OS bullseye [15:55:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [15:55:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1055.eqiad.wmnet with OS bullseye completed: - kubernetes1055 (**PASS*... [15:56:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:56:57] (03CR) 10Ebernhardson: [C: 03+1] search-loader: Move new VMs into prod role [puppet] - 10https://gerrit.wikimedia.org/r/957762 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [15:57:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye [15:57:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye [15:59:54] dancy: I'm in a meeting that run might a couple minutes over, but I see it :) be right with you [15:59:58] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10JMeybohm) [16:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:15] rzl: can I also ask for a merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/956813 [16:00:17] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:00:20] (It should be very simple) [16:00:32] 10SRE, 10serviceops, 10Patch-For-Review: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10JMeybohm) 05Open→03Resolved This is done and clients (confd/pybal) are back on the cluster. I tried to capture the process here (minus the need to add a new SAN to the cergen cert whi... [16:00:56] I'm on my way home so can't test but it's very very simple [16:01:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:31] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage [16:01:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:55] (03PS1) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) [16:02:42] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) [16:03:15] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [16:03:19] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage [16:03:24] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1031.eqiad.wmnet with OS bullseye [16:03:25] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "failed in reimage script said manually run it - robh@cumin1001 - T342533" [16:03:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1030.eqiad.wmnet with OS bullseye [16:03:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [16:03:34] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [16:03:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [16:03:48] T342533: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 [16:04:11] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "failed in reimage script said manually run it - robh@cumin1001 - T342533" [16:04:24] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43305/console" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:04:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage [16:04:50] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:05:32] (03PS2) 10Cathal Mooney: wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [16:06:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [16:07:49] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [16:08:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage [16:09:24] 10SRE, 10SRE-Access-Requests: datacenter ops group right addition: sre.puppet.sync-netbox-hiera cookbook - https://phabricator.wikimedia.org/T346368 (10RobH) [16:09:57] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:10:12] (03PS1) 10RobH: adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) [16:10:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: datacenter ops group right addition: sre.puppet.sync-netbox-hiera cookbook - https://phabricator.wikimedia.org/T346368 (10RobH) p:05Triage→03Medium [16:10:27] (03PS1) 10Andrea Denisse: Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 [16:10:32] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10JMeybohm) conf2 nodes are on bullseye now and the metrics do look better now, as expected [16:10:43] (03CR) 10CI reject: [V: 04-1] adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) (owner: 10RobH) [16:10:46] (03PS1) 10Andrea Denisse: Revert "netmon: Failover from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/957398 [16:10:58] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43306/console" [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy) [16:11:00] (03PS2) 10Andrea Denisse: Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 [16:11:28] RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [16:11:39] (03PS2) 10RobH: adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) [16:11:46] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43307/console" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:11:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 (owner: 10Andrea Denisse) [16:11:52] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "netmon: Failover from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/957398 (owner: 10Andrea Denisse) [16:12:04] (03CR) 10Elukey: [C: 03+1] decorators: extend documentation (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [16:12:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:12:09] (03CR) 10CI reject: [V: 04-1] adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) (owner: 10RobH) [16:12:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1053.eqiad.wmnet with OS bullseye [16:12:17] (03CR) 10Andrea Denisse: [C: 03+2] Revert "netmon: Failover from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/957398 (owner: 10Andrea Denisse) [16:12:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye completed: - kubernetes1053 (**WARN*... [16:12:31] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:12:36] rzl: I am around if you need me. [16:12:47] (03CR) 10Cathal Mooney: [C: 03+2] wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [16:12:49] (03CR) 10Andrea Denisse: [C: 03+2] Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 (owner: 10Andrea Denisse) [16:12:53] (03PS3) 10RobH: adding cookbook to datacenter ops rights [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) [16:12:59] third times the charm [16:12:59] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43308/console" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:13:42] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [16:14:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1040.eqiad.wmnet with OS bullseye [16:14:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1041.eqiad.wmnet with OS bullseye [16:14:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1042.eqiad.wmnet with OS bullseye [16:14:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye [16:14:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1041.eqiad.wmnet with OS bullseye [16:14:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1042.eqiad.wmnet with OS bullseye [16:15:02] dancy: I was double-checking whether semicolons work that way on an ExecStart line but of course they do :) merging now [16:15:21] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Sync ldap/ops into GitLab repos/sre group [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy) [16:15:59] (famously it's *not* a shell, which trips people up sometimes) [16:16:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [16:16:59] (03PS1) 10Cathal Mooney: Remove manual entry for ns0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957770 (https://phabricator.wikimedia.org/T346326) [16:17:03] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update - volans@cumin1001" [16:17:19] PROBLEM - Check systemd state on netmon2002 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-discovery-all.service,librenms-poller-all.service,librenms-poller-all.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:52] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update - volans@cumin1001" [16:18:17] RhinosF1: the SRE who's most familiar with wikistats is on leave -- if you're able to get a code review from someone who knows what you're changing, I would really much prefer that :) but if that's impossible, let me know [16:18:23] (03Abandoned) 10Cathal Mooney: wikimediacloud.org: drop 208.80.154.148 from ns0.openstack [dns] - 10https://gerrit.wikimedia.org/r/957767 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [16:18:44] rzl: yes it's me looking after wikistats while they are off. [16:18:45] dancy: merged and deployed to all three gitlab hosts, test at will [16:18:55] I'm cleaning up things that have been long broken [16:19:44] (03CR) 10Volans: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/957769 (https://phabricator.wikimedia.org/T346368) (owner: 10RobH) [16:19:57] yeah, I appreciate that! but it's still best to have a second pair of eyes on anything, and I'm not informed enough to do that for you [16:20:04] Ok [16:20:15] I will try and poke Arnold, he knows bits [16:20:16] rzl: The timer will run again in 10 minutes. I'll keep an eye on it. [16:20:41] (03CR) 10Hnowlan: [C: 03+1] Extend the maps restart cookbook to also handle reboots (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [16:20:56] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:21:09] RhinosF1: okay sounds good -- if it turns out there's no one and you're completely stuck, let me know [16:21:15] Will do [16:21:19] but in that case I will insist on you being around to at least test it :) [16:21:22] (03CR) 10Andrew Bogott: [C: 03+1] Remove manual entry for ns0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957770 (https://phabricator.wikimedia.org/T346326) (owner: 10Cathal Mooney) [16:21:22] !log Failing over from netmon2002 (codfw) to netmon1003 (eqiad). [16:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:38] rzl: yes, sadly the stupid traffic has spoilt my plan [16:21:54] And decided to make my journey home much longer than normal [16:23:03] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:23:33] (03CR) 10Cathal Mooney: [C: 03+2] Remove manual entry for ns0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957770 (https://phabricator.wikimedia.org/T346326) (owner: 10Cathal Mooney) [16:23:41] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:23:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1056.eqiad.wmnet with OS bullseye [16:23:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [16:26:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1040.eqiad.wmnet with OS bullseye [16:26:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye executed with errors: - kubernetes... [16:27:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage [16:27:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage [16:28:59] (03CR) 10Vgutierrez: aptrepo: Add Bookworm HAProxy third party repos (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:30:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1040.eqiad.wmnet with OS bullseye [16:30:20] rzl: Doh! The next run failed. Looking into it. [16:30:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye [16:31:00] (03PS2) 10BCornwall: package_builder: add piuparts package [puppet] - 10https://gerrit.wikimedia.org/r/956968 [16:31:14] dancy: I'm around if you need anything [16:31:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage [16:31:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage [16:32:16] (03CR) 10BCornwall: package_builder: add piuparts package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall) [16:32:50] (03PS3) 10Andrea Denisse: Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 [16:32:52] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:07] (03CR) 10Andrea Denisse: [V: 03+2] Revert "wikimedia: Failover LibreNMS from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/957397 (owner: 10Andrea Denisse) [16:33:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage [16:34:31] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: rancid-differ.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage [16:36:33] (03PS2) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) [16:36:57] 10SRE, 10Growth-Team, 10Graphite: Delete MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from Graphite - https://phabricator.wikimedia.org/T346371 (10Urbanecm_WMF) [16:37:01] the netmon1003 failures are expected [16:37:09] (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [16:37:13] (03PS2) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) [16:37:19] (03CR) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:37:21] (03CR) 10Filippo Giunchedi: [C: 03+2] rancid: fix log dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/957685 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [16:39:30] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [16:42:10] (03PS6) 10Btullis: [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:42:41] !incidents [16:42:42] 4045 (UNACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [16:42:42] 4044 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [16:42:42] 4043 (RESOLVED) ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw) [16:42:42] 4042 (RESOLVED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [16:42:42] 4041 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [16:42:43] 4039 (RESOLVED) HaproxyUnavailable cache_text global sre () [16:42:43] 4038 (RESOLVED) VarnishUnavailable global sre (varnish-text) [16:42:43] 4040 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [16:42:43] 4037 (RESOLVED) [7x] ProbeDown sre (probes/service) [16:42:44] 4036 (RESOLVED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [16:42:44] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [16:42:50] !ack 4045 [16:42:50] 4045 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [16:43:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:44:03] (03CR) 10Bking: [C: 03+2] search-loader: Move new VMs into prod role [puppet] - 10https://gerrit.wikimedia.org/r/957762 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [16:45:51] (03PS1) 10Ahmon Dancy: Sync ldap/ops into GitLab repos/sre group (v2) [puppet] - 10https://gerrit.wikimedia.org/r/957775 (https://phabricator.wikimedia.org/T343035) [16:46:06] rzl: Another attempt at https://gerrit.wikimedia.org/r/c/operations/puppet/+/957775 [16:46:29] rzl: I'm open to suggestions on how to do it right. [16:46:51] ah damn I bet you're right [16:46:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:47:04] let's try this, but if it doesn't work, wrapping the whole thing in sh -c is the coward's easy way out :) [16:47:13] !incidents [16:47:13] 4045 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [16:47:13] (03CR) 10RLazarus: [C: 03+2] Sync ldap/ops into GitLab repos/sre group (v2) [puppet] - 10https://gerrit.wikimedia.org/r/957775 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy) [16:47:13] 4044 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [16:47:13] 4043 (RESOLVED) ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw) [16:47:14] 4042 (RESOLVED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [16:47:14] 4041 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [16:47:14] 4039 (RESOLVED) HaproxyUnavailable cache_text global sre () [16:47:14] 4038 (RESOLVED) VarnishUnavailable global sre (varnish-text) [16:47:14] 4040 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [16:47:15] 4037 (RESOLVED) [7x] ProbeDown sre (probes/service) [16:47:15] 4036 (RESOLVED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [16:47:16] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [16:47:28] rzl: Agreed [16:47:40] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:48:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:48:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1042.eqiad.wmnet with OS bullseye [16:48:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1042.eqiad.wmnet with OS bullseye completed: - kubernetes1042 (**PAS... [16:48:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:48:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:48:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1041.eqiad.wmnet with OS bullseye [16:48:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1041.eqiad.wmnet with OS bullseye completed: - kubernetes1041 (**PAS... [16:49:12] Hi rzl, sorry for pinging you but I saw you online. Do you there's something we should do regarding the EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) alert? [16:49:21] I'm looking at our docs regarding etcd. [16:49:31] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:49:55] denisse: I think those hosts were just upgraded to bullseye so I'm immediately suspicious :) let me see what I can find out [16:50:00] jayme: I don't suppose you're still online? [16:51:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:52:01] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:27] (03CR) 10Btullis: "I fixed the CI issues and I updated the commit message to try to add a bit of clarity." [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:53:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:53:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1040.eqiad.wmnet with OS bullseye [16:53:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1040.eqiad.wmnet with OS bullseye completed: - kubernetes1040 (**PAS... [16:53:44] denisse: I'm reading up on what I can but I'm not an etcd expert, sorry [16:53:49] dancy: in the meantime, puppet's done [16:53:57] thx.. Watching. [16:54:10] next run in 6 minutes [16:55:44] rzl: No worries, it's fine. [16:56:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1043.eqiad.wmnet with OS bullseye [16:56:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1043.eqiad.wmnet with OS bullseye [16:56:52] denisse: so, conf2005 is the host running etcdmirror, meaning it's responsible for replication between eqiad and codfw [16:57:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1044.eqiad.wmnet with OS bullseye [16:57:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1044.eqiad.wmnet with OS bullseye [16:58:04] I don't immediately see the cause but, the answer is yes we should treat this as serious [16:58:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1045.eqiad.wmnet with OS bullseye [16:58:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1045.eqiad.wmnet with OS bullseye [16:58:58] might need to escalate to either _joe_ or akosiaris or jayme even though it's their evening, but let me keep digging and see what I can find [17:00:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T1700) [17:00:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1046.eqiad.wmnet with OS bullseye [17:00:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye [17:00:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye completed: - kubernetes1052 (**WARN*... [17:00:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye [17:00:23] rzl: Fixed! Thanks for your help. [17:00:23] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:30] dancy: \i/ [17:00:33] er, \o/ [17:00:43] hehe [17:00:49] I like \i/ [17:00:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye [17:01:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye [17:02:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye [17:02:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye [17:03:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1048.eqiad.wmnet with OS bullseye [17:03:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1048.eqiad.wmnet with OS bullseye [17:03:48] (03PS2) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) [17:03:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1049.eqiad.wmnet with OS bullseye [17:04:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1049.eqiad.wmnet with OS bullseye [17:04:58] 10SRE, 10SRE-Access-Requests: datacenter ops group right addition: sre.puppet.sync-netbox-hiera cookbook - https://phabricator.wikimedia.org/T346368 (10RobH) 05Open→03Resolved [17:05:09] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039 [17:05:21] T346039: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 [17:05:23] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039 [17:10:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:10:16] <_joe_> rzl: what's going on with etcd? [17:10:22] <_joe_> can I help? [17:10:48] _joe_: we got paged for replication on conf2005 -- looks like it was a downtime expiring but I'm not sure what state it's in [17:11:12] I might be wrong but I wonder if it's a monitoring issue, the mirror unit is up and the last logs show replication [17:11:12] <_joe_> rzl: so first order of business is understanding if the cluster is used by clients right now [17:11:23] the alert is expr: 'up{job="etcdmirror"} != 1' [17:11:35] the unit is called etcdmirror-conftool-eqiad-wmnet.service [17:11:49] <_joe_> volans: that never changed [17:11:59] yeah, I was looking at logs for the systemd unit and it had some failures earlier but is healthy now [17:12:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage [17:12:06] https://gerrit.wikimedia.org/r/c/operations/puppet/+/957395 [17:12:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage [17:12:18] ^ was a changed pushed earlier, to revert not using conf2*, so I think conf2* is now in use [17:12:32] <_joe_> bblack: yeah and replication works [17:12:45] <_joe_> we're trying to understand why monitoring thinks otherwise [17:12:46] let's write something to etcd and see if it replicates, but from logs it looks like it's healthy [17:12:55] <_joe_> let me take a look [17:13:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) [17:13:31] <_joe_> volans: damn I'm on my half-setup laptop... just depool a mw appserver in codfw [17:13:33] <_joe_> then repool it [17:13:43] <_joe_> you should see it in the logs for etcdmirror [17:14:18] I was looking at the Wiki and I think this is the issue with etcd. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication [17:14:43] yes replicated immediately [17:14:44] INFO: Replicating key /conftool/v1/pools/codfw/appserver/nginx/mw2384.codfw.wmnet [17:14:44] Sep 14 17:14:31 conf2005 etcdmirror-conftool-eqiad-wmnet[7607]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/codfw/appserver/nginx/mw2384.codfw.wmnet at index 2457350 [17:14:47] 👍 [17:15:01] same for the pool [17:15:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage [17:15:14] so yeah I'd say monitoring problem, not real probem [17:15:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:15:34] from thanos: [17:15:35] job:up:avail{job="etcdmirror", prometheus="ops", site="codfw"} [17:15:37] 0 [17:15:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage [17:16:00] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) [17:16:00] If it's a monitoring problem I'll file a task for it. [17:16:03] up{cluster="etcd", instance="conf2005:8000", job="etcdmirror", prometheus="ops", site="codfw"} [17:16:03] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162 - https://phabricator.wikimedia.org/T346326 (10cmooney) 05Open→03Resolved Change is now live on the ORG servers when I... [17:16:07] that's 0 [17:16:27] yeah, went from 1 to 0 at 14:18 and stayed 0 [17:16:37] ferm? [17:16:50] which tracks with https://sal.toolforge.org/log/ac8OlIoBxE1_1c7shMTD from SAL [17:16:51] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [17:17:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage [17:18:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage [17:18:16] rzl: do you know if prometheus checks locally or remotely? [17:18:26] for this case [17:18:36] <_joe_> I found the issue [17:18:38] I don't know, sorry [17:18:47] <_joe_> the web interface of etcdmirror is broken on bullseye [17:18:59] ahh okay [17:19:17] <_joe_> curl localhost:8000 [17:19:23] Oh, interesting. [17:19:30] lol

Request did not return bytes

[17:19:36] <_joe_> volans: I didn't check that immediately because you said the lag was ok? [17:19:44] <_joe_> volans: yes something changed in twisted for sure [17:19:52] I sayd the "log", not "lag" :D [17:19:57] <_joe_> sigh [17:19:57] sorry [17:19:58] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:05] <_joe_> ok anyways [17:20:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [17:20:12] any objection to just downtiming then? sounds like a working-hours kind of issue -- the only problem is we won't have replication alerts overnight [17:20:32] <_joe_> rzl: yeah I think it's a pretty critical issue if it breaks though [17:20:42] ye true [17:20:46] <_joe_> it can lead to all kinds of split-brain situations [17:21:07] <_joe_> I would advise to move back the client SRV records at least to just eqiad [17:21:12] <_joe_> if you downtime it [17:21:23] <_joe_> pybal having a server not depooled, we can live with [17:21:53] haven't we reimaged eqiad already though? we'd just have the same problem there, right [17:22:01] One question, so if I understand correctly this is not an issue impacting our users, right? [17:22:13] how didn't we notice? all etcd are on bullseye [17:22:16] denisse: correct, but it means if another issue came up that did impact our users, we wouldn't know about it [17:22:28] volans: etcdmirror only runs on one host, in the replica cluster [17:22:33] (JobUnavailable) firing: (4) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:22:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage [17:22:54] rzl: sure, but I thought we would check the web interface when migrating from one OS to another :D [17:23:00] beside the mirror [17:23:06] <_joe_> volans: because this is the first time we run etcdmirror on bullseye [17:23:15] the web interface is mirror-specific? [17:23:48] <_joe_> it's part of etcdmirror, yes [17:23:52] ok [17:24:02] <_joe_> it's offering prom metrics and some local-consumable stats [17:24:23] <_joe_> it's a 50 line file to fix https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/etcd-mirror/+/refs/heads/master/etcdmirror/rest.py [17:24:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage [17:24:43] twisted... [17:25:07] <_joe_> they probably changed the method name from render_GET [17:25:37] we probably went from 18.9 to 20.3 [17:27:33] (JobUnavailable) firing: (5) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:27:49] (03CR) 10Urbanecm: "this is now ready" [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [17:28:59] _joe_: probably adding .encode('utf-8') might do it [17:29:10] was it running with python2 before? [17:29:18] <_joe_> volans: yes [17:29:22] <_joe_> and yes [17:29:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1046.eqiad.wmnet with OS bullseye [17:29:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye executed with errors: - kubernetes... [17:29:35] <_joe_> it has been ported to python3 by alex [17:29:43] <_joe_> clearly this was missing [17:29:47] oh ugh I bet you're right, I was digging through twisted release notes but that's almost certainly it [17:30:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:30:11] <_joe_> rzl: yeah I saw the docs for render_GET and it's expected to return bytes [17:30:16] https://stackoverflow.com/a/48320880 [17:30:19] <_joe_> it's ofc not explained properly [17:30:21] <_joe_> but yes [17:30:32] any objection if I reach in and hot-patch it on conf2005 to see what happens? can't break it any worse than it's broken [17:30:37] if that works I'll send a puppet patch [17:30:38] you can even skip the 'utf-8' if you want as it's the default [17:30:42] <_joe_> rzl: go on [17:30:53] <_joe_> rzl: it's not a puppet patch [17:30:59] <_joe_> etcd-mirror is a deb package :) [17:31:02] rzl: it's a debian package [17:31:10] <_joe_> but yes hotpatch it for now [17:31:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:31:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1043.eqiad.wmnet with OS bullseye [17:31:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1043.eqiad.wmnet with OS bullseye completed: - kubernetes1043 (**PAS... [17:31:46] oh even bette [17:31:48] r [17:32:20] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:33:29] * volans going afk [17:33:52] volans: thanks <3 [17:34:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:34:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1044.eqiad.wmnet with OS bullseye [17:34:16] okay restarting etcdmirror [17:34:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1044.eqiad.wmnet with OS bullseye completed: - kubernetes1044 (**PAS... [17:34:37] (03CR) 10Volans: [C: 03+2] decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [17:35:38] volans _joe_ rzl : Thanks for the help!! [17:35:39] <3 [17:36:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage [17:38:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye [17:38:08] (03Merged) 10jenkins-bot: decorators: extend documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/957702 (owner: 10Volans) [17:38:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye completed: - kubernetes1052 (**PASS*... [17:38:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:38:17] okay the good news is we're not getting 500s for everything, the bad news is we're getting 404s for everything [17:38:35] $ curl localhost:8000/lag [17:38:36] The desired url b'/lag' was not found [17:39:01] smells like that should be a string and not a bytes so we're missing a decode() somewhere else, I'll dig around [17:39:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:39:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1049.eqiad.wmnet with OS bullseye [17:39:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1049.eqiad.wmnet with OS bullseye completed: - kubernetes1049 (**PAS... [17:40:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage [17:40:07] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:40:11] (although I would have expected that to happen in library code...) [17:41:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:41:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1048.eqiad.wmnet with OS bullseye [17:41:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1048.eqiad.wmnet with OS bullseye completed: - kubernetes1048 (**PAS... [17:41:43] <_joe_> rzl: right? but twisted gonna twist [17:42:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1046.eqiad.wmnet with OS bullseye [17:42:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye [17:42:25] <_joe_> rzl: I would decode the path before line 27 [17:42:31] yeah I just got there too [17:42:45] hot take: this is a very silly problem [17:43:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage [17:44:00] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1053.eqiad.wmnet with OS bullseye [17:44:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye [17:46:09] (03PS3) 10Hokwelum: Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190) [17:46:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage [17:46:17] rzl: It's funny how some silly problems can have such a devastating impact. :o [17:46:57] Well, not really funny, mostly interesting. [17:48:21] and we're back! [17:48:23] (03PS4) 10Hokwelum: Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190) [17:48:39] curl localhost:8000/metrics is working for me, alert should clear on the next scrape [17:49:05] rzl: Thank you so much for your help!! [17:49:59] _joe_ and v.olans get all the credit for debugging it, I just fixed what they found :) [17:50:17] (03CR) 10Hokwelum: Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190) (owner: 10Hokwelum) [17:50:21] Thanks to the 3 of you for your help and support!! <3 [17:52:10] (EtcdReplicationDown) resolved: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [17:52:33] (JobUnavailable) firing: (5) Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:52:35] 🎉 [17:52:48] following up with a proper patch now [17:53:13] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:54:19] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:55:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:56:54] (03PS7) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [17:56:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:56:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1045.eqiad.wmnet with OS bullseye [17:57:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1045.eqiad.wmnet with OS bullseye completed: - kubernetes1045 (**PAS... [17:59:35] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [17:59:55] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage [18:01:06] (03PS1) 10RLazarus: Python3 fixes: return bytes from render_GET, and accept a bytes path [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 [18:02:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:02:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage [18:03:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:03:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1046.eqiad.wmnet with OS bullseye [18:03:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1046.eqiad.wmnet with OS bullseye completed: - kubernetes1046 (**PAS... [18:03:52] (03CR) 10Jcrespo: "This is ready for review, more context at: https://phabricator.wikimedia.org/T346233#9167913" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [18:06:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) [18:07:54] (03PS8) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [18:18:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1053.eqiad.wmnet with OS bullseye [18:19:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye completed: - kubernetes1053 (**PASS*... [18:20:02] (03CR) 10BBlack: [C: 03+2] fe_mem_gb_reserved:170 for test hosts in other dcs [puppet] - 10https://gerrit.wikimedia.org/r/957352 (owner: 10BBlack) [18:24:40] !log cp107[56],cp202[78],cp600[19]: (one host from each cluster, at 3 sites): restarting varnish-frontend spaced out over the next ~hour for memory tweaks. [18:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:51] 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) p:05Triage→03Medium [18:25:03] 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [18:25:08] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) [18:26:42] 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [18:27:04] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@7160e27]: Deploy latest DAGs to analytics Airflow instance T340861 [18:27:13] T340861: Implement a backfill job for the dumps hourly table - https://phabricator.wikimedia.org/T340861 [18:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:27:45] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@7160e27]: Deploy latest DAGs to analytics Airflow instance T340861 (duration: 00m 40s) [18:27:51] 10SRE, 10Cloud-VPS: cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [18:31:07] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [18:31:38] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [18:31:57] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [18:32:19] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [18:34:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:34:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1051.eqiad.wmnet with OS bullseye [18:34:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye completed: - kubernetes1051 (**WARN*... [18:35:01] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) Btw I'm assuming pdns is actually generating all of these packets. I'm not very familiar with the overall setup and how designate pushes out changes to the t... [18:35:38] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1051.eqiad.wmnet with OS bullseye [18:35:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye [18:35:46] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye [18:35:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye [18:37:13] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:37:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:38:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:38:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1050.eqiad.wmnet with OS bullseye [18:38:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:38:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye completed: - kubernetes1050 (**WARN*... [18:39:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1050.eqiad.wmnet with OS bullseye [18:39:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye [18:41:25] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43309/console" [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [18:44:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:44:15] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [18:44:52] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:45:05] !log retrying Cassandra bootstrap of restbase1030-c — T331713 [18:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:09] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [18:46:18] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:49:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:30] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage [18:51:34] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [18:52:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T346387 (10phaultfinder) [18:52:32] !log stopping bootstrap of restbase1030-c — T331713 [18:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:37] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [18:53:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage [18:54:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:54:07] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage [18:54:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:56:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [18:57:59] !log initiating `removenode`, ID=627fe8e9-d298-43b3-a1a2-7c8a3f01370b (restbase1030-c) — T331713 [18:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:02] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [18:58:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage [18:59:02] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) Update on current progress on above steps: ~~1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).~~ ~~2. Update bo... [18:59:05] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) [19:01:06] (03CR) 10Krinkle: [C: 03+1] Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190) (owner: 10Hokwelum) [19:06:09] (03PS1) 10Bartosz Dziewoński: Don't offer visual diffs for non-wikitext pages [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252) [19:06:21] (03PS1) 10Bartosz Dziewoński: ThreadItemStore: Add details to row insertion exceptions [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859) [19:08:27] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:08:28] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10cmooney) Do we have any way to measure it's impact? I had a quick look at available promethues metrics and didn't see much corresponding to icmp (but may ha... [19:09:33] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:10:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1051.eqiad.wmnet with OS bullseye [19:11:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye [19:11:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye completed: - kubernetes1051 (**PASS*... [19:11:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye completed: - kubernetes1052 (**PASS*... [19:14:24] (03CR) 10Eevans: [C: 03+1] Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [19:14:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1050.eqiad.wmnet with OS bullseye [19:14:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye completed: - kubernetes1050 (**PASS*... [19:20:22] !log rolling Cassandra restart, RESTBase/row-B — T331713 [19:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:25] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [19:20:36] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-14,19,21,24].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [19:27:36] (03PS1) 10Cathal Mooney: Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) [19:27:56] (03CR) 10Cathal Mooney: [C: 03+2] Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [19:29:26] (03Merged) 10jenkins-bot: Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [19:29:29] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) > some sort of rate-limiting configured on the switch-side for ICMP echo, which was IP-aware and didn't count packets from our own internal systems... [19:31:53] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) https://grafana.wikimedia.org/d/000000513/ping-offload might be a good starting point (might need some updates/tweaking to get the exact data you wan... [19:32:37] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10cmooney) >>! In T345809#9168116, @BBlack wrote: >> some sort of rate-limiting configured on the switch-side for ICMP echo, which was IP-aware and didn't coun... [19:49:05] (03PS2) 10Cathal Mooney: Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) [20:00:05] TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230914T2000). [20:00:05] MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:29] hi [20:00:58] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43310/console" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [20:05:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[13-14,19,21,24].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [20:06:00] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [20:06:26] anyone around to deploy? [20:06:55] (03PS1) 10Krinkle: graphite: Remove temporary blackhole for wanobjectcache hex-like stats [puppet] - 10https://gerrit.wikimedia.org/r/957797 (https://phabricator.wikimedia.org/T178531) [20:07:25] (03PS1) 10Herron: remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) [20:13:10] any deployers? [20:13:21] TheresNoTime: ^ [20:13:48] brennen, thcipriani; ^ [20:14:44] MatmaRex: let me make sure i have decent connectivity to the deployment server [20:15:10] thanks [20:15:11] I can deploy [20:15:22] sorry, missed the ping, thanks for the extra ping RhinosF1 :) [20:15:52] thcipriani: jouncebot never pinged you [20:15:58] I assumed that was deliberate [20:16:00] But I guess not [20:16:30] Looks like not brennen was [20:16:34] I have a meeting ping for this one, I think not pinging brennen was a bad find and replace on my part [20:16:35] No idea if you were ever there [20:16:44] i took myself off the window today. :) [20:16:52] MatmaRex: are these fine to go together? [20:17:02] brennen: so sneaky :P [20:17:06] i keep trying to back away from this window [20:17:26] backport windows are sticky [20:17:42] thcipriani: yeah [20:17:46] (03CR) 10Thcipriani: [C: 03+2] Don't offer visual diffs for non-wikitext pages [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252) (owner: 10Bartosz Dziewoński) [20:18:04] (03CR) 10Thcipriani: [C: 03+2] ThreadItemStore: Add details to row insertion exceptions [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859) (owner: 10Bartosz Dziewoński) [20:20:11] !log rolling Cassandra restart, RESTBase/row-C — T331713 [20:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:15] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [20:20:35] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[15-16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [20:23:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252) (owner: 10Bartosz Dziewoński) [20:24:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859) (owner: 10Bartosz Dziewoński) [20:25:40] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [20:26:44] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [20:28:48] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [20:32:04] (03Merged) 10jenkins-bot: Don't offer visual diffs for non-wikitext pages [extensions/VisualEditor] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957399 (https://phabricator.wikimedia.org/T346252) (owner: 10Bartosz Dziewoński) [20:32:07] (03Merged) 10jenkins-bot: ThreadItemStore: Add details to row insertion exceptions [extensions/DiscussionTools] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957400 (https://phabricator.wikimedia.org/T343859) (owner: 10Bartosz Dziewoński) [20:32:25] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]] [20:32:31] T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859 [20:32:31] T346252: "Caught exception of type UnexpectedValueException" from visual diff when viewing non-wikitext diffs - https://phabricator.wikimedia.org/T346252 [20:33:56] !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD [20:33:56] option) [20:34:23] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10BTullis) >>! In T345726#9158579, @RLazarus wrote: > Hi @joanna_borun -- does this need Infrastructure F... [20:34:30] ^ MatmaRex both are on mwdebug machines, check please :) [20:34:55] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase20[15-16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [20:34:59] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [20:36:01] thcipriani: VE change looks good, DT change we'll see in the logs [20:37:37] (so we're good to proceed with both) [20:37:50] (03CR) 10Btullis: [C: 04-1] "We have decided to take a different route for this now, so this patch can either be abandoned or refactored. Rather than move the user/gro" [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [20:38:41] MatmaRex: thanks for checking, going [20:38:44] !log thcipriani@deploy1002 thcipriani and matmarex: Continuing with sync [20:44:01] (03CR) 10Btullis: [WIP] admin: Create analytics-wmde system user and airflow admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [20:45:01] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]] (duration: 12m 35s) [20:45:06] T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859 [20:45:07] T346252: "Caught exception of type UnexpectedValueException" from visual diff when viewing non-wikitext diffs - https://phabricator.wikimedia.org/T346252 [20:45:08] ^ MatmaRex should be live everywhere [20:45:09] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:25] thanks thcipriani! [20:45:34] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:46:33] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:47:36] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Brycehughes) @aborrero most (if not all) of the Toolforge tools are throwing 504's (T346126). Seems related to this. Is there any way we can fix t... [20:47:41] (03CR) 10Stevemunene: [WIP] admin: Create analytics-wmde system user and airflow admin group (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [20:47:56] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [20:47:59] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [20:48:54] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10Brycehughes) [20:50:34] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:51:49] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10Brycehughes) @aborrero @cmooney I'm wondering if T346177 was resolved prematurely, since most if not all of the Toolforge tools are failing to resolve now. Any ch... [20:57:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1032.eqiad.wmnet with OS bullseye [20:57:13] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:57:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1032.eqiad.wmnet with OS bullseye [20:57:35] RECOVERY - BFD status on cr1-esams is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:57:39] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:58:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1033.eqiad.wmnet with OS bullseye [20:58:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1033.eqiad.wmnet with OS bullseye [20:59:05] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:59:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye [20:59:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye [20:59:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:00:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1035.eqiad.wmnet with OS bullseye [21:00:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1035.eqiad.wmnet with OS bullseye [21:01:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [21:01:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye [21:02:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1037.eqiad.wmnet with OS bullseye [21:02:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1037.eqiad.wmnet with OS bullseye [21:02:28] (03PS1) 10Ryan Kemper: wdqs: bring wdqs20[3-5] into service [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475) [21:02:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye [21:03:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye [21:03:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1039.eqiad.wmnet with OS bullseye [21:03:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1039.eqiad.wmnet with OS bullseye [21:06:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:06:48] (03CR) 10Bking: [C: 03+1] wdqs: bring wdqs20[3-5] into service [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper) [21:07:33] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper) [21:11:49] (03PS1) 10Dduvall: gitlab: Fix permissions of Gemfile.local [puppet] - 10https://gerrit.wikimedia.org/r/957803 [21:11:53] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: bring wdqs20[3-5] into service [puppet] - 10https://gerrit.wikimedia.org/r/957802 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper) [21:12:17] (03CR) 10CI reject: [V: 04-1] gitlab: Fix permissions of Gemfile.local [puppet] - 10https://gerrit.wikimedia.org/r/957803 (owner: 10Dduvall) [21:12:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (although it appears to me that the dispatch::web and in turn the dispatch::ldap_sync classes can also be removed?)" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [21:13:40] !log T345475 Beginning process to bring 3 new hosts `wdqs202[3-5]` into service. Merged https://gerrit.wikimedia.org/r/957802 and running puppet on hosts [21:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:45] T345475: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 [21:13:50] (03PS2) 10Dduvall: gitlab: Fix permissions of Gemfile.local [puppet] - 10https://gerrit.wikimedia.org/r/957803 [21:14:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage [21:15:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage [21:15:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage [21:15:42] (SystemdUnitFailed) firing: (2) nginx.service Failed on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage [21:17:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage [21:19:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage [21:20:42] (SystemdUnitFailed) firing: (12) nginx.service Failed on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage [21:24:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[16,20,22,25].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [21:24:14] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [21:24:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage [21:25:42] (SystemdUnitFailed) firing: (12) nginx.service Failed on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:54] (03PS4) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:26:19] (03CR) 10CI reject: [V: 04-1] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:26:21] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:26:45] !log rolling Cassandra restart, RESTBase/row-D — T331713 [21:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:04] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[12,17-18,23,26-27].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [21:27:25] (03PS5) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:27:51] (03CR) 10CI reject: [V: 04-1] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:30:08] (03PS6) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:30:42] (SystemdUnitFailed) resolved: (12) nginx.service Failed on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:32:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:32:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1033.eqiad.wmnet with OS bullseye [21:32:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1033.eqiad.wmnet with OS bullseye completed: - kubernetes1033 (**PAS... [21:33:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1034.eqiad.wmnet with OS bullseye [21:33:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors: - kubernetes... [21:34:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [21:34:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye [21:34:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye [21:34:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1034.eqiad.wmnet with OS bullseye [21:34:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors: - kubernetes... [21:34:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage [21:35:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye [21:35:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye [21:35:35] (03PS7) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:35:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1034.eqiad.wmnet with OS bullseye [21:35:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors: - kubernetes... [21:37:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:38:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage [21:38:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:38:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1037.eqiad.wmnet with OS bullseye [21:38:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1037.eqiad.wmnet with OS bullseye completed: - kubernetes1037 (**PAS... [21:39:09] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:40:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:40:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1039.eqiad.wmnet with OS bullseye [21:40:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1039.eqiad.wmnet with OS bullseye completed: - kubernetes1039 (**PAS... [21:41:04] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:41:07] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:41:13] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:42:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:42:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1035.eqiad.wmnet with OS bullseye [21:42:12] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:42:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1035.eqiad.wmnet with OS bullseye completed: - kubernetes1035 (**PAS... [21:48:04] (03PS8) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:49:38] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43313/console" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:50:46] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:50:51] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [21:51:16] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bookworm [21:52:04] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:52:48] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:53:53] (03CR) 10Bking: [C: 03+1] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [21:54:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:55:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:55:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1032.eqiad.wmnet with OS bullseye [21:55:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1032.eqiad.wmnet with OS bullseye completed: - kubernetes1032 (**PAS... [21:58:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) @Jclark-ctr or @VRiley-WMF can you check these servers' eth ports. they either aren't connected or might be connected to the wrong port on the switch. thank you... [21:59:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) [22:01:21] (03PS2) 10Krinkle: graphite: Remove temporary blackhole for wanobjectcache hex-like stats [puppet] - 10https://gerrit.wikimedia.org/r/957797 (https://phabricator.wikimedia.org/T178531) [22:05:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye [22:06:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye [22:06:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye [22:06:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1030.eqiad.wmnet with OS bullseye [22:06:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye [22:06:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1031.eqiad.wmnet with OS bullseye [22:11:55] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [22:14:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [22:16:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:20:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage [22:21:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage [22:21:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage [22:21:43] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[12,17-18,23,26-27].codfw.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [22:21:47] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [22:24:36] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage [22:27:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage [22:27:33] (JobUnavailable) firing: (6) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:29:36] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:29:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage [22:30:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:32:33] (JobUnavailable) firing: (6) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:32:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:37:22] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:38:34] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:40:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:43:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:44:09] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bookworm [22:45:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:45:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1034.eqiad.wmnet with OS bullseye [22:45:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:45:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1030.eqiad.wmnet with OS bullseye [22:45:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye completed: - kubernetes1034 (**WAR... [22:45:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1030.eqiad.wmnet with OS bullseye completed: - kubernetes1030 (**PAS... [22:47:06] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:47:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:47:21] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:51:00] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:59:00] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:59:40] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:03:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1031.eqiad.wmnet with OS bullseye [23:03:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1031.eqiad.wmnet with OS bullseye completed: - kubernetes1031 (**PAS... [23:08:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [23:09:57] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2005-dev.codfw.wmnet with OS bookworm [23:10:07] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye [23:10:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye [23:10:56] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2006-dev.codfw.wmnet with OS bookworm [23:11:45] (03PS1) 10Andrew Bogott: Put cloudvirt200[4-6]-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/957834 (https://phabricator.wikimedia.org/T342459) [23:12:55] !log rolling Cassandra restart, RESTBase/eqiad/row-A — T331713 [23:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:59] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [23:13:06] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[16,19-21,28,31].eqiad.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [23:13:19] (03CR) 10Andrew Bogott: [C: 03+2] Put cloudvirt200[4-6]-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/957834 (https://phabricator.wikimedia.org/T342459) (owner: 10Andrew Bogott) [23:15:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:17:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:18:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:23:57] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10cmooney) @Brycehughes that issue was resolved however there have been other changes made. They should not have caused any issues, but I can't guarantee the probl... [23:24:53] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [23:26:09] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [23:26:15] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [23:27:07] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [23:27:27] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [23:30:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [23:32:17] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [23:34:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [23:49:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1056.eqiad.wmnet with OS bullseye [23:49:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye completed: - kubernetes1056 (**PASS*... [23:54:45] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:54:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:58:03] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status