[00:00:20] (03CR) 10SD hehua: [C:03+1] Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089182 (https://phabricator.wikimedia.org/T379500) (owner: 10Hamish) [00:05:25] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088799 (owner: 10TrainBranchBot) [00:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1089281 [00:38:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1089281 (owner: 10TrainBranchBot) [01:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1089297 [01:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1089297 (owner: 10TrainBranchBot) [01:11:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1263:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1263 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:13:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1089281 (owner: 10TrainBranchBot) [01:43:21] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1089297 (owner: 10TrainBranchBot) [02:16:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1263:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1263 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:22:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1263:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1263 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1263:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1263 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:40:07] (03PS5) 10Krinkle: Add title-case mapping to support migration to PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [02:40:17] (03CR) 10Krinkle: [C:03+1] "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:25] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:04] (03PS3) 10KartikMistry: Update recommendation-api to 2024-11-08-142328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088276 (https://phabricator.wikimedia.org/T379037) [06:29:34] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519 (10revi) 03NEW p:05Triage→03High [06:29:45] * revi facepalm [06:37:55] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10308249 (10revi) At least I realized I did something wrong and closed the browser before everyone got removed, but looks like those with email A to somewhe... [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:10:29] (03CR) 10Kevin Bazira: [C:03+1] Update recommendation-api to 2024-11-08-142328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088276 (https://phabricator.wikimedia.org/T379037) (owner: 10KartikMistry) [07:10:40] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-11-08-142328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088276 (https://phabricator.wikimedia.org/T379037) (owner: 10KartikMistry) [07:11:54] (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-08-142328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088276 (https://phabricator.wikimedia.org/T379037) (owner: 10KartikMistry) [07:15:13] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:18:35] (03CR) 10Giuseppe Lavagetto: [C:03+2] fetch_external_clouds_vendors_nets: compatibility with conftool 4.0 [puppet] - 10https://gerrit.wikimedia.org/r/1083781 (https://phabricator.wikimedia.org/T376877) (owner: 10Giuseppe Lavagetto) [07:27:03] FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:08] RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:31:21] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for dumps roles [puppet] - 10https://gerrit.wikimedia.org/r/1088526 (owner: 10Muehlenhoff) [07:32:40] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: fix typo, remove "repo" parameter [puppet] - 10https://gerrit.wikimedia.org/r/1089603 (https://phabricator.wikimedia.org/T374723) [07:33:41] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089603 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [07:34:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088325 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [07:35:48] (03PS1) 10Slyngshede: Release v0.1.1 [software/bitu] - 10https://gerrit.wikimedia.org/r/1089604 [07:36:57] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4486/co" [puppet] - 10https://gerrit.wikimedia.org/r/1089603 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [07:37:47] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] external_clouds_vendors: fix typo, remove "repo" parameter [puppet] - 10https://gerrit.wikimedia.org/r/1089603 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [07:39:54] (03PS1) 10Fabfur: hiera: enable haproxykafka on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) [07:43:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [07:43:05] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from to-be-decommed servers [puppet] - 10https://gerrit.wikimedia.org/r/1088325 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [07:46:33] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1089604 (owner: 10Slyngshede) [07:49:59] (03PS1) 10Muehlenhoff: Sync list of servers in Hiera as well [puppet] - 10https://gerrit.wikimedia.org/r/1089606 [07:51:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet [07:52:53] (03CR) 10Muehlenhoff: [C:03+2] Sync list of servers in Hiera as well [puppet] - 10https://gerrit.wikimedia.org/r/1089606 (owner: 10Muehlenhoff) [07:52:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10308319 (10ops-monitoring-bot) Draining ganeti1011.eqiad.wmnet of running VMs [07:54:40] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10308324 (10Ladsgroup) This is a massive UX problem, not you-problem. Many people accidentally clicked on it including yours truly. We can restore it from b... [08:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T0800). nyaa~ [08:00:05] Hamishcz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:39] (03PS2) 10Slyngshede: Netfilter: Route alerts for cloud hosts to WMCS. [alerts] - 10https://gerrit.wikimedia.org/r/1087434 [08:00:50] :) [08:01:33] (03PS1) 10Muehlenhoff: profile::mariadb::ferm_lists: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1089607 [08:03:10] (03PS1) 10Giuseppe Lavagetto: puppetserver: do not try to validate requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1089608 (https://phabricator.wikimedia.org/T374723) [08:05:34] (03PS2) 10Giuseppe Lavagetto: puppetserver: do not try to validate requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1089608 (https://phabricator.wikimedia.org/T374723) [08:05:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1089608 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [08:06:06] * Hamishcz here, in case if I missed something :) [08:08:23] (03CR) 10Giuseppe Lavagetto: [C:03+2] puppetserver: do not try to validate requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1089608 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [08:09:01] Hamishcz: i can deploy today [08:09:07] good morning! [08:09:49] (03CR) 10Urbanecm: [C:03+2] Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089182 (https://phabricator.wikimedia.org/T379500) (owner: 10Hamish) [08:10:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089182 (https://phabricator.wikimedia.org/T379500) (owner: 10Hamish) [08:10:35] (03Merged) 10jenkins-bot: Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089182 (https://phabricator.wikimedia.org/T379500) (owner: 10Hamish) [08:11:14] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1089182|Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki (T379500)]] [08:11:17] T379500: Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki - https://phabricator.wikimedia.org/T379500 [08:13:28] urbanecm: good morning :) [08:13:43] long time no see huh? [08:13:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10308348 (10MoritzMuehlenhoff) [08:15:49] yep [08:15:52] still building [08:16:43] (03PS1) 10Giuseppe Lavagetto: Update to latest MR [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089623 [08:16:56] np, wait for i [08:16:57] t [08:17:03] (03PS2) 10Fabfur: hiera: enable haproxykafka on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) [08:17:10] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Update to latest MR [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089623 (owner: 10Giuseppe Lavagetto) [08:17:52] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Update to latest - oblivian@cumin1002" [08:17:54] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Update to latest - oblivian@cumin1002 [08:18:25] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Update to latest - oblivian@cumin1002 [08:18:26] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Update to latest - oblivian@cumin1002" [08:20:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:22:45] !log urbanecm@deploy2002 urbanecm, hamishz: Backport for [[gerrit:1089182|Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki (T379500)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:22:48] T379500: Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki - https://phabricator.wikimedia.org/T379500 [08:22:51] finally [08:22:54] Hamishcz: please test [08:24:02] urbanecm: confirmed good for me :) [08:24:06] great! [08:24:07] many thanks to u [08:24:07] !log urbanecm@deploy2002 urbanecm, hamishz: Continuing with sync [08:26:23] (03PS1) 10Slyngshede: C:apereo_cas Disable registry cleaner. [puppet] - 10https://gerrit.wikimedia.org/r/1089638 [08:26:47] (03PS2) 10Slyngshede: C:apereo_cas disable registry cleaner [puppet] - 10https://gerrit.wikimedia.org/r/1089638 [08:27:50] (03PS2) 10Varnent: Update Office Wiki favicon to use wmf.ico and also delete now unused office.ico file. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) [08:27:52] (03CR) 10Urbanecm: [C:03+2] Update Office Wiki favicon to use wmf.ico and also delete now unused office.ico file. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) (owner: 10Varnent) [08:27:55] (03PS2) 10Varnent: Update Wikimedia Foundation primary address. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088628 (https://phabricator.wikimedia.org/T379417) [08:27:58] (03CR) 10Urbanecm: [C:03+2] Update Wikimedia Foundation primary address. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088628 (https://phabricator.wikimedia.org/T379417) (owner: 10Varnent) [08:28:37] (03Merged) 10jenkins-bot: Update Office Wiki favicon to use wmf.ico and also delete now unused office.ico file. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) (owner: 10Varnent) [08:28:41] (03Merged) 10jenkins-bot: Update Wikimedia Foundation primary address. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088628 (https://phabricator.wikimedia.org/T379417) (owner: 10Varnent) [08:30:45] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [08:32:14] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1089182|Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki (T379500)]] (duration: 20m 59s) [08:32:17] T379500: Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki - https://phabricator.wikimedia.org/T379500 [08:32:20] Hamishcz: should be live [08:32:40] ah yes [08:32:46] report to zhwiki now [08:32:49] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1088628|Update Wikimedia Foundation primary address. (T379417)]], [[gerrit:1082559|Update Office Wiki favicon to use wmf.ico and also delete now unused office.ico file. (T378026)]] [08:32:52] appreciate for notification [08:32:53] T379417: Update address for Wikimedia Foundation - https://phabricator.wikimedia.org/T379417 [08:32:53] T378026: Update favicon for Office Wiki to use general Foundation (black) favicon - https://phabricator.wikimedia.org/T378026 [08:35:00] !log urbanecm@deploy2002 urbanecm, varnent: Backport for [[gerrit:1088628|Update Wikimedia Foundation primary address. (T379417)]], [[gerrit:1082559|Update Office Wiki favicon to use wmf.ico and also delete now unused office.ico file. (T378026)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:35:26] !log urbanecm@deploy2002 urbanecm, varnent: Continuing with sync [08:40:04] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1088628|Update Wikimedia Foundation primary address. (T379417)]], [[gerrit:1082559|Update Office Wiki favicon to use wmf.ico and also delete now unused office.ico file. (T378026)]] (duration: 07m 15s) [08:40:09] T379417: Update address for Wikimedia Foundation - https://phabricator.wikimedia.org/T379417 [08:40:09] T378026: Update favicon for Office Wiki to use general Foundation (black) favicon - https://phabricator.wikimedia.org/T378026 [08:40:15] done [08:40:41] (03PS2) 10Varnent: Update favicon for Office Wiki and remove old icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994851 (https://phabricator.wikimedia.org/T144254) [08:40:56] (03Abandoned) 10Urbanecm: Update favicon for Office Wiki and remove old icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994851 (https://phabricator.wikimedia.org/T144254) (owner: 10Varnent) [08:47:34] (03PS3) 10Fabfur: haproxykafka: systemd service hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) [08:49:03] (03CR) 10Fabfur: "I think we could proceed with this, test instance with these overrides in the systemd unit show no differences in performances (after remo" [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [08:49:23] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [08:53:39] (03CR) 10Slyngshede: [C:03+2] Block Search: Priorities form input over query params. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087875 (https://phabricator.wikimedia.org/T378338) (owner: 10Slyngshede) [08:55:57] (03Merged) 10jenkins-bot: Block Search: Priorities form input over query params. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087875 (https://phabricator.wikimedia.org/T378338) (owner: 10Slyngshede) [08:59:24] (03PS1) 10Muehlenhoff: Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) [08:59:39] (03PS2) 10Muehlenhoff: Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) [09:00:16] (03CR) 10CI reject: [V:04-1] Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [09:02:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1011.eqiad.wmnet [09:02:43] (03PS1) 10Slyngshede: P:idp add default empty value for redis password [puppet] - 10https://gerrit.wikimedia.org/r/1089653 [09:03:54] (03PS3) 10Muehlenhoff: Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) [09:03:55] (03CR) 10Slyngshede: [C:03+2] P:idp add default empty value for redis password [puppet] - 10https://gerrit.wikimedia.org/r/1089653 (owner: 10Slyngshede) [09:04:03] (03CR) 10Muehlenhoff: [C:03+1] P:idp add default empty value for redis password [puppet] - 10https://gerrit.wikimedia.org/r/1089653 (owner: 10Slyngshede) [09:04:39] (03PS1) 10Giuseppe Lavagetto: fastapi::application: add home dir for deploy user [puppet] - 10https://gerrit.wikimedia.org/r/1089654 (https://phabricator.wikimedia.org/T374723) [09:04:40] (03PS1) 10Giuseppe Lavagetto: profile::conftool::hiddenparma: add etcd credentials [puppet] - 10https://gerrit.wikimedia.org/r/1089655 (https://phabricator.wikimedia.org/T374723) [09:04:42] (03PS1) 10Giuseppe Lavagetto: profile::conftool::hiddenparma: Switch to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1089656 (https://phabricator.wikimedia.org/T374723) [09:05:12] (03PS4) 10Muehlenhoff: Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) [09:06:12] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4487/co" [puppet] - 10https://gerrit.wikimedia.org/r/1089656 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:08:03] (03PS5) 10Muehlenhoff: Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) [09:10:11] !log remove ganeti1011 from active ganeti nodes T378921 [09:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:15] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [09:12:28] PROBLEM - ganeti-noded running on ganeti1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:12:44] PROBLEM - ganeti-confd running on ganeti1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:14:08] FIRING: ProbeDown: Service ganeti1011:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:15:21] (03PS1) 10Slyngshede: P:idp default redis password to empty string [puppet] - 10https://gerrit.wikimedia.org/r/1089661 [09:16:08] (03CR) 10Slyngshede: [C:03+2] P:idp default redis password to empty string [puppet] - 10https://gerrit.wikimedia.org/r/1089661 (owner: 10Slyngshede) [09:19:18] (03CR) 10Giuseppe Lavagetto: [C:03+2] fastapi::application: add home dir for deploy user [puppet] - 10https://gerrit.wikimedia.org/r/1089654 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:25:27] (03CR) 10Slyngshede: [C:03+2] Release v0.1.1 [software/bitu] - 10https://gerrit.wikimedia.org/r/1089604 (owner: 10Slyngshede) [09:27:57] (03Merged) 10jenkins-bot: Release v0.1.1 [software/bitu] - 10https://gerrit.wikimedia.org/r/1089604 (owner: 10Slyngshede) [09:29:35] (03PS1) 10Volans: sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) [09:29:37] (03CR) 10Giuseppe Lavagetto: [C:03+2] profile::conftool::hiddenparma: add etcd credentials [puppet] - 10https://gerrit.wikimedia.org/r/1089655 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:35:36] (03PS1) 10Muehlenhoff: Add ganeti1049/ganeti1050 as Ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/1089665 (https://phabricator.wikimedia.org/T378921) [09:35:59] (03PS2) 10Giuseppe Lavagetto: profile::conftool::hiddenparma: Switch to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1089656 (https://phabricator.wikimedia.org/T374723) [09:36:02] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [09:42:02] (03CR) 10Giuseppe Lavagetto: [C:03+2] profile::conftool::hiddenparma: Switch to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1089656 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:42:51] (03CR) 10FNegri: "The kernel message on cloudvirt1063 says "SGX disabled by BIOS.", so IIUC on that specific host it's already disabled in the bios. Should " [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [09:43:01] (03PS2) 10Volans: sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) [09:43:52] (03CR) 10Volans: "@Luca: if you could check how it is on supermicro we could do both in the same patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [09:48:22] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/1087434 (owner: 10Slyngshede) [09:48:30] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [09:50:52] (03PS3) 10Volans: sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) [09:52:03] RESOLVED: ProbeDown: Service ganeti1011:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:55:08] (03CR) 10Arturo Borrero Gonzalez: "For disabling it on the kernel we may need something via puppet, and no the cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [09:56:26] (03CR) 10Muehlenhoff: "But if it's gets disabled on the hardware during provisioning, there's no further need to disable it on the OS side?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [10:00:08] (03PS1) 10Giuseppe Lavagetto: Fix UI bug [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089674 [10:00:26] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Fix UI bug [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089674 (owner: 10Giuseppe Lavagetto) [10:00:41] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Update to latest - oblivian@cumin1002" [10:00:44] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Update to latest - oblivian@cumin1002 [10:01:15] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Update to latest - oblivian@cumin1002 [10:01:16] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Update to latest - oblivian@cumin1002" [10:09:29] (03CR) 10FNegri: "I expect we would still see those kernel warnings at boot if it's disabled on the hardware. It's not a big issue, but I think is sub-optim" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [10:10:47] (03CR) 10Slyngshede: Netfilter: Route alerts for cloud hosts to WMCS. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1087434 (owner: 10Slyngshede) [10:13:36] (03CR) 10Arturo Borrero Gonzalez: "we may be misunderstanding the warning message, but I read it as:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [10:15:54] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] Netfilter: Route alerts for cloud hosts to WMCS. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1087434 (owner: 10Slyngshede) [10:20:01] (03CR) 10Slyngshede: Netfilter: Route alerts for cloud hosts to WMCS. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1087434 (owner: 10Slyngshede) [10:20:14] (03Abandoned) 10Arturo Borrero Gonzalez: prometheus-node-kernel-panic: scan last 60d worth of messages [puppet] - 10https://gerrit.wikimedia.org/r/1088539 (owner: 10Arturo Borrero Gonzalez) [10:21:09] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] Netfilter: Route alerts for cloud hosts to WMCS. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1087434 (owner: 10Slyngshede) [10:22:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:23:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:23:53] 06SRE, 10Thumbor, 07Wikimedia-Incident: "Error: 500, Internal Server Error" during thumbnail generation - https://phabricator.wikimedia.org/T379426#10308931 (10hnowlan) This is a recurrence of T374350 [10:31:21] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10308973 (10jcrespo) Hey, @revi, no worries. Could you confirm with me a timestamp when members would be good? 2024-11-11 06:20:00. A little earlier, at 06:... [10:33:38] (03PS1) 10Slyngshede: IDM: Switch to host running v0.1.1 [dns] - 10https://gerrit.wikimedia.org/r/1089695 [10:46:48] (03PS3) 10Stevemunene: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) [10:47:04] (03CR) 10Btullis: [C:03+2] [wikireplicas] Redact the abuse_filter_action table with a custom view [puppet] - 10https://gerrit.wikimedia.org/r/1088550 (https://phabricator.wikimedia.org/T378671) (owner: 10Btullis) [10:55:16] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:57:38] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.update-views [10:58:39] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089700 [10:59:28] (03PS2) 10Ammarpad: contactpages: Update Affcom UserGroup application form [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082174 (https://phabricator.wikimedia.org/T375392) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T1100) [11:00:14] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089701 [11:03:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089665 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [11:04:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082174 (https://phabricator.wikimedia.org/T375392) (owner: 10Ammarpad) [11:04:44] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:06:22] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:06:31] (03PS1) 10Slyngshede: P:idp type check Redis keys before accessing [puppet] - 10https://gerrit.wikimedia.org/r/1089702 [11:07:46] (03Abandoned) 10Slyngshede: apereo_cas: update idp logout script [puppet] - 10https://gerrit.wikimedia.org/r/896146 (owner: 10Jbond) [11:08:59] (03PS2) 10Slyngshede: P:idp type check Redis keys before accessing [puppet] - 10https://gerrit.wikimedia.org/r/1089702 [11:12:55] (03PS4) 10Stevemunene: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) [11:25:02] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1049/ganeti1050 as Ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/1089665 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [11:27:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1089695 (owner: 10Slyngshede) [11:28:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089607 (owner: 10Muehlenhoff) [11:30:05] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:37:35] (03CR) 10Slyngshede: [C:03+2] IDM: Switch to host running v0.1.1 [dns] - 10https://gerrit.wikimedia.org/r/1089695 (owner: 10Slyngshede) [11:40:28] (03PS3) 10Elukey: sre.hosts.reimage: improve UEFI for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1088590 [11:43:34] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [11:43:39] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [11:43:46] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-redacteddb1001.eqiad.wmnet [11:44:22] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [11:45:50] 06SRE, 06Infrastructure-Foundations, 10netops: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549 (10cmooney) 03NEW p:05Triage→03Low [11:46:21] !log btullis@cumin1002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [11:47:40] (03CR) 10Elukey: [C:03+1] Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:48:07] 06SRE, 06Infrastructure-Foundations, 10netops: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10309316 (10cmooney) a:03cmooney [11:49:46] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10309322 (10jcrespo) I was able to pin down the undesired mass deletion starting at: ` # at 906744201 #241111 6:21:17 server id 171966607 end_log_pos 9067... [11:52:23] 06SRE, 10observability, 10Observability-Logging, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q2): ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10309312 (10fnegri) This doesn't seem to be related to Cloud VPS. [11:54:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [11:56:05] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-redacteddb1001.eqiad.wmnet [11:56:37] PROBLEM - MariaDB Replica Lag: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 597921.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:56:39] PROBLEM - MariaDB Replica SQL: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 2 of table wikidatawiki.revision cannot be converted from type bigint to type int(10) unsigned https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:56:48] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [11:58:01] (03PS1) 10Elukey: team-sre: move irc-echo alerts to ircstream [alerts] - 10https://gerrit.wikimedia.org/r/1089714 (https://phabricator.wikimedia.org/T376014) [12:01:29] (03CR) 10Cathal Mooney: [C:03+1] "+1 from me as long as fr-tech SREs are happy. For the record this these IPs are part of a wider block that routes to the payments firewal" [dns] - 10https://gerrit.wikimedia.org/r/1088612 (owner: 10Ssingh) [12:01:36] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [12:02:06] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T379337#10309357 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power supploy [12:05:43] (03PS1) 10Btullis: Canary cephosd1001 to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) [12:06:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4488/co" [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [12:06:45] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1088590 (owner: 10Elukey) [12:06:51] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [12:13:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [12:14:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10309416 (10Jclark-ctr) @cmooney thanks for the list I have populated both new switches up to port 27 with sfp-t [12:14:43] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089725 (owner: 10L10n-bot) [12:15:12] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1049 [12:16:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1049 [12:16:52] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1050 [12:18:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1050 [12:21:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1049.eqiad.wmnet [12:23:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2083.codfw.wmnet with OS bullseye [12:28:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1049.eqiad.wmnet [12:29:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet [12:36:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet [12:40:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1049.eqiad.wmnet to cluster eqiad and group D [12:41:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1049.eqiad.wmnet to cluster eqiad and group D [12:42:18] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [12:48:38] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [12:54:10] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [12:57:42] jouncebot: nowandnext [12:57:42] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [12:57:42] In 1 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T1400) [13:00:13] (03PS1) 10Michael Große: wikipedias: clear link-recommendations on page save [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089739 (https://phabricator.wikimedia.org/T379522) [13:00:31] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [13:02:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085593 (https://phabricator.wikimedia.org/T377829) (owner: 10Máté Szabó) [13:02:50] (03Merged) 10jenkins-bot: Exclude temp account viewer autopromotions from RC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085593 (https://phabricator.wikimedia.org/T377829) (owner: 10Máté Szabó) [13:03:11] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1085593|Exclude temp account viewer autopromotions from RC (T377829)]] [13:03:14] T377829: Hide auto-promotions into the local 'checkuser-temporary-account-viewer' group in Special:RecentChanges - https://phabricator.wikimedia.org/T377829 [13:03:31] (03PS1) 10Giuseppe Lavagetto: Update dependencies to update conftool including bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089740 [13:03:44] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Update dependencies to update conftool including bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089740 (owner: 10Giuseppe Lavagetto) [13:04:39] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:04:41] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix bug in requestctl commit - oblivian@cumin1002" [13:04:44] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix bug in requestctl commit - oblivian@cumin1002 [13:05:17] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix bug in requestctl commit - oblivian@cumin1002 [13:05:19] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix bug in requestctl commit - oblivian@cumin1002" [13:05:28] !log dreamyjazz@deploy2002 mszabo, dreamyjazz: Backport for [[gerrit:1085593|Exclude temp account viewer autopromotions from RC (T377829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:35] !log dreamyjazz@deploy2002 mszabo, dreamyjazz: Continuing with sync [13:08:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [13:08:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [13:08:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:09:47] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10309509 (10jcrespo) It seems to me not all members were deleted, while I can see some people subscribing manually, only around 300 are missing ATM. [13:10:19] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085593|Exclude temp account viewer autopromotions from RC (T377829)]] (duration: 07m 07s) [13:10:22] T377829: Hide auto-promotions into the local 'checkuser-temporary-account-viewer' group in Special:RecentChanges - https://phabricator.wikimedia.org/T377829 [13:10:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1305.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:10:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1306.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1307.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1308.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1309.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1306.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:24] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1310.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1050.eqiad.wmnet to cluster eqiad and group D [13:12:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1306.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:12:49] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1050.eqiad.wmnet to cluster eqiad and group D [13:13:38] 06SRE, 10Thumbor, 07Wikimedia-Incident: "Error: 500, Internal Server Error" during thumbnail generation - https://phabricator.wikimedia.org/T379426#10309514 (10jijiki) →14Duplicate dup:03T374350 [13:13:47] jynus: yeah, It seems like mailman sends unsubscribe request A to Z from browser request, one by one, and I force-closed the browser at around K [13:15:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1311.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:16:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1312.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:17:07] revi: cool [13:17:36] as I am guessing you are an admin there, could you handle a brief communication there after I do the restore (I was testing it, looks fine to run) [13:18:44] sure [13:18:55] will ping you when done [13:19:09] I had to do comms anyway because people got unsub notice [13:19:10] :-p [13:22:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10309536 (10MoritzMuehlenhoff) [13:22:35] !log reverting deleted rows on db1176 (mailman3) T379519 [13:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:38] T379519: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519 [13:25:27] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts irc1002.wikimedia.org [13:29:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1305.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:29:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1308.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:29:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1310.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:29:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1309.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:29:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1307.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:30:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:31:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1305.eqiad.wmnet with OS bookworm [13:31:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1305.eqiad.wmnet with OS... [13:31:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1306.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:32:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1306.eqiad.wmnet with OS bookworm [13:32:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309560 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1306.eqiad.wmnet with OS... [13:32:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1307.eqiad.wmnet with OS bookworm [13:32:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1308.eqiad.wmnet with OS bookworm [13:32:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1309.eqiad.wmnet with OS bookworm [13:32:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1307.eqiad.wmnet with OS... [13:32:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309562 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1308.eqiad.wmnet with OS... [13:32:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS... [13:33:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1310.eqiad.wmnet with OS bookworm [13:33:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1310.eqiad.wmnet with OS... [13:33:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1312.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:34:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1311.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:34:05] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:34:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1311.eqiad.wmnet with OS bookworm [13:34:56] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1312.eqiad.wmnet with OS bookworm [13:35:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1311.eqiad.wmnet with OS... [13:35:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1312.eqiad.wmnet with OS... [13:36:39] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009#10309582 (10Volans) With the new requestctl web UI I think it would be very useful if the current requestctl... [13:38:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:38:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:38:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts irc1002.wikimedia.org [13:38:48] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10309585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cum... [13:39:10] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts irc2002.wikimedia.org [13:40:59] (03PS1) 10Muehlenhoff: deployment-charts: Remove irc1002/irc2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) [13:42:59] (03PS1) 10Muehlenhoff: Remove irc1002/irc2002 from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089752 (https://phabricator.wikimedia.org/T376014) [13:44:09] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:45:45] (03CR) 10Andrea Denisse: [C:03+2] titan: Bring thanos raw retention to 44w [puppet] - 10https://gerrit.wikimedia.org/r/1088390 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [13:48:25] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089739 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [13:49:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309607 (10Jclark-ctr) [13:50:34] PROBLEM - Host logstash2025 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:34] PROBLEM - Host releases2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:46] PROBLEM - Host kafkamon2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:03] (03PS3) 10Zabe: snapshot: Remove labtestwiki from excluded wikis [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) [13:51:40] (03CR) 10CI reject: [V:04-1] snapshot: Remove labtestwiki from excluded wikis [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [13:52:00] (03CR) 10Zabe: snapshot: Remove labtestwiki from excluded wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [13:52:03] FIRING: [2x] ProbeDown: Service logstash2025:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash2025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:23] FIRING: ProbeDown: Service releases2003:443 has failed probes (http_releases_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:01] (03PS4) 10Zabe: snapshot: Remove labtestwiki from excluded wikis [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) [13:54:42] FIRING: JobUnavailable: Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:41] !log powercycled ganeti2031 [13:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:12] PROBLEM - Host ganeti2031 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:25] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:43] (03CR) 10Btullis: airflow: add airflow-wmde files (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [13:58:45] RECOVERY - Host ganeti2031 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [13:59:08] FIRING: [3x] ProbeDown: Service ganeti2031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T1400) [14:00:05] ZhaoFJx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:28] RECOVERY - Host kafkamon2003 is UP: PING OK - Packet loss = 0%, RTA = 31.02 ms [14:00:29] RECOVERY - Host logstash2025 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [14:00:35] RECOVERY - Host releases2003 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [14:00:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage [14:01:03] 06SRE, 06Infrastructure-Foundations: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye - https://phabricator.wikimedia.org/T348730#10309640 (10MoritzMuehlenhoff) Happened again on ganeti2031 today. [14:02:03] RESOLVED: [3x] ProbeDown: Service ganeti2031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:02:23] RESOLVED: ProbeDown: Service releases2003:443 has failed probes (http_releases_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:02:54] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: improve UEFI for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1088590 (owner: 10Elukey) [14:03:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage [14:03:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage [14:04:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage [14:04:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage [14:04:19] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage [14:04:25] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [14:04:27] I wouldn’t mind if someone else deployed today tbh [14:04:30] but I can also do it if needed [14:04:34] ZhaoFJx: are you there? [14:04:38] yep [14:04:42] RESOLVED: JobUnavailable: Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:04:45] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage [14:04:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage [14:04:55] I can deploy [14:05:06] thanks! [14:05:12] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:05:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:05:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:05:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts irc2002.wikimedia.org [14:05:39] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10309643 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cum... [14:06:58] !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [14:07:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage [14:07:56] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2088.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:08:00] (03PS2) 10Zabe: zhwiki: Allow event-organizer self remove usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [14:08:18] (03CR) 10Zabe: [C:03+2] zhwiki: Allow event-organizer self remove usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [14:09:00] (03Merged) 10jenkins-bot: zhwiki: Allow event-organizer self remove usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [14:09:01] (03CR) 10Muehlenhoff: "There are some bugs left with these options, let me first revise it before we add one more user of this interface, I'll start looking into" [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:09:51] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1078764|zhwiki: Allow event-organizer self remove usergroup (T376061)]] [14:09:54] T376061: Allow organizers to remove themselves from event organizer group - https://phabricator.wikimedia.org/T376061 [14:10:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage [14:12:00] !log zabe@deploy2002 zabe, zhaofjx: Backport for [[gerrit:1078764|zhwiki: Allow event-organizer self remove usergroup (T376061)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:24] ZhaoFJx: do you now how to test on the test servers? [14:12:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:52] zabe: not sure if I know [14:13:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage [14:13:33] https://wikitech.wikimedia.org/wiki/WikimediaDebug [14:13:34] okay [14:13:54] you can see at that page a guide how to use X-Wikimedia-Debug [14:14:37] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10309659 (10jcrespo) ` root@db1176[mailman3]> SELECT count(*) FROM member where list_id = 'commons-l.lists.wikimedia.org'; +----------+ | count(*) | +------... [14:14:42] your patch is quite simple, so in your case what you would do is enabling the browser extension and navigating to https://zh.wikipedia.org/w/index.php?title=Special:%E7%BE%A4%E7%BB%84%E6%9D%83%E9%99%90&uselang=en to see whether the changes of your patch work as expected [14:14:51] (03CR) 10Elukey: [C:03+1] Remove irc1002/irc2002 from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089752 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [14:15:08] looks good to me [14:15:13] (03CR) 10Elukey: [C:03+1] deployment-charts: Remove irc1002/irc2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [14:15:26] the permission is showed under the user group [14:15:48] cool [14:15:49] !log zabe@deploy2002 zabe, zhaofjx: Continuing with sync [14:16:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage [14:16:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1089714 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:16:58] jynus: phab message ACK [14:18:07] can you send an email there? I am not sure if the preferences were recovered succesfully [14:18:36] so subscription is ok, but maybe delivery is disbled or something [14:18:53] I'm writing an postmortem [14:19:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage [14:20:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2088.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:20:31] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078764|zhwiki: Allow event-organizer self remove usergroup (T376061)]] (duration: 10m 40s) [14:20:34] T376061: Allow organizers to remove themselves from event organizer group - https://phabricator.wikimedia.org/T376061 [14:20:53] ZhaoFJx: patch is live [14:21:00] all right, thank you:) [14:21:41] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:21:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:21:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1305.eqiad.wmnet with OS bookworm [14:22:04] yw [14:22:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1305.eqiad.wmnet with OS book... [14:22:40] yeah, I think no preferences/default preferences in some list is "normal", so things should be ok [14:22:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage [14:25:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:26:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:26:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1307.eqiad.wmnet with OS bookworm [14:26:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1307.eqiad.wmnet with OS book... [14:27:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [14:27:47] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye [14:28:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:28:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:28:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1308.eqiad.wmnet with OS bookworm [14:31:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1308.eqiad.wmnet with OS book... [14:31:26] I sent out my email to commons-l; expect one or two minutes for mails to be actually sent from my server :-p [14:31:56] ok, thanks, that will help me check I am correctly resuscribed [14:32:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:32:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:32:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1306.eqiad.wmnet with OS bookworm [14:32:59] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:33:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309736 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1306.eqiad.wmnet with OS book... [14:33:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309737 (10Jclark-ctr) [14:33:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:33:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1312.eqiad.wmnet with OS bookworm [14:33:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309752 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1312.eqiad.wmnet with OS book... [14:34:27] (03CR) 10Ssingh: hiera: enable haproxykafka on eqsin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:35:09] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bullseye [14:35:14] minor mistake, revi, despite my title still being DBA, I am an SRE in charge of recovery :-D, not that it matters, just FYI [14:35:33] Your message entitled "[Commons-l] Re: Sudden unsubscription: Sorry!" was successfully received by the Commons-l mailing list. [14:35:42] yep, I saw it [14:35:44] My clock is still stuck at 2022, you know [14:35:58] so I will close the ticket as resolved [14:36:01] (or 2020~2021) [14:36:06] (03CR) 10Fabfur: hiera: enable haproxykafka on eqsin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:36:19] (03PS3) 10Fabfur: hiera: enable haproxykafka on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:44] anyway, thanks as usual! [14:36:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:37:08] (03CR) 10Elukey: [C:03+2] team-sre: move irc-echo alerts to ircstream [alerts] - 10https://gerrit.wikimedia.org/r/1089714 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:37:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:37:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1310.eqiad.wmnet with OS bookworm [14:37:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1310.eqiad.wmnet with OS book... [14:37:43] zabe: could you deploy another thing? => https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1089739 [14:38:10] if not, that's ok too, I can also schedule it tomorrow [14:39:36] 06SRE, 10Wikimedia-Mailing-lists: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10309773 (10jcrespo) 05Open→03Resolved a:03jcrespo Cleanup done, I will commit the automation for reverting accidental unsubscriptions to the ops:... [14:41:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:43:45] * Lucas_WMDE still around if needed [14:44:37] !log btullis@cumin1002 END (FAIL) - Cookbook sre.presto.roll-restart-workers (exit_code=99) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [14:45:01] Lucas_WMDE: happy if we could get it out of the door if it is no hassle, but not worth disrupting your day if your focus is somewhere else right now [14:45:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309808 (10Jclark-ctr) [14:47:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10309837 (10elukey) >>! In T371400#10307915, @jhathaway wrote: > @elukey I was able to reproduce the issue, by wiping the files from the efi partition, be... [14:48:20] eh, I can deploy now [14:48:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089739 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [14:48:42] @Lucas_WMDE Cool! Then I'll add it to the calendar right away [14:48:46] ok, thanks! [14:49:24] (03Merged) 10jenkins-bot: wikipedias: clear link-recommendations on page save [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089739 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [14:49:40] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1089739|wikipedias: clear link-recommendations on page save (T379522)]] [14:49:43] T379522: Switch GETempLinkRecommendationSwitchTagClearHook to true at all wikis - https://phabricator.wikimedia.org/T379522 [14:51:46] !log lucaswerkmeister-wmde@deploy2002 migr, lucaswerkmeister-wmde: Backport for [[gerrit:1089739|wikipedias: clear link-recommendations on page save (T379522)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:52:02] can you test the change? [14:52:31] kinda, I can save an edit and make sure nothing breaks [14:52:35] otherwise not really [14:52:40] ok [14:52:58] (though it has been live on eswiki and frwiki already, so I'm not expecting anything) [14:55:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:55:51] @Lucas_WMDE Looks good as far as I can tell [14:56:00] !log lucaswerkmeister-wmde@deploy2002 migr, lucaswerkmeister-wmde: Continuing with sync [14:57:29] (03PS1) 10Elukey: profile::docker::reporter::report: use internal registry endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1089804 (https://phabricator.wikimedia.org/T378618) [14:57:39] (03CR) 10Ssingh: [C:03+1] hiera: enable haproxykafka on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:58:57] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:00:39] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1089739|wikipedias: clear link-recommendations on page save (T379522)]] (duration: 10m 59s) [15:00:45] T379522: Switch GETempLinkRecommendationSwitchTagClearHook to true at all wikis - https://phabricator.wikimedia.org/T379522 [15:00:51] !log UTC afternoon backport+config window done [15:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:53] jouncebot: now [15:00:54] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [15:00:58] ok :) [15:01:00] * Lucas_WMDE done deploying [15:01:06] 06SRE, 06Infrastructure-Foundations: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10309876 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:45] Lucas_WMDE: Thank you! 🙏 [15:03:45] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T379233#10309882 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:03:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:03:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1309.eqiad.wmnet with OS bookworm [15:03:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS book... [15:04:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309884 (10Jclark-ctr) [15:04:32] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:04:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:04:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1311.eqiad.wmnet with OS bookworm [15:04:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1311.eqiad.wmnet with OS book... [15:05:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309889 (10Jclark-ctr) [15:09:16] (03PS1) 10FNegri: toolsdb: apply pinning to all debian versions [puppet] - 10https://gerrit.wikimedia.org/r/1089806 (https://phabricator.wikimedia.org/T352206) [15:09:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10309892 (10Jclark-ctr) 05Open→03Resolved a:05Clement_Goubert→03Jclark-ctr [15:11:55] (03PS1) 10D3r1ck01: PageUpdater: restore call to RevisionFromEditComplete [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089807 (https://phabricator.wikimedia.org/T379152) [15:13:22] (03CR) 10Ssingh: [C:03+1] Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [15:18:33] (03CR) 10Elukey: Create new lvs service kartotherian-k8s-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [15:20:11] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1089806 (https://phabricator.wikimedia.org/T352206) (owner: 10FNegri) [15:29:40] 06SRE, 10Wikimedia-Mailing-lists: Further improve DMARC compatibility on lists.wikimedia.org - https://phabricator.wikimedia.org/T379517#10309987 (10Reedy) [15:31:07] (03CR) 10FNegri: [C:03+2] toolsdb: apply pinning to all debian versions [puppet] - 10https://gerrit.wikimedia.org/r/1089806 (https://phabricator.wikimedia.org/T352206) (owner: 10FNegri) [15:38:29] (03PS10) 10Ssingh: hiera: do not install haproxykafka on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [15:38:29] (03CR) 10Ssingh: "Verify that this is a NOOP for cp production hosts as well, given we are modifying P:cache::haproxykafka" [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [15:40:09] (03CR) 10JMeybohm: [C:03+1] profile::docker::reporter::report: use internal registry endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1089804 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [15:43:26] (03CR) 10Arturo Borrero Gonzalez: prometheus-node-kernel-panic: ignore false warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [15:47:20] (03PS9) 10Elukey: Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [15:52:02] (03CR) 10Ssingh: [C:03+1] Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [15:52:11] (03CR) 10Elukey: [C:03+2] Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [15:54:09] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: cluster=codfw,service=kartotherian-k8s-ssl [15:55:14] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [15:55:23] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [15:58:48] (03PS7) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [15:58:48] (03PS1) 10Elukey: Move kartotherian-k8s-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1089817 (https://phabricator.wikimedia.org/T378944) [15:59:57] (03CR) 10Ssingh: [C:03+1] Move kartotherian-k8s-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1089817 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:00:01] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [16:03:16] (03CR) 10Elukey: [C:03+2] Move kartotherian-k8s-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1089817 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:05:03] (03CR) 10Alexandros Kosiaris: [C:03+1] "Merge right before a MW deploy window and it will be picked up by it. Easy to deploy as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [16:05:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:07:10] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:09:21] !log installing libarchive security updates [16:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:33] !log restart pybal on lvs1020 (secondary) to pick up new kartotherian-k8s-ssl service [16:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:54] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 86 connections established with conf1007.eqiad.wmnet:4001 (min=87) https://wikitech.wikimedia.org/wiki/PyBal [16:10:10] !log restart pybal on lvs1019 (primary) to pick up new kartotherian-k8s-ssl service [16:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:10] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:12:20] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 87 connections established with conf1007.eqiad.wmnet:4001 (min=87) https://wikitech.wikimedia.org/wiki/PyBal [16:17:37] !log restart pybal on lvs2014 (secondary) to pick up new kartotherian-k8s-ssl service [16:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:10] !log restart pybal on lvs2013 (primary) to pick up new kartotherian-k8s-ssl service [16:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:35] (03CR) 10FNegri: prometheus-node-kernel-panic: ignore false warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [16:21:54] (03CR) 10FNegri: [C:03+2] team-wmcs: aggregate kernel alerts over 24h [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [16:23:05] (03Merged) 10jenkins-bot: team-wmcs: aggregate kernel alerts over 24h [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [16:24:17] (03CR) 10Ssingh: [C:03+1] profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:24:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:29:44] (03PS1) 10Effie Mouzeli: Add replacement kafka nodes to kafka_brokers_main on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1089822 (https://phabricator.wikimedia.org/T363214) [16:30:04] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T1630) [16:50:37] (03CR) 10Ssingh: "Thanks for the reviews, folks. I will wait for @jgreen@wikimedia.org's +1 before proceeding." [dns] - 10https://gerrit.wikimedia.org/r/1088612 (owner: 10Ssingh) [16:52:48] (03Abandoned) 10Elukey: profile::docker::reporter: use the docker-registry's internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1075879 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey) [16:53:07] (03CR) 10Elukey: [C:03+2] profile::docker::reporter::report: use internal registry endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1089804 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [16:55:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:55:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [16:55:25] huh [16:55:35] from RU [16:55:40] !incidents [16:55:40] 5392 (UNACKED) NELHigh sre (thanos-rule tcp.address_unreachable) [16:55:47] !ack 5392 [16:55:47] 5392 (ACKED) NELHigh sre (thanos-rule tcp.address_unreachable) [17:03:06] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089725 (owner: 10L10n-bot) [17:15:30] (03PS1) 10Peter Fischer: CirrusSearch: re-enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) [17:15:44] 06SRE, 06Infrastructure-Foundations, 06Traffic: NEL: don't alert on domains we don't control - https://phabricator.wikimedia.org/T349807#10310243 (10jcrespo) This happened again for another proxy/domain: https://logstash.wikimedia.org/goto/fdbf6830d7a58fccbb40681028ac5bdd [17:16:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [17:24:18] 06SRE, 06Infrastructure-Foundations, 10netops: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#10310260 (10Reedy) [17:25:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:25:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [17:31:31] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009#10310274 (10Joe) The way we could do this is something as follows: * Add the CORS headers to superset to allo... [17:31:36] (03PS1) 10Jcrespo: Add quick and dirty script to revert accidental mailman unsubscriptions [software] - 10https://gerrit.wikimedia.org/r/1089831 (https://phabricator.wikimedia.org/T379519) [17:32:17] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009#10310284 (10Joe) As an alternative, which I might actually prefer, all the process would remain server-side i... [17:33:14] (03CR) 10Jcrespo: "Terrible script, but FYI" [software] - 10https://gerrit.wikimedia.org/r/1089831 (https://phabricator.wikimedia.org/T379519) (owner: 10Jcrespo) [17:34:02] (03PS2) 10Jcrespo: Add quick and dirty script to revert accidental mailman unsubscriptions [software] - 10https://gerrit.wikimedia.org/r/1089831 (https://phabricator.wikimedia.org/T379519) [17:35:40] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:00] (03CR) 10CI reject: [V:04-1] Add quick and dirty script to revert accidental mailman unsubscriptions [software] - 10https://gerrit.wikimedia.org/r/1089831 (https://phabricator.wikimedia.org/T379519) (owner: 10Jcrespo) [17:40:44] RECOVERY - Host ganeti2042 is UP: PING WARNING - Packet loss = 50%, RTA = 36.75 ms [17:42:26] (03PS1) 10Hnowlan: thumbor: fail health check if healthy servers is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089832 (https://phabricator.wikimedia.org/T379561) [17:46:57] (03CR) 10Alexandros Kosiaris: [C:03+1] Remove irc1002/irc2002 from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089752 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [17:47:16] (03PS2) 10Alexandros Kosiaris: deployment-charts: Remove irc1002/irc2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [17:53:13] (03PS1) 10Giuseppe Lavagetto: Add superset links [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089834 [17:53:27] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add superset links [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1089834 (owner: 10Giuseppe Lavagetto) [17:54:42] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add superset links - oblivian@cumin1002 - T379567" [17:54:44] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add superset links - oblivian@cumin1002 - T379567 [17:54:45] T379567: Link to superset dashboard in requestctl web UI - https://phabricator.wikimedia.org/T379567 [17:55:16] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add superset links - oblivian@cumin1002 - T379567 [17:55:17] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add superset links - oblivian@cumin1002 - T379567" [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T1800) [18:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T1800). [18:08:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089807 (https://phabricator.wikimedia.org/T379152) (owner: 10D3r1ck01) [18:12:27] FIRING: [6x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2063:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:17:27] FIRING: [17x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:22:27] RESOLVED: [17x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:24:24] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:24] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:29:24] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:24] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:46:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10310392 (10Khantstop) Hi @BTullis, I hope all is well. I’m trying to gain access to sql_lab and superset, a... [18:47:41] (03PS11) 10Fabfur: hiera: do not install haproxykafka on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) [18:53:06] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [18:53:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [19:04:35] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10310435 (10cmooney) [19:04:48] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10310436 (10cmooney) a:03cmooney [19:08:13] (03CR) 10Ssingh: [C:03+1] "Looks good. We can ignore the deployment-pcc failure and merge this and check that it works as intended but the failure is expected." [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [19:44:14] PROBLEM - MariaDB Replica SQL: s2 on db1182 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:51:58] PROBLEM - MariaDB Replica Lag: s2 on db1182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:08:14] RECOVERY - MariaDB Replica SQL: s2 on db1182 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:08:32] I did some magic there [20:15:58] RECOVERY - MariaDB Replica Lag: s2 on db1182 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:35:53] (03PS1) 10Cathal Mooney: Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) [20:38:01] (03PS2) 10Cathal Mooney: Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T2100). [21:00:05] tgr and Ammar: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:30] o/ [21:06:25] I can deploy [21:09:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082174 (https://phabricator.wikimedia.org/T375392) (owner: 10Ammarpad) [21:10:12] (03Merged) 10jenkins-bot: contactpages: Update Affcom UserGroup application form [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082174 (https://phabricator.wikimedia.org/T375392) (owner: 10Ammarpad) [21:10:32] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1082174|contactpages: Update Affcom UserGroup application form (T375392)]] [21:10:35] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [21:11:20] tgr|away Thanks! [21:12:43] !log tgr@deploy2002 ammarpad, tgr: Backport for [[gerrit:1082174|contactpages: Update Affcom UserGroup application form (T375392)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:10] (03CR) 10Cathal Mooney: [C:03+2] Move idle-timeout under login to the dedicated login template [homer/public] - 10https://gerrit.wikimedia.org/r/1088535 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [21:16:14] (03Merged) 10jenkins-bot: Move idle-timeout under login to the dedicated login template [homer/public] - 10https://gerrit.wikimedia.org/r/1088535 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [21:21:29] (03PS1) 10Cathal Mooney: Add automation for IPsec tunnels on srx devices based on Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) [21:23:24] (03PS2) 10Cathal Mooney: Add automation for IPsec tunnels on srx devices based on Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) [21:23:47] (03PS3) 10Cathal Mooney: Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) [21:24:59] (03CR) 10CI reject: [V:04-1] Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [21:29:03] tgr|away It does not look right to me. I am seeing two undefined message keys, Although the keys exists in codesearch (WikimediaMessages extension) I am not sure why they are not available on meta [21:30:03] some sort of message cache issue maybe? [21:30:49] though in theory scap rebuilds the message cache on the test servers [21:32:35] Here are the keys: https://codesearch.wmcloud.org/search/?q=contactpage-affcom-user-group-application-note%7Ccontactpage-affcom-user-group-intent-letter-label&files=&excludeFiles=&repos=#operations/mediawiki-config [21:32:55] yeah the code looks correct at a glance [21:33:18] I guess deploy in production and see if that works better? if not, we can always revert [21:33:31] having broken text for a few minutes shouldn't be a big deal [21:33:54] !log tgr@deploy2002 ammarpad, tgr: Continuing with sync [21:35:16] Yeah, thank you [21:38:39] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082174|contactpages: Update Affcom UserGroup application form (T375392)]] (duration: 28m 07s) [21:38:42] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [21:39:57] well that didn't work [21:40:10] let me try a manual rebuild [21:41:05] (03PS4) 10Cathal Mooney: Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) [21:42:02] so https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1082177 was reverted, and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1084285 was merged last Tuesday, so it will be deployed this week [21:42:30] I suppose I should backport it [21:42:38] Oh I think I see what's happening here. The messages were only merged  but not deployed [21:42:59] Yess [21:43:48] (03PS1) 10Gergő Tisza: contactpage: Update AffCom contact form messages (Resubmit) [extensions/WikimediaMessages] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089864 (https://phabricator.wikimedia.org/T375392) [21:45:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089864 (https://phabricator.wikimedia.org/T375392) (owner: 10Gergő Tisza) [21:57:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241111T2200). [22:01:14] o/ still waiting for a deployment to finish. Have two more lined up if you don't need the window. [22:02:19] scap is taking its sweet time: Retrying (Retry(total=9, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected [22:03:04] I guess it's really CI that's taking the time [22:04:40] (03Merged) 10jenkins-bot: contactpage: Update AffCom contact form messages (Resubmit) [extensions/WikimediaMessages] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089864 (https://phabricator.wikimedia.org/T375392) (owner: 10Gergő Tisza) [22:05:01] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1089864|contactpage: Update AffCom contact form messages (Resubmit) (T375392)]] [22:05:14] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [22:12:11] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:19:00] tgr|away I can see it works now [22:19:56] !log tgr@deploy2002 tgr: Backport for [[gerrit:1089864|contactpage: Update AffCom contact form messages (Resubmit) (T375392)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:20:00] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [22:20:05] (03CR) 10Gergő Tisza: [C:03+2] PageUpdater: restore call to RevisionFromEditComplete [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089807 (https://phabricator.wikimedia.org/T379152) (owner: 10D3r1ck01) [22:21:38] !log tgr@deploy2002 tgr: Continuing with sync [22:30:49] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1089864|contactpage: Update AffCom contact form messages (Resubmit) (T375392)]] (duration: 25m 48s) [22:30:53] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [22:34:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089807 (https://phabricator.wikimedia.org/T379152) (owner: 10D3r1ck01) [22:56:29] (03Merged) 10jenkins-bot: PageUpdater: restore call to RevisionFromEditComplete [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089807 (https://phabricator.wikimedia.org/T379152) (owner: 10D3r1ck01) [22:56:48] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1089807|PageUpdater: restore call to RevisionFromEditComplete (T379152)]] [22:56:52] T379152: RevisionFromEditComplete hook no longer allows you to modify tags - https://phabricator.wikimedia.org/T379152 [22:59:11] !log tgr@deploy2002 d3r1ck01, tgr: Backport for [[gerrit:1089807|PageUpdater: restore call to RevisionFromEditComplete (T379152)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:02:52] !log tgr@deploy2002 d3r1ck01, tgr: Continuing with sync [23:08:33] !log tgr@deploy2002 scap failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--network', '--', 'purgeMessageBlobStore.php']' returned non-zero exit status 1. (scap version: 4.122.0) (duration: 11m 44s) [23:12:05] well that's new [23:13:15] 23:08:33 Wikimedia\Rdbms\DBConnectionError from line 1129 of /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: php_network_getaddresses: getaddrinfo failed: Name or service not known (WMF_MAINTENANCE_OFFLINE_placeholder) [23:14:51] purgeMessageBlobStore.php is the last scap step and the patch did not involve any message changes so... probably fine? [23:16:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.113s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:18:49] filed as T379589 [23:18:49] T379589: scap backport fails at purgeMessageBlobStore.php with getaddrinfo failed - https://phabricator.wikimedia.org/T379589 [23:20:06] !log UTC late deploys done [23:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.022s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded