[00:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1096721 [00:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1096721 (owner: 10TrainBranchBot) [00:42:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:42:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:03:30] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:04:37] (03CR) 10PRESELYA1: "hello" [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 (owner: 10Volans) [01:08:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1096742 [01:08:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1096742 (owner: 10TrainBranchBot) [01:13:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1096721 (owner: 10TrainBranchBot) [01:40:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1096742 (owner: 10TrainBranchBot) [01:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:25:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:26:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:26:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:26:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:41:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078105 (https://phabricator.wikimedia.org/T371592) (owner: 10Jforrester) [02:47:49] (03CR) 10Jforrester: [C:04-1] "Semi -1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (https://phabricator.wikimedia.org/T371662) (owner: 10Pppery) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:38] RECOVERY - Host parse2017 is UP: PING WARNING - Packet loss = 71%, RTA = 0.28 ms [03:07:56] PROBLEM - SSH on parse2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:14:02] PROBLEM - Host parse2017 is DOWN: PING CRITICAL - Packet loss = 100% [03:42:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:00:44] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10351489 (10fnegri) 05Resolved→03Open p:05Triage→03High This has just caused a WMCS proxy outage, beca... [04:28:58] (03PS1) 10Tim Starling: Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) [04:29:41] (03CR) 10CI reject: [V:04-1] Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [04:42:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:43:38] (03PS2) 10Tim Starling: Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) [04:44:18] (03CR) 10CI reject: [V:04-1] Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [05:05:29] (03CR) 10Pppery: "Is there some better way of doing this right now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (https://phabricator.wikimedia.org/T371662) (owner: 10Pppery) [05:23:15] (03PS3) 10Tim Starling: Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) [05:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:51:24] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:09:40] (03CR) 10Giuseppe Lavagetto: [C:03+2] aptrepo: add import for vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/1093875 (owner: 10Giuseppe Lavagetto) [06:25:23] (03PS1) 10Giuseppe Lavagetto: apt-updates: add new hpe key [puppet] - 10https://gerrit.wikimedia.org/r/1096964 [06:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:41:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:22] (03PS2) 10Giuseppe Lavagetto: apt-updates: fix expired keys [puppet] - 10https://gerrit.wikimedia.org/r/1096964 [07:06:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:06:51] (03CR) 10Giuseppe Lavagetto: [C:03+2] apt-updates: fix expired keys [puppet] - 10https://gerrit.wikimedia.org/r/1096964 (owner: 10Giuseppe Lavagetto) [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:14:21] (03PS1) 10Giuseppe Lavagetto: aptrepo: remove temporarily pyall from updates [puppet] - 10https://gerrit.wikimedia.org/r/1097182 [07:15:31] <_joe_> !log upgrading vopsbot to 0.3.9 [07:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:50] (03CR) 10Giuseppe Lavagetto: [C:03+2] aptrepo: remove temporarily pyall from updates [puppet] - 10https://gerrit.wikimedia.org/r/1097182 (owner: 10Giuseppe Lavagetto) [07:42:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:47:20] (03PS4) 10Kosta Harlan: IPReputation: Enable everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053230 (https://phabricator.wikimedia.org/T360067) [07:47:46] !log remove ganeti7004 from active Ganeti nodes in magru02 T376737 [07:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:50] T376737: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737 [07:49:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053230 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [07:50:22] PROBLEM - ganeti-noded running on ganeti7004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [07:50:22] PROBLEM - ganeti-confd running on ganeti7004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [07:52:06] FIRING: [13x] ProbeDown: Service ganeti7004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:53:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet [07:53:44] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10351574 (10ops-monitoring-bot) Draining ganeti7003.magru.wmnet of running VMs [07:54:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet [07:55:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:55:44] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [07:55:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:56:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [07:56:11] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [07:56:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [07:57:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2240.codfw.wmnet with reason: Maintenance [07:57:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2240.codfw.wmnet with reason: Maintenance [07:57:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2240 (T367781)', diff saved to https://phabricator.wikimedia.org/P71119 and previous config saved to /var/cache/conftool/dbconfig/20241125-075758-arnaudb.json [07:58:03] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:59:12] (03PS1) 10Brouberol: airflow-wmde: migrate scheduler to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097280 (https://phabricator.wikimedia.org/T380622) [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T0800). nyaa~ [08:00:05] tgr and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T367781)', diff saved to https://phabricator.wikimedia.org/P71120 and previous config saved to /var/cache/conftool/dbconfig/20241125-080010-arnaudb.json [08:00:19] most definitely not a deployment window [08:00:20] hello [08:00:23] hello [08:00:27] o/ [08:00:40] not a deployment window? [08:00:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain [08:00:54] kostajh: playing around with "OwO what's this, a deployment window" from jouncebot, sorry [08:01:06] ah [08:01:13] tgr|away: you could go first [08:01:40] urbanecm: are you deploying? I can do it otherwise [08:01:55] i can if needed, otherwise, feel free to go ahead [08:02:00] tgr|away / urbanecm: any idea how wgCentralAuthIpoidUrl is populated? I don't see an override in operations/mediawiki-config. And In extension.json it is set to `false` [08:02:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain [08:03:02] (03PS1) 10Brouberol: airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097281 (https://phabricator.wikimedia.org/T380622) [08:03:17] kostajh: for some reason, it's in PrivateSettings.php [08:03:21] My patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1053230?usp=dashboard is ready but would like to also set `wgCentralAuthIpoidUrl` to false [08:03:21] aha [08:03:50] ok, will self deploy then [08:03:59] (as to why, i have no idea, i'd expect it to be in ProductionServices.php like others) [08:04:16] urbanecm: is that something you can help me update? I'd like to set `wgCentralAuthIpoidUrl` to false (so just remove it from privatesettings.php) and then enable Extension:IPReputation [08:04:23] (03CR) 10JMeybohm: [C:03+1] "Cool, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) (owner: 10CDanis) [08:04:48] urbanecm: don't we some public skeleton PrivateSettings where such variables are supposed to be listed so they are discoverable? [08:05:16] (03PS2) 10Fabfur: benthos: WIP for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [08:05:21] tgr|away: in theory, https://github.com/wikimedia/operations-mediawiki-config/blob/master/private/readme.php, but that appears to be heavily outdated [08:06:00] hm [08:06:16] maybe we should have a git hook reminding people, or something [08:06:49] (03CR) 10Gergő Tisza: [C:03+2] Disable more extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094071 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [08:07:32] (03Merged) 10jenkins-bot: Disable more extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094071 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [08:08:07] kostajh: sure! in theory, you should be able to update that yourself. from deployment host, go to `/srv/mediawiki-stagging/private`, edit PrivateSettings.php, commit, and then proceed with the backport as normally (it'll take your PrivateSettings.php changes with it) [08:08:39] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1094071|Disable more extensions when using the shared login domain (T373737)]] [08:08:43] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [08:09:09] kostajh: note there are two other related values set. [08:09:17] yeah I see them [08:09:49] ok, tgr|away let me know when you're finished please [08:10:01] (03CR) 10JMeybohm: [C:03+1] mw-api-int: add migration release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [08:10:13] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: bump images for gui and builder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:10:27] (03CR) 10JMeybohm: [C:03+1] wikikube: Default to containerd partition layout [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) (owner: 10Clément Goubert) [08:10:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain [08:11:15] urbanecm: I think it is safe to remove all three entries. I'll plan to do that. That should mean CentralAuth's URL for ipoid is false, and that the feature flag (CentralAuthIpoidCheckAtAccountCreation) is off, based on thd default extension.json settings [08:11:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain [08:11:35] (03Merged) 10jenkins-bot: wikidata-query-gui: bump images for gui and builder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:11:37] then, Extension:IPReputation's defaults have the URL for IPoid and the feature flag on, and IPReputationIPoidCheckAtAccountCreationLogOnly is set to true. [08:11:41] so it should be a no-op [08:11:56] kostajh: ack, sounds good. feel free to ping me if needed when duing that. [08:12:57] (03PS14) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [08:13:36] (03CR) 10CI reject: [V:04-1] haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:15:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P71121 and previous config saved to /var/cache/conftool/dbconfig/20241125-081517-arnaudb.json [08:15:28] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [08:16:00] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [08:16:40] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [08:17:04] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [08:17:06] FIRING: [13x] ProbeDown: Service ganeti7004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:17:17] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [08:17:41] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [08:25:02] !log tgr@deploy2002 tgr: Backport for [[gerrit:1094071|Disable more extensions when using the shared login domain (T373737)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:25:19] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [08:25:44] for the record, deployment failed with https://pastebin.com/yV7PLW5P, and then https://pastebin.com/H86xifkd [08:25:50] third retry succeeded [08:26:05] both 503s, I couldn't reproduce either manually [08:26:24] the patch should be a noop in production, so unlikely to be related [08:27:14] (03PS15) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [08:29:35] tgr|away: ack. shall I proceed with my patch? [08:29:48] tgr|away: is this the same as T364880 ? [08:29:49] T364880: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880 [08:29:54] just a sec [08:30:03] !log tgr@deploy2002 tgr: Continuing with sync [08:30:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P71122 and previous config saved to /var/cache/conftool/dbconfig/20241125-083024-arnaudb.json [08:30:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:31:12] (03CR) 10Jelto: [V:03+1 C:03+2] profile::auto_restarts::service: make restart time configurable [puppet] - 10https://gerrit.wikimedia.org/r/1093953 (https://phabricator.wikimedia.org/T380476) (owner: 10Jelto) [08:32:24] yeah the error in that task is similar [08:35:58] in/61 [08:36:05] err :) [08:36:11] (it is monday..) [08:36:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:39] 06SRE, 10Scap, 06serviceops-radar: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#10351622 (10Tgr) I ran into this twice today (not the totoro one specifically, just random testserver checks failing with an 503). Passed on the third r... [08:37:07] (03CR) 10Jelto: "fyi: I merged this change, it was unmerged on puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1097182 (owner: 10Giuseppe Lavagetto) [08:37:23] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not (P{cp4043.*} or P{cp4051.*}) and A:cp for 9.2.6-1wm2 [08:39:15] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1094071|Disable more extensions when using the shared login domain (T373737)]] (duration: 30m 35s) [08:39:18] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [08:39:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain [08:40:39] kostajh: you are good to go [08:41:10] tgr|away: thanks [08:43:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain [08:43:57] urbanecm: I made the commit to PrivateSettings, and am backporting the config patch now [08:44:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053230 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [08:45:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T367781)', diff saved to https://phabricator.wikimedia.org/P71123 and previous config saved to /var/cache/conftool/dbconfig/20241125-084531-arnaudb.json [08:45:32] (03CR) 10Stevemunene: [C:03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097280 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [08:45:35] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:45:42] (03Merged) 10jenkins-bot: IPReputation: Enable everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053230 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [08:46:00] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1053230|IPReputation: Enable everywhere (T360067)]] [08:46:03] T360067: Deploy Extension:IPReputation - https://phabricator.wikimedia.org/T360067 [08:46:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance [08:46:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance [08:46:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2179.codfw.wmnet with reason: Maintenance [08:46:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2179.codfw.wmnet with reason: Maintenance [08:47:04] (03PS16) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [08:47:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain [08:48:20] (03CR) 10Stevemunene: [C:04-1] "Based on the previous change I5fa8eac25a3a057c93b7e82e9cbe059c70778c4a and tittle I think this should be on hieradata/role/common/analytic" [puppet] - 10https://gerrit.wikimedia.org/r/1097281 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [08:48:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain [08:49:46] (03CR) 10Brouberol: "oops, indeed!" [puppet] - 10https://gerrit.wikimedia.org/r/1097281 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [08:49:49] (03PS1) 10Brouberol: airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097308 (https://phabricator.wikimedia.org/T380622) [08:50:12] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1053230|IPReputation: Enable everywhere (T360067)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:50:25] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:50:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:50:43] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:27] PROBLEM - Bird Internet Routing Daemon on durum7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:51:27] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:51:43] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:52:25] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:52:27] RECOVERY - Bird Internet Routing Daemon on durum7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:52:27] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7001 is OK: OK: UP (pid=2360) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:52:38] (03PS1) 10Urbanecm: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) [08:52:49] (03CR) 10Urbanecm: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [08:53:59] !log kharlan@deploy2002 kharlan: Continuing with sync [08:54:37] (03PS1) 10Urbanecm: Migrate to virtual domains [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) [08:58:14] (03PS3) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [08:59:23] (03CR) 10Jelto: [C:03+2] gitlab: refactor check for ssh-gitlab in restore script [puppet] - 10https://gerrit.wikimedia.org/r/1093948 (https://phabricator.wikimedia.org/T380476) (owner: 10Jelto) [09:01:49] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1053230|IPReputation: Enable everywhere (T360067)]] (duration: 15m 48s) [09:01:52] T360067: Deploy Extension:IPReputation - https://phabricator.wikimedia.org/T360067 [09:02:02] (03PS1) 10JMeybohm: Remove alert for sessionstore not running on dedicated nodes [alerts] - 10https://gerrit.wikimedia.org/r/1097311 (https://phabricator.wikimedia.org/T379599) [09:02:46] (03PS1) 10JMeybohm: Remove affinity and tolerations from sessionstore deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097312 (https://phabricator.wikimedia.org/T379599) [09:04:29] I'm done with deploying. Verified that log entries are appearing from Extension:IPReputation, and that `$wgCentralAuthIpoidUrl` is `false` [09:04:50] !log UTC morning deploys done [09:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:05:57] 06SRE, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10351687 (10jcrespo) {F57744909} {F57744915} [09:10:55] (03PS4) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [09:12:46] jouncebot: nowandnext [09:12:46] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [09:12:46] In 1 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1100) [09:12:52] (03CR) 10Ladsgroup: [C:03+2] Bump ratio of new parsercache key spec to 6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093956 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [09:13:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093956 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [09:13:23] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:13:32] !log restarting blazegraph on wdqs1012 (BlazegraphFreeAllocatorsDecreasingRapidly) [09:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:40] (03Merged) 10jenkins-bot: Bump ratio of new parsercache key spec to 6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093956 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [09:13:56] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1093956|Bump ratio of new parsercache key spec to 6 (T373037)]] [09:13:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain [09:14:00] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [09:15:44] (03CR) 10Vgutierrez: [C:04-1] haproxy: add ring support to haproxy configuration (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:18:14] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1093956|Bump ratio of new parsercache key spec to 6 (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:18:18] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [09:18:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain [09:19:33] (03CR) 10Arnaudb: [C:03+1] cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 (owner: 10Volans) [09:20:27] PROBLEM - Bird Internet Routing Daemon on doh7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:20:29] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [09:20:31] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:20:45] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:22:27] RECOVERY - Bird Internet Routing Daemon on doh7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:22:29] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7001 is OK: OK: UP (pid=2386) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [09:22:31] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:22:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:22:45] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:25:02] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093956|Bump ratio of new parsercache key spec to 6 (T373037)]] (duration: 11m 05s) [09:25:07] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [09:28:57] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094435 (https://phabricator.wikimedia.org/T380591) (owner: 10Brouberol) [09:29:46] (03CR) 10Btullis: [C:03+1] airflow-wmde: migrate scheduler to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097280 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [09:30:26] (03CR) 10Brouberol: [C:03+2] postgresql-airflow-analytics-test: add helmfile and configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094435 (https://phabricator.wikimedia.org/T380591) (owner: 10Brouberol) [09:30:28] (03PS17) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [09:30:33] (03CR) 10Vgutierrez: [C:04-1] benthos: add benthos for haproxy debug functions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:30:44] (03CR) 10Fabfur: haproxy: add ring support to haproxy configuration (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:31:24] 06SRE, 10Bitu, 06Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474#10351789 (10SLyngshede-WMF) 05In progress→03Resolved [09:32:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:32:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:34:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2005.codfw.wmnet [09:34:29] (03PS1) 10Gergő Tisza: Update private/readme.php to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 [09:34:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10351837 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: prep for prod [09:35:11] (03CR) 10CI reject: [V:04-1] Update private/readme.php to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 (owner: 10Gergő Tisza) [09:35:32] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1094448 (owner: 10Muehlenhoff) [09:36:14] (03CR) 10Btullis: "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1094420 (owner: 10Muehlenhoff) [09:36:18] (03CR) 10Btullis: [C:03+1] turnilo: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1094420 (owner: 10Muehlenhoff) [09:37:54] (03PS2) 10Brouberol: airflow-analytics-test: use the cloudnative PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094436 (https://phabricator.wikimedia.org/T380591) [09:37:54] (03PS1) 10Brouberol: airflow-analytics-test: add namespace to the cloudnativePG tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097323 (https://phabricator.wikimedia.org/T380622) [09:39:15] !log remove ganeti7003 from active Ganeti nodes in magru01 T376737 [09:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:19] T376737: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737 [09:41:29] PROBLEM - ganeti-confd running on ganeti7003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:41:29] PROBLEM - ganeti-noded running on ganeti7003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:42:12] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: add namespace to the cloudnativePG tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097323 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [09:43:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:44:11] FIRING: [13x] ProbeDown: Service ganeti7003:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:53] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2005.codfw.wmnet [09:46:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2005.codfw.wmnet [09:47:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10351944 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: prep for prod [09:53:27] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:54:15] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti7003/7004 [puppet] - 10https://gerrit.wikimedia.org/r/1097324 [09:55:12] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10351959 (10MatthewVernon) It's worth noting here that this is causing icinga to never be happy on the new nodes -... [09:56:14] (03PS2) 10Arnaudb: mariadb: prod dbproxy200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) [09:56:38] (03PS18) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [09:56:47] (03CR) 10Volans: "That's a good question. If we want to be on safe side we could just add the `-T Disable pseudo-terminal allocation.` CLI option to cu" [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) (owner: 10FNegri) [09:56:58] (03PS5) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [09:57:08] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:57:51] (03CR) 10Vgutierrez: [C:03+1] haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:58:30] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2005.codfw.wmnet [10:02:33] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-test: use the cloudnative PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094436 (https://phabricator.wikimedia.org/T380591) (owner: 10Brouberol) [10:02:43] (03CR) 10Vgutierrez: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:02:50] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: use the cloudnative PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094436 (https://phabricator.wikimedia.org/T380591) (owner: 10Brouberol) [10:06:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1005.eqiad.wmnet [10:06:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10352018 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: prep for prod [10:07:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:07:15] (03CR) 10Arnaudb: [C:03+2] mariadb: prod dbproxy200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb) [10:07:25] !log extending backup1009 free filesystem [10:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:06] (03PS19) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [10:10:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:10:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:11:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:11:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:14:31] 06SRE, 10bacula, 10Data-Persistence-Backup: Backup freshness: Stale: 1 (gerrit1003) [bacula ran out of available space/reciclable volumes] - https://phabricator.wikimedia.org/T380716 (10jcrespo) 03NEW [10:14:33] 06SRE, 10bacula, 10Data-Persistence-Backup: Backup freshness: Stale: 1 (gerrit1003) [bacula ran out of available space/reciclable volumes] - https://phabricator.wikimedia.org/T380716#10352049 (10jcrespo) p:05Triage→03High [10:15:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10352050 (10FastLizard4) This has happened now on the [[ https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/IK7PIL... [10:16:03] (03PS1) 10Jcrespo: backups: Extending available bacula space to 100TB [puppet] - 10https://gerrit.wikimedia.org/r/1097326 (https://phabricator.wikimedia.org/T380716) [10:16:04] (03PS1) 10Gergő Tisza: SUL3: Sort overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097327 [10:16:04] (03PS1) 10Gergő Tisza: More authentication domain overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 [10:17:58] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be1005.eqiad.wmnet [10:18:15] (03PS2) 10Gergő Tisza: SUL3: Sort overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097327 (https://phabricator.wikimedia.org/T373737) [10:18:16] (03PS2) 10Gergő Tisza: More authentication domain overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) [10:19:11] FIRING: [13x] ProbeDown: Service ganeti7003:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:20] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10352059 (10cmooney) Link has been clean since the optic was replaced: {F57745141 width=600} I'll suggest to WMCS we put... [10:21:35] (03CR) 10David Caro: [V:03+1 C:03+2] profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [10:24:34] (03CR) 10Jcrespo: [C:03+2] backups: Extending available bacula space to 100TB [puppet] - 10https://gerrit.wikimedia.org/r/1097326 (https://phabricator.wikimedia.org/T380716) (owner: 10Jcrespo) [10:25:02] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not (P{cp4043.*} or P{cp4051.*}) and A:cp for 9.2.6-1wm2 [10:30:14] (03CR) 10Fabfur: [C:03+2] haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:33:23] 06SRE, 10bacula, 10Data-Persistence-Backup: Backup freshness: Stale: 1 (gerrit1003) [bacula ran out of available space/reciclable volumes] - https://phabricator.wikimedia.org/T380716#10352110 (10jcrespo) As usual, I had to do a storage daemon reload for the update to kick in (plus deleted errored volume). [10:35:46] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-eqsin and not (P{cp5018.*} or P{cp5026.*}) and A:cp for 9.2.6-1wm2 [10:35:58] 06SRE, 10bacula, 10Data-Persistence-Backup: Backup freshness: Stale: 1 (gerrit1003) [bacula ran out of available space/reciclable volumes] - https://phabricator.wikimedia.org/T380716#10352116 (10jcrespo) ` JobId Type Level Files Bytes Name Status ======================================... [10:36:02] (03PS1) 10Giuseppe Lavagetto: aptrepo: remove old keys [puppet] - 10https://gerrit.wikimedia.org/r/1097332 [10:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:37:24] (03PS1) 10Elukey: profile::service_proxy::envoy: add tegola [puppet] - 10https://gerrit.wikimedia.org/r/1097333 (https://phabricator.wikimedia.org/T378944) [10:38:02] <_joe_> !log deleted pyall component from reprepro [10:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:07] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:39:19] (03CR) 10Kosta Harlan: Configure instrument for the Incident Reporting System (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [10:39:22] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:39:41] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10352131 (10jcrespo) >>! In T377853#10351959, @MatthewVernon wrote: > It's worth noting here that this is causing... [10:40:02] !log hashar@deploy2002 Started deploy [integration/docroot@d585f2b]: build: Updating cross-spawn to 7.0.6 [10:40:10] 06SRE, 10bacula, 10Data-Persistence-Backup: Backup freshness: Stale: 1 (gerrit1003) [bacula ran out of available space/reciclable volumes] - https://phabricator.wikimedia.org/T380716#10352137 (10jcrespo) 05Open→03Resolved ` [11:39:07] RECOVERY - Backup freshness on backup1001 is OK: Fresh... [10:40:13] !log hashar@deploy2002 Finished deploy [integration/docroot@d585f2b]: build: Updating cross-spawn to 7.0.6 (duration: 00m 10s) [10:40:22] (03CR) 10Máté Szabó: Configure instrument for the Incident Reporting System (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [10:42:20] (03CR) 10Kosta Harlan: Configure instrument for the Incident Reporting System (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [10:42:24] (03PS2) 10Máté Szabó: Configure instrument for the Incident Reporting System [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) [10:43:50] (03CR) 10CI reject: [V:04-1] Configure instrument for the Incident Reporting System [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [10:44:51] (03CR) 10Clément Goubert: [C:03+2] wikikube: Default to containerd partition layout [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) (owner: 10Clément Goubert) [10:45:23] (03CR) 10Máté Szabó: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [10:46:26] (03PS6) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [10:46:33] (03CR) 10Jelto: [C:03+1] "lgtm! `deployment-charts/helmfile.d/services/sessionstore/values-staging.yaml` has empty values `{}` for `affinity` and `tolerations` whic" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097312 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [10:46:39] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:46:46] (03CR) 10CI reject: [V:04-1] benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:48:20] (03CR) 10Jelto: [C:03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1097311 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [10:48:53] (03PS1) 10Slyngshede: Blocking: Allow multiple account managers groups [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 [10:50:03] (03CR) 10CI reject: [V:04-1] Blocking: Allow multiple account managers groups [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede) [11:13:55] We currenty see problem in gate-and-submit pipelines for math extension (contint1002.wikimedia.org cannot be reached) see e.g. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/1095093 is that a known problem? [11:16:43] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10352318 (10Ladsgroup) >>! In T379942#10346406, @Ladsgroup wrote: > My plan is to start a 16 parallel cleaners for commons thumbnails, the first one doing the clean up on containers e... [11:17:55] (03PS1) 10EoghanGaffney: mailman: Change task runner to operate every 12 hours [puppet] - 10https://gerrit.wikimedia.org/r/1097344 (https://phabricator.wikimedia.org/T377045) [11:18:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage [11:18:42] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: update [puppet] - 10https://gerrit.wikimedia.org/r/1097343 (owner: 10Arturo Borrero Gonzalez) [11:19:56] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from ganeti7003/7004 [puppet] - 10https://gerrit.wikimedia.org/r/1097324 (owner: 10Muehlenhoff) [11:20:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage [11:24:28] !log installing Linux 6.1.119 on Bookworm nodes [11:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:51] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10352341 (10MatthewVernon) We do only deploy the swift credentials to one frontend host per DC; and all the swift frontends have only 15 cores (they don't often end up CPU-bound - typ... [11:25:12] (03PS1) 10Btullis: Increase the hadoop directory max items limit on the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/1097346 (https://phabricator.wikimedia.org/T380674) [11:26:08] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10352374 (10eoghan) We've been doing some investigating over the last week, and it's a very hard problem to track down. No... [11:26:25] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4579/co" [puppet] - 10https://gerrit.wikimedia.org/r/1097346 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [11:26:40] (03PS1) 10Brouberol: airflow: disable variable management from the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097347 (https://phabricator.wikimedia.org/T380727) [11:28:34] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10352383 (10MoritzMuehlenhoff) All VMs have moved away from ganeti7003/ganeti7004 and I've switched them to the insetup::infrastruct... [11:31:24] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10352382 (10cmooney) Ok the BGP downpref policy has been reverted, and we have routed traffic back running over the link.... [11:32:36] (03CR) 10Jelto: [C:03+1] "lgtm for troubleshooting the mailman race condition issue" [puppet] - 10https://gerrit.wikimedia.org/r/1097344 (https://phabricator.wikimedia.org/T377045) (owner: 10EoghanGaffney) [11:33:11] (03CR) 10EoghanGaffney: [C:03+2] mailman: Change task runner to operate every 12 hours [puppet] - 10https://gerrit.wikimedia.org/r/1097344 (https://phabricator.wikimedia.org/T377045) (owner: 10EoghanGaffney) [11:34:17] !log hashar@deploy2002 Installing scap version "4.128.0" for 211 hosts [11:34:31] (03PS2) 10Klausman: knative: Bump all images to latest release 1.16.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) [11:36:11] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10352404 (10Ladsgroup) >>! In T379942#10352341, @MatthewVernon wrote: > We do only deploy the swift credentials to one frontend host per DC; and all the swift frontends have only 15 c... [11:38:56] RECOVERY - MD RAID on wikikube-worker1256 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:39:13] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1002" [11:39:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1002" [11:39:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1256.eqiad.wmnet with OS bookworm [11:39:36] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bookworm completed: - wikikube-worker1256 (... [11:41:11] (03PS1) 10Btullis: Increase the heap size on the hadoop nameservers to 164 GB [puppet] - 10https://gerrit.wikimedia.org/r/1097349 (https://phabricator.wikimedia.org/T380674) [11:41:24] !log homer 'cr*eqiad*' commit 'T379454' [11:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:32] T379454: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454 [11:42:25] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4580/co" [puppet] - 10https://gerrit.wikimedia.org/r/1097349 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [11:42:44] PROBLEM - SSH on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:42:46] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:43:00] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:43:12] (03CR) 10Slyngshede: "recheck" [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede) [11:45:38] RECOVERY - SSH on ms-fe2009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:45:38] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [11:45:50] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [11:46:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:46:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:46:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1179 (T380449)', diff saved to https://phabricator.wikimedia.org/P71125 and previous config saved to /var/cache/conftool/dbconfig/20241125-114651-ladsgroup.json [11:47:14] !log hashar@deploy2002 Installing scap version "4.128.0" for 211 hosts [11:47:33] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [11:49:11] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1290.eqiad.wmnet [11:49:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1290.eqiad.wmnet [11:51:45] !log hashar@deploy2002 Installation of scap version "4.128.0" completed for 211 hosts [11:56:03] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1256.eqiad.wmnet [11:56:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1256.eqiad.wmnet [11:56:07] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352473 (10ops-monitoring-bot) pool host wikikube-worker1256.eqiad.wmnet by cgoubert@cumin1002 with reason: RAID ok [11:56:08] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352474 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker1256.eqiad.wmnet completed: - wikikube-worker1256.eqiad.... [11:57:02] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352475 (10Clement_Goubert) 05Open→03Resolved Host reimaged, RAID ok, repooled [11:57:29] (03PS2) 10Gergő Tisza: Update private/readme.php to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 [11:57:44] (03CR) 10JMeybohm: "It has only for tolerations and it's part of the PS already ;)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097312 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [11:58:10] (03CR) 10JMeybohm: [C:03+2] Remove alert for sessionstore not running on dedicated nodes [alerts] - 10https://gerrit.wikimedia.org/r/1097311 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [11:59:25] (03Merged) 10jenkins-bot: Remove alert for sessionstore not running on dedicated nodes [alerts] - 10https://gerrit.wikimedia.org/r/1097311 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [12:00:34] (03CR) 10JMeybohm: [C:03+2] Remove affinity and tolerations from sessionstore deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097312 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [12:01:12] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731 (10MoritzMuehlenhoff) 03NEW [12:01:20] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352520 (10MoritzMuehlenhoff) p:05Triage→03High [12:02:02] (03Merged) 10jenkins-bot: Remove affinity and tolerations from sessionstore deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097312 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [12:02:55] (03CR) 10Clément Goubert: [C:03+2] wikikube: Add wikikube-worker13[13-28] [puppet] - 10https://gerrit.wikimedia.org/r/1094381 (https://phabricator.wikimedia.org/T380350) (owner: 10Clément Goubert) [12:03:15] (03CR) 10Elukey: "Quick first pass, I left some comments :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) (owner: 10Klausman) [12:03:16] !log hashar@deploy2002 Pruned MediaWiki: 1.39.0-wmf.1 (duration: 00m 37s) [12:03:56] jouncebot: nowandnext [12:03:56] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [12:03:57] In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1400) [12:05:43] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352536 (10MoritzMuehlenhoff) [12:06:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet [12:06:29] !log hashar@deploy2002 Pruned MediaWiki: 1.39.0-wmf.1 (duration: 00m 40s) [12:09:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-worker2002.codfw.wmnet [12:10:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet [12:10:43] (03PS1) 10Clément Goubert: wikikube: Remove wikikube-worker1328 [puppet] - 10https://gerrit.wikimedia.org/r/1097361 (https://phabricator.wikimedia.org/T380350) [12:12:10] (03CR) 10Clément Goubert: [C:03+2] wikikube: Remove wikikube-worker1328 [puppet] - 10https://gerrit.wikimedia.org/r/1097361 (https://phabricator.wikimedia.org/T380350) (owner: 10Clément Goubert) [12:13:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-worker2002.codfw.wmnet [12:19:49] (03CR) 10FNegri: [C:03+2] wikireplica_dns: remove toolsdb and redis records [puppet] - 10https://gerrit.wikimedia.org/r/1034052 (https://phabricator.wikimedia.org/T374953) (owner: 10FNegri) [12:22:48] (03PS1) 10Brouberol: airflow: mention instance name in the email header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097366 [12:22:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-worker2003.codfw.wmnet [12:23:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-worker2004.codfw.wmnet [12:25:29] (03CR) 10Btullis: [C:03+1] airflow: mention instance name in the email header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097366 (owner: 10Brouberol) [12:25:34] (03PS1) 10JMeybohm: k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 [12:26:03] (03CR) 10Brouberol: [C:03+1] Increase the hadoop directory max items limit on the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/1097346 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [12:26:19] (03CR) 10Brouberol: [C:03+1] Increase the heap size on the hadoop nameservers to 164 GB [puppet] - 10https://gerrit.wikimedia.org/r/1097349 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [12:26:45] (03CR) 10Brouberol: [C:03+2] airflow: mention instance name in the email header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097366 (owner: 10Brouberol) [12:26:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-worker2003.codfw.wmnet [12:27:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-worker2004.codfw.wmnet [12:27:29] (03PS2) 10Brouberol: airflow: disable variable management from the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097347 (https://phabricator.wikimedia.org/T380727) [12:28:32] !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd and (A:cephosd) [12:29:41] (03PS2) 10JMeybohm: k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 [12:30:23] jouncebot: now [12:30:23] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [12:31:24] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:31:34] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:31:59] (03CR) 10Btullis: [V:03+1 C:03+2] Increase the hadoop directory max items limit on the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/1097346 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [12:32:32] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1179 gradually with 4 steps - Maint over [12:32:41] (03CR) 10Btullis: [V:03+1 C:03+2] Increase the heap size on the hadoop nameservers to 164 GB [puppet] - 10https://gerrit.wikimedia.org/r/1097349 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [12:32:54] 06SRE, 10SRE-swift-storage, 06Commons: Schuur - Nieuwerbrug - 20164513 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T380738 (10MatthewVernon) 03NEW [12:33:39] (03CR) 10Clément Goubert: [C:03+1] memcached: add mc-gp200[4-6] gutter servers to pool [puppet] - 10https://gerrit.wikimedia.org/r/1092290 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [12:33:44] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-eqsin and not (P{cp5018.*} or P{cp5026.*}) and A:cp for 9.2.6-1wm2 [12:34:22] (03CR) 10Clément Goubert: [C:03+1] memcached: add mc-gp100[4-6] gutter servers to pool [puppet] - 10https://gerrit.wikimedia.org/r/1092280 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [12:35:06] (03PS3) 10Klausman: knative: Bump all images to latest release 1.16.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) [12:35:08] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10352676 (10MatthewVernon) >>! In T379942#10352404, @Ladsgroup wrote: >>>! In T379942#10352341, @MatthewVernon wrote: >> We do only deploy the swift credentials to one frontend host p... [12:35:49] (03CR) 10Klausman: knative: Bump all images to latest release 1.16.0 (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) (owner: 10Klausman) [12:36:58] (03PS4) 10Klausman: knative: Bump all images to latest release 1.16.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) [12:37:24] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:37:34] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:37:51] (03CR) 10Klausman: "Good catch. I'll give it a whirl on minikube (once I find my notes on that again :))" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) (owner: 10Klausman) [12:39:56] (03PS1) 10Effie Mouzeli: mcrouter: update mcrouter exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097368 (https://phabricator.wikimedia.org/T380212) [12:40:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1313.eqiad.wmnet with OS bookworm [12:41:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1314.eqiad.wmnet with OS bookworm [12:41:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1315.eqiad.wmnet with OS bookworm [12:41:55] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup1010.eqiad.wmnet with reason: Reboot [12:42:09] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup1010.eqiad.wmnet with reason: Reboot [12:42:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1316.eqiad.wmnet with OS bookworm [12:42:30] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [12:42:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1317.eqiad.wmnet with OS bookworm [12:42:48] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097347 (https://phabricator.wikimedia.org/T380727) (owner: 10Brouberol) [12:42:49] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352694 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=25c79e5b-b394-4c20-ac85-265f1ab5a71d) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services wi... [12:43:00] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on D{kubestage100[5-6].eqiad.wmnet} and (A:wikikube-staging-worker-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-eqiad or A:wikikube-staging-master-eqiad or A:wikikube-worker-codfw or A:wikikube-master-codfw or A:wikikube-worker-eqiad or A:wikikube-master-eqiad or A:ml-serve-worker-eqiad or A:ml-serve-master-eqiad or A:ml-ser [12:43:00] ve-worker-codfw or A:ml-serve-master-codfw or A:ml-staging-worker or A:ml-staging-master or A:dse-k8s-worker or A:dse-k8s-master or A:aux-worker or A:aux-master) [12:43:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1318.eqiad.wmnet with OS bookworm [12:43:27] (03CR) 10Brouberol: [C:03+2] airflow: disable variable management from the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097347 (https://phabricator.wikimedia.org/T380727) (owner: 10Brouberol) [12:43:39] (03PS1) 10Urbanecm: createExtensionTables: Use virtual domains for GrowthExperiments [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097369 (https://phabricator.wikimedia.org/T354939) [12:43:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1319.eqiad.wmnet with OS bookworm [12:44:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1320.eqiad.wmnet with OS bookworm [12:44:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-worker2005.codfw.wmnet [12:45:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-ctrl2002.codfw.wmnet [12:46:24] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:46:52] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:47:35] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup1011.eqiad.wmnet with reason: Reboot [12:47:43] (03CR) 10Effie Mouzeli: [C:03+1] deployment-prep: Remove leftover hhvm config [puppet] - 10https://gerrit.wikimedia.org/r/1095282 (owner: 10Lucas Werkmeister) [12:47:49] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup1011.eqiad.wmnet with reason: Reboot [12:47:52] (03CR) 10Effie Mouzeli: [C:03+2] deployment-prep: Remove leftover hhvm config [puppet] - 10https://gerrit.wikimedia.org/r/1095282 (owner: 10Lucas Werkmeister) [12:47:53] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352717 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0c1a74c1-8a06-405c-a5fc-07dabe312239) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services wi... [12:48:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:48:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-worker2005.codfw.wmnet [12:48:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:49:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-ctrl2002.codfw.wmnet [12:49:22] (03CR) 10Clément Goubert: [C:03+1] mcrouter: update mcrouter exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097368 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [12:50:24] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:52:07] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352732 (10JMeybohm) [12:52:17] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add mc-gp200[4-6] gutter servers to pool [puppet] - 10https://gerrit.wikimedia.org/r/1092290 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [12:52:24] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add mc-gp100[4-6] gutter servers to pool [puppet] - 10https://gerrit.wikimedia.org/r/1092280 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [12:52:38] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352747 (10JMeybohm) [12:53:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: refresh for latest network changes [puppet] - 10https://gerrit.wikimedia.org/r/1097370 (https://phabricator.wikimedia.org/T380728) [12:53:52] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:53:53] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: refresh for latest network changes [puppet] - 10https://gerrit.wikimedia.org/r/1097370 (https://phabricator.wikimedia.org/T380728) (owner: 10Arturo Borrero Gonzalez) [12:54:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-ctrl2003.codfw.wmnet [12:54:51] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update mcrouter exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097368 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [12:55:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd2003.codfw.wmnet [12:56:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:56:58] (03Merged) 10jenkins-bot: mcrouter: update mcrouter exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097368 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli) [12:57:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:57:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd2004.codfw.wmnet [12:58:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on D{kubestage100[5-6].eqiad.wmnet} and (A:wikikube-staging-worker-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-eqiad or A:wikikube-staging-master-eqiad or A:wikikube-worker-codfw or A:wikikube-master-codfw or A:wikikube-worker-eqiad or A:wikikube-master-eqiad or A:ml-serve-worker-eqiad or A:ml-serve-maste [12:58:06] r-eqiad or A:ml-serve-worker-codfw or A:ml-serve-master-codfw or A:ml-staging-worker or A:ml-staging-master or A:dse-k8s-worker or A:dse-k8s-master or A:aux-worker or A:aux-master) [12:58:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-ctrl2003.codfw.wmnet [12:58:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:59:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:59:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [12:59:25] (03PS1) 10Stevemunene: datahub: add datahub production index prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) [12:59:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd2003.codfw.wmnet [12:59:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:00:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd2005.codfw.wmnet [13:00:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [13:00:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1313.eqiad.wmnet with reason: host reimage [13:01:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [13:01:17] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [13:01:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [13:01:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd2004.codfw.wmnet [13:01:49] PROBLEM - Host dns7001 is DOWN: PING CRITICAL - Packet loss = 100% [13:02:01] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1314.eqiad.wmnet with reason: host reimage [13:02:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [13:02:04] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: T376737 [13:02:17] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: T376737 [13:02:21] T376737: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737 [13:02:23] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:02:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1315.eqiad.wmnet with reason: host reimage [13:02:30] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Device asw1-b3-magru.mgmt.magru.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:02:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:02:35] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:43] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dns7001.wikimedia.org with reason: T376737 [13:02:49] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [13:02:57] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns7001.wikimedia.org with reason: T376737 [13:03:01] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:03:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [13:03:15] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti7003.magru.wmnet with reason: T376737 [13:03:19] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:03:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1316.eqiad.wmnet with reason: host reimage [13:03:28] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti7003.magru.wmnet with reason: T376737 [13:03:38] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti7004.magru.wmnet with reason: T376737 [13:03:51] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti7004.magru.wmnet with reason: T376737 [13:03:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1317.eqiad.wmnet with reason: host reimage [13:03:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd2005.codfw.wmnet [13:04:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1318.eqiad.wmnet with reason: host reimage [13:04:10] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs7003.magru.wmnet with reason: T376737 [13:04:24] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs7003.magru.wmnet with reason: T376737 [13:04:45] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: T376737 [13:04:47] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: T376737 [13:04:52] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: T376737 [13:04:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1313.eqiad.wmnet with reason: host reimage [13:05:01] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352797 (10jcrespo) [13:05:05] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352798 (10MoritzMuehlenhoff) [13:05:06] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:05:06] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: T376737 [13:05:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1319.eqiad.wmnet with reason: host reimage [13:05:13] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7008.magru.wmnet with reason: T376737 [13:05:27] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7008.magru.wmnet with reason: T376737 [13:05:35] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7015.magru.wmnet with reason: T376737 [13:05:49] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7015.magru.wmnet with reason: T376737 [13:05:56] PROBLEM - SSH on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:05:56] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:06:36] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1320.eqiad.wmnet with reason: host reimage [13:06:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:06:47] RECOVERY - SSH on ms-fe2009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:06:47] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [13:06:58] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [13:07:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:07:48] (03CR) 10Brouberol: [C:03+2] airflow-wmde: migrate scheduler to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097280 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [13:08:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:08:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1316.eqiad.wmnet with reason: host reimage [13:08:41] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:09:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [13:09:25] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/wmde AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:09:29] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1317.eqiad.wmnet with reason: host reimage [13:12:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:13:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:13:51] (03PS3) 10JMeybohm: k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 [13:13:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1315.eqiad.wmnet with reason: host reimage [13:14:33] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352833 (10MoritzMuehlenhoff) [13:14:54] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10352818 (10RobH) {F57745607} {F57745609} [13:15:04] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:15:04] PROBLEM - SSH on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:15:18] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:15:38] (03CR) 10Kosta Harlan: [C:03+1] Configure instrument for the Incident Reporting System (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [13:15:54] RECOVERY - SSH on ms-fe2009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:15:55] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [13:16:08] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [13:16:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2239.codfw.wmnet with reason: T373579, host is WIP [13:16:26] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [13:16:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2239.codfw.wmnet with reason: T373579, host is WIP [13:17:04] (03PS4) 10JMeybohm: k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 [13:17:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1319.eqiad.wmnet with reason: host reimage [13:17:30] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:17:30] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:34] PROBLEM - BGP status on lsw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:17:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1179 gradually with 4 steps - Maint over [13:18:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:18:39] (03CR) 10Btullis: datahub: add datahub production index prefix (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) (owner: 10Stevemunene) [13:20:37] (03PS1) 10Brouberol: airflow-wmde: fix instance name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097376 (https://phabricator.wikimedia.org/T380622) [13:21:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1314.eqiad.wmnet with reason: host reimage [13:22:02] (03CR) 10Brouberol: [C:03+2] airflow-wmde: fix instance name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097376 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [13:22:30] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:22:58] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 412MiB (2% inode=48%): /tmp 412MiB (2% inode=48%): /var/tmp 412MiB (2% inode=48%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [13:23:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:23:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [13:24:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1313.eqiad.wmnet with OS bookworm [13:24:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:24:30] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:24:34] RECOVERY - BGP status on lsw1-f1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1320.eqiad.wmnet with reason: host reimage [13:24:58] (03CR) 10CI reject: [V:04-1] k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 (owner: 10JMeybohm) [13:25:49] (03PS5) 10JMeybohm: k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 [13:25:54] (03PS6) 10DDesouza: Reader Survey: Deploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) [13:27:11] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on D{wikikube-worker[2128-2170].codfw.wmnet} and (A:wikikube-staging-worker-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-eqiad or A:wikikube-staging-master-eqiad or A:wikikube-worker-codfw or A:wikikube-master-codfw or A:wikikube-worker-eqiad or A:wikikube-master-eqiad or A:ml-serve-worker-eqiad or A:ml-serve-master-eqiad or [13:27:11] A:ml-serve-worker-codfw or A:ml-serve-master-codfw or A:ml-staging-worker or A:ml-staging-master or A:dse-k8s-worker or A:dse-k8s-master or A:aux-worker or A:aux-master) [13:27:18] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352915 (10ops-monitoring-bot) Started rebooting nodes in wikikube-codfw cluster: * wikikube-worker[2128-2170].codfw.wmnet [13:28:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1316.eqiad.wmnet with OS bookworm [13:28:17] (03PS2) 10Stevemunene: datahub: add datahub production index prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) [13:28:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1318.eqiad.wmnet with reason: host reimage [13:28:40] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on D{wikikube-worker[1305-1312].eqiad.wmnet} and (A:wikikube-staging-worker-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-eqiad or A:wikikube-staging-master-eqiad or A:wikikube-worker-codfw or A:wikikube-master-codfw or A:wikikube-worker-eqiad or A:wikikube-master-eqiad or A:ml-serve-worker-eqiad or A:ml-serve-master-eqiad or [13:28:40] A:ml-serve-worker-codfw or A:ml-serve-master-codfw or A:ml-staging-worker or A:ml-staging-master or A:dse-k8s-worker or A:dse-k8s-master or A:aux-worker or A:aux-master) [13:28:48] (03CR) 10Stevemunene: datahub: add datahub production index prefix (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) (owner: 10Stevemunene) [13:28:50] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10352924 (10ops-monitoring-bot) Started rebooting nodes in wikikube-eqiad cluster: * wikikube-worker[1305-1312].eqiad.wmnet [13:29:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:29:28] (03CR) 10Btullis: [C:03+1] datahub: add datahub production index prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) (owner: 10Stevemunene) [13:30:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1317.eqiad.wmnet with OS bookworm [13:30:19] (03PS1) 10Slyngshede: Blocking: Show current user LDAP status [software/bitu] - 10https://gerrit.wikimedia.org/r/1097378 [13:30:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:30:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:06] PROBLEM - BGP status on lsw1-b2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:15] (03PS6) 10JMeybohm: k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 [13:32:06] RECOVERY - BGP status on lsw1-b2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:06] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [13:32:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host db1246.eqiad.wmnet [13:32:30] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:32:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1315.eqiad.wmnet with OS bookworm [13:32:36] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1319.eqiad.wmnet with OS bookworm [13:36:30] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:37:42] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@006515b]: Testing the new k8s deployment [13:37:55] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@006515b]: Testing the new k8s deployment (duration: 02m 34s) [13:38:18] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [13:38:22] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [13:38:29] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [13:38:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host db1246.eqiad.wmnet [13:38:36] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:08] PROBLEM - BGP status on lsw1-b4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:04] !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd and (A:cephosd) [13:40:08] RECOVERY - BGP status on lsw1-b4-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:41:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1314.eqiad.wmnet with OS bookworm [13:41:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host es2041.codfw.wmnet [13:42:07] !log aborrero@cumin1002 START - Cookbook sre.dns.netbox [13:42:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host es2042.codfw.wmnet [13:42:28] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [13:43:20] !log cordoned kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet - T379599 [13:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:24] T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters - https://phabricator.wikimedia.org/T379599 [13:43:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1320.eqiad.wmnet with OS bookworm [13:44:23] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [13:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:36] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [13:44:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:44:47] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: use vlan1120/vlan2120 prefix for FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1097380 (https://phabricator.wikimedia.org/T380728) [13:46:04] !log deployed sessionstore to non-dedicated nodes - T379599 [13:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:09] (03CR) 10D3r1ck01: "Thanks for review, I'll also wait for a signal from Krinkle before I schedule this." [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:47:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1318.eqiad.wmnet with OS bookworm [13:47:08] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:09] (03PS1) 10Bartosz Dziewoński: Pass context to 'revreview-pending-basic' on history page [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097381 (https://phabricator.wikimedia.org/T380519) [13:47:13] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: use vlan1120/vlan2120 prefix for FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1097380 (https://phabricator.wikimedia.org/T380728) (owner: 10Arturo Borrero Gonzalez) [13:47:24] (03PS1) 10Bartosz Dziewoński: Use Contexts for Message objects in review dialog (tooltip) [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097382 (https://phabricator.wikimedia.org/T380519) [13:47:26] (03PS2) 10Giuseppe Lavagetto: aptrepo: remove old keys [puppet] - 10https://gerrit.wikimedia.org/r/1097332 [13:47:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:47:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097381 (https://phabricator.wikimedia.org/T380519) (owner: 10Bartosz Dziewoński) [13:47:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097382 (https://phabricator.wikimedia.org/T380519) (owner: 10Bartosz Dziewoński) [13:47:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:47:46] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10353057 (10BTullis) [13:48:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es2041.codfw.wmnet [13:48:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es2042.codfw.wmnet [13:49:04] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:08] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:04] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host es2043.codfw.wmnet [13:50:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host es2044.codfw.wmnet [13:51:17] (03CR) 10Btullis: [V:03+1 C:03+2] Enable the performace CPU governor on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) (owner: 10Btullis) [13:51:29] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-text_codfw and A:cp for 9.2.6-1wm2 [13:51:42] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_codfw and A:cp for 9.2.6-1wm2 [13:54:21] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10353066 (10jcrespo) [13:54:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:55:12] PROBLEM - BGP status on lsw1-c4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:20] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10353072 (10RobH) So the first swap went with some issues, detailed in my followup to the ticket just now: > Support, > > Please... [13:56:12] RECOVERY - BGP status on lsw1-c4-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es2043.codfw.wmnet [13:56:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es2044.codfw.wmnet [13:56:40] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:40] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host es2045.codfw.wmnet [13:59:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host es2046.codfw.wmnet [13:59:23] (03CR) 10CDanis: [C:03+2] k8s: temp. enforce maximum cluster size [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) (owner: 10CDanis) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1400). [14:00:05] danisztls, MatmaRex, James_F, and dbrant: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] * James_F waves. [14:00:11] o/ [14:00:15] o/ [14:01:28] hi [14:01:38] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [14:01:43] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [14:01:43] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:16] my wmf.4 backports can go out at the same time. my config patch should be a no-op. [14:03:08] PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:03:37] o/ [14:03:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es2046.codfw.wmnet [14:04:08] RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:23] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10353130 (10MoritzMuehlenhoff) [14:04:32] hello [14:04:36] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:40] Lucas_WMDE: [14:04:55] and others, my deployement took longer than expected [14:05:18] I am afraid I will delay you lot a wee bit [14:05:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es2045.codfw.wmnet [14:05:36] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:05:52] (03CR) 10Bartosz Dziewoński: [C:03+1] SUL3: Sort overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097327 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [14:06:01] ok [14:06:17] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:06:23] (03CR) 10Bartosz Dziewoński: [C:03+1] More authentication domain overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [14:09:54] this is me too [14:11:08] PROBLEM - BGP status on lsw1-d2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:08] RECOVERY - BGP status on lsw1-d2-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:17] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [14:13:21] * Lucas_WMDE looks at the scheduled config changes meanwhile [14:13:31] not sure we’ll get through all of them [14:13:43] (03PS1) 10CDanis: Revert "k8s: temp. enforce maximum cluster size" [puppet] - 10https://gerrit.wikimedia.org/r/1097389 [14:14:22] I’m guessing MatmaRex’ backports are the most important thing, followed by dbrant’s stream config [14:14:58] and then probably danisztls (no effect yet but unblocks testing), and then MatmaRex + James_F config changes (no-op cleanups) [14:15:24] my backports can both go out at the same time [14:15:26] effie: any idea how long you’ll need? just wondering if I should start gate-and-submit for the MW backports already :) [14:15:45] (IIUC they *should* be faster than usual because parallel testing was enabled for core earlier today) [14:15:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:15:48] Lucas_WMDE: we have hit another issue with eqiad which is not relevant to my deployment [14:15:50] (03PS1) 10David Caro: cloudcephmon1004: provision as mon [puppet] - 10https://gerrit.wikimedia.org/r/1097390 (https://phabricator.wikimedia.org/T364870) [14:15:53] (but idk if that also applies to the wmf branches) [14:16:11] Lucas_WMDE: and if I was not deploying, it would prolly hit your deployment :) [14:16:17] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:16:17] hm, ok [14:16:18] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10353154 (10RobH) [14:16:20] hi all, eqiad wikikube is in an unfortunate state right now, and is not able to handle deployments [14:17:00] ok [14:17:10] so no backport/config deployments until further notice? [14:17:32] for now yes, however we are optimistic, it will not take long [14:17:37] ack [14:17:39] good luck :) [14:18:42] PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:17] uh [14:19:40] !log disable puppet and kubelet on wikikube-worker13[13-28].eqiad.wmnet for ip exhaustion T375845 [14:19:42] RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:46] T375845: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845 [14:20:29] !log Manually deleting wikikube-worker13[13-20].eqiad.wmnet for ip exhaustion T375845 [14:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:22] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10353180 (10VRiley-WMF) [14:22:04] (03PS1) 10CDanis: Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 [14:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:15] (03CR) 10CI reject: [V:04-1] Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 (owner: 10CDanis) [14:22:38] (03CR) 10Muehlenhoff: [C:03+1] "That sounds good to me, but we can also be bold and move forward, both works for me" [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) (owner: 10FNegri) [14:22:55] (03PS2) 10CDanis: Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 (https://phabricator.wikimedia.org/T380350) [14:22:58] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 418MiB (2% inode=48%): /tmp 418MiB (2% inode=48%): /var/tmp 418MiB (2% inode=48%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [14:23:06] (03CR) 10CI reject: [V:04-1] Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 (https://phabricator.wikimedia.org/T380350) (owner: 10CDanis) [14:24:57] (03PS1) 10Giuseppe Lavagetto: Add tooltips [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1097394 [14:25:43] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add tooltips [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1097394 (owner: 10Giuseppe Lavagetto) [14:26:06] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add tooltips - oblivian@cumin1002" [14:26:09] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add tooltips - oblivian@cumin1002 [14:26:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:26:19] !log prune unneeded kernels from grafana2001 [14:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:42] PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:26:45] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add tooltips - oblivian@cumin1002 [14:26:47] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add tooltips - oblivian@cumin1002" [14:27:06] (03PS5) 10Klausman: knative: Bump all images to latest release 1.12.x [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) [14:27:13] Lucas_WMDE: proceed, and I will resume my deployment after you [14:28:17] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:28:33] ok! [14:28:44] RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097381 (https://phabricator.wikimedia.org/T380519) (owner: 10Bartosz Dziewoński) [14:28:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097382 (https://phabricator.wikimedia.org/T380519) (owner: 10Bartosz Dziewoński) [14:28:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:57] starting with the FlaggedRevs backports [14:31:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:31:28] (03PS3) 10Clément Goubert: Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 (https://phabricator.wikimedia.org/T380350) (owner: 10CDanis) [14:33:02] (03CR) 10CDanis: [C:03+1] Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 (https://phabricator.wikimedia.org/T380350) (owner: 10CDanis) [14:33:04] (03CR) 10Klausman: "I've backed down the version from v1.16 to v1.12.4 (3 for net-istio). This would still make k8s v1.28 and 1.29 the recommended versions to" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) (owner: 10Klausman) [14:33:09] (03PS1) 10Cathal Mooney: Add reverse for new IPv6 range assigned by cloud services [dns] - 10https://gerrit.wikimedia.org/r/1097397 (https://phabricator.wikimedia.org/T380174) [14:33:33] (03CR) 10Clément Goubert: [C:03+2] Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 (https://phabricator.wikimedia.org/T380350) (owner: 10CDanis) [14:34:08] PROBLEM - BGP status on lsw1-b2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:28] (03PS6) 10Klausman: knative: Bump all images to latest release 1.12.x [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) [14:34:31] (03CR) 10Effie Mouzeli: [C:03+1] Revert "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097392 (https://phabricator.wikimedia.org/T380350) (owner: 10CDanis) [14:35:18] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10353260 (10RobH) Ok, they introduced another mistake: > Support, > > 1HR3PZ3 shows a power supply failure (please check the pow... [14:36:08] RECOVERY - BGP status on lsw1-b2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:11] (03CR) 10Ssingh: [C:03+1] Add reverse for new IPv6 range assigned by cloud services [dns] - 10https://gerrit.wikimedia.org/r/1097397 (https://phabricator.wikimedia.org/T380174) (owner: 10Cathal Mooney) [14:36:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:38:24] the core backport noselenium build finished in 7m7s btw \o/ [14:38:28] (03CR) 10Cathal Mooney: [C:03+2] Add reverse for new IPv6 range assigned by cloud services [dns] - 10https://gerrit.wikimedia.org/r/1097397 (https://phabricator.wikimedia.org/T380174) (owner: 10Cathal Mooney) [14:38:31] wait [14:38:34] what am I talking about [14:38:38] it’s FlaggedRevs not core :D [14:39:13] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:39:26] Lucas_WMDE: we will deduct that from your salary [14:39:39] nooooo [14:39:46] not my precious WMF salary [14:40:00] :> [14:40:01] the split jobs in extensions seem much faster than the core jobs [14:41:22] (03Merged) 10jenkins-bot: Pass context to 'revreview-pending-basic' on history page [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097381 (https://phabricator.wikimedia.org/T380519) (owner: 10Bartosz Dziewoński) [14:41:24] (03Merged) 10jenkins-bot: Use Contexts for Message objects in review dialog (tooltip) [extensions/FlaggedRevs] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097382 (https://phabricator.wikimedia.org/T380519) (owner: 10Bartosz Dziewoński) [14:41:32] yay [14:41:43] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1097381|Pass context to 'revreview-pending-basic' on history page (T380519)]], [[gerrit:1097382|Use Contexts for Message objects in review dialog (tooltip) (T380519)]] [14:41:58] T380519: FULLPAGENAMEE produces Special:Badtitle/Message in pending changes message - https://phabricator.wikimedia.org/T380519 [14:42:08] PROBLEM - BGP status on lsw1-b4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:55] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverse IPv6 includes to dns repo for vlan1107 - cmooney@cumin1002" [14:42:58] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [14:43:08] RECOVERY - BGP status on lsw1-b4-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:30] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverse IPv6 includes to dns repo for vlan1107 - cmooney@cumin1002" [14:44:30] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:45:26] (03PS4) 10Bking: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:45:40] (03PS4) 10Bking: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:46:13] one of the test server checks failed [14:46:14] (03PS5) 10Bking: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:46:19] expected 200 got 503 [14:46:28] retrying [14:46:32] (03PS23) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:46:53] (03PS5) 10Bking: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:47:00] (03PS6) 10Bking: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:47:06] (03PS24) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:47:17] failed again but this time a different check [14:47:29] first one was the internal.w.o check in k8s-2_of_2 [14:47:39] second one was the techconduct.w.o check in baremetal-1_of_1 [14:47:42] o_O [14:47:47] retrying again… [14:47:51] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_codfw and A:cp for 9.2.6-1wm2 [14:47:56] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1310-1312].eqiad.wmnet [14:47:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1310-1312].eqiad.wmnet [14:48:17] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1097381|Pass context to 'revreview-pending-basic' on history page (T380519)]], [[gerrit:1097382|Use Contexts for Message objects in review dialog (tooltip) (T380519)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:48:21] T380519: FULLPAGENAMEE produces Special:Badtitle/Message in pending changes message - https://phabricator.wikimedia.org/T380519 [14:48:25] now it went through [14:48:29] MatmaRex: please test [14:48:39] anyone else have an idea about those failed checks? [14:48:47] I don’t see much to go on, haven’t found the errors in logstash yet [14:49:02] the first was on mwdebug-next, the second on mwdebug1002 [14:49:08] (03PS1) 10Cathal Mooney: Revert "Add reverse for new IPv6 range assigned by cloud services" [dns] - 10https://gerrit.wikimedia.org/r/1097401 [14:49:19] looking [14:49:27] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-text_codfw and A:cp for 9.2.6-1wm2 [14:49:45] (03CR) 10Ssingh: [C:03+1] "Sounds good!" [dns] - 10https://gerrit.wikimedia.org/r/1097401 (owner: 10Cathal Mooney) [14:50:04] Lucas_WMDE: all good [14:50:08] PROBLEM - BGP status on lsw1-b4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:21] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Continuing with sync [14:50:23] (03CR) 10CI reject: [V:04-1] Revert "Add reverse for new IPv6 range assigned by cloud services" [dns] - 10https://gerrit.wikimedia.org/r/1097401 (owner: 10Cathal Mooney) [14:50:24] ok, thanks! [14:50:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:50:34] hmm [14:50:36] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:50:38] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:50:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:51:02] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:02] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:08] RECOVERY - BGP status on lsw1-b4-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:50] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:52] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:51:56] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:52:02] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:52:24] the above is a1d01a63b4aab9411af88c598cbe6a28146274ec not being merged but a revert is in place, so it should clear up after that [14:52:28] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:52:28] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is cc101a0bbb5db32aaf8fb87ccd24d70d4010c11e, dns.git is a1d01a63b4aab9411af88c598cbe6a28146274ec) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:52:47] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1309.eqiad.wmnet [14:52:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1309.eqiad.wmnet [14:54:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1309.eqiad.wmnet with OS bookworm [14:54:12] (03PS1) 10Novem Linguae: enwiki: add "mergehistory" to "import" user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097402 (https://phabricator.wikimedia.org/T380753) [14:54:34] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove vlan1107 IPv6 entries - cmooney@cumin1002" [14:56:15] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1009-1014].eqiad.wmnet [14:56:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove vlan1107 IPv6 entries - cmooney@cumin1002" [14:56:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:57:08] PROBLEM - BGP status on lsw1-b7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:57:17] Lucas_WMDE: I think the k8s check are possibly because of recreate instead of rolling upgrade [14:57:19] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097381|Pass context to 'revreview-pending-basic' on history page (T380519)]], [[gerrit:1097382|Use Contexts for Message objects in review dialog (tooltip) (T380519)]] (duration: 15m 35s) [14:57:24] T380519: FULLPAGENAMEE produces Special:Badtitle/Message in pending changes message - https://phabricator.wikimedia.org/T380519 [14:57:35] should I continue deploying? [14:57:36] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:57:45] Lucas_WMDE: yeah yeah [14:57:52] dbrant: are you still there? [14:57:56] I would do your change next [14:57:58] yep [14:57:59] claime: ok thanks :) [14:59:07] RECOVERY - BGP status on lsw1-b7-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:59:14] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1094511|New stream config for Android Rabbit Holes feature. (T380107)]] [14:59:18] T380107: Android Rabbit Holes Data Instrumentation - https://phabricator.wikimedia.org/T380107 [14:59:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1009-1014].eqiad.wmnet [15:01:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:49] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:01:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:01:57] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:02:26] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_drmrs and A:cp for 9.2.6-1wm2 [15:02:27] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:02:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:02:35] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-text_drmrs and A:cp for 9.2.6-1wm2 [15:03:44] !log lucaswerkmeister-wmde@deploy2002 dbrant, lucaswerkmeister-wmde: Backport for [[gerrit:1094511|New stream config for Android Rabbit Holes feature. (T380107)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:04:37] dbrant: can you test the change on mwdebug? [15:05:07] PROBLEM - BGP status on lsw1-b7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:07] checking [15:05:27] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:05:37] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:05:37] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:05:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:06:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:06:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:06:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:06:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:06:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:06:05] Lucas_WMDE: and... looks good! [15:06:07] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:06:07] RECOVERY - BGP status on lsw1-b7-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:06:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:08:11] !log lucaswerkmeister-wmde@deploy2002 dbrant, lucaswerkmeister-wmde: Continuing with sync [15:08:14] great, thanks! [15:08:18] sorry, got distracted for a sec [15:10:19] (03PS1) 10Clément Goubert: wikikube: Decommission kubernetes10[09-14] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/1097405 (https://phabricator.wikimedia.org/T380027) [15:10:19] (03PS1) 10Clément Goubert: wikikube: Decommission kubernetes10[09-14] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/1097406 (https://phabricator.wikimedia.org/T380027) [15:10:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1224.eqiad.wmnet with reason: Maintenance [15:10:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1224.eqiad.wmnet with reason: Maintenance [15:11:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T380449)', diff saved to https://phabricator.wikimedia.org/P71131 and previous config saved to /var/cache/conftool/dbconfig/20241125-151103-ladsgroup.json [15:11:09] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:11:17] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [15:11:47] RECOVERY - Host dns7001 is UP: PING WARNING - Packet loss = 90%, RTA = 30.32 ms [15:11:51] PROBLEM - Host 2a02:ec80:700:1:195:200:68:5 is DOWN: CRITICAL - Host Unreachable (2a02:ec80:700:1:195:200:68:5) [15:11:57] PROBLEM - Host 195.200.68.5 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:09] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:26] (03CR) 10JMeybohm: [C:03+1] wikikube: Decommission kubernetes10[09-14] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/1097405 (https://phabricator.wikimedia.org/T380027) (owner: 10Clément Goubert) [15:13:27] (03CR) 10JMeybohm: [C:03+1] wikikube: Decommission kubernetes10[09-14] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/1097406 (https://phabricator.wikimedia.org/T380027) (owner: 10Clément Goubert) [15:13:27] (03CR) 10Clément Goubert: [C:03+2] wikikube: Decommission kubernetes10[09-14] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/1097405 (https://phabricator.wikimedia.org/T380027) (owner: 10Clément Goubert) [15:14:08] (03CR) 10Cathal Mooney: [C:03+1] Disable SSH password auth on all devices [homer/public] - 10https://gerrit.wikimedia.org/r/1091725 (https://phabricator.wikimedia.org/T379464) (owner: 10Ayounsi) [15:14:59] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1094511|New stream config for Android Rabbit Holes feature. (T380107)]] (duration: 15m 45s) [15:15:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [15:15:04] T380107: Android Rabbit Holes Data Instrumentation - https://phabricator.wikimedia.org/T380107 [15:15:15] jouncebot: now [15:15:15] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [15:15:36] danisztls: are you still there? [15:15:42] !log robh@cumin1002 START - Cookbook sre.dns.netbox [15:15:46] Lucas_WMDE: yep [15:15:47] I’d deploy the enwiki reader survey and then close the window I think [15:15:54] and leave the no-op config cleanups for another time [15:15:56] ok! [15:15:56] thnx [15:16:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [15:17:19] (03CR) 10Lucas Werkmeister (WMDE): "FWIW, I feel like QuickSurveys would be a nice candidate for another separate `wmf-config/ext-*.php` file with split settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [15:17:33] (03Merged) 10jenkins-bot: Reader Survey: Deploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [15:17:51] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1093987|Reader Survey: Deploy on enwiki (T378660)]] [15:17:55] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [15:18:06] !log robh@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:11] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:18:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [15:19:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubernetes[1009-1014].eqiad.wmnet [15:20:11] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:56] !log lucaswerkmeister-wmde@deploy2002 dani, lucaswerkmeister-wmde: Backport for [[gerrit:1093987|Reader Survey: Deploy on enwiki (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:22:05] danisztls: can you test on mwdebug? [15:22:21] Lucas_WMDE: yes [15:23:04] Lucas_WMDE: looks good [15:23:08] !log lucaswerkmeister-wmde@deploy2002 dani, lucaswerkmeister-wmde: Continuing with sync [15:23:10] ok, thanks! [15:23:47] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:46] “Logstash checker Counted 950 error(s) in the last 20 seconds. The threshold is 10.” [15:25:51] “PHP Deprecated: Use of QuickSurveys survey with description parameter was deprecated in MediaWiki 1.43.” [15:25:58] ditto for question, link, and instanceTokenParameterName parameter [15:25:59] -.- [15:26:15] PROBLEM - BGP status on lsw1-c4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:26:32] *looks at the other surveys* [15:26:37] looks like it should now be an array of questions [15:27:06] Lucas_WMDE: I think you are right, there were changes in how QS work [15:27:11] danisztls: I think we need either a follow-up patch for the new format or a revert [15:27:15] RECOVERY - BGP status on lsw1-c4-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:31] If you can wait a few minutes I will do a follow up [15:27:40] ok [15:27:47] I’ll leave the scap open to keep the lock held [15:27:56] (unless someone else is waiting to do a deploy? let me know) [15:28:00] (03PS7) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) [15:28:47] (03CR) 10Andrew Bogott: "Yep, not breaking resolv.conf is the whole idea of this patch, I just wanted to make sure it doesn't break dns because that's scary!" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:28:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:29:10] and in the meantime I assume it’s fine to leave this code on the canary servers, as it’s “only” a deprecation [15:29:39] uh. but one with 14k occurrences already in logspam-watch [15:29:40] hm [15:29:45] *thinks* [15:29:54] yeah I guess that gets hit for every (uncached) enwiki page view? [15:30:36] I really should’ve looked at mwdebug logstash before continuing with that sync :/ [15:31:26] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:32:20] (03PS1) 10Brouberol: airflow-wmde: enable traffic to the airflow-search instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097409 (https://phabricator.wikimedia.org/T380622) [15:32:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:32:49] (03PS1) 10DDesouza: Reader Survey: Fix question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097410 (https://phabricator.wikimedia.org/T378660) [15:33:11] PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T380449)', diff saved to https://phabricator.wikimedia.org/P71132 and previous config saved to /var/cache/conftool/dbconfig/20241125-153354-ladsgroup.json [15:33:59] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [15:34:50] Lucas_WMDE: fix is ready on 1097410 [15:35:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097410 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [15:35:11] Lucas_WMDE: sry I should've double check QS documenttaion [15:35:11] RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:18] danisztls: thanks, trying to deploy that now [15:35:53] (03Merged) 10jenkins-bot: Reader Survey: Fix question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097410 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [15:36:10] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:36:13] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1097410|Reader Survey: Fix question (T378660)]] [15:36:17] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [15:36:48] (03CR) 10Andrew McAllister (WMDE): [C:03+1] airflow-wmde: enable traffic to the airflow-search instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097409 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [15:37:07] kind of mind-boggling to think that *just the canary servers* apparently get something close to 50k enwiki requests in 15 minutes [15:37:15] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup2010.codfw.wmnet with reason: Reboot [15:37:28] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup2010.codfw.wmnet with reason: Reboot [15:37:35] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup2011.codfw.wmnet with reason: Reboot [15:37:37] Lucas_WMDE: that's a lot [15:37:44] yup [15:37:49] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup2011.codfw.wmnet with reason: Reboot [15:38:04] (03CR) 10Brouberol: [C:03+2] airflow-wmde: enable traffic to the airflow-search instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097409 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [15:38:04] (03CR) 10Andrew Bogott: [C:03+1] cloudcephmon1004: provision as mon [puppet] - 10https://gerrit.wikimedia.org/r/1097390 (https://phabricator.wikimedia.org/T364870) (owner: 10David Caro) [15:38:05] (132k total errors in logstash but that’s inflated by it being four deprecation warnings per request, one for each parameter) [15:38:10] (03CR) 10Ssingh: "For peace of mind, running PCC on C:resolvconf, with:" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:38:13] RECOVERY - Host 195.200.68.5 is UP: PING WARNING - Packet loss = 50%, RTA = 30.49 ms [15:38:43] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:47] Lucas_WMDE: I'm not sure how to test the fix. From the front-end it looked good already. [15:38:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1309.eqiad.wmnet with OS bookworm [15:38:57] PROBLEM - Recursive DNS on 195.200.68.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:39:06] yeah, I’ll just have to look at logstash I think [15:39:14] apparently Special:BlankPage is enough to trigger the warning [15:39:27] can you test that the survey still works? (once it’s deployed) [15:39:35] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:40:01] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dani: Backport for [[gerrit:1097410|Reader Survey: Fix question (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:40:04] Lucas_WMDE: yes [15:40:15] logstash looks good so far [15:40:26] there’s some “Could not load user for revision 1” (was already there before) but no more deprecation warnings [15:40:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [15:40:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [15:41:11] PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:34] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:41:35] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts kubernetes[1009-1014].eqiad.wmnet [15:41:51] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dani: Continuing with sync [15:41:55] I’ll go ahead and sync already [15:42:01] even if it breaks the survey, we should fix it afterwards [15:42:07] Lucas_WMDE: doesn't look good, 'yesMsg' and 'noMsg' are not optional as they we're previously [15:42:08] but roll out the deprecation warning fix immediately [15:42:11] RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:12] hm, wait? [15:42:20] so what does that mean? [15:42:25] I’m still not seeing anything in logstash [15:42:29] !log robh@cumin2002 START - Cookbook sre.dns.netbox [15:42:38] Lucas_WMDE: it show blank buttons now :) [15:42:45] ah [15:43:01] but that’s much more harmless than 158k logstash warnings ;) [15:43:13] if the survey has 0 coverage, at least [15:43:19] right? [15:43:29] it just means it needs another fix before the coverage is cranked up [15:44:09] (my main priority right now is getting logstash into a usable state again) [15:44:10] (03PS1) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [15:44:27] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:34] (03CR) 10Clément Goubert: [C:03+2] wikikube: Decommission kubernetes10[09-14] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/1097406 (https://phabricator.wikimedia.org/T380027) (owner: 10Clément Goubert) [15:44:37] (03PS1) 10DDesouza: Reader Survey: Fix yes/no messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097417 (https://phabricator.wikimedia.org/T378660) [15:44:49] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:45:01] Lucas_WMDE: yep, I shoul've tested this on labs first [15:45:10] Lucas_WMDE: I don't want to take more of your time now [15:45:55] FIRING: MaxConntrack: Max conntrack at 91.78% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:46:02] !log homer cr*eqiad* commit 'T380027' [15:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:06] T380027: Decommission kubernetes10[09-14] - https://phabricator.wikimedia.org/T380027 [15:46:35] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru swaps - robh@cumin2002" [15:46:36] Lucas_WMDE: yes, 0 coverage make it harmless [15:46:49] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:46:55] okay, thanks for confirming :) [15:46:55] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1091249/4581/" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:47:22] also I just noticed looking at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1097410/1/wmf-config/InitialiseSettings.php that the questions block is indented more than it should be but apparently PHPCS was fine with it 🤷 [15:47:23] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru swaps - robh@cumin2002" [15:47:24] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:33] (03CR) 10Ssingh: [C:03+1] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:48:02] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:48:11] PROBLEM - BGP status on lsw1-d2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:48:34] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:48:41] (03CR) 10Ssingh: [C:03+1] "Traffic will merge this when a suitable window opens (we have magru work ongoing right now so don't want to mix it.)" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:49:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P71133 and previous config saved to /var/cache/conftool/dbconfig/20241125-154901-ladsgroup.json [15:49:16] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097410|Reader Survey: Fix question (T378660)]] (duration: 13m 02s) [15:49:20] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [15:50:11] RECOVERY - BGP status on lsw1-d2-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:55] RESOLVED: MaxConntrack: Max conntrack at 91.78% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:51:19] (03PS1) 10Bking: WIP: wdqs: Remove Search Platform's blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1097421 (https://phabricator.wikimedia.org/T379182) [15:52:06] danisztls: left a summary at T378660#10353724 [15:52:21] and I think that’s enough deployment for me for today ;) [15:52:45] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:52:55] !log UTC afternoon backport+config window done (apologies for the temporary flood of “Use of QuickSurveys survey” deprecation warnings – should be fixed again) [15:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:15] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:15] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:38] (03PS1) 10Brouberol: global_config: open port 8600 (webserver) for airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1097424 (https://phabricator.wikimedia.org/T364389) [15:54:48] I looked at logstash again just to be sure, but it looks like those messages really all came from the canary hosts [15:54:56] god knows how many more it would’ve been on a full deployment [15:55:19] out of curiosity… is it easily possible for a deployer to temporarily depool the canary hosts? [15:55:41] “we know we messed up here, let’s stop user traffic to them for a few minutes while the fix is being prepared and rolled out” [15:55:47] hmmm [15:55:48] Lucas_WMDE: thanks again Lucas_WMDE [15:56:04] danisztls: np [15:56:04] (03CR) 10Andrew McAllister (WMDE): [C:03+1] global_config: open port 8600 (webserver) for airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1097424 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [15:56:09] it was more exciting than the usual deployment at least ;) [15:56:11] PROBLEM - BGP status on lsw1-d1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:56:12] good question Lucas_WMDE [15:57:07] (03CR) 10Brouberol: [C:03+2] global_config: open port 8600 (webserver) for airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1097424 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [15:57:11] RECOVERY - BGP status on lsw1-d1-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:57:17] does that mean I should turn it into a feature request on phabricator? ^^ [15:57:59] Lucas_WMDE: I'm trying to think of a way that isn't "destroy the release" [15:58:32] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:00:45] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:02:26] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_drmrs and A:cp for 9.2.6-1wm2 [16:02:36] Lucas_WMDE: The "right way" would actually be to rollback, revert, then roll forward [16:02:45] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:04] or at least roll back the deployment [16:03:12] hm [16:03:22] does scap let me do a rollback? [16:03:31] PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:32] good question :D [16:03:34] I know I could’ve (should’ve?) done a revert, I skipped it because it looked like a fix wouldn’t take long [16:04:01] (though I only noticed how many errors there were in logstash after making that decision… it was a lot more than I thought it would be) [16:04:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P71134 and previous config saved to /var/cache/conftool/dbconfig/20241125-160408-ladsgroup.json [16:04:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:04:24] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-text_drmrs and A:cp for 9.2.6-1wm2 [16:04:31] RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:04:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:04:57] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on D{wikikube-worker[1305-1312].eqiad.wmnet} and (A:wikikube-staging-worker-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-eqiad or A:wikikube-staging-master-eqiad or A:wikikube-worker-codfw or A:wikikube-master-codfw or A:wikikube-worker-eqiad or A:wikikube-master-eqiad or A:ml-serve-worker-eqiad or A:ml-se [16:04:57] rve-master-eqiad or A:ml-serve-worker-codfw or A:ml-serve-master-codfw or A:ml-staging-worker or A:ml-staging-master or A:dse-k8s-worker or A:dse-k8s-master or A:aux-worker or A:aux-master) [16:05:28] (03CR) 10Urbanecm: More authentication domain overrides (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [16:05:48] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:05:51] I thought scap checked logstash for errors after deploying to canaries and rolled back on its own if there were too many [16:06:15] but maybe it's only looking at errors, not warnings? [16:07:02] it called them errors [16:07:18] I pasted what it showed me at https://phabricator.wikimedia.org/T378660#10353724 [16:07:23] (I also still have the terminal if needed) [16:07:32] I don’t see any indication of automatic rollback there [16:07:50] yeah so maybe scap should offer a rollback option there [16:08:30] my recollection from the olden days is that rollback wasn’t possible because you wouldn’t know what to rollback to (someone would have to revert the right patch™ in /srv/mediawiki-staging first) [16:08:37] but of course under k8s we do have an old image version… [16:09:55] because absent this, you'd have to go to each mw-on-k8s helmfile directory and do a rollback with helm (not even helmfile) [16:09:58] (actually, I guess in principle you could’ve tried to restore the canary hosts to any non-canary host’s /srv contents, it just probably wasn’t set up to easily allow that. but there was a “source of truth” for a “working” version of the code available) [16:10:31] PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:34] the reason is what triggers the possibility of doing a helmfile apply is the change of mediawiki image version, which would stay the same if you just exit scap [16:10:40] so no helmfile apply [16:10:57] sounds painful [16:11:01] a bit [16:11:31] RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:13:13] (03CR) 10Sergio Gimeno: [C:03+1] [Growth] enwiki: Deploy Add Link to 2% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095126 (https://phabricator.wikimedia.org/T377631) (owner: 10Urbanecm) [16:15:54] (03Abandoned) 10CDanis: Revert "k8s: temp. enforce maximum cluster size" [puppet] - 10https://gerrit.wikimedia.org/r/1097389 (owner: 10CDanis) [16:17:18] (03PS1) 10Fabfur: site: temporary changing role to some magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1097429 (https://phabricator.wikimedia.org/T376737) [16:18:19] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:33] (03PS1) 10Brouberol: airflow: enable port 8600 to be reached from Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1097430 (https://phabricator.wikimedia.org/T364389) [16:19:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T380449)', diff saved to https://phabricator.wikimedia.org/P71138 and previous config saved to /var/cache/conftool/dbconfig/20241125-161915-ladsgroup.json [16:19:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [16:19:19] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:19] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [16:19:22] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4582/co" [puppet] - 10https://gerrit.wikimedia.org/r/1097430 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [16:19:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [16:20:02] (03CR) 10Andrew McAllister (WMDE): [C:03+1] airflow: enable port 8600 to be reached from Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1097430 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [16:23:26] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow: enable port 8600 to be reached from Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1097430 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [16:23:37] !log hashar@deploy2002 Installing scap version "4.128.0" for 211 hosts [16:25:19] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:25:36] (03CR) 10Muehlenhoff: site: temporary changing role to some magru hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097429 (https://phabricator.wikimedia.org/T376737) (owner: 10Fabfur) [16:26:19] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:14] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [16:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:57] !log hashar@deploy2002 Installation of scap version "4.128.0" completed for 211 hosts [16:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1630) [16:32:22] PROBLEM - BGP status on lsw1-d6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:22] RECOVERY - BGP status on lsw1-d6-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:36:18] (03PS1) 10Scott French: php8.1: rebuild to pick up 8.1.31 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097436 [16:38:04] !log hashar@deploy2002 Installing scap version "4.129.0" for 211 hosts [16:38:13] of course [16:38:20] I reinstalled the previous version instead of upgrading... [16:39:22] PROBLEM - BGP status on lsw1-d6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:40:22] RECOVERY - BGP status on lsw1-d6-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:41:02] (03CR) 10David Caro: "Did a quick test, there's three functions we use to resolve names, and only one of them actually fails if it can't resolve:" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [16:42:11] (03CR) 10David Caro: [C:03+2] cloudcephmon1004: provision as mon [puppet] - 10https://gerrit.wikimedia.org/r/1097390 (https://phabricator.wikimedia.org/T364870) (owner: 10David Caro) [16:42:23] !log uploaded php8.1 8.1.31-1+wmf11u1 to apt.w.o (16:25 UTC) [16:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:25] !log hashar@deploy2002 Installation of scap version "4.129.0" completed for 211 hosts [16:43:05] (03CR) 10Clément Goubert: [C:03+1] php8.1: rebuild to pick up 8.1.31 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097436 (owner: 10Scott French) [16:44:04] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:45:40] !log hashar@deploy2002 Pruned MediaWiki: 1.44.0-wmf.2 (duration: 03m 05s) [16:46:24] PROBLEM - BGP status on lsw1-d7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:47:15] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7001.magru.wmnet with OS bullseye [16:47:24] RECOVERY - BGP status on lsw1-d7-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:49:20] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti7003.magru.wmnet [16:54:24] PROBLEM - BGP status on lsw1-d7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:55:08] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:55:24] RECOVERY - BGP status on lsw1-d7-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:43] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [16:58:59] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [16:58:59] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:00] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti7003.magru.wmnet [16:59:10] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354319 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `ganeti7003.magru.wmnet` - ganeti70... [16:59:38] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp7006.magru.wmnet [16:59:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: support IPv6 and IPv4-only networks [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) [17:01:13] (03PS1) 10Bking: wdqs-ldf: Make Data Platform SRE the recipient of the LDF alerts [puppet] - 10https://gerrit.wikimedia.org/r/1097441 (https://phabricator.wikimedia.org/T379182) [17:01:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10354314 (10VRiley-WMF) a:03VRiley-WMF Would we like to proceed with replacing the CMOS battery?... [17:01:26] PROBLEM - BGP status on lsw1-b4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:28] RECOVERY - BGP status on lsw1-b4-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:30] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [17:04:01] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10354356 (10Andrew) From Gerrit, @dcaro writes: > > Did a quick test, there's three functions we use to res... [17:04:06] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10354358 (10fnegri) a:05fnegri→03Andrew Assigning this task to @Andrew as he's currently working on a patch. [17:05:06] (03CR) 10Scott French: [V:03+2 C:03+2] "Verified locally (built, ran entrypoint as a basic smoke test)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097436 (owner: 10Scott French) [17:05:42] (03PS1) 10JMeybohm: Decom kubernetes[12]0[01][56] dedicates sessionstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/1097442 (https://phabricator.wikimedia.org/T379599) [17:06:40] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:07:06] (03CR) 10Clément Goubert: [C:03+1] Decom kubernetes[12]0[01][56] dedicates sessionstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/1097442 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [17:07:33] (03PS2) 10Brouberol: airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097308 (https://phabricator.wikimedia.org/T380622) [17:08:00] (03Abandoned) 10Brouberol: airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097281 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [17:09:28] PROBLEM - BGP status on lsw1-b7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:00] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7006.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [17:10:16] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7006.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [17:10:17] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:10:18] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp7006.magru.wmnet [17:10:20] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354380 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp7006.magru.wmnet` - cp7006.magru... [17:10:28] RECOVERY - BGP status on lsw1-b7-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:11:56] jouncebot: nowandnext [17:11:56] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [17:11:56] In 0 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1800) [17:11:56] In 0 hour(s) and 48 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1800) [17:13:50] (03Abandoned) 10Fabfur: site: temporary changing role to some magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1097429 (https://phabricator.wikimedia.org/T376737) (owner: 10Fabfur) [17:14:01] (03PS1) 10Hashar: scap: delete wmf branches automatically [puppet] - 10https://gerrit.wikimedia.org/r/1097444 (https://phabricator.wikimedia.org/T303828) [17:14:10] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [17:16:30] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:47] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [17:18:32] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:19:26] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:20:10] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354407 (10RobH) [17:21:59] (03PS1) 10BCornwall: varnish: Increase RSA cert warnings to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) [17:22:18] (03CR) 10STran: Unify IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [17:22:49] (03CR) 10Vgutierrez: [C:04-1] varnish: Increase RSA cert warnings to 100% (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:22:55] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [17:23:01] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [17:23:01] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:24:01] (03CR) 10Urbanecm: [C:03+2] Migrate to virtual domains [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:24:01] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354419 (10RobH) [17:24:02] (03CR) 10Urbanecm: [C:03+2] createExtensionTables: Use virtual domains for GrowthExperiments [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097369 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:24:34] PROBLEM - BGP status on lsw1-c4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:25:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:25:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097369 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:25:34] RECOVERY - BGP status on lsw1-c4-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:28:52] (03PS2) 10BCornwall: varnish: Increase RSA cert warnings to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) [17:29:24] (03CR) 10BCornwall: varnish: Increase RSA cert warnings to 100% (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:29:25] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti7004.magru.wmnet [17:29:33] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp7008.magru.wmnet [17:30:17] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:31:34] PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:31:50] (03CR) 10Pppery: "Would it be better to do a `$wmgDisabledSpecialPages` key in InitializeSettings.php and then iterate over it in CommonSettings to set keys" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (https://phabricator.wikimedia.org/T371662) (owner: 10Pppery) [17:32:01] (03CR) 10CI reject: [V:04-1] Migrate to virtual domains [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:32:11] (03CR) 10Urbanecm: Migrate to virtual domains [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:32:15] (03CR) 10Urbanecm: [C:03+2] Migrate to virtual domains [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:32:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:32:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097369 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:32:34] RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:33:29] (03Merged) 10jenkins-bot: createExtensionTables: Use virtual domains for GrowthExperiments [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097369 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:33:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:33:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097417 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [17:34:38] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:34:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1237.eqiad.wmnet with reason: Maintenance [17:35:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1237.eqiad.wmnet with reason: Maintenance [17:35:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1237 (T380449)', diff saved to https://phabricator.wikimedia.org/P71140 and previous config saved to /var/cache/conftool/dbconfig/20241125-173511-ladsgroup.json [17:35:43] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [17:36:42] FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:38:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:39:16] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7004.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [17:39:33] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7004.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [17:39:33] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:39:34] PROBLEM - BGP status on lsw1-d1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:34] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti7004.magru.wmnet [17:39:42] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:39:48] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354527 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `ganeti7004.magru.wmnet` - ganeti70... [17:40:36] RECOVERY - BGP status on lsw1-d1-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:41:36] (03PS3) 10BCornwall: varnish: Increase RSA cert warnings to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) [17:41:53] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:53] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp7008.magru.wmnet [17:42:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:43:29] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354538 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp7008.magru.wmnet` - cp7008.magru... [17:43:51] (03PS1) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:37] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7006.magru.wmnet with OS bullseye [17:45:09] (03CR) 10BCornwall: [V:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:45:52] (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [17:46:23] (03PS2) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:46:27] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4588/co" [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:46:33] (03CR) 10Vgutierrez: [C:03+1] varnish: Increase RSA cert warnings to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:46:36] PROBLEM - BGP status on lsw1-d1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:46:44] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [17:47:35] (03PS3) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:47:36] RECOVERY - BGP status on lsw1-d1-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:47:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:48:15] (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [17:48:23] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Increase RSA cert warnings to 100% [puppet] - 10https://gerrit.wikimedia.org/r/1097446 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:48:40] (03PS4) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:49:45] !log T378260 `snapshot1016.eqiad.wmnet` => manually deleted `cirrussearch-dump-s11.[timer,service] [17:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:49] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [17:49:50] T378260: Retire labtestwiki - https://phabricator.wikimedia.org/T378260 [17:49:50] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7001.magru.wmnet with OS bullseye [17:49:50] !log T378260 `snapshot1016.eqiad.wmnet` => manually deleted `cirrussearch-dump-s11.[timer,service]` [17:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:16] (03PS5) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:50:45] (03CR) 10SBassett: [C:03+1] "LGTM and thanks for a43c02f2b4 and 89b90ae3db as well. Happy to +2 if nobody else does." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 (owner: 10Gergő Tisza) [17:52:18] (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [17:53:09] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a] (wcqs): Deploy 0.3.150 to WCQS [17:53:42] PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:54:37] (03Merged) 10jenkins-bot: Migrate to virtual domains [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097310 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [17:54:40] RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:54:41] finally [17:54:43] (03PS6) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:54:57] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1097310|Migrate to virtual domains (T354939)]], [[gerrit:1097369|createExtensionTables: Use virtual domains for GrowthExperiments (T354939)]] [17:55:01] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [17:55:31] (03PS7) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:55:37] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:56:02] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a] (wcqs): Deploy 0.3.150 to WCQS (duration: 02m 53s) [17:56:35] (03PS8) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [17:57:44] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp7008 [17:57:59] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp7008 [17:58:09] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7004 [17:58:24] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7004 [17:59:07] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354620 (10RobH) [17:59:10] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1097310|Migrate to virtual domains (T354939)]], [[gerrit:1097369|createExtensionTables: Use virtual domains for GrowthExperiments (T354939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1800) [18:00:05] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T1800). [18:00:41] PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:00:45] (03PS9) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [18:01:23] !log urbanecm@deploy2002 urbanecm: Continuing with sync [18:01:41] RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:01:50] (03PS10) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [18:02:04] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [18:02:09] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [18:02:09] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:02:42] (03PS11) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [18:03:05] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs7003.magru.wmnet [18:03:35] (03PS12) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [18:03:43] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7006.magru.wmnet with OS bullseye [18:04:00] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7006.magru.wmnet with OS bullseye [18:04:53] (03PS13) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [18:05:43] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:06:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10354661 (10JMeybohm) 05Open→03Declined Oh, sorry. We forgot to update here. Lets not spen... [18:08:14] (03PS1) 10Bartosz Dziewoński: LoginCompleteHookHandler: onTempUserCreatedRedirect() should use getPrimaryInstance() [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097457 (https://phabricator.wikimedia.org/T380042) [18:08:15] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097310|Migrate to virtual domains (T354939)]], [[gerrit:1097369|createExtensionTables: Use virtual domains for GrowthExperiments (T354939)]] (duration: 13m 18s) [18:08:19] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [18:08:23] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#10354676 (10JMeybohm) I'm pretty happy with this. If it is not 100% correct, I did not notice so far: `lang=bash _cookbook_completion() { local cur cur="${COMP_WORDS[... [18:08:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097457 (https://phabricator.wikimedia.org/T380042) (owner: 10Bartosz Dziewoński) [18:08:27] okay, let's see how many things this breaks :) [18:08:37] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:08:56] !log rebuilt php8.1 production images to pick up 8.1.31 [18:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:37] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:10:54] (03PS2) 10Urbanecm: [Growth] enwiki: Deploy Add Link to 2% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095126 (https://phabricator.wikimedia.org/T377631) [18:10:54] (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Deploy Add Link to 2% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095126 (https://phabricator.wikimedia.org/T377631) (owner: 10Urbanecm) [18:13:06] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:15:37] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:16:56] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [18:17:13] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [18:17:14] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:15] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts lvs7003.magru.wmnet [18:17:28] (03PS15) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [18:17:37] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:17:57] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp7015.magru.wmnet [18:18:16] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354737 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `lvs7003.magru.wmnet` - lvs7003.mag... [18:21:42] FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:42] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10354753 (10Andrew) Nameserver is missing from the following hosts: cn-staging-1.centralnotice-staging.eqiad1... [18:23:37] PROBLEM - BGP status on lsw1-d6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:24:09] (03PS1) 10Krinkle: webperf: Update Scap git origin for statsv.git [puppet] - 10https://gerrit.wikimedia.org/r/1097459 [18:24:16] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:24:37] RECOVERY - BGP status on lsw1-d6-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237 (T380449)', diff saved to https://phabricator.wikimedia.org/P71141 and previous config saved to /var/cache/conftool/dbconfig/20241125-182603-ladsgroup.json [18:26:08] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [18:26:42] FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:27:40] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7015.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [18:27:57] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7015.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [18:27:57] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:27:58] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp7015.magru.wmnet [18:28:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:28:27] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354804 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp7015.magru.wmnet` - cp7015.magru... [18:28:56] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:29:43] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [18:30:37] PROBLEM - BGP status on lsw1-d6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:08] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:31:12] (03CR) 10Dzahn: [C:03+2] planet: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092823 (owner: 10Muehlenhoff) [18:31:37] RECOVERY - BGP status on lsw1-d6-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:33:01] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [18:34:35] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7008.magru.wmnet with OS bullseye [18:35:04] (03CR) 10Dzahn: [C:03+2] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1092823 (owner: 10Muehlenhoff) [18:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:38:37] PROBLEM - BGP status on lsw1-d7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:39:37] RECOVERY - BGP status on lsw1-d7-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:40:44] (03CR) 10Dzahn: [C:03+1] "So this is a case where WMF does not own it, yet it still points to WMF name servers. That indicates this should be right. But it also rai" [puppet] - 10https://gerrit.wikimedia.org/r/1092362 (owner: 10BCornwall) [18:41:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237', diff saved to https://phabricator.wikimedia.org/P71142 and previous config saved to /var/cache/conftool/dbconfig/20241125-184110-ladsgroup.json [18:42:19] (03CR) 10CDanis: [C:03+2] webperf: Update Scap git origin for statsv.git [puppet] - 10https://gerrit.wikimedia.org/r/1097459 (owner: 10Krinkle) [18:43:16] (03CR) 10Dzahn: [C:03+1] "lgtm. needs a second patch to actually link them to the parking template in DNS repo, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1092943 (owner: 10BCornwall) [18:43:46] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354896 (10RobH) [18:43:57] (03CR) 10Dzahn: [C:03+2] peopleweb: limit envoy srange to CACHES and DEPLOYMENT_SERVERS [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:45:09] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:45:17] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp7015 [18:45:33] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp7015 [18:45:39] PROBLEM - BGP status on lsw1-d7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:46:39] RECOVERY - BGP status on lsw1-d7-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:48:40] !log krinkle@deploy2002 Started deploy [statsv/statsv@6678d4b]: I7a8d831817: remove unused statsvr.py [18:48:49] !log krinkle@deploy2002 Finished deploy [statsv/statsv@6678d4b]: I7a8d831817: remove unused statsvr.py (duration: 00m 09s) [18:49:20] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354915 (10Fabfur) [18:49:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on D{wikikube-worker[2128-2170].codfw.wmnet} and (A:wikikube-staging-worker-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-eqiad or A:wikikube-staging-master-eqiad or A:wikikube-worker-codfw or A:wikikube-master-codfw or A:wikikube-worker-eqiad or A:wikikube-master-eqiad or A:ml-serve-worker-eqiad or A:ml-se [18:49:23] rve-master-eqiad or A:ml-serve-worker-codfw or A:ml-serve-master-codfw or A:ml-staging-worker or A:ml-staging-master or A:dse-k8s-worker or A:dse-k8s-master or A:aux-worker or A:aux-master) [18:49:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:52:44] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7008.magru.wmnet with OS bullseye [18:53:00] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7008.magru.wmnet with OS bullseye [18:53:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 (owner: 10Gergő Tisza) [18:53:08] cdanis: I bet you already know but that's the scap change there [18:53:23] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs7003 [18:53:24] error running /usr/lib/git-core/git 'config' '--global' ... [18:53:37] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs7003 [18:53:48] 'fatal: $HOME not set' ... hmmm [18:54:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097327 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [18:54:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [18:54:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:54:55] mutante: the scap change is the puppet failure? [18:55:19] It seemed like it.. but now it resolved.. currently running it again on one of the hosts.. [18:55:36] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354939 (10RobH) [18:56:11] cdanis: scap deploy local fails. example: wdqs2026 [18:56:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237', diff saved to https://phabricator.wikimedia.org/P71143 and previous config saved to /var/cache/conftool/dbconfig/20241125-185617-ladsgroup.json [18:56:54] "git lfs install" failed.. it's a bit confusing [18:57:01] mutante: puppet has been failing on that host for a long time https://puppetboard.wikimedia.org/node/wdqs2026.codfw.wmnet [18:57:17] same error hours ago https://puppetboard.wikimedia.org/report/wdqs2026.codfw.wmnet/93854d1676f415cda41a59ad6b654afa3b4928c7 [18:57:21] I don't think it's my fault :) [18:57:31] ooh! I guess then it's once again that we are always just slightly under the threshold of failed hosts. [18:57:41] yeah, that must be it [18:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:57:56] one more and it starts flapping [18:57:57] https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=5 [18:58:04] there's a flapping magru host [18:58:10] magru has so few hosts that one flapping triggers the alert [18:58:17] sorry, I just assumed too much because it was the last thing merged and scap [18:58:23] no, it was sensible ahah [18:58:40] makes sense since it's limited to magru, ack [18:58:49] so it's not actually THAT widespread :p [18:59:02] the site is depooled fwiw and some alerts are expected because of the ongoing maintenance [18:59:06] so no cause for worry as such [18:59:13] oh, that too? ok, thanks [18:59:17] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [18:59:23] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [18:59:23] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:28] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [18:59:35] mutante: yeah, ongoing hw maintenance (see T376737) [18:59:48] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7003.magru.wmnet with OS bookworm [18:59:52] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS... [19:00:00] there is something about prometheus and wdqs too [19:00:09] sukhe: ACK:) ty [19:01:23] some of these puppet failure alerts might also seem "new" because we had them disabled on the edge sites [19:01:33] see T379807 [19:01:34] T379807: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807 [19:01:38] aha! [19:01:45] cwhite put out a fix for that in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/32fcbbd018f33ed09becde377d70c32a99249a54%5E%21/#F0 [19:01:51] it's always more complex :p [19:01:53] so this is why it might seem new [19:01:57] gotcha [19:02:12] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [19:02:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7006.magru.wmnet with OS bullseye [19:02:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:03:17] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354984 (10BCornwall) [19:04:25] jouncebot: nowandnext [19:04:25] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [19:04:25] In 1 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T2100) [19:05:47] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10354992 (10BCornwall) [19:06:08] FYI, I'd like to run a test deployment after updating the php 8.1 production images (affects only mwdebug-next). [19:06:20] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7008.magru.wmnet with OS bullseye [19:06:25] I'll start that shortly unless there are any objections. [19:06:38] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7008.magru.wmnet with OS bullseye [19:08:35] !log swfrench@deploy2002 Started scap sync-world: Deployment to pick up new php 8.1 base images [19:10:17] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs7003 [19:10:34] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs7003 [19:10:37] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:11:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237 (T380449)', diff saved to https://phabricator.wikimedia.org/P71144 and previous config saved to /var/cache/conftool/dbconfig/20241125-191124-ladsgroup.json [19:11:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [19:11:31] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [19:11:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [19:13:54] (03CR) 10Dzahn: [C:03+2] "looks all good :)" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:14:11] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [19:14:16] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [19:14:17] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:34] (03CR) 10Dzahn: [C:03+2] Re-add Envoy firewall config for phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1092796 (owner: 10Muehlenhoff) [19:18:13] !log swfrench@deploy2002 Finished scap sync-world: Deployment to pick up new php 8.1 base images (duration: 09m 37s) [19:19:36] all done on my end [19:22:45] (03PS2) 10Urbanecm: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) [19:22:59] (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Deploy Add Link to 2% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095126 (https://phabricator.wikimedia.org/T377631) (owner: 10Urbanecm) [19:23:41] (03Merged) 10jenkins-bot: [Growth] enwiki: Deploy Add Link to 2% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095126 (https://phabricator.wikimedia.org/T377631) (owner: 10Urbanecm) [19:24:24] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1095126|[Growth] enwiki: Deploy Add Link to 2% of new users (T377631)]] [19:24:29] T377631: Add a link (Structured task): Release to a subset of newcomers on English Wikipedia - https://phabricator.wikimedia.org/T377631 [19:27:50] (03PS1) 10Reedy: InitialiseSettings.php: Reduce indenting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097474 [19:28:03] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:28:37] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1095126|[Growth] enwiki: Deploy Add Link to 2% of new users (T377631)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:29:42] !log urbanecm@deploy2002 urbanecm: Continuing with sync [19:31:44] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage [19:35:27] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage [19:36:23] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1095126|[Growth] enwiki: Deploy Add Link to 2% of new users (T377631)]] (duration: 11m 59s) [19:36:28] T377631: Add a link (Structured task): Release to a subset of newcomers on English Wikipedia - https://phabricator.wikimedia.org/T377631 [19:43:07] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7008.magru.wmnet with OS bullseye [19:43:28] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7008.magru.wmnet with OS bullseye [19:45:22] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10355099 (10Ladsgroup) So now 0-5 are running on ms-fe2009 (plus non commons thumbs) and 6-a are running on ms-fe2010. I'll start b-f on ms-fe2011 tomorrow. [19:47:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:48:28] (03PS2) 10DDesouza: Reader Survey: Fix yes/no messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097417 (https://phabricator.wikimedia.org/T378660) [19:48:48] (03PS1) 10Jdlrobson: Nov 26 2024: Vector 2022 Deployments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097484 (https://phabricator.wikimedia.org/T379799) [19:49:02] 06SRE, 10SRE-tools, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10355108 (10bking) [19:49:07] 06SRE, 10SRE-tools, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10355104 (10bking) Per IRC conversation with @dcausse , we now have [[ https://wikitech.wikimedia.org/wiki/Search/CirrusS... [19:49:14] (03CR) 10BCornwall: "The intention is to remove all records entirely for those domains - hold them but don't use them." [puppet] - 10https://gerrit.wikimedia.org/r/1092943 (owner: 10BCornwall) [19:49:59] 06SRE, 10SRE-tools, 10Data-Platform-SRE (2024.11.09 - 2024.11.29), 03Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10355110 (10bking) [19:50:23] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase2036.codfw.wmnet [19:50:25] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase2037.codfw.wmnet [19:50:26] (03CR) 10BCornwall: "Indeed, this is a "correct" match but it appears that we wanted to keep it this way, at least for now. Do you think it best to open a phab" [puppet] - 10https://gerrit.wikimedia.org/r/1092362 (owner: 10BCornwall) [19:50:26] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase2038.codfw.wmnet [19:52:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:54:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:56:46] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin2002" [19:58:09] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin2002" [19:58:10] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7003.magru.wmnet with OS bookworm [19:58:34] (03PS3) 10DDesouza: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) [19:59:04] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355125 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS boo... [19:59:21] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355138 (10RobH) [19:59:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:00:04] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7004.magru.wmnet with OS bookworm [20:00:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance [20:00:11] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7004.magru.wmnet with OS... [20:00:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance [20:00:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2115 (T380449)', diff saved to https://phabricator.wikimedia.org/P71147 and previous config saved to /var/cache/conftool/dbconfig/20241125-200031-ladsgroup.json [20:00:44] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [20:04:49] (03PS1) 10Eevans: decommission restbase202[1-3].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1097488 (https://phabricator.wikimedia.org/T380790) [20:09:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7008.magru.wmnet with reason: host reimage [20:09:55] (03CR) 10Ssingh: [C:03+1] "as it pertains to just the puppet repo, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1097488 (https://phabricator.wikimedia.org/T380790) (owner: 10Eevans) [20:12:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7008.magru.wmnet with reason: host reimage [20:14:13] (03CR) 10Ssingh: [C:03+1] "modules/profile/data/profile/installserver/preseed.yaml can also be updated." [puppet] - 10https://gerrit.wikimedia.org/r/1097488 (https://phabricator.wikimedia.org/T380790) (owner: 10Eevans) [20:15:33] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355192 (10RobH) [20:21:17] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: corto: only operate on applicable phabricator issues - https://phabricator.wikimedia.org/T380293#10355204 (10Eevans) 05Open→03Resolved Deployed to production as v1.0.7; Done [20:23:32] 06SRE-OnFire, 10Incident Tooling: Corto: Incident responder workflow automation (MVP) - https://phabricator.wikimedia.org/T356790#10355210 (10Eevans) 05Open→03Resolved a:03Eevans Calling this done (the tracking ticket for the MVP, not Corto). Feel free to re-open if you disagree. [20:23:50] (03CR) 10Bartosz Dziewoński: [C:04-1] "`$_ENV` may not be set depending on PHP config ( https://www.php.net/manual/en/reserved.variables.environment.php#98113 ). I don't know if" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095082 (https://phabricator.wikimedia.org/T380575) (owner: 10Gergő Tisza) [20:24:40] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [20:26:47] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs7003.magru.wmnet with OS bullseye [20:27:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:31:06] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7004.magru.wmnet with reason: host reimage [20:32:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:34:35] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7004.magru.wmnet with reason: host reimage [20:35:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:36:00] (03PS2) 10Gergő Tisza: Allow simulating the SUL3 shared domain settings via env var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095082 (https://phabricator.wikimedia.org/T380575) [20:38:07] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355245 (10RobH) [20:38:35] (03PS2) 10Eevans: decommission restbase202[1-3].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1097488 (https://phabricator.wikimedia.org/T380790) [20:38:57] (03CR) 10Eevans: "Oh, good catch; Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1097488 (https://phabricator.wikimedia.org/T380790) (owner: 10Eevans) [20:40:04] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [20:43:00] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:43:27] (03PS1) 10Wangombe: Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) [20:43:34] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs7003.magru.wmnet with OS bullseye [20:44:46] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs7003.magru.wmnet with OS bullseye [20:45:26] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns7001 [20:45:32] (03CR) 10Ssingh: [C:03+1] decommission restbase202[1-3].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1097488 (https://phabricator.wikimedia.org/T380790) (owner: 10Eevans) [20:45:42] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns7001 [20:45:47] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:47:17] (03CR) 10Eevans: [C:03+2] decommission restbase202[1-3].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1097488 (https://phabricator.wikimedia.org/T380790) (owner: 10Eevans) [20:50:14] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [20:51:05] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru reshuffle - robh@cumin2002" [20:51:05] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:51:22] !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts restbase[2021-2023].codfw.wmnet [20:52:00] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355267 (10RobH) [20:53:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T380449)', diff saved to https://phabricator.wikimedia.org/P71149 and previous config saved to /var/cache/conftool/dbconfig/20241125-205320-ladsgroup.json [20:53:35] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [20:56:04] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin2002" [20:56:30] (03CR) 10RLazarus: [C:03+2] scap: delete wmf branches automatically [puppet] - 10https://gerrit.wikimedia.org/r/1097444 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [20:57:33] !log brett@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [20:57:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7008.magru.wmnet with OS bullseye [20:58:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1010 / ganeti1013 - https://phabricator.wikimedia.org/T379612#10355295 (10VRiley-WMF) [20:59:48] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T2100). [21:00:05] danisztls, MatmaRex, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] o/ [21:00:17] o/ [21:00:45] hi [21:02:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [21:03:14] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[2021-2023].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [21:03:30] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[2021-2023].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [21:03:31] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:03:31] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase[2021-2023].codfw.wmnet [21:04:09] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Host reimage - brett@cumin2002 - brett@cumin2002" [21:04:15] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Host reimage - brett@cumin2002 - brett@cumin2002" [21:07:47] I suppose I can deploy [21:08:17] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission restbase202[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T380790#10355309 (10Eevans) [21:08:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P71150 and previous config saved to /var/cache/conftool/dbconfig/20241125-210827-ladsgroup.json [21:08:52] danisztls: do you want the two patches deployed together or separately? [21:09:11] tgr|away: separately [21:09:30] I have to do a quick check to see if the fix is working [21:10:15] so coverage first, fix second? [21:10:55] oh nevermind, those are dependent patches [21:11:04] I was looking at the schedule page [21:11:25] TheresNoTime: fix first [21:11:29] sry [21:11:32] tgr|away: fix first [21:12:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097417 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:12:57] (03Merged) 10jenkins-bot: Reader Survey: Fix yes/no messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097417 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:13:11] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1097417|Reader Survey: Fix yes/no messages (T378660)]] [21:13:15] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:17:22] !log tgr@deploy2002 dani, tgr: Backport for [[gerrit:1097417|Reader Survey: Fix yes/no messages (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:17:42] danisztls: ^ [21:18:03] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testing - sukhe@cumin1002" [21:18:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:18:07] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testing - sukhe@cumin1002" [21:21:35] tgr|away: It's fine to sync but I may have to do another patch :( [21:22:10] !log tgr@deploy2002 dani, tgr: Continuing with sync [21:22:54] danisztls: do you want to do that now? [21:23:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:23:07] tgr|away: you can continue with the increase on coverage [21:23:11] I assume the other patch should wait for now? [21:23:15] ah, ok [21:23:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P71151 and previous config saved to /var/cache/conftool/dbconfig/20241125-212334-ladsgroup.json [21:25:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1010 / ganeti1013 - https://phabricator.wikimedia.org/T379612#10355334 (10VRiley-WMF) a:03VRiley-WMF [21:25:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1010 / ganeti1013 - https://phabricator.wikimedia.org/T379612#10355336 (10VRiley-WMF) [21:26:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1010 / ganeti1013 - https://phabricator.wikimedia.org/T379612#10355347 (10VRiley-WMF) 05Open→03Resolved [21:28:47] tgr|away: I've incorrectly named the messages on config but I created new messages on the wiki so it will not be an issue. [21:28:56] 06SRE, 10SRE-Access-Requests: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10355353 (10mpopov) [21:29:14] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097417|Reader Survey: Fix yes/no messages (T378660)]] (duration: 16m 02s) [21:29:19] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:30:08] (03CR) 10Gergő Tisza: [C:03+2] LoginCompleteHookHandler: onTempUserCreatedRedirect() should use getPrimaryInstance() [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097457 (https://phabricator.wikimedia.org/T380042) (owner: 10Bartosz Dziewoński) [21:30:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:30:21] (03CR) 10CI reject: [V:04-1] Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:30:42] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs7003.magru.wmnet with OS bullseye [21:31:00] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs7003.magru.wmnet with OS bullseye [21:31:33] (03PS4) 10DDesouza: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) [21:31:58] (03PS5) 10Gergő Tisza: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:32:13] (03CR) 10TrainBranchBot: "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:32:58] (03Merged) 10jenkins-bot: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:33:14] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1094054|Reader Survey: Increase coverage on enwiki (T378660)]] [21:36:38] tgr|away: thanks, can you do another patch? :) [21:37:00] sure [21:37:24] !log tgr@deploy2002 tgr, dani: Backport for [[gerrit:1094054|Reader Survey: Increase coverage on enwiki (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:37:28] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:37:30] might have to wait for the CentralAuth one though, it's already being merged [21:38:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T380449)', diff saved to https://phabricator.wikimedia.org/P71152 and previous config saved to /var/cache/conftool/dbconfig/20241125-213841-ladsgroup.json [21:38:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance [21:38:48] (03PS1) 10DDesouza: Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097518 (https://phabricator.wikimedia.org/T378660) [21:38:50] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [21:38:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance [21:39:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2131 (T380449)', diff saved to https://phabricator.wikimedia.org/P71153 and previous config saved to /var/cache/conftool/dbconfig/20241125-213904-ladsgroup.json [21:39:53] tgr|away: when you prefer, 1097518 [21:40:11] (03Merged) 10jenkins-bot: LoginCompleteHookHandler: onTempUserCreatedRedirect() should use getPrimaryInstance() [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1097457 (https://phabricator.wikimedia.org/T380042) (owner: 10Bartosz Dziewoński) [21:40:52] danisztls: do you want to test the coverage increase? [21:42:18] tgr|away: not needed [21:42:30] !log tgr@deploy2002 tgr, dani: Continuing with sync [21:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:27] (03CR) 10Ladsgroup: [C:03+1] "LGTM. cc Zabe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [21:45:16] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7015.magru.wmnet with OS bullseye [21:46:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10355402 (10mpopov) 05Resolved→03Open a:05MatthewVernon→03None @Khantstop has rep... [21:46:44] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [21:48:09] (03CR) 10Andrew Bogott: "I've copied that comment into the phab task as it seems worth investigating independently from this patch" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [21:49:21] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1094054|Reader Survey: Increase coverage on enwiki (T378660)]] (duration: 16m 06s) [21:49:33] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:50:14] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1097457|LoginCompleteHookHandler: onTempUserCreatedRedirect() should use getPrimaryInstance() (T380042)]] [21:50:18] T380042: RuntimeException: Global user does not have ID '0'. - https://phabricator.wikimedia.org/T380042 [21:54:37] !log tgr@deploy2002 tgr, matmarex: Backport for [[gerrit:1097457|LoginCompleteHookHandler: onTempUserCreatedRedirect() should use getPrimaryInstance() (T380042)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:55:00] oh! looking [21:55:42] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [21:55:52] tgr|away: seems good. i was able to null-edit that page now while logged out [21:56:03] !log tgr@deploy2002 tgr, matmarex: Continuing with sync [21:58:10] (03PS1) 10CDobbins: Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 [21:59:03] (03PS2) 10CDobbins: Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 [22:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241125T2200) [22:00:59] o/ we have two more deploys left if the window is not busy [22:02:55] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097457|LoginCompleteHookHandler: onTempUserCreatedRedirect() should use getPrimaryInstance() (T380042)]] (duration: 12m 41s) [22:03:00] T380042: RuntimeException: Global user does not have ID '0'. - https://phabricator.wikimedia.org/T380042 [22:03:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097518 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [22:04:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T380449)', diff saved to https://phabricator.wikimedia.org/P71154 and previous config saved to /var/cache/conftool/dbconfig/20241125-220406-ladsgroup.json [22:04:20] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [22:04:38] (03Merged) 10jenkins-bot: Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097518 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [22:04:56] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1097518|Reader Survey: Increase coverage (T378660)]] [22:05:00] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [22:05:17] (03PS3) 10Gergő Tisza: SUL3: Sort overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097327 (https://phabricator.wikimedia.org/T373737) [22:05:26] (03PS3) 10Gergő Tisza: More authentication domain overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) [22:05:39] (03PS3) 10Gergő Tisza: Update private/readme.php to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 [22:08:15] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7015.magru.wmnet with OS bullseye [22:09:04] (03Abandoned) 10Bking: Revert "wdqs: create wdqs-internal-[main,scholarly] roles" [puppet] - 10https://gerrit.wikimedia.org/r/1094554 (owner: 10Bking) [22:09:11] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [22:09:13] tgr|away: thank you very much [22:09:18] !log tgr@deploy2002 tgr, dani: Backport for [[gerrit:1097518|Reader Survey: Increase coverage (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:09:29] danisztls: do you need to test it? [22:09:34] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [22:09:55] tgr|away: no [22:11:10] RECOVERY - Host mr1-magru.oob is UP: PING WARNING - Packet loss = 33%, RTA = 132.00 ms [22:12:24] !log tgr@deploy2002 tgr, dani: Continuing with sync [22:13:13] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [22:18:59] (03PS1) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097535 (https://phabricator.wikimedia.org/T379333) [22:19:04] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097518|Reader Survey: Increase coverage (T378660)]] (duration: 14m 08s) [22:19:08] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [22:19:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P71155 and previous config saved to /var/cache/conftool/dbconfig/20241125-221913-ladsgroup.json [22:19:35] (03PS3) 10CDobbins: Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 [22:20:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097327 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:20:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:20:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 (owner: 10Gergő Tisza) [22:20:49] (03Merged) 10jenkins-bot: SUL3: Sort overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097327 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:20:52] (03Merged) 10jenkins-bot: More authentication domain overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:20:54] (03Merged) 10jenkins-bot: Update private/readme.php to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097322 (owner: 10Gergő Tisza) [22:21:09] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1097327|SUL3: Sort overrides (T373737)]], [[gerrit:1097328|More authentication domain overrides (T373737)]], [[gerrit:1097322|Update private/readme.php to match production]] [22:21:14] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [22:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:36] !log tgr@deploy2002 tgr: Backport for [[gerrit:1097327|SUL3: Sort overrides (T373737)]], [[gerrit:1097328|More authentication domain overrides (T373737)]], [[gerrit:1097322|Update private/readme.php to match production]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:26:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:27:26] !log tgr@deploy2002 tgr: Continuing with sync [22:28:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:31:40] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs7003.magru.wmnet with OS bullseye [22:33:59] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097327|SUL3: Sort overrides (T373737)]], [[gerrit:1097328|More authentication domain overrides (T373737)]], [[gerrit:1097322|Update private/readme.php to match production]] (duration: 12m 49s) [22:34:03] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [22:34:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P71156 and previous config saved to /var/cache/conftool/dbconfig/20241125-223420-ladsgroup.json [22:34:46] !log UTC late deploys done [22:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:37:28] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:37:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:37:34] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [22:37:54] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:37:55] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:38:41] (03PS5) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [22:38:42] (03PS2) 10Ryan Kemper: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) [22:38:42] (03PS2) 10Ryan Kemper: wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) [22:38:42] (03PS2) 10Ryan Kemper: wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) [22:38:48] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:38:49] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:39:21] (03CR) 10Gergő Tisza: More authentication domain overrides (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097328 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:41:54] (03CR) 10Subramanya Sastry: rest-gateway: order mw-api-int paths strictly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [22:43:00] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:43:01] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:43:05] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [22:46:16] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7015.magru.wmnet with OS bullseye [22:48:44] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:48:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:48:48] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [22:49:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T380449)', diff saved to https://phabricator.wikimedia.org/P71157 and previous config saved to /var/cache/conftool/dbconfig/20241125-224927-ladsgroup.json [22:49:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2191.codfw.wmnet with reason: Maintenance [22:49:33] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [22:49:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2191.codfw.wmnet with reason: Maintenance [22:49:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2191 (T380449)', diff saved to https://phabricator.wikimedia.org/P71158 and previous config saved to /var/cache/conftool/dbconfig/20241125-224949-ladsgroup.json [22:51:27] (03PS6) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [22:51:27] (03PS3) 10Ryan Kemper: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) [22:51:27] (03PS3) 10Ryan Kemper: wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) [22:51:27] (03PS3) 10Ryan Kemper: wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) [22:51:28] (03PS1) 10Ryan Kemper: wdqs-internal: codfw pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) [22:51:31] (03PS1) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097542 (https://phabricator.wikimedia.org/T379333) [22:53:34] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer wikidata from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:53:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer wikidata from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:56:33] !log bking@cumin1002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer wikidata from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:56:34] !log bking@cumin1002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer wikidata from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [22:56:37] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [22:56:51] !log Import varnish-modules 0.20.0-2~deb11u1 into varnish-staging apt component [22:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:07] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp7015.magru.wmnet lvs7003.magru.wmnet on all recursors [23:00:20] !log brett@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) cp7015.magru.wmnet lvs7003.magru.wmnet on all recursors [23:00:54] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp7015.magru.wmnet lvs7003.magru.wmnet cp7015.mgmt.magru.wmnet lvs7003.mgmt.magru.wmnet on all recursors [23:01:07] !log brett@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) cp7015.magru.wmnet lvs7003.magru.wmnet cp7015.mgmt.magru.wmnet lvs7003.mgmt.magru.wmnet on all recursors [23:01:15] !log bking@cumin1002 START - Cookbook sre.wdqs.data-transfer [23:01:16] !log bking@cumin1002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [23:01:31] (03CR) 10Bartosz Dziewoński: [C:03+1] Allow simulating the SUL3 shared domain settings via env var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095082 (https://phabricator.wikimedia.org/T380575) (owner: 10Gergő Tisza) [23:02:01] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [23:09:20] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7015.magru.wmnet with OS bullseye [23:09:53] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:09:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:09:57] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [23:10:24] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:10:24] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:10:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191 (T380449)', diff saved to https://phabricator.wikimedia.org/P71159 and previous config saved to /var/cache/conftool/dbconfig/20241125-231026-ladsgroup.json [23:10:35] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [23:13:32] (03PS1) 10Ryan Kemper: sre.wdqs.data-transfer: checking puppet alias is pointless [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) [23:14:43] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:14:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:15:19] (03CR) 10Andrew Bogott: "This can be abandoned, can't it?" [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro) [23:16:20] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:16:21] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:16:24] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [23:16:46] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:16:47] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:19:12] (03PS7) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [23:19:12] (03PS2) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097542 (https://phabricator.wikimedia.org/T379333) [23:19:12] (03PS4) 10Ryan Kemper: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) [23:19:12] (03PS4) 10Ryan Kemper: wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) [23:19:13] (03PS4) 10Ryan Kemper: wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) [23:19:23] (03CR) 10CI reject: [V:04-1] sre.wdqs.data-transfer: checking puppet alias is pointless [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [23:22:58] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805 (10Andrew) 03NEW [23:23:40] !log removing 2 files for legal compliance [23:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191', diff saved to https://phabricator.wikimedia.org/P71160 and previous config saved to /var/cache/conftool/dbconfig/20241125-232533-ladsgroup.json [23:38:11] (03PS1) 10EoghanGaffney: vrts: Update mail alias generation script to bail on too many changes [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) [23:40:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191', diff saved to https://phabricator.wikimedia.org/P71161 and previous config saved to /var/cache/conftool/dbconfig/20241125-234040-ladsgroup.json [23:41:58] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:41:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:42:03] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [23:44:41] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [23:46:44] (03PS1) 10Andrew Bogott: nova policy.yaml: open os_compute_api:os-aggregates:index api [puppet] - 10https://gerrit.wikimedia.org/r/1097558 (https://phabricator.wikimedia.org/T380069) [23:47:20] (03PS2) 10Andrew Bogott: nova policy.yaml: open os_compute_api:os-aggregates:index api [puppet] - 10https://gerrit.wikimedia.org/r/1097558 (https://phabricator.wikimedia.org/T380069) [23:47:38] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097558 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [23:49:05] (03CR) 10Andrew Bogott: [C:03+2] nova policy.yaml: open os_compute_api:os-aggregates:index api [puppet] - 10https://gerrit.wikimedia.org/r/1097558 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [23:49:42] (03PS2) 10Ryan Kemper: sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) [23:49:48] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2018.codfw.wmnet, repooling source-only afterwards [23:49:53] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [23:51:30] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:51:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:53:34] !log removing 1 file for legal compliance [23:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:37] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2018.codfw.wmnet, repooling source-only afterwards [23:54:50] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191 (T380449)', diff saved to https://phabricator.wikimedia.org/P71162 and previous config saved to /var/cache/conftool/dbconfig/20241125-235547-ladsgroup.json [23:55:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [23:55:52] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [23:56:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [23:56:23] (03CR) 10CI reject: [V:04-1] sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper)