[00:01:33] (03CR) 10Krinkle: [C:03+1] "Confirmed wmgUseMathML is not used elsewhere. Confirmed it matches the default. Confirmed the default makes sense (mathml == Mathoid SVG, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [00:03:36] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10115579 (10andrea.denisse) a:03andrea.denisse [00:05:21] (03CR) 10Andrea Denisse: [C:03+2] admin: add southparkfan to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1068073 (https://phabricator.wikimedia.org/T373518) (owner: 10Ssingh) [00:12:52] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10115588 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/40; - member ge-1/0/40; [edit... [00:13:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070349 (owner: 10TrainBranchBot) [00:13:59] (03CR) 10Bartosz Dziewoński: [C:04-1] "Reviewed on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1069676" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069678 (https://phabricator.wikimedia.org/T371596) (owner: 10Gergő Tisza) [00:17:15] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10115596 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/41; - member ge-1/0/41; [edit interfaces interface-ran... [00:20:07] (03CR) 10Krinkle: [C:03+1] Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 (owner: 10Bartosz Dziewoński) [00:24:26] (03PS1) 10Papaul: Add DNS entries for frban2002 [dns] - 10https://gerrit.wikimedia.org/r/1070355 [00:26:07] (03CR) 10Papaul: [C:03+2] Add DNS entries for frban2002 [dns] - 10https://gerrit.wikimedia.org/r/1070355 (owner: 10Papaul) [00:27:09] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10115604 (10Papaul) [00:28:22] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10115605 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this is ready for you [00:28:51] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10115607 (10Papaul) [00:30:01] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10115608 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this is ready for you [00:42:51] (03PS12) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [00:42:52] (03PS1) 10Andrew Bogott: Horizon config: set SECURE_PROXY_SSL_HEADER [puppet] - 10https://gerrit.wikimedia.org/r/1070356 (https://phabricator.wikimedia.org/T359590) [00:44:11] (03PS2) 10Andrew Bogott: Horizon config: set SECURE_PROXY_SSL_HEADER [puppet] - 10https://gerrit.wikimedia.org/r/1070356 (https://phabricator.wikimedia.org/T359590) [00:44:11] (03PS13) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [00:45:02] (03CR) 10Andrew Bogott: [C:03+2] Horizon config: set SECURE_PROXY_SSL_HEADER [puppet] - 10https://gerrit.wikimedia.org/r/1070356 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [00:45:22] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10115618 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] memb... [00:47:31] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10115619 (10Papaul) 05Open→03Resolved a:03Papaul All the switch clean up is done [00:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:00:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 (owner: 10Bartosz Dziewoński) [01:00:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 (owner: 10Bartosz Dziewoński) [01:00:51] FIRING: [8x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:15:57] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:01:12] (03PS1) 10Scott French: kubernetes: mw2260 revert to role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/1070364 [02:04:37] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Update iDRAC on mw2260.codfw.wmnet - https://phabricator.wikimedia.org/T373934#10115689 (10Scott_French) [02:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:27:53] (03CR) 10Subramanya Sastry: [C:04-2] "Let us backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1070326 instead of this patch." [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070319 (https://phabricator.wikimedia.org/T373920) (owner: 10Jforrester) [03:00:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:13] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:31:09] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 6.873 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:31:23] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:37:26] (03PS1) 10Abijeet Patro: TTMServerAid: Tell PHP that we're OK with $services starting out null [extensions/Translate] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070370 (https://phabricator.wikimedia.org/T373921) [03:43:25] FIRING: SystemdUnitFailed: wmf_auto_restart_systemd-timesyncd.service on wikikube-worker2076:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:07:51] (03CR) 10CI reject: [V:04-1] TTMServerAid: Tell PHP that we're OK with $services starting out null [extensions/Translate] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070370 (https://phabricator.wikimedia.org/T373921) (owner: 10Abijeet Patro) [04:41:52] (03CR) 10Abijeet Patro: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070370 (https://phabricator.wikimedia.org/T373921) (owner: 10Abijeet Patro) [04:42:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Translate] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070370 (https://phabricator.wikimedia.org/T373921) (owner: 10Abijeet Patro) [04:51:11] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:53:11] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:00:51] FIRING: [8x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:15:57] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:56:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:49] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:49] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:39] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:58:07] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:58:19] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:59:07] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:59:17] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 7.728 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:59:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 6.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:53] !log aqu@deploy1003 Started deploy [analytics/refinery@07fd127] (thin): Regular analytics weekly train THIN [analytics/refinery@07fd1275] [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:48] !log aqu@deploy1003 Finished deploy [analytics/refinery@07fd127] (thin): Regular analytics weekly train THIN [analytics/refinery@07fd1275] (duration: 04m 55s) [06:14:17] (03PS1) 10Brouberol: Define a catchall monitor for pending admin_ng changes [alerts] - 10https://gerrit.wikimedia.org/r/1070483 (https://phabricator.wikimedia.org/T331894) [06:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:35:03] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:05] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:01] FIRING: [3x] RedisMemoryFull: Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:46:25] hello, I have the following patch for backport in the morning window: 1070370: TTMServerAid: Tell PHP that we're OK with $services starting out null | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1070370; patches for translate extension usually take quite some time before they are merged. Might be helpful to +2 it now. [06:54:15] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070484 [06:56:59] (03CR) 10Alexandros Kosiaris: [C:03+1] Define a catchall monitor for pending admin_ng changes [alerts] - 10https://gerrit.wikimedia.org/r/1070483 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [07:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240904T0700). [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:34] o/ [07:07:25] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2011.codfw.wmnet [07:08:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2011.codfw.wmnet [07:09:07] 06SRE, 10SRE-Access-Requests: Requesting access to `contint-admins` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969 (10ArthurTaylor) 03NEW [07:09:14] !log T373095 depool kubernetes2011, kubernetes2012, kubernetes2036, kubernetes2037, wikikube-worker2037, wikikube-worker2038, mw2436, mw2437 [07:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:17] T373095: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095 [07:09:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:10:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:10:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T367781)', diff saved to https://phabricator.wikimedia.org/P68612 and previous config saved to /var/cache/conftool/dbconfig/20240904-071007-arnaudb.json [07:10:10] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:10:18] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2012.codfw.wmnet [07:10:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2012.codfw.wmnet [07:10:56] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2036.codfw.wmnet [07:11:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T367781)', diff saved to https://phabricator.wikimedia.org/P68613 and previous config saved to /var/cache/conftool/dbconfig/20240904-071115-arnaudb.json [07:11:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2036.codfw.wmnet [07:11:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2037.codfw.wmnet [07:12:09] urbanecm and Amir1, there for the backport? [07:12:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2037.codfw.wmnet [07:12:18] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2037.codfw.wmnet [07:12:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2037.codfw.wmnet [07:12:56] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2038.codfw.wmnet [07:13:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2038.codfw.wmnet [07:13:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2436.codfw.wmnet [07:14:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2436.codfw.wmnet [07:14:09] abijeet: can I help if no one around? [07:14:13] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2437.codfw.wmnet [07:14:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2437.codfw.wmnet [07:15:45] kart_, yea, thanks [07:16:49] abijeet: cool. Deploying. [14:06:34] yeah, new node, puppet hasn't run on registry yet [14:07:26] and the rename isn't done because it's a supermicro node [14:07:38] ebernhardson: MatmaRex: could you reschedule your patches, they won't get done in this window, sorry! :) [14:07:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:07:45] FIRING: [2x] HttpdUnreachable: httpd unavailable for deployment mw-debug at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [14:07:54] TheresNoTime: yep, no problem [14:07:54] basically registry breaking because of unrelated work [14:08:25] jayme: can we remove wikikube-worker2088.codfw.wmnet from the kubernetes hosts in puppet until the cookbook is fixed for it? [14:08:30] if that race could e addressed that would be gret [14:08:31] great [14:08:33] yeah yeah [14:08:42] on it [14:08:54] !log scap failed, [[gerrit:1070561]] merged but undeployed currently [14:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:07] or aside, don't touch wikikube hosts during MediaWiki deployment windows :] [14:09:09] it happens when a host gets stuck in the middle of a rename for any reason [14:09:15] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:09:17] hnowlan: I'll sync that when this is resolved, that okay? [14:09:21] thanks claime ! [14:09:41] hashar: This is a rare failure mode because that host in particular is different [14:09:41] TheresNoTime: yep, thank you- if it's easier I am okay to roll back for now [14:10:32] TheresNoTime: when you are done I will push https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1070548 [14:10:36] hnowlan: do a revert of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070561 and merge that? afaict scap rolled back the testservers [14:10:39] a backport for tonight train [14:11:05] (03PS1) 10Hnowlan: Revert "Allow copyuploads on test2wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070607 [14:11:26] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:11:45] RESOLVED: [2x] HttpdUnreachable: httpd unavailable for deployment mw-debug at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [14:11:53] what's the protocol on merging that ^ as regards votes? [14:12:10] I can +2 it [14:12:25] thanks! [14:12:28] (03CR) 10Samtar: [C:03+2] Revert "Allow copyuploads on test2wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070607 (owner: 10Hnowlan) [14:12:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:13:13] (03Merged) 10jenkins-bot: Revert "Allow copyuploads on test2wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070607 (owner: 10Hnowlan) [14:13:31] (03PS8) 10Arnaudb: mariadb: productionize db22[21-40] [puppet] - 10https://gerrit.wikimedia.org/r/1068667 (https://phabricator.wikimedia.org/T373579) [14:13:58] !log [[gerrit:1070561]] reverted, fwiw [14:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:23] thanks for the deploy anyway :D [14:14:37] :D [14:15:05] !log homer lsw1-b3-codfw* commit [14:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:43] * TheresNoTime is stepping away for a bit — feel free to do your backport hashar, once that's all working I guess :) [14:15:55] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:16:37] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2087.codfw.wmnet with OS bookworm [14:16:38] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2087 [14:16:39] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:16:49] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10117589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2087.codfw.wmne... [14:16:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10117590 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host wikikube-worke... [14:17:14] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:17:20] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2082 [14:17:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10117592 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor... [14:18:31] (03CR) 10Ladsgroup: "Commit message needs updating, otherwise look good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1068667 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [14:19:21] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10117596 (10Jclark-ctr) @Dzahn phab1005 is still continuing to fail imaging not picking up ip address for pxe booting would you be able to assist? [14:19:32] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281#10117597 (10joanna_borun) p:05Triage→03Low [14:20:40] (03PS1) 10Clément Goubert: Remove wikikube-worker2088.codfw.wmnet from wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1070610 (https://phabricator.wikimedia.org/T373982) [14:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:22:16] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:22:18] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:24:23] (03CR) 10Xcollazo: [C:03+1] "Ah, this was simpler than I thought." [puppet] - 10https://gerrit.wikimedia.org/r/1070558 (https://phabricator.wikimedia.org/T373904) (owner: 10Btullis) [14:25:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070548 (https://phabricator.wikimedia.org/T373920) (owner: 10Hashar) [14:25:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2054 to wikikube-worker2088 [14:25:23] (03CR) 10Hnowlan: [C:03+1] Remove wikikube-worker2088.codfw.wmnet from wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1070610 (https://phabricator.wikimedia.org/T373982) (owner: 10Clément Goubert) [14:25:34] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:27:57] (03CR) 10JHathaway: [C:03+2] "yeah, I agree" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [14:28:47] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2054 to wikikube-worker2088 - cgoubert@cumin1002" [14:29:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2054 to wikikube-worker2088 - cgoubert@cumin1002" [14:29:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:03] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2088 [14:29:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2088 [14:29:35] (03PS1) 10Hnowlan: sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 [14:29:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2054 to wikikube-worker2088 [14:30:11] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2088.codfw.wmnet on all recursors [14:30:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2088.codfw.wmnet on all recursors [14:32:27] hashar: registry fixed [14:32:36] claime: <3 [14:32:48] (03CR) 10Elukey: [C:03+2] sre.hosts.rename: add support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1070600 (owner: 10Elukey) [14:33:16] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 381, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:38] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2082.codfw.wmnet [14:33:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2082.codfw.wmnet [14:34:03] 10SRE-tools, 06cloud-services-team, 10Cloud-VPS, 10Spicerack, and 2 others: cookbooks: for --interactive flags, add an option to skip the rest - https://phabricator.wikimedia.org/T315341#10117647 (10fnegri) [14:34:06] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10117652 (10Jclark-ctr) @Dzahn disregard i figured out issue [14:35:06] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:35:52] (03PS3) 10Clément Goubert: sre.k8s.renumber-node: fix k8s_metadata scope [cookbooks] - 10https://gerrit.wikimedia.org/r/1070585 (owner: 10Hnowlan) [14:36:11] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [14:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10117659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ml-lab1001... [14:36:47] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10117662 (10ABran-WMF) >>! In T373095#10110436, @ABran-WMF wrote: > [...] I'll double check the DNS indeed great catch! no DNS modification need... [14:37:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:27] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:38:13] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2036.codfw.wmnet [14:38:17] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2036.codfw.wmnet [14:38:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10117668 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumbering for host wikikub... [14:38:21] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:38:37] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725#10117669 (10joanna_borun) p:05Triage→03Low [14:38:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2036.codfw.wmnet [14:39:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'swap masters for es1 - T373095', diff saved to https://phabricator.wikimedia.org/P68648 and previous config saved to /var/cache/conftool/dbconfig/20240904-143928-arnaudb.json [14:39:32] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661#10117671 (10joanna_borun) p:05Triage→03Low [14:39:33] T373095: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095 [14:39:45] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2036.codfw.wmnet with OS bullseye [14:40:08] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [14:40:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2036 [14:40:15] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:40:24] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [14:40:24] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:40:45] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:41:53] how is integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php74/ taking THIRTY MINUTES nowadays [14:41:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10117677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2036.cod... [14:42:02] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725#10117686 (10aborrero) 05Open→03Declined not working on this at the moment. [14:42:12] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#10117692 (10fnegri) [14:42:22] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#10117696 (10joanna_borun) p:05Triage→03Medium [14:42:34] or https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74/buildTimeTrend going from ~ 15 to spikes of 22 minutes [14:42:35] hmm [14:42:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10117701 (10jhathaway) >>! In T372666#10116617, @Volans wrote: > FYI Cumin's puppet backend too will need to be refactored to support structured facts. good point, do you... [14:43:07] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661#10117702 (10aborrero) 05Open→03Declined not working on this at the moment. [14:43:44] (03PS1) 10Ladsgroup: tables-catalog: Add another batch of mediawiki core tables [puppet] - 10https://gerrit.wikimedia.org/r/1070616 (https://phabricator.wikimedia.org/T363581) [14:44:02] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2036 - cgoubert@cumin1002" [14:44:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2036 - cgoubert@cumin1002" [14:44:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:44:06] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2036.codfw.wmnet 121.16.192.10.in-addr.arpa 1.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:44:07] (03CR) 10JHathaway: [C:03+2] puppet8: remove unused scap config file [puppet] - 10https://gerrit.wikimedia.org/r/1064839 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [14:44:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2036.codfw.wmnet 121.16.192.10.in-addr.arpa 1.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:44:10] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2036 [14:44:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2036 [14:44:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2036 [14:44:49] (03CR) 10CI reject: [V:04-1] sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 (owner: 10Hnowlan) [14:44:57] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724#10117697 (10aborrero) 05Open→03Declined not working on this at the moment. [14:46:18] (03PS1) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [14:47:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:49:59] !log homer cr*codfw* commit 'T372878' [14:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:02] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:55:03] FIRING: [2x] KubernetesCalicoDown: mw2316.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:55:38] (03Merged) 10jenkins-bot: ParserOutput::collectMetadata: Log if given value is non-numeric and also non-string, for easier debugging, and don't fatal [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070548 (https://phabricator.wikimedia.org/T373920) (owner: 10Hashar) [14:55:43] (03PS2) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [14:55:43] (03CR) 10JHathaway: 2FA: Use username as foreign key to security token table. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1065166 (owner: 10Slyngshede) [14:56:29] (03PS2) 10Hnowlan: sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 [14:56:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 379, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:48] (03CR) 10Ottomata: [C:03+1] "Thanks Jayme!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070257 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [14:58:36] oh joy [14:58:40] (03PS4) 10Clément Goubert: sre.k8s.renumber-node: fix k8s_metadata scope [cookbooks] - 10https://gerrit.wikimedia.org/r/1070585 (owner: 10Hnowlan) [14:58:50] I have an unexpected "Revert "Allow copyuploads on test2wiki"" [14:58:52] (03CR) 10Vgutierrez: "that diff output isn't related to `show_diff => false` not being enforced, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [14:59:03] hnowlan: looks like we both went to deploy at the same time ? [14:59:44] oh that is from TheresNoTime earlier [14:59:52] hashar: go ahead [14:59:53] (03PS2) 10DCausse: wdqs: drop run_tests crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070587 [14:59:53] (03PS2) 10DCausse: wdqs: drop deploy_mode [puppet] - 10https://gerrit.wikimedia.org/r/1070603 [15:00:01] it was the revert from the earlier mess [15:00:03] (03CR) 10Filippo Giunchedi: [C:03+1] mtail: update cpu_throttle pattern [puppet] - 10https://gerrit.wikimedia.org/r/1070606 (https://phabricator.wikimedia.org/T373995) (owner: 10Herron) [15:00:08] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070607 :) [15:00:21] doing it :) [15:00:23] thanks claime ! [15:00:34] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1070548|ParserOutput::collectMetadata: Log if given value is non-numeric and also non-string, for easier debugging, and don't fatal (T373920)]] [15:00:48] T373920: TypeError: MediaWiki\Parser\ParserOutput::setNumericPageProperty with non-numeric value - https://phabricator.wikimedia.org/T373920 [15:01:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:01] (03CR) 10Herron: [C:03+2] mtail: update cpu_throttle pattern [puppet] - 10https://gerrit.wikimedia.org/r/1070606 (https://phabricator.wikimedia.org/T373995) (owner: 10Herron) [15:02:20] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:02:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2036.codfw.wmnet with reason: host reimage [15:03:02] (03PS1) 10David Caro: cloudceph: add cloudcephmon1005 [puppet] - 10https://gerrit.wikimedia.org/r/1070621 (https://phabricator.wikimedia.org/T374005) [15:03:49] (03PS2) 10David Caro: cloudceph: add cloudcephmon1005 [puppet] - 10https://gerrit.wikimedia.org/r/1070621 (https://phabricator.wikimedia.org/T374005) [15:04:16] (03PS3) 10David Caro: cloudceph: add cloudcephmon1005 [puppet] - 10https://gerrit.wikimedia.org/r/1070621 (https://phabricator.wikimedia.org/T374005) [15:04:22] (03PS3) 10Hnowlan: sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 [15:04:22] !log hashar@deploy1003 hashar: Backport for [[gerrit:1070548|ParserOutput::collectMetadata: Log if given value is non-numeric and also non-string, for easier debugging, and don't fatal (T373920)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:04:41] !log hashar@deploy1003 hashar: Continuing with sync [15:05:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [15:06:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2036.codfw.wmnet with reason: host reimage [15:06:19] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:06:30] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3867/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070621 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [15:07:02] (03CR) 10JHathaway: [C:03+1] MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750 (owner: 10Slyngshede) [15:07:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10117845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ml-lab... [15:07:34] (03CR) 10JHathaway: [C:03+1] Management command for importing TOTP tokens from MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/1067918 (owner: 10Slyngshede) [15:09:10] (03CR) 10JHathaway: "correct, I believe that is either a PCC bug, or feature" [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [15:09:11] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070548|ParserOutput::collectMetadata: Log if given value is non-numeric and also non-string, for easier debugging, and don't fatal (T373920)]] (duration: 08m 37s) [15:09:14] T373920: TypeError: MediaWiki\Parser\ParserOutput::setNumericPageProperty with non-numeric value - https://phabricator.wikimedia.org/T373920 [15:10:38] (03CR) 10Andrew Bogott: [C:03+1] cloudceph: add cloudcephmon1005 [puppet] - 10https://gerrit.wikimedia.org/r/1070621 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [15:11:00] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:11:02] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1070621 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [15:11:13] (03PS1) 10David Caro: idp: add missing hiera values to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1070624 [15:11:22] RECOVERY - MD RAID on cp7015 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:11:41] \o/ [15:12:21] (03CR) 10Slyngshede: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1070624 (owner: 10David Caro) [15:12:23] I am off for a bit [15:12:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab1005.eqiad.wmnet with OS bookworm [15:12:55] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10117889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm execut... [15:12:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host phab1005.eqiad.wmnet with OS bookworm [15:13:11] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7015.magru.wmnet [15:13:11] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7015.magru.wmnet [15:13:40] 10ops-codfw, 06DC-Ops, 06serviceops: kubernetes2035 (renamed to wikikube-worker2087) reporting "Comm Error: Backplane 0" - https://phabricator.wikimedia.org/T374019 (10JMeybohm) 03NEW [15:13:41] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10117906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm [15:13:43] (03CR) 10David Caro: [C:03+2] idp: add missing hiera values to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1070624 (owner: 10David Caro) [15:14:24] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:15:15] (03CR) 10David Caro: [C:03+2] envvars-backend: update endpoint to new schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [15:15:19] !log vgutierrez@cumin1002 conftool action : set/pooled=yes; selector: name=cp7015.magru.wmnet [15:15:56] (03CR) 10David Caro: [V:03+1 C:03+2] cloudceph: add cloudcephmon1005 [puppet] - 10https://gerrit.wikimedia.org/r/1070621 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [15:16:51] (03PS1) 10Scott French: decommission mw2267 (no changes for mw2260) [puppet] - 10https://gerrit.wikimedia.org/r/1070627 (https://phabricator.wikimedia.org/T374018) [15:16:58] (03Abandoned) 10Scott French: kubernetes: mw2260 revert to role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/1070364 (owner: 10Scott French) [15:17:23] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10117950 (10JMeybohm) [15:20:58] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2081.codfw.wmnet [15:21:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2081.codfw.wmnet [15:24:47] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2267.codfw.wmnet [15:25:21] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2267.codfw.wmnet [15:29:26] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw226[1-2].codfw.wmnet mw22[68-77].codfw.wmnet - https://phabricator.wikimedia.org/T371262#10118008 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:29:28] (03PS2) 10Ladsgroup: tables-catalog: Add another batch of mediawiki core tables [puppet] - 10https://gerrit.wikimedia.org/r/1070616 (https://phabricator.wikimedia.org/T363581) [15:29:32] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add another batch of mediawiki core tables [puppet] - 10https://gerrit.wikimedia.org/r/1070616 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [15:31:14] jouncebot: nowandnext [15:31:14] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [15:31:15] In 1 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240904T1700) [15:31:51] (03CR) 10Ladsgroup: [C:03+2] Fix bug causing review form to disappear on unreviewed pages [extensions/FlaggedRevs] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1070324 (https://phabricator.wikimedia.org/T373582) (owner: 10Ladsgroup) [15:32:24] (03PS1) 10Btullis: Add two test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [15:34:52] (03PS3) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [15:34:52] (03PS2) 10Btullis: Add two test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [15:36:28] (03CR) 10Clément Goubert: [C:03+1] decommission mw2267 (no changes for mw2260) [puppet] - 10https://gerrit.wikimedia.org/r/1070627 (https://phabricator.wikimedia.org/T374018) (owner: 10Scott French) [15:38:19] (03PS1) 10David Caro: cloudceph: add missing mgr entry for cloudcephmon1005 [puppet] - 10https://gerrit.wikimedia.org/r/1070631 (https://phabricator.wikimedia.org/T374005) [15:38:20] (03PS3) 10Btullis: Add two test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [15:38:37] (03CR) 10David Caro: [C:03+2] cloudceph: add missing mgr entry for cloudcephmon1005 [puppet] - 10https://gerrit.wikimedia.org/r/1070631 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [15:40:43] 06SRE, 06Infrastructure-Foundations, 10netops: ToR server-move Netbox script adding ".0" to end of interface names - https://phabricator.wikimedia.org/T374024 (10cmooney) 03NEW p:05Triage→03Medium [15:41:06] (03PS4) 10Btullis: Add two test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [15:41:53] (03Merged) 10jenkins-bot: Fix bug causing review form to disappear on unreviewed pages [extensions/FlaggedRevs] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1070324 (https://phabricator.wikimedia.org/T373582) (owner: 10Ladsgroup) [15:42:53] !log aqu@deploy1003 Started deploy [airflow-dags/analytics@3b0d8ba]: Regular analytics weekly train [airflow-dags@3b0d8ba1] [15:43:02] (03CR) 10FNegri: [C:03+1] prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1070206 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [15:43:21] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1070324|Fix bug causing review form to disappear on unreviewed pages (T373582)]] [15:43:22] !log configure lsw1-c1-codfw interfaces for servers in advance of move T373095 [15:43:23] T373582: The review form doesn't show up for pages without any stable edit - https://phabricator.wikimedia.org/T373582 [15:43:24] (03PS4) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [15:43:24] (03PS5) 10Btullis: Add two test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [15:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:26] T373095: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095 [15:43:41] topranks: oh I need to depool servers for you right? [15:43:42] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@3b0d8ba]: Regular analytics weekly train [airflow-dags@3b0d8ba1] (duration: 00m 48s) [15:43:58] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10118063 (10Dzahn) Was this a bug in the cookbook? [15:44:28] claime: I think Alex already took care of it :) [15:44:32] (03CR) 10Hnowlan: renumber-node: Add --os parameter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070590 (owner: 10JMeybohm) [15:44:33] ah cool [15:44:34] (03CR) 10Scott French: [C:03+2] decommission mw2267 (no changes for mw2260) [puppet] - 10https://gerrit.wikimedia.org/r/1070627 (https://phabricator.wikimedia.org/T374018) (owner: 10Scott French) [15:44:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:45:17] claime: this was the commet, I think it covers them all: https://phabricator.wikimedia.org/T373095#10116030 [15:45:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:46:13] topranks: yeah, fantastic [15:46:32] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [15:47:04] 10SRE-tools, 10Cumin, 06Infrastructure-Foundations, 10Spicerack: Formalize and share the spicerack/cumin release process - https://phabricator.wikimedia.org/T276443#10118074 (10elukey) 05Open→03Resolved a:03elukey We have now https://gitlab.wikimedia.org/repos/sre/python-release that basically do... [15:47:30] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10118086 (10Dzahn) >>! In T373485#10115801, @Krd wrote: > Why is this exemption hardcoded in a script while the addresses could have been... [15:47:50] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1070324|Fix bug causing review form to disappear on unreviewed pages (T373582)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:48:59] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [15:49:21] (03CR) 10Elukey: mediawiki: fetch active deployment host (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [15:49:30] (03PS5) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [15:49:30] (03PS6) 10Btullis: Add two test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [15:50:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3873/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [15:50:40] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:50] (03PS5) 10Clément Goubert: sre.k8s.renumber-node: Refactor host setup depending on backend. [cookbooks] - 10https://gerrit.wikimedia.org/r/1070585 (owner: 10Hnowlan) [15:53:26] (03PS7) 10Btullis: Add two test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [15:53:48] !log swfrench@cumin2002 START - Cookbook sre.hosts.decommission for hosts mw[2260,2267].codfw.wmnet [15:53:52] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070324|Fix bug causing review form to disappear on unreviewed pages (T373582)]] (duration: 10m 31s) [15:53:55] T373582: The review form doesn't show up for pages without any stable edit - https://phabricator.wikimedia.org/T373582 [15:54:06] (03PS1) 10Effie Mouzeli: mcrouter: double mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070633 (https://phabricator.wikimedia.org/T374025) [15:55:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2125 db2138 db2149 db2190 db2206 db2207 es2031 es2032 es2036 - T370852', diff saved to https://phabricator.wikimedia.org/P68650 and previous config saved to /var/cache/conftool/dbconfig/20240904-155459-arnaudb.json [15:55:03] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [15:55:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 10 hosts with reason: network maintenance T370852 [15:56:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 10 hosts with reason: network maintenance T370852 [15:56:08] !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2083.codfw.wmnet [15:56:10] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T373894#10118125 (10Jhancock.wm) still alerting this morning. rebooted idrac. will check on it after some other tasks. [15:56:19] (03PS2) 10Effie Mouzeli: mcrouter: double mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070633 (https://phabricator.wikimedia.org/T374025) [15:56:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10118127 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbering for host wikikube... [15:56:27] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2083.codfw.wmnet with OS bullseye [15:56:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10118128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2083.codf... [15:56:37] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2083 [15:56:45] (03CR) 10Ryan Kemper: [C:03+2] tlsproxy::localssl: Remove support for cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:56:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070634 [15:56:49] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:56:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070634 (owner: 10TrainBranchBot) [15:57:50] (03CR) 10Giuseppe Lavagetto: [C:03+1] mcrouter: double mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070633 (https://phabricator.wikimedia.org/T374025) (owner: 10Effie Mouzeli) [15:58:33] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:59:52] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 27 hosts with reason: Move server uplinks codfw racks C1 [16:00:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 27 hosts with reason: Move server uplinks codfw racks C1 [16:00:48] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10118144 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6ba6c00e-f364-45da-8be3-ee80785b36c0) set by cmooney@cumin1002 for... [16:00:57] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:37] (03PS6) 10Clément Goubert: sre.k8s.renumber-node: Refactor host setup depending on backend. [cookbooks] - 10https://gerrit.wikimedia.org/r/1070585 (owner: 10Hnowlan) [16:01:53] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [16:01:57] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:03:04] (03CR) 10Hnowlan: [C:03+1] sre.k8s.renumber-node: Refactor host setup depending on backend. [cookbooks] - 10https://gerrit.wikimedia.org/r/1070585 (owner: 10Hnowlan) [16:03:59] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2083 - hnowlan@cumin1002" [16:04:14] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:15] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2260,2267].codfw.wmnet [16:04:32] (03PS6) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [16:04:32] (03PS8) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:05:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10118151 (10Dzahn) 05In progress→03Stalled Thanks for that update. Setting to stalled for now. This will of course be changed once we got the approval. Cheers [16:05:04] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: double mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070633 (https://phabricator.wikimedia.org/T374025) (owner: 10Effie Mouzeli) [16:05:53] (03PS9) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:05:58] (03Merged) 10jenkins-bot: mcrouter: double mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070633 (https://phabricator.wikimedia.org/T374025) (owner: 10Effie Mouzeli) [16:06:41] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3876/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:06:44] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [16:06:50] !log migrating servers in codfw rack C1 from asw-c-codfw to lsw1-c1-codfw T373095 [16:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:56] T373095: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095 [16:07:17] !log restarting mcrouter on codfw [16:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:21] (03PS10) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:08:58] 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, and 2 others: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#10118192 (10ABran-WMF) [16:09:10] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2083 - hnowlan@cumin1002" [16:09:10] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:10] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2083.codfw.wmnet 167.16.192.10.in-addr.arpa 7.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:09:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2083.codfw.wmnet 167.16.192.10.in-addr.arpa 7.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:09:14] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2083 [16:09:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3877/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:09:30] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2083 [16:09:30] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2083 [16:10:35] (03Abandoned) 10Clément Goubert: Remove wikikube-worker2088.codfw.wmnet from wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1070610 (https://phabricator.wikimedia.org/T373982) (owner: 10Clément Goubert) [16:11:22] PROBLEM - Host mw2318 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:27] (03CR) 10JMeybohm: [C:03+1] sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 (owner: 10Hnowlan) [16:12:02] (03CR) 10Hnowlan: [C:03+2] sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 (owner: 10Hnowlan) [16:12:04] (03PS7) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [16:12:04] (03PS11) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:12:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3878/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:13:45] !log running homer 'cr*codfw*' commit 'T374018' [16:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:48] T374018: decommission mw2260.codfw.wmnet, mw2267.codfw.wmnet - https://phabricator.wikimedia.org/T374018 [16:13:53] (03CR) 10Btullis: [V:03+1 C:03+2] Lower the number of slots that the enwiki dump uses [puppet] - 10https://gerrit.wikimedia.org/r/1070558 (https://phabricator.wikimedia.org/T373904) (owner: 10Btullis) [16:14:13] (03PS14) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [16:14:13] (03PS1) 10Andrew Bogott: cloudweb2002-dev idp: change service id to be more restrictive [puppet] - 10https://gerrit.wikimedia.org/r/1070636 [16:14:48] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.renumber-node: Refactor host setup depending on backend. [cookbooks] - 10https://gerrit.wikimedia.org/r/1070585 (owner: 10Hnowlan) [16:15:25] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10118222 (10andrea.denisse) a:03andrea.denisse [16:16:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2036.codfw.wmnet with OS bullseye [16:16:23] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10118229 (10cmooney) Link moves completed, all servers now responding to ping again so looks ok. Unsure of exact times for each but looking at... [16:16:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10118230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2036.codfw.w... [16:16:42] !log homer lsw1-b8-codfw* commit 'T372878' [16:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:44] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:17:21] (03PS8) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [16:17:21] (03PS12) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:18:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68651 and previous config saved to /var/cache/conftool/dbconfig/20240904-161806-arnaudb.json [16:18:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68652 and previous config saved to /var/cache/conftool/dbconfig/20240904-161806-arnaudb.json [16:18:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3879/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:18:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68654 and previous config saved to /var/cache/conftool/dbconfig/20240904-161806-arnaudb.json [16:18:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68653 and previous config saved to /var/cache/conftool/dbconfig/20240904-161806-arnaudb.json [16:18:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68655 and previous config saved to /var/cache/conftool/dbconfig/20240904-161806-arnaudb.json [16:18:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68656 and previous config saved to /var/cache/conftool/dbconfig/20240904-161806-arnaudb.json [16:18:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68657 and previous config saved to /var/cache/conftool/dbconfig/20240904-161806-arnaudb.json [16:18:10] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [16:18:53] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2036.codfw.wmnet [16:18:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2036.codfw.wmnet [16:18:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2036.codfw.wmnet [16:19:03] (03CR) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:19:24] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:20:03] FIRING: [3x] KubernetesCalicoDown: mw2316.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:20:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10118258 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [16:21:54] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 373, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:22:26] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10118280 (10ABran-WMF) d/p hosts are repooling [16:22:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070634 (owner: 10TrainBranchBot) [16:24:50] (03CR) 10Ottomata: "Cooooool! Some thoughts:" [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:25:11] (03CR) 10Ottomata: "(unresolving)" [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:26:11] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2083.codfw.wmnet with reason: host reimage [16:26:18] (03CR) 10Ottomata: "Oh, also. Is the intention for folks to have to ensure the containing directory of the file is created, or can the full directory path be" [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:26:34] (03PS1) 10Bartosz Dziewoński: Do not consume 'centralauthtoken' on api.php OPTIONS requests [extensions/CentralAuth] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070638 (https://phabricator.wikimedia.org/T373925) [16:26:51] hashar: dancy: hi, i saw you were doing some train blocker backports earlier. do you want to do https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1070638 too or should i schedule it for the normal backport window? [16:27:12] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2011.codfw.wmnet [16:27:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2011.codfw.wmnet [16:27:20] MatmaRex: I want all the fixes applied ASAP [16:27:28] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2012.codfw.wmnet [16:27:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2012.codfw.wmnet [16:27:39] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2036.codfw.wmnet [16:27:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2036.codfw.wmnet [16:27:45] Matmaxrex: If that's ready to go I can backport it now [16:27:46] PROBLEM - Host kubernetes2010 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:51] dancy: cool, that one's good to go then. i can verify if you deploy it [16:27:52] *Matmarex [16:27:52] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2037.codfw.wmnet [16:27:52] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: Refactor host setup depending on backend. [cookbooks] - 10https://gerrit.wikimedia.org/r/1070585 (owner: 10Hnowlan) [16:27:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2037.codfw.wmnet [16:27:57] ok.. starting now [16:28:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070638 (https://phabricator.wikimedia.org/T373925) (owner: 10Bartosz Dziewoński) [16:28:31] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2436.codfw.wmnet [16:28:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2436.codfw.wmnet [16:28:42] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2437.codfw.wmnet [16:28:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2437.codfw.wmnet [16:29:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2037.codfw.wmnet [16:29:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2037.codfw.wmnet [16:29:12] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2038.codfw.wmnet [16:29:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2038.codfw.wmnet [16:29:29] !log T373095 repool kubernetes2011, kubernetes2012, kubernetes2036, kubernetes2037, wikikube-worker2037, wikikube-worker2038, mw2436, mw2437 [16:29:30] PROBLEM - Host kubernetes2035 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:31] T373095: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095 [16:29:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2083.codfw.wmnet with reason: host reimage [16:29:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:30:02] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 455, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:32:00] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1070636 (owner: 10Andrew Bogott) [16:34:14] (03CR) 10Andrew Bogott: [C:03+2] cloudweb2002-dev idp: change service id to be more restrictive [puppet] - 10https://gerrit.wikimedia.org/r/1070636 (owner: 10Andrew Bogott) [16:34:18] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views for eswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070640 (https://phabricator.wikimedia.org/T374029) [16:35:26] (03PS1) 10Clément Goubert: sre.k8s.renumber-node: Make --os optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1070641 [16:37:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070640 (https://phabricator.wikimedia.org/T374029) (owner: 10C. Scott Ananian) [16:37:30] (03Merged) 10jenkins-bot: Do not consume 'centralauthtoken' on api.php OPTIONS requests [extensions/CentralAuth] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070638 (https://phabricator.wikimedia.org/T373925) (owner: 10Bartosz Dziewoński) [16:37:57] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1070638|Do not consume 'centralauthtoken' on api.php OPTIONS requests (T373925)]] [16:37:59] T373925: Cross-wiki API does not work any more - https://phabricator.wikimedia.org/T373925 [16:39:20] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2260.codfw.wmnet, mw2267.codfw.wmnet - https://phabricator.wikimedia.org/T374018#10118404 (10Scott_French) a:05Scott_French→03None [16:40:04] !log dancy@deploy1003 matmarex, dancy: Backport for [[gerrit:1070638|Do not consume 'centralauthtoken' on api.php OPTIONS requests (T373925)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:40:47] (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read Views for eswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070640 (https://phabricator.wikimedia.org/T374029) (owner: 10C. Scott Ananian) [16:41:17] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:41:35] dancy: looks good on mwdebug [16:41:44] ok proceeding [16:41:46] !log dancy@deploy1003 matmarex, dancy: Continuing with sync [16:42:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68658 and previous config saved to /var/cache/conftool/dbconfig/20240904-164221-arnaudb.json [16:42:25] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10118457 (10cmooney) >>! In T373096#10106969, @Dzahn wrote: > The server `phab2002` mentioned here for Collaboration Services is standby a... [16:42:26] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [16:42:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68659 and previous config saved to /var/cache/conftool/dbconfig/20240904-164243-arnaudb.json [16:42:46] (03PS32) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [16:43:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68660 and previous config saved to /var/cache/conftool/dbconfig/20240904-164305-arnaudb.json [16:43:14] (03CR) 10JMeybohm: [C:03+2] "oops, sorry" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070641 (owner: 10Clément Goubert) [16:43:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68661 and previous config saved to /var/cache/conftool/dbconfig/20240904-164321-arnaudb.json [16:43:22] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [16:43:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68662 and previous config saved to /var/cache/conftool/dbconfig/20240904-164340-arnaudb.json [16:43:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68663 and previous config saved to /var/cache/conftool/dbconfig/20240904-164351-arnaudb.json [16:44:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68664 and previous config saved to /var/cache/conftool/dbconfig/20240904-164404-arnaudb.json [16:44:07] (03PS9) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [16:44:07] (03PS13) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:44:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68665 and previous config saved to /var/cache/conftool/dbconfig/20240904-164421-arnaudb.json [16:44:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68666 and previous config saved to /var/cache/conftool/dbconfig/20240904-164435-arnaudb.json [16:44:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3880/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:45:08] (03PS10) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [16:45:09] (03PS14) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:45:52] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3881/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:47:34] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [16:48:13] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070638|Do not consume 'centralauthtoken' on api.php OPTIONS requests (T373925)]] (duration: 10m 16s) [16:48:17] T373925: Cross-wiki API does not work any more - https://phabricator.wikimedia.org/T373925 [16:48:35] (03CR) 10Scott French: [C:03+1] sre.k8s.renumber-node: Make --os optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1070641 (owner: 10Clément Goubert) [16:49:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2083.codfw.wmnet with OS bullseye [16:49:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10118561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2083.codfw.wm... [16:50:04] (03PS1) 10David Caro: cloudcephmon: use the right path for the mon keyring [puppet] - 10https://gerrit.wikimedia.org/r/1070642 [16:50:12] (03PS15) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [16:50:12] (03PS1) 10Andrew Bogott: Add some inline comments explaining about keystone resources [puppet] - 10https://gerrit.wikimedia.org/r/1070643 (https://phabricator.wikimedia.org/T359590) [16:51:17] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:51:25] (03PS11) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [16:51:25] (03PS15) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [16:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:52:00] (03PS2) 10Andrew Bogott: Add some inline comments explaining about keystone resources [puppet] - 10https://gerrit.wikimedia.org/r/1070643 (https://phabricator.wikimedia.org/T359590) [16:52:00] (03PS16) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [16:52:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3882/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:52:34] (03PS33) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [16:53:20] (03PS3) 10Andrew Bogott: Add some inline comments explaining about keystone resources [puppet] - 10https://gerrit.wikimedia.org/r/1070643 (https://phabricator.wikimedia.org/T359590) [16:53:20] (03PS17) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [16:53:41] (03CR) 10David Caro: [C:03+2] cloudcephmon: use the right path for the mon keyring [puppet] - 10https://gerrit.wikimedia.org/r/1070642 (owner: 10David Caro) [16:53:50] thanks for deploying dancy! [16:54:04] 06SRE, 06Infrastructure-Foundations, 10netops: ToR server-move Netbox script adding ".0" to end of interface names - https://phabricator.wikimedia.org/T374024#10118603 (10cmooney) [16:54:40] 06SRE, 06Infrastructure-Foundations, 10netops: ToR server-move Netbox script adding ".0" to end of interface names - https://phabricator.wikimedia.org/T374024#10118608 (10cmooney) [16:54:43] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [16:56:47] (03CR) 10FNegri: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1070643 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [16:56:57] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: Make --os optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1070641 (owner: 10Clément Goubert) [16:57:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68667 and previous config saved to /var/cache/conftool/dbconfig/20240904-165727-arnaudb.json [16:57:30] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [16:57:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68668 and previous config saved to /var/cache/conftool/dbconfig/20240904-165749-arnaudb.json [16:58:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68669 and previous config saved to /var/cache/conftool/dbconfig/20240904-165811-arnaudb.json [16:58:21] (03PS1) 10David Caro: Revert "cloudcephmon: use the right path for the mon keyring" [puppet] - 10https://gerrit.wikimedia.org/r/1070644 [16:58:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68670 and previous config saved to /var/cache/conftool/dbconfig/20240904-165827-arnaudb.json [16:58:36] (03CR) 10David Caro: [C:03+2] Revert "cloudcephmon: use the right path for the mon keyring" [puppet] - 10https://gerrit.wikimedia.org/r/1070644 (owner: 10David Caro) [16:58:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68671 and previous config saved to /var/cache/conftool/dbconfig/20240904-165846-arnaudb.json [16:58:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68672 and previous config saved to /var/cache/conftool/dbconfig/20240904-165857-arnaudb.json [16:59:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68673 and previous config saved to /var/cache/conftool/dbconfig/20240904-165909-arnaudb.json [16:59:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68674 and previous config saved to /var/cache/conftool/dbconfig/20240904-165926-arnaudb.json [16:59:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68675 and previous config saved to /var/cache/conftool/dbconfig/20240904-165941-arnaudb.json [16:59:47] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240904T1700) [17:00:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:32] !log homer lsw1-b3-codfw* commit 'T372878' [17:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:39] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [17:00:51] FIRING: [4x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip6) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:05:59] (03CR) 10Hnowlan: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [17:06:51] (03PS4) 10Andrew Bogott: Add some inline comments explaining about keystone resources [puppet] - 10https://gerrit.wikimedia.org/r/1070643 (https://phabricator.wikimedia.org/T359590) [17:06:51] (03PS18) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [17:07:17] Rolling train to group0 [17:07:30] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070647 (https://phabricator.wikimedia.org/T373640) [17:07:32] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070647 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [17:08:12] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070647 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [17:08:29] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add some inline comments explaining about keystone resources [puppet] - 10https://gerrit.wikimedia.org/r/1070643 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [17:09:21] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2083.codfw.wmnet [17:09:23] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2083.codfw.wmnet [17:09:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2083.codfw.wmnet [17:10:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10118739 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor... [17:12:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68676 and previous config saved to /var/cache/conftool/dbconfig/20240904-171232-arnaudb.json [17:12:35] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [17:12:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68677 and previous config saved to /var/cache/conftool/dbconfig/20240904-171254-arnaudb.json [17:13:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68678 and previous config saved to /var/cache/conftool/dbconfig/20240904-171317-arnaudb.json [17:13:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68679 and previous config saved to /var/cache/conftool/dbconfig/20240904-171332-arnaudb.json [17:13:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68680 and previous config saved to /var/cache/conftool/dbconfig/20240904-171351-arnaudb.json [17:14:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68681 and previous config saved to /var/cache/conftool/dbconfig/20240904-171402-arnaudb.json [17:14:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68682 and previous config saved to /var/cache/conftool/dbconfig/20240904-171415-arnaudb.json [17:14:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68683 and previous config saved to /var/cache/conftool/dbconfig/20240904-171431-arnaudb.json [17:14:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68684 and previous config saved to /var/cache/conftool/dbconfig/20240904-171447-arnaudb.json [17:15:03] RESOLVED: [3x] KubernetesCalicoDown: mw2316.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:16:51] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.21 refs T373640 [17:16:54] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [17:17:45] (03PS2) 10Scott French: sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) [17:17:46] (03PS2) 10Scott French: sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) [17:17:48] (03PS2) 10Scott French: sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) [17:17:49] (03PS2) 10Scott French: sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) [17:18:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070648 [17:18:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070648 (owner: 10TrainBranchBot) [17:21:23] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configurator: Enabling prometheus monitoring for MPIC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070649 (https://phabricator.wikimedia.org/T361346) [17:38:32] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10118893 (10andrea.denisse) 05Open→03In progress [17:43:48] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10118915 (10andrea.denisse) [17:48:21] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10118952 (10andrea.denisse) [17:48:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070648 (owner: 10TrainBranchBot) [17:50:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:50:49] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:16] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070652 (https://phabricator.wikimedia.org/T373640) [17:51:18] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070652 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [17:51:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 6.171 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:01] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:04] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070652 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [17:52:24] !log dancy@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.21 refs T373640 [17:52:27] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [17:53:35] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10118977 (10andrea.denisse) Hi @Aklapper , I'm unable to remove the from the #acl-Project-admins and #acl_security Phabricator groups because I'm not an administrator of t... [17:59:13] !log dancy@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.21 refs T373640 (duration: 06m 48s) [17:59:16] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [18:00:04] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240904T1800) [18:03:04] (03PS1) 10Andrea Denisse: ldap: Offboard Manuel (WMDE) from WMF systems [puppet] - 10https://gerrit.wikimedia.org/r/1070653 (https://phabricator.wikimedia.org/T373927) [18:03:53] (03CR) 10Andrea Denisse: [C:03+2] ldap: Offboard Manuel (WMDE) from WMF systems [puppet] - 10https://gerrit.wikimedia.org/r/1070653 (https://phabricator.wikimedia.org/T373927) (owner: 10Andrea Denisse) [18:06:26] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10119075 (10andrea.denisse) [18:20:43] 06SRE: Upload slow - https://phabricator.wikimedia.org/T372217#10119156 (10andrea.denisse) 05Open→03Stalled Changing status as we're awaiting for user's feedback. [18:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:28:00] (03PS1) 10Jdlrobson: Fixes: Echo icon not visible after click [skins/Vector] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070657 (https://phabricator.wikimedia.org/T373936) [18:29:13] (03PS2) 10Cathal Mooney: Update prefix-lists for new private, global IPv6 ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1070589 (https://phabricator.wikimedia.org/T330153) [18:31:30] (03CR) 10Scott French: "Thanks again in advance for the review, @cgoubert@wikimedia.org!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [18:35:19] (03PS3) 10Bking: wdqs: remove deprecated wcqs reload crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070551 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:35:56] (03PS4) 10Bking: wdqs: remove deprecated wcqs reload crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070551 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:35:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070551 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:38:25] (03CR) 10Bking: [C:03+2] wdqs: remove deprecated wcqs reload crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070551 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:38:41] (03CR) 10Bking: [V:03+2 C:03+2] wdqs: remove deprecated wcqs reload crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070551 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:49:16] (03PS3) 10Bking: wdqs: drop run_tests crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070587 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:49:20] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070587 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:49:27] (03CR) 10CI reject: [V:04-1] wdqs: drop run_tests crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070587 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:49:45] (03PS4) 10Bking: wdqs: drop run_tests crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070587 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:49:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070587 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:51:46] (03CR) 10Bking: [C:03+2] wdqs: drop run_tests crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070587 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:51:50] (03CR) 10Bking: [V:03+2 C:03+2] wdqs: drop run_tests crontask [puppet] - 10https://gerrit.wikimedia.org/r/1070587 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [18:51:57] 10ops-codfw, 06DC-Ops: hw troubleshooting: host won't boot lists backplane error for pay-lb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374054 (10Dwisehaupt) 03NEW [19:00:50] (03PS3) 10DCausse: wdqs: drop deploy_mode [puppet] - 10https://gerrit.wikimedia.org/r/1070603 [19:01:05] (03PS4) 10Bking: wdqs: drop deploy_mode [puppet] - 10https://gerrit.wikimedia.org/r/1070603 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [19:01:10] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070603 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [19:02:43] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: hw troubleshooting: host won't boot lists backplane error for pay-lb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374054#10119246 (10Dwisehaupt) [19:05:57] (03CR) 10Bking: [C:03+2] wdqs: drop deploy_mode [puppet] - 10https://gerrit.wikimedia.org/r/1070603 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [19:06:30] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10119248 (10Dwisehaupt) @papaul Any time is fine. pay-lb2002 is down due to hardware error (T374054). I have powered down civi2002 and frpig2002 so t... [19:06:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070657 (https://phabricator.wikimedia.org/T373936) (owner: 10Jdlrobson) [19:07:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [19:07:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab1005.eqiad.wmnet with OS bookworm [19:07:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10119256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ml-lab1001... [19:08:01] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10119257 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm execut... [19:08:47] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10119259 (10Aklapper) [19:08:55] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10119260 (10Aklapper) [19:09:14] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10119262 (10Aklapper) >>! In T373927#10118977, @andrea.denisse wrote: > I'm unable to remove the from the #acl-Project-admins and #acl_security Phabricator groups because... [19:15:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host phab1005.eqiad.wmnet with OS bookworm [19:15:56] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10119271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm [19:27:20] (03PS1) 10C. Scott Ananian: ParserOutput: Turn off noisy log - we have the info we need for now [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070663 (https://phabricator.wikimedia.org/T374046) [19:28:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070663 (https://phabricator.wikimedia.org/T374046) (owner: 10C. Scott Ananian) [19:29:48] (03CR) 10Krinkle: ats: Fix issue with /api/ pointing to /w/rest.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070274 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris) [19:34:12] is the train still running? I added some info to https://phabricator.wikimedia.org/T373640#10119315 regarding T374046 [19:34:12] T374046: PHP Warning: MediaWiki\Parser\ParserOutput::collectMetadata: bad type for 'translate-is-translation', set '1' (T373920) [Called from MediaWiki\Parser\ParserOutput::collectMetadata in /srv/mediawiki/php-1.43.0-wmf.21/includes/pa - https://phabricator.wikimedia.org/T374046 [19:34:24] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on phab1005.eqiad.wmnet with reason: host reimage [19:34:39] cscott: Train is pending your change and one other that is currently in flight. [19:34:56] ok [19:35:22] and by your change I mean https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1070548 which I will make a cherry-pick for now. [19:35:39] oh, that was already merged. [19:35:42] what's the other one.. [19:35:54] This one: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1070663 [19:35:58] (03Merged) 10jenkins-bot: Fixes: Echo icon not visible after click [skins/Vector] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070657 (https://phabricator.wikimedia.org/T373936) (owner: 10Jdlrobson) [19:36:08] already cherry-picked.. excellent. [19:36:20] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1070657|Fixes: Echo icon not visible after click (T373936)]] [19:36:23] T373936: [Regression pre-wmf.21] The Alerts and the Notices icons disappear a while after the page load - https://phabricator.wikimedia.org/T373936 [19:37:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1005.eqiad.wmnet with reason: host reimage [19:38:31] !log dancy@deploy1003 jdlrobson, dancy: Backport for [[gerrit:1070657|Fixes: Echo icon not visible after click (T373936)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:39:10] jdlrobson: Lemme know when you've verified. [19:42:17] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10119328 (10Papaul) @Jhancock.wm let tried to do this first thing tomorrow morning and like you said keep the same ports setup.Thanks [19:47:10] (03CR) 10Ahmon Dancy: [C:03+2] ParserOutput: Turn off noisy log - we have the info we need for now [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070663 (https://phabricator.wikimedia.org/T374046) (owner: 10C. Scott Ananian) [19:48:31] cjming: I can handle the backport window today. Train blocker fixes are taking a while to land. [19:49:57] !log dancy@deploy1003 Sync cancelled. [19:50:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070663 (https://phabricator.wikimedia.org/T374046) (owner: 10C. Scott Ananian) [19:54:44] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:56:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:56:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab1005.eqiad.wmnet with OS bookworm [19:56:12] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10119341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm comple... [19:56:42] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10119343 (10Jclark-ctr) [19:56:57] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10119344 (10Jclark-ctr) 05Open→03Resolved [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240904T2000). nyaa~ [20:00:05] ebernhardson, MatmaRex, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:22] Hello folks. The backport window is delayed due to some train blocker processing. [20:01:38] hi [20:04:45] no worries, i have both a train blocker and a backport, so i'm around either way :) [20:11:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10119415 (10VRiley-WMF) [20:15:05] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:15:07] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:16:55] (03Merged) 10jenkins-bot: ParserOutput: Turn off noisy log - we have the info we need for now [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070663 (https://phabricator.wikimedia.org/T374046) (owner: 10C. Scott Ananian) [20:17:14] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1070663|ParserOutput: Turn off noisy log - we have the info we need for now (T374046)]] [20:17:17] T374046: PHP Warning: MediaWiki\Parser\ParserOutput::collectMetadata: bad type for 'translate-is-translation', set '1' (T373920) [Called from MediaWiki\Parser\ParserOutput::collectMetadata in /srv/mediawiki/php-1.43.0-wmf.21/includes/pa - https://phabricator.wikimedia.org/T374046 [20:17:31] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:17:33] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:19:21] !log dancy@deploy1003 dancy, cscott: Backport for [[gerrit:1070663|ParserOutput: Turn off noisy log - we have the info we need for now (T374046)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:19:28] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:19:30] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:19:48] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:19:50] cscott: Please verify that all is well on testservers. [20:20:18] ok, i'll try to verify that the log volume has reduced [20:21:05] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:25:57] dancy: looks good [20:26:12] Excellent. Proceeding [20:26:14] !log dancy@deploy1003 dancy, cscott: Continuing with sync [20:30:48] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070663|ParserOutput: Turn off noisy log - we have the info we need for now (T374046)]] (duration: 13m 33s) [20:30:55] T374046: PHP Warning: MediaWiki\Parser\ParserOutput::collectMetadata: bad type for 'translate-is-translation', set '1' (T373920) [Called from MediaWiki\Parser\ParserOutput::collectMetadata in /srv/mediawiki/php-1.43.0-wmf.21/includes/pa - https://phabricator.wikimedia.org/T374046 [20:31:49] MatmaRex: You ready? [20:32:48] dancy: yeah, but let's do the other changes first, mine are just cleanup [20:32:55] ok [20:33:04] dancy: not sure if ebernhardson is around, but i worked with him on that patch [20:33:19] ah, thanks. [20:33:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070640 (https://phabricator.wikimedia.org/T374029) (owner: 10C. Scott Ananian) [20:34:29] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for eswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070640 (https://phabricator.wikimedia.org/T374029) (owner: 10C. Scott Ananian) [20:34:48] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1070640|Turn on Parsoid Read Views for eswikivoyage (T374029)]] [20:34:51] T374029: Deploy Parsoid Read Views to es wikivoyage - https://phabricator.wikimedia.org/T374029 [20:35:12] i think i'll reschedule my config cleanup for tomorrow, it seems you've had a busy day. but i'd like to see the change by ebernhardson go out today [20:35:30] That sounds great. [20:36:06] (03PS3) 10Ebernhardson: NetworkSession: Only enable for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070344 (https://phabricator.wikimedia.org/T373826) [20:36:51] !log dancy@deploy1003 cscott, dancy: Backport for [[gerrit:1070640|Turn on Parsoid Read Views for eswikivoyage (T374029)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:22] (03PS1) 10Aklapper: Weekly Phabricator data for Tech News: Add recipients, tweak params [puppet] - 10https://gerrit.wikimedia.org/r/1070667 (https://phabricator.wikimedia.org/T373952) [20:37:30] cscott: Awaiting verification [20:37:44] dancy: ok, checking! [20:39:35] dancy: looks good [20:40:00] !log dancy@deploy1003 cscott, dancy: Continuing with sync [20:40:04] Sweet [20:42:21] (the NetworkSession change isn't testable on mwdebug, instead there are links to metrics that it should affect on https://phabricator.wikimedia.org/T373826) [20:44:35] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070640|Turn on Parsoid Read Views for eswikivoyage (T374029)]] (duration: 09m 46s) [20:44:37] T374029: Deploy Parsoid Read Views to es wikivoyage - https://phabricator.wikimedia.org/T374029 [20:44:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070344 (https://phabricator.wikimedia.org/T373826) (owner: 10Ebernhardson) [20:45:14] (03PS1) 10Scott French: sre.discovery.datacenter: update EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1070668 (https://phabricator.wikimedia.org/T372649) [20:45:40] (03Merged) 10jenkins-bot: NetworkSession: Only enable for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070344 (https://phabricator.wikimedia.org/T373826) (owner: 10Ebernhardson) [20:46:01] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1070344|NetworkSession: Only enable for private wikis (T373826)]] [20:46:06] T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems - https://phabricator.wikimedia.org/T373826 [20:48:02] !log dancy@deploy1003 ebernhardson, dancy: Backport for [[gerrit:1070344|NetworkSession: Only enable for private wikis (T373826)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:48:07] !log dancy@deploy1003 ebernhardson, dancy: Continuing with sync [20:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:52:36] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070344|NetworkSession: Only enable for private wikis (T373826)]] (duration: 06m 34s) [20:52:41] T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems - https://phabricator.wikimedia.org/T373826 [20:54:01] !log gerrit servers: upgraded git package version [20:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:58] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070669 (https://phabricator.wikimedia.org/T373640) [20:56:02] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070669 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [20:56:44] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070669 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [20:57:10] thanks for deploying dancy [20:57:20] No problem. [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240904T2100) [21:00:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:51] FIRING: [4x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip6) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:03:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:03:35] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.21 refs T373640 [21:03:38] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [21:12:37] (03CR) 10Dzahn: "thank you! agreed if we can avoid this then it's better to not even maintain it. I just saw it as a step up from maintaining it right in s" [puppet] - 10https://gerrit.wikimedia.org/r/1069387 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [21:13:41] (03CR) 10Dzahn: "sounds like a good idea to try it on new hardware. maybe set it back to WIP for now." [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [21:17:54] (03CR) 10Dzahn: "there were 3 very short spikes where a single IP got throttled for a minimal amount of time, in the last 2 days. but that doesn't mean it'" [puppet] - 10https://gerrit.wikimedia.org/r/1070025 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [21:20:37] (03CR) 10Dzahn: "should we change wh" [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [21:27:29] (03PS1) 10Cathal Mooney: Fix some bugs with the move_server Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1070670 (https://phabricator.wikimedia.org/T374024) [21:28:14] (03PS1) 10JHathaway: vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) [21:29:16] (03PS2) 10Cathal Mooney: Fix some bugs with the move_server Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1070670 (https://phabricator.wikimedia.org/T374024) [21:30:51] (03CR) 10CI reject: [V:04-1] vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway) [21:33:02] (03PS2) 10JHathaway: vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) [21:33:08] (03PS1) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) [21:47:53] (03CR) 10Dzahn: [C:03+2] "no fundamental difference here, added annotation in dashboard. still the occasional spike" [puppet] - 10https://gerrit.wikimedia.org/r/1070295 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [21:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:25] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:56] (03PS1) 10Peter Fischer: Let PageEntitySerializer.canonicalPageURL accept PageReference [extensions/EventBus] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070675 (https://phabricator.wikimedia.org/T372904) [21:57:54] (03CR) 10Dzahn: [C:03+1] "looks good to me. let's test it" [puppet] - 10https://gerrit.wikimedia.org/r/1070591 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [22:10:06] (03CR) 10Dzahn: [C:03+2] Weekly Phabricator data for Tech News: Add recipients, tweak params [puppet] - 10https://gerrit.wikimedia.org/r/1070667 (https://phabricator.wikimedia.org/T373952) (owner: 10Aklapper) [22:20:26] (03CR) 10Dzahn: "waiting for approval, manager on vacation, but not urgent for Zoe, see ticket" [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [22:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:36:08] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10119784 (10Papaul) The diagram below will outline the cabling of the new Fundraising network devices {F57461650} [23:06:11] (03PS1) 10BCornwall: corto: Add gdrive-creds.json [labs/private] - 10https://gerrit.wikimedia.org/r/1070679 [23:06:48] (03CR) 10BCornwall: [V:03+2 C:03+2] corto: Add gdrive-creds.json [labs/private] - 10https://gerrit.wikimedia.org/r/1070679 (owner: 10BCornwall) [23:07:41] (03PS18) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [23:08:04] (03CR) 10CI reject: [V:04-1] Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [23:13:39] (03PS19) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [23:14:01] (03CR) 10CI reject: [V:04-1] Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [23:14:11] (03PS20) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [23:19:17] (03CR) 10BCornwall: [V:03+1] "I've updated this to include" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [23:27:04] (03PS1) 10Dzahn: site: add insetup-gerrit role to gerrit2003, remove gerrit1004 hiera [puppet] - 10https://gerrit.wikimedia.org/r/1070680 (https://phabricator.wikimedia.org/T372804) [23:28:52] (03PS1) 10Scott French: [DNM] P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 [23:29:14] (03CR) 10CI reject: [V:04-1] [DNM] P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (owner: 10Scott French) [23:34:40] (03CR) 10Scott French: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (owner: 10Scott French) [23:36:32] (03PS2) 10Scott French: [DNM] P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070682 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070682 (owner: 10TrainBranchBot) [23:45:32] (03PS2) 10Dzahn: site: add insetup-gerrit role to gerrit2003, remove gerrit1004 hiera [puppet] - 10https://gerrit.wikimedia.org/r/1070680 (https://phabricator.wikimedia.org/T372804) [23:47:30] 06SRE, 06MediaWiki-Engineering, 10MediaWiki-extensions-BounceHandler, 10Observability-Metrics, 07Grafana: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761#10119911 (10colewhite) [23:47:42] (03CR) 10BCornwall: [V:03+1] "I've updated this to include a config param to specify the gdrive-creds.json file. Support for customizing that in corto is at https://git" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [23:53:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10119914 (10Dwisehaupt) @Papaul Could you check this lights out interface when you get a chance. I am unable to get a response on port 22 or port 443 from the bastion for frban2... [23:53:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10119915 (10Dwisehaupt) @Papaul Could you check this lights out interface when you get a chance. I am unable to get a response on port 22 or port 443 from the bastion for frdb... [23:53:58] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10119916 (10Dwisehaupt) @Papaul Could you check this lights out interface when you get a chance. I am unable to get a response on port 22 or port 443 from the bastion for... [23:56:59] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1070680/3884/" [puppet] - 10https://gerrit.wikimedia.org/r/1070680 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)