[07:53:11] 06serviceops, 13Patch-For-Review: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#10136511 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9c649759-0814-4af1-9b77-d1cbc2c297aa) set by jayme@cumin1002 for 1 day, 0:00:00 on 2 host(s)... [08:25:50] 06serviceops, 06SRE: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10136524 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm [08:34:29] 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10136529 (10jijiki) [08:53:31] 06serviceops, 06SRE: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10136579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm completed: - dragonfly-supernode1001 (**PASS**... [08:54:09] 06serviceops, 06Abstract Wikipedia team, 10Wikifunctions, 13Patch-For-Review: While mw-wikifunctions exists as a separate cluster, replace the httpbb appserver test suite with one specific to WF - https://phabricator.wikimedia.org/T374442#10136573 (10Clement_Goubert) 05Open→03Resolved a:03Clement_... [10:00:58] 06serviceops: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015#10136773 (10elukey) [10:59:12] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 13Patch-For-Review, 07Video: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517#10136898 (10hnowlan) The healthcheck endpoint is not consistently returning a 503 when workers are busy - this could be some kind of a r... [11:25:21] I'm moving the dragonfly-supernode role to Puppet 7 (2001 is on P7 already) [11:31:06] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 13Patch-For-Review, 07Video: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517#10136980 (10hnowlan) From php-fpm's fpm-status we can even see this behaviour so our check isn't at fault: ` root@mw1451:/home/hnowlan#... [11:43:00] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10137027 (10MoritzMuehlenhoff) [11:53:40] !log failover Equinix IXP peering on cr1-eqiad to test new port T370696 [12:43:48] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10137168 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Given the last replacement did again cause service disruption (even though less bad than the first replacement) I went with copying all kafk... [12:54:16] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10GitLab (Integrations), and 2 others: Container image reports in debmonitor are broken - https://phabricator.wikimedia.org/T348876#10137210 (10MoritzMuehlenhoff) One other thing to consider is to run docker-repo daily (at least for Mo-F... [12:54:46] 06serviceops, 10Wikifunctions, 10Abstract Wikipedia team (25Q1 (Jul–Sep)): While mw-wikifunctions exists as a separate cluster, replace the httpbb appserver test suite with one specific to WF - https://phabricator.wikimedia.org/T374442#10137206 (10Jdforrester-WMF) This is great, thank you! [12:54:47] o/ quick heads up that I'm having issues with kafka-main@codfw, will try to restart some jobs to see if if this helps [13:03:01] seems to have worked, the kafka client was apparently very confused by something, claiming "Discovered transaction coordinator kafka-main2008.codfw.wmnet:9092 (id: 2003 rack: null)" [13:03:17] id: 2003 for kafka-main2008 seems weird [13:04:12] dcausse: that might be a result of some of the hardware replacements that jayme was working on [13:04:51] cdanis: ok thanks, yes that might explain it [13:05:30] dcausse: o/ we kept the old id for 2008 so the new host would pick up without issues (and the need to move partitions etc..) [13:06:34] now that we are on the subject.. Is there a specific reason why port 9092 is used instead of 9093? (plaintext vs TLS) [13:06:46] elukey: ok makes sense, I think this caused one our client to go a bit crazy, but no big deal, a restart did the trick [13:07:30] elukey: no just historical reasons and the fact that the job did not allow setting ssl, I think we can now, will try to get that done [13:08:05] dcausse: yes yes without any rush, I was curious :) [13:10:55] indeed "my fault" - I did not yet deploy config changes to all affected services because in theory it should not be problem ... will do in a bit [13:34:06] 06serviceops: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015#10137362 (10elukey) The poolcounter2005 host is up with Bookworm, as far as I can see it seems working fine. If serviceops could confirm that the host is working, we can easily swap 2004 with 2005 in mediawiki config. [13:41:59] poolcounter2005 is up and running, so far it looks good [13:42:10] I am a bit ignorant and I don't know any specific test for it [13:42:24] but once we are confident, we can add it to mw config [13:46:38] afaict we don't have a procedure to copy already existing locks either, i don't know how impactful swapping an empty server would be [14:07:28] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374380#10137505 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:18:45] claime: I think it should be as impactful as removing it from mw-config for reimage, at least in theory.. [14:19:18] there may be some temporary errors happening, the same as if one of the poolcounter nodes is rebooted [14:20:13] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542 (10JMeybohm) 03NEW [14:20:24] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137567 (10JMeybohm) [14:20:26] 06serviceops, 13Patch-For-Review: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#10137568 (10JMeybohm) [14:38:21] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137625 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2003.codfw.wmnet` - kafka-main20... [14:38:50] 06serviceops, 06Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, and 2 others: ptwiki: Use backing node service instead of RESTBase on pregeneration changeprop rules - https://phabricator.wikimedia.org/T372749#10137621 (10Jgiannelos) @Seddon informed me that ptwiki is used for som... [14:39:29] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137630 (10JMeybohm) [14:40:52] claime: you ok to depool the k8s hosts for today's network move? [14:40:57] topranks: yep [14:41:05] I'll do it just before the meeting [14:41:07] cool... we have wikikube-ctrl2002 on the list too btw [14:41:11] great thanks [14:41:38] and kafka-main2003, not sure about that one [14:41:52] decommed 2min ago :) [14:42:06] jayme: haha nice :) [14:42:09] much obliged [15:06:22] topranks: all good for k8s nodes [15:06:48] claime: thanks <3 [16:04:25] 06serviceops, 07Datacenter-Switchover: Determine switchover changes for migration of video scaling to k8s - https://phabricator.wikimedia.org/T372849#10137960 (10hnowlan) At this point in time I'd say it's not out of the question that we could have mercurius up and running some jobs, but for the purposes of th... [16:07:45] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10137961 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:28:10] jayme, claime: all done with the switch link moves [16:28:16] thanks as always :) [16:34:14] topranks: ack, repooling [16:51:34] 06serviceops, 06Abstract Wikipedia team, 10MW-on-K8s: Some wikifunctions calls end up served by mw-web - https://phabricator.wikimedia.org/T374556 (10Clement_Goubert) 03NEW [17:42:00] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10138390 (10Scott_French) Alright, `php8.1-uuid` should now be available in component/php81. I'll add it to the... [17:58:50] 06serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (2024.09.06 - 2024.09.27), 07Kubernetes, 13Patch-For-Review: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195#10138472 (10bking) Update: I deployed the latest patchset to staging and I can... [18:22:28] 06serviceops, 07Datacenter-Switchover: Determine switchover changes for migration of video scaling to k8s - https://phabricator.wikimedia.org/T372849#10138567 (10Scott_French) Thanks, @hnowlan! Got it, so while there might be some subset of transcoding jobs running via Mercurius, the simplest solution is just... [22:04:36] 06serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (2024.09.06 - 2024.09.27), 07Kubernetes, 13Patch-For-Review: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195#10139241 (10bking) @RKemper and I tried to manually add the "chart-name" label... [23:12:02] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10139345 (10Jdforrester-WMF) [23:55:30] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10139390 (10Jdforrester-WMF)