[05:18:12] 06serviceops, 06Infrastructure-Foundations: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd - https://phabricator.wikimedia.org/T373819#10132541 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The override is now fixed on the Debian archive side and bullse... [08:23:56] 06serviceops, 13Patch-For-Review: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#10132766 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1ecd31b5-5c44-49dc-a69c-a3104ecc9241) set by jayme@cumin1002 for 1 day, 0:00:00 on 2 host(s)... [08:36:58] hi folks, I'm seeking a review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071193 to wrap things up [09:35:42] 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10132942 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by cgoubert: for 1 hosts: mw2431.codfw.wmnet [09:36:05] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249#10132945 (10Clement_Goubert) >>! In T374249#10131011, @Jhancock.wm wrote: > mw2431 is causing an alert in netbox https://netbox.wikimedia.org/extras/scripts/r... [09:38:39] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10132947 (10Clement_Goubert) >>! In T373591#10131008, @Jhancock.wm wrote: > mw2379 is causing an alert in netbox https://netbox.wikimedia.org/extras/scripts/r... [10:34:03] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10133209 (10JMeybohm) During the current replacement with the above process (replacing broker id 2002, kafka-main2002 -> kafka-main2007) @Vgutierrez reported errors from purged: ` Sep 10 10:25:09 cp2032... [11:30:39] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Improve calico-typha firewall rules - https://phabricator.wikimedia.org/T365687#10133364 (10jijiki) [11:39:59] 06serviceops, 10MediaWiki-Uploading: Large file uploads broken on (at least) group0 - https://phabricator.wikimedia.org/T374436 (10hnowlan) 03NEW [11:41:54] 06serviceops, 10MediaWiki-Uploading, 07Regression, 07Wikimedia-production-error: Large file uploads broken via Special:Upload - https://phabricator.wikimedia.org/T374436#10133405 (10Aklapper) [11:42:34] 06serviceops, 10MediaWiki-Uploading, 07Regression, 07Wikimedia-production-error: Large file uploads broken via Special:Upload - https://phabricator.wikimedia.org/T374436#10133403 (10hnowlan) [11:51:43] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10133430 (10JMeybohm) We where this time seeing huge spikes in RTT from eventgate-main to some kafka brokers (mainly kafka-main2006 over a long period) that propagated down to high latency of POST's to e... [11:53:16] 06serviceops, 10MW-on-K8s, 10Scap, 13Patch-For-Review: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts - https://phabricator.wikimedia.org/T366778#10133446 (10akosiaris) [12:25:16] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10133597 (10Trizek-WMF) [12:26:06] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10133599 (10Trizek-WMF) [12:32:03] swfrench-wmf: Thanks, will file a Phab task. [12:34:24] 06serviceops, 06Abstract Wikipedia team, 10Wikifunctions: While mw-wikifunctions exists as a separate cluster, replace the httpbb appserver test suite with one specific to WF - https://phabricator.wikimedia.org/T374442 (10Jdforrester-WMF) 03NEW [13:52:53] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451 (10JMeybohm) 03NEW [13:52:57] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10134068 (10JMeybohm) [13:53:02] 06serviceops, 13Patch-For-Review: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#10134069 (10JMeybohm) [14:01:39] claime: are you stuck depooling the service ops nodes for our switch move later? [14:01:40] https://phabricator.wikimedia.org/T373097 [14:01:54] I'm more than happy to bug someone else of course :P [14:02:39] topranks: woops forgot, yeah can do [14:04:55] cool thanks <3 [14:13:51] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 13Patch-For-Review: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10134209 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2002.codfw.wmnet` - kafk... [14:16:09] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 13Patch-For-Review: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10134222 (10JMeybohm) [14:19:33] topranks: all done for k8s nodes [14:19:47] jayme: kafka-main2008 isn't in production yet right? [14:20:02] claime: nope, insetup [14:20:23] Cool [14:22:36] claime: much obliged! [14:23:56] effie: potential impact from losing shard10 and shard11 hosts from cofdw wancache for a minute at most? [14:24:28] nah [14:25:02] cool [14:25:07] gutterpool will pick up th traffic until the shards are back [14:26:01] ack [14:26:23] topranks: all good for us then [14:27:44] super thanks all! [14:28:13] the two restbase hosts ok for a blip? [14:28:28] restbase2025 & restbase2033 [14:34:51] I can depool them from the restbase service, cassandra won't mind though [14:35:28] done [14:41:00] hnowlan: thanks! [15:08:50] 06serviceops, 06Abstract Wikipedia team, 10Wikifunctions: While mw-wikifunctions exists as a separate cluster, replace the httpbb appserver test suite with one specific to WF - https://phabricator.wikimedia.org/T374442#10134541 (10Clement_Goubert) Sure, what URLs and expected HTTP codes/text would you like `... [15:33:24] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134659 (10Clement_Goubert) Yep that's ours. I'll depool the node so you can reseat when you want. [15:33:37] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134665 (10ops-monitoring-bot) depool host wikikube-worker2092.codfw.wmnet by cgoubert@cumin1002 with reason: Degraded RAID [15:33:39] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134666 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host wikikube-worker2092.codfw.wmnet completed: - wik... [15:34:46] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134668 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2597914-1845-48e0-a060-39e43e562886) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host... [15:35:33] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134670 (10Clement_Goubert) Host depooled and downtimed for a week, all yours. [15:38:09] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134676 (10Jhancock.wm) reseated. [15:47:54] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134736 (10Clement_Goubert) It's not showing up in system, and still shows foreign on the RAID controler interface, but that host is part of {T358489} and should not act... [15:58:22] 06serviceops, 06SRE, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10134785 (10Clement_Goubert) [16:02:51] 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10134805 (10jijiki) [16:03:12] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134813 (10Jhancock.wm) sounds good to me. [16:06:47] 06serviceops, 06Abstract Wikipedia team, 10Wikifunctions: While mw-wikifunctions exists as a separate cluster, replace the httpbb appserver test suite with one specific to WF - https://phabricator.wikimedia.org/T374442#10134815 (10Jdforrester-WMF) Looking at https://gerrit.wikimedia.org/r/plugins/gitiles/ope... [16:07:39] 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10134797 (10jijiki) 05In progress→03Resolved Further testing completed with @Ladsgroup * [[https://logstash.wikimedia.org/goto/45728dfc... [16:10:04] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10134821 (10Clement_Goubert) [16:16:57] claime, hnowlan: moves are all complete thanks [16:17:05] topranks: thanks [16:40:17] 06serviceops: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604#10134938 (10Scott_French) Allocating: * [[ https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports | Service port ]]: 4453 * [[ https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a... [16:43:57] 06serviceops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06Machine-Learning-Team: Migrate the ownership of Docker images in production-images repo to mailing lists - https://phabricator.wikimedia.org/T373526#10134961 (10akosiaris) >>! In T373526#10102566, @elukey wrote: >>>! In T373526#10100630, @a... [16:50:45] 06serviceops: Prepare PHP 8.1 production images - https://phabricator.wikimedia.org/T372602#10134991 (10Scott_French) The 8.1 production images are now built and available on docker-registry.w.o. [16:55:58] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10135021 (10Clement_Goubert) Reset the RAID config and the disk is still in `Foreign` state, so I can't use it for a Virtual Disk. I think a replacement is in order. [17:03:54] 06serviceops, 10Continuous-Integration-Infrastructure: Prepare PHP 8.1 production images - https://phabricator.wikimedia.org/T372602#10135065 (10akosiaris) Adding @hashar and #continuous-integration-infrastructure. The packages and images that will be used to run production on php8.1 in the medium term are rea... [17:04:08] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar): Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10135070 (10akosiaris) Adding @hashar and #continuous-integration-infrastructure. The packages and images that will be used to run pr... [18:34:18] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10135293 (10Jhancock.wm) made a service request with Dell. Will update when it arrives. [18:45:41] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar): Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10135312 (10hashar) Thank you for the early notice @akosiaris! I am subscribing @Jdforrester-WMF who is doing a lot of updates to the... [19:08:19] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar): Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10135382 (10hashar) > While working through the production image definitions for T372602, I discovered that the three extension packa... [19:17:38] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar): Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10135439 (10Reedy) If I can be a pain, can `php-uuid` (see {T373752}; not currently installed, and not explicitly needed for PHP 7.4... [19:25:51] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10135536 (10Scott_French) Thanks, @hashar! Great, the core set of packages you mention in T372507#10135311 sho... [19:39:23] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10135603 (10hashar) > `php8.1-ast` That one is for the Phan static analyzer and we currently get it from sury.... [21:27:21] 06serviceops, 07Datacenter-Switchover: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10135894 (10Dreamy_Jazz) >>! Quoting from T357547 > While stopping all maintenance scripts (01-stop-maintenance), we found a user triggered script which we fiercely killed manuall... [22:20:48] 06serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10135977 (10Scott_French) I have a `php8.1-uuid` package (1.20-12) built and ready to go. There's one thing I'd... [22:41:22] 06serviceops: Build php-uuid package, and add to WMF production and CI - https://phabricator.wikimedia.org/T373752#10136023 (10Scott_French) Thanks for the details! Just to confirm, the request here is for `php-uuid` packages for **both** 7.4 and 8.1, correct? (in `buster-wikimedia` `component/icu67` and `bullse... [22:59:11] 06serviceops: Build php-uuid package, and add to WMF production and CI - https://phabricator.wikimedia.org/T373752#10136047 (10Reedy) I don't think we specifically need it on PHP 7.4, but it shouldn't harm anything. It would also mean we can keep things cleaner (same PHP extension list on 7.4 and 8.1 in CI and... [23:14:03] 06serviceops: Build php-uuid package, and add to WMF production and CI - https://phabricator.wikimedia.org/T373752#10136087 (10Scott_French) @Reedy - Got it, thanks for the context! I'll look into how complicated this might be on 7.4 in light of how builds for our `icu67` component work. Naively, I'm a little co... [23:17:04] 06serviceops, 06Data-Persistence, 06SRE, 07Datacenter-Switchover: Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908#10136091 (10Scott_French) 05Open→03Resolved [23:24:22] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: sre.switchdc.mediawiki cookbook should take a task-id argument - https://phabricator.wikimedia.org/T330273#10136096 (10Scott_French) 05Open→03Resolved Alright, this is now supported, and used in a couple of different ways: * The task is updat... [23:39:44] 06serviceops: Prepare PHP 8.1 service images for Shellbox - https://phabricator.wikimedia.org/T374502 (10Scott_French) 03NEW