[06:54:50] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9858015 (10Joe) A few thoughts on this: # I think using daemonsets is a better option... [07:42:24] 06serviceops: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9858139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc-wf1002.eqiad.wmnet with OS bookworm [07:42:32] 06serviceops: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9858154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc-wf2002.codfw.wmnet with OS bookworm [07:56:33] 06serviceops, 06SRE, 13Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253#9858230 (10jijiki) >>! In T236253#9856381, @Dzahn wrote: > I talked a bit about this in #systemd IRC channel. Mostly to ask if the config is irrelevant as long as the package i... [08:20:04] 06serviceops: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9858341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc-wf1002.eqiad.wmnet with OS bookworm completed: - mc-wf1002 (**PASS**) - Downtimed on Icinga... [08:24:01] 06serviceops: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9858358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc-wf2002.codfw.wmnet with OS bookworm completed: - mc-wf2002 (**PASS**) - Downtimed on Icinga... [08:38:06] 06serviceops: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9858393 (10jijiki) 05In progress→03Resolved [08:44:07] 06serviceops, 10Cloud-Services, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858437 (10jijiki) 05In progress→03Open The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikim... [08:45:29] 06serviceops, 10Cloud-Services, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858443 (10jijiki) [08:59:23] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9858506 (10akosiaris) >>! In T365265#9858015, @Joe wrote: > A few thoughts on this: >... [09:03:34] 06serviceops, 10Cloud-Services, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858514 (10jijiki) [09:05:48] 06serviceops, 10Cloud-Services, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858516 (10jijiki) [09:18:15] 06serviceops, 10Cloud-Services, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858577 (10jijiki) [09:27:37] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9858591 (10kamila) @VRiley-WMF Yes, that works, thank you! Since with moving racks it's going to take a while, could we please d... [09:38:02] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858611 (10MoritzMuehlenhoff) [10:03:47] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9858663 (10Clement_Goubert) >>! In T365265#9858506, @akosiaris wrote: >>>! In T365265#... [10:05:54] 06serviceops, 10MoveComms-Support, 10MW-on-K8s, 06SRE, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9858668 (10Clement_Goubert) [10:09:01] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9858678 (10akosiaris) >>! In T365265#9858663, @Clement_Goubert wrote: >>>! In T365265#... [10:18:29] 06serviceops, 10MoveComms-Support, 10MW-on-K8s, 06SRE, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9858735 (10Ladsgroup) [10:21:03] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858744 (10MoritzMuehlenhoff) [10:45:29] 06serviceops, 06Infrastructure-Foundations, 06Release-Engineering-Team, 13Patch-For-Review: Deprecate buster-backports - https://phabricator.wikimedia.org/T362518#9858827 (10Clement_Goubert) `docker-registry.wikimedia.org/wikimedia/mediawiki-services-image-suggestion-api` deleted, thanks. [10:52:44] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9858889 (10Clement_Goubert) Renamed: `mw1426` to `wikikube-worker1002` `mw1427` to `wikikube-worker1003` `mw1443` to `wikikube-worker1004` `mw1490` to `wikikube-worker1007... [10:53:24] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9858897 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7674428f-f194-4d51-ae42-1bbedb9b1fde) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s... [11:04:02] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9858949 (10Joe) >>! In T365265#9858506, @akosiaris wrote: >>>! In T365265#9858015, @Jo... [11:04:16] 06serviceops, 06DC-Ops, 10ops-eqiad: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583 (10Clement_Goubert) 03NEW [11:04:39] 06serviceops, 06DC-Ops, 10ops-eqiad: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858964 (10Clement_Goubert) [11:04:40] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9858965 (10Clement_Goubert) [11:06:02] 06serviceops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858966 (10Clement_Goubert) [11:06:08] 06serviceops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858967 (10Clement_Goubert) [11:22:11] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9858995 (10Joe) After a lot of back and forth between solutions all having downsides,... [11:22:49] 06serviceops, 10MediaWiki-Platform-Team (Radar): Enable extstore to a subset of memcached servers (experiment) - https://phabricator.wikimedia.org/T352885#9858997 (10jijiki) We have enabled extstore to 6/18 servers per memcached cluster. A few observations: * System: apart from the expected disc ops and incr... [11:26:22] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9859001 (10VRiley-WMF) Sure thing! We'll do it one at a time. [11:41:52] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9859048 (10SGupta-WMF) Hi @Scott_French We are almost done coding the services... [12:35:56] 06serviceops, 10Thumbor, 13Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Upgrade Thumbor to bullseye - https://phabricator.wikimedia.org/T336881#9859198 (10TheDJ) Can anyone update the ticket with the current state ? I believe thumbor hasn't had work done since October again and I'm not... [12:38:07] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9859201 (10MoritzMuehlenhoff) [12:46:50] 06serviceops, 10Thumbor: Find 8 machines (4 eqiad + 4 codfw) for Thumbor - https://phabricator.wikimedia.org/T280843#9859246 (10TheDJ) 05Open→03Declined As far as I know, thumbor runs on k8s now, so provisionally closing this, but can be reopened when I'm incorrect of course. [14:01:48] 06serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar): mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690#9859780 (10jijiki) I have been trying to figure out how much does the dns resolution costs. [14:01:51] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9859778 (10Jdforrester-WMF) [14:01:52] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371#9859790 (10Jdforrester-WMF) [14:02:01] 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9859792 (10Jdforrester-WMF) [14:02:17] 06serviceops, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#9859793 (10Jdforrester-WMF) [14:08:39] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9859822 (10fgiunchedi) >>! In T365265#9858015, @Joe wrote: > A few thoughts on this: >... [14:10:47] 06serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar): mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690#9859824 (10jijiki) I have been trying to figure out how much does the dns resolution of `'mcrouter-main.mw-mcrouter.svc.cluster.local.:4442` costs, by using xdgui. I dont... [14:12:56] 06serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar): mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690#9859832 (10jijiki) 05Stalled→03In progress [14:16:25] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9859840 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl1001.eqiad.... [14:37:48] 06serviceops, 06DC-Ops, 10ops-codfw: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609 (10Clement_Goubert) 03NEW p:05Triage→03High [14:49:04] 06serviceops, 06DC-Ops, 10ops-codfw: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9859997 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=da38d2ec-3c5a-4c49-a0b8-5355aa47... [15:13:09] 06serviceops, 10Cloud-Services, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9860167 (10jijiki) [15:13:13] 06serviceops, 10AQS2.0, 10Data Products (Data Products Sprint 14): Metrics api response sometimes returns cached 301 (from kubernetes ??) - https://phabricator.wikimedia.org/T364253#9860168 (10EChukwukere-WMF) yes this can be marked as resolved. Thanks all [15:13:14] 06serviceops, 10Cloud-Services, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9860169 (10jijiki) 05Open→03In progress [15:25:32] 06serviceops, 06DC-Ops, 10ops-codfw: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860268 (10Jhancock.wm) when I put a faceplate on all three servers, I find the same error: The system Confi... [15:34:06] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860331 (10Jhancock.wm) all servers are updated and are error free. if this happens again with any... [15:34:53] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860334 (10Clement_Goubert) Thanks so much @Jhancock.wm [15:41:51] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860373 (10Clement_Goubert) 05Open→03Resolved Hosts repooled, uncordoned and set back to ac... [17:23:08] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9860880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [17:41:25] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9860976 (10colewhite) >>! In T365265#9858015, @Joe wrote: > # As for one release per M... [18:15:48] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [19:36:47] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [19:37:41] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9861511 (10Scott_French) Thanks for the update, @SGupta-WMF - that's great! T... [19:37:47] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [19:38:00] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [19:38:10] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9861517 (10Scott_French) [19:44:33] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [19:58:11] 06serviceops, 10Phabricator, 13Patch-For-Review, 07Technical-Debt: Investigate / remove custom downstream changes to PhabricatorClientRateLimit.php - https://phabricator.wikimedia.org/T364839#9861600 (10Aklapper) @jelto Hi, would you have any input or recommendations on this? If not that's fine, I could al... [20:00:09] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [20:48:49] 06serviceops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9861710 (10Jclark-ctr) 05Open→03Resolved manually updated firmware iDRAC Firmware Version 7.00.00.171 BIOS Version... [20:52:46] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [21:10:46] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [22:00:31] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [22:01:50] 06serviceops, 06SRE: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9861981 (10Jdforrester-WMF)