[01:06:53] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10132300 (10Papaul) [01:08:25] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10132304 (10Papaul) Cluster creation complete ` root@pfw1-codfw# run show chassis cluster status Cluster ID: 1 Node Pr... [01:11:21] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10132313 (10Papaul) [07:05:07] 9 [07:05:09] err [07:05:10] :) [07:45:34] hello folks! [07:45:46] I'd need to restart the anycast health-checker on dns hosts [07:45:52] (for py3.11 upgrades) [07:56:34] best to wait until Sukhbir is around, I synced up yesterday for the service restarts needed on DNS hosts as required by the openssl update, possibly the restart procedure might restart these anyway [08:15:37] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10132731 (10dcaro) [08:21:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10132749 (10cmooney) 05Open→03Resolved a:03cmooney Still all looking good, there have been no logs or cases the interface reported d... [08:24:39] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10132754 (10dcaro) @cmooney @VRiley-WMF Hi! I'm almost done draining the rack, we can try to find a slot startin... [08:27:53] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10132768 (10dcaro) [08:39:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10132799 (10cmooney) Ok patch has been merged and things are ok for now. Hosts are configured to peer with the switch unicast I... [08:51:37] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10132830 (10cmooney) I'll leave this open for now, we will need to make a call on how to proceed here in general, there are two... [09:16:11] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392#10132903 (10cmooney) It seems the routed ganeti hosts actually use nftables instead. This is nice as it does allow us to match on the incoming interf... [09:19:55] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392#10132909 (10MoritzMuehlenhoff) >>! In T374392#10132903, @cmooney wrote: > It seems the routed ganeti hosts actually use nftables instead. This is nic... [09:28:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on durum2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:38:16] I'm seeking reviewers for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071815 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071816 [09:38:46] * vgutierrez looking [09:39:16] thank you vgutierrez [09:43:48] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on durum2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:46:45] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10132983 (10ABran-WMF) db2114 is decommed (see T362948) [09:53:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on durum1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:57:00] FIRING: [2x] PurgedHighEventLag: High event process lag with purged on cp2032:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:02:00] FIRING: [8x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:07:00] FIRING: [12x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:12:00] FIRING: [19x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:17:00] RESOLVED: [10x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:19:00] FIRING: [2x] PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:23:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on durum1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:24:00] FIRING: [15x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:29:00] FIRING: [37x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:34:00] FIRING: [20x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:38:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on durum1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:39:00] FIRING: [29x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:43:48] RESOLVED: [2x] PuppetZeroResources: Puppet has failed generate resources on durum1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:44:00] FIRING: [33x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:49:00] FIRING: [19x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:54:00] FIRING: [26x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:54:44] we should start getting recoveries for purged soon [10:59:00] FIRING: [18x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:04:00] FIRING: [15x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:06:41] alerts already resolved according to alerts.wm.o [11:09:00] RESOLVED: [12x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:26:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10133598 (10cmooney) To confirm the links look good, interfaces come up when enable and rx light is good: ` cmooney@re... [13:28:48] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10133916 (10cmooney) >>! In T373097#10129063, @Jelto wrote: > I depooled `gitlab-runner2003` for tomorrows maintenance... [13:59:00] 10netops, 06Infrastructure-Foundations, 06SRE: Enable BFD on 'core' EBGP peerings from L3 switches to CRs - https://phabricator.wikimedia.org/T374452 (10cmooney) 03NEW p:05Triage→03Low [14:03:17] vgutierrez: I had a quick look at those options for gobgpd neighbor config [14:04:03] nothing jumping out at me. I guess we should consider 'graceful restart' like we spoke about before. Things like route-refresh capability I assume are going to be on and enable themselves if both sides support (so should be fine with defaults) [14:04:28] but we don't need anything special really [14:40:07] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10134339 (10cmooney) >>! In T373942#10117963, @Jgreen wrote: > The haproxy configuration part seems to work, I'm a... [14:40:25] jhathaway: both patches merged, I have updated T372664 as well. thanks for the patches! [14:40:25] T372664: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664 [14:40:46] of course, thanks for rolling them out sukhe [14:50:37] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10134407 (10Jgreen) >>! In T373942#10134339, @cmooney wrote: >>>! In T373942#10117963, @Jgreen wrote: >> The hapro... [15:47:38] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134741 (10ABran-WMF) db/es hosts have been depooled [15:50:57] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134763 (10cmooney) >>! In T373097#10134741, @ABran-WMF wrote: > db/es hosts have been depooled thanks for confirming! [16:03:08] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134808 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c5ef5c49-317c-49af-b11b-61e58fe45620) set... [16:09:35] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134820 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a5d7ae66-6b48-4bdb-8951-87b0e41404de) set... [16:17:56] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134839 (10cmooney) Move done, all migrated hosts are pinging again no issues to report. [16:24:27] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134882 (10ABran-WMF) db/es hosts are repooling [16:25:23] win 35 [16:45:09] 06Traffic, 06Data Products, 06Data-Engineering: Prepare puppet configuration to send haproxy logs to haproxykafka socket - https://phabricator.wikimedia.org/T374473 (10Fabfur) 03NEW [17:27:55] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10135140 (10bwang) [19:48:01] 06Traffic: haproxykafka features - https://phabricator.wikimedia.org/T374128#10135629 (10Fabfur) [21:39:46] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10135906 (10cmooney) 05Open→03Resolved a:03cmooney