[08:31:16] 10Traffic, 10SRE: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) I started working on trafficserver package [08:31:31] 10Traffic, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) a:05Vgutierrez→03RobH lvs1013-lvs1015 have been reimaged as expected, we've been unable to reimage lvs1016 [09:10:31] 10Traffic, 10DNS, 10SRE, 10Patch-For-Review: Update learn.wiki DNS records - https://phabricator.wikimedia.org/T342509 (10Vgutierrez) [09:16:11] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [09:20:14] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) p:05Triage→03Medium naive test on cp4052 resulted on a rollback, ATS itself was able to fetch data from its backend servers but varnish was unable to fetch data as expect returning the usual 503 Backend fetch f... [09:36:02] 10Traffic, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) Currently the issue with the reimaging of lvs1016 is related to T342345 [09:55:38] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) We'll first make the move to 2% of traffic, then ramp up from there during the week. [10:09:16] 10Traffic, 10SRE: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [10:42:46] 10Traffic, 10SRE, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [10:45:07] 10Traffic, 10SRE, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) The change has been applied in all DCs. Metrics shows a positive impact (as expected) on the number of sessions on port 80 for all DCs. Ex: {F37148259} [11:05:25] 10Traffic, 10DNS, 10SRE, 10Patch-For-Review: Update learn.wiki DNS records - https://phabricator.wikimedia.org/T342509 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez ` vgutierrez@dns1004:~$ dig +short learn.wiki @ns0.wikimedia.org 3.33.143.48 15.197.134.113 vgutierrez@dns1004:~$ dig +short studio.le... [11:11:28] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) >>! In T341463#9014217, @Quiddity wrote: > Thanks for the draft, appreciated! I've [[https://meta.wikimedia.org/wiki/Tech/News/2023/29#Tech_News:_... [11:11:40] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:11:52] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05In progress→03Resolved [12:32:44] 10Traffic, 10SRE: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) 05Open→03Resolved [13:15:53] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) @MatthewVernon: my understanding is that rewrite.py is currently setting expiry headers for thumbnails on retrieval from Swi... [13:24:43] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) varnish healthchecks aren't happy with ATS 9.2.1: ` $ sudo -i varnishadm -n frontend backend.list Backend name Admin Probe Last updated vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c... [13:34:34] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) varnish backend_health flags report the following: * 4 --> IPv4 connection * X --> request transmit succeeded * r --> read response failed a traffic capture shows that ATS puts the response back on the wire: {F371... [13:42:41] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) on a 9.1.4 instance (cp4051) healthcheck looks identical in terms of payload, but TCP connection gets closed gracefully: {F37148392} [13:47:45] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) healthcheck gets generated by our [[ https://github.com/wikimedia/operations-puppet/blob/2a54660a121812f502ace46add3f7ac97d1b4b95/modules/profile/files/trafficserver/default.lua#L58-L64 | default.lua ]], specifical... [14:14:59] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) >>! In T211661#9037678, @ori wrote: > @MatthewVernon: my understanding is that rewrite.py is currently setting expiry... [14:21:42] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) >>! In T211661#9037959, @Ladsgroup wrote: >>>! In T211661#9037678, @ori wrote: >> @MatthewVernon: my understanding... [14:35:30] 10netops, 10Infrastructure-Foundations, 10SRE: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10cmooney) a:03cmooney [14:36:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability, 10good first task: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298 (10joanna_borun) [14:37:10] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) After checking https://github.com/apache/trafficserver/blob/9.2.x/CHANGELOG-9.2.0 I've noticed: ` #8784 - Propagate proxy.config.net.sock_option_flag_in to newly accepted connections ` we set proxy.config.net.soc... [14:39:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298 (10joanna_borun) [14:46:35] 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) [15:03:53] sukhe: o/ lol 3.8 -> 3.99 for gdnsd :D [15:04:01] haha [15:04:03] well [15:19:10] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) Right. Now I remember. The initial expiration is indeed supposed to be set by Thumbor. The necessary functionality had some... [15:19:36] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10Vgutierrez) [[ https://grafana.wikimedia.org/goto/ukhre43Vz?orgId=1 | grafana ]] shows a regression on lua performance after the update to 9.2.1: {F37148438} I do remember talking to upstream about getting a lot of requests w... [15:53:42] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) I also don't know how well Swift would handle 15k QPS of object metadata updates (cf T211661#8377883) [16:50:43] (SystemdUnitCrashLoop) firing: (5) varnish-frontend-hospital.service crashloop on cp2027:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:55:43] (SystemdUnitCrashLoop) firing: (6) varnish-frontend-hospital.service crashloop on cp2027:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:56:32] er ? [16:57:14] looking [17:00:43] (SystemdUnitCrashLoop) firing: (6) varnish-frontend-hospital.service crashloop on cp2027:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:05:43] (SystemdUnitCrashLoop) firing: (8) varnish-frontend-hospital.service crashloop on cp2027:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:10:43] (SystemdUnitCrashLoop) resolved: (8) varnish-frontend-hospital.service crashloop on cp2027:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:15:42] (SystemdUnitCrashLoop) firing: (3) varnish-frontend-hospital.service crashloop on cp6001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:20:43] (SystemdUnitCrashLoop) resolved: (8) varnish-frontend-hospital.service crashloop on cp6001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:24:42] (SystemdUnitCrashLoop) firing: (3) varnish-frontend-hospital.service crashloop on cp6013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:29:42] (SystemdUnitCrashLoop) firing: (8) varnish-frontend-hospital.service crashloop on cp6009:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:39:43] (SystemdUnitCrashLoop) resolved: (8) varnish-frontend-hospital.service crashloop on cp6009:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:45:12] (SystemdUnitCrashLoop) firing: (15) varnish-frontend-hospital.service crashloop on cp1075:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:46:29] forgot to give an update because lunch but this is expected because of the ATS restart [17:50:12] (SystemdUnitCrashLoop) firing: (16) varnish-frontend-hospital.service crashloop on cp1075:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:55:12] (SystemdUnitCrashLoop) firing: (13) varnish-frontend-hospital.service crashloop on cp1075:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:00:13] (SystemdUnitCrashLoop) resolved: (7) varnish-frontend-hospital.service crashloop on cp1075:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:20:44] 10Traffic: varnish-frontend-hospital crash upon ATS restart - https://phabricator.wikimedia.org/T342566 (10Vgutierrez) [18:21:23] 10Traffic: varnish-frontend-hospital crash upon ATS restart - https://phabricator.wikimedia.org/T342566 (10Vgutierrez) p:05Triage→03Medium [18:33:42] (SystemdUnitCrashLoop) firing: (3) varnish-frontend-hospital.service crashloop on cp3054:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:38:42] (SystemdUnitCrashLoop) firing: (7) varnish-frontend-hospital.service crashloop on cp3050:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:43:42] (SystemdUnitCrashLoop) firing: (7) varnish-frontend-hospital.service crashloop on cp3050:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:47:15] 10Traffic, 10SRE: varnish-frontend-hospital crash upon ATS restart - https://phabricator.wikimedia.org/T342566 (10Vgutierrez) `counterexample 0 Backend_health - vcl-84635598-fffa-4367-86af-05856c435a6e.be_cp3064_esams_wmnet Went sick -------H 2 3 5 0.000000 0.000000 0 Back... [18:48:42] (SystemdUnitCrashLoop) firing: (2) varnish-frontend-hospital.service crashloop on cp3058:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:53:42] (SystemdUnitCrashLoop) resolved: (2) varnish-frontend-hospital.service crashloop on cp3058:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:27:10] 10Traffic, 10Data Products: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577 (10Milimetric)