[05:04:09] <jinxer-wm>	 FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[05:09:09] <jinxer-wm>	 RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[07:26:27] <wikibugs>	 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration, 13Patch-For-Review: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10746435 (10Fabfur) @JAllemandou I've prepared [[ https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/82 | this patch ]] for h...
[10:09:16] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056 (10cmooney) 03NEW p:05Triage→03Medium
[10:10:18] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746764 (10cmooney)
[10:12:29] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746767 (10cmooney)
[10:33:18] <wikibugs>	 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059 (10Vgutierrez) 03NEW
[10:33:26] <wikibugs>	 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059#10746846 (10Vgutierrez) p:05Triage→03High
[10:34:12] <wikibugs>	 06Traffic, 06Experimentation Lab: Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#10746847 (10Vgutierrez)
[10:37:24] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746861 (10cmooney)
[10:38:10] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746862 (10cmooney)
[10:38:48] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746864 (10cmooney)
[10:40:16] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746877 (10cmooney)
[10:50:26] <topranks>	 vgutierrez: quick question, is there a reason you spec'd the Mellanox ConnectX 5 specifically?
[10:50:50] <vgutierrez>	 topranks: it was the one available from Dell at least publicly on their webpage
[10:50:58] <topranks>	 being specific myself, I'm wondering if the ConnectX 6 also supports the features you are looking at 
[10:51:02] <topranks>	 ok
[10:51:56] <topranks>	 machine-learning are looking at some beefy super-micro systems which may have ConnectX 6 in them.  I'm trying to see how best to minimise the overall number of vendors/models we end up having to support 
[10:51:59] <topranks>	 hence the question 
[10:56:03] <topranks>	 vgutierrez: do you know if ConnectX 6 supports what you need?  
[10:56:31] <topranks>	 probably won't order if Dell doesn't list it but as things move on might help to consolidate the different models in production 
[10:56:35] <vgutierrez>	 topranks: standarization justifies the potential price tag bump?
[10:56:43] <topranks>	 yes one million percent 
[10:57:44] <topranks>	 at least when we are talking  in terms of hundreds of dollars 
[10:59:35] <vgutierrez>	 dunno if that's the potential gap that we are discussing here
[10:59:36] <vgutierrez>	 but anyways
[10:59:57] <vgutierrez>	 a ConnectX 6 NIC would work for us given it's also supported by the mlx5 kernel driver
[11:20:24] <wikibugs>	 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747041 (10Ladsgroup) 05Open→03Resolved
[11:21:53] <topranks>	 vgutierrez: thanks for confirming cheers 
[11:22:29] <topranks>	 the x6 is just the newer revision of the x5, x4, x3 etc.  they come in at similar price point afaik 
[11:30:59] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10747081 (10cmooney) >>! In T392007#10745165, @Jclark-ctr wrote: > @RobH  we have 1 free cross connect circuit id 21996480.  but have plenty of r...
[11:41:49] <wikibugs>	 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073 (10Fabfur) 03NEW
[12:03:55] <wikibugs>	 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073#10747191 (10Vgutierrez) p:05Triage→03Medium take into account that we set a different list of allowed methods per cluster: ` hieradata/role/common/cache/text.yaml:    allowed_methods: '^(GET|HEAD|OPTIONS|PATCH|PO...
[12:09:49] <hnowlan>	 hello! I regret to inform you I have yet another gateway-check change I'd like to roll out later today. Would that be okay? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136676 
[12:11:33] <vgutierrez>	 hnowlan: today is kinda Friday for some of us
[12:11:50] <vgutierrez>	 hnowlan: at the same time you're on-call at the moment so feel free to break stuff ;P
[12:14:04] <hnowlan>	 vgutierrez: haha, fair on both points. 
[12:17:02] <hnowlan>	 yeah I think we'll park it until next week - it's technically my friday too
[12:17:18] <fabfur>	 👍
[12:17:19] <fabfur>	 :D 
[12:38:51] <wikibugs>	 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747323 (10Jdforrester-WMF)
[12:44:19] <wikibugs>	 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059#10747338 (10Vgutierrez)
[12:45:41] <sukhe>	 we missed an opportunity to roll out more things while fabfur was on on-call :(
[12:58:44] <fabfur>	 thanks 
[12:58:51] <fabfur>	 I'll remember it when YOU will be oncall
[12:59:13] <sukhe>	 :D
[13:50:36] <wikibugs>	 06Traffic, 06SRE Observability, 07sre-alert-triage: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T392091 (10LSobanski) 03NEW
[13:53:29] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094 (10cmooney) 03NEW p:05Triage→03Low
[13:58:43] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10747774 (10Jgreen)
[14:37:59] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748105 (10cmooney)
[14:39:06] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748126 (10cmooney) These are the two for codfw: ` ip route add vrf vrf-cloudgw blackhole 172.16.128.0/17 metric 9999 ip route add vrf vrf-cloudgw blackhole 2a02:...
[15:35:50] <wikibugs>	 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059#10748454 (10BCornwall) 05Open→03In progress
[16:19:51] <jinxer-wm>	 FIRING: [4x] FermMSS: Unexpected MSS value on 198.35.26.98:443 @ ncredir4001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=ulsfo&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DFermMSS
[16:23:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: nginx.service on ncredir4001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:23:51] <vgutierrez>	 brett: ^^
[16:24:05] <brett>	 ugh
[16:24:13] <brett>	 I was just babysitting the rollouts here
[16:25:11] <vgutierrez>	 so let's run-puppet on acme-chief after adding a non-canonical-redirect cert
[16:25:19] <vgutierrez>	 like by default :)
[16:25:19] <brett>	 I was doing that ._.
[16:25:34] <brett>	 I was running on acme-chief and then ncredir
[16:26:16] <brett>	 weird timing?
[16:28:29] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: nginx.service on ncredir4001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:28:49] <brett>	 ryankemper: Do we want to schedule some time to work on LVS service removal?
[16:29:51] <jinxer-wm>	 RESOLVED: [4x] FermMSS: Unexpected MSS value on 198.35.26.98:443 @ ncredir4001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=ulsfo&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DFermMSS
[16:51:16] <vgutierrez>	 puppet ran in ncredir4001@2025-04-16T16:15:17.779084+00:00, 2 minutes later than in acme-chief1002 (2025-04-16T16:13:32.172196+00:00)
[16:51:21] <vgutierrez>	 but...
[16:51:27] <vgutierrez>	 Apr 16 16:13:51 acmechief1002 acme-chief-backend[3715934]: Handling pushed CSR event for non-canonical-redirect-8 / ec-prime256v1
[16:51:27] <vgutierrez>	 Apr 16 16:15:38 acmechief1002 acme-chief-backend[3715934]: Handling validated challenges event for non-canonical-redirect-8 / ec-prime256v1
[16:51:45] <vgutierrez>	 acme-chief took some time issuing non-canonical-redirect-8
[16:52:05] <vgutierrez>	 brett: I guess we need to disable-puppet on ncredir till the certs are issued for the first time
[16:52:30] <vgutierrez>	 or just be defensive on nginx reload time and refuse to reload if TLS material doesn't look sane
[16:53:47] <brett>	 hmmmm
[16:54:05] <brett>	 I'm partial to the second one
[16:55:03] <fabfur>	 you can reuse the ExecStartPre script from HAProxy if can be useful
[16:55:05] <brett>	 maybe an override for ExecReload?
[16:56:00] <sukhe>	 fabfur's idea is nice I would say and has been reviewed and deployed, so less uplift required
[16:56:14] <brett>	 it's reloading, not restarting
[16:56:18] <vgutierrez>	 yep... a similar script can be used
[16:56:20] <fabfur>	 you just need a file with the certificate list in case
[16:57:20] <vgutierrez>	 actually nope..
[16:57:25] <vgutierrez>	 the script is not enough
[16:57:36] <vgutierrez>	 on the current state it will allow nginx to reload and fail
[16:58:03] <vgutierrez>	 we need to validate OCSP response as well
[16:58:48] <vgutierrez>	 but you could say that the haproxy one needs that too
[16:58:55] <vgutierrez>	 so we can work towards that :D
[16:59:07] <fabfur>	 yes, it would be a nice improvement, but didn't we have something for this already? 
[16:59:15] <fabfur>	 (ocsp response check)
[17:01:05] <sukhe>	 we have a separate check, yes
[17:02:06] <vgutierrez>	 fabfur: nginx is crashing cause ocsp response data isn't there
[17:02:23] <vgutierrez>	 so yeah, we need to validate it
[17:02:56] <fabfur>	 we only have a post-deployment check on icinga/alertmanager, correct? 
[17:03:08] <vgutierrez>	 yes
[17:03:11] <fabfur>	 ok
[17:05:56] <sukhe>	 hmm ok, I thought it is crashing because of a lack of certs themselves
[17:33:40] <jinxer-wm>	 FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[17:42:29] <vgutierrez>	 sukhe: nope.. acme-chief prevents that by generating some fake certs initially
[17:42:58] <vgutierrez>	 fake as in self-signed
[17:43:20] <sukhe>	 https://c.tenor.com/6Ju_FlRfSGUAAAAC/tenor.gif
[18:18:40] <jinxer-wm>	 FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[18:38:40] <jinxer-wm>	 RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[21:32:09] <ryankemper>	 brett: yes, definitely. let me get back to you guys on when; in addition to removing the `wdqs-internal` lvs stuff we'll also separately need to remove `wdqs` as well, and I still need to decide which one to tackle first
[22:43:36] <brett>	 ryankemper: okay, cool. let's do it next week since there is a lot of people on vacation this week.