[05:04:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [05:09:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:26:27] 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration, 13Patch-For-Review: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10746435 (10Fabfur) @JAllemandou I've prepared [[ https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/82 | this patch ]] for h... [10:09:16] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056 (10cmooney) 03NEW p:05Triage→03Medium [10:10:18] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746764 (10cmooney) [10:12:29] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746767 (10cmooney) [10:33:18] 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059 (10Vgutierrez) 03NEW [10:33:26] 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059#10746846 (10Vgutierrez) p:05Triage→03High [10:34:12] 06Traffic, 06Experimentation Lab: Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#10746847 (10Vgutierrez) [10:37:24] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746861 (10cmooney) [10:38:10] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746862 (10cmooney) [10:38:48] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746864 (10cmooney) [10:40:16] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746877 (10cmooney) [10:50:26] vgutierrez: quick question, is there a reason you spec'd the Mellanox ConnectX 5 specifically? [10:50:50] topranks: it was the one available from Dell at least publicly on their webpage [10:50:58] being specific myself, I'm wondering if the ConnectX 6 also supports the features you are looking at [10:51:02] ok [10:51:56] machine-learning are looking at some beefy super-micro systems which may have ConnectX 6 in them. I'm trying to see how best to minimise the overall number of vendors/models we end up having to support [10:51:59] hence the question [10:56:03] vgutierrez: do you know if ConnectX 6 supports what you need? [10:56:31] probably won't order if Dell doesn't list it but as things move on might help to consolidate the different models in production [10:56:35] topranks: standarization justifies the potential price tag bump? [10:56:43] yes one million percent [10:57:44] at least when we are talking in terms of hundreds of dollars [10:59:35] dunno if that's the potential gap that we are discussing here [10:59:36] but anyways [10:59:57] a ConnectX 6 NIC would work for us given it's also supported by the mlx5 kernel driver [11:20:24] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747041 (10Ladsgroup) 05Open→03Resolved [11:21:53] vgutierrez: thanks for confirming cheers [11:22:29] the x6 is just the newer revision of the x5, x4, x3 etc. they come in at similar price point afaik [11:30:59] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10747081 (10cmooney) >>! In T392007#10745165, @Jclark-ctr wrote: > @RobH we have 1 free cross connect circuit id 21996480. but have plenty of r... [11:41:49] 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073 (10Fabfur) 03NEW [12:03:55] 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073#10747191 (10Vgutierrez) p:05Triage→03Medium take into account that we set a different list of allowed methods per cluster: ` hieradata/role/common/cache/text.yaml: allowed_methods: '^(GET|HEAD|OPTIONS|PATCH|PO... [12:09:49] hello! I regret to inform you I have yet another gateway-check change I'd like to roll out later today. Would that be okay? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136676 [12:11:33] hnowlan: today is kinda Friday for some of us [12:11:50] hnowlan: at the same time you're on-call at the moment so feel free to break stuff ;P [12:14:04] vgutierrez: haha, fair on both points. [12:17:02] yeah I think we'll park it until next week - it's technically my friday too [12:17:18] 👍 [12:17:19] :D [12:38:51] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747323 (10Jdforrester-WMF) [12:44:19] 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059#10747338 (10Vgutierrez) [12:45:41] we missed an opportunity to roll out more things while fabfur was on on-call :( [12:58:44] thanks [12:58:51] I'll remember it when YOU will be oncall [12:59:13] :D [13:50:36] 06Traffic, 06SRE Observability, 07sre-alert-triage: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T392091 (10LSobanski) 03NEW [13:53:29] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094 (10cmooney) 03NEW p:05Triage→03Low [13:58:43] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10747774 (10Jgreen) [14:37:59] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748105 (10cmooney) [14:39:06] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748126 (10cmooney) These are the two for codfw: ` ip route add vrf vrf-cloudgw blackhole 172.16.128.0/17 metric 9999 ip route add vrf vrf-cloudgw blackhole 2a02:... [15:35:50] 06Traffic: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059#10748454 (10BCornwall) 05Open→03In progress [16:19:51] FIRING: [4x] FermMSS: Unexpected MSS value on 198.35.26.98:443 @ ncredir4001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=ulsfo&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [16:23:25] FIRING: SystemdUnitFailed: nginx.service on ncredir4001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:51] brett: ^^ [16:24:05] ugh [16:24:13] I was just babysitting the rollouts here [16:25:11] so let's run-puppet on acme-chief after adding a non-canonical-redirect cert [16:25:19] like by default :) [16:25:19] I was doing that ._. [16:25:34] I was running on acme-chief and then ncredir [16:26:16] weird timing? [16:28:29] RESOLVED: SystemdUnitFailed: nginx.service on ncredir4001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:49] ryankemper: Do we want to schedule some time to work on LVS service removal? [16:29:51] RESOLVED: [4x] FermMSS: Unexpected MSS value on 198.35.26.98:443 @ ncredir4001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=ulsfo&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [16:51:16] puppet ran in ncredir4001@2025-04-16T16:15:17.779084+00:00, 2 minutes later than in acme-chief1002 (2025-04-16T16:13:32.172196+00:00) [16:51:21] but... [16:51:27] Apr 16 16:13:51 acmechief1002 acme-chief-backend[3715934]: Handling pushed CSR event for non-canonical-redirect-8 / ec-prime256v1 [16:51:27] Apr 16 16:15:38 acmechief1002 acme-chief-backend[3715934]: Handling validated challenges event for non-canonical-redirect-8 / ec-prime256v1 [16:51:45] acme-chief took some time issuing non-canonical-redirect-8 [16:52:05] brett: I guess we need to disable-puppet on ncredir till the certs are issued for the first time [16:52:30] or just be defensive on nginx reload time and refuse to reload if TLS material doesn't look sane [16:53:47] hmmmm [16:54:05] I'm partial to the second one [16:55:03] you can reuse the ExecStartPre script from HAProxy if can be useful [16:55:05] maybe an override for ExecReload? [16:56:00] fabfur's idea is nice I would say and has been reviewed and deployed, so less uplift required [16:56:14] it's reloading, not restarting [16:56:18] yep... a similar script can be used [16:56:20] you just need a file with the certificate list in case [16:57:20] actually nope.. [16:57:25] the script is not enough [16:57:36] on the current state it will allow nginx to reload and fail [16:58:03] we need to validate OCSP response as well [16:58:48] but you could say that the haproxy one needs that too [16:58:55] so we can work towards that :D [16:59:07] yes, it would be a nice improvement, but didn't we have something for this already? [16:59:15] (ocsp response check) [17:01:05] we have a separate check, yes [17:02:06] fabfur: nginx is crashing cause ocsp response data isn't there [17:02:23] so yeah, we need to validate it [17:02:56] we only have a post-deployment check on icinga/alertmanager, correct? [17:03:08] yes [17:03:11] ok [17:05:56] hmm ok, I thought it is crashing because of a lack of certs themselves [17:33:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:42:29] sukhe: nope.. acme-chief prevents that by generating some fake certs initially [17:42:58] fake as in self-signed [17:43:20] https://c.tenor.com/6Ju_FlRfSGUAAAAC/tenor.gif [18:18:40] FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:38:40] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:32:09] brett: yes, definitely. let me get back to you guys on when; in addition to removing the `wdqs-internal` lvs stuff we'll also separately need to remove `wdqs` as well, and I still need to decide which one to tackle first [22:43:36] ryankemper: okay, cool. let's do it next week since there is a lot of people on vacation this week.