[05:02:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [05:07:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [06:43:29] 10netops, 06Infrastructure-Foundations: Upgrade Junos 20 switches - https://phabricator.wikimedia.org/T390813 (10ayounsi) 03NEW [07:03:40] 10netops, 06DC-Ops, 06Infrastructure-Foundations: Upgrade management switches to Junos 21.4 - https://phabricator.wikimedia.org/T390814 (10ayounsi) 03NEW [07:04:34] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702309 (10ayounsi) I went to open a JTAC case for the non-working msw but they're all Out Of Support, I opened {T390814} to track their upgrade. [07:51:24] 10netops, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#10702393 (10ayounsi) [08:18:13] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702500 (10ayounsi) Opened JTAC case 2025-0402-657200 for the SRXs. [08:37:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cp2036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:49:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cp2035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:52:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cp2036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:57:04] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10702638 (10aborrero) >>! In T389958#10683594, @cmooney wrote: > @aborrero @taavi one thing we could maybe try, if we wanted to make progress sooner (i.e. with... [09:04:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cp2035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:23:46] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702760 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=18436b96-18d3-4109-9dbe-088b91594c7c) set by ayounsi@cumin1002 for 0:30:00 on 1 host(s) and their services with re... [09:56:57] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702937 (10ayounsi) JTAC asked us to reboot it. It didn't help. [10:06:18] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Classify ceph traffic flows for network prioritization - https://phabricator.wikimedia.org/T390044#10702957 (10ayounsi) [10:42:08] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703121 (10cmooney) >>! In T389958#10702638, @aborrero wrote: > Yes, lets try with the static routes. Thanks! Thanks Arturo - can we arrange a window for thi... [11:09:16] Thumbnail steps are at 60% now [11:11:25] Amir1: when the thumbnail deploy started? We're investigating another thing and want to understand if it's correlated [11:12:08] March 10, 5% each day we could deploy [11:12:31] ack, tnx [11:13:09] I can slow down the bumps or take a break to allow regeneration (or just let it be :D) [11:26:02] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703215 (10aborrero) >>! In T389958#10703121, @cmooney wrote: >>>! In T389958#10702638, @aborrero wrote: >> Yes, lets try with the static routes. Thanks! > >... [11:38:43] 06Traffic: Increased number of connections to varnish on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846 (10Vgutierrez) 03NEW [11:38:53] 06Traffic: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846#10703255 (10Vgutierrez) [11:39:06] 06Traffic: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846#10703256 (10Vgutierrez) p:05Triage→03High [11:52:27] 06Traffic: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846#10703304 (10Vgutierrez) {F58966266} restarting varnish on cp6016 resulted on ~500 requests less per ATS instance, so 4k requests less in total [12:01:05] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703334 (10aborrero) [12:15:47] 06Traffic: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846#10703394 (10Vgutierrez) it looks like old versions of the VCLs are piling up on varnish: ` vgutierrez@cumin1002:~$ sudo -i cumin 'A:cp-text_drmrs' "varnishadm -n front... [13:27:00] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10703649 (10ayounsi) Bad news, JTAC told me that gNMI is not supported on SRX300 (or any branch level SRX) nor EX4300. Some pointers : https://apps.juniper.net/feature-explorer/feature/4332... [13:38:28] akosiaris: so as you mentioned regarding https://phabricator.wikimedia.org/P74583, let's depool a host (maybe not in esams) and apply the patch there manually [13:40:27] <_joe_> and this is how we'll find out the bug is esams-specific :D [13:40:54] <_joe_> the only thing that left me doubtful is [13:41:03] 06Traffic, 06Abstract Wikipedia team, 06serviceops, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10703770 (10akosiaris) Adding #traffic, since this involves ATS [13:41:23] <_joe_> in multi-dc lua, we return TS_LUA_DID_REMAP [13:41:33] <_joe_> but we don't actually change the backend [13:41:42] <_joe_> it should not be an issue at all [13:42:11] <_joe_> akosiaris: the load.php stuff might be because you lack en.wikipedia.org in the certificate at the ingress btw [13:42:24] vgutierrez: done, cp3066 [13:42:29] and ofc. I can't reproduce [13:42:32] <_joe_> uh actually not, the request should be mangled by varnish [13:42:59] wait... [13:43:36] akosiaris: what's the applied change? [13:43:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133363 :? [13:43:58] yes [13:44:38] <_joe_> vgutierrez: you now understand why we can't wrap our head around it [13:45:37] this can't be traffic related, can it? [13:46:21] I 'll repool for 5 minutes, it's 3rps total for this service, it shouldn't cause an issue [13:47:03] akosiaris: hmmm do you need multi-dc.lua in there? [13:47:14] omg [13:47:18] once I pooled it [13:47:21] I now have errors? [13:47:35] vgutierrez: not really, I just adapted [13:47:38] given that you're using the same FQDN it looks like you can skipt it entirely [13:47:48] -ingress is temporary btw [13:48:12] the end goal is that once this proves it works, the original one is flipped to an ingress service in service.yaml [13:48:20] and then this patch reverted [13:48:45] ok, this is crazy, I am depooling again and running the test 3 times in a row [13:48:50] which is like 600 requests [13:48:53] that's enough [13:50:11] ok [13:50:15] how are you reproducing at the moment? [13:50:16] which URL? [13:51:49] https://phabricator.wikimedia.org/T390854#10703816 [13:51:52] 06Traffic, 06Abstract Wikipedia team, 06serviceops, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10703816 (10akosiaris) `lang=bash deploy1003:~$ siege -c 2 -r 100 --no-parser --no-follow -H "Host: www.wikifunctions.... [13:51:54] I'm asking cause the two URLs on https://phabricator.wikimedia.org/P74583 aren't the same [13:52:22] 11 requests out of 200 in the pooled case returned a 404 [13:52:29] why are you removing ts=99? [13:52:55] cause it's imaterial for the backend case [13:53:20] the ts=99 was an arbitrary parameter I was passing just to prove myself not crazy [13:53:25] it doesn't change anything [13:53:33] ack [13:59:25] 06Traffic, 06Abstract Wikipedia team, 06serviceops, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10703865 (10akosiaris) After depooling the node and running 3 consecutive invocations of the above, no 404s observed at... [14:00:12] I see in logs 20250402.13h56m17s CONNECT: attempt fail [CONNECTION_ERROR] to 10.2.2.75:4450 for host='cp3066.esams.wmnet' connection_result=ENET_SSL_CONNECT_FAILED [-20104] error=ENET_SSL_CONNECT_FAILED [-20104] attempts=0 [14:00:12] url='https://mw-web-ro.discovery.wmnet:4450/w/api.php?action=query&format=json&list=wikilambdaload_zobjects&wikilambdaload_zids=Z1%7CZ2%7CZ12%7CZ11%7CZ3%7CZ4%7CZ6%7CZ8%7CZ7%7CZ9%7CZ40%7CZ41%7CZ42%7CZ14%7CZ1002%7CZ881%7CZ18%7CZ60%7CZ1001%7CZ1003%7CZ1004%7CZ1005%7CZ1672%7CZ1645&wikilambdaload_language=en&wikilambdaload_get_dependencies=true' [14:00:25] something weird is happening and sending the requests to the wrong backend [14:01:36] and yes right after pooling the node, I see the 404s again [14:01:47] This is starting to smell like a race condition or something [14:02:20] vgutierrez: I got to run to an errand, I 've depooled the node and left it with puppet disabled and remap.config hand edited [14:02:51] ack [14:02:58] I think there is enough in the task to be able to reproduce, all it requires is running pool, wait a bit and the siege invocation should return 404s [14:03:12] you might want to edit ~/.siege/siege.conf to disable JSON output [14:03:14] Date:2025-04-02 Time:14:00:21 ConnAttempts:0 ConnReuse:1 TTFetchHeaders:89 ClientTTFB:90 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:90 TotalPluginTime:0 ActivePluginTime:0 TotalTime:90 OriginServer:mw-wikifunctions-ingress.discovery.wmnet OriginServerTime:90 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:404 OriginStatus:404 [14:03:14] ReqURL:http://www.wikifunctions.org/w/api.php?action=query&format=json&list=wikilambdaload_zobjects&wikilambdaload_zids=Z1%7CZ2%7CZ12%7CZ11%7CZ3%7CZ4%7CZ6%7CZ8%7CZ7%7CZ9%7CZ40%7CZ41%7CZ42%7CZ14%7CZ1002%7CZ881%7CZ18%7CZ60%7CZ1001%7CZ1003%7CZ1004%7CZ1005%7CZ1672%7CZ1645&wikilambdaload_language=en&wikilambdaload_get_dependencies=true&vgutierrez=1 ReqHeader:User-Agent:curl/7.74.0 ReqHeader:Host:www.wikifunctions.org [14:03:14] ReqHeader:X-Client-IP:- ReqHeader:Cookie:- BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:- BerespHeader:Connection:- RespHeader:X-Cache-Int:cp3066 miss RespHeader:Backend-Timing:- [14:03:17] makes the output more interactive [14:03:25] that's a 404 reported by atslog-backend [14:03:55] and this is curl getting a 404 https://www.irccloud.com/pastebin/JQl9ZU1x/ [14:16:52] akosiaris: any idea on how we can tell which k8s pod is replying? [14:17:18] server: istio-envoy doesn't help [14:42:13] Hmm ingress gateway is overriding Server header. Need to figure that out [14:47:08] akosiaris: I think you have a misbhaving pod [14:47:37] https://www.irccloud.com/pastebin/yRvXVwQD/ [14:47:54] so with -Z and --parallel-max 2 I managed to get a continous stream of 404s [14:48:12] given that curl reuses connections it means that basically it was hitting the same pod for all the requests [14:49:49] <_joe_> vgutierrez: that's why I asked if we restarted the istio ingress :) [14:50:42] Could be. They are few (9 in total) and it should be easy to test once I am back in front of a computer [14:51:09] But only 6 should see traffic for those URLs. [14:54:04] hard to tell from my PoV with an opaque applayer :) [14:54:34] what I can see here is that 404s return in ~80ms and the 200 in ~200ms [14:54:48] but besides that, they all look the same from ehre [15:43:03] 06Traffic, 13Patch-For-Review: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10704657 (10Fabfur) [15:43:24] 06Traffic, 13Patch-For-Review: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10704658 (10Fabfur) [16:06:03] 06Traffic, 13Patch-For-Review: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846#10704772 (10Vgutierrez) fixing reload-vcl.py produced the expected result: ` vgutierrez@cp6015:~$ sudo -i varnishadm -n frontend vcl.list availa... [16:06:40] akosiaris: I'm reenabling puppet on cp3066 and repooling it, let's continue with this as soon as we can tell between k8s pods please [16:09:56] ok [16:21:11] vgutierrez: btw, your repro in https://www.irccloud.com/pastebin/yRvXVwQD/ isn't hitting the proper pods. It's hitting the old release, named "main". The ingress pods are a different. they are named group0, group1 and group2 [16:21:38] curl --parallel-max 2 -Z -s -o /dev/null -v --connect-to www.wikifunctions.org:80:127.0.0.1:3128 [16:21:38] 'http://www.wikifunctions.org/w/api.php?action=query&format=json&list=wikilambdaload_zobjects&wikilambdaload_zids=Z1%7CZ2%7CZ12%7CZ11%7CZ3%7CZ4%7CZ6%7CZ8%7CZ7%7CZ9%7CZ40%7CZ41%7CZ42%7CZ14%7CZ1002%7CZ881%7CZ18%7CZ60%7CZ1001%7CZ1003%7CZ1004%7CZ1005%7CZ1672%7CZ1645&wikilambdaload_language=en&wikilambdaload_get_dependencies=true&vgutierrez=[1-200]' [16:21:38] 2>&1 |grep server [16:21:38] < server: mw-wikifunctions.eqiad.main-5df8864c9f-gwhgz [16:21:41] etc etc [16:22:07] so your manual patch was buggy? :) [16:22:26] or are you trying now that I've re-enabled puppet on cp3066? [16:22:52] I just tried it [16:22:57] puppet got reenabled [16:23:05] so your manual patch got removed from cp3066 [16:23:14] ok, that explains it then [16:41:45] so, I got the pod IPs and are running the same curl over all of them. [16:41:58] for i in 10.67.161.7 10.67.131.71 10.67.138.30 10.67.185.188 10.67.182.168 10.67.143.84 10.67.157.175 10.67.167.58 10.67.158.25 ; do curl --parallel-max 2 -Z -s -o /dev/null -v --connect-to www.wikifunctions.org:443:${i}:4451 [16:41:58] 'https://www.wikifunctions.org/w/api.php?action=query&format=json&list=wikilambdaload_zobjects&wikilambdaload_zids=Z1%7CZ2%7CZ12%7CZ11%7CZ3%7CZ4%7CZ6%7CZ8%7CZ7%7CZ9%7CZ40%7CZ41%7CZ42%7CZ14%7CZ1002%7CZ881%7CZ18%7CZ60%7CZ1001%7CZ1003%7CZ1004%7CZ1005%7CZ1672%7CZ1645&wikilambdaload_language=en&wikilambdaload_get_dependencies=true&vgutierrez=[1-200]' |& [16:41:58] grep 404 ; done [16:42:09] no 404 matched [16:42:24] 4451 rather than 30443? [16:42:29] you're skipping something there [16:42:32] yes, the ingress [16:42:37] that's what you asked, right? [16:42:48] but in any case, there is something that did change [16:42:58] and it's... pods got redeployed due to scap [16:43:19] so new versions of everything, lifetimes at 3h24m [16:43:30] either there was a corrupt pod and I was the most unlucky person in the world today [16:43:57] or... not sure I like the alternatives enough to spell some of them out. They aren't kind to me. [16:44:19] please fix the ingress so we can see in ATS the server pod that's replying [16:44:39] that would definitely help in the future debugging this kind of odd behaviour [16:44:52] yeah, I 'll try to find out how to do that, a first search didn't return something promising [16:45:09] it should be possible however, since envoy in the mesh, doesn't overwrite it [16:45:21] and the ingress is also envoy, just with a different configuration [16:45:41] apparently in the past envoy didn't even allow that and some people were really pissed cause "audits" [16:46:08] see https://github.com/istio/istio/issues/13861 [16:46:21] they want to remove it, but apparently they couldn't even set it for a while [16:46:39] 06Traffic, 13Patch-For-Review: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846#10705049 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez {F58966969} traffic volume in ATS on text@drmrs got back to normal after disc... [20:25:55] FIRING: [3x] MaxConntrack: Max conntrack at 100% on ncredir3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:27:18] hmmm [20:27:30] Yeah, wtf [20:29:47] something is hitting ncredir@esams badly [20:30:10] (for ncredir standards) [20:30:55] RESOLVED: [5x] MaxConntrack: Max conntrack at 99.84% on ncredir3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:41:44] puppet is broken on ncredir due to other changes [20:41:50] nginx currently cant reload [20:42:26] acme-chief's puppet is now fixed, running ncredir's [20:44:09] I was busy fixing all that and didn't get to see what domain was being hit :/ [20:45:35] brett: there is now a new problem though i'm afraid [20:45:45] hm? [20:45:51] still Failed to call refresh: '/usr/sbin/service nginx reload' returned 1 [20:46:04] the missing file was created though [20:46:18] /etc/acmecerts/non-canonical-redirect-8/live/ec-prime256v1.ocsp is missing [20:46:24] '/etc/acmecerts/non-canonical-redirect-8/15ab3adcc9aa47db9d358f2450d182b3' to '/etc/acmecerts/non-canonical-redirect-8/fe23ea7cf3254aedbcae412a4af0ef1c' [20:46:36] let acme-chief issue the certificate [20:46:46] and then run puppet on the ncredir host [20:47:16] yeah, we're good now. Just gotta be patient [20:49:15] confirmed no more errors on ncredir1001 [21:16:04] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 (10ssingh) 03NEW [21:52:34] 10Domains, 06Traffic: [toolforge] transfer/adopt toolsbeta.org domain to the foundation - https://phabricator.wikimedia.org/T362253#10706170 (10Andrew) p:05Triage→03Medium [22:42:06] 06Traffic: rework ncmonitor's patch submission for ncredir - https://phabricator.wikimedia.org/T390915 (10BCornwall) 03NEW [22:42:19] 06Traffic: rework ncmonitor's patch submission for ncredir - https://phabricator.wikimedia.org/T390915#10706271 (10BCornwall) 05Open→03In progress p:05Triage→03Low