[03:59:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:01:17] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9857885 (10stjn) The same also happened in Russian Wikipedia as I introduced https://ru.wikipedia.org/wiki/MediaWiki:Gadget-common-site.css as m... [04:04:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:24:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:29:40] FIRING: [12x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:34:40] FIRING: [13x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:39:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:44:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:49:40] FIRING: [12x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:54:40] FIRING: [11x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:59:40] FIRING: [10x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [05:04:40] RESOLVED: [6x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:51:51] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858458 (10Tgr) So the issue is that you have newly added the "infobox" gadget (in the sense of making it default) and edited Mobile.css (etc),... [08:53:24] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858471 (10stjn) Can you clarify why the desktop version does not have the similar problem with CSS caching, though? [08:55:21] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858483 (10Tgr) Are you sure it doesn't? I don't think there is any difference in how Mobile.css and e.g. Common.css are loaded. [09:10:03] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858529 (10stjn) Yes, I am pretty sure that is the case, desktop CSS cache was not a problem for me while testing anonymously, but mobile CSS ca... [09:16:48] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858572 (10Func) >>! In T366517#9858471, @stjn wrote: > Can you clarify why the desktop version does not have the similar problem with CSS cachi... [09:22:20] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858583 (10stjn) >>! In T366517#9858572, @Func wrote: > Nux has provided screenshots and HTML of desktops being affected, so it's not the case.... [09:24:01] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858588 (10Func) That's simply because they reverted the Common.css removal. [09:42:34] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9858621 (10Nux) >>! In T366517#9858458, @Tgr wrote: > So the issue is that you have newly added the "infobox" gadget (in the sense of making it... [10:05:54] 06Traffic, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9858668 (10Clement_Goubert) [10:18:29] 06Traffic, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9858735 (10Ladsgroup) [10:27:42] 06Traffic, 06Content-Transform-Team, 06MW-Interfaces-Team, 10RESTBase Sunsetting: Remove long term caching and active purging for Parsoid endpoints in RESTBase - https://phabricator.wikimedia.org/T365630#9858764 (10daniel) Do we have a number for how long it is acceptable for vandalism to remain visible? i... [10:35:04] 06Traffic, 06SRE: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9858791 (10cmooney) >>! In T366193#9855670, @BBlack wrote: > IMHO, the A/B set solution with a pair of anycasts, is the most elegant and simple way to achieve the best balance of resiliency and perf for our authdns. I thin... [11:36:07] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9859042 (10Tgr) Gadgets depend on user preferences (even if a module is default, you can still disable it). So either you have a URL which is th... [13:22:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:23:07] fabfur: that's you restarting the host? [13:23:19] yeah it is [13:23:29] not speaking for him but because I saw the list just now [13:23:33] yes [13:23:41] but they should be silenced [13:24:02] tcp-mss-clamper looks good on cp7009 so I'm guessing it should be OK on the next one [13:24:30] FIRING: HAProxyRestarted: HAProxy server restarted on cp7009:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=magru%20prometheus/ops&var-instance=cp7009&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [13:25:12] uh? [13:25:30] why this is firing? [13:25:46] vgutierrez@cp7009:~$ systemctl show haproxy.service |grep -i restarts [13:25:46] NRestarts=3 [13:26:15] I'll reset manually [13:27:02] issues reaching the OCSP responder [13:27:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:27:56] service unavailable? [13:28:02] 503s [13:28:15] dunno if from the OCSP responder itself or webproxy.magru.wmnet [13:31:07] running it manually now it works [13:33:24] restarting haproxy manually [13:34:30] fabfur: it also worked automatically (after 3 attempts) otherwise haproxy wouldn't be running on that host [13:34:30] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp7001:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [13:34:57] mmm this is apparently a shared issue [13:37:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:39:30] RESOLVED: [2x] HAProxyRestarted: HAProxy server restarted on cp7001:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [13:40:22] ^^ this is me [13:41:17] now I just run the ocsp update script (all fine apparently) on cp7010 [13:42:38] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:43:23] but this won't help with the alert [13:43:29] I must restart it manually also there [13:46:54] 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702#9859673 (10Gehel) Removing DPE SRE, let traffic decide if and what needs to be done [13:48:28] Hey, I'm looking at moving elastic and wdqs to IPIP encap (Ref T365616 ) . Do y'all have any advice/example CRs/etc? [13:48:29] T365616: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS) - https://phabricator.wikimedia.org/T365616 [13:49:11] inflatador: that's not supported at the moment [13:50:32] vgutierrez oh really, I thought some hosts were migrated already? [13:50:49] inflatador: on high-traffic LVS, not low-traffic ones [13:51:20] inflatador: it's in our radar and should be feasible soon(TM), but not right now [13:51:26] vgutierrez pretty sure ours are considered high traffic? [13:51:38] or are you only doing traffic-owned pools ATM? [13:51:50] inflatador: your services run on low-traffic LVS [13:52:10] inflatador: high-traffic1 is the LVS dedicated to the text cluster and high-traffic2 is the LVS dedicated to the upload cluster [13:52:39] low-traffic (definitely an awful name) is the LVS that takes care of internal services [13:52:48] vgutierrez ACK. I'll put the ticket in waiting, if y'all wouldn't mind pinging when that is an option [13:52:56] inflatador: will do :) [13:54:06] 06Traffic, 06Data-Platform-SRE: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS) - https://phabricator.wikimedia.org/T365616#9859717 (10bking) [13:54:31] 06Traffic, 06Data-Platform-SRE: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS) - https://phabricator.wikimedia.org/T365616#9859731 (10bking) Per IRC conversation with @Vgutierrez , this feature is not yet available. Tagging Traffic so they can ping us when it's ready. [13:57:06] inflatador: BTW, are you able to depool a whole DC in the services that you want to switch to IPIP? [13:57:23] Yes [13:57:44] ok, I'm asking cause we cannot switch individual realservers from L2 to IPIP [13:57:59] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp7010:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [13:58:09] so it needs to happen at the same time, so no way of doing it with depooling the whole DC or with some downtime [13:58:19] *without depooling [14:00:40] That'll work. w[cq]s can be failed over w/confctl at any time. Production search needs a mwconfig change ahead of time, but we can work out the details [14:02:32] err..w[cd]qs that is [14:07:59] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp7002:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:13:00] FIRING: [3x] HAProxyRestarted: HAProxy server restarted on cp7002:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:23:00] FIRING: [4x] HAProxyRestarted: HAProxy server restarted on cp7002:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:27:25] 06Traffic: HAProxy must start after network is really up - https://phabricator.wikimedia.org/T366606 (10Fabfur) 03NEW [14:28:00] FIRING: [5x] HAProxyRestarted: HAProxy server restarted on cp7002:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:33:00] RESOLVED: [5x] HAProxyRestarted: HAProxy server restarted on cp7002:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:40:00] FIRING: [4x] HAProxyRestarted: HAProxy server restarted on cp7003:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:45:00] FIRING: [4x] HAProxyRestarted: HAProxy server restarted on cp7003:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:50:00] RESOLVED: [2x] HAProxyRestarted: HAProxy server restarted on cp7004:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:57:14] FIRING: [4x] HAProxyRestarted: HAProxy server restarted on cp7004:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:57:59] RESOLVED: [2x] HAProxyRestarted: HAProxy server restarted on cp7014:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [15:11:00] FIRING: [3x] HAProxyRestarted: HAProxy server restarted on cp7005:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [15:12:14] FIRING: [4x] HAProxyRestarted: HAProxy server restarted on cp7005:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [15:16:00] RESOLVED: [3x] HAProxyRestarted: HAProxy server restarted on cp7005:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [16:11:02] 06Traffic: HAProxy must start after network is really up - https://phabricator.wikimedia.org/T366606#9860591 (10ops-monitoring-bot) Host rebooted by fabfur@cumin1002 with reason: Test haproxy dependencies [16:11:36] 06Traffic: HAProxy must start after network is really up - https://phabricator.wikimedia.org/T366606#9860603 (10ops-monitoring-bot) Host rebooted by fabfur@cumin1002 with reason: Test haproxy dependencies [16:16:30] FIRING: HAProxyRestarted: HAProxy server restarted on cp7001:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=magru%20prometheus/ops&var-instance=cp7001&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [16:21:30] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp7001:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=magru%20prometheus/ops&var-instance=cp7001&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [17:40:23] 06Traffic, 06DC-Ops, 10ops-ulsfo, 06SRE: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9860978 (10BCornwall) [17:50:22] 06Traffic: HAProxy must start after network is really up - https://phabricator.wikimedia.org/T366606#9861049 (10ssingh) 05Open→03Resolved a:03ssingh On investigation, we found that (cp7001): ` [Tue Jun 4 16:15:46 2024] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow con... [17:54:05] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: missing project handler does not handle short-form URLs properly - https://phabricator.wikimedia.org/T355018#9861075 (10Pppery) [17:54:21] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: missing project handler does not handle short-form URLs properly - https://phabricator.wikimedia.org/T355018#9861086 (10Pppery) I added #wikimedia-apache-configuration since it seems like the problem is that something in Apache is causing https://git... [18:19:23] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861173 (10cmooney) [18:19:44] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9861174 (10cmooney) [18:29:00] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861186 (10cmooney) [21:13:27] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: missing project handler does not handle short-form URLs properly - https://phabricator.wikimedia.org/T355018#9861777 (10Pppery) [21:42:15] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9861911 (10Pppery) [22:00:32] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9861977 (10Nux) >>! In T366517#9859042, @Tgr wrote: > Gadgets depend on user preferences (even if a module is default, you can still disable it)...