[02:49:11] 06Traffic, 10MediaWiki-Platform-Team (Radar), 07SecTeam-Processed, 07Security: SUL Integration for eventyay (Wikimania virtual event platform) - https://phabricator.wikimedia.org/T378157#11208755 (10bachkhois) Yes, we fixed the user-agent. The login issue has been resolved. For `jwt.exceptions.ExpiredSigna... [03:06:04] 06Traffic, 10MediaWiki-Platform-Team (Radar): Write Hadoop query for progres metric of unified mobile routing metric - https://phabricator.wikimedia.org/T405429#11208763 (10Krinkle) >>! From the **task description**: > Write Hadoop query that measures the completion metric […] Starting in Turnilo, looking bro... [04:08:44] 06Traffic, 10MediaWiki-Platform-Team (Radar): Write Hadoop query for progres metric of unified mobile routing metric - https://phabricator.wikimedia.org/T405429#11208800 (10Krinkle) Onwards to Hadoop proper, then. I tend to start by getting a small sample from a single partition (1 hour of 1 day), so that I ca... [10:32:22] Yo :) eqiad cp nodes are sill seeing high-ish levels of non-PURGE requests to ATS https://grafana.wikimedia.org/goto/UTio5RqHR?orgId=1 While I don't think it is problematic for continuing with the mediawiki switchover today, I'm not sure why that's happening [10:33:27] haproxy and varnish have ~0 incoming connections, and the major difference between the last switchover and this one wrt to this steady stream of request rate is that it is completely unbalanced (cf https://grafana.wikimedia.org/goto/Vd4lcgqHR?orgId=1 ) [10:33:43] Not major, just finding it weird [10:48:20] just a heads-up, I will be merging a change to shift weights in gateway-check.lua.conf for test2wiki for testing purposes. It's minor so I'll just be letting puppet do its thing https://gerrit.wikimedia.org/r/1190696 [12:55:35] claime: fairly "typical" in that respect sadly, and also what we have seen in the past [12:56:20] essentially, non-TTL respecting recursors with default TTLs (!), hardcoded text-lb.eqiad lookups [12:57:16] haproxy having 0 connections hmm [12:57:54] https://grafana.wikimedia.org/goto/hcKiYR3NR?orgId=1 doesn't seem to be the case though I think? [12:58:31] like eqiad as you can see has like ~90rps [13:02:31] maybe the change since last switchover is that health probes are more-numerous due to liberica or something? [13:02:35] or accounted differently [13:03:53] are these checks going through 'til ATS? [13:19:33] fabfur: no, just varnish, since if varnish is down, it can't directly talk to the backends anyway so we don't go all the way till ATS [13:20:04] bblack: but there is no liberica in eqiad (or codfw) [13:20:06] yet [13:20:19] well, in drmrs/codfw, I assume we're still probing ATS separately, since they do cross-node traffic [13:20:48] we don't have healthchecks that punch through to ATS on single-backend though? that seems a shame, as nothing but cache hits will work if ATS is malfunctioning. [13:21:19] that's probably a deep topic to explore though, no easy answers :) [13:22:07] I guess even in drmrs/codfw case, we wouldn't be probing it from the LB, just from varnish healthchecks [13:22:42] yeah. there is also no per-site distinction possible there (as an obvious blocker I think, separate from the single-backend distinction) [13:22:54] for the healthcheck I mean [13:23:14] so the weirdest thing to me is that this phantom load isn't at all evenly distributed between machines [13:24:02] well if it's external, that could be because it's from a very limited range of source IPs? [13:24:09] cdanis: is that really surprising though? [13:24:17] sukhe: with haproxy reporting 0 traffic, I think so [13:24:33] (one or just a small handful. beyond that hopefully hashing would push it around more) [13:24:36] oh right, no haproxy [13:24:44] the ATS metrics say they are GETs too [13:24:50] but it doesn't seem to be reporting 0 though: https://grafana.wikimedia.org/goto/hcKiYR3NR?orgId=1 [13:25:11] sukhe: but it isn't reporting 700 https://grafana.wikimedia.org/goto/JdNfPgqHR?orgId=1 [13:27:06] sukhe: by ~0 I meant compared to the varnish request rate, but yes, it was a little exaggerated [13:27:14] those are some surprisingly well-rounded numbers too, makes me wonder [13:27:22] bblack: and they're pretty constant over time [13:27:25] it's mysterious [13:28:49] Naively, I would expect haproxy request rate to be superior to ATS request rate right? [13:29:03] it should be [13:30:50] 3x nodes ~30, 1x ~150, 1x ~160, 1x ~200, 1x ~718 (with some odd dropouts), 1x ~768 [13:31:11] it just seems very odd the numbers aren't more-natural looking [13:31:22] which makes me think some automated scheduled requests [13:33:33] yeah I have no idea. fabfur and I can dig deeper soon! [13:34:46] it has to be purged [13:34:55] or varnish I guess [13:35:01] it's all from the local host in any case [13:35:20] all the estab and time-wait to the ATS port is from localhost or the local IP [13:36:40] lots of CLOSE_WAIT sockets from varnishd, it's cycling [13:37:01] maybe varnish health probes, and they're per-VCL, and some nodes have more old VCLs still "active" than others? [13:39:12] yeah I think that's it [13:42:01] cp1100 - 73 [13:42:02] cp1102 - 4 [13:42:02] cp1104 - 78 [13:42:02] cp1106 - 4 [13:42:02] cp1108 - 17 [13:42:04] cp1110 - 21 [13:42:06] cp1112 - 16 [13:42:09] cp1114 - 4 [13:42:11] ^ count of "warm" vcls per eqiad text node [13:42:28] at a glance, seems to line up with the shape of the uneven distribution? [13:45:05] I wonder if it's actually purged's persistent conn that's keeping VCLs artificially warm forever [13:45:13] maybe it should cycle out the conn occasionally if so [14:24:27] but shouldn't the purge requests be consistent across nodes? [14:26:39] sukhe: it's the ATS-reported GETs that are widely varying [14:27:14] Brandon is saying that on some cps, varnish is doing many more healthchecks to ATS because of old VCL generations [14:27:44] yeah but I was trying to understand [14:27:45] 09:45:05 < bblack> I wonder if it's actually purged's persistent conn that's keeping VCLs artificially warm forever [14:28:35] that was a random guess [14:28:37] oh the theory there is that, since purged keeps its connection open forever, that keeps that old config iteration pinned [14:28:50] 06Traffic, 10MediaWiki-Platform-Team (Radar), 07SecTeam-Processed, 07Security: SUL Integration for eventyay (Wikimania virtual event platform) - https://phabricator.wikimedia.org/T378157#11210255 (10ssingh) >>! In T378157#11208755, @bachkhois wrote: > Yes, we fixed the user-agent. The login issue has been... [14:28:51] as to why we always end up with problematically-long lists of "warm" VCLs [14:28:55] and still active and doing duplicate healthchecking [14:28:59] it's probably not even a very good guess :) [14:29:06] it's plausible! [14:29:48] in theory the idea is that excess stale "warm" VCLs linger because we still have open client conns processing requests in them, which is ok, and you'd expect that conditional to end naturally at some reasonable point. [14:30:09] but we're not stacking up 78x vcl reloads in a short time window, so clearly something is hanging onto some reference forever [14:31:00] I donno, it's been forever since I've refreshed myself on the state of affairs in upstream varnish on how VCLs are meant to eventually go completely cold and get killed off [14:31:27] there could just be varnish bugs. or we've failed to adapt to some change that requires action on our part (or some config settings) to help move them along, or something. [14:31:50] I don't think we have looked at that at all recently. for varnish 7, we just upgraded the code that gets the list of VCLs but IIRC we never looked at it underneath [14:33:20] 06Traffic, 10MediaWiki-Platform-Team (Radar), 07SecTeam-Processed, 07Security: SUL Integration for eventyay (Wikimania virtual event platform) - https://phabricator.wikimedia.org/T378157#11210265 (10Dzahn) Does this mean we can close this ticket? [14:41:23] 10Domains, 07HTTPS, 10DNS, 06SRE, 06Traffic-Icebox: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071#11210296 (10ssingh) There is work underway by Timo on unifying the mobile and desktop variants for Wikimedia projects; see T214998. There are no pl... [14:50:17] 06Traffic, 06SRE: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210334 (10ssingh) [15:13:20] 06Traffic, 10MediaWiki-Platform-Team (Radar), 07SecTeam-Processed, 07Security: SUL Integration for eventyay (Wikimania virtual event platform) - https://phabricator.wikimedia.org/T378157#11210455 (10MarioB) So, a number of users confirmed it is working again for them. So, this can be closed. Thanks. [15:18:44] 06Traffic, 10MediaWiki-Platform-Team (Radar), 07SecTeam-Processed, 07Security: SUL Integration for eventyay (Wikimania virtual event platform) - https://phabricator.wikimedia.org/T378157#11210479 (10ssingh) 05Open→03Resolved a:03ssingh Thanks for letting us know. [15:26:35] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210509 (10ssingh) 05Open→03Resolved a:03ssingh @JKelsoteel-WMF `wikimediafoundation.org` is now verified. If you are unable to set the permissi... [15:37:20] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11210576 (10cmooney) [15:38:01] 06Traffic, 07Documentation: Document x-cache-status header on Wikitech - https://phabricator.wikimedia.org/T404654#11210589 (10aaron) >>! In T404654#11203673, @BCornwall wrote: > Thanks for the report, @aaron! I've updated that article section to include that header. Would you say it addresses concerns? T... [15:40:43] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11210598 (10cmooney) As discussed in today's meeting I believe all the cloudcephosd hosts have jumbo frames enabled on all their physical interfaces. So there should be n... [15:41:45] 06Traffic, 06SRE: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210610 (10JKelsoteel-WMF) 05Resolved→03Open Hi @ssingh , I just tested this with our service account (I am part of ITS), and I am still seeing the window prompting me to v... [15:42:57] 06Traffic, 06SRE: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210616 (10ssingh) >>! In T404974#11210610, @JKelsoteel-WMF wrote: > Hi @ssingh , I just tested this with our service account (I am part of ITS), and I am still seeing the wind... [15:43:55] 06Traffic, 06SRE: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210617 (10JKelsoteel-WMF) No problem. Here it is: google-site-verification=m7jEgoI4DOUy0u6cebxtp7oJT7s3nnNyPWgmPQmNEjc [15:45:39] 06Traffic, 06SRE: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210622 (10ssingh) >>! In T404974#11210617, @JKelsoteel-WMF wrote: > No problem. Here it is: google-site-verification=m7jEgoI4DOUy0u6cebxtp7oJT7s3nnNyPWgmPQmNEjc Thanks, and... [15:49:03] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210631 (10JKelsoteel-WMF) Yes, I am using the ITS service account designated to grant others access to our various GSC properties. The account is oktaser... [15:53:55] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210655 (10ssingh) >>! In T404974#11210631, @JKelsoteel-WMF wrote: > Yes, I am using the ITS service account designated to grant others access to our vari... [15:56:32] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210662 (10JKelsoteel-WMF) Understood, thank you! Going forward, as part of the process for handling these requests, once the verification is completed fo... [15:57:46] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210667 (10ssingh) >>! In T404974#11210662, @JKelsoteel-WMF wrote: > Understood, thank you! Going forward, as part of the process for handling these reque... [15:57:53] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210668 (10JKelsoteel-WMF) Confirming it is working for our service account now. 👍 Thank you! [15:58:59] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210672 (10ssingh) OK great, resolving this for now but yeah, let's add this to the docs and see how it serves us next time. Thanks and sorry for the... [15:59:05] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210674 (10ssingh) 05Open→03Resolved [15:59:51] 06Traffic, 06SRE, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210677 (10JKelsoteel-WMF) Thank you all! And no worries, appreciate the help! [16:51:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499 (10cmooney) 03NEW p:05Triage→03Medium [17:13:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11211183 (10cmooney) For reference these are the vlans / IPs currently connected: ` lvs1018 - enp94s0f0np0 - vlan1031 - 10.64.130.18/24 - private1-e1-... [17:22:06] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11211217 (10cmooney)