[05:49:02] 10netops, 10Operations: Eqiad: C6 mgmt switch glitch - https://phabricator.wikimedia.org/T249309 (10Marostegui) [06:14:23] 10netops, 10Operations: Eqiad: C6 mgmt switch glitch - https://phabricator.wikimedia.org/T249309 (10elukey) On msw1 I see all events like the following, starting at 5:40 UTC: ` Apr 3 05:40:24 msw1-eqiad chassism[1399]: ifd_process_flaps IFD: ge-0/0/23, sent flap msg to RE, Downstate Apr 3 05:40:24 msw1-eq... [06:17:50] 10Traffic, 10Operations, 10User-DannyS712: 503 error on enwikinews - https://phabricator.wikimedia.org/T249280 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Thanks for you report. a 503 error usually signals a transient issue. Please reopen this task if you experience this issue frequently. [06:33:34] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch glitch - https://phabricator.wikimedia.org/T249309 (10elukey) [06:34:23] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch glitch - https://phabricator.wikimedia.org/T249309 (10elukey) Interesting that ganeti1011's mgmt interface recovered, but not the others. Adding dcops to see if we can schedule in the next days/weeks a check of `msw-c6-eqiad`. [06:39:47] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch glitch - https://phabricator.wikimedia.org/T249309 (10ayounsi) p:05Triage→03High [06:42:21] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch glitch - https://phabricator.wikimedia.org/T249309 (10ayounsi) * Check msw-c6-eqiad's status * Check msw-c6-eqiad cabling to msw1-eqiad Replace either cable or switch depending on what's faulty. [06:43:36] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10ayounsi) [07:05:52] 10netops, 10Operations: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10ayounsi) 05Open→03Stalled p:05Medium→03Low All netflow hosts are now running FNM 1.1.4. Now waiting for upstream. [07:13:01] 10netops, 10Operations, 10Patch-For-Review, 10User-Elukey: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10ayounsi) 05Open→03Resolved Afaik, everything is done here, thanks! [07:25:34] \o/ [08:55:55] <_joe_> vgutierrez, ema what is the best graph to see the purge rate for cache-text in one dc? [09:22:10] _joe_: you know what? I can't find a decent one :) [09:22:32] _joe_: let me add PURGE to ats-cluster-view [09:23:04] oh no wait [09:23:50] _joe_: https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?orgId=1&var-site=esams&var-cache_type=varnish-text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&var-method=PURGE&fullscreen&panelId=1 [09:25:13] <_joe_> this is requests per second? [09:25:27] <_joe_> sigh [09:26:22] <_joe_> this is 2k times larger than the edit rate [09:31:29] <_joe_> ema:how many nodes we have for cache text in esams? [09:32:52] _joe_: 8 [09:33:21] <_joe_> ok so the purges per second need to be divided by that to make sense [09:33:28] <_joe_> still way too many [09:34:24] _joe_: yeah, maybe this graph is better: https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=esams%20prometheus%2Fops&var-layer=backend&var-cluster=text&fullscreen&panelId=12 [10:15:13] 10Traffic, 10Operations: Only retry failed requests for external traffic on cache frontends - https://phabricator.wikimedia.org/T249317 (10ema) [10:15:17] 10Traffic, 10Operations: Only retry failed requests for external traffic on cache frontends - https://phabricator.wikimedia.org/T249317 (10ema) p:05Triage→03Medium [10:38:24] 10Traffic, 10Operations, 10good first task: Only retry failed requests for external traffic on cache frontends - https://phabricator.wikimedia.org/T249317 (10ema) [12:36:29] 10Traffic, 10MediaWiki-Maintenance-scripts, 10MediaWiki-extensions-Maintenance, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10jcrespo) [12:37:23] 10Traffic, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10jcrespo) [12:39:58] 10Traffic, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10CDanis) Is `en.wikipedia.org` really the correct hostname to use for cswiki's logo? [13:25:48] 10netops, 10Operations: IRR updates needed - https://phabricator.wikimedia.org/T235886 (10ayounsi) About: > We found that the prefixes 185.15.56.0/22 and 2a02:ec80::/29 are in use but not documented in the RIPE Database as assignments. After discussing it with John, the deeper issue might be that they are "... [13:46:45] 10Traffic, 10Operations: cp1075 + cp1081 being Pybal-depooled/repooled frequently - https://phabricator.wikimedia.org/T249335 (10CDanis) [13:48:56] 10Traffic, 10Operations: cp1075 + cp1081 being Pybal-depooled/repooled frequently - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) p:05Triage→03Medium [14:07:02] the old aggregate-client-status-code has purge data too (but you still have to divide) [14:08:04] it says last-day peak was ~4.5K/sec purges for text@eqiad (already divided) [14:10:23] min/avg/max for past week, divided up properly, is 623/2521/4859 [14:10:27] (purges/sec) [14:12:54] 10Traffic, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10jcrespo) According to documentation is: > This is because the cache for /static is shared between all wikis, and the canonical form int... [14:15:02] 10Traffic, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10jcrespo) @WDoranWMF Infrastructure side of things would fall probably under #traffic but who would be the right maintainer of Mediawiki... [14:19:58] 10Traffic, 10Operations: cp1075 + cp1081 being Pybal-depooled/repooled frequently - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) as it can be seen in https://grafana.wikimedia.org/d/80zd3mjZk/t249335?orgId=1 it looks like there is a memory leak on ats-tls that at some point begins to hit negatively... [14:20:10] 10Traffic, 10Core Platform Team, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10WDoranWMF) @jcrespo Excellent question <- which is what people say when they aren't positive what the answer is.... [14:32:20] 10Traffic, 10Core Platform Team, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10jcrespo) > It sounds like High but is it UBN? I've done a very superficial triage, just pinging some teams to k... [14:33:17] 10Traffic, 10Core Platform Team, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10BBlack) Most likely the cause is that the Varnish rule for normalizing `/static/` to the enwiki hostname hasn't... [14:36:49] 10Traffic, 10Core Platform Team, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10jcrespo) @WDoranWMF if BBlack is right, this may not need mw code changes, we should wait for that. [14:39:05] 10Traffic, 10Core Platform Team, 10MediaWiki-Maintenance-scripts, 10Operations: purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10BBlack) Yeah, the varnish (frontend) code for this is in `modules/varnish/templates/text-frontend.inc.vcl.erb`:... [14:45:14] 10Traffic, 10Operations: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) [14:45:30] 10Traffic, 10MediaWiki-Maintenance-scripts, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10Anomie) The maintenance script seems like it should be functioning, assuming any p... [14:47:40] 10Traffic, 10MediaWiki-Maintenance-scripts, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10BBlack) ` bblack@cumin1001:~$ sudo cumin A:cp-text 'curl -s https://en.wikipedia.o... [14:53:00] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10Krinkle) [15:16:59] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10BBlack) So, the backend purging queues in esams are way behind. One the one node I'm staring at... [15:17:32] 10Traffic, 10Operations: varnishd crashes in vbf_stp_condfetch(): cp3057 and cp3061 - https://phabricator.wikimedia.org/T249344 (10ema) [15:17:42] 10Traffic, 10Operations: varnishd crashes in vbf_stp_condfetch(): cp3057 and cp3061 - https://phabricator.wikimedia.org/T249344 (10ema) p:05Triage→03Medium [15:28:23] journalctl -u vhtcpd.service -xn 1000000|grep Purger0|cut -d: -f1-3,8|cut -d" " -f1-6|less [15:28:55] ema: vgutierrez: this on an esams text node shows the logged queue sizes for the purges for the ATS backend, over time [15:29:10] you can see it catching up in the overnights, most days, but not all, and slowly getting worse [15:29:17] some days it recovers to zero overnight, some days it doesn't [15:29:38] bblack: do we have those stats in prometheus? [15:30:06] I didn't find them in grafana, but we're supposed to [15:30:53] modules/prometheus/files/usr/local/bin/prometheus-vhtcpd-stats.py [15:31:29] I see vhtcpd_queue_size and vhtcpd_queue_max_size in prometheus... always a 0 value for every node in eqiad and esams [15:31:44] I couldn't even get an autocomplete for vhtcpd in grafana [15:32:19] that's odd, worked fine for me on /explore [15:32:46] honestly, it looks outdated though [15:33:00] as in, the python script seems to match an odler version of the expected stats... [15:33:04] hmmm [15:33:37] the file currently looks like: [15:33:38] start:1582152620 uptime:3775380 purgers:2 recvd:7919253941 bad:0 filtered:0 [15:33:41] Purger0: input:7919253941 failed:0 q_size:86873061 q_mem:14865036237 q_max_size:157669187 q_max_mem:28148024297 [15:33:44] Purger1: input:7832380879 failed:0 q_size:2331 q_mem:718320 q_max_size:129967 q_max_mem:18856476 [15:34:10] that looks wrong [15:34:14] and it seems to be looking for an older version of that file with a different layout [15:34:18] yeah [15:34:23] before there were seperate purger queues [15:34:45] somebody must have updated the software but never fixed the prometheus side or something [15:34:53] we should hunt down the daemon's author and burn him at the stake :P [15:36:05] bblack: is this T241232? [15:36:05] T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 [15:37:14] 10Traffic, 10Operations, 10observability: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10CDanis) [15:37:23] ema: possibly related, yeah. but only esams seems to be behind, so it's not *just* intrinsic to the purge rate over that socket [15:37:28] other load seems to be a factor, too [15:52:01] 10Traffic, 10Operations, 10observability: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10BBlack) I had a chat with the author to make sure we understand the meaning of the fields: First line: `start`: this is just the *nix... [15:54:10] LOL [15:57:12] so yeah [15:57:18] bblack: 😂 [15:58:00] but still, even with rates spike up into ~4-5K/sec... purge handling must be somewhat-slow in ATS-land [15:58:10] will spreading it over multiple ports even help? [15:58:39] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10BBlack) Probably-related: T241232 [15:59:23] so the purge queue is "working", it's just very behind [15:59:50] Mar 24 07:50:20 [16:00:02] is the last time cp3050's instance actually caught up and emptied the queue completely [16:00:32] it got close, at only ~1.4M entries, at Mar 26 06:20:20 [16:01:34] it's still varying, there are times when it goes downwards for hours, but it just can't quite catch up on a daily basis anymore [16:04:16] eventually, if it grows indefinitely, some malloc() failure will happen and the daemon will crash out [16:04:45] it does have restart=on-failure [16:07:22] but then we've lost a bunch of purges, right? [16:07:30] if the daemon crashes, yes [16:07:43] (or if we choose to restart it) [16:08:41] the older version of the daemon had a memory-limit parameter, and willfully reset the queue to zero when it got too big and then logged how many times it has had to do such a reset [16:08:58] but that seemed like a lot of fluff for the same basic result as "crash and let it init restart it if it gets too big" [16:09:36] the only real difference is now "too big" isn't a parameter, it's just "whever the OS decides we're too big" for ulimit or true oom [16:10:37] do we set any OOMScoreAdjust? [16:11:16] I don't see that or ulimits on this, at present [16:11:50] it's currently holding ~13GB [16:11:58] (of purge data, on cp3050 example) [16:12:11] 😬 [16:12:47] one way to get it caught up woul dbe to depool esams a while, since it appears to catch up better under low load at night [16:16:21] the relevant ET_NET on cp3050 is peaking around 66%, but sometimes substantially lower [16:16:25] it's not saturating a CPU anyways [16:17:52] ema: to try distributing better, it would work to attemp multiple TCPs to the same listen port? [16:28:35] (or of course, we could go back to finding ways to drastically reduce the insane purge rate) [16:32:48] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10wiki_willy) a:03Cmjohnson Assigning to @Cmjohnson, since he'll be onsite today [16:55:51] 10Traffic, 10Operations, 10Product-Infrastructure-Team-Backlog: Elevated 503 responses between 2020-03-15 and 2020-03-19 - https://phabricator.wikimedia.org/T248132 (10Mholloway) 05Open→03Declined Checked again just now and it looks like the issue was transitory. Might as well close this. [17:10:50] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10Cmjohnson) @XioNoX the netgear switch does not have any power to it, I tried replacing the power cable and used a different power outlet and still nothing. These do not have redundant power and we... [17:14:52] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10Cmjohnson) [17:19:55] 10Traffic, 10Operations, 10observability: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10BBlack) We're probably going to add multiple purger connections to fan out the per-thread load from T241232 to help with T249325 . I'... [17:34:18] 10netops, 10Operations, 10ops-eqiad: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10wiki_willy) [19:16:58] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): purgeList.php/HTCP purge doesn't seem to invalidate cache correctly - https://phabricator.wikimedia.org/T249325 (10Urbanecm) >>! In T249325#6026427, @jcrespo wrote: >> It sounds like High but is it UBN? > > I'v... [19:57:05] 10Traffic, 10Operations, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: Cache not consistently updated for PCS JS endpoint - https://phabricator.wikimedia.org/T249290 (10Pchelolo) I'm gonna retag this with #traffic and remove CPT for now, since CDN purges are not within our are of expertise... [21:55:43] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Operations, and 7 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847 (10Krinkle) [22:18:44] 10Traffic, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [22:21:49] 10Traffic, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) Code steward of the core feature TBD. It's a pretty minor feature, but worth double-checking th... [23:30:09] bblack: pretty sure the purge issues are beyond just purgeList.php CLI [23:30:10] https://phabricator.wikimedia.org/T249325#6028287 [23:30:18] and https://phabricator.wikimedia.org/T249290 [23:30:40] I'm not sure why we haven't yet heard that MW edits aren't being purged for page views, but it's possibile people just haven't noticed yet.