[01:28:50] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Operations, 10Thumbor: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Krinkle) [05:30:34] 10Traffic, 10Core Platform Team, 10Operations, 10serviceops, 10Performance-Team (Radar): Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Joe) a:05Joe→03None >>! In T250205#6056793, @daniel wrote: > @Joe You are assigned to this ticket, is this something you a... [05:32:37] <_joe_> vgutierrez: do you have any task about harmonizing timeouts? [05:32:51] nope [05:32:53] IIRC [05:33:08] I remember writing the wikitech page as a documentation effort in that front [05:33:41] and maybe that's already outdated with the latest Envoy changes [05:33:43] https://wikitech.wikimedia.org/wiki/HTTP_timeouts [05:35:26] <_joe_> not really that outdated, but yes, we need to basically do a full check of the timeouts in all the stack [05:45:04] 10Traffic, 10DBA, 10Operations, 10serviceops: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10Joe) [08:01:15] _joe_: thanks for mentioning nrpe::monitor_systemd_unit_state, I wasn't aware of it. I think check_procs is still better in this case for the reason mentioned in the CR though [08:01:32] <_joe_> oh I'll take a look [08:02:14] <_joe_> check_procs is like the shittiest thing ever, it raises quite a few false positives, so in general I tend to avoid it and trust systemd to keep track of state pretty well. [08:02:53] <_joe_> so yes, for the transition period use check_procs [08:03:07] <_joe_> but afterwards you'd probably be better off with a check of the systemd unit [08:03:12] yeah [08:04:11] now that I think of it, both vhtcpd and purged would probably fail to start if the other one is running [08:04:24] so we would get an alert anyways [08:04:26] but maybe better to be explicit in the check for now [08:04:28] <_joe_> they all bind to the same port? [08:04:30] <_joe_> yes [08:04:33] <_joe_> I agree with out [08:04:35] <_joe_> *you [08:06:09] _joe_: yeah, 0.0.0.0:4827 udp [08:06:29] <_joe_> ema: you probably want to know which one is running, too [08:08:38] _joe_: ? [08:09:16] <_joe_> I mean with check_procs you can check which one of vhtcpd and purged is running [08:09:34] <_joe_> you can only indirectly know from a systemd check [08:09:52] ah, right [08:10:32] added a TODO comment [08:12:13] _joe_: while you're here (hehe), I've prepared a VCL patch for one of our dear actionables: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/588135/ [08:13:10] that seems much better than manually commenting out VCL on a Sunday morning [08:21:30] <_joe_> yes [08:38:50] ema: can https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Prometheus+jobs+reduced+availability be ACKed? I think you mentioned that alert yesterday [08:39:49] XioNoX: hold on a second please [08:44:04] 10Traffic, 10DBA, 10Operations, 10serviceops: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10fgiunchedi) p:05Triage→03Medium [08:44:32] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Operations, 10Thumbor: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10fgiunchedi) p:05Triage→03Medium [08:57:53] XioNoX: ack'ed, thanks [08:57:59] thanks! [09:17:53] 10Traffic, 10Operations: pybal healthchecks reaching the applayer on specific requests - https://phabricator.wikimedia.org/T250258 (10Vgutierrez) [09:18:12] 10Traffic, 10Operations: pybal healthchecks reaching the applayer on specific requests - https://phabricator.wikimedia.org/T250258 (10Vgutierrez) p:05Triage→03High [10:30:34] 10Traffic, 10Core Platform Team, 10Operations, 10serviceops, 10Performance-Team (Radar): Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Joe) [10:37:01] 10Traffic, 10Operations, 10Repository-Admins: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10ema) 05Open→03Resolved [12:28:52] 10Traffic, 10Operations, 10Pybal: pybal healthchecks reaching the applayer on specific requests - https://phabricator.wikimedia.org/T250258 (10Aklapper) [12:41:52] Next friday at around 0400Z there will be maint for IC-307235 and OGYX/120003/ZYO, noting it here in case it might cause trouble [12:41:56] XioNoX: ^ [13:03:00] godog: thanks, that's the 2 main codfw/eqiad links, the 3rd path is through eqord, which mean there will be higher latency if the two links are going down at the same time [13:11:09] * godog touches wood, friday 17th [13:17:55] at least it's not a 13th [13:42:47] 10Traffic, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Reedy) [14:47:10] 10Traffic, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10fgiunchedi) p:05Triage→03Medium [14:47:19] 10Traffic, 10Operations, 10Pybal: pybal healthchecks reaching the applayer on specific requests - https://phabricator.wikimedia.org/T250258 (10Vgutierrez) p:05High→03Medium Lowering the priority to medium as the issue is not happening after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/589... [15:13:51] 10Traffic, 10Operations, 10Pybal: pybal healthchecks reaching the applayer on specific requests - https://phabricator.wikimedia.org/T250258 (10Vgutierrez) [15:57:06] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10ema) The work on [[https://gerrit.wikimedia.org/g/operations/software/purged | purged]] (T249583) is proceeding, and... [18:56:58] 10Traffic, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Krinkle) The same also applies to the `action=raw&ctype=text/javascript` variants for CSS/JS pages. These make up a much... [19:22:07] 10Traffic, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10daniel) > I propose we add a separate Title method for the subset of URLs that need purging for link updates (in other wo... [20:07:17] 10Traffic, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Joe) I would frankly prefer to pass a flag to getCdnUrls, and return those dependent urls only if the flag has its defaul... [21:01:21] looks like deployment-prep's new acme-chief cert hasn't been picked up by ATS [21:01:35] config: [21:01:37] root@deployment-cache-text06:~# grep acmecerts /srv/trafficserver/* -r [21:01:42] /srv/trafficserver/tls/etc/ssl_multicert.config:dest_ip=* ssl_cert_name=acmecerts/unified/live/rsa-2048.chained.crt,acmecerts/unified/live/ec-prime256v1.chained.crt ssl_key_name=acmecerts/unified/live/rsa-2048.key,acmecerts/unified/live/ec-prime256v1.key [21:01:52] certs: [21:01:56] root@deployment-cache-text06:~# openssl x509 -noout -text -in /etc/acmecerts/unified/live/ec-prime256v1.chained.crt | grep 'Not After' [21:01:56] Not After : Jul 14 19:53:34 2020 GMT [21:01:56] root@deployment-cache-text06:~# openssl x509 -noout -text -in /etc/acmecerts/unified/live/rsa-2048.chained.crt | grep 'Not After' [21:01:56] Not After : Jul 14 19:53:53 2020 GMT [21:02:16] reload and check local port 443: [21:02:17] root@deployment-cache-text06:~# service trafficserver-tls reload [21:02:17] root@deployment-cache-text06:~# openssl s_client -connect localhost:443 2>/dev/null | openssl x509 -noout -text | grep 'Not After' [21:02:17] Not After : Apr 16 08:00:25 2020 GMT [21:03:45] I could probably just restart trafficserver-tls but we should probably know what's going on here as the same issue may affect prod [21:08:10] oh I need do_ocsp for the thing at the bottom of modules/profile/manifests/trafficserver/tls_material.pp possibly [21:08:11] wonder why [21:09:34] I'm not exactly sure why it's tied to do_ocsp, but really everything public should probably do ocsp. [21:10:34] maybe weren't sure it would work elsewhere and do_ocsp is effectively acting as an "if prod" flag, too [21:11:58] yeah [21:12:03] not sure why it's off in my case [21:18:53] anyway that turned out to be the trick [21:30:59] 10Traffic, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Ladsgroup) An idea: `Title::getCdnUrls()` can be moved to `HtmlCacheUpdater` (recently introduced class). [21:34:11] 10Traffic, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10daniel) Joe wrote: > I would frankly prefer to pass a flag to getCdnUrls, and return those dependent urls only if the fla...