[00:16:25] 10netops, 10Operations: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Bellow is my proposal to add validation to our config. There are many possible ways of doing it, so feedback from @faidon or @mark is welcome! * The RPKI BGP communities are non transitive so `community delete RPKI... [07:20:28] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10jcrespo) For extra context, 503 errors can also happen randomly, the current stats say that 99.999512% of requests are suc... [09:23:40] BTW, we already went through a renewal cycle of unified, wikibase && non-canonical-redirect certs by acme-chief [09:24:08] everything went as expected, OCSP checks for wikibase didn't scream [09:24:46] so either it works or we're not monitoring it :-P [09:24:48] \o/ [09:25:55] so.. SSL OK - OCSP staple validity for wikiba.se has 448514 seconds left:Certificate wikiba.se contains all required SANs:Certificate wikiba.se (RSA) valid until 2019-08-25 13:00:35 +0000 (expires in 82 days) [09:25:56] ;P [09:26:16] duration 63d 17h 10m 16s [09:26:50] and something similar for the ECDSA one [09:26:51] SSL OK - OCSP staple validity for wikiba.se has 449065 seconds left:Certificate wikiba.se contains all required SANs:Certificate wikiba.se (ECDSA) valid until 2019-08-25 13:00:23 +0000 (expires in 82 days) [09:29:08] volans: I should challenge you to a banjo duel after that comment ;P [09:29:38] Monkey Island 3 banjo duel for reference: https://www.youtube.com/watch?v=R_PqZofBPWg [09:30:13] rotfl [09:30:21] I was just applying Occam's razor [09:37:54] lol @banjo duel [09:54:38] cmon sword fighting! https://www.youtube.com/watch?v=a_OfWP2DHms [10:13:03] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Volans) a:05Volans→03None I'm leaving it back to the current clinic duty (@fsero) at this point given that it ne... [11:52:40] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10fgiunchedi) I've added a 'top 20 backend' panel to https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X ! th... [12:39:07] 10Acme-chief, 10HTTPS, 10Traffic, 10Operations: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 (10Vgutierrez) 05Open→03Resolved [12:39:09] 10HTTPS, 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) [13:59:33] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10ema) @CDanis thank you so much for this! Very useful. Note that the Server response header will be set to `Varnish` for all synthe... [14:14:29] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) Indeed, thanks @ema ! I talked with @fgiunchedi some about this earlier and we tweaked the wording on the Logstash dashboa... [14:36:51] 10Traffic, 10Operations, 10observability: varnish: add X-Fetch-Error response header - https://phabricator.wikimedia.org/T224994 (10ema) [14:36:56] 10Traffic, 10Operations, 10observability: varnish: add X-Fetch-Error response header - https://phabricator.wikimedia.org/T224994 (10ema) p:05Triage→03Normal [14:40:14] ema: +1 that sounds great [14:52:36] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3035.esams.wmnet'] ` The log can be found i... [15:06:45] 10Traffic, 10Operations, 10Patch-For-Review, 10User-notice: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10Legoktm) For Tech News: Bots and other scripts that do not set an identifiable [[https://meta.wikimedia.org/wiki/User-Agent_policy|User-Ag... [15:20:38] cdanis: incidentally, I'm just now porting https://github.com/varnishcache/varnish-cache/pull/2450 to our 5.1.x packages, which means we won't have to do the -q 'FetchError ne "Pass delivery abandoned"' dance anymore [15:21:07] nice! [15:24:47] 10Traffic, 10Operations, 10User-notice: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10TheDJ) Not sure if it applies here, but please remember that we allow `Api-User-Agent` as an alternative to `User-Agent` for Javascript solutions. [15:27:04] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3035.esams.wmnet'] ` and were **ALL** successful. [15:49:53] not to be too big of a pain, but could use a review on cloudelastic LVS puppet and dns patches, not sure if doing this right: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/512925/ https://gerrit.wikimedia.org/r/#/c/operations/dns/+/512924/ [15:55:34] so, there's a varnish patch in 4.1.10 fixing https://github.com/varnishcache/varnish-cache/pull/2135 by limiting backend connection retries to 1 [15:55:55] currently we fixed the problem by disabling retries altogether (gethdr_extrachance=0) [15:56:30] well, with our patch 0010-extrachance-retries.patch, which adds gethdr_extrachance, which we set to 0 :) [15:57:21] are we happy with our own patch and no retries, or do we want to follow upstream and port https://github.com/daghf/varnish-cache/commit/36b6558b70271193ea9000a8a7fc8fb7b33422f2 ? [15:59:00] s/0010-extrachance-retries.patch/0001-gethdr_extrachance.patch/ actually [16:00:34] yeah it's kinda of non-trivial to evaluate that whole question heh [16:01:33] the way I tend to see it, is that our 0-retries hackaround specifically works out ok for us (even if it's not viable for upstream/general-case), because we have the frontend retry-503-once papering over that and all other related deeper issues. [16:01:58] but also, the longer we hang onto our custom patch, the more-incompatible we become with other upstream changes, which might make it hard to mitigate other related bugs with upstream fixes. [16:02:34] where we are on the balance of those things, I don't have a good grasp on, but you might, having stared at the 4.x backports patches recently :) [16:03:31] I think the fixed single-retry of their patch will slightly hurt us vs 0, but it's just going to make dumb cases dumbed, it's not going to hurt good cases. [16:03:47] err that was maybe not easy to parse usefully [16:04:55] What I mean is, we'll get an extra amplification and timeout-extension from it (of 2x) that we don't really want or desire, but that will only apply to (a) the very rare corner case connection-close races extrachance was meant to handle and (b) dumb situations with long timeouts on output emission from backends (like the restbase scenario that originally drove us to look at this) [16:05:32] so making already-horrible cases slightly worse in (b) isn't a huge loss, for the gain of syncing back with upstream, IMHO. [16:05:57] mailbox lag on cp3039 - https://grafana.wikimedia.org/d/000000478/varnish-mailbox-lag?orgId=1 [16:06:13] I'll go restart it [16:06:19] matching icinga alert - https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp3039&service=Check+Varnish+expiry+mailbox+lag [16:06:34] should that alert be a critical instead of a warning? [16:06:35] we had two out of the chash [16:06:44] mailbox lag, no [16:06:48] 5xx's, maybe [16:08:24] there is a certain amount of unavoidable pain involved in forward-porting 4.1.x fixes to 5.1.x (much less than back-porting from 6!) [16:08:52] but dropping our custom patch would indeed decrease the pain we can avoid [16:09:32] so I'd be +1 on tracking upstream on this [16:10:31] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3045.esams.wmnet'] ` The log can be found i... [16:10:50] ok [16:10:57] wmf! :) [16:10:59] err [16:11:01] wfm :) [16:11:13] that's muscle memory right there! [16:12:02] yeah so tl;dr on the reimage mess - the patch was for 3045 (but had 3035 named in the commitmsg), I did all the other steps on 3035 instead, reimaging it pointlessly from varnish-be -> varnish-be. [16:12:20] I've started the reimage of 3045, and now looking at 3035 to see if it needs any further help before repool [16:14:24] a way to avoid the ipsec alerts is to run puppet on icinga.wikimedia.org after merging the reimage patch [16:14:47] the reimage script does that for us [16:14:53] but yeah... [16:15:07] which is of course the real reason I got ipsec alerts earlier (working on one host while patching another) [16:15:12] but I thought I had another explanation for it [16:15:17] now it all kinda makes sense :) [16:17:38] ongoing patching work here: https://gerrit.wikimedia.org/r/#/q/topic:patches-4.1.x+(status:open+OR+status:merged) [16:18:15] that's most of 4.1.10, still missing the patch we just discussed and 9754715 Honor first_byte_timeout for recycled backend connections [16:18:28] tomorrow I'll port those and 4.1.11 [16:19:06] (jerkins complains for reasons I don't understand, but I haven't really looked. Tests are all green building locally on my workstation) [16:21:28] now it's beer o'clock, see you tomorrow (you lurkers too!) [16:21:56] cya! [16:25:38] 10netops, 10Operations: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10ayounsi) 1/ outstanding alert, I think this is due to the alert being triggered right before a 3 days weekend and people not paying enough attention to active Icinga alerts.... [17:17:27] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3045.esams.wmnet'] ` and were **ALL** successful. [18:48:41] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10fsero) Hi @Tgr :) I'm following this up, according to https://github.com/matrix-org/synapse/blob/master/docs/fede... [19:35:07] 10netops, 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10herron) A syslog UDP listener on port 10514 is now running on lithium/wezen, and forwarding messages received to the Kaf... [19:41:02] 10netops, 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10herron) >>! In T224128#5234630, @herron wrote: > Before moving production logs to this I think we should decide on some... [21:18:42] haven't seen that one before https://blog.cloudflare.com/bandwidth-costs-around-the-world/ [21:19:18] They're actively calling some ISPs out "Their behavior is irrational in any competitive market and so it is not a surprise that each of these providers is a relative monopolist in their home market." [21:19:43] "Today, however, there are six expensive networks (HiNet, Korea Telecom, Optus, Telecom Argentina, Telefonica, Telstra) that are more than an order of magnitude more expensive than other bandwidth providers around the globe and refuse to discuss local peering relationships." [21:27:03] XioNoX: ill try to review thta CR tomorrow, feel free to ping me if i dont [21:28:30] jbond42: no rush :) thanks! [21:29:45] np [21:36:32] 10netops, 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10ayounsi) a:03ayounsi Test was successful, next step is to do the change to all devices. Note that this would be a great use of anycast. Onl... [22:16:30] bblack: cp3035 has a failed PSU [22:23:55] 10Traffic, 10Operations, 10ops-esams: cp3035 PS Redundancy Lost - https://phabricator.wikimedia.org/T225035 (10ayounsi) p:05Triage→03High [22:24:28] https://phabricator.wikimedia.org/T225035 so do we have 5 CP servers with no redundant power? [22:37:14] yeah [22:37:40] it started alerting because I reinstalled it, which killed whatever previous ack/downtime on the failed PSU [22:38:00] we have a bunch of failed power supplies and also a few failed servers, esams cp cluster is a mess on hw [22:38:36] 10Traffic, 10Operations, 10ops-esams: cp3035 PS Redundancy Lost - https://phabricator.wikimedia.org/T225035 (10RobH) a:03wiki_willy This system is no longer under warranty. This is unlikely, but still could possibly be, due to the power cable becoming unseated. Since no one ever opens the ESAMS racks tho... [22:39:09] ok! I couldn't find any matching task [22:39:31] yeah, I don't think they all even have tasks [22:40:27] anyways, it's in the capex budget for early next FY to go replace all the things in esams, that's pretty much what we're waiting on to fix all the things there.