[03:52:48] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [04:27:15] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [04:43:26] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:05:14] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:16:38] 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:24:59] 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:47:43] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:54:39] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) 05Open→03Resolved [05:54:44] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [05:54:48] 10Traffic, 10Operations: Get rid of nginx puppetization for cache upload - https://phabricator.wikimedia.org/T236120 (10Vgutierrez) [07:37:06] 10Traffic, 10Operations, 10Performance-Team: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10Gilles) Confirmed on WMCS: ` HTTP/2 502 date: Tue, 22 Oct 2019 07:34:28 GMT content-type: text/html server: ATS/8.0.5 cache-control: no-store c... [07:41:08] 10Traffic, 10Operations, 10Performance-Team: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10ema) The certificate for performance.discovery.wmnet does not include performance.wikimedia.org in SubjectAltName, hence ATS fails to connect to... [07:56:32] 10Traffic, 10Operations: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:03:48] 10Traffic, 10Operations: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10ema) [08:04:00] 10Traffic, 10Operations: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10ema) p:05Triage→03Normal [08:09:07] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10ema) 05Open→03Resolved a:03ema Done, thanks for the bug report @ori! [08:44:56] _joe_: I'm doing something wrong, but really I can't see what. Help! :) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545207/ [08:45:00] Error: Function lookup() did not find a value for the name 'profile::tlsproxy::envoy::ensure' [08:45:11] https://puppet-compiler.wmflabs.org/compiler1001/18979/ [08:45:42] <_joe_> ema: are you sure that's the role applied to those servers in site.pp with the role() keyword? [08:45:55] <_joe_> because that's the role that will be looked up on hiera [08:46:50] I think it is, otherwise puppet would not try to lookup the variable in the first place? [08:47:18] oh wait [08:47:30] logstash1007 is role(logstash), not logstash::elasticsearch [08:53:49] _joe_: <3 [08:54:58] <_joe_> remember: roles can include other roles [08:55:11] <_joe_> and those do not count for hiera lookups [09:03:43] I'm seeing a fair amount of 502s for requests like GET http://198.35.26.96/ (text-lb.ulsfo), known ? [09:03:49] happy to followup in a task too [09:04:16] fair amount == 1000 per five minutes [09:05:23] where are you seeing those? [09:05:52] but I think it makes sense to see them [09:06:20] text-lb has two servers running ats-be (4027 and 4028) and those two will fail to establish a TLS connection using that hostname [09:06:43] basically because I don't think that we ship the text-lb IPs on the SAN list for the TLS certificate for the mw* servers [09:06:46] ema ^^ [09:07:23] that's correct [09:07:37] we could return a synth 404 from varnish-fe instead [09:07:49] even a 400 [09:08:34] but careful on how you do it... cause pybal uses Host: 127.0.0.1 for some tests [09:08:39] interesting, now on varnish-be we return a 200 page that says "Unconfigured domain" [09:08:55] see for example http://91.198.174.192 [09:09:11] somehow I thought the "Unconfigured domain" page was a 404 [09:09:13] hmm that's varnish-be? [09:09:49] funny.. that doesn't get redirected to https :) [09:10:06] I was expecting a TLS handshake error [09:10:09] godog: thanks for opening a can of worms! Could you please file a task? :) [09:10:45] lol, you are very welcome ema [09:11:25] vgutierrez: I was looking at the frontend-dashboard and noticed lower availability in ulsfo, then the 5xx dashboard in logstash confirmed [09:11:38] but yes will open a task [09:12:29] thanks [09:16:04] 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10fgiunchedi) [09:16:46] sigh that also reminds me that frontend-traffic should be updated with ats metrics too [09:19:17] sadly we don't have the 'can of worms' token [09:26:50] we should! [09:54:30] I'm sure c.danis has the appropriate emoji though :-P [10:15:20] 10Traffic, 10Operations: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10Joe) Given we have the hot-restarted now, that's probably a good idea. [10:30:50] _joe_: is this how it works? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545225/ [10:31:03] * ema tries click-bait for CRs [10:31:23] OCD says that it should be before the logstash line... [10:31:27] sorry.. CDO [10:31:29] * vgutierrez hides [10:31:54] <_joe_> ema: sure, remember to pool the dnsdisc entries before [10:32:10] <_joe_> the dns update [10:32:30] <_joe_> (also it's pooled=true/false in this case, it's a on/off system) [10:35:13] vgutierrez: better? [10:35:17] _joe_: thanks [10:37:22] <3 [10:37:27] even ❤️ [10:53:54] aaaand authdns-update failed :) [10:53:57] lucky day today [10:54:01] https://phabricator.wikimedia.org/P9427 [10:54:07] error: Name 'kibana.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-kibana' [10:54:21] I'm gonna revert the dns change and go for lunch [12:46:00] 10Traffic, 10Operations, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10jijiki) It appears we are having fetch errors, possibly due to timeouts as well mostly on two servers where we have enabled... [12:59:48] ema: I'm going to depool esams - https://gerrit.wikimedia.org/r/c/operations/dns/+/545270 [13:00:42] XioNoX: ok, for how long roughly? [13:01:23] ema: that one should be less than 2h, aiming for 1h max [13:01:35] ack [13:01:39] then another one later on or tomorrow morning depending on progress [13:01:55] ema: CR looks good? [13:02:18] XioNoX: do we have a task? [13:02:59] T235805 ? [13:02:59] T235805: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 [13:03:46] yeah :) [13:04:03] XioNoX: ack, please add it to the commit log [13:04:15] other than that the CR looks good [13:04:38] cdanis: thx :) [13:05:26] 10Traffic, 10Operations, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) I highly suspect that's related to stricter timeouts on ats-be compared to varnish-be and atls-tls, that would... [13:14:03] Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: bast3002.wikimedia.org,cp[3007-3008,3010,3030,3032-3036,3038-3047,3049].esams.wmnet,lvs[3001-3004].esams.wmnet,maerlant.wikimedia.org,multatuli.wikimedia.org,nescio.wikimedia.org [13:14:05] yay [13:15:32] and manually downtimed everything with esams in the name from the ui [13:32:44] XioNoX: really nice, we got no icinga spam on irc \o/ [13:32:53] I didn't do antyhing yet :) [13:33:08] ah :) [13:33:11] lol [13:33:29] I'm sure there will be some noise on irc :( [13:33:48] meanwhile, to follow the effects of depooling esams it is interesting to watch: [13:33:51] https://grafana.wikimedia.org/d/000000500/varnish-caching?refresh=15m&orgId=1&from=now-3h&to=now&var-cluster=cache_text&var-cluster=cache_upload&var-site=codfw&var-site=eqiad&var-site=ulsfo&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [13:34:24] XioNoX: did you downtimed the mgmt or need a hand? [13:34:33] I can wrap up some python foo in few minutes [13:40:43] volans: if only a few minutes then sure [13:40:50] it will be useful in the future [13:40:56] thx [13:47:06] ok I'll look into it asp [13:47:08] asap [13:50:03] ema: the kibana thing is about a/a vs a/p mismatch [13:51:02] a/a use the hieradata key "active_active: true" and go in the discovery-geo-resources file in the dns mock stuff, a/p use the hieradata key "active_active: false" and go in the discovery-metafo-resources file. [13:52:02] (and the DYNA record similarly matches, saying either metafo or geoip) [13:52:27] the DNS change for kibana is all geoip-based (active/active), but the puppet side had "active_active: false" [13:52:43] oh [13:53:22] I guess I should wait for the firefighting happening on -operations [13:55:11] so for some reason the OS upgrade doesn't want to go through... I'll restart the VC without the upgrade, and then try it again... [13:57:44] XioNoX: wanted to make sure you saw the librenms alert re: cr1-eqiad port util, looks like it is xe-4/3/2, NTT 10Gbps, currently using 8Gbps and looks fairly steady [13:58:05] hopefully eqiad doesn't get too hot as we get the combined EU and US peak there [13:58:15] yeah, see the other SRE channel :) [14:02:19] XioNoX: {done} [14:03:02] for 2 hours [14:04:22] thx [14:04:41] still waiting for things to quiet down on -operations to not add my barage of icinga [14:05:22] XioNoX: I'll push a smaller geodns change, which I think doesn't have huge impact but is the simplest thing that could potentially help [14:05:31] (switch default for unknown/generic clients to codfw) [14:05:42] yeah codfw does not have much traffic [14:05:58] thx [14:06:07] same for ulsfo [14:06:20] XioNoX: I was a bit optimistic, not seeing them in the UI, checking logs [14:07:26] right, our normal geodns routing isn't based on evening out loads, it's based on attempting to get the best latency for everyone. [14:07:44] it'd be nice to have an alternate mapfile ready to go which is based on evening loads in scenarios like this [14:08:08] (or to have more DCs so we don't have to worry about this, or to have geodns code that can auto-balance it for us by just tweaking some simple weighting numbers) [14:08:17] :) [14:09:04] bblack: so, something like this for the kibana change? https://gerrit.wikimedia.org/r/#/c/operations/dns/+/545287/ Waiting for esams work to be done before merging ofc [14:09:41] yeah [14:13:26] bblack: alright, I'm going to restart asw-esams, should be down for ~20min max [14:13:51] XioNoX: ack [14:14:26] XioNoX: how long you want the downtime for the mgmt? [14:14:46] XioNoX: wait [14:15:04] XioNoX: sorry I'm still getting up to speed for the morning - did you handle the ns2/multatuli situation yet? [14:15:49] I think not, because I can't reach it from the outside [14:15:54] er, forgot about that one... [14:16:14] it's kind of a biggy! [14:16:30] do we still have routeability to esams/knams for the esams address space? [14:17:07] (if so, we just need those routers to forward that IP back over transport to eqiad and hand it to ns0 basically. The nsX hosts all already listen for all of the 3x public authdns IPs) [14:17:16] yeah, pushing the routing redirection in a few min [14:22:54] thanks! [14:23:58] $ dig +nsid @ns2.wikimedia.org en.wikipedia.org A|grep NSID [14:23:59] ; NSID: 61 75 74 68 64 6e 73 31 30 30 31 (a) (u) (t) (h) (d) (n) (s) (1) (0) (0) (1) [14:24:28] bblack: is it working? [14:24:31] yes [14:25:06] (dig +nsid asks the server to self-identify with a binary label, and we happen to set those to the hostnames of the underlying servers) [14:29:45] just in case there are more people reading here — we (wmcs folks) are seeing packet loss to lvs1014 which is (ultimately) breaking a bunch of ldap things for us. [15:00:04] XioNoX: so the basic effect is that (with esams depooled) the set of geodns changes removes ~1/3 of the total eqiad traffic, which might be enough [15:00:16] yeah for sure [15:00:29] we were around 8/9Gbps on NTT [15:00:42] XioNoX: we need to undo all the icinga suppression too if we're gonna repool [15:01:00] but otherwise I'd say let's go ahead [15:01:09] undoing the icinga suppression is hard :( [15:01:13] (and undo the temporary ns2 routing) [15:01:36] they expire in 15min [15:01:38] XioNoX: did you attach a comment with your downtime? [15:01:53] XioNoX: do we need the asw upgraded to proceed with everything else? (is it blocking the whole week, or can we leave it for later?) [15:02:09] I assume you're having local discussion about this all anyways, I just don't see it here [15:02:45] we can proceed without doing it, I was using this opportunity to get everything to a better software version [15:03:46] ok [15:04:05] if it weren't for asw work, how soon would we need to depool again with proceeding? [15:04:30] tomorrow morning eu time I'd say [15:04:30] (is it worth repooling for a significant chunk of today or whatever?) [15:04:40] yeah [15:04:44] yeah ok so let's let the icinga suppresses expire, undo ns2 re-routing, then repool? [15:05:05] (and then I'll back out the geodns reshuffling a little later afterwards) [15:05:16] (and prep a better combined patch for use tomorrow) [19:06:32] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dzahn) ^ The reason to merge this was not a comment on the general question to enable avatars. The reason was that during T222391 we noticed an undesirable dependency. During a Ger... [20:05:21] 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) Of interest: all have user agent FortiGate (FortiOS 5.0) and [[ https://logstash.wikimedia.org/goto/3fa7d259cc2043eb0b56a6ae5e89298f | have appeared near simultaneously from a number of sources gl... [20:05:41] 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) p:05Triage→03Normal [20:09:55] 10Traffic, 10Operations: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BBlack) p:05Triage→03Normal [22:36:41] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH) p:05Triage→03Normal [22:37:05] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH)