[12:43:47] are the instructions to update compiler facts here accurate? https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs#FAQ [12:44:21] I launched the oneliner locally and it's been sitting there for ~5 min now [12:44:23] godog: AFAIK yes but there are 3 compilers now [12:44:27] ### Syncing facts from puppetmaster1001.eqiad.wmnet [12:44:47] and that's included in the oneliner, so yeah should be ok [12:45:13] I'm wondering if this is the first time we run it after the puppetmaster upgrade though [12:45:16] jbond42 ^^^ [12:46:01] thanks, I'll dig into it a little bit [12:46:11] it's going into puppet config print localcacert [12:46:15] in a loop or something [12:46:25] starting a ps for 30s [12:46:31] this should work with the new puppet masters [12:46:36] however `puppet config print localcacert [12:46:50] & that is a recent change and may not have been fully tested [12:47:24] yeah I launched /usr/local/bin/puppet-facts-export and I guess it is taking a while, going through all hosts [12:47:49] I'll let it be, thanks [12:51:35] it shouldn't take that long [13:35:22] godog: did it end? [13:36:10] s/end/completed succesfully/ [13:41:08] yeah looks like it [13:41:21] it took longer than I remembered indeed [13:50:45] it was surely under 5 minutes IIRC [13:52:53] I can't dig into why it is slow now, although I got https://gerrit.wikimedia.org/r/c/operations/puppet/+/552062 out to show some progress [14:03:15] ack, thanks [14:10:27] Sorry made the mistake of updating my kernel and have spend the last few hours driying to get my displays working #YearOfLinuxOnDesktop. ill take a look at the export script and make sure i havn't added a bug somewhere [14:12:49] jbond42: but now you can enjoy the various improvements and bugfixes which no doubt the new kernel brought along! [14:13:30] :D lol yay [14:14:44] I would love to not have to remember to modify my kernel commandline to include 'nomodeset' iff I'm resuming from hibernate (as opposed to doing a fresh boot) [14:14:48] #YearOfLinuxOnTheDesktop [14:21:09] 5.3 or which update broke things? [15:34:18] moritzm: its more to do with my docking station which needs a binary (display-link) driver, had to switch to the unsigned kernel [15:36:01] ah, yes [15:53:30] godog: volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/552079 this should speed the export back up [15:55:14] sweet, thanks jbond42 [15:56:09] thx indeed [15:59:04] np [17:57:59] heyo, we're seeing 503s for blubberoid.wikimedia.org [17:58:18] it's hitting cp1083 which i think is ats? [18:00:01] effie ^^^ [18:04:20] <_joe_> marxarelli: is the problem persisting? [18:04:29] <_joe_> can you give me an example URL? [18:11:29] _joe_: `curl -D - -s https://blubberoid.wikimedia.org/` [18:11:58] <_joe_> marxarelli: I get a 404 [18:12:18] _joe_: ah, just did as well [18:12:22] seems intermittent then [18:12:57] https://www.irccloud.com/pastebin/XtbN4NF5/ [18:13:28] <_joe_> yeah I'm looking at the logs [18:13:29] https://www.irccloud.com/pastebin/6YMn9O7t/ [18:13:38] https://phabricator.wikimedia.org/P9696 [18:13:39] <_joe_> it's from all cache servers [18:13:51] cp1075 works and the others seem to not [18:14:07] <_joe_> cp1075 is ATS [18:14:09] interestingly cp1075 is text_ats [18:14:46] <_joe_> uhm wait cdanis [18:14:58] <_joe_> is cp1075 the backend? [18:15:25] <_joe_> I think cp1077 is the backend [18:15:52] no you read them in reverse order [18:16:13] <_joe_> oh right you get the same frontend because of source-hash [18:20:37] <_joe_> marxarelli: did you deploy a new version of blubberoid by any chance? [18:24:42] <_joe_> I was looking at varnishlog -c -q 'ReqHeader ~ "Host: blubberoid.wikimedia.org"' [18:25:03] <_joe_> and I see requests failing with "Backend fetch failed" [18:26:53] <_joe_> - RespProtocol HTTP/1.1 [18:26:55] <_joe_> - RespStatus 503 [18:26:57] <_joe_> - RespReason Backend fetch failed [18:27:21] <_joe_> curling the same url from the same host never returns an error [18:29:14] <_joe_> now lemme check one thing [18:29:33] is it relevant that the eqiad line for the blubberoid service in hieradata/role/common/cache/text.yaml is commented out? [18:29:46] whereas I see on https://config-master.wikimedia.org/discovery/services.yaml that discovery thinks both eqiad and codfw are pooled for blubberoid? [18:30:57] <_joe_> possibly! [18:31:16] <_joe_> so, lemme see calling codfw [18:31:40] <_joe_> nope, same correct response with curl [18:31:48] <_joe_> at least as far as my http knowledge goes [18:32:00] <_joe_> I see a 404 with the correct content-lenght [18:33:25] <_joe_> so it seems to be a common problem to all varnishes? [18:33:31] <_joe_> pretty strange [18:37:44] <_joe_> eqiad is the only dc where this happens [18:38:29] <_joe_> cdanis: I'm a bit tired to go beyond this level at this point of the day, but if this escalates ping me back [18:39:05] <_joe_> you might want to involve ema possibly though if further expertise is needed. I'm rusty at varnish debugging [18:39:13] yeah so [18:39:20] I can confirm it is all the varnishbes in eqiad that have this issue [18:39:40] just cumin'd it across all the cp-texts, talking to the backend varnish (port 3128) directly [18:39:53] <_joe_> cdanis: decommenting the cache route to eqiad might miraculously solve whatever issue we're seeing [18:39:59] I think it will [18:40:02] because [18:40:04] for cross-site services [18:40:08] varnish used to go via other varnishes, right? [18:40:13] and I think that was done at the *frontend* layer? [18:40:26] but I think that ATS does some piece of that now, and looks at DNS discovery when making that decision? [18:40:36] anyway I was about to try that [18:40:52] <_joe_> cdanis: yes, it should still go via the other varnishes in fact [18:41:06] <_joe_> let me see for a sec how that is routed in the current vcl on the servers [18:41:39] <_joe_> cdanis: hah [18:41:42] <_joe_> - ReqHeader X-Next-Is-Cache: 1 [18:42:27] hah [18:42:28] <_joe_> it seems somehow it tries to talk to a codfw backend and fails at it? [18:42:56] https://gerrit.wikimedia.org/r/552111 _joe_ [18:43:55] there's also the similar and still commented-out line for docker-registry [18:43:59] I'm not touching that yet 🙃 [18:45:11] <_joe_> no, that you should not [18:45:24] lgtu? [18:46:35] _joe_: no I haven't touched blubberoid in quite some time [18:47:21] <_joe_> sorry I'm trying to understand what happened here cdanis [18:47:24] <_joe_> and I think I know [18:47:34] <_joe_> I doubt blubberoid worked from eqiad for some time [18:48:11] _joe_: I just ran puppet agent on cp1085 [18:48:19] varnishbe returns 404 there now [18:48:27] <_joe_> yeah [18:48:30] so it seems that uncommenting did fix the issue [18:48:51] <_joe_> what's the change in vcl? [18:49:00] the Puppet diffs are also illustrative https://puppetboard.wikimedia.org/report/cp1085.eqiad.wmnet/7a2685b8bf2412b828935a3796ce659dc5f98c6b [18:49:23] doing a quick spot-check for secrets and I'll paste [18:49:27] hmm, i'm still seeing 503s intermittently [18:49:35] marxarelli: yes, it has not hit all varnishes yet [18:49:41] <_joe_> marxarelli: puppet hasn't run everywhere [18:49:44] ah ok! [18:49:52] _joe_: https://phabricator.wikimedia.org/P9697 [18:49:53] i'll be more patient :) [18:50:33] <_joe_> cdanis: ok so apparently sending the request across cache layers doesn't work anymore or $something? [18:51:00] _joe_: I think specifically in the case where 1) DNS Discovery doesn't match Varnish hiera and 2) ATS is not involved [18:51:05] <_joe_> cdanis: ah damn you're running puppet everywhere? [18:51:12] oh did you not want me to? heh [18:51:18] sorry I had just fired it up on A:cp-text_eqiad [18:51:21] <_joe_> I hoped to preserve one server [18:51:25] oops [18:51:26] <_joe_> to see what happened [18:51:33] <_joe_> with a tcpdump [18:51:40] <_joe_> what's not working, I mean [18:51:44] <_joe_> anyways, problem solved [18:52:21] thanks for the fix, cdanis, _joe_ [18:52:35] I think, if the current situation will persist for much longer, there should be some check for mismatches between DNS Discovery and between the hiera [18:53:08] <_joe_> marxarelli: if you tried from the east coast earlier, we would've noticed before :P [18:53:29] <_joe_> anyways, I'm afk [18:53:49] good to know :P [18:53:56] <_joe_> cdanis: I don't think that's what happened here (the mismatch causing the issue). [18:54:11] <_joe_> but yeah this needs to be better understood [19:00:13] _joe_: no, I am pretty sure it is that [19:01:04] _joe_: if you remove the stanza from hiera again, the VCL on the host for the backend doesn't know about blubberoid /at all/ [19:01:21] <_joe_> it does a pass [19:01:24] <_joe_> to the next cache [19:01:38] <_joe_> which is defined in backend-directors.vcl to be codfw [19:02:43] <_joe_> so somehow eqiad varnish bes fail to fetch the info from the cache backends in codfw [19:03:29] mutante: hello! got some qestions about phab1003 if you have some time (otherwise I can open a task) [19:04:16] hm you are right [19:06:10] XioNoX: listening to the tech talk and installing a new machine. what's up [19:06:35] mutante: https://phabricator.wikimedia.org/T238781 [19:07:23] XioNoX: it is caused / subtask of https://phabricator.wikimedia.org/T238593 [19:07:53] it's the realtime notification feature using websockets and it's currently disabled [19:08:13] oh wait. maybe not [19:08:33] 22280 is the phab-ssh server [19:08:45] i'll take a look [19:08:48] I do see - remove ferm hole for port 22280/tcp (confirmed) [19:09:08] where do you see that? [19:09:16] https://phabricator.wikimedia.org/T238593#5675795 [19:09:43] I just wanted to point significant things I'm seeing in those logs [19:09:59] ah, yes [19:10:13] i'm taking the ticket [19:25:55] rules for that were disabled in ATS but not in varnish [19:49:58] effie: should I leave puppet disabled on labweb1001/1002 or did you just miss them when you re-enabled? [19:52:47] andrewbogott: this is odd, yes I missed that [19:52:59] 'k [19:52:59] but I think it should have been visible in icinga [19:53:17] puppet should be enabled [19:53:26] icinga says "OK: Puppet is currently disabled, not alerting. Last run 1 day ago with 0 failures" [19:53:36] * andrewbogott enables [19:53:51] hmm [19:54:09] ok, thank you [19:54:17] thanks for the hhvm cleanup! [19:54:20] sorry about that [19:54:22] hehe [20:27:31] if anyone from serviceops are around: https://phabricator.wikimedia.org/T238789 [20:49:33] _joe_, cdanis: getting 503s from docker-registry.wikimedia.org now [20:49:50] heh, yes, similar issue there I think [20:51:17] kk [20:52:13] <_joe_> and that's less solvable though [20:52:43] <_joe_> cdanis: you can try to depool eqiad from dns for that entry maybe? (I wouldn't know how to do that though) [20:53:13] _joe_: registry is active/passive? [20:53:29] <_joe_> yes [20:53:45] <_joe_> because swift replication is sigh [20:54:46] I see docker-registry 300/10 IN DYNA metafo!disc-docker-registry [20:54:49] so that's one of the right things [20:55:06] so I think it just needs a conftool change [20:55:38] <_joe_> no [20:55:55] <_joe_> the problem is as usual in varnish that routes via codfw from eqiad [20:56:01] <_joe_> and seems broken [21:33:49] uh [21:33:56] just looking at cp1077 [21:34:37] /etc/varnish/directors.frontend.vcl defines cache_local and cache_local_random to point to local backend varnish instances, adding all the hosts needed [21:35:10] but /etc/varnish/directors.backend.vcl defines names cache_codfw and cache_codfw_random (which are referenced elsewhere, including by cross-cluster redirection) but then... doesn't fill in any physical backends for them [21:40:25] _joe_: as you suspected, confctl does not believe any varnish-bes are pooled in codfw [21:44:22] <_joe_> I don't know what do there though [21:45:19] yes indeed, nor do I understand what broke this [21:46:39] https://tools.wmflabs.org/sal/log/AW5uTLUp0fjmsHBa0GUJ ? [21:47:16] this isn't working for any varnishbe instance in eqiad though [21:47:49] https://phabricator.wikimedia.org/P9699 [21:53:33] <_joe_> ats-tls is just the tls terminator [21:54:10] <_joe_> cdanis: so I think what happened is we've transitioned codfw to ATS on the backend? [21:54:12] yeah I was just looking for recent changes [21:54:23] _joe_: I think I remember hearing everywhere except eqiad, although I am not sure [21:54:47] <_joe_> ok so, it would be interesting to see when the codfw backends were depooled [21:54:50] I suspect [21:55:01] that we didn't plan for an active/passive service where the one active DC is codfw [21:55:50] <_joe_> yeah varnish-be is gone in codfw [21:56:01] <_joe_> so one thing you could try is to route via ats-be [21:56:44] <_joe_> confctl select 'service=ats-be,dc=codfw,cluster=cache_text' get [21:56:52] <_joe_> it's been converted apparently [21:57:15] <_joe_> cdanis: so one solution is [21:57:32] <_joe_> set the codfw lvs for the registry in eqiad's cache configs [21:57:34] does ats-be accept HTTP? [21:57:45] <_joe_> it should, yes cdanis [21:57:52] <_joe_> you can do it both ways [21:59:46] https://phabricator.wikimedia.org/P9700 [21:59:49] yes that works [22:06:00] okay, I think I can propose a patch [22:06:08] I would still like someone from traffic to review, because this is scary [22:18:42] <_joe_> I can help I think [22:20:38] _joe_: I think it could also wait until tomorrow, I think the only services affected are cxserver and docker-registry [22:20:57] <_joe_> sigh [22:21:01] <_joe_> cxserver is important [22:23:04] well in that case see if https://gerrit.wikimedia.org/r/c/operations/puppet/+/552142 makes sense to you _joe_ [22:24:05] <_joe_> it kinda-does, but I'd prefer to patch the routes temporarily to be cross-dc [22:24:47] how do you mean? [22:25:08] directors.backend.vcl winds up with just cross-dc routes [22:25:10] <_joe_> addig eqiad: in the cache directors [22:25:18] <_joe_> but that's pii-breaking [22:25:23] <_joe_> so let's not do that [22:25:43] ah yes [22:26:29] this seems closest to 'original' behavior [22:26:40] and i think will reuse the same ipsec tunnels that still exist (??) [22:38:45] hm [22:38:48] https://logstash.wikimedia.org/goto/6f192b659517086acc8639cd2a79a429 [22:38:53] I am somewhat inclined to believe that something happened today [22:50:05] cdanis: I think that 6 cp hosts in codfw have been reimaged into text_ats today [22:50:21] _joe_: do you know why cxserver is overridden in wikimedia_text-backend.vcl to always route to codfw btw? dns discovery implies it is pooled in both eqiad and codfw [22:53:02] of course [22:53:04] the same thing again [22:53:07] depooled from eqiad [22:53:12] it's just this is how it gets rendered [22:53:15] ok [22:53:31] so I will repool in eqiad [22:54:43] <_joe_> I think there were reasons why cxserver was depooled in eqiad [22:54:47] <_joe_> but whatever [22:55:07] the blame on the yaml file is the same as before [22:55:13] the depool three months ago for the k8s upgrade [22:55:28] and dns discovery does not match, it is pooled in eqiad, so, eqiad is getting some traffic anyway [22:55:37] <_joe_> ok let's repool now [22:55:49] <_joe_> the important use of cxserver is probably the best [22:56:00] <_joe_> that one hehe [22:56:13] <_joe_> so let's fix cxserver [22:56:26] <_joe_> docker registry can wait tomorrow morning for ema to see [22:57:04] +1 [22:57:46] <_joe_> as the PO,PM,POC and SPOF on the registry, I think I can take that decision :P [22:58:32] <_joe_> esp when it's midnight [22:59:20] dumb q _joe_ what monitoring exists for cxserver [22:59:34] <_joe_> I guess the usual swagger based one [22:59:42] <_joe_> sorry, openapi [22:59:52] ok so nothing that would have noticed this, really, unless it goes via traffic instead of via the svc address [22:59:54] which i doubt it does [23:00:05] <_joe_> we don't [23:01:26] ok I have merged and it works fine on cp1077 now, will let puppet proceed as normal with the rest of the eqiad cps [23:01:46] it's dinnertime here (I know, scandalously early), I will be afk for a bit but be watching for pings