[00:49:59] hello ema et al. about 8 hours ago this code line started to break: [00:50:05] $trusted_proxies = $cache_text_nodes[$::site] + pick($cache_text_nodes["${::site}_ats"], []) [00:50:12] this now fails if $::site is "codfw" [00:50:47] somehow it is '' and that is right after "$cache_text_nodes = pick($cache_nodes['text'], {})" [00:51:27] i also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/552083/1/conftool-data/node/codfw.yaml which is kind of in that time frame and looks close.. though i dont fully see why yet [00:53:20] well.. it kind of makes sense because "This completes the conversion of text@codfw to ATS." [00:53:55] yep, i see now https://gerrit.wikimedia.org/r/c/operations/puppet/+/552083/1/hieradata/common/cache.yaml [00:59:08] but it already has the _ats suffix in "pick($cache_text_nodes["${::site}_ats"]" [01:22:37] well... https://gerrit.wikimedia.org/r/c/operations/puppet/+/552161 ..shrug [01:22:42] that fixed it for now [03:06:46] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) nice stap, playing a little bit with debug logging for a specific client ip I've been able to get the same information: `[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: wow, i really need to learn systemtap [03:11:04] pretty powerful right? [03:11:11] yeah, very impressive [03:12:14] it would be nice to be able to dump the whole request [03:12:36] but it looks clear to me that there is some issue with wikidata [03:18:53] vgutierrez: btw -- https://gerrit.wikimedia.org/r/c/operations/puppet/+/552142 fixes docker-registry.wikimedia.org which is currently broken for users hitting eqiad [03:19:49] i wanted a review as i'm not totally sure of what i'm doing here, but also it's not a service with many users (and i think internal ones don't go via traffic?), so i'm not terribly fussed about it [03:20:21] ema would be more prepared to review than me [03:20:27] k np! it can wait for that for sure [03:27:12] let's see if I can give some though love to a mw server using KA to reproduce this [04:00:45] at the same point it's quite interesting that we are only seeing this with ats-be [04:01:04] I wonder if envoy can be at fault here [04:01:53] that could be pretty easy and secure enough to test on cp1075 considering that everything stays in the same DC [04:02:54] also... if envoy is seeing this I guess it should log something somewhere' [04:02:55] ? [04:04:49] hmmm right, we don't use envoy for mw servers [04:05:03] nginx is handling TLS for those [04:05:46] oh wonderful [04:05:52] we have a lot of nginx errors as well [04:12:46] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) it looks like the issue is happening at mw servers as well, nginx (TLS termination for appservers-rw.discovery.wmnet) is screaming a lot with wikidata requests, for the request I've p... [05:15:20] _joe_: any tips debugging the mw side? [06:36:57] <_joe_> vgutierrez: so varnish is super-picky with http responses [06:37:12] we are not talking varnish right now [06:37:13] <_joe_> so I frankly doubt something has changed there [06:37:24] <_joe_> sorry, at the apache layer [06:37:34] <_joe_> if we weren't seeing such issues in varnish [06:37:38] <_joe_> that connects to apache [06:37:50] yup, but nginx is screaming as well [06:37:55] is not only ats complaining [06:38:06] <_joe_> ats complains because nginx complains [06:38:10] the nginx instance running in the mw server [06:38:21] <_joe_> I'm saying there is something wrong there, in that nginx's config [06:38:33] <_joe_> or well some constraint those requests violate [06:38:54] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) Interesting enough, on the apache access log, that request looks OK: ` 2019-11-21T02:57:04 73916 10.64.0.62 proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/304 14079 GET http://www.... [06:38:55] <_joe_> do you have a repro case? For a wikidata request to that nginx? [06:39:13] I'm able to trigger it [06:39:27] looping requests [06:39:35] <_joe_> my point is: varnish-be => apache is ok, ats-be => nginx => apache does not [06:39:42] yup [06:39:50] I said something similar this APAC morning here [06:40:11] actually considering that everything is within eqiad we could point temporarily ats-be to apache [06:40:12] and see what happens [06:40:26] <_joe_> vgutierrez: that too, yes [06:40:40] funny enough, the faulty requests are 304s [06:40:48] at least this one [06:40:49] <_joe_> ohhhh [06:40:50] let me check others [06:41:06] <_joe_> maybe it's some shitty 304 but still sends the body problem? [06:41:29] <_joe_> it's not an uncommon issue - we had it in the past [06:42:56] I think so [06:43:02] 2019-11-21T02:03:29 75533 10.64.32.32 proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/304 14079 GET http://www.wikidata.org/wiki/Special:EntityData/Q38646387.json?vgutierrez=1&RANDOM=16886 - - - 10.64.0.130 curl/7.52.1 - - - 10.64.32.32 [06:43:16] <_joe_> so try to repro the request [06:43:22] <_joe_> on the backend [06:43:30] <_joe_> I love vgutierrez=1 :D [06:43:42] it's so easy to trace across the different layers... O:) [06:44:07] and at the same time you know who to blame if you see a request like that [06:44:10] <_joe_> ok so this request fails consistently? [06:44:36] let me check against mw1330 in this case [06:45:36] <_joe_> also how do you get a 304? I guess you use some caching headers [06:46:27] so the triggering request on ats-be is this one [06:46:43] curl -H 'Host: www.wikidata.org' "http://127.0.0.1:3128/wiki/Special:EntityData/Q38646387.json?RANDOM=$RANDOM" [06:46:55] well... kindof, that one is missing the vgutierrez=1 :) [06:47:44] <_joe_> how can you get a 304 on that request? [06:48:00] <_joe_> you have no caching headers there [06:49:44] dunno [06:49:55] or ats-be is adding the caching headers [06:50:23] <_joe_> I'm trying curl --resolve www.wikidata.org:443:$(dig +short mw1261.eqiad.wmnet) 'https://www.wikidata.org/wiki/Special:EntityData/Q38646387.json' -H 'if-modified-since: Sat, 30 Mar 2019 07:05:28 GMT' -D - and I couldn't get it to fail [06:50:29] <_joe_> this goes to nginx [06:51:07] <_joe_> makes no sense you get a 304 to that request, do you concur? [06:51:12] if you grep the nginx error.log on those servers you'd see a lot of events regarding wikidata [06:51:16] <_joe_> it breaks HTTP [06:52:25] <_joe_> and in fact, if I try what you pasted above on cp1075, I get a 200 ok [06:52:30] sure [06:52:39] ATS retries up to 3 times [06:52:42] before giving up [06:52:42] <_joe_> so, I'm confused [06:52:52] so you can still generate an error [06:52:56] and get a 200 [06:52:57] <_joe_> yeah my point is you should /not/ get a 304 [06:53:03] <_joe_> ever [06:53:09] so I don't see that 304 [06:53:13] ats-be doesn't get it [06:53:15] nginx doesn't get it [06:53:17] apache logs it [06:53:23] <_joe_> wtf. [06:53:43] see my latest comments on https://phabricator.wikimedia.org/T237319 [06:53:55] the last one is the 304 on apache, the previous the error logged by nginx [06:55:14] same happens with user requests [06:55:22] mw1330 [06:55:24] 2019/11/21 06:54:22 [error] 48012#48012: *167060213 upstream prematurely closed connection while reading upstream, client: 10.20.0.60, server: www.wikimedia.org, request: "GET /wiki/Special:EntityData/Q21512669.json HTTP/1.1", upstream: "http://10.64.32.32:80/wiki/Special:EntityData/Q21512669.json", host: "www.wikidata.org" [06:55:29] that's nginx [06:55:35] apache shows 2019-11-21T06:54:22 88936 10.64.32.32 proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/304 13881 GET http://www.wikidata.org/wiki/Special:EntityData/Q21512669.json - - - 95.216.107.40, 10.20.0.54, 10.20.0.60 - - - - 10.64.32.32 [06:55:45] fuck [06:55:51] I shouldn't paste that one [06:55:53] :/ [06:57:31] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) Why would a 304 have 14079 bytes of content-length is the first thing I'd ask. It looks like we do something wrong somewhere, but it's not reproducible consistently in my tests. @Vgutierrez... [06:58:14] <_joe_> vgutierrez: looking at the apache log lines [06:58:33] <_joe_> it's always a 304 with a non-zero content-length [06:58:43] yeah [06:58:45] <_joe_> but I don't seem able to reproduce it [06:59:04] _joe_: I do believe that reusing connections is a factor here [06:59:22] http/1.1 keep-alive [06:59:47] <_joe_> vgutierrez: I would be very very surprised if it was [07:00:02] <_joe_> http/1.1 is ats-be => nginx? [07:00:09] yes [07:00:48] <_joe_> ok that could be more relevant than keepalive [07:00:57] <_joe_> vgutierrez: AIUI the issue is at the application layer [07:01:30] I'm wondering if those 304s are being triggered by varnish-be as well [07:03:43] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) Do I need to be concerned about my IP being released during discussion // should this task be private? [07:06:15] <_joe_> vgutierrez: so [07:06:24] <_joe_> mw1330:~$ awk '{if ($4 ~ /localhost.304/ && $5 != 0) print $_ }' /var/log/apache2/other_vhosts_access.log | wc -l [07:06:26] <_joe_> 914 [07:06:34] yep I was checking the same [07:07:04] <_joe_> awk '{if ($4 ~ /localhost.304/ && $5 != 0) print $_ }' /var/log/apache2/other_vhosts_access.log | fgrep Special:EntityData | wc -l [07:07:06] <_joe_> 922 [07:07:08] <_joe_> so it's only that [07:07:17] <_joe_> now we need a repro against apache [07:09:45] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) Just confirmed we only get 304s with a non-zero content-length for requests to `Special:EntityData` on wikidata. I still don't have a reliable repro case. [07:47:23] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) @Joe it seems pretty easy to trigger: ` vgutierrez@mw1330:~$ curl --resolve www.wikidata.org:80:10.64.32.32 "www.wikidata.org/wiki/Special:EntityData/Q38646387.json" -o /dev/null -v -... [08:19:06] <_joe_> vgutierrez: we should try the repro here https://bz.apache.org/bugzilla/show_bug.cgi?id=57198 [08:19:39] <_joe_> if that's what happens, we need to patch apache I think [08:20:07] <_joe_> I'm surprised varnish did accept that response tbh [08:20:49] _joe_: hmm it looks like the same issue [08:21:13] but we're running 2.4.25 [08:21:59] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) https://bz.apache.org/bugzilla/show_bug.cgi?id=57198 seems relevant, but I think it should be fixed in our version of apache2 (2.4.25). We should try to see what php-fpm answers using `furl... [08:22:18] <_joe_> vgutierrez: now is when I introduce you to furl [08:22:29] <_joe_> or, how to make the request to php-fpm directly [08:22:29] furl? [08:22:37] lovely [08:22:43] <_joe_> thank ori for that [08:23:17] <_joe_> I can craft a furl request, but later in the day [08:23:39] <_joe_> now I have to review otto's patches or he will hate me, I promised to do it today [08:25:46] <_joe_> you know what would be great? a script to make the same request at all the layers of the shitcake [08:26:12] <_joe_> or at least a wiki page :P [08:28:53] hmmm furl supports unix sockets? [08:29:43] <_joe_> it does IIRC [08:32:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/520025 :) [08:34:30] yup, it works [08:35:39] <_joe_> oh, lol [08:35:42] <_joe_> I added it [08:36:07] or maybe you outsource your work to third parties... [08:36:10] ;P [08:36:32] <_joe_> damn you caught me [08:42:45] _joe_: so, php-fpm returns a 304 with a big fat body [08:43:36] <_joe_> vgutierrez: now I don't remember if that's ok by the FCGI specs, but it should be [08:44:01] <_joe_> vgutierrez: try creating a tiny repro case to confirm apache is misbehaving I guess [08:44:13] meaning that fcgi or apache should be smart enough to discard the body? [08:44:21] <_joe_> yeah [08:44:26] <_joe_> one of the two should [08:44:29] <_joe_> arguably apache [08:44:52] <_joe_> my suspect is apache does when *it* decides it's a 304 [08:45:00] <_joe_> not when the fcgi server is saying so [08:45:11] <_joe_> I have no idea if that's the correct behaviour [08:47:12] hmmm https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842676 [08:49:15] again.. that's 2.4.10 [08:49:18] and we're running 2.4.25 [08:49:23] it should include that patch [08:54:28] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) reproducing it with furl shows that php-fpm returns a 304 with a body, and it looks like apache2 2.4.25-3+deb9u7 is failing to handle that [09:03:22] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) The patch seems to be there, see https://salsa.debian.org/apache-team/apache2/blob/debian/2.4.25-3+deb9u7/modules%2Fproxy%2Fmod_proxy_fcgi.c#L663-675: `lang=C... [09:24:37] 10Traffic, 10Operations: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) [09:24:48] 10Traffic, 10Operations: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) p:05Triage→03High [09:28:46] 10Traffic, 10Operations: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) [09:34:49] 10Traffic, 10DNS, 10Operations, 10SRE-tools: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10fgiunchedi) p:05Normal→03Low >>! In T238727#5679123, @Volans wrote: > @fgiunchedi I think is fair request, but given we're in process of auto-generating all mgmt and... [09:40:46] 10Traffic, 10Operations, 10Patch-For-Review: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) [09:43:14] 10Traffic, 10Operations, 10Patch-For-Review: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log ca... [10:02:01] _joe_: BTW, I cannot reproduce on mw2001 with the PoC used in the apache bug [10:02:17] *mwdebug2001 sorry [10:04:01] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) [10:04:26] 10Traffic, 10Operations, 10Patch-For-Review: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2023.codfw.wmnet'] ` [10:04:59] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) @addshore @wmde-leszek: it seems we are causing this. We should take a look [10:05:49] <_joe_> vgutierrez: try to emit status 304 yourself [10:05:54] <_joe_> and then some text [10:07:00] bingo [10:07:03] that breaks it [10:08:48] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) so it looks like the apache fix doesn't fix every scenario. This small PoC triggers the issue: `lang=php vgutierrez: ok so I /think/ this is an apache bug [10:25:59] <_joe_> well it is [10:26:06] _joe_: I've reopened the bug [10:26:11] <_joe_> but first I want to read hwo the fcgi protocol works [10:26:21] <_joe_> vgutierrez: have you checked if it's fixed in HEAD? [10:26:40] 10Traffic, 10Operations: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) HTTP routing of docker-registry looks good to me now: ` $ curl -s --resolve docker-registry.wikimedia.org:443:208.80.154.224 -v https://docker-registry.w... [10:27:04] <_joe_> ema: ahah your evil plan works [10:27:06] <_joe_> kudos [10:27:41] what can I say, I'm a proactive outside-of-the-box thinker [10:28:03] _joe_: nope [10:31:41] <_joe_> vgutierrez: I'm trying to read again mod_proxy_fcgi's code [10:32:22] the 304 check is there [10:32:28] but doesn't seem enough [10:33:28] <_joe_> that piece of code has a while look with a nested labeled goto [10:33:39] <_joe_> I feel like I'm back in the academia [10:34:33] <_joe_> elukey: you're the one here with the best knowledge of apache's code [10:40:33] _joe_ reading, can somebody write a quick summary? [10:40:52] <_joe_> elukey: https://phabricator.wikimedia.org/T237319#5681062 [10:41:06] <_joe_> with that code, apache returns a 304 with a body [10:41:16] <_joe_> which nginx doesn't like at all [10:41:51] <_joe_> now there was https://bz.apache.org/bugzilla/show_bug.cgi?id=57198 which was fixed already, but this is a slightly different case of the same problem [10:42:01] <_joe_> in this case we send apache the status code already [10:42:27] <_joe_> so I suspect the checks that are done here https://salsa.debian.org/apache-team/apache2/blob/debian/2.4.25-3+deb9u7/modules%2Fproxy%2Fmod_proxy_fcgi.c#L663-675 don't apply [10:43:51] I recall that I worked on that code previously [10:43:57] <_joe_> yes [10:44:26] <_joe_> exactly for that problem, we backported that patch IIRC [10:46:01] ahh it was for the 412 https://github.com/apache/httpd/commit/5a53fac7267665609948b54bcec2f034039e4fcc [10:46:40] and also http://svn.apache.org/viewvc?view=revision&revision=1752347 [10:46:48] <_joe_> so I think the problem is we send directly the 304 to apache [10:46:57] <_joe_> it's not determining the status itself [10:47:54] but the status seems grabbed from the FCGI record itself no? [10:48:18] <_joe_> yes [10:48:46] <_joe_> vgutierrez: do you have a furl output for that PoC? [10:49:00] <_joe_> I can generate it otherwise [10:49:02] nope [10:49:06] please go ahead [10:49:14] <_joe_> is your script still somewhere? [10:49:43] mwdebug2001 [10:50:18] <_joe_> I am thinking we should start apache withing gdb, put breakpoints around dispatch in mod_proxy_fcgi to find out what's going on, the code is driving me crazy [10:51:09] yep I agree, I can work on it if you guys want [10:52:08] <_joe_> Status: 304 Not Modified [10:52:10] <_joe_> X-Powered-By: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1 [10:52:12] <_joe_> test [10:52:18] <_joe_> this is what we get from the fcgi daemon [10:52:34] <_joe_> argh irc skipped the double newline [10:53:10] <_joe_> https://dpaste.de/dgzB/raw [10:56:20] I am trying to repro with 2.4.42 but so far no luck [10:56:47] hmm interesting [10:57:16] ah no wait sorry, curl is probably hiding things, didn't think about it [10:57:24] <_joe_> yeah use telnet [10:57:37] <_joe_> I'm pretty sure the bug is there in trunk too [10:57:45] <_joe_> anyways, I have to work on other things too, sorry [10:57:46] hmmm curl is screaming [10:57:51] with verbose mode elukey [10:58:09] elukey: it should say something like "Excess found in a non pipelined read: excess = 13881 url = /wiki/Special:EntityData/Q38646387.json (zero-length body)" [10:58:48] then no I can't repro with your script [11:00:03] <_joe_> yeah I can confirm too [11:00:32] <_joe_> elukey: we're also using php7.3 [11:00:47] <_joe_> if you're on sid [11:01:00] <_joe_> anyways, I'll let you two continue :P [11:01:15] ahahaha [11:02:54] vgutierrez: we could in theory add symbols for httpd on one host and debug via gdb, but it might be tedious [11:03:02] how urgent is this problem? [11:03:09] hmmm [11:03:48] just to step in proxy fgci code and see [11:04:17] I checked the changelog and from .25 to .42 there were changes, but not related to this basic functionality [11:04:50] we should handle this to ema [11:04:56] and let him do his systemtap magic [11:05:40] is there a host that I can use to test? [11:06:09] I guess all appservers curling a specific url [11:06:17] so I could use a mw2 [11:06:36] curl --resolve www.wikidata.org:80:10.64.32.32 "www.wikidata.org/wiki/Special:EntityData/Q38646387.json" -o /dev/null -v -H 'if-modified-since: Sat, 30 Mar 2019 07:05:28 GMT' [11:06:42] that triggers the issue on any mws appserver [11:06:46] *mw [11:06:56] that one is mw1330 [11:07:18] basically triggering a 304 on wikidata /wiki/Special:EntityData [11:08:35] yep works fine thanks, I am on mw2222 [11:09:27] <_joe_> but you should use the special repro on mwdebug2001 probably :P [11:10:32] <_joe_> btw how nginx responded to a similar bug report: https://trac.nginx.org/nginx/ticket/459 [11:12:05] lol [11:12:30] it's quite interesting that we are only seeing this with wikidata [11:13:09] 10Traffic, 10Operations: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10akosiaris) \o/. Thanks for taking care of this! [11:44:30] I am checking debug logging from mod_proxy_fcgi [11:44:38] and I see a bit of a mess in there [11:47:03] going to add some info in the task [11:50:39] ah no ok my bad [11:56:05] trying now on mwdebug [11:56:26] 10netops, 10Operations, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10jbond) [12:04:24] vgutierrez: did you test the poc on mwdebug? [12:05:28] yep [12:05:31] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10elukey) On mwdebug2001 this seems to be the mod proxy fcgi debug: ` Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:debug] [pid 4516:tid 1403735935526... [12:05:49] and it triggered the issue [12:06:06] is under /w/vgutierrez.php [12:06:17] nice [12:07:19] how do you curl it? [12:07:37] just hit the thing [12:07:48] yes I meant the URL [12:07:57] http://en.wikipedia.org/w/vgutierrez.php [12:08:07] ack [12:08:08] and resolve to mwdebug ip [12:10:03] on mwdebug2001? [12:10:19] yes [12:10:42] let me grab my laptop... [12:11:14] because I am still getting 404 with curl --resolve en.wikipedia.org:80:10.192.0.98 "http://en.wikipedia.org/w/vgutierrez.php" -v [12:13:17] sorry... I was getting dinner [12:13:31] https://www.irccloud.com/pastebin/4iXKUFtc/ [12:13:39] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Bugreporter) >>! In T238803#5680279, @Koavf wrote: > I object to deletion: as long as we still own the do... [12:13:57] that's from my terminal buffer [12:14:24] yep [12:14:27] I'm getting a 404 now [12:14:31] it looks like the file has been removed [12:14:35] recreating it now [12:15:04] elukey: done, try now [12:15:07] <3 [12:15:09] I'm triggering the bug right now [12:15:23] yep me too! [12:15:27] awesome [12:15:42] it's real! I'm not crazy [12:22:02] 10Traffic, 10DNS, 10Operations: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10Bugreporter) [12:24:18] elukey: something useful on mod_proxy_fcgi debug now? [12:25:05] vgutierrez: not really, I see a regular 304 handling, but I might need to turn on more logging [12:25:16] also the Status seems to be there, 304 [12:25:21] yep [12:25:31] Status: 304, I've seen that with furl in mw1330 [12:25:48] and after all we get the right status [12:25:53] but somehow it doesn't ignore the body [12:26:35] my next step is to add the -X to httpd (so starts with one process/thread) and see what gdb says [12:27:08] I also have other things to do (and one errand), so I'll keep working on this later on in the evening probably [12:27:13] it is not super urgent right? [12:28:23] not super urgent [12:32:17] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) a:03Jdforrester-WMF [12:57:49] 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) Update: given the upcoming follow-up visit to esams next week, I requested a new image from RIPE. I got it today, and it can be found in the same place, as "anchor.nl-ams-as14907-**v2**.img". [13:09:12] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10darthmon_wmde) Is there anything that we can quickly do on wikibase to fix this? if so, please advise what concretely. Thanks! [13:20:21] 10netops, 10Operations, 10ops-esams: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10faidon) [13:23:39] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [13:25:53] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1077.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [13:26:01] 10netops, 10Operations, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris) Could be totally different but with @jijiki we 've seen this behavior elsewhere as well. The latest installment is T238789. Per that logstash dashboard, it's the mediawiki's th... [13:28:57] mutante: thanks for taking care of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552161/ [13:36:25] akosiaris: hi! [13:36:39] akosiaris: do we know if class otrs::web is used anywhere? [13:37:12] ema: yes. on mendelevium [13:37:19] it's the OTRS server [13:37:20] ok, it sets $trusted_proxies based on cache::nodes[$site]['text'] [13:37:37] but the actual nodes are cache::nodes[$site]['text'] + cache::nodes[$site]['text_ats'] [13:37:55] and yet we haven't noticed so far, surprisingly [13:38:16] 5 [13:38:16] 6 RemoteIPHeader X-Real-IP [13:38:16] 7 RemoteIPInternalProxy <%= @trusted_proxies.join(' ') %> [13:38:16] 8 [13:38:17] sorry, cache::nodes['text'][$site] [13:38:21] the only use case I found [13:39:10] ah yes, that's ticket.wikimedia.org [13:39:21] mendelevium:/etc/apache2/sites-available/50-ticket-wikimedia-org.conf [13:39:25] the computed value is RemoteIPInternalProxy cp1077.eqiad.wmnet cp1079.eqiad.wmnet cp1081.eqiad.wmnet cp1083.eqiad.wmnet cp1085.eqiad.wmnet cp1087.eqiad.wmnet cp1089.eqiad.wmnet [13:39:33] and indeed the ats nodes aren't there [13:39:35] fixing [13:41:49] * akosiaris tries to think where X-Real-IP was used in OTRS context [13:43:46] ema: quick answer sudo cumin 'C:otrs::web' ;) [13:45:02] volans: thanks! [13:45:59] ema: I think we can remove that part [13:46:16] it doesn't look like it's used [13:47:23] even better [13:47:33] I can't find and Require ip rule that applies to that host [13:47:36] how old is that thing? [13:47:44] why did we add it? [13:47:47] * akosiaris reading git history [13:48:52] ema: ah there we go. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/242772/ [13:49:18] ema: I 'd say proceed as you were. I 'll have to dig deeper to see if T87217 still applies [13:49:19] T87217: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217 [13:49:38] as far as I know, not, but making sure [13:50:19] akosiaris: +1 on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552245/ then? [13:50:21] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1077.eqiad.wmnet'] ` and were **ALL** successful. [13:51:15] +1ed [13:51:21] ty [13:58:50] ema: from the looks of it is still being used in the apache logging (something is not entirely right there though, but alas meeting), so it stays for now. [13:59:14] ack [14:53:48] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1079.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [15:18:25] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1079.eqiad.wmnet'] ` and were **ALL** successful. [18:06:41] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10elukey) I have created a Docker image for Debian stretch installing apache2 (same version of the mw app servers) + php7.2-fpm from Sury's repo + the following... [18:10:17] ema: ack. thanks for the follow-up fix. so you just added "[]" as the default to fix it.. yep [18:16:36] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) >>! In T238029#5679932, @Neil_P._Quinn_WMF wrote: > @SBisson I looked over the patch and [the schema](https://meta... [18:31:58] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Neil_P._Quinn_WMF) Thank you for the quick responses, @SBisson! >>! In T238029#5682200, @SBisson wrote: >>>! In T238029#5... [19:06:47] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) >>! In T238029#5682265, @Neil_P._Quinn_WMF wrote: > [...] > ISO-8601 (i.e. `2019-10-02T16:15:30Z`), please. Done [22:20:32] bblack: is there any good reason why re-implementing all the authdns stuff with clush instead of doing those as cookbooks and take advantage of all the automation we have already in place? I know you want to be able to run them also if we have an issue to reach them or other things are down, but that could perfectly be a fallback way that we run once every now and then to ensure it's working [22:20:38] fine. Another approach could be to have ... [22:20:41] ... cumin installed in some/all the authdns servers with dedicated keys just for those purposes. [22:24:47] yes, mostly independence from DNS resolution failures and other such things [22:25:10] (which needs more other work too, I'd like to have puppet populating /etc/hosts on these servers with each others' IPs as well) [22:25:17] cookbooks and cumin can perfectly use IPs [22:25:28] they can already use the known hosts file as backend [22:25:31] that is populated by puppet [22:25:39] well assuming it isn't borked [22:25:46] same for /etc/hosts [22:26:18] but the source could be anything really [22:26:26] true! I'm not sure how we manage not-borking a puppet-deployed ip->host mapping in general for this case, which pulls from dns but doesn't break on broken dns [22:26:50] other than just redundantly putting it in as static hierdata for this case and not relying on dns at all [22:27:12] or not relying on puppet at all if you want to go down the road until the end :) [22:27:18] FYI $ sudo cumin 'K{authdns*}' [22:27:18] 4 hosts will be targeted: [22:27:19] authdns[1001,2001],authdns[1001,2001].wikimedia.org [22:27:31] ? [22:27:50] ah right the double entries, my bad [22:28:05] I don't even know [22:28:07] * volans wondering if he had removed those and is a bug or what... but unrealted [22:28:10] *unrelated [22:28:12] but that's not even the set of authdns hosts [22:28:29] (there's ganeti3003 right now as well) [22:28:37] known hosts file goes by hostname [22:28:42] doesn't know about puppet classes ofc [22:28:45] right [22:28:52] re: clush in general... [22:29:06] clush is just the CLI that ships with python3-clustershell, which I think is what cumin uses as well [22:29:15] yes [22:29:17] and since we're not using puppetdb and already have a keying mechanism for ssh [22:29:22] it just seemed simpler to use it directly [22:30:26] the alternative would be to install the cumin command onto all the authdns, and use it in a similar way to clush [22:30:52] let me put it in another way. Assuming a cookbook would use IPs and not relying on puppetdb or other dependencies, what would be the advantage of running this automation from within the authdns servers and not like the cumin hosts? [22:31:20] or re-phrasing: which failure scenario an ssh from cumin* to authdns* would fail but an ssh withing authdn* would succeed? [22:31:26] *in which [22:32:03] so, the way things operate at present, the flow of this is: [22:32:26] I know we compile on one authdns and pull from that one [22:32:51] right... [22:33:08] but we could have the compilation anywhere as long as we have gdnsd installed, or keep the current flow [22:33:19] so we already have to have the keys set up for the sideways git pulls, and we already have to have working ssh between them, and we already have to act on one, then have the rest pull from it [22:33:24] and have a cookbook compile on one host and go to the others in parallel to fetch [22:34:25] also all of this devolves to manual cases, or should anyways [22:35:07] I keep having to put words like "should" because I'm pretty sure even the current mechanisms are flawed in achieving those aims :) [22:35:22] lol, sure [22:35:34] but the intent was that this devolves gracefully to work in cases where almost nothing else is working right [22:36:00] where someone might log into 1/N authdnses and manually commit a fix locally and have the scripts sync that to the rest, for instance [22:36:11] and even if some are unreachable, they can be skipped [22:36:21] and even if recdns is borked everywhere and that's what you're trying to fix [22:36:24] and so-on [22:36:54] and since working recdns relies on working authdns, and everything else in the world breaks with broken recdns, it's not easy to ensure there are no accidental dependencies [22:37:41] with 3 servers, it has been acceptable that this stuff is imperfect, because you can just open 3 shells and hack the data live on all 3 pretty quickly and easily too [22:38:10] but I'd like all of this to be working much better before we have 13 of them in the next transitional stage [22:39:34] there's still a bunch of other missing bits to fix, e.g.: [22:40:09] for this kind of failure scenario it seems much more reliable to me to not rely on internal network at all, and go patch each one of them and have them compile locally independently (as opposed as everyday behaviour) [22:40:11] 1) Putting something like poolcounter in play, similar to discussed for puppet-merge (and obviously, able to put a flag on authdns-update to skip it when everything's borked and you're sure there aren't multiple operators working) [22:40:40] and have that automated from the outside (not manually ofc) [22:40:44] 2) Solving the "oh yeah this all uses the hostnames of the authdnses, so we need e.g. /etc/hosts static data too" [22:41:52] 3) We really also (for related reasons) need some real cluster-membership stuff for shooting down isolated authdns, too [22:42:37] e.g. if esams becomes segmented from the rest of our network, we'd like the esams authdns boxes to detect this and shut down authdns service to the world, so that they don't keep lying with old data while we're pushing related updates to the ones we can reach [22:43:26] can we even assume we can ssh in the first place? we might need to go via mgmt console, depending on the failure scenario [22:43:32] I've thought a bit about the cluster membership logic, it's a little nonstandard what we want, still tossing that around [22:44:30] so the ssh/network thing goes like this: [22:45:33] if we can't ssh to it from inside our network (because the site is isolated), then we really don't need to reach it for updates, and we also want it to die so that the outside world doesn't reach it either. [22:45:52] because it's sitting in an isolated cache DC anyways, which can't reliably serve users [22:46:09] fair [22:46:13] so we're just trying to depool the site at that point [22:46:33] and if we can depool it on all the authdns we can reach, and if it can self-detect the breakage and just shut itself down (so that outsiders don't use it for dns lookups), we're good [22:47:09] that's a whole separate problem, but basically for design reasons I don't think we have to consider authdnses that we can't reach on the network at all, for authdns-update purposes [22:47:44] that could be used as an argument for the cookbook way thought ;) [22:48:01] but we do have to consider that because of , we've broken everything else like recdns lookups and we have to push an authdns data change to fix it. [22:48:33] and broken recdns lookups I use as shorthand for "broken just about everything else" [22:48:59] assume gerrit is nonworking/unreachable, assume no hostname lookups work via normal recdns, assume puppet agents can't usefully run, etc, etc [22:49:27] sure, just ssh via ip with no reverse lookup [22:50:02] it's probably reasonable to assume that e.g. there's always an available cumin master that can reach the network. [22:50:31] or at least, that a cumin-master disaster and an authdns disaster don't coincide, since there isn't much in the way of shared potential causes [22:51:24] and, so, yes, if we had something like /etc/hosts hack or similar to reliably make the IPs work without recdns from the cumin masters (and between authdnses) [22:51:47] we could have a cumin cookbook stage through this, replacing basically the "authdns-update" layer of the functionality [22:51:56] assuming that doesn't imply any other dependencies that I'm unaware of [22:52:13] and for the manual-commit sort of thing [22:52:32] you'd still ssh to authdns1001 (or whichever) and do a manual commit, and then use some flagged/different version of the cookbook run to push? [22:53:25] I don't think it's unworkable to move this to a cookbook style of execution [22:53:41] I just don't see any big advantages to killing "authdns-update" at this stage [22:53:53] (with clush available, which is pretty simple) [22:54:05] I was wondering if we could use instead the same idea that was proposed for the autogenerated data from netbox, basically: compile on the cumin-master and expose it via HTTPS internally and have the authdns pull from it (in normal scenario) and [22:54:22] it's like a 58 line shellscript, which really could be replaced with an even shorter python script (and probably will, somewhere in here) [22:54:33] as a fallback pull from any authdns so that it could be run manually from there if needed [22:55:28] yeah but why have a separate mechanism for "fallback"? then we're not confident in our fallback. Did it break since someone last did it because of [22:55:54] if we have to build a mechanism that's supposedly-very-reliable, why not use it all the time and be confident in it? [22:56:28] sure, I still need to understand what is the failure scenario in which a cumin-master cannot talk to the authdns servers and they still can within them [22:56:42] if they can't talk to eqiad or codfw reliably [22:56:44] I donno, I could ask you the same :) [22:56:50] they can't serve traffic reliably [22:57:05] 22:54 < volans> as a fallback pull from any authdns so that it could be run manually from there if needed [22:57:10] ^ why this fallback at all? [22:57:16] what are you falling back from the failure of? [22:57:30] exactly, that probably is not needed [22:57:40] it was to be "backward" compatible :) [22:58:13] ok [22:59:01] so, assuming we can solve all the same problems both ways (the static IP data for the authdns hosts, and a mechanism for manual commits, etc) [22:59:35] what's the upside? [22:59:48] just standardization of tooling? [23:00:17] right now I'm not even trying to fix this part, I'm just trying to not make it worse while I work on other things. [23:00:32] integrated !log-ing, one less specialized tool/workflow to remember, could take optional advantage of other stuff if needed going forward [23:00:35] but at some point we have to fix it before I can get to those other things [23:00:51] like I dunno optionally reload the recdns if you're fixing some negative cache stuff [23:00:54] but at this stage, it's a 58 line shellscript that invokes clush and it works :) [23:01:35] also, by putting cumin and cookbooks beneath our authdns workflow, you're taking all of this reliability arguments on for all of it as well [23:02:10] a nice challenge! [23:02:13] if the code for integrated !log makes the tool blow up because of whatever's broken, and that makes cumin fail to be able to push the authdns update in a time of need, it was all for naught [23:02:36] so everything that gets touched by this workflow has to be fail-safe [23:03:11] I know and I'm not saying it's all already like that, needs some auditing [23:03:14] and testing [23:03:31] and of cousre fail-safe means different things in different contexts too [23:03:52] usually fail-safe in most contexts will mean "if something seems amiss, abort and complain loudly so we don't cause a problem" [23:04:13] in this case fail-safe means "ignore everything else less-important than might fail and make sure this thing happens" [23:04:24] indeed [23:07:33] anyways, it sounds like an interesting problem to work on, and as we rely more on cookbooks we probably want all of these things to be super-reliable anyways [23:07:46] or at least have flags for it [23:08:23] and the poolcounter thing we're talking about for e.g. puppet-merge and authdns-update, could be centrally abstracted there as well [23:08:27] yes, doesn't make too much sense to have automation if it's not reliable nor adaptable to failure scenarios [23:08:35] (which would still be a thing, since we have multiple cumin masters) [23:08:58] I'd argue puppet-merge should be moved over like this as well, they're very similar cases [23:09:52] talking about this I was thinking for a central solution for all the "sensible" repos we have [23:10:11] in which we want to have a manual step with diff before making it applied [23:10:12] do you mean for the git pull automation? [23:10:17] yeah [23:10:27] there's a standard pattern there, but also some details have been different in each case, too [23:10:33] I've at least other 4-5 repos to treat like those [23:10:59] each one woul dhave some post-action script or hook that is different [23:11:14] there's probably really two separate-but-related things to break out of these flows [23:11:27] the other approach is to do it via a merge-repo cookbook [23:11:50] or, I donno, it's tricky [23:11:52] I've still many ideas and not a real concrete proposal [23:12:04] so at least for the authdns case, there's two steps: [23:12:34] 1) The "sync from upstream (gerrit) repo" part. Which fetches the remote HEAD, does the diff/confirm, then updates local master HEAD or whatever. [23:12:38] I'd say 3: push the SHA1 to a known approved position, compile, distribute [23:12:51] (there's a bunch of common pattern to abstract there too, e.g. puppet-merge's yes/no/multiple, etc) [23:13:03] yes this part I was mostly talking about [23:13:23] and yeah (2) is "then make sure that same HEAD gets merged on N other machines, non-interactively" [23:13:37] right now the authdns-update pattern is weak and easily broken, though [23:13:57] it only works if you assume there aren't multiple persons operating from multiple masters running concurrent authdns-updates :/ [23:14:10] we need a distributed lock, that's for sure [23:14:12] I think the poolcounter sort of thing fixes a lot of those concerns [23:14:24] but then still, it's silly to assume remote HEAD, better to pass the SHA around [23:14:33] absolutely [23:14:35] never head [23:15:03] and then you realize you're reimplementing scap :D [23:15:06] if I weren't worried about manual commits [23:15:29] I'd say the pattern is: given a $remote_origin (e.g. gerrit) and a [set, of, servers] [23:15:56] fetch the latest HEAD from $remote_origin, review it, then use its SHA to update [set, of, servers] directly from $remote_origin using the SHA [23:16:29] maybe that's the default action of the pattern [23:16:40] but then the "fallback" pattern you want is more like: [23:17:07] the user makes a manual commit in , and notes the sha [23:17:27] then runs the thing giving it an argument of --force-origin some_server:some_hash [23:17:29] or something like that [23:17:51] but then it causes differential failure scenarios [23:18:31] the everyday command worked because all [set, of, servers] could fetch from the normal $remote_origin, but now in an emergency they're fetching from another server (the cumin master, or one of themselves), and that might have been broken for weeks because of a ferm rule change or whatever and nobody noticed. [23:19:00] so this is why I'd rather the data flow always be the same, other than the skipped parts. [23:19:30] which makes me rewind and rewrite the first idea to avoid direct fetches from [set, of, servers] from $remote_origin [23:20:09] ... [23:20:14] that got confusing quickly [23:20:27] I was thinking to always fetch from an internal "approved" checkout [23:20:44] but what I mean is, your cumin masters are effectively going to become important central git sources, if we go down this road. [23:20:49] that is replicated in eqiad/codfw, has some distributed lock to ensure we can run on both [23:21:01] those or some repo[12]001 [23:21:14] yay more parts to break :) [23:21:21] now we need those hardcoded in /etc/hosts too :) [23:21:29] but then really, we endup re-implementing basically scap just with a different transport :D [23:21:34] not really [23:21:52] deploy1001 is the internal approved checkout and rsync is the transport [23:21:54] but this may be the big design barrier, right [23:21:55] not that far [23:22:10] all of these things seem to be doing similar things, so they should maybe all be the same thing... [23:22:27] but some implementations of this have a design goal of being as simple and reliable and trustworthy as possible [23:22:42] which is at odds with a design goal of being feature-rich and user-friendly and generic and re-useable, etc... [23:22:59] life is hard, I know :-) [23:23:10] there's a lot of things you'd want in an "scap" that you don't want in an "authdns-update" [23:23:24] because if dns is broken you just don't do software deploys, and ask SRE to fix the network so scap works again [23:23:45] authdns-update doesn't have that luxury, and so it must be very simple and very reliable [23:24:00] oh don't get me wrong, I don't want to rewrite scap or have those features, I'm just saying that the design we were talking about could easily resemble something like scap going forward [23:24:24] yeah I'm just saying, this is all a spectrum of related tooling [23:24:57] but I doubt we'd even want scap to be designed for this kind of reliability. cumin, maybe! [23:25:09] but even ignoring authdns-update, we still need it for a bunch of other repos [23:25:12] maybe reliability is the wrong word [23:25:21] fault tolerance? [23:25:31] resiliance [23:25:48] to as many failure scenarios as possible [23:28:32] blerg [23:28:55] :-P [23:29:10] things are so complicated in this world :) [23:29:44] so yeah, back in the authdns-update world, these are currently two different layers of functionality [23:30:10] where authdns-update does the upper-level bit (sshing around the set of servers and asking them to do things, and telling them who has the authoritative git HEAD for now) [23:30:19] and authdns-git-pull does the pull/review/sync -related bits [23:30:40] and then deploy-check does all the stuff that happens after that, which is a whole other ball of wax [23:31:18] a well-designed solution on the cumin master made of a git-syncing abstraction and a cookbook could replace basically all of it up to the point of deploy-check [23:32:34] I'm having trouble wrapping my head around the tradeoffs and distinctions between the gerrit server, repo1001, and ssh+git to "local" checkouts on peer servers or the cumin master, etc [23:32:53] the whole reason authdns-update is pulling sideways is to avoid dependency on the relatively-fragile gerrit [23:33:27] if repo1001 is a more-standard git server that's just mirroring from gerrit and acting as a distribution point that's more-reliable (and locally-editable?), that has value I think [23:34:05] where does repo1001 then fall on the reliability spectrum between the "pull sideways from peer checkouts over ssh" and "gerrit" and is the only reason we need a repo1001 in the middle "java"? [23:34:19] all things to think about! [23:34:30] also trust [23:34:41] we want that manual step [23:34:51] outside of gerrit [23:34:55] sure [23:35:27] what I mean is, looking at the value of a repo1001 solution, vs one that just diffs and manual-vets a given SHA1, then tells all the servers to just pull that SHA from gerrit directly [23:36:10] and I guess I'm using "repo1001 solution" to mean a real "git server", [23:36:16] :) [23:36:29] repo1001 could still be just ssh+git to local checkouts on a box named repo1001, or cumin1001 [23:36:35] I guess the only diff is to work on gerrit failure [23:37:00] and allow local modifications [23:37:04] yeah I think I'm comfortable with saying we need something to mitigate gerrit failure, because java and all that [23:37:09] and we need local mods [23:37:30] which means we need a checkout anyways, so it's not a bare repo server really anymore [23:37:37] yes [23:37:42] it could be though [23:37:46] both [23:37:56] ok [23:37:57] you local modify locally, push to the bare and export that [23:38:07] then you have to resolve the conflict when gerrit comes back up [23:38:25] but that step is needed anyway, can vary in how hard is it [23:38:31] that seems needlessly complicated though, unless there's a big perf gain or something [23:38:37] vs just having a checkout and no bare repo [23:38:44] indeed [23:39:01] and the gerrit-syncing part can just always fail in case of divergence [23:39:25] also we most likely don't want random hosts to have the ssh key to connect to the cumin hosts (or maybe it's ok for a given tear down user?) [23:39:26] if we made local edits, and we're at the point where we need to get back in sync, step 1 is go do the appropriate local "git reset --hard ..." or whatever to make it syncable again [23:39:35] yep [23:39:44] so the ssh part, yeah... [23:39:46] so at that point we need to think if pull or push [23:39:55] we could obviously also allow anonymous https transport with a bare repo [23:39:58] we can expose the git via https or rsync daemon [23:40:09] I don't think you can do anon https transport for just a non-bare clone access right? [23:40:18] or actively push it taking advantage of the cumin's key [23:40:29] oh right [23:40:32] hmmm [23:40:51] that could be interesting too.. just have cumin push to the remote clones over ssh [23:42:07] as a side note, the main reason we proposed the HTTPS transport for the autogenerated data from netbox is to be able to use that in CI (and that could be itself debatable) [23:43:35] and yes I think you need the bare to expose it via http [23:43:50] yeah [23:44:14] there's a lot of thorny issues deep in all of this [23:45:15] but yeah once we have a world with two upstream repos (netbox + ops/dns), I think the git mechanisms and deploy-check should be easily adaptable [23:45:36] (to put those contents in different places and blend them appropriately) [23:46:04] I think we probably should have some differentiation in how the update is triggered though [23:46:55] so that a deployer of an ops/dns manual change, and a pending netbox-driven git change, both going through authdns-update or its replacement... [23:47:11] you don't want either to be tricked into deploying the other, it gets really confusing and racy [23:47:33] yes [23:47:38] but deploy-check.py will still need to mix the contents of both on each deploy [23:48:07] but the authdns-update (or whatever) for pushing one repo's change should stick with the last-merged HEAD of the other, even if more updates are pending in some sense) [23:48:19] which we don't keep good track of at present [23:48:52] anyways, I think that can be fixed with some CLI flags for the two cases, etc [23:49:01] gotta run! :) [23:49:03] the blend though wouldn't be the $INCLUDE? [23:49:10] sure, and I should go to sleep :) [23:49:18] by blending [23:49:56] I mean: deploy-check needs to take some version of both repos, copy them into a mock /tmp/testme/etc/gdnsd/zones/ , and execute the automated check that it doesn't break, and then push that over the real /etc/gdnsd/zones/ at the end. [23:50:17] the automated check will include the "gdnsd checkconf", so the whole dataset including working $INCLUDEs has to be there [23:50:30] nite! :) [23:51:03] sure [23:51:06] right