[05:04:06] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) I've depooled cp5007 to conduct some experiments, I've captured the varnish-fe traffic with the following tcpdu...
[05:15:29] <vgutierrez>	 bblack: remember our talk about ATS getting responses on "closed" TCP connections? I reproduced it on cp5007: https://phabricator.wikimedia.org/T234887#5597399
[05:17:41] <bblack>	 vgutierrez: the way I read that trace is:
[05:18:18] <bblack>	 lines 75-79 are all "normal" (opens a conn and makes a POST request, which v-fe ACKs)
[05:18:32] <bblack>	  80  12.942536 10.132.0.107 → 10.132.0.107 TCP 66 8349 → 3123 [FIN, ACK] Seq=441 Ack=1 Win=44032 Len=0 TSval=2735499196 TSecr=2735498851
[05:18:35] <bblack>	  81  12.989195 10.132.0.107 → 10.132.0.107 TCP 66 3123 → 8349 [ACK] Seq=1 Ack=442 Win=45056 Len=0 TSval=2735499208 TSecr=2735499196
[05:19:44] <bblack>	 ^ in this pair of packets (which is after a ~1.4s gap in traffic), ATS seems to be sending a FIN (with a re-ack of seq1, not sure there...), meaning ATS won't send more traffic
[05:20:19] <bblack>	 but v-fe's response in 81 simply ACKs the FIN, meaning v-fe is saying "I see that you're not going to send me anything else, but I might still send you more stuff"... half-close
[05:20:33] <bblack>	 then 30s later:
[05:20:39] <bblack>	 202  42.009912 10.132.0.107 → 10.132.0.107 HTTP 2577 HTTP/1.1 503 Backend fetch failed  (text/html)
[05:20:42] <bblack>	 203  42.009921 10.132.0.107 → 10.132.0.107 TCP 54 8349 → 3123 [RST] Seq=442 Win=0 Len=0
[05:20:48] <bblack>	 ^ v-fe sends more stuff, and ATS RSTs in response
[05:21:01] <bblack>	 the connection was never fully closed though
[05:21:24] <vgutierrez>	 so ATS is failing to properly close the connection?
[05:21:29] <bblack>	 arguably, if the reason for ATS's FIN was that the whole thing was in an error state, it should've RST'd right away
[05:21:49] <bblack>	 ATS cannot force v-fe to cleanly-close its side other than by doing a RST
[05:22:01] <vgutierrez>	 so ATS aborts the transaction at that point (packet 80)
[05:22:20] <bblack>	 it's perfectly valid for the client (ATS) to send an HTTP request and then a FIN to close the sending side, but still wait on the half-closed connection to the server (v-fe) to send a response before it also closes with its own FIN
[05:22:41] <bblack>	 from v-fe's point of view, the above is what things look like up through 80
[05:23:01] <bblack>	 if ATS really wants to say "this is all messed up and I don't even want a response", it should RST way back at line 80
[05:23:19] <vgutierrez>	 hmmm
[05:23:22] <vgutierrez>	 got you
[05:23:34] <bblack>	 (or alternatively, it could half-close and pointlessly drain the response for a clean close, but then why RST at the end if you're doing that, too)
[05:23:53] <bblack>	 but also, i thought usually that FIN on line 80 would come without an ACK there
[05:24:01] <bblack>	 it's odd
[05:24:11] <vgutierrez>	 I'll check on the source code how ATS handles the transaction abort regarding the origin
[05:24:25] <bblack>	 normally the sequence is the active closer sends "FIN", then the other side sends "FIN+ACK" to close up their own side and ack the FIN
[05:24:35] <bblack>	 or you can have "FIN", "ACK", then later the other "FIN"
[05:25:01] <bblack>	 given that ATS already ACK'd Seq1 back on line 77
[05:25:12] <bblack>	 I don't see why its FIN is also carrying an ACK of Seq1 as well
[05:25:16] <bblack>	 maybe my TCP is rusty
[05:27:18] <bblack>	 anyways, the FIN+ACK mystery like I said might be me being rusty
[05:27:45] <bblack>	 otherwise this all looks like varnish is being reasonable and ATS is being... well .. it's hard to say unreasonable, but certainly open to question :)
[05:28:04] <ema>	 oh hello bblack :)
[05:28:09] <ema>	 still up?
[05:28:11] <bblack>	 sort of
[05:28:16] <bblack>	 need anything? :)
[05:28:41] <ema>	 just coffee :)
[05:30:03] <bblack>	 ok I'm stepping back away then, enjoy it :)
[05:30:21] <ema>	 see you!
[06:49:06] <XioNoX>	 alright, time for another esams depool, we're going to do some heavy lifting so not sure 100% when it will come back, but probably down all morning local time
[07:04:12] <vgutierrez>	 ack
[07:37:04] <wikibugs>	 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema)
[08:33:52] <bblack>	 XioNoX: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/545385/ is sitting there if you want to use it as well, it will free up capacity in eqiad like before (shift north america load over to codfw, etc)
[08:34:07] <XioNoX>	 thx!
[08:34:15] <bblack>	 esams is pretty far into its daily ramp up at this point, might avoid saturation, etc
[08:34:18] <XioNoX>	 le me know if we start saturating something
[08:34:26] <XioNoX>	 or feel free to merge it
[08:34:29] <bblack>	 I won't be here I'm going back to sleep, but will be back later :)
[08:35:53] <XioNoX>	 ok, no pb!
[09:06:38] <wikibugs>	 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema)
[09:28:57] <wikibugs>	 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) p:05Normal→03Low Adding the few project tags we are using nowadays.  Lowering priority since clearly w...
[09:46:49] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) >>! In T234803#5583888, @BBlack wrote: > Notes from IRC, etc: >  > The current patch (merging shortly: https://gerrit.wikimedi...
[09:54:34] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Gilles) @elitre it should be its own task, since it's a PDF failing to render and this t...
[10:21:47] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Elitre) oK,will file separately then, TY,
[10:34:35] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) 05Resolved→03Open
[10:35:33] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) The solution proposed in https://gerrit.wikimedia.org/r/543022 doesn't work as expected due to a bug on ATS. after a config reload the lua script loses the argtb
[10:44:07] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Elitre)
[10:46:18] <wikibugs>	 10HTTPS, 10Traffic, 10DBA, 10Operations, and 3 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499 (10jcrespo)
[13:45:37] <bblack>	 XioNoX: is our ETA for repool "very soon", or "still a while from now?"
[13:45:45] <cdanis>	 NTT link in eqiad hot again
[13:46:09] <XioNoX>	 can be repooled now even, I'm working on the mgmt network now but the prod one is good
[13:46:28] <XioNoX>	 or in 30min I should be 100% good, tracking down some issues
[13:52:49] <ema>	 bblack: I've added https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production based on yesterday's partial success with kibana :) Can you confirm it's somehow sane? 
[13:59:46] <ema>	 godog: 'ATS backend availability' added https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=7&fullscreen&orgId=1&from=now-1h&to=now
[14:02:25] <bblack>	 ema: there's no requirement to merge the changes together at the same time, just that the hieradata/conftool-data puppet commit gets merged up and applied to all authdnses (not just the one you're operating from) before the dns side commit is authdns-update'd
[14:02:45] <bblack>	 they can be 5 minutes apart or 20 years apart, but the order matters, and that the puppet agent ran on all the authdns first
[14:03:07] <ema>	 ah right, the change needs to be applied to all the authdns of course (because authdns-update will apply the change to them all)
[14:03:43] <bblack>	 (and the unwinding the dnsdisc part goes in reverse - revert the dns change, authdns-update that everywhere, then revert the puppet change)
[14:04:47] <cdanis>	 I have a related question -- are there any existing examples of cross-repo CI?
[14:06:00] <bblack>	 :)
[14:06:20] <bblack>	 I'm not fond of the cross-repo nature of this at all
[14:06:40] <bblack>	 but the alternatives so far are worse, all constraints considered
[14:07:04] <bblack>	 I don't know if there's existing examples of x-repo CI
[14:07:09] <cdanis>	 yeah, agreed; I was just wondering if there was a way to check that people get the geiop/metafo thing right
[14:07:28] <bblack>	 well, it fails if they don't
[14:07:38] <bblack>	 so there's that :)
[14:07:46] <ema>	 I've noticed! :)
[14:08:04] <volans>	 the real obstacle IMHO for this kind of things is that we cannot directly modify CI scripts to do custom stuff in an easy way
[14:08:31] <bblack>	 well we've more or less solved that for the ops/dns case, as the jenkins CI just invokes a script controlled by the repo itself
[14:08:36] <volans>	 so we endup having CI setup the repo and run a single command (like tox) and we try to do as much as possible within that constraint
[14:08:52] <godog>	 ema: good times! thanks
[14:08:55] <volans>	 bblack: that's not enough ,you've not enabled the delta diff check for the zone validator
[14:09:04] <volans>	 exactlu for this reason, you cannot control the repo itself easily
[14:09:06] <bblack>	 hmmm true
[14:09:14] <volans>	 (checkout master, run, checkout patch, run, delta, etc...)
[14:09:42] <bblack>	 but we can do that
[14:11:04] <bblack>	 e.g. in deploy-check.py or similar, we could check the origin's master hash and the local clone's master hash, and if they don't already match (meaning this is running as final check on an already-merged change), assuming we're CI-ing something not yet merged and do the diff stuff
[14:11:30] <bblack>	 or something along those lines
[14:11:44] <bblack>	 but really only jenkins has a consistent unracy view of such things
[14:12:14] <bblack>	 doing it in the deploy-check executed via authdns-update on the end hosts would have some races (where you're deploying change X while someone's merging change X+1 over in gerrit)
[14:13:05] <bblack>	 I guess you could figure that out emprically too, though (by checking for any parent/child relationship between the two hashes)
[14:13:18] <bblack>	 (in the origin, I mean)
[14:14:03] <volans>	 sure
[14:15:16] <bblack>	 or we can stop thinking of all of this as a unified thing
[14:15:49] <bblack>	 the checks we have now are "functional" checks - does the final data at this commit pass sanity/preload checks, or not... and gets executed both by CI and various deployment tools on live hosts, etc
[14:16:40] <bblack>	 the "diff them and look for craziness in the diff <too many lines changed, or similar heuristics?>" part could be a separate execution of the validator that happens only from the jenkins docker, before or after the deploy-check.py stuff
[14:16:48] <volans>	 in general yes, but for the zone_validator, until we fix all the backlock of warnings, having the delta would be useful IMHO
[14:17:06] <cdanis>	 the cr1-eqiad NTT link is still >8Gbps, should we repool esams or move NA traffic to codfw?
[14:17:08] <bblack>	 oh right, it's delta-of-warnings
[14:17:24] <bblack>	 XioNoX: ? on ready to repool?
[14:17:50] <XioNoX>	 yeah esams can be repooled
[14:18:06] <cdanis>	 ok I can do that
[14:18:09] <XioNoX>	 I'm still working on some mgmt issues
[14:18:28] <bblack>	 I have it going
[14:36:13] <wikibugs>	 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3007.esams.wmnet` -  cp3007.esams.wmnet (**PASS**)   - Downtimed host on Icin...
[14:38:08] <wikibugs>	 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3008.esams.wmnet` -  cp3008.esams.wmnet (**FAIL**)   - Downtimed host on Icin...
[14:40:23] <wikibugs>	 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3010.esams.wmnet` -  cp3010.esams.wmnet (**PASS**)   - Downtimed host on Icin...
[14:41:53] <wikibugs>	 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ema) >>! In T208585#5599005, @ops-monitoring-bot wrote: >   - **Failed to power off, manual intervention required**: Remote IPMI for cp3008.mgmt.esams.wmnet failed (exit=...
[15:30:34] <wikibugs>	 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10Papaul) @robh there is no dns3003 the last server is bast3003 so  only dns300[1-2]
[15:32:10] <wikibugs>	 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10BBlack) confirming above - @papaul is correct.  The total set of new esams Linux boxes AFAIK is: 16x caches, 3x LVS, 2x DNS, 1x Bastion, 3x Ganeti.
[15:33:25] <wikibugs>	 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH)
[15:33:40] <wikibugs>	 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10RobH)
[15:53:27] <_joe_>	 heads up - gonna restart pybal on the codfw/eqiad loadbalancers
[15:53:31] <_joe_>	 for a lvs change
[15:57:55] <wikibugs>	 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) [[ https://logstash.wikimedia.org/goto/0493475ebf5b04d14b38741e3c75261a | And now it's dropped off for a few hours. ]]
[16:30:52] <wikibugs>	 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#5597740, @hashar wrote: > Lowering priority since clearly we have no bandwith to work on addin...
[16:35:48] <wikibugs>	 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) @BBlack here is the information for the CP servers in rack 15   cp3055 : xe-5/0/15 cp3056: xe-5/0/16 cp3057: xe-5/0/17 cp3058: xe-5/0/18 cp3059: xe-5/0...
[16:37:12] <wikibugs>	 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Papaul) @BBlack dn3002 racked in rack 15 switch information   xe-5/0/14
[16:42:37] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH)
[16:42:58] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH)
[16:43:40] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH)
[16:45:03] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @BBlack lvs3005 switch information   xe-5/0/12
[17:24:06] <gilles>	 https://www.internetexchangemap.com/
[17:29:11] <XioNoX>	 thx, that's cool!
[18:03:44] <wikibugs>	 10netops, 10Operations, 10ops-esams: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) 05Open→03Resolved Done.
[18:03:47] <wikibugs>	 10Traffic, 10netops, 10Operations, 10Wikimedia-Incident: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi)
[18:03:48] <wikibugs>	 10netops, 10Operations, 10ops-esams: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi)
[18:04:03] <wikibugs>	 10netops, 10Operations, 10ops-esams: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi) 05Open→03Resolved a:03ayounsi Done.