[05:04:06] 10Traffic, 10Operations, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) I've depooled cp5007 to conduct some experiments, I've captured the varnish-fe traffic with the following tcpdu... [05:15:29] bblack: remember our talk about ATS getting responses on "closed" TCP connections? I reproduced it on cp5007: https://phabricator.wikimedia.org/T234887#5597399 [05:17:41] vgutierrez: the way I read that trace is: [05:18:18] lines 75-79 are all "normal" (opens a conn and makes a POST request, which v-fe ACKs) [05:18:32] 80 12.942536 10.132.0.107 → 10.132.0.107 TCP 66 8349 → 3123 [FIN, ACK] Seq=441 Ack=1 Win=44032 Len=0 TSval=2735499196 TSecr=2735498851 [05:18:35] 81 12.989195 10.132.0.107 → 10.132.0.107 TCP 66 3123 → 8349 [ACK] Seq=1 Ack=442 Win=45056 Len=0 TSval=2735499208 TSecr=2735499196 [05:19:44] ^ in this pair of packets (which is after a ~1.4s gap in traffic), ATS seems to be sending a FIN (with a re-ack of seq1, not sure there...), meaning ATS won't send more traffic [05:20:19] but v-fe's response in 81 simply ACKs the FIN, meaning v-fe is saying "I see that you're not going to send me anything else, but I might still send you more stuff"... half-close [05:20:33] then 30s later: [05:20:39] 202 42.009912 10.132.0.107 → 10.132.0.107 HTTP 2577 HTTP/1.1 503 Backend fetch failed (text/html) [05:20:42] 203 42.009921 10.132.0.107 → 10.132.0.107 TCP 54 8349 → 3123 [RST] Seq=442 Win=0 Len=0 [05:20:48] ^ v-fe sends more stuff, and ATS RSTs in response [05:21:01] the connection was never fully closed though [05:21:24] so ATS is failing to properly close the connection? [05:21:29] arguably, if the reason for ATS's FIN was that the whole thing was in an error state, it should've RST'd right away [05:21:49] ATS cannot force v-fe to cleanly-close its side other than by doing a RST [05:22:01] so ATS aborts the transaction at that point (packet 80) [05:22:20] it's perfectly valid for the client (ATS) to send an HTTP request and then a FIN to close the sending side, but still wait on the half-closed connection to the server (v-fe) to send a response before it also closes with its own FIN [05:22:41] from v-fe's point of view, the above is what things look like up through 80 [05:23:01] if ATS really wants to say "this is all messed up and I don't even want a response", it should RST way back at line 80 [05:23:19] hmmm [05:23:22] got you [05:23:34] (or alternatively, it could half-close and pointlessly drain the response for a clean close, but then why RST at the end if you're doing that, too) [05:23:53] but also, i thought usually that FIN on line 80 would come without an ACK there [05:24:01] it's odd [05:24:11] I'll check on the source code how ATS handles the transaction abort regarding the origin [05:24:25] normally the sequence is the active closer sends "FIN", then the other side sends "FIN+ACK" to close up their own side and ack the FIN [05:24:35] or you can have "FIN", "ACK", then later the other "FIN" [05:25:01] given that ATS already ACK'd Seq1 back on line 77 [05:25:12] I don't see why its FIN is also carrying an ACK of Seq1 as well [05:25:16] maybe my TCP is rusty [05:27:18] anyways, the FIN+ACK mystery like I said might be me being rusty [05:27:45] otherwise this all looks like varnish is being reasonable and ATS is being... well .. it's hard to say unreasonable, but certainly open to question :) [05:28:04] oh hello bblack :) [05:28:09] still up? [05:28:11] sort of [05:28:16] need anything? :) [05:28:41] just coffee :) [05:30:03] ok I'm stepping back away then, enjoy it :) [05:30:21] see you! [06:49:06] alright, time for another esams depool, we're going to do some heavy lifting so not sure 100% when it will come back, but probably down all morning local time [07:04:12] ack [07:37:04] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [08:33:52] XioNoX: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/545385/ is sitting there if you want to use it as well, it will free up capacity in eqiad like before (shift north america load over to codfw, etc) [08:34:07] thx! [08:34:15] esams is pretty far into its daily ramp up at this point, might avoid saturation, etc [08:34:18] le me know if we start saturating something [08:34:26] or feel free to merge it [08:34:29] I won't be here I'm going back to sleep, but will be back later :) [08:35:53] ok, no pb! [09:06:38] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [09:28:57] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) p:05Normal→03Low Adding the few project tags we are using nowadays. Lowering priority since clearly w... [09:46:49] 10Traffic, 10Operations, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) >>! In T234803#5583888, @BBlack wrote: > Notes from IRC, etc: > > The current patch (merging shortly: https://gerrit.wikimedi... [09:54:34] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Gilles) @elitre it should be its own task, since it's a PDF failing to render and this t... [10:21:47] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Elitre) oK,will file separately then, TY, [10:34:35] 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) 05Resolved→03Open [10:35:33] 10Traffic, 10Operations, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) The solution proposed in https://gerrit.wikimedia.org/r/543022 doesn't work as expected due to a bug on ATS. after a config reload the lua script loses the argtb [10:44:07] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Elitre) [10:46:18] 10HTTPS, 10Traffic, 10DBA, 10Operations, and 3 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499 (10jcrespo) [13:45:37] XioNoX: is our ETA for repool "very soon", or "still a while from now?" [13:45:45] NTT link in eqiad hot again [13:46:09] can be repooled now even, I'm working on the mgmt network now but the prod one is good [13:46:28] or in 30min I should be 100% good, tracking down some issues [13:52:49] bblack: I've added https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production based on yesterday's partial success with kibana :) Can you confirm it's somehow sane? [13:59:46] godog: 'ATS backend availability' added https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=7&fullscreen&orgId=1&from=now-1h&to=now [14:02:25] ema: there's no requirement to merge the changes together at the same time, just that the hieradata/conftool-data puppet commit gets merged up and applied to all authdnses (not just the one you're operating from) before the dns side commit is authdns-update'd [14:02:45] they can be 5 minutes apart or 20 years apart, but the order matters, and that the puppet agent ran on all the authdns first [14:03:07] ah right, the change needs to be applied to all the authdns of course (because authdns-update will apply the change to them all) [14:03:43] (and the unwinding the dnsdisc part goes in reverse - revert the dns change, authdns-update that everywhere, then revert the puppet change) [14:04:47] I have a related question -- are there any existing examples of cross-repo CI? [14:06:00] :) [14:06:20] I'm not fond of the cross-repo nature of this at all [14:06:40] but the alternatives so far are worse, all constraints considered [14:07:04] I don't know if there's existing examples of x-repo CI [14:07:09] yeah, agreed; I was just wondering if there was a way to check that people get the geiop/metafo thing right [14:07:28] well, it fails if they don't [14:07:38] so there's that :) [14:07:46] I've noticed! :) [14:08:04] the real obstacle IMHO for this kind of things is that we cannot directly modify CI scripts to do custom stuff in an easy way [14:08:31] well we've more or less solved that for the ops/dns case, as the jenkins CI just invokes a script controlled by the repo itself [14:08:36] so we endup having CI setup the repo and run a single command (like tox) and we try to do as much as possible within that constraint [14:08:52] ema: good times! thanks [14:08:55] bblack: that's not enough ,you've not enabled the delta diff check for the zone validator [14:09:04] exactlu for this reason, you cannot control the repo itself easily [14:09:06] hmmm true [14:09:14] (checkout master, run, checkout patch, run, delta, etc...) [14:09:42] but we can do that [14:11:04] e.g. in deploy-check.py or similar, we could check the origin's master hash and the local clone's master hash, and if they don't already match (meaning this is running as final check on an already-merged change), assuming we're CI-ing something not yet merged and do the diff stuff [14:11:30] or something along those lines [14:11:44] but really only jenkins has a consistent unracy view of such things [14:12:14] doing it in the deploy-check executed via authdns-update on the end hosts would have some races (where you're deploying change X while someone's merging change X+1 over in gerrit) [14:13:05] I guess you could figure that out emprically too, though (by checking for any parent/child relationship between the two hashes) [14:13:18] (in the origin, I mean) [14:14:03] sure [14:15:16] or we can stop thinking of all of this as a unified thing [14:15:49] the checks we have now are "functional" checks - does the final data at this commit pass sanity/preload checks, or not... and gets executed both by CI and various deployment tools on live hosts, etc [14:16:40] the "diff them and look for craziness in the diff " part could be a separate execution of the validator that happens only from the jenkins docker, before or after the deploy-check.py stuff [14:16:48] in general yes, but for the zone_validator, until we fix all the backlock of warnings, having the delta would be useful IMHO [14:17:06] the cr1-eqiad NTT link is still >8Gbps, should we repool esams or move NA traffic to codfw? [14:17:08] oh right, it's delta-of-warnings [14:17:24] XioNoX: ? on ready to repool? [14:17:50] yeah esams can be repooled [14:18:06] ok I can do that [14:18:09] I'm still working on some mgmt issues [14:18:28] I have it going [14:36:13] 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3007.esams.wmnet` - cp3007.esams.wmnet (**PASS**) - Downtimed host on Icin... [14:38:08] 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3008.esams.wmnet` - cp3008.esams.wmnet (**FAIL**) - Downtimed host on Icin... [14:40:23] 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3010.esams.wmnet` - cp3010.esams.wmnet (**PASS**) - Downtimed host on Icin... [14:41:53] 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ema) >>! In T208585#5599005, @ops-monitoring-bot wrote: > - **Failed to power off, manual intervention required**: Remote IPMI for cp3008.mgmt.esams.wmnet failed (exit=... [15:30:34] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10Papaul) @robh there is no dns3003 the last server is bast3003 so only dns300[1-2] [15:32:10] 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10BBlack) confirming above - @papaul is correct. The total set of new esams Linux boxes AFAIK is: 16x caches, 3x LVS, 2x DNS, 1x Bastion, 3x Ganeti. [15:33:25] 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH) [15:33:40] 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10RobH) [15:53:27] <_joe_> heads up - gonna restart pybal on the codfw/eqiad loadbalancers [15:53:31] <_joe_> for a lvs change [15:57:55] 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) [[ https://logstash.wikimedia.org/goto/0493475ebf5b04d14b38741e3c75261a | And now it's dropped off for a few hours. ]] [16:30:52] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#5597740, @hashar wrote: > Lowering priority since clearly we have no bandwith to work on addin... [16:35:48] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) @BBlack here is the information for the CP servers in rack 15 cp3055 : xe-5/0/15 cp3056: xe-5/0/16 cp3057: xe-5/0/17 cp3058: xe-5/0/18 cp3059: xe-5/0... [16:37:12] 10Traffic, 10DNS, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Papaul) @BBlack dn3002 racked in rack 15 switch information xe-5/0/14 [16:42:37] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH) [16:42:58] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH) [16:43:40] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH) [16:45:03] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @BBlack lvs3005 switch information xe-5/0/12 [17:24:06] https://www.internetexchangemap.com/ [17:29:11] thx, that's cool! [18:03:44] 10netops, 10Operations, 10ops-esams: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) 05Open→03Resolved Done. [18:03:47] 10Traffic, 10netops, 10Operations, 10Wikimedia-Incident: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [18:03:48] 10netops, 10Operations, 10ops-esams: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi) [18:04:03] 10netops, 10Operations, 10ops-esams: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi) 05Open→03Resolved a:03ayounsi Done.