[03:05:42] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Resolved→03Open and..it is DOWN again 23:03 <+icinga-wm> PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, ex... [03:10:25] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) - Maintenance window: Start Date and Time: 2019-May-23 03:00 UTC End Date and Time: 2019-May-23 07:00 UTC Action and Reason: Emergency hardware work needed to restore traffic. We will rese... [04:50:41] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Open→03Resolved 23:21 <+icinga-wm> RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 23:22 <+icin... [09:35:24] 10Traffic, 10DNS, 10Operations: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10Maintenance_bot) [10:48:03] 10Traffic, 10Operations: cp3031: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10Maintenance_bot) [10:51:05] 10Traffic, 10Operations, 10ops-codfw: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Maintenance_bot) [13:33:22] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-eqiad: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack) [13:39:55] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack) [13:42:04] <_joe_> bblack: don't varnish/nginx have timeouts for requests going to restbase? [13:43:04] they do, but they're relatively-weak enforcers [13:43:32] <_joe_> I was looking at https://phabricator.wikimedia.org/T224222 [13:43:35] it doesn't have an overall timeout for a whole backend-facing request. It has a TTFB timeout, and it has a timeout for idleness between received bytes (of header or body or whatever) [13:44:19] <_joe_> also why http doesn't redirect to https? [13:44:47] it does for me [13:44:51] <_joe_> oh no it does, yes [13:44:59] <_joe_> stupid new firefox [13:45:10] does it really never time out? [13:45:14] I have a curl running now to see [13:46:04] there are ways that timeouts stack up non-intuitively, esp when looking at webrequest, slowlog, etc [13:46:11] (varnishslowlog I mean, in logstash) [13:46:28] because of internal and "external" retries, basically [13:47:47] so for varnish-be -> applayer, it should be set to: [13:47:49] connect_timeout: '3s' [13:47:49] first_byte_timeout: '63s' [13:47:49] between_bytes_timeout: '31s' [13:48:02] varnish-fe -> varnish-be does: [13:48:04] connect_timeout: '3s' [13:48:05] first_byte_timeout: '65s' [13:48:05] between_bytes_timeout: '33s' [13:48:35] but then on any random kind of 503 failure (explicit applayer 503, backend-generated 503, frontend 503 due to timeout reaching backend, whatever) [13:48:54] the frontend will retry the whole request through the layers once, all as part of its backend handling of a single req->resp cycle from the public pov [13:49:23] hence ~120s timeouts depending on the scenario, possibly longer in some scenarios, virtually-indefinite if the layers are bleeding out bytes slowly over time [13:50:48] my sample curl on that has been going for several minutes now :/ [13:51:14] I can dig deeper in a bit and actually try to live trace my curl through all the stack [13:52:02] I vaguely recall we've had virtually-infinite timeouts before, and it was with restbase specifically [13:52:20] I thought we had worked around that with some varnish source-patching and config, but it's been a while since I looked at it all [13:55:23] (that had to do with a "bug" in varnish, although you can argue there's no correct behavior for all possible corner cases, with how it handles failures over a reused connection. With just the wrong backend behavior, varnish can end up internally retrying a backend request indefinitely without VCL ever seeing it as controllable/limit-able retries) [13:56:22] the corner-case situation there in general, is that with a pool of persistent reusable connections to the backend, the backend can legally close the conn at any time (e.g. between txns, due to some configured maximum connection lifetime) [13:57:09] so it's always possible varnish (the client) sends a new req down the pipe at the same time the other side is closing, resulting in a transient failure for that request [13:57:24] varnish's assumption was this was a rare race, and the right answer is to loop and try again on a different conn [13:57:47] (without allowing the typical VCL-level visibility or control of the situation) [13:58:30] but let's say if the applayer happens to be crashing->closing on a certain request due to a bug or whatever.... well obviously you just get an infinite internal loop (or at least a few very long timeouts before giving up, depending) [13:59:34] in generall connection re-use has poorly-defined edges in HTTP that aren't easy to resolve in a way that covers all cases sanely. This is shades of the same problems we have with nginx->varnish-fe connection sharing and error-bleed. [14:00:28] my test curl's been hanging there for ~15 minutes now heh [14:01:03] <_joe_> ok so [14:01:14] <_joe_> sorry I was a bit lost in debugging at lower layers [14:01:22] <_joe_> the request definitely hangs in citoid [14:01:35] <_joe_> so somehow neither restbase nor varnish timeout ever [14:03:14] are they leaking bytes slowly over time? are you sure varnish isn't just reconnecting and retrying (to RB) every ~minute or two? [14:03:24] I can dig some later too, after this LVS stuff [14:04:22] <_joe_> yeah no rush [14:04:28] <_joe_> it's just a strange effect [14:04:34] <_joe_> I'll keep digging [14:20:00] heh, somewhere around the 25-minute mark, my curl finally returned [14:20:10] < HTTP/2 504 [14:20:10] < server: nginx/1.13.6 [14:20:10] < date: Thu, 23 May 2019 14:08:54 GMT [14:20:28] with a "503 gateway timeout" short html response, clearly generated by our front-edge nginx itself, not varnish or beneath [14:20:47] err sorry "504 gateway timeout" [14:42:40] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org... [14:48:17] <_joe_> bblack: restbase has a 4 minute timeout because it retries twice [14:48:26] <_joe_> see https://phabricator.wikimedia.org/T224222#5208173 [14:48:40] <_joe_> it's scary this timeout amplification can even happen tbh [15:02:38] yeah we had long debates about timeout amplification across the stack, a long time ago, I think even techcom (or its predecessor) was involved [15:03:19] the TL;DR we took away from it, at least from traffic's perspective, is that it's best that everything in our stack never do retries, due to the amplification (not just timeout extension, but also query volume amplification retrying the failures, too) [15:03:35] and we include even varnish-be / ats-be in that (shouldn't ever retry) [15:04:43] and we made a lone exception that the outer edge of traffic infra (well, currently done in varnish-fe), there should be a singular retry only specifically for the 503 case (explicit, or generated by timeout, etc), but not other statuses, just to paper over any truly-transient minor issues elsewhere deeper in the stack (which we implemented). [15:05:01] to protect public traffic from the impact of those transient issues [15:06:07] before deciding on those things, we had crazy amplification scenarios, where 1 public request multiplied by N failing/retrying to varnish-be, then multiplied again to another varnish-be in another DC, then multiplied again talking to restbase, which multiplied it again talking to parsoid, etc, etc... total chaos explosion of requests in some failure scenarios. [15:42:44] 10netops, 10Operations, 10Patch-For-Review: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10jbond) just watching ripe presentation and thought this may be of interest https://ripe78.ripe.net/archives/video/106 [15:47:49] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1001.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1006.wikimedia.org', 'lvs1002.wikimedia.org... [16:31:19] 10Traffic, 10Analytics, 10Operations: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) [17:17:39] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair) [17:34:14] 10netops, 10DC-Ops, 10Operations, 10observability: Send some LibreNMS alerts to dcops and netops only - https://phabricator.wikimedia.org/T224180 (10Krenair) [18:06:12] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ayounsi) [19:58:57] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Andrew) >>! In T223902#5203523, @Vgutierrez wrote: > so, after a quick check you should consider several things: > * wikimedia.org is a can... [20:02:38] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair) >>! In T223902#5209071, @Andrew wrote: >>>! In T223902#5203523, @Vgutierrez wrote: >> so, after a quick check you should consider... [20:08:49] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Vgutierrez) Right.. that ldap service certificate it's being handled by acme-chief and as Alex explained the *.wikimedia.org limitation onl... [21:23:16] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) p:05Triage→03Normal [21:35:01] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org... [21:38:27] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Andrew) I am fine with changing our proposed names to things like keystone-eqiad1.wikimedia.org or keystone-eqiad1-wmcs.wikimedia.org if th... [21:44:58] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair) Another thing to consider if we're really talking about using the prod caches is that currently those endpoints are not exposed to... [21:58:26] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10Volans) p:05Triage→03Normal [22:05:14] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1004.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1001.wikimedia.org... [22:08:01] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10BBlack) Do these belong in `wikimedia.org` at all? It seems this has already been discussed, but I guess I lack some context. The comment... [22:33:56] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack) [22:36:15] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack) a:05BBlack→03ayounsi These are reimaged to `role(spare::system)` now. Over to @ayounsi for getting rid of all the special cases related to t... [23:05:50] 10Traffic, 10Operations: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10HMarcus)