[02:10:37] 10Domains, 10Traffic, 10Operations, 10WikimediaUI Style Guide: Redirect design.wikimedia.org/style-guide/wiki/* to design.wikimedia.org/style-guide/ - https://phabricator.wikimedia.org/T200304 (10Prtksxna) p:05Triage>03Normal [02:11:30] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282 (10Prtksxna) >>! In T185282#4448556, @Dzahn wrote: > @Prtksxna Yes, i think so. Please feel free to create that subtask and assign... [08:25:13] 10netops, 10Operations: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10mark) > The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link. Could you explain that premise? What are we trying to optimize for?... [09:18:58] is there any chance I could get some more pybal reviews? :) [09:19:13] sure [09:19:26] I'll do some today :) [09:19:52] thank you!! [11:07:40] 10Traffic, 10Operations, 10Pybal: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure - https://phabricator.wikimedia.org/T168539 (10mark) @ema: Has this been seen again? Does this need any work in Pybal? [11:52:30] 10Traffic, 10Operations, 10Pybal: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure - https://phabricator.wikimedia.org/T168539 (10ema) 05Open>03Resolved a:03ema Nope, I haven't seen this since. Closing. [12:39:32] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) @ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vla... [12:44:09] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10mark) >>! In T195923#4450204, @Cmjohnson wrote: > @ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public v... [12:47:38] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) Thanks @mark fixing now. I looked up one other and it must've been for something else. I believe it was cp1008 [12:58:39] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) I think they did something, as the password for mgmt ssh appears to be reset (can't get in anymore) [13:22:04] eqiad running varnish 5.1.3-1wm9, trying alternate domains there again [13:36:01] https://github.com/varnishcache/varnish-cache/issues/9000 -> "Varnish crashes with separate VCLs if a HEAD request has a header value containing the substring 'khp'" [13:36:41] bblack: I did test HEAD requests extensively! :) [13:48:00] ok cp1067 looks good so far, re-enabling puppet on text-eqiad hosts [14:00:34] alright! Initial testing of phab and grafana through text-eqiad looks good [14:49:43] oh interesting :) [14:50:10] we switch to misc VCL in text_common_recv, which is called at the very end of cluster_fe_recv [14:51:18] cluster_fe_recv does some text-specific work that conflicts with misc stuff [14:51:30] for example: `if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; }` [14:51:54] so basically that would break every misc site with /static/ ^ [14:52:35] cluster_be_recv instead calls text_common_recv at the very beginning (and doesn't do much really anyways) [14:53:24] we could either (a) move the call to text_common_recv before the text-specific vcl in cluster_fe_recv [14:53:42] or (b) switch to the alternate VCL earlier on in vcl_recv [14:54:55] drawback for (b) is that there's no other text_common_ "hook" happening before text_common_recv, so we'd have to define a new one [14:55:06] and (a) seems easy enough to me :) [14:57:16] 10netops, 10Operations: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) > Could you explain that premise? What are we trying to optimize for? > > If a path with an extra hop in eqiad is the lowest latency path, that could just become our preferred path, despite not being direct? Also sinc... [15:09:19] (a) seems fraught with potential bugs though, due to re-ordering of the bits that matter on text [15:10:09] I'd say move the alt-vcl switcher, but it's tricky to puzzle out exactly where the new hook should go [15:11:15] yeah [15:11:40] the only thing that definitely has to come before it, is host-header sanitization [15:12:30] normalize_request, which happens very early [15:13:00] pretty much at the beginning of vcl_recv [15:13:38] as for the rest, it may be common code, but it's probably simpler to switch VCL as early as reasonably possible [15:14:11] maybe put a hook for vcl-switching right after "call normalize_request" (within the no-restarts block)? [15:14:46] I guess in that thinking, I'm assuming restarts stay within the switched-to VCL [15:28:07] bblack: is there anything except for 'return (restart)' that can cause a restart? [15:30:38] I don't know if backend-side failures even go all the way back to a frontside restart [15:30:54] it's hard to imagine intrinstic restarts confined within the front side, but I guess it's possible [15:31:52] mmh [15:32:54] s/even/ever/ above [15:33:14] anyways, I don't see any docs on restart behavior, but it would be good to know [15:34:08] seems simple to test artificially with some vtc for switching that's protected by restarts==0, and an explicit restart followed by differing behaviors in the two VCLs. [15:34:57] I'd expect by design philosophy that it would stay within the switched-to VCL, since header mangling isn't reset, and thus things would be super-confusing otherwise. [15:44:25] my understanding is that retries and restarts are entirely separate things [15:45:20] I don't see anything in the code triggering a restart (but maybe it's too late in the day!), so given that our VCL does not directly return restart that should not happen? [15:45:35] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10LGoto) [15:46:40] maybe! [15:47:06] back in the day (Varnish 3), there was no "retry", and "restart" could happen anywhere (including where we'd now see retry) and always went back to the start of vcl_recv() [15:47:16] which is the source of all of our req.restarts conditionals [15:47:20] here's the client FSM with restarts https://book.varnish-software.com/4.0/_images/detailed_fsm.svg [15:47:50] but with 4 there was the front/back-side splits, where retry is a backside thing that I don't think ever goes all the way back to front-side [15:47:52] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) Ok, so the email back from them when I woke up this AM was a bit confusing, but boils down to this: * They seem to have replaced the mainboard, and set the temp drac password as requested. * @robh... [15:47:54] and restart is just within front [15:47:57] and the backend one with retries: https://book.varnish-software.com/4.0/_images/detailed_fsm_backend.svg [15:49:36] right [15:50:02] so, you make a convincing case that we should probably just yank out all the req.restarts==0 conditionals [15:50:29] :) [15:50:54] but, there's a notable hole to poke in it [15:51:42] bblack: ok, the cp5006 password is set like the others now =] im working on confirming all its hardware is right and hten ill reinstall the os [15:52:27] there's no "re", it was never installed :) [15:52:30] but yeah ok [15:52:41] indeed [15:52:59] i misparsed their email was wondering how they saw raid status if they couldnt power it on [15:53:08] but they can power it on, it just has no os. which was expected. [15:53:56] it will probably fail puppetization, but we can handle that part [15:55:16] ema: vcl_deliver() is actually back in the frontside where restarts happen, even though it's very late stage after the backside fetching stuff [15:55:30] ema: and the webp patch executes a restart from vcl_deliver :) [15:56:08] (which wouldn't be this cluster, but it's a demonstrable case where we might use similar restarts in the future, and not want to then go find where all those req.restarts conditionals should be put back at) [15:57:54] ema: it's probably safe to assume restart doesn't switch VCLs back to the main one, and safe to put the switching hook inside req.restarts==0 between normalize_request and recv_fe_ip_processing. [15:58:08] ema: but still, maybe worth a vtc check to be sure it doesn't need to re-switch on restart [15:59:44] vgutierrez: thanks for the reviews :) [15:59:57] my pleasure [16:00:13] bblack: yes in general it seems to be a good idea to keep the restart guards in case we will be doing restarts in the future [16:01:00] I'm happy for my sanity though that retry doesn' cause a restart [16:07:28] once we switch to ATS backends, there's so much sanitizing/refactoring/cleanup we can do [16:09:35] yup [16:11:11] (or alternatively, we can ignore the cruft a few months longer and switch the frontends too) [16:21:47] cp5006 loading installer image over pxe woot [16:27:33] hrmm cp5006 fails at Jul 25 16:24:50 bast5001 atftpd[759]: Serving lpxelinux.0 to 10.132.0.106:2071 [16:27:45] when at Loading debian-installer/amd64/initrd.gz... [16:28:20] ok, finally loaded [16:28:30] that took 4 minutes to load, and its from bast5001 (local) [16:28:36] thats odd and too long for that. [16:35:37] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) a:05Cmjohnson>03RobH @RobH can you take over the installs from here. I did do production dns, please review and merge if okay. I am not seeing a physic... [16:38:00] gotta love vcl refactoring! [16:38:43] bblack: something like this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/447836/ [16:44:46] I have to go soon, puppet is still disabled on text (except for eqiad, where things seem stable) [16:46:10] it should be safe to enable it on the other DCs IMHO but I'm not gonna make the decision a few minutes before leaving :) [16:52:18] bblack: actually, varnish is still on wm8 (buggy with separate vcl) on non-eqiad, so please do not re-enable puppet where it's disabled [16:52:55] bblack: or if it needs to be re-enabled, revert https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/447776/ first [16:52:58] o/ [17:43:53] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) The domain was removed in {0593daa89b07982b67121bb6d14f05974d3e5914}. I... [17:50:08] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) I guess short term fix is to disable thumbnail prerendering since it is... [17:54:07] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) @Tgr I think that's right. Do you mind doing so? [18:01:31] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) Actually, I take that back. We should be abl... [19:03:19] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) At risk of asking the obvious – have we decide... [19:11:43] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) 200 is the default value for that property; overri... [19:12:48] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Anomie) >>! In T200346#4451345, @Krinkle wrote: > At ri... [19:14:53] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) >>! In T200346#4451362, @Anomie wrote: > "0 is... [19:35:23] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) a:05RobH>03BBlack I'll take over these from here. It's a very new hardware config we'll have to develop some puppet-level fixups for as we test how the inst... [19:46:27] robh: I found cp5006 sitting on the usual initramfs prompt for failure to assemble md0 after the installer reboot. I'm doing puppetization, etc on it now [19:57:39] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) At this point, just waiting on someone with a... [20:08:58] 10Traffic, 10netops, 10Operations: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) Looking at doing this Wednesday August 1st, 3 PM UTC, 1h expected. 1 link at a time, only on the primary of the redundant ones, and outside link maintenance. [20:13:21] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) 05Open>03Resolved a:03Krinkle Tentativel... [20:24:44] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557 (10BBlack) [20:24:47] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) 05Open>03Resolved cp5006 is now installed and puppeted and in-service, should be all fixed up assuming nothing bursts into flames in the near future. [20:25:24] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10BBlack) p:05High>03Normal [20:50:25] bblack: awesome thanks [20:50:33] i had to pickup someone from the airport and traffic was terrible =P [20:50:41] what should have been 90 minutes was 3 hours. [20:51:14] bblack: ill close out our ticket for cp5006 and start one for the other cp system failuire [20:51:32] since these kinds of things are simply the cost of doing international datacenter hosting =] [20:51:49] but man it feels odd to put in expensive smarthands tickets for already paid for warranty support [20:51:50] heh [20:57:09] yeah I hear you :) [20:57:21] does a dell tech actually need smarthands? [21:15:04] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10TheDJ) > I see. So all network/transport level errors,... [21:54:14] ema: FYI for cp1075-99: jessie installer's 3.16 kernel doesn't have the right drivers (for disk or NIC, I think), so currently jessie installs there are impossible. stretch seems to install ok so far, and probably our current runtime jessie kernel is fine too... [21:55:01] ema: so not sure here, we can either take the plunge on moving caches towards stretch (ugh, double-packaging of various things during the transition, and slows down other timelines), or we can push for doing a fixup to the jessie installer to use a newer kernel at install time. [21:55:15] ema: (not even sure how realistic that last option even is) [21:58:49] mmmh strange, the additional drivers should be there, unless those boxes require some new drivers [21:59:37] they do [21:59:59] I don't think 3.16 + extra drivers will cut it, we actually need the newer kernel to get the drivers at all [22:01:06] got it, modifying the kernel in the netinst should be possible but might be painful, I found this outdated guide [22:01:09] https://wiki.debian.org/DebianInstaller/Modify/CustomKernel [22:01:18] also not very detailed :D [22:01:30] yeah, plus I donno if it would cause some other regression with how the installer software itself deals with module loading, etc [22:03:22] ack [22:05:24] a third option could be installing stretch and downgrading it immediately to jessie (before first puppet) [22:05:37] this one too could be quite messy [22:06:32] we have to go stretch eventually anyways, it's just always a PITA with all the custom packages we have [22:08:58] sure, and being also more than few hosts less manual the process easier it is [22:09:35] * volans having some other weird and convoluted ideas not even worth mentioning ;) [22:18:57] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) What I know so far from testing on cp1075: * The various BIOS settings seem fine so far, I didn't have to change anything in BIOS or NIC or controller firmware s...