[02:10:37] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WikimediaUI Style Guide: Redirect design.wikimedia.org/style-guide/wiki/* to design.wikimedia.org/style-guide/ - https://phabricator.wikimedia.org/T200304 (10Prtksxna) p:05Triage>03Normal
[02:11:30] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282 (10Prtksxna) >>! In T185282#4448556, @Dzahn wrote: > @Prtksxna Yes, i think so. Please feel free to create that subtask and assign...
[08:25:13] <wikibugs>	 10netops, 10Operations: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10mark) > The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link.  Could you explain that premise? What are we trying to optimize for?...
[09:18:58] <mark>	 is there any chance I could get some more pybal reviews? :)
[09:19:13] <vgutierrez>	 sure
[09:19:26] <vgutierrez>	 I'll do some today :)
[09:19:52] <mark>	 thank you!!
[11:07:40] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure - https://phabricator.wikimedia.org/T168539 (10mark) @ema: Has this been seen again? Does this need any work in Pybal?
[11:52:30] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure - https://phabricator.wikimedia.org/T168539 (10ema) 05Open>03Resolved a:03ema Nope, I haven't seen this since. Closing.
[12:39:32] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) @ayounsi  I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vla...
[12:44:09] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10mark)  >>! In T195923#4450204, @Cmjohnson wrote: > @ayounsi  I was not able to add the ports in row A to the public vlan. Can you check the following and add to public v...
[12:47:38] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) Thanks @mark fixing now.  I looked up one other and it must've been for something else. I believe it was cp1008
[12:58:39] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) I think they did something, as the password for mgmt ssh appears to be reset (can't get in anymore)
[13:22:04] <ema>	 eqiad running varnish 5.1.3-1wm9, trying alternate domains there again
[13:36:01] <bblack>	 https://github.com/varnishcache/varnish-cache/issues/9000 -> "Varnish crashes with separate VCLs if a HEAD request has a header value containing the substring 'khp'"
[13:36:41] <ema>	 bblack: I did test HEAD requests extensively! :)
[13:48:00] <ema>	 ok cp1067 looks good so far, re-enabling puppet on text-eqiad hosts
[14:00:34] <ema>	 alright! Initial testing of phab and grafana through text-eqiad looks good
[14:49:43] <ema>	 oh interesting :)
[14:50:10] <ema>	 we switch to misc VCL in text_common_recv, which is called at the very end of cluster_fe_recv 
[14:51:18] <ema>	 cluster_fe_recv does some text-specific work that conflicts with misc stuff
[14:51:30] <ema>	 for example: `if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; }`
[14:51:54] <ema>	 so basically that would break every misc site with /static/ ^
[14:52:35] <ema>	 cluster_be_recv instead calls text_common_recv at the very beginning (and doesn't do much really anyways)
[14:53:24] <ema>	 we could either (a) move the call to text_common_recv before the text-specific vcl in cluster_fe_recv 
[14:53:42] <ema>	 or (b) switch to the alternate VCL earlier on in vcl_recv
[14:54:55] <ema>	 drawback for (b) is that there's no other text_common_ "hook" happening before text_common_recv, so we'd have to define a new one
[14:55:06] <ema>	 and (a) seems easy enough to me :)
[14:57:16] <wikibugs>	 10netops, 10Operations: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) > Could you explain that premise? What are we trying to optimize for? >  > If a path with an extra hop in eqiad is the lowest latency path, that could just become our preferred path, despite not being direct? Also sinc...
[15:09:19] <bblack>	 (a) seems fraught with potential bugs though, due to re-ordering of the bits that matter on text
[15:10:09] <bblack>	 I'd say move the alt-vcl switcher, but it's tricky to puzzle out exactly where the new hook should go
[15:11:15] <ema>	 yeah
[15:11:40] <bblack>	 the only thing that definitely has to come before it, is host-header sanitization
[15:12:30] <ema>	 normalize_request, which happens very early
[15:13:00] <ema>	 pretty much at the beginning of vcl_recv
[15:13:38] <bblack>	 as for the rest, it may be common code, but it's probably simpler to switch VCL as early as reasonably possible
[15:14:11] <bblack>	 maybe put a hook for vcl-switching right after "call normalize_request" (within the no-restarts block)?
[15:14:46] <bblack>	 I guess in that thinking, I'm assuming restarts stay within the switched-to VCL
[15:28:07] <ema>	 bblack: is there anything except for 'return (restart)' that can cause a restart?
[15:30:38] <bblack>	 I don't know if backend-side failures even go all the way back to a frontside restart
[15:30:54] <bblack>	 it's hard to imagine intrinstic restarts confined within the front side, but I guess it's possible
[15:31:52] <ema>	 mmh
[15:32:54] <bblack>	 s/even/ever/ above
[15:33:14] <bblack>	 anyways, I don't see any docs on restart behavior, but it would be good to know
[15:34:08] <bblack>	 seems simple to test artificially with some vtc for switching that's protected by restarts==0, and an explicit restart followed by differing behaviors in the two VCLs.
[15:34:57] <bblack>	 I'd expect by design philosophy that it would stay within the switched-to VCL, since header mangling isn't reset, and thus things would be super-confusing otherwise.
[15:44:25] <ema>	 my understanding is that retries and restarts are entirely separate things
[15:45:20] <ema>	 I don't see anything in the code triggering a restart (but maybe it's too late in the day!), so given that our VCL does not directly return restart that should not happen?
[15:45:35] <wikibugs>	 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10LGoto)
[15:46:40] <bblack>	 maybe!
[15:47:06] <bblack>	 back in the day (Varnish 3), there was no "retry", and "restart" could happen anywhere (including where we'd now see retry) and always went back to the start of vcl_recv()
[15:47:16] <bblack>	 which is the source of all of our req.restarts conditionals
[15:47:20] <ema>	 here's the client FSM with restarts https://book.varnish-software.com/4.0/_images/detailed_fsm.svg
[15:47:50] <bblack>	 but with 4 there was the front/back-side splits, where retry is a backside thing that I don't think ever goes all the way back to front-side
[15:47:52] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) Ok, so the email back from them when I woke up this AM was a bit confusing, but boils down to this:  * They seem to have replaced the mainboard, and set the temp drac password as requested. * @robh...
[15:47:54] <bblack>	 and restart is just within front
[15:47:57] <ema>	 and the backend one with retries: https://book.varnish-software.com/4.0/_images/detailed_fsm_backend.svg
[15:49:36] <bblack>	 right
[15:50:02] <bblack>	 so, you make a convincing case that we should probably just yank out all the req.restarts==0 conditionals
[15:50:29] <ema>	 :)
[15:50:54] <bblack>	 but, there's a notable hole to poke in it
[15:51:42] <robh>	 bblack: ok, the cp5006 password is set like the others now =]  im working on confirming all its hardware is right and hten ill reinstall the os
[15:52:27] <bblack>	 there's no "re", it was never installed :)
[15:52:30] <bblack>	 but yeah ok
[15:52:41] <robh>	 indeed
[15:52:59] <robh>	 i misparsed their email was wondering how they saw raid status if they couldnt power it on
[15:53:08] <robh>	 but they can power it on, it just has no os.  which was expected.
[15:53:56] <bblack>	 it will probably fail puppetization, but we can handle that part
[15:55:16] <bblack>	 ema: vcl_deliver() is actually back in the frontside where restarts happen, even though it's very late stage after the backside fetching stuff
[15:55:30] <bblack>	 ema: and the webp patch executes a restart from vcl_deliver :)
[15:56:08] <bblack>	 (which wouldn't be this cluster, but it's a demonstrable case where we might use similar restarts in the future, and not want to then go find where all those req.restarts conditionals should be put back at)
[15:57:54] <bblack>	 ema: it's probably safe to assume restart doesn't switch VCLs back to the main one, and safe to put the switching hook inside req.restarts==0 between normalize_request and recv_fe_ip_processing.
[15:58:08] <bblack>	 ema: but still, maybe worth a vtc check to be sure it doesn't need to re-switch on restart
[15:59:44] <mark>	 vgutierrez: thanks for the reviews :)
[15:59:57] <vgutierrez>	 my pleasure
[16:00:13] <ema>	 bblack: yes in general it seems to be a good idea to keep the restart guards in case we will be doing restarts in the future 
[16:01:00] <ema>	 I'm happy for my sanity though that retry doesn' cause a restart
[16:07:28] <bblack>	 once we switch to ATS backends, there's so much sanitizing/refactoring/cleanup we can do
[16:09:35] <ema>	 yup
[16:11:11] <bblack>	 (or alternatively, we can ignore the cruft a few months longer and switch the frontends too)
[16:21:47] <robh>	 cp5006 loading installer image over pxe woot 
[16:27:33] <robh>	 hrmm cp5006 fails at Jul 25 16:24:50 bast5001 atftpd[759]: Serving lpxelinux.0 to 10.132.0.106:2071 
[16:27:45] <robh>	 when at Loading debian-installer/amd64/initrd.gz... 
[16:28:20] <robh>	 ok, finally loaded
[16:28:30] <robh>	 that took 4 minutes to load, and its from bast5001 (local)
[16:28:36] <robh>	 thats odd and too long for that.
[16:35:37] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) a:05Cmjohnson>03RobH @RobH can you take over the installs from here.  I did do production dns, please review and merge if okay.   I am not seeing a physic...
[16:38:00] <ema>	 gotta love vcl refactoring!
[16:38:43] <ema>	 bblack: something like this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/447836/
[16:44:46] <ema>	 I have to go soon, puppet is still disabled on text (except for eqiad, where things seem stable)
[16:46:10] <ema>	 it should be safe to enable it on the other DCs IMHO but I'm not gonna make the decision a few minutes before leaving :)
[16:52:18] <ema>	 bblack: actually, varnish is still on wm8 (buggy with separate vcl) on non-eqiad, so please do not re-enable puppet where it's disabled 
[16:52:55] <ema>	 bblack: or if it needs to be re-enabled, revert https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/447776/ first
[16:52:58] <ema>	 o/
[17:43:53] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) The domain was removed in {0593daa89b07982b67121bb6d14f05974d3e5914}. I...
[17:50:08] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) I guess short term fix is to disable thumbnail prerendering since it is...
[17:54:07] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) @Tgr I think that's right.  Do you mind doing so?
[18:01:31] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) Actually, I take that back.  We should be abl...
[19:03:19] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) At risk of asking the obvious – have we decide...
[19:11:43] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) 200 is the default value for that property; overri...
[19:12:48] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Anomie) >>! In T200346#4451345, @Krinkle wrote: > At ri...
[19:14:53] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) >>! In T200346#4451362, @Anomie wrote: > "0 is...
[19:35:23] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) a:05RobH>03BBlack I'll take over these from here.  It's a very new hardware config we'll have to develop some puppet-level fixups for as we test how the inst...
[19:46:27] <bblack>	 robh: I found cp5006 sitting on the usual initramfs prompt for failure to assemble md0 after the installer reboot.  I'm doing puppetization, etc on it now
[19:57:39] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) At this point, just waiting on someone with a...
[20:08:58] <wikibugs>	 10Traffic, 10netops, 10Operations: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) Looking at doing this Wednesday August 1st, 3 PM UTC, 1h expected.  1 link at a time, only on the primary of the redundant ones, and outside link maintenance.
[20:13:21] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) 05Open>03Resolved a:03Krinkle Tentativel...
[20:24:44] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557 (10BBlack)
[20:24:47] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) 05Open>03Resolved cp5006 is now installed and puppeted and in-service, should be all fixed up assuming nothing bursts into flames in the near future.
[20:25:24] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10BBlack) p:05High>03Normal
[20:50:25] <robh>	 bblack: awesome thanks
[20:50:33] <robh>	 i had to pickup someone from the airport and traffic was terrible =P
[20:50:41] <robh>	 what should have been 90 minutes was 3 hours.
[20:51:14] <robh>	 bblack: ill close out our ticket for cp5006 and start one for the other cp system failuire
[20:51:32] <robh>	 since these kinds of things are simply the cost of doing international datacenter hosting =]
[20:51:49] <robh>	 but man it feels odd to put in expensive smarthands tickets for already paid for warranty support
[20:51:50] <robh>	 heh
[20:57:09] <bblack>	 yeah I hear you :)
[20:57:21] <bblack>	 does a dell tech actually need smarthands?
[21:15:04] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10TheDJ) > I see. So all network/transport level errors,...
[21:54:14] <bblack>	 ema: FYI for cp1075-99: jessie installer's 3.16 kernel doesn't have the right drivers (for disk or NIC, I think), so currently jessie installs there are impossible.  stretch seems to install ok so far, and probably our current runtime jessie kernel is fine too...
[21:55:01] <bblack>	 ema: so not sure here, we can either take the plunge on moving caches towards stretch (ugh, double-packaging of various things during the transition, and slows down other timelines), or we can push for doing a fixup to the jessie installer to use a newer kernel at install time.
[21:55:15] <bblack>	 ema: (not even sure how realistic that last option even is)
[21:58:49] <volans>	 mmmh strange, the additional drivers should be there, unless those boxes require some new drivers
[21:59:37] <bblack>	 they do
[21:59:59] <bblack>	 I don't think 3.16 + extra drivers will cut it, we actually need the newer kernel to get the drivers at all
[22:01:06] <volans>	 got it, modifying the kernel in the netinst should be possible but might be painful, I found this outdated guide
[22:01:09] <volans>	 https://wiki.debian.org/DebianInstaller/Modify/CustomKernel
[22:01:18] <volans>	 also not very detailed :D
[22:01:30] <bblack>	 yeah, plus I donno if it would cause some other regression with how the installer software itself deals with module loading, etc
[22:03:22] <volans>	 ack
[22:05:24] <volans>	 a third option could be installing stretch and downgrading it immediately to jessie (before first puppet)
[22:05:37] <volans>	 this one too could be quite messy
[22:06:32] <bblack>	 we have to go stretch eventually anyways, it's just always a PITA with all the custom packages we have
[22:08:58] <volans>	 sure, and being also more than few hosts less manual the process easier it is
[22:09:35] * volans having some other weird and convoluted ideas not even worth mentioning ;)
[22:18:57] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) What I know so far from testing on cp1075: * The various BIOS settings seem fine so far, I didn't have to change anything in BIOS or NIC or controller firmware s...