[08:08:11] <wikibugs>	 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2392961 (10elukey) Adding some random thoughts about the following snippet of c...
[08:28:24] <moritzm>	 bblack, ema: I'm upgrading libxslt on the cp* hosts, but will skip the nginx restart, it's only used by the ngx_http_xslt module we don't use
[08:37:42] <_joe_>	 wow libxslt
[08:37:49] <_joe_>	 brings me back to old times
[09:08:46] <ema>	 moritzm: ok, thanks!
[09:45:36] <wikibugs>	 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2393128 (10Joe) The "got bogus version 1" is a logging bug and has been fixed i...
[09:57:55] <wikibugs>	 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2393152 (10elukey) With the following code snippet I obtain:  ``` First test: b...
[10:28:23] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2393240 (10elukey) We finally tracked down all the sources of null/missing end timestamps coming from Varnish:  1) Varnish Pipe logs,...
[14:35:28] <elukey>	 something interesting that I found reading https://www.varnish-cache.org/docs/trunk/reference/vsl.html
[14:35:36] <elukey>	 VSL tag
[14:35:37] <elukey>	 Warnings and error messages genererated by the VSL API while reading the shared memory log.
[14:35:56] <elukey>	 --  VSL            timeout
[14:35:57] <elukey>	 --  End            synth
[14:36:22] <elukey>	 this doesn't seem to indicate a timeout occurred in Varnish
[14:36:38] <elukey>	 but something weird happened to vk while reading shm
[14:38:57] <elukey>	 like https://github.com/varnishcache/varnish-cache/blob/a50c99f6b3883d1a58cedfe26511bfc0d30d50bb/include/vapi/vapi_options.h#L80
[14:44:18] <elukey>	 maybe due to super slow clients
[14:45:00] <elukey>	 I was convinced that varnish was flushing all the tags once ready to log
[14:46:00] <elukey>	 but from what I've seen vk is not blocked by slow req, otherwise it wouldn't be so fast
[14:46:21] <elukey>	 (and also analytics wouldn't see the weird seqno misalignment) 
[14:46:32] <elukey>	 so probably it is a safe net for Varnish to log
[14:47:00] <elukey>	 I could add a flag to VK to signal this
[14:47:40] <elukey>	 not complete as bblack was suggesting but something like https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L891
[14:49:11] <elukey>	 VSL_Log:*
[14:56:57] <bblack>	 well
[14:57:07] * elukey now hides
[14:57:14] <elukey>	 :)
[14:57:24] <bblack>	 it wouldn't make sense for varnish to flush a partial transaction from varnishd -> shm (as in, log the start of a long request, and not the end until 1h later)
[14:57:43] <bblack>	 but maybe it does, and that's the workaround
[14:57:58] <bblack>	 but either way, from the VSL perspective, long-running requests are going to get the artificial timeout like shown
[14:59:54] <elukey>	 I would have thought that the VSL timeout would have been the send_timeout (or how it is called)
[15:04:04] <elukey>	 bblack: if confirmed, would it make sense to have another tag like VSL_Log:* showing the error message? (or maybe true/false)
[15:20:41] <_joe_>	 bblack: when you have time, your opinion on https://gerrit.wikimedia.org/r/#/c/295202/ would be appreciated
[15:22:24] <bblack>	 _joe_: can it actually be combined with other actions in a single salt?
[15:23:37] <_joe_>	 yes
[15:23:44] <_joe_>	 IIRC
[15:24:14] <bblack>	 how does that look?
[15:25:10] <_joe_>	 I don't remember, I am looking it up
[15:25:22] <_joe_>	 but if that's not possible, this is a building block you can use
[15:25:24] <bblack>	 I guess what I mean is, in the interest of not adding more code to maintain (new salt python modules), if it's not significantly better than comosing via cmd.run 'depool; blah; repool', what's the point?
[15:25:28] <_joe_>	 to write your own module
[15:25:43] <bblack>	 s/comosing/composing/'
[15:26:37] <bblack>	 I guess for me that's not just about this review, but the whole general idea of defining salt modules that execute host command using obvious existing metadata
[15:27:08] <bblack>	 if we've already got it in command form, which is composable for more use-cases than just salt, why define a salt wrapper for the command?
[15:27:23] <_joe_>	 so the point is "if we have a common pattern that we run via cmd.run, let's use a module to automate that
[15:27:42] <bblack>	 I'm not sure that's a good thing
[15:27:48] <_joe_>	 and well, salt modules are easier to make work with one another
[15:28:19] <bblack>	 it's adding an abstraction between what command is actually invoked and what you type, but only via salt and not via host CLI or other scripting/execution methods, it's adding more python code to maintain, etc
[15:28:51] <_joe_>	 bblack: well it's pretty common with all remote exec frameworks
[15:29:03] <_joe_>	 like fabric, ansible, etc
[15:29:21] <bblack>	 to define python wrapper classes that semi-automate what could be just shell executions?
[15:29:22] <_joe_>	 you write automation code in python instead than shell scripts every other time
[15:29:37] <bblack>	 it seems wrong to me
[15:30:12] <_joe_>	 bblack: and yes, you can do things like
[15:30:39] <_joe_>	 salt 'targets' conftool.depool,cmd.run <something>,conftool.pool IIRC
[15:31:02] <_joe_>	 now that I think of it, that's what would be most useful to have via salt
[15:31:22] <_joe_>	 a wrapper around cmd.run that does depooling/pooling for you around your command
[15:32:14] <bblack>	 the benefit over: salt 'targets' cmd.run 'depool; <something:; repool' seems minimal at best.  and we still have to know how to use the CLI anyways, for when we actually operate on a single host.  unless we forget how because we're used to the slightly-different syntax/behavior of this abstraction.
[15:32:39] <bblack>	 and it's more framework-lockin.  if we build up a library of salt modules for everything and start relying on them, it gets harder to move to ansible or dsh or whatever else.
[15:33:06] <_joe_>	 well, we said let's try to use salt a bit before we decide to ditch it
[15:33:09] <bblack>	 why define an abstraction and create piles of new python code to maintain, to make slightly-prettier what could be executed with any remote execution as shell command sequences?
[15:33:15] <_joe_>	 to try to build some automation for it
[15:33:38] <bblack>	 but this doesn't really automate anything.  it replaces one short string with another short string, on different sides of a quote character
[15:34:30] <_joe_>	 yes, this is pretty bland of course, but it can be useful as a building block - that's what I thought
[15:34:35] <bblack>	 I'd rather see the automation on one host be tool-agnostic.  we can use it when we log into one host to do single-host things, and use it the same way via any orchestration tool that can execute commands.
[15:34:53] <_joe_>	 ok so you want distributed ssh and nothing else
[15:35:10] <bblack>	 maybe I could see it if I saw the end-goal.  if there was some amazing automation at the end of this chain of thinking, which requires that these be modules rather than commands
[15:35:11] <_joe_>	 fair enough, but that's definitely not what salt is thought for
[15:35:35] <bblack>	 what you're automating in salt is still only within one host
[15:36:20] <bblack>	 if you have salt (or any of many other tools that can do dsh-like things), you can already do anything on the CLI on one host, and ask to execute that in parallel or serial with batching/timeout/etc.
[15:37:13] <bblack>	 what is the actual upside of defining a salt module for anything other than shell exec?  can it orchestrate in some novel way that can't be accomplished with dsh? that would be interesting to hear about.
[15:37:46] <bblack>	 but otherwise all of those things should be CLI tools if they're not already, so they can be used without salt too.  and in this case, it already is.
[15:39:42] <_joe_>	 I think this merits a wider discussion, tbh, as I don't agree necessarily with the idea we should pack everything into cli tools. Salt has some advantages (like pillars and grains, and other salt modules) that make it better than writing a framework-agnostic cli tool
[15:40:26] <bblack>	 well, separate out the concerns though: some of salt's functionality is more about config-management-like things than execution-like things
[15:40:33] <_joe_>	 in the specific case, you might be right (although I'd probably like to rewrite pool/depool/drain)
[15:40:39] <bblack>	 pillars and such might be useful for salt-as-cfg
[15:41:13] <_joe_>	 well they are even for salt-as-exec, as it can help you determine how to act - like having the "site" and "cluster" variables available
[15:41:34] <bblack>	 to use as conditionals on execution, yes.  as arguments even, but then couldn't they be env vars?
[15:41:46] <bblack>	 so long as you can do this:
[15:42:07] <_joe_>	 bblack: yes of course, if we don't use salt we need to expose those variables to tools in another way
[15:42:22] <_joe_>	 and it could well be /etc/wmf_env.sh
[15:42:25] <bblack>	 salt -C 'G@cluster:cache_*' cmd.run 'echo $grain_site' (and I'm not sure the latter is possible, but we could define the env var via puppet just like we do the salt grain)
[15:43:07] <bblack>	 it's useful to have grains (or similar) to control the salt (or similar) tool's execution targets.  and it's useful to have similar vars we can use for the CLI (which arguably should be env vars too)
[15:43:39] <bblack>	 but I have yet to see anyone show me something novel salt can do with any of this for the execution case (rather than config case) that justifies any module other than cmd.run
[15:43:47] <_joe_>	 anyways, I'm flying to wikimania tomorrow, and you're going on vacation when I get back
[15:43:56] <bblack>	 and doing it without justification -> excess python code to maintain, and tool lock-in
[15:43:59] <_joe_>	 bblack: simplifying the salt calls
[15:44:10] <_joe_>	 but well, ofc it's tool lock in
[15:44:25] <_joe_>	 although it's very easy to adapt a salt module to be a cli tool
[15:44:27] <bblack>	 yeah but if wrapping the CLI tool in a module makes it easier to execute, you're just working around poor CLI tool syntax or lack of relevant env vars to use.
[15:44:30] <bblack>	 fix the tool and the env vars
[15:44:46] <_joe_>	 I could actually make it so that the same code could do both
[15:45:21] <_joe_>	 at the end of the day, creating a salt module just saves me a few lines of boilerplate cli mangling
[15:45:23] <bblack>	 yeah but... why? all of that just to replace cmd.run 'depool; foo; repool' with conftool.depool,cmd.run 'foo',conftool.repool
[15:45:46] <_joe_>	 ok think of more complex command lines than the conftool example
[15:45:47] <bblack>	 make the CLI require less boilerplate, rather than having the salt module do it
[15:46:08] <bblack>	 what's been fixed there is site/fqdn/cluster grains -> CLI args automatically
[15:46:27] <bblack>	 fixing that for the generic CLI case isn't easy, but it's totally possible
[15:46:44] <_joe_>	 as I said, env vars are easy to define and import
[15:47:06] <bblack>	 conftool has two distinct use-cases: operating on mass wildcards of things beyond just one host, and operating on services of the local host
[15:47:29] <_joe_>	 sorry needing a pause for a smoke before the meeting
[15:47:31] <_joe_>	 :)
[15:47:45] <bblack>	 adding a "-s" flag that pre-restricts fqdn to itself (site/cluster is redundant, right?) fixes that, and can look at `hostname -f` internally or whatever.
[15:48:24] <_joe_>	 well, conftool works better the more selectors you specify
[15:48:31] <bblack>	 ok
[15:48:35] <_joe_>	 as in it uses less resources on etc
[15:48:37] <_joe_>	 *etcd
[15:48:41] <_joe_>	 but it's easy to do that
[15:49:02] <_joe_>	 but let's take the puppet module
[15:49:14] <_joe_>	 ofc we can creat a cli tool called "wmfpuppet" instead
[15:49:23] <bblack>	 still, export our grains to global env vars via puppet, and give conftool a "-s" flag to automate the notion of hardcoding name=${grain_fqdn},site=${grain_site},cluster=${grain_cluster}
[15:50:12] <_joe_>	 bbiab
[15:50:20] <bblack>	 if the CLI sucks to use for any of a number of standard tools, fix it for the CLI in the general case, and then cmd.run is enough and we get the benefits elsewhere (manual CLI, future tools)
[15:51:12] <bblack>	 this whole same argument, btw, about salt-specific python code to maintain and complexity-in-salt, vs fixing things up for the general case in a tool-agnostic way, applies to how we use puppet, too.
[15:52:13] <bblack>	 e.g. the difference between in puppet sslcert module, we wrote a generic python script to manage CA certificate chaining, and puppet executes it when necessary, as opposed to writing custom puppet/ruby code to manage that process
[15:52:48] <bblack>	 there's a reasonable dividing line in there somewhere, on putting complex things in agnostic tools, and using the CM/orchestration tools' functionality on top.
[15:53:17] <bblack>	 I don't know exactly how you define that line in terms of correctness.  maybe it's a taste thing to some degree.
[15:54:09] <bblack>	 for cfg management it's very hard to draw that line in the sane correctly, and maybe there's more wiggle room, but I think on the whole we've been putting too much in puppet-specifics.
[15:54:34] <bblack>	 for execution tools like salt, though, I have yet to see any compelling argument for putting more than intelligently-targetted remote execution in them...
[15:55:17] <_joe_>	 should we talk about this when we're both back from wikimania/vacations with a few more people maybe?
[15:55:40] <_joe_>	 because I see your point and still think salt modules are kind of handy :)
[15:55:53] <bblack>	 this sounds like a great discussion for an ops offsite too :)
[15:56:10] <_joe_>	 it's a bit far in the future though
[15:56:34] <_joe_>	 anyways, ops meeting!
[15:58:08] <bblack>	 (and so my points don't seem so anti-salt: if we had taken the more-agnostic approach more in our puppetization, transitioning CM from puppet->salt would be far simpler)
[16:12:17] <ori>	 <_joe_> well, we said let's try to use salt a bit before we decide to ditch it
[16:12:29] <ori>	 IT'S BEEN THREE YEARS
[16:42:59] <wikibugs>	 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2394147 (10BBlack) 3 day log (over the weekend, basically since the last update above on the 17th):  New usernames over the past 3 days: ``...
[17:09:34] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2394235 (10BBlack)
[17:15:00] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2394258 (10BBlack) @ori did notification go out?  We're now 7 days from cert expiry.
[17:16:37] <ori>	 bblack: no, I forgot. Sending it now.
[17:16:41] <ori>	 Sorry.
[17:18:13] <bblack>	 np!
[17:21:26] <ori>	 bblack: do you expect any downtime? (should we schedule any, just in case?)
[17:23:42] <bblack>	 no downtime, just DNS switch
[17:39:51] <ori>	 bblack: what would happen to any open connections?
[17:40:23] <ori>	 they should be fine, right? users just have to watch out for client implementations which cache DNS lookups, I suppose?
[17:55:21] <bblack>	 well
[17:55:40] <bblack>	 if we're getting into all the ways clients can violate standards, that's a long list :)
[17:56:19] <bblack>	 the DNS entry has a 1H TTL, and we could terminate existing connections a few hours after the DNS switch if we want to force clients over, causing a reconnect.
[17:56:33] <bblack>	 I assume disconnect->reconnect is pretty normal, can happen for any number of reasons
[17:57:23] <bblack>	 there's no reason to rush on shutting down the old service though, in the case of clients that did somehow indefinitely cache the IP
[17:57:36] <bblack>	 ori: ^
[17:58:13] <bblack>	 (plus it's nice to have it working if we have to revert.  by the time we figure that out it may be too late to renew the old cert in time, but then again most of them don't use HTTPS today anyways, apparently)
[18:00:33] <ori>	 https://etherpad.wikimedia.org/p/rcstream-misc
[18:00:47] <ori>	 LMK if that looks good (and feel free to make changes) and I'll send it out
[18:00:51] <ori>	 bblack: ^
[18:36:23] <bblack>	 ori: I switched the "when" to June 23rd.  27th is my first day of vacation, I'll be out for 2 weeks.
[18:36:32] <ori>	 wfm, thanks
[18:36:41] <bblack>	 otherwise, lgtm
[19:13:53] <ori>	 cool, sent
[19:14:21] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2394859 (10ori) >>! In T134871#2394258, @BBlack wrote: > @ori did notification go out?  We're now 7 days from cert expiry...
[19:27:22] <bblack>	 thanks!
[19:29:40] <ori>	 np, sorry again for letting it drop