[13:38:31] <_joe_>	 ema, bblack: wanna take a look at https://gerrit.wikimedia.org/r/#/c/294694/ ? I guess you do the puppet disable/enable dance as often as me
[13:45:08] <bblack>	 it would be nice if puppet would support a better version of that on its own, too
[13:46:17] <bblack>	 e.g. at the puppet level, consider enable/disable with no message just to be the empty-message, but still consider it a message.
[13:47:01] <bblack>	 then, on every disable, record the message into a list of all current reasons it's disabled (unless the same message already exists, of course).
[13:47:21] <bblack>	 and on every enable, only remove this message (which may leave the agent disabled for other reasons)
[13:48:18] <bblack>	 and then perhaps a special flag "puppet agent --enable --force" or something to wipe all outstanding disable-messages
[13:50:41] <ema>	 and --sal update SAL  
[13:50:58] <ema>	 (which would be cool to have on the pool/depool scripts too)
[13:54:03] <_joe_>	 bblack: yeah that would be nice
[13:54:29] <_joe_>	 ema: we have it on confctl but because of how tcpircbot works, I could not add it for generic hosts
[13:54:39] <bblack>	 well pool/depool wrap confctl, which in turn logs, but yeah only from palladium
[13:55:18] <bblack>	 IMHO we should swap out the confctl logging for a separate confd watcher daemon that runs somewhere and just logs etcd updates from anywhere
[13:55:48] <_joe_>	 bblack: won't that be too noisy?
[13:55:48] <bblack>	 although that would lose the (future, in confctl) ability to add a custom trailer to confctl SAL entries, like a ticket number
[13:56:06] <_joe_>	 bblack: we can create a confctl audit log
[13:56:09] <bblack>	 probably!
[13:56:16] <_joe_>	 that we can keep e.g. in mysql
[13:56:26] <bblack>	 or......
[13:56:30] <_joe_>	 or?
[13:57:08] <bblack>	 we could hack rsyslog config such that if the programname is SAL, the syslog entry is forwarded to a special syslog server that forwards to SAL and thus IRC, and have confctl log actions to syslog as SAL: foo
[13:57:14] <bblack>	 might come in handy for lots of things
[13:58:30] <_joe_>	 yes
[13:58:35] <bblack>	 logger -t SAL "SAL this from any host"
[13:58:42] <_joe_>	 that's not a bad idea
[13:59:00] <_joe_>	 now, who is going to implement it? :D
[13:59:35] <hashar>	 maybe a SOA saloid service? :D
[14:00:34] * _joe_ beats hashar with a tick wooden stick
[14:00:46] <_joe_>	 saloid, geez :P
[14:00:47] <bblack>	 saloid better implement swagger specs or else
[14:00:59] <_joe_>	 don't enable him!
[14:01:12] <hashar>	 well Bryan idea of pilling up the sal messages to elastic search is quite nice
[14:01:13] <hashar>	 https://tools.wmflabs.org/sal/production  :D
[14:01:32] <bblack>	 except then how do we log SAL messages about fixing elasticsearch? :)
[14:01:35] <hashar>	 no idea what is the whole workflow of a message though
[14:02:17] <_joe_>	 bblack: well, it's a dedicated instance
[14:02:22] <_joe_>	 in toollabs
[14:02:27] <_joe_>	 IIRC
[14:02:40] <bblack>	 oh ok, let's have prod SAL depend on labs+toollabs not being broken at any moment :)
[14:09:00] <hashar>	 or promote the labs setup to production grade
[14:29:59] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2385376 (10BBlack)
[14:33:55] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Stream: stream.wikimedia.org doesn't redirect to HTTPS - https://phabricator.wikimedia.org/T137915#2385381 (10BBlack) Note: based on simple test python and javascript clients, websocket client libraries tend to not support 301 redirects to HTTPS.  So we'll prob...
[14:47:41] <wikibugs>	 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2385440 (10BBlack) As discussed in email, now that we're past the first deadline date and we've been posting username lists on public wikis...
[14:49:39] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2385447 (10BBlack) Latest list of accounts still making insecure requests over the past ~24H: T136674#2385440
[16:13:23] <paravoid>	 I had proposed fedmsg before for this kind of thing
[16:23:58] <elukey>	 bblack: Hi! I left varnishlog working on cp3008 this night and I found this tag
[16:24:03] <elukey>	 -   ReqURL         /mediawiki/1.26/mediawiki-1.26.3.tar.gz
[16:24:06] <elukey>	 [...]
[16:24:27] <elukey>	 -   VSL            timeout
[16:24:27] <elukey>	 -   End            synth
[16:24:34] <elukey>	 and no Timestamp:Resp
[16:24:37] <bblack>	 ok
[16:24:44] <bblack>	 that kinda makes sense
[16:24:47] <elukey>	 yeah
[16:24:55] <bblack>	 can you save the full log from that request somewhere?
[16:25:01] <elukey>	 sure, doing it
[16:25:50] <bblack>	 I'm semi-relatedly planning to do a bunch of logged artificial testing of failure scenarios with vcl_synth and/or vcl_backend_error to figure out what's going on with our redundant synth output and X-Cache issues anyways
[16:25:59] <bblack>	 will figure out the timeout scenario while I'm at it to confirm the log entry
[16:27:20] <bblack>	 I already found one interesting related thing: vcl_backend_error is documented to only be able to return deliver or retry (and combined with other restrictions, that makes our life hard), yet the source code seems to indicate we could return "abandon" and get the behavior we want :)
[16:29:00] <bblack>	 (I think we we did our retry-503-once logic there and then "return (abandon)", it would end up going through vcl_synth or vcl_deliver where we can fix up X-Cache)
[16:29:16] <bblack>	 (and then we don't need to synthesize an errorpage in there either)
[16:30:36] <elukey>	 all right you can find the log in /home/elukey/misc_release_timeout_log.txt on cp3008
[16:30:44] <bblack>	 thanks!
[16:31:07] <elukey>	 the sleepy version of myself didn't think to use | tee 
[16:31:15] <elukey>	 so I had to copy/paste from tmux
[16:31:23] <elukey>	 if you need more data I can add other logs
[16:32:01] <elukey>	 for the moment we are seeing varnishkafka logging timestamp: - for socket.io and release.tar.gz requests, and I bet all of them are timeouts
[16:32:07] <elukey>	 without Timestamp:Resp
[16:32:41] <elukey>	 now the question is: do we want to log them have them logged by vk or shall we filter them?
[16:33:40] <bblack>	 well socket.io is the pipe case
[16:34:30] <elukey>	 yeah I already filter out there requests (~ correspondent to Upgrade websocket req) but I am still seeing socket.io related logs with timestamp -
[16:34:47] <elukey>	 so I am kinda guessing that they may are experiencing timeouts too?
[16:36:15] <bblack>	 possibly!
[16:37:38] <bblack>	 I don't really get the timeout problem in this one
[16:37:52] <bblack>	 I mean it says "timeout", but it seems like it timed out in 10ms
[16:37:59] <bblack>	 sorry ~100ms
[16:38:57] <bblack>	 also, it had already streamed the applayer response headers (200 OK, etc) before it timed out, so it wasn't able to give a real error to the user.
[16:40:16] <bblack>	 perhaps the strange fast timeout has to do with connection-reuse? e.g. this was a reused backend connection that was used for many other requests, and was already almost idle long enough to get disconnect by some idle timeout
[16:40:26] <elukey>	 yeah vk was reporting that too.. HTTP 200 and no timestamp
[16:40:33] <bblack>	 and thus some race condition had it terminate ~100ms after a new request was fired off that would've kept it alive
[16:41:31] <bblack>	 that race would probably always exist in the case that a backend had a shorter idle timeout than the varnish talking to the backend
[16:41:37] <bblack>	 (even if the backend was another varnish)
[16:42:14] <bblack>	 still, two separate problems: (1) explore how timeouts work to make sure we can handle logging them, and (2) fix whatever stupid issue is causing a 100ms timeout
[16:43:56] <elukey>	 interesting, didn't think about connection re-use
[16:44:01] <bblack>	 https://info.varnish-software.com/blog/understanding-timeouts-varnish-cache
[16:44:17] <elukey>	 (brb in a bit)
[17:03:01] <elukey>	 wow a lot of timeout settings :D
[17:13:03] <wikibugs>	 10Traffic, 06Operations: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10BBlack)
[17:13:15] <wikibugs>	 10Traffic, 06Operations: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385892 (10BBlack) p:05Triage>03Normal
[17:13:26] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10BBlack)
[17:15:32] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385900 (10BBlack)
[17:16:28] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10BBlack)
[17:54:31] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2386066 (10BBlack) Other interesting references:  https://datatracker.ietf.org/doc/draft-alakuijala-brotli/ (IETF standard, seems pretty far along in the approval process) https://blog.cloudfla...
[18:11:33] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2386165 (10BBlack) p:05Normal>03Low A very quick check (just a couple of minutes on one cache_text machine) shows about 7% of requests indicate brotli support in Accept-Encoding.  Not big e...
[18:20:12] <wikibugs>	 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2386183 (10Whatamidoing-WMF) I've posted notes for the newest four.
[18:21:29] <wikibugs>	 10Traffic, 06Operations, 06Wikipedia-Android-App-Backlog, 10iOS-app-Bugs: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#2386185 (10Mholloway)
[18:21:44] <wikibugs>	 10Traffic, 06Operations, 06Wikipedia-Android-App-Backlog, 06Zero, 10iOS-app-Bugs: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#2386188 (10Mholloway)
[18:33:46] <ori>	 bblack: I think you might have been counting vhtcpd reqs by mistake
[18:33:54] <ori>	 [cp1055:~] $ varnishncsa -n frontend -m 'RxRequest:^(?!PURGE)' -F "%{accept-encoding}i" | head -100000 | grep -c br
[18:33:54] <ori>	 35471
[18:34:00] <ori>	 ~35%
[18:34:03] <ori>	 pretty consistent
[18:35:38] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10ori) >>! In T137979#2386165, @BBlack wrote: > A very quick check (just a couple of minutes on one cache_text machine) shows about 7% of requests indicate brotli support in Accept-Enc...
[18:38:29] <bblack>	 ori: you're right, I wasn't thinking and just logged all AE's
[18:39:49] <bblack>	 in any case, it might be better to improve gzip first, now that I read that cloudflare thing
[18:40:07] <bblack>	 deploy their faster library, bump our compression level from 6 to 8, probably no real problem CPU-wise.
[18:40:25] <bblack>	 (since most objects are cacheable, and are cached once at the backend to store them)
[18:40:32] <bblack>	 s/cached/compressed/
[18:40:39] <ori>	 I also note that 10% of requests have the user-agent string "ApacheBench/2.3" or "Faraday v0.9.2"
[18:40:49] <bblack>	 nice
[18:42:05] <ori>	 so inspiring, volunteer load-testing
[18:43:28] <bblack>	 it's all one IP that happens to hit cp1055, some ISP in brazil
[18:44:34] <bblack>	 apparently he's benchmarking favicon
[18:44:59] <ori>	 yes, I think this is a case where gender assumptions are pretty safe
[18:45:16] <bblack>	 heh
[18:45:28] <bblack>	 also, over HTTP and getting 301s to HTTPS I think, but it stopped before I looked at that
[18:46:02] <ori>	 Faraday v0.9.2 is hitting the API
[18:46:21] <bblack>	 Farady seems like some kind of legit HTTP library 
[18:46:37] <bblack>	 https://github.com/lostisland/faraday
[18:48:30] <bblack>	 if gzip comp level 8 turns out to be a pain CPU-wise (with or without help from cf's library), an interesting patch to varnishd would be to have two different compression levels when varnish has to compress an incoming object, depending on whether it's cacheable for >X seconds or not.
[18:48:53] <bblack>	 use 6 for uncacheable pass-traffic and <5min TTL objects, use 8 or 9 for storing long-lived objects into compressed cache
[18:49:42] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10Krinkle) >>! In T137979#2386210, @ori wrote: >>>! In T137979#2386165, @BBlack wrote: >> A very quick check (just a couple of minutes on one cache_text machine) shows about 7% of requ...
[18:51:39] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team: Support brotli compression - https://phabricator.wikimedia.org/T137979#2386232 (10BBlack) Yeah ori's right, I didn't filter properly.  Interesting!
[19:01:06] <bblack>	 if the brotli% is that high and growing, we might plan for an eventual switch to brotli-only in varnish and decomp->recomp for gzip-only clients in nginx at the edge, if it works out on CPU
[19:02:12] <bblack>	 (it would take a while to put the pieces together anyways, and by then maybe we're closer to ~80%)
[20:31:52] <wikibugs>	 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2343854 (10Danmichaelo) Fixed CatWatchBot and hopefully the remaining tasks for DanmicholoBot
[20:53:27] <gehel>	 bblack: I'm about to merge https://gerrit.wikimedia.org/r/#/c/294390 (removal of referrer check for maps).
[20:53:52] <wikibugs>	 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 2 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2386912 (10GWicke)
[20:54:49] <gehel>	 bblack: if I read the code correctly, that requires a manual reload of the varnish for maps
[21:06:29] <bblack>	 gehel: huh?
[21:07:06] <bblack>	 gehel: nothing manual is required
[21:07:12] <gehel>	 bblack:  the removal of the referrer check (now that all blockers to that are closed) is a change in VCL
[21:07:35] <gehel>	 varnish auto reload VCL? I don't see a service notification in puppet code...$
[21:08:25] <bblack>	 VCL updates are automatic, yes
[21:08:34] <bblack>	 just puppet-merge and wait half an hour or so for puppet to run
[21:08:58] <gehel>	 ok, thanks! sorry to bother you again :(
[21:09:16] <gehel>	 why not salting puppet agent?
[21:09:52] <bblack>	 because there's no critical need?
[21:10:19] <bblack>	 if salted puppet agent on every ops/puppet commit...
[21:11:01] <bblack>	 I would have liked to see more review on puppetization as well
[21:11:58] <bblack>	 "we are confident that we can reinstall a server in a reasonable timeframe"... who's we, and where's the secret sauce? is the unpuppetized portion of the process at least documented well enough that anyone in ops has a chance of following it?
[21:17:25] <gehel>	 The "we" is me and the maps team. To be honest, the most problematic part in getting those new servers in place was my incompetence on all those components.
[21:18:08] <wikibugs>	 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 2 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387040 (10GWicke) Another possibility is that there are issues with the eventlogging instrumentation. The number...
[21:19:03] <gehel>	 There are now a couple of scripts for the data load. They need to be tested more, but at least they are a fairly good documentation of the steps to take.
[21:20:38] <wikibugs>	 10Traffic, 06Discovery, 10Kartotherian, 06Maps, and 3 others: Remove referrer check from varnish for maps cluster - https://phabricator.wikimedia.org/T137848#2387056 (10Gehel) 05Open>03Resolved
[21:21:56] <bblack>	 gehel: still, the process needs documentation, at least on wikitech, of the parts that aren't puppet-automated
[21:22:45] <gehel>	 bblack: right. I'll do as much as I can asap...
[21:23:20] <bblack>	 also, when puppet does run on maps hosts (I noticed this yesterday), it always does:
[21:23:23] <bblack>	 Notice: /Stage[main]/Postgresql::Postgis/Package[postgresql-9.4-postgis]/ensure: ensure changed 'purged' to 'present'
[21:23:26] <bblack>	 Notice: /Stage[main]/Role::Maps::Master/Postgresql::User[replication@maps2003-v4]/Exec[pass_set-replication@maps2003-v4]/returns: executed successfully
[21:23:29] <bblack>	 Notice: /Stage[main]/Role::Maps::Master/Postgresql::User[replication@maps-test2002-v4]/Exec[pass_set-replication@maps-test2002-v4]/returns: executed successfully
[21:23:32] <bblack>	 Notice: /Stage[main]/Role::Maps::Master/Postgresql::User[replication@maps-test2003-v4]/Exec[pass_set-replication@maps-test2003-v4]/returns: executed successfully
[21:23:35] <bblack>	 Notice: /Stage[main]/Role::Maps::Master/Postgresql::User[replication@maps2002-v4]/Exec[pass_set-replication@maps2002-v4]/returns: executed successfully
[21:23:38] <bblack>	 Notice: /Stage[main]/Role::Maps::Master/Postgresql::User[replication@maps2004-v4]/Exec[pass_set-replication@maps2004-v4]/returns: executed successfully
[21:23:41] <bblack>	 Notice: /Stage[main]/Role::Maps::Master/Postgresql::User[replication@maps-test2004-v4]/Exec[pass_set-replication@maps-test2004-v4]/returns: executed successfully
[21:23:44] <bblack>	 Notice: /Stage[main]/Cassandra::Metrics/Base::Service_unit[cassandra-metrics-collector]/Service[cassandra-metrics-collector]/ensure: ensure changed 'stopped' to 'running'
[21:23:54] <bblack>	 what is it constantly supposedly reinstalling a purged package and restarting a dead metrics collector? and what's with all the slow execs related to replication?
[21:24:18] <bblack>	 I don't know if anyone outside the authors has really reviewed the maps puppetization that is there
[21:24:43] <bblack>	 but a good step 1 is making sure someone else uninvolved can at least follow a documented practice to install/reinstall a working maps server
[21:25:17] <gehel>	 I am under the impression that I am this someone else...
[21:25:57] <bblack>	 no, I by "someone else" I mean "anyone else"
[21:26:39] <bblack>	 which means docs that reall work if you follow them and don't know tricky things that were learned along the way the first time
[21:28:04] <gehel>	 What I mean is that the fact that I could reinstall those servers (with some help from Alex) is a good indication that the automation is mostly working.
[21:29:16] <bblack>	 yeah but I bet I couldn't
[21:29:24] <gehel>	 Most of the difficulties I found along the way were because I did not know anything about Cassandra, almost nothing about Postgres.
[21:30:20] <bblack>	 exactly.  we shouldn't need to know deep tech details.  plenty of ops folks aren't going to have deep knowledge about cassandra and postgres.  whatever puppet can't automate should be in a document that boils down to "execute these commands blindly and it will work"
[21:32:00] <gehel>	 Agreed. And no, we're not entirely there, but I think we are close enough. That might have been a bad call...
[21:46:19] <wikibugs>	 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 2 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387115 (10Mholloway) Sure, one or both of us can look into this.
[22:04:48] <wikibugs>	 10Traffic, 10DBA, 06Labs, 06Operations, 10Tool-Labs: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376097 (10Legoktm) >>! In T137707#2382924, @Antigng_ wrote: > Lack of hard and fast limit on read requests can be a problem, since your definition of request limit...