[00:09:23] <bblack>	 ema: maps+misc are past the libvmod-netmapper upgrade barrier now
[00:09:58] <bblack>	 ema: using this as a salt command worked: puppet agent --disable netmapper-fixup; run-no-puppet echo; service confd stop; apt-get install libvmod-netmapper; /usr/local/sbin/varnish-frontend-restart; service confd start; puppet agent --enable
[00:10:32] <bblack>	 for misc I used "-b 3", for maps "-b 2".  for upload I'm going to kick off the same, but "-b 1" and maybe run the puppet agent at the end and/or add an additional sleep there, to slow it down a bit
[00:10:57] <bblack>	 (it's tolerable without either one, but I've tested that enough already lately, may as well make things smoother for end-user perf)
[00:18:36] <bblack>	 once that's done (it's in-progress now, but it's going to take a while), what's left is the actual process of installing the kernel + dist-upgrade on all the caches and slowly rebooting them
[00:19:19] <bblack>	 can do the apt work all at once globally.  for the reboots, they'll self-depool around those.  just need to pick a timing and have salt kick them off serially for cluster:cache_* and it will spread them around smooth enough
[00:20:02] <bblack>	 using the "echo reboot | at now + 1 minute" hack, and some sleep-loop around salting them all?
[00:20:56] <bblack>	 I'm forgetting two important details of course: touching the pool-once file prior to reboot to make sure they self-repool
[00:21:06] <bblack>	 not sure what else, hmmmm
[00:21:15] <bblack>	 oh icinga downtime
[00:22:26] <bblack>	 and of course there's a chance we still have some stupid thing going on with esams bios where they won't reboot ok without a kick from the console....
[00:22:48] <bblack>	 I seem to remember being worried about that last time, but then it didn't happen, presumably because things got "fixed" by the last round of reboots before that?
[02:22:16] <bblack>	 in any case, I'm out for the evening.  authdns+lvs are all done.  the varnish4 clusters are all past the libvmod-netmapper issue, but haven't gotten kernel upgrade or dist-upgrade or reboots
[07:14:04] * volans takes notes for automation use cases ;)
[08:42:58] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696081 (10elukey) Another idea!  The downside of adding special rules in the main httpd config file imo...
[09:27:42] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696123 (10elukey) >>! In T109226#1543444, @ori wrote: > The root cause rests with the interaction of th...
[10:49:01] <ema>	 I'm going oldschool through cache_maps reboots sequentially
[10:49:03] <ema>	 https://phabricator.wikimedia.org/P4171
[10:49:07] <ema>	 https://phabricator.wikimedia.org/P4172
[10:49:12] <ema>	 https://phabricator.wikimedia.org/P4173
[10:50:07] <ema>	 unfortunately a bit of babysitting is needed, cp2015 somehow got stuck and needed power cycling
[10:57:37] <ema>	 hosts file sorted to space reboots through the DCs: cp1047, cp2009, cp4011, cp3004, cp1060, cp2021, ... 
[11:49:55] <ema>	 _maps reboots went fine except for cp2015 getting stuck and a few icinga false positives (strongswan and ntp)
[11:50:47] <ema>	 I'll grab a bite and then carry on with misc
[11:56:27] <elukey>	 I checked with ema some vk errors for non webrequest instances like statsv and eventlogging (role::cache::kafka::eventlogging and role::cache::kafka::statsv). Only cache::text seems to have those roles, so probably misc and maps have old systemd units?
[11:57:07] <elukey>	 there are no config files under /etc/varnishkafka other than webrequest.conf for misc and maps
[13:35:20] <bblack>	 the others (statsv, eventlogging) are specific to the text cluster
[13:36:54] <elukey>	 bblack: were they also deployed in the past in misc/maps?
[13:38:10] <bblack>	 not to the roles, no
[13:38:33] <bblack>	 but it's entirely possible to find remnants of one role in another role's host, sometimes they were re-roled without re-installing
[13:41:07] <elukey>	 ahh didn't think about this possibility
[13:41:33] <elukey>	 so we just need to do some cleanup in maps/misc 
[13:43:26] <bblack>	 what's wrong?
[13:44:01] <bblack>	 did they start kafka::eventlogging or kafka:statsd daemons on reboot?
[13:44:42] <elukey>	 I think that systemd tried to bring them up but since there is no config they failed
[13:44:45] <elukey>	 elukey@cp4020:~$ sudo systemctl -a | grep varnishkafka
[13:44:45] <elukey>	 ● varnishkafka-statsv.service                                                              loaded    failed   failed    VarnishKafka statsv varnishkafka-webrequest.service                                                          loaded    active   running   VarnishKafka webrequest
[13:44:51] <elukey>	 ● varnishkafka-eventlogging.service                                                        loaded    failed   failed    VarnishKafka eventlogging
[13:45:07] <bblack>	 ok
[13:46:18] <bblack>	 oh, there may be an additional confusion factor in all of this
[13:46:33] <bblack>	 in that at some point we moved custom systemd unit deployment from /etc/systemd/ to /lib/systemd/ ....
[13:47:26] <ema>	 in the meantime, misc reboots are going fine. I'll start with text now
[13:47:35] <bblack>	 awesome :)
[13:48:00] <bblack>	 no upload yet?
[13:48:12] <ema>	 we can do upload and text in parallel, yes
[13:48:20] <bblack>	 I was thinking, should ensure=>absent the weekly restart cron a bit before the upload reboots, and turn it back on a bit after they're all done
[13:48:36] <bblack>	 (for upload, so they don't restart backends just before/after/during all of this and add to the pain)
[13:48:46] <ema>	 +1
[13:52:17] <bblack>	 heh I've been staring at salt outputs about these mystery vk services for like 10 minutes now
[13:52:35] <bblack>	 I just realized the reason they're confusing is I typed /lib/systemd/systemd/...
[13:53:36] <bblack>	 so yeah, old variants exist on various maps/misc/upload cache host, both under /etc and under /lib
[13:54:29] <bblack>	 fixing up the mess via salt...
[13:55:30] <Snorri>	 Hey ema, bblack. Soon there is a new (better) version of my thesis. I´ll give you both the new version as soon as it´s ready. :)
[13:55:41] <ema>	 Snorri: nice, thanks!
[13:58:42] <wikibugs>	 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2696596 (10faidon) 05Open>03Resolved a:03faidon Two weeks have passed and this hasn't reoccurred. I'm going to resolve this for now — we can reopen if it happens again or if we h...
[13:59:43] <wikibugs>	 10Traffic, 06Operations: Consider per-route DCTCP for dc-local traffic on jessie hosts - https://phabricator.wikimedia.org/T128377#2696608 (10faidon)
[14:00:32] <wikibugs>	 10netops, 06Labs, 06Operations: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406#2696632 (10faidon) We agreed on all of the above during the Barcelona offsite. We've preliminary agreed to attempt implementing them in tandem with the Neutron migration, which wou...
[14:03:31] <ema>	 bblack: https://gerrit.wikimedia.org/r/#/c/314560/
[14:10:41] <bblack>	 merged :)
[14:11:47] <bblack>	 I saw I missed it by several minutes, figured I'd jump ahead of anyone making it rebase before it happened :)
[14:11:51] <bblack>	 I didn't salt puppet though
[14:14:14] <ema>	 salting puppet
[14:14:20] <wikibugs>	 10Traffic, 06Operations: Consider per-route DCTCP for dc-local traffic on jessie hosts - https://phabricator.wikimedia.org/T128377#2696723 (10BBlack) 05Open>03declined Per-route congestion control is complicated, and DCTCP requires ECN support from our network gear, and may not play nice with other concurr...
[14:17:02] <wikibugs>	 10Traffic, 06Operations, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2693856 (10Joe) a:03Joe
[14:17:10] <ema>	 misc done
[14:24:50] <bblack>	 elukey: re: https://gerrit.wikimedia.org/r/#/c/314336/1 ..... notice the old comment about "We want to send batches at least once a second", and I changed that wording to "not send batches more often than once per second"....
[14:25:02] <_joe_>	 bblack, ema I'm going to upgrade conftool in ~ 20 minutes, if all goes well
[14:25:17] <_joe_>	 anyone around to test everything is ok?
[14:25:39] <bblack>	 elukey: I think since the old number (6000) looked like something that was considered a per-cache-host peak rate in the past (of some kind)... I kinda figured the comment's wording was bad and that my new wording was what was intended...
[14:25:54] <bblack>	 elukey: but I'm not really 100% sure
[14:26:51] <bblack>	 elukey: but looking at the old file, it doesn't make sense to say "our worst/highest rate is 6000/s, so set the batch_num_messages to 6000 to batch *at least* once per second" ... what that would actually do is batch at most once per second?
[14:27:24] <bblack>	 _joe_: ema's in the midst of reboots that are using the depool/pool/confctl stuff
[14:29:06] <_joe_>	 bblack: ok so I'll upload the package, install it on puppetmaster1001, then let ema test if if he wants to
[14:29:29] <_joe_>	 also, I don't need to build it for precise anymore! \o/
[14:29:47] <ema>	 _joe_: sounds good, let's not update it on the cp hosts though
[14:29:56] <bblack>	 _joe_: yeah that sounds fine.  as long as confctl doesn't change on the end-hosts in the midst of the process, we can test/upgrade after
[14:29:57] <_joe_>	 ema: agreed
[14:30:20] <_joe_>	 bblack: ok, I'll leave it to you to upgrade the caches
[14:30:29] <ema>	 we don't have ensure=>latest, do we? :)
[14:30:43] <bblack>	 eh who knows
[14:30:43] <_joe_>	 ema: I don't think we're that dumb, no
[14:31:11] <_joe_>	 require_package('python-conftool')
[14:31:17] <_joe_>	 we're not :P
[14:31:22] <ema>	 great
[14:31:28] <elukey>	 bblack: so librdkafka states that batch.num.messages is the the minimum number of messages to wait for to accumulate in the local queue before sending off a message set." so I think that the meaning was "we create more or less 6k messages/second so vk will ship batches of events at least once a second, but if there are more events it might send more stuff
[14:32:03] <elukey>	 this is my interpretation
[14:32:15] <elukey>	 but it doesn't make a lot of sense I agree
[14:32:18] <bblack>	 well, in the context of the actual cache boxes, even back at the time this was last tuned
[14:32:18] <ema>	 I'll start rebooting upload
[14:32:43] <bblack>	 6000 was apparently a "this is the highest number we observe on any kind of individual cache box", but there would have been many boxes with much lower rates, too
[14:33:06] <elukey>	 exactly
[14:33:15] <bblack>	 so if it actually waits for 6000 to accumulate before sending a batch, with no other constraint, it could be a long time between batches on some caches
[14:33:30] <bblack>	 I kind of assume there's some timeout too so that it eventually sends a smaller count on a low-rate box?
[14:33:52] <elukey>	 this is a good point
[14:34:58] <elukey>	 queue.buffering.max.ms - how long to wait for batch.num.messages to fill up in the local queue.
[14:35:11] <bblack>	 the opposite extreme is codfw cache_misc machines, whose reqs/sec rate daily average under "normal" conditions is slightly under 1.
[14:35:30] <bblack>	 whereas the 9000 is a worst-case with e.g. esams+ulsfo depooled on the text cluster in eqiad
[14:35:55] <elukey>	 default kafka.queue.buffering.max.ms is 1 second
[14:36:13] <bblack>	 it would take a codfw cache_misc box well over 1h to generate 6k reqs heh
[14:36:44] <bblack>	 ah ok!
[14:37:09] <bblack>	 so th emin rate, if there's traffic to log at all, is 1/s
[14:37:43] <bblack>	 and the batching only really kicks in if we're getting more than batch.num.mesages/sec ?
[14:38:45] <elukey>	 yes, librdkafka ships batches of batch.num.mesages if they are queued before the 1 second timeout
[14:38:47] <bblack>	 so many tunables....
[14:38:56] <elukey>	 I wasn't aware of this config 
[14:38:59] <elukey>	 really interesting
[14:39:44] <bblack>	 assuming we had some crazy DoS scenario and a cache box was getting nearly-infinite reqs/sec
[14:40:10] <bblack>	 ...
[14:40:26] <bblack>	 well I donno, it just seems strange this parameter exists at all.  is there no effectively maximum to the byte size of a batch?
[14:41:05] <bblack>	 you'd think we'd still want them batched up 1/s for some efficiency reason, and so we'd want batch.num.messages to be very large, up to whenever some other limit kicks in
[14:42:06] <elukey>	 there are tons of tunables in https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md
[14:42:08] <bblack>	 (I guess this gets into "what is a batch"? is it a burst of UDP packets?)
[14:42:25] <elukey>	 but this is master and we use 0.8 IIRC
[14:42:45] <bblack>	 Maximum number of messages batched in one MessageSet. The total MessageSet size is also limited by message.max.bytes. 
[14:43:02] <bblack>	 whose default is 1M
[14:43:12] <bblack>	 so that's the "other" limit
[14:43:24] <elukey>	 ah no we use librdkafka 0.9
[14:43:40] <ema>	 in other news, we've been quite diligent triaging stuff in the workboard so far, let's see how long it lasts :)
[14:44:07] <bblack>	 if before we reach this 9000/sec rate, the msgs in a 1s batch exceed 1 million bytes in total size, it's going to batch on that instead
[14:44:30] <bblack>	 ema: it will last as long as we're still making interesting changes to the workboard itself + 1 month :)
[14:45:19] <ema>	 so ideally we should make one interesting change to the workboard per month and then we're good!
[14:45:37] <bblack>	 andre emailed me btw, asking basically why the hell we have a BadHerald column and such when we could just fix the Herald rules :)
[14:45:56] <elukey>	 bblack: so message.max.bytes. is intended as the size of the batch right?
[14:46:13] <bblack>	 elukey: well, it's the maximum possible size of a batch in bytes
[14:46:36] <elukey>	 yes sorry this is what I meant :)
[14:46:38] <bblack>	 so if we hit that limit before we hit 9000 messages squeezing in, it's going to send more than 1 batch/sec regardless
[14:47:30] <bblack>	 also the default batch.num.messages is 10000
[14:48:30] <bblack>	 in any case, now that we've had this conversation, I don't think the change is doing anything too stupid
[14:50:42] <elukey>	 nope I think it is good
[14:50:53] <elukey>	 maybe we can wait ottomata and double check with him
[14:51:32] <bblack>	 _joe_: the apache patch to kill bits in https://gerrit.wikimedia.org/r/#/c/305536/ ... merging anything with apache config scares me, do we have a special process around it?
[14:54:31] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696809 (10BBlack) Side note on side note: we had some Varnish/VCL conditional code to treat hhvm and Ze...
[14:55:12] <_joe_>	 bblack: just canary deploying, usually, but I'm sure there are some notes on the software deployment page
[14:55:22] <_joe_>	 let me finish one thing and I'll find it
[14:55:51] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696811 (10BBlack) On your repro attempts: I think the original case that was badly cached was for users...
[15:01:25] <wikibugs>	 10Traffic, 06Operations: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#2696846 (10BBlack)
[15:03:48] <bblack>	 ema: the cp3034 hang sounds like the old problems.  in at least one past reboot cycle, the majority of cp30[34]x suffered from it I think
[15:04:32] <bblack>	 I kept a few offline and went through their BIOS (etc) settings on the console manually, comparing to cp107[1-4] which are identical hardware/generation and don't have this problem, and couldn't find any meaningful diff
[15:04:43] <bblack>	 there was a theory they needed BIOS/iDRAC firmware updates, maybe
[15:05:15] <ema>	 in this case it hanged during boot, I could see this in console: [OK 
[15:06:04] <bblack>	 hmmm, as in [OK was part of systemd bringing up services?
[15:06:35] <ema>	 bblack: same thing just happened to cp2011 if you want to take a look
[15:06:52] <bblack>	 eep
[15:07:01] <bblack>	 new race condition from the dist-upgrade?
[15:07:21] <ema>	 ?
[15:08:00] <bblack>	 just speculating on how we've got these two machines possibly hanging after userland is starting up
[15:08:11] <bblack>	 cp2011 so far outputted a newline since I connected a few seconds ago
[15:08:44] <ema>	 mmh
[15:09:27] <bblack>	 did you dist-upgrade them all earlier?
[15:09:36] <bblack>	 I never really saw, kind of assumed
[15:09:54] <ema>	 yes I did
[15:10:37] <ema>	 bblack: any news on cp2011's console? Should we just power-cycle it?
[15:10:44] <bblack>	 debugging
[15:10:50] <ema>	 alright
[15:10:58] <bblack>	 trying to, anyways :)
[15:12:49] <bblack>	 booted fine this time
[15:12:55] <bblack>	 I didn't powercycle from racadm though
[15:13:07] <bblack>	 I did a soft powercycle from inside the "console com2"
[15:13:26] <bblack>	 with the dell-magic Esc+Shift+R, Esc+R, Esc+Shift+R
[15:13:39] <bblack>	 (it's their serial console equivalent of ctrl+alt+del)
[15:13:55] <ema>	 TIL!
[15:14:16] <bblack>	 I was hoping it was more-likely to repro the problem than a hard powercycle
[15:14:36] <bblack>	 maybe attach some consoles ahead of the upcoming reboots and try to catch one in action?
[15:14:58] <bblack>	 (what I wouldn't give for console servers that preserve some MBs of history to scroll back through... they exist...)
[15:15:54] <ema>	 varnish-be failed repool on cp3034, pooling manually
[15:16:04] <bblack>	 Raft Internal Error?
[15:16:18] <ema>	 I imagine so yeah
[15:16:42] <ema>	 Oct 06 15:05:42 cp3034 sh[2400]: ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure
[15:16:54] <bblack>	 yeah
[15:17:12] <bblack>	 journalctl -xn 100 -u traffic-pool.service
[15:17:33] <bblack>	 ^ when I execute commands like that and they show me something useful, it's one of the few times I actually like systemd
[15:17:59] <ema>	 :)
[15:18:12] <ema>	 I went brute force instead
[15:18:13] <ema>	  pooled changed no => yes
[15:18:13] <ema>	 root@cp3034:~# 
[15:18:17] <ema>	 yeah, not that
[15:18:22] <ema>	 journalctl --since=today|grep -i raft
[15:18:40] <bblack>	 at $jobs -1 and -2, I used Cyclades brand serial console servers.  They were Linux-based on the inside, running on some embedded PPC platform
[15:19:15] <bblack>	 they were bought by Avocent which was bought by emmerson, so probably this is roughly the modern equivalent, assuming they're still the same at all: http://www.emersonnetworkpower.com/en-US/Products/InfrastructureManagement/SerialConsoles/Pages/AvocentACS6000AdvancedConsoleServer.aspx
[15:19:54] <bblack>	 because it was linux on the inside, it was easy to customize things like ssh auth and ldap and and keys and blah blah
[15:20:15] <bblack>	 and they logged the serial console outputs at all times so you could review missed output on hangs
[15:20:23] <bblack>	 and supported multiple readonly connections while one was R/W, etc
[15:20:52] <ema>	 that sounds great
[15:21:33] <bblack>	 yeah maybe we should consider looking at that or something else similar among the modern options
[15:21:42] <bblack>	 could try out a new solution for the asia DC :)
[15:22:55] <bblack>	 of course, we don't really do proper serial console servers here, it's all drac-based right?
[15:23:13] <bblack>	 I don't know if our modern boxes even have physical serial ports heh
[15:23:27] <ema>	 catching the problem now is pretty hard, we had only 3 machines hanging out of 55 reboots
[15:23:34] <ema>	 luckily :)
[15:24:08] <ema>	 aaand cp3045 froze
[15:24:18] <bblack>	 did you catch it? :)
[15:24:34] <ema>	 nope I was busy saying that it's hard to do
[15:24:39] <bblack>	 lol
[15:25:04] <bblack>	 I have tools because I was busy using my tools to complain about my broken tools
[15:25:16] <bblack>	 I missed a no in there somewhere
[15:25:23] <ema>	 just a blank screen this time
[15:26:03] <bblack>	 there are so many different ways the reboot can fail, that we've seen over time
[15:26:32] <bblack>	 at one distant point (I think we're well past it now in kernel revs) there was a kernel bug that would hang them at the very end of kernel shutdown and prevent rebooting the hardware
[15:27:05] <bblack>	 "[OK" earlier is suspicious, though
[15:27:33] <bblack>	 (suspiciously sounds like it's making it to userland startup and failing some race/dependency issue there)
[15:28:05] <bblack>	 like the md assembly thing we faced before
[15:30:02] <bblack>	 back on the other topic: of course going back to real serial consoles and a serial console server seems like a step backwards on the scale of: http://xkcd.com/1737/
[15:30:44] <bblack>	 we're supposed to not care about the disposition of one machine right? hook the host down icinga alert to an automatic ipmi hard reboot, and if it stays down you just throw it in the trash and move on, right? :)
[15:31:36] <ema>	 uh the traffic depool unit shows progress, that's nice
[15:32:00] <bblack>	 yeah seconds of execution comes from systemd itself I think
[15:32:07] <bblack>	 the actual unit file just has a built in sleep
[15:34:40] <ema>	 of course they're all booting fine now that I'm looking
[15:43:46] <_joe_>	 ema: whenever you want to test conftool 0.3.1, I think it's important for restarts
[15:46:46] <ema>	 _joe_: ok, the new feature to test is exiting with non-zero in case of errors right?
[15:47:11] <_joe_>	 yep
[15:47:37] <bblack>	 which will cause "depool" to exit non-zero, which will cause traffic-pool systemd unit to fail to stop properly
[15:47:55] <bblack>	 but duwing a "shutdown" sequence, I don't think that failure will actually stop it from stopping varnish and the host?
[15:48:09] <bblack>	 s/duwing/during/
[15:49:02] <ema>	 $ confctl --quiet select name=cp3047.esams.wmnet,service='nginx' set/pooled=yes
[15:49:05] <ema>	 ERROR:conftool:Error when trying to set/pooled=yes on name=cp3047.esams.wmnet,service=nginx
[15:49:08] <ema>	 ERROR:conftool:Failure writing to the kvstore: Backend error: The request requires user authentication : Insufficient credentials
[15:49:11] <ema>	 this one exited with 1
[15:52:02] <ema>	 given the transient nature of the Raft issues it might be worth re-trying in case of failure?
[15:55:58] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2696978 (10BBlack) 05Resolved>03Open Of course, I spoke too soon.  The stats anomaly is slowly becoming visible again, and live logging confirms these broken clien...
[15:57:14] <ema>	 people started using xp+ie8 again? :)
[15:57:31] <ema>	 oh no that's chrome
[15:58:02] <bblack>	 I don't think it's even chrome
[15:58:15] <bblack>	 I think it's some other strange software, and they copied the UA string of an old chrome version, or something
[15:58:21] <ema>	 right
[15:58:41] <bblack>	 or they're reusing an old copy of chrome to execute the fetches, anyways, or something odd
[15:59:23] <bblack>	 we don't normally even try to investigate something like that "just because"... it's not worth stopping all the strange ways people might test/benchmark/whatever against us
[15:59:40] <bblack>	 but this particular case is egregious, as it seriously warps our stats for "portal" pageviews by orders of magnitude
[16:00:03] <bblack>	 (and also warps our stats on clients with bad/outdated TLS stats a bit)
[16:00:11] <bblack>	 s/TLS stats/TLS negotations/
[16:02:38] <bblack>	 it was hard to notice the bad stats creeping back in when the workaround was reverted
[16:02:54] <bblack>	 I guess there's some slow/organic pattern to them eventually recovering from the 401s sent in the workaround
[16:03:05] <bblack>	 it's very easy to see the sharp dropoff when the workaround goes back in place, though :)
[16:03:47] <bblack>	 in the Bad Ciphers graph if you look at any <24h view here: https://grafana.wikimedia.org/dashboard/db/tls-ciphers
[16:04:52] <bblack>	 the implication is those bad requests were ~1.5% of our total global request volume just before the workaround went back in place
[16:09:12] <ema>	 interesting
[16:11:12] <ema>	 26/31 cache_text rebooted, 18/39 cache_upload
[16:14:27] <bblack>	 clearly Varnish4 has a regression related rebooting :)
[16:22:10] <ema>	 :)
[16:25:37] <bblack>	 on a slightly-more-serious note, maybe the expiry mailbox backlog (well, now that it's partially mitigated) would never be a real issue anyways, because we never leave varnishd processes alive long enough due to various maintenance and changes?
[16:26:15] <bblack>	 at this rate it may be several weeks before we find out if the "weekly" restart cron is long enough or not heh
[16:26:28] <bblack>	 by then we'll be testing varnish5 or something
[16:27:12] <ema>	 yeah I think testing varnish5 is certainly doable without much effort
[16:30:54] <ema>	 cp2022 froze with some gibberish in console, powercycling
[16:31:25] <ema>	 (didn't manage to catch it earlier)
[16:32:46] <ema>	 so far the hosts hanging have been: cp2015 cp3034 cp3045 cp2022
[16:36:22] <bblack>	 well, they're all newest-gen
[16:36:31] <bblack>	 but that doesn't say a whole lot statistically
[16:37:07] <bblack>	 roughly half of all our caches are that "newest-gen" hardware
[16:37:25] <bblack>	 do it's not unusual that a handful of rare failures happen to all land there
[16:37:28] <bblack>	 s/do/so/
[16:43:50] <ema>	 text done \o/
[17:12:08] <ema>	 interesting
[17:12:18] <ema>	 cp2017 froze like this:
[17:12:22] <ema>	 Lo
[17:12:22] <ema>	 Debian GN
[17:13:02] <bblack>	 hmmmm
[17:13:28] <bblack>	 was there other stuff before it? it finished "normal" bootup and that was about when it was printing the first login prompt?
[17:13:51] <bblack>	 I guess those could be characters from grub output, too
[17:13:56] <bblack>	 (if no other context)
[17:14:43] <ema>	 no other stuff and I caught it after it happened
[17:15:04] <bblack>	 hmmmm
[17:16:30] <bblack>	 well also, I think Dell does have some tiny inbuilt console output buffering (possibly accidentally, in the translation from virtual serial -> virtual vga -> ssh terminal session)
[17:16:44] <bblack>	 so the output may be unreliable when we see a fragment post-hoc
[17:17:07] <bblack>	 for instance, I just connected to cp2001's console (which is up and running fine) and hit enter when I connected to generate a new login prompt
[17:17:14] <bblack>	 the output I got is literaly:
[17:17:17] <bblack>	 [1083748.014712]
[17:17:17] <bblack>	 Debian GNU/Linux 8 cp2001 ttyS1
[17:17:17] <bblack>	 cp2001 login: 
[17:17:40] <bblack>	 the 108.... is clearly kernel/dmesg output, but the machine hasn't been up that long, so it's clearly from the previous boot
[17:19:31] <bblack>	 "Lo" is probably "Loading" that happens when grub loads a kernel
[17:19:33] <bblack>	 I think?
[17:19:55] <bblack>	 but who knows, on a stuck console all of that output could've been from ages ago in some mis-managed console buffer
[17:22:40] <ema>	 and we're done!
[17:23:05] <bblack>	 \o/
[17:49:49] <bblack>	 ema: should we close T131502 ? I called the goal "done", the rest is just sorting out nits as we go I think
[17:49:49] <stashbot>	 T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502
[17:49:59] <bblack>	 and maybe start gutting the v3-compat bits from upload VCL too
[17:53:28] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2697291 (10BBlack) 05Open>03declined This seems really complicated to get "right", and it's only in corner cases that it even helps us much.  There's potential downsides on the pattern-ada...
[18:10:00] <wikibugs>	 10Traffic, 06Operations, 07Beta-Cluster-reproducible, 13Patch-For-Review: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2697362 (10ori) If I understood [[ https://github.com/facebook/hhvm/blob/235b6ed60f54fe4d1f18bc9592e4a7ea5f573b05/hph...
[18:24:13] <wikibugs>	 10Traffic, 06Operations, 07Beta-Cluster-reproducible, 13Patch-For-Review: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2697376 (10BBlack) In general, zlib supports a defined compression level of `0`, which means "no compression", but is...
[18:26:55] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2697392 (10Nemo_bis) >>! In T141786#2696978, @BBlack wrote: > At this point I think it's more likely a fake UA string from some kind of benchmarking or other software...