[02:31:52] <wikibugs>	 07HTTPS, 10Traffic, 10Monitoring, 06Operations, 13Patch-For-Review: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2658075 (10Dzahn) p:05Triage>03Normal
[06:00:58] <elukey>	 ema: o/ vk 1.0.12 on cp3034 looks good, if you are ok I'd upload it to reprrepro and then maybe deploy it to cache upload esams? It would be great to deploy it in all cache upload before tomorrow to see if it helps   
[07:33:21] <ema>	 elukey: cool, please go ahead!
[08:52:48] <elukey>	 vk in cache:upload esams upgraded 
[08:53:06] <elukey>	 I'll let it boil for a bit and then proceed with codfw, ulsfo and eqiad
[08:59:40] <ema>	 elukey: could you start with ulsfo perhaps?
[08:59:48] <elukey>	 sure
[08:59:57] <elukey>	 any issue with codfw?
[08:59:58] <ema>	 I'm finishing eqiad's conversion to the new storage layout and codfw is up next
[09:00:03] <elukey>	 ahhhhh
[09:00:05] <elukey>	 okok
[09:00:15] <ema>	 thanks!
[09:01:00] <elukey>	 ema: don't kill me but what if we installed in codfw begore you go ahead with the conversion?
[09:01:06] <elukey>	 to test vk's new resiliency
[09:01:12] <ema>	 elukey: sure, we can do that
[09:01:53] <elukey>	 ema: ok for me to install in codfw now?
[09:01:58] <ema>	 elukey: yep
[09:13:50] <elukey>	 vk upgraded in upload codfw :)
[09:14:01] <ema>	 elukey: yay!
[09:19:16] <ema>	 alright so eqiad is now running the storage experiment
[09:27:11] <ema>	 for codfw I've prepared a set of patches starting with https://gerrit.wikimedia.org/r/#/c/312208 and ending with https://gerrit.wikimedia.org/r/#/c/312212
[09:49:15] <wikibugs>	 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457#2658527 (10Gehel)
[12:05:57] <elukey>	 ema: is it ok if I proceed with vk's upgrade in ulsfo and eqiad?
[12:06:04] <elukey>	 (upload)
[12:23:54] <ema>	 elukey: yes
[12:41:12] <ema>	 starting the storage conversion in codfw
[12:44:38] <elukey>	 I've just finished the vk deploy in upload
[12:46:42] <ema>	 elukey: nice
[12:48:43] <elukey>	 I missed cp2017.codfw.wmnet
[12:48:52] <elukey>	 if it is ok I'll update it, otherwise I'll wait
[12:49:02] <ema>	 elukey: go for it
[12:49:16] <ema>	 I'm currently working on cp2002 and cp2005
[12:49:46] <elukey>	 super, watching on 2002 if anything happens
[13:02:38] <moritzm>	 ema, bblack: I've built 1.0.2i, make some smoke tests and uploaded to carbon
[13:03:33] <ema>	 moritzm: cool, thanks
[13:03:44] <moritzm>	 needed some tweaks to the Debian patches and the cloudfare patch, merged as https://gerrit.wikimedia.org/r/#/c/312234/
[13:03:56] <moritzm>	 I'd install on cp1008 next?
[13:04:21] <ema>	 sounds good to me
[13:14:41] <mark>	 ema: please link to phab ticket in gerrit changes, thanks :)
[13:15:13] <ema>	 mark: sure!
[13:15:21] <moritzm>	 ema, bblack: actually, cp1008 is currently running the experimental openssl 1.1 build
[13:26:58] <ema>	 mailed-received is now ~200k on cp1099
[13:27:07] <ema>	 varnishstat -1 -f MAIN.exp_mailed -f MAIN.exp_received | awk '/mail/ { m=$2 } /rece/ { r=$2 } END { print m-r;}'
[13:27:10] <ema>	 216995
[13:36:31] <mark>	 it was < 20 a few hours ago
[13:37:03] <ema>	 no 503s yet
[13:37:11] <ema>	 difference ~300k
[13:43:20] <ema>	 and now it caught up
[13:45:00] <mark>	 weird eh
[13:45:09] <mark>	 it would be easier to understand it would not progress at all for a while
[13:45:12] <mark>	 if*
[13:45:52] <ema>	 I'm logging the differences in ~ema/mrdiff.log, we're now at ~40k
[13:50:11] <mark>	 i heard we'll have varnish stats in prometheus soon :)
[13:50:20] <ema>	 can't wait :)
[13:55:12] <ema>	 all other hosts have a diff of 0 except for cp1049 which is at 4 now
[14:00:42] <ema>	 went back to 0 after reaching 152297
[14:04:19] <ema>	 and it's now able to catch up apparently, the diff is not rising anymore
[14:07:14] <mark>	 i wonder if we let a single cache store one bin only
[14:07:19] <mark>	 (and miss-pass everything else)
[14:07:21] <mark>	 would it still happen :)
[14:08:34] <ema>	 so the fact that "after a while" it manages to catch up might be related to what we've seen with the 503s, in the sense that they stopped after a certain time
[14:08:54] <bblack>	 I think it tracks load
[14:09:15] <bblack>	 during peak load times the rate of mailed items goes up, and above certain rates it's more likely to fall behind, etc
[14:09:44] <bblack>	 and yeah, the 503 recoverys are probably from catching up (at least somewhat) too
[14:10:13] <bblack>	 https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=mailbox&vl=&x=500&n=&hreg[]=cp1099&mreg[]=varnish.MAIN.exp_.%2A&gtype=line&glegend=show&aggregate=1&embed=1&_=1474476862829
[14:10:32] <bblack>	 ^ if you look at cp1099 there, it didn't start falling behind until the rate was getting pretty high for the daily cycle
[14:11:27] <bblack>	 so long as they catch up in some reasonable timeframe, it's tolerable
[14:11:44] <bblack>	 the problem is when they persistently fall behind into the millions of items and never really catch back up
[14:16:12] <mark>	 there's debug logging in that code, right
[14:16:18] <mark>	 did you enable and capture that while this is happening?
[14:16:27] <mark>	 i.e. the objects it's looking at
[14:16:31] <mark>	 probably extremely verbose
[14:16:44] <mark>	 and therefore possibly even a heisenbug
[14:18:22] <bblack>	 huh?
[14:19:27] <bblack>	 moritzm: when you say tweaks, you mean offset updates to build, or? the way things are structured, we don't easily get a diff of the diffs I guess
[14:20:46] <bblack>	 oh I guess it is a diff of the diffs, there's just a lot of line noise from quilt
[14:21:36] <mark>	 hmm I see I recalled it wrong. there's no logging in the expire thread
[14:21:47] <moritzm>	 nothing to the code itself, but the Configure script shipped by 1.0.2i differs from the one the cloudfare patch patches, I had to fix these up
[14:22:19] <moritzm>	 also the ca.patch needed to be updated since the hunk it patches now shows a different message, so the patch failed to apply
[14:22:35] <moritzm>	 the rest is just the effect of refreshing the patches with quilt
[14:23:36] <mark>	 oh there is some
[14:23:37] <mark>	 	VSLb(&ep->vsl, SLT_ExpKill, "EXP_Inbox p=%p e=%.9f f=0x%x", oc,
[14:23:37] <mark>	 oc->timer_when, oc->flags);
[14:23:48] <mark>	 etc
[14:24:00] <mark>	 just wondering if we could derive some patterns while this is happening, but it might be too verbose
[14:25:08] <bblack>	 moritzm: ok
[14:26:32] <bblack>	 moritzm: resetting cp1008 back to normal
[14:27:37] <bblack>	 moritzm: relatedly: I dumped the chapoly draft ciphers from our list this week too, so we get some stats on 1.0.2 without them before the 1.1.0 switch (which may have its own stats patterns to watch)
[14:29:35] <moritzm>	 ack, saw that a few days ago
[14:30:08] <bblack>	 cp1008/pinkunicorn reset back to normal prod openssl/nginx packages and upgraded/restarted
[14:30:13] <bblack>	 looking at sslllabs.com results now
[14:31:17] <bblack>	 so, offsite agenda, we really want a whole breakout on TLS issues I think
[14:31:50] <bblack>	 discuss the 1.1.0 upgrade, and also discuss strategy on when/how we dump 3DES (and the other non-FS cipher eventually too)
[14:33:56] <bblack>	 elukey: where you at on the vk package upgradees? I'm asking from the perspective of: can we just do an "apt-get upgrade" on all the caches and pick up the outstanding misc updates with the openssl update, or is vk upgrade outstanding but not ready for it on some?
[14:34:01] <moritzm>	 agreed
[14:34:44] <ema>	 bblack: we cannot just apt-get upgrade because of vmod-netmapper
[14:34:53] <bblack>	 if there's leftover time after that, we can talk about (H)PKP too :)
[14:34:59] <bblack>	 ema: ugh :)
[14:35:04] <bblack>	 ok
[14:35:34] <mark>	 is any upload box backlogged on lru right now?
[14:36:17] <ema>	 mark: no
[14:36:50] <mark>	 a large object pushing out a huge amount of lru objects due to fragmentation should be quite visible with "varnishlog -i ExpKill"
[14:37:14] <bblack>	 the pattern with the load has happened before, though
[14:37:25] <mark>	 on cp1099 I typically see less than 10 consecutive candidates
[14:37:38] <mark>	 *   << BeReq    >> 243185335 
[14:37:38] <mark>	 -   ExpKill        LRU_Cand p=0x7f43fa07bb80 f=0x0 r=1
[14:37:38] <mark>	 -   ExpKill        LRU x=219435462
[14:37:38] <mark>	 -   ExpKill        LRU_Cand p=0x7f432ecf18c0 f=0x0 r=1
[14:37:38] <mark>	 -   ExpKill        LRU x=220052281
[14:37:41] <mark>	 etc
[14:37:48] <bblack>	 I think the high-load times of the day (for the mailbox rate, which tracks the overall traffic rates approximately) just puts more pressure.  depending on uptime it can be enough to push it over the edge of unrecoverable or not
[14:37:58] <mark>	 yeah
[14:38:20] <mark>	 i'll look at any other box, to see if there's generally more fragmentation
[14:38:33] <bblack>	 you mean the ones on old storage?
[14:38:36] <mark>	 yes
[14:38:43] <mark>	 i guess there are none left in eqiad? :)
[14:38:48] <bblack>	 well with the 24h cron restarts they tend not to get awful, but yeah
[14:38:53] <bblack>	 none left in eqiad, right
[14:39:24] <bblack>	 I can tell you the longest-running old storage box though
[14:39:36] <mark>	 wonder if we could get a feel for fragmentation by parsing varnishlog output with a script that tracks the number of candidates per lru run
[14:39:40] <mark>	 please do :)
[14:39:43] <bblack>	 cp4006
[14:39:54] <bblack>	 apparently it has missed a daily restart somehow, as it's >1d now
[14:40:10] <bblack>	 maybe from confctl fail? :)
[14:40:46] <ema>	 mark: if you want to get live debugging output in addition to what varnishd normally logs we can do that with stap
[14:41:05] <mark>	 yeah that's nice
[14:41:22] <bblack>	 ema: you still have logs running in screen on cp4006 from a week ago
[14:41:54] <mark>	 subjectively, just from glancing at the varnishlog output, cp4006 looks slightly more fragmented, more candidates evaluated per lru pushout
[14:41:58] <mark>	 but that's hardly scientific ;)
[14:42:29] <bblack>	 cp1099 is at a bit over 64h uptime now
[14:42:50] <ema>	 bblack: we should have noticed confctl fails in cron's emails, I don't see any wrt cp4006
[14:42:50] <mark>	 cp4006 quite regularly has >10 though, whereas cp1099 rarely does
[14:42:53] <bblack>	 that it's still catching back up from its mailbox backlog bursts is way better than where we'd be without storage fixes at that uptime
[14:43:49] <elukey>	 bblack: sorry just read the message - I upgraded upload today, but the pkg is on reprepro so if you are going to upgrade the rest of the segments it will be less work for me :)
[14:43:52] <bblack>	 the big ? in my head before we all start traveling is whether I turn on the new-storage hosts' restarts with a daily or weekly cron before we go
[14:44:04] <bblack>	 if it looks really good, weekly, but otherwise just go back to daily for safety for now
[14:44:12] <bblack>	 we won't get to see a whole week of history first
[14:44:14] <mark>	 i'd say daily before travel?
[14:44:19] <ema>	 +
[14:44:21] <ema>	 +1 :)
[14:44:37] <bblack>	 but on the other hand, even the "weekly" doesn't mean one week out from when we start it.  some of them will restart within hours of turning on "weekly"
[14:45:26] <bblack>	 ema: do you have outstanding puppet disables or anything for codfw storage upgrades?
[14:45:42] <bblack>	 or in the midst of anything else the openssl + misc package updates might interfere with?
[14:46:53] <ema>	 bblack: nope, I've converted cp2002+cp2005 and then got distracted by cp1099
[14:47:13] <bblack>	 ok
[14:47:23] <bblack>	 hmmm, we have some strangeness with experimental, etc...
[14:47:38] <bblack>	 I checked randomly on cp4006, and it wants to upgrade openssl to 1.1???
[14:47:45] <bblack>	 I thought I didn't upload that anywhere
[14:49:28] <bblack>	 I guess I must have uploaded it to experimental
[14:49:30] <bblack>	 but still....
[14:49:51] <ema>	 yeah 1.1.0-1+wmf2 is in experimental
[14:50:16] <bblack>	 why does cp4006 select it by default though?
[14:50:50] <bblack>	 root@cp4006:~# apt-cache policy openssl
[14:50:50] <bblack>	 openssl: Installed: 1.0.2h-1~wmf4 Candidate: 1.1.0-1+wmf2
[14:51:02] <bblack>	 doesn't happen on most other hosts I check....
[14:51:27] <bblack>	 hmmm s/most/some/
[14:51:38] <ema>	 do the other hosts have experimental in sources.list?
[14:51:45] <bblack>	 yeah
[14:51:45] <ema>	 perhaps they're running v3
[14:51:50] <ema>	 mmh interesting
[14:52:00] <bblack>	 so all the v4 hosts have experimental at least
[14:52:06] <ema>	 yes
[14:52:20] <ema>	 which one has experimental but doesn't want to upgrade to 1.1.0-1+wmf2?
[14:52:20] <bblack>	 I thought still that experimental packages had to be installed explicitly?
[14:52:37] <bblack>	 none
[14:52:45] <bblack>	 it's just the experimental-enabled v4 hosts
[14:54:02] <bblack>	 hmmmmm
[14:55:14] <bblack>	 it doesn't break the library stuff in any case since they're separate, but we really shouldn't put the new openssl CLI package in place either
[14:55:51] <bblack>	 anyways, I can hack around it for now with openssl=x
[14:57:37] <ema>	 yeah I guess our experimental is not really like debian's experimental, which has NotAutomatic: yes in the Release file
[14:58:03] <moritzm>	 yeah, just had a look at 4006, the experimental component has the standard pinning
[14:58:17] <bblack>	 ok
[14:58:20] <moritzm>	 so these are treated equally with the remaining sections
[14:58:38] <moritzm>	 something we can tweak/discuss during the offsite session
[14:58:41] <bblack>	 well, I wanted to get all the misc updates we're behind on, but both openssl and libvmod-netmapper are blocking that heh
[14:59:02] <bblack>	 apt-get --dry-run install "openssl=1.0.2i-1~wmf1" libssl1.0.0 libssl1.0.0-dbg libsystemd0 libudev1 libxml2 linux-image-3.16.0-4-amd64 linux-libc-dev linux-meta locales multiarch-support python2.7 python2.7-minimal ruby2.1 systemd systemd-sysv udev wget base-files e2fslibs e2fsprogs gnupg gnupg-agent gnupg2 gpgv libc-bin libc-dev-bin libc6 libc6-dev libcomerr2 libltdl7 libnet-ssleay-perl linux-met
[14:59:08] <bblack>	 a-4.4
[14:59:30] <moritzm>	 all the other packages are from the jessie 8.6 point update, which I'm deploying cluster-wide
[14:59:32] <bblack>	 ^ that seems to DTRT on both v4 and v3 hosts and pick up all the oustanding package upgrades and the right openssl CLI, but not libvmodnetmapper or any openssl-1.1 packages
[14:59:55] <bblack>	 ignore them then, or?
[14:59:59] <moritzm>	 hadn't had much time to work on it this week, but hopefully these will all be deployed in the week after the offsite
[15:00:09] <moritzm>	 yeah, nothing serious in there
[15:00:11] <bblack>	 ok
[15:00:21] <ema>	 linux-image-3.16.0-4-amd64 ?
[15:01:03] <ema>	 I thought we were happy 4.4 users :)
[15:01:13] <bblack>	 yeah I just don;'t like having random package updates outstanding
[15:01:17] <bblack>	 may as well keep it up to date :)
[15:01:37] <bblack>	 apt-get --dry-run install "openssl=1.0.2i-1~wmf1" libssl1.0.0 libssl1.0.0-dbg
[15:01:40] <bblack>	 works for just openssl
[15:07:01] <bblack>	 nginx upgrades going now
[15:07:28] <bblack>	 (not package upgrades, the "service nginx upgrade" command to do lossless restarts onto the new libssl)
[15:14:18] <bblack>	 moritzm: so cache nginx are all using the new openssl lib
[15:21:22] <bblack>	 I wonder if the openssl-1.0 package updates with 3DES deprioritizing to medium will have some real effect on global 3DES support
[15:21:47] <bblack>	 for sites that didn't care much, but just follow on with it not being in HIGH anymore and they weren't allowing MEDIUM before
[15:23:01] <bblack>	 ebay.com is a new 3DES-failure in my top-100 survey today, but I don't think they enforce HTTPS in general?
[15:23:38] <bblack>	 yeah they don't
[15:23:50] <bblack>	 most of the top-100 that I find failing to support 3DES also don't force HTTPS either
[15:24:02] <bblack>	 with the continuing exceptions of github.com and tumblr.com
[15:32:49] <ema>	 bblack: should I carry on with the other codfw hosts?
[15:36:25] <bblack>	 ema: yeah
[15:36:37] <ema>	 alright!
[15:36:44] <bblack>	 I guess if we assume we're putting daily restarts back in place by friday
[15:36:54] <bblack>	 and we know the new storage can go at least 3 days
[15:37:04] <bblack>	 there's no reason not to accelerate this to all them we can before we leave
[15:37:17] <bblack>	 maybe I can pick up whatever's left of codfw and start on esams later today
[15:41:00] <ema>	 cool
[15:45:12] <ema>	 ERROR:conftool:Error when trying to set/pooled=no on name=cp2008.codfw.wmnet,service=varnish-be-rand
[15:45:15] <ema>	 ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure
[15:45:18] <ema>	 :(
[15:46:09] <bblack>	 did it fail the script this time? :)
[15:46:17] <bblack>	 oh right, probably not a script
[15:46:22] <ema>	 no script this time, right
[15:47:12] <moritzm>	  linux-image-3.16.0-4-amd64 can be removed, it's the kernel installed by d-i (the old meta package linux-image-amd64 depends on it and needs to be removed along)
[15:47:49] <moritzm>	 actually, since all jessie hosts use 4.4(or at least 3.19) now, we could just as well, drop this cluster-wide via puppet
[15:48:51] <moritzm>	 bblack: great, will proceed with the remaining clusters tomorrow and after the offsite
[15:52:19] <bblack>	 moritzm: seems fine to me if you want to kill the 3.16 packages.  it's not like we'd ever reboot back to it at this point.
[15:53:44] <moritzm>	 ack, I'll do that when I have some spare time
[15:56:27] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Operations, 06Performance-Team, and 4 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#2659323 (10mark)
[16:01:29] <paravoid>	 http://blog.cerowrt.org/post/bbrs_basic_beauty/
[16:05:15] <bblack>	 yeah BBR sounds exciting :)
[16:23:54] <Snorri>	 ema, bblack: You got mail! *cheers*
[16:24:28] <bblack>	 Snorri: awesome :)
[16:24:33] <ema>	 Snorri: thanks!
[16:25:25] <ema>	 Snorri: and congrats ;)
[16:29:40] <elukey>	 I know that it has been a long day when I spend minutes trying to figure out things like "why cache:text has vk 1.0.7? It is so old.. did I forget to upgrade it during the past months?"
[16:30:15] <ema>	 elukey: heh :)
[16:31:39] <elukey>	 anyhow, brain faults aside, I'll ping you guys back tomorrow morning to see if vk needs to be upgraded in misc/maps or not 
[16:31:48] <elukey>	 :)
[16:52:02] <ema>	 bblack: I've left https://gerrit.wikimedia.org/r/#/c/312211/ and https://gerrit.wikimedia.org/r/#/c/312212/ to do in codfw. The other hosts look good and I had no real issues except for the etcd hiccup.
[16:52:43] <ema>	 see you all tomorrow :)
[16:54:02] <bblack>	 ok
[17:31:46] <wikibugs>	 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2659577 (10BBlack)
[17:45:03] <wikibugs>	 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2659624 (10faidon) Nothing appears abnormal in the logs of either csw2, asw nor cr2. Which other hosts on the same network did you try from? I'm interested to find out if they connecte...
[17:48:12] <mark>	 i wonder what fs would be optimal for a varnish fs backend with one-object-per-file
[18:06:31] <bblack>	 probably btrfs
[18:07:05] <bblack>	 probably btrfs is the anwer to most filesystem questions now.  it's gotten stable, and it's awesomer than all the alternatives in so many ways
[18:07:25] <bblack>	 it's just taking a while for us all to catch up to it being the new norm that that's the right decision with modern kernels
[18:08:13] <bblack>	 from a blog post about it earlier this year:
[18:08:15] <bblack>	 They have run a test against the same storage, formatted at different stages with XFS, EXT4 and BTRFS, and they wrote around 24 million files of different size and layout. XFS takes 430 seconds to complete the operations and it was performance bound by its log system; EXT4 took 200 seconds to complete the test, and its limit comes from the fixed inode locations. Both limits are the results of the
[18:08:21] <bblack>	 ir design, and overcoming of those limits was one of the original goal of BTRFS. Did they succeed? The same test took 62 seconds to be completed on BTRFS, and the limit was the CPU and Memory of the test system, while both XFS and EXT4 were able to use only around 25% of the available CPU because they were quickly IO bound.
[18:08:38] <bblack>	 and it's SSD-aware (not just TRIM, but changes how it does seek optimization, etc)
[18:10:57] <MaxSem>	 does that mean that btrfs === "poop it all into RAM cache and it will be written (eventually)"
[18:11:33] <MaxSem>	 I remember DOS worked so much faster with smartdrive :D
[18:13:02] <bblack>	 https://btrfs.wiki.kernel.org/index.php/Status
[18:16:07] <bblack>	 there's still some reports of bugs and warnings of "not ready for production", but it seems that's mostly about uses-cases like "data you really care about" and "using all its complex features like comrpession+raid6+..."
[18:16:33] <bblack>	 I think for use as a cache backend filesystem with no mirroring/raid/compression, it's probably already pretty stable and the perf benefits are probably nice
[18:36:58] <bblack>	 ema: codfw done a bit ago, starting up on esams now.
[18:40:02] <bblack>	 (I'm also going to go restart cp4006 now before it gets in trouble for being up >1d before I get to it for conversion)
[18:54:55] <mark>	 I figured btrfs as well
[18:55:50] <mark>	 Just dont have it run out of space ;)
[18:56:56] <mark>	 I used btrfs for media storage across 48 drives over 2010, did you know that? ;)
[18:57:15] <bblack>	 no, I didn't :)
[18:57:34] <mark>	 Didnt fail... Somehow :)
[18:57:53] <bblack>	 yeah it's probably just a matter of not pushing on the wrong corner-case buttons
[18:57:56] <bblack>	 (even today)
[18:58:01] <mark>	 Yes
[18:58:12] <mark>	 48 drives though ;)
[18:58:27] <mark>	 Esams thumbs traffic :)
[18:58:28] <bblack>	 it'll be nice when it gets to the point we can use it for rootfs
[18:58:41] <bblack>	 putting the mirroring down in the FS makes life easier for e.g. partman-like crap
[18:59:22] <bblack>	 and then the compression thing could be a game-changer in the long run too
[18:59:37] <bblack>	 I could see 10 years form now nobody really invokes gzip manually much or does gzipped logrotate, etc...
[18:59:50] <bblack>	 just "chattr +c" on files you want compressed by the FS for you
[19:03:08] <bblack>	 it would be interesting for the general/rootfs case, to have some kind of filesystem option for semi-smart auto-compression
[19:04:21] <bblack>	 where the FS makes the call to try compressing or not, perhaps by filtering on size first (don't compress below X bytes), and then doing a trial compression on the data to see if it seems worth it (if a few blocks of this file compress by more than X%, turn on compression for it)
[19:17:53] <gehel>	 bblack: any chance I could have your review on https://gerrit.wikimedia.org/r/#/c/312223/ and related? 
[19:18:48] <gehel>	 bblack: joe had a look at the varnish part, and I'm probably completely wrong in the way I'm trying to move wdqs to LVS...
[19:27:37] <gehel>	 bblack: if I reference wdqs.svc.eqiad.wmnet directly, it means that we need to push a puppet change to fallback to codfw, correct?
[19:27:59] <bblack>	 gehel: yes.  in general, we haven't been doing DC failover testing for cache_misc
[19:28:31] <bblack>	 no matter what you have to push a puppet change, it's just a question of whether it's on hieradata/ or modules/role/ :)
[19:28:54] <gehel>	 yeah, I'm still a bit lost in all of that...
[19:29:53] <bblack>	 basically don't worry about switching datacenters at this point
[19:30:12] <bblack>	 just getting the LVS abstraction inserted at all, so that it works like all the rest of the existing cache_misc services, is an improvement.
[19:30:25] <gehel>	 and what is the status of active / active clusters? T134404 seems to indicate that it is not possible yet, but joe lead me to think it should be...
[19:30:26] <stashbot>	 T134404: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404
[19:33:17] <bblack>	 gehel: that's a whole other question, but text/upload are close to achieving it, and misc is completely different right now
[19:33:44] <bblack>	 those questions are something we're resolving for the infrastructure as a whole.  once they're resolved, then we can look at turning it on for various services on various clusters
[19:33:51] <bblack>	 but it's irrelevant to moving a service to LVS today :)
[19:33:58] <gehel>	 ok, thanks for the clarification! 
[19:34:16] <gehel>	 yeah, not directly related, but I'm curious...
[19:34:40] <bblack>	 there's a set of 5 patches starting at https://gerrit.wikimedia.org/r/#/c/300574/
[19:35:02] <bblack>	 which was my most-recent attempt at unifying all related things for all clusters, and laying the groundwork for how to move to active/active support
[19:35:13] <bblack>	 but they're now outdated and need non-trivial fixups just to rebase
[19:35:43] <gehel>	 I'll have a look, just for my general knowledge...
[19:36:47] <bblack>	 the goal is to get to a point where we define a service's applayer endpoint in one or more DCs and everything else is automagic, basically
[19:37:21] <bblack>	 if you define it only as "eqiad: wdqs.svc.eqiad.wmnet", all the caches will do inter-cache routing until they reach eqiad, and then it reaches that service endpoint there
[19:38:02] <bblack>	 but if you define both "eqiad: wdqs.svc.eqiad.wmnet" and "codfw: wdqs.svc.codfw.wmnet", the cache routing will be split: ulsfo+codfw will route into the codfw applayer, and esams+eqiad will route into the eqiad applayer.
[19:39:30] <bblack>	 and there won't be any changes to other data to switch those things up or disable one or the other, the inter-cache routing config stays the same regardless, and the final routing to the applayer is per-service (e.g. one cache cluster can have a service that only has an app endpoint in codfw, another that only has eqiad, and a third that splits to both), and never crosses DCs to reach the applaye
[19:39:36] <bblack>	 r either
[19:39:47] <bblack>	 (as in, it will never have to go cache.eqiad->app.codfw just because of which backends are configured)
[19:41:34] <bblack>	 sorry that's probably confusing to anyone not living in my head
[19:41:45] <bblack>	 it will make sense once it's working and documented, though :)
[19:42:54] <bblack>	 the bottom line is it divorces the complexity/config of inter-cache/inter-DC routing of requests from the declarative config of which parallel application endpoints are available in which DCs to route requests to.
[19:44:06] <bblack>	 (the first being mostly a traffic-level abstraction others shouldn't care about which exists at a per-DC/per-cache-cluster level, the other being what the application cares about and defined at a per-application level)
[19:45:20] <bblack>	 thus we can have active/active RestBase, MediaWiki only in eqiad, and cxserver only in codfw, or whatever.  and making changes to those things doesn't involve knowing about or changing anything about inter-cache/inter-DC stuff.
[19:48:57] <bblack>	 gehel: also, on the final (varnish) patch, can you remove the VCL wdqs probe too? it's in modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb (search for wdqs)
[19:49:28] <gehel>	 bblack: sure! I did not think to look there...
[19:49:46] <bblack>	 yeah it shouldn't even be there, because we shouldn't have non-LVS services defined in varnish :)
[19:49:52] <bblack>	 but wdqs and logstash are the last remaining exceptions
[19:50:26] <bblack>	 well and rcs is also non-LVS, but doesn't have a custom probe
[19:50:30] <bd808>	 what does it take to setup an LVS for logstash?
[19:51:15] <bd808>	 it shouldn't need session pinning or anything weird like that as far as I know
[19:51:16] <bblack>	 not much really. a set of patches like the one gehel is working on now for wdqs
[19:51:18] <gehel>	 bd808: probably not all that much...
[19:51:27] <bblack>	 just nobody's taken the time (me included!)
[19:51:31] <gehel>	 I can have a look at logstash once wdqs is working
[19:51:42] <bd808>	 sweet
[19:51:56] <gehel>	 there's probably a few of the logstash services that we can't put behind LVS, GELF at least
[19:52:20] <bblack>	 I don't that's relevant to logstash.wikimedia.org endpoint on cache_misc though, right?
[19:52:23] <bd808>	 ... I think gelf does packet reassembly on the logstash side
[19:52:47] <bd808>	 yeah the varnish is just for the kibana interface
[19:53:23] <bblack>	 yeah in the conftool data we could call the cluster "logstash" and the service "kibana" or something, and maybe make it kibana.svc.eqiad.wmnet and such
[19:53:30] <bblack>	 to be clear that it's the HTTP kibana thing
[19:53:31] <bd808>	 having a proper service ip with balancing for log ingestion is a whole other beast
[19:54:22] <gehel>	 yeah, let's start with the easy ones...
[19:54:24] <bd808>	 and tricky because we use a couple of protocols that are UDP based and require collecting multiple datgrams to form a complete log entry
[19:59:45] <bblack>	 T132458 <- there's already a ticket for lvs-isizing kibana, like the wdqs
[19:59:45] <stashbot>	 T132458: Move logstash to an LVS service - https://phabricator.wikimedia.org/T132458
[20:00:12] <bblack>	 probably poorly named, since I made that ticket to be about the kibana http thing, but just said "logstash to an LVS service" :)
[20:11:12] <gehel>	 so the wdqs stuff should be mostly good, but it is getting a bit late here to merge it tonight...
[20:11:22] <gehel>	 I'll start preparing the logstash LVS
[20:12:56] <wikibugs>	 10Traffic, 06Operations, 10Wikimedia-Logstash: Move logstash.wikimedia.org (kibana) to an LVS service - https://phabricator.wikimedia.org/T132458#2660234 (10bd808)
[20:22:15] <bblack>	 gehel: ok cool, I'll take another look/review later.  thanks for working on it! :)
[20:22:24] <gehel>	 no problem!
[21:27:24] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2660533 (10Krinkle) The ones that start with `/skins` and `/static` are most likely from on-wiki gadgets and site scripts and stylesheets (e.g. Common.css) which will have been broken by...
[21:27:35] <wikibugs>	 10Wikimedia-Apache-configuration: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2660534 (10Alphos)
[21:31:20] <wikibugs>	 10Wikimedia-Apache-configuration: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2660578 (10Dereckson) Apache configuration explicitely sets `AddDefaultCharset`:  ```lang=apache,name=modules/noc/templates/noc.wikimedia.org.erb     <Di...
[23:38:34] <bblack>	 another raft failure, first one I've seen today:
[23:38:36] <bblack>	 ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure