[03:38:22] <wikibugs>	 10Wikimedia-Apache-configuration: Titles containing a dot give HTTP 403 when accessed using action=raw and the short URL form - https://phabricator.wikimedia.org/T126183#2013992 (10matmarex) Previously rejected as T85751 by @brion.
[10:14:19] <wikibugs>	 7HTTPS, 10Huggle: Huggle 2 fails on HTTP used when HTTPS expected - https://phabricator.wikimedia.org/T126357#2014331 (10DVdm) Got it! Just commenting out the aborts and exit subs does the trick. In **Login.vb**  ```     Class LoginRequest : Inherits Request ...         Protected Overrides Sub Process() ......
[12:15:45] <wikibugs>	 7HTTPS, 10OTRS, 6Security, 6operations: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#2014556 (10akosiaris)
[12:29:00] <wikibugs>	 7HTTPS, 10OTRS, 6operations: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#2014594 (10akosiaris) 5Open>3Invalid A side effect of the upgrade is that this has been turned obsolete. Now OTRS is behind misc-web, treated the same as all oth...
[12:29:19] <ema>	 bblack: I guess we can probably update varnish.systemd.erb already? https://gerrit.wikimedia.org/r/#/c/269664/ 
[12:39:23] <paravoid>	 cp1065 and cp1067 complain about RAID errors
[12:40:48] <ema>	 thanks paravoid 
[12:41:13] <paravoid>	 also tons of
[12:41:13] <paravoid>	 Subject: Cron <root@cp1059> /usr/local/sbin/update-ocsp-all
[12:41:14] <paravoid>	 run-parts: /etc/update-ocsp.d/hooks/nginx-reload exited with return code 99
[12:41:20] <paravoid>	 but not sure if these are fixed by now
[12:41:36] <ema>	 md0 : active raid1 sdb1[1] 9756672 blocks super 1.2 [2/1] [_U]
[12:41:37] <paravoid>	 (the hook just does "service nginx reload")
[12:42:10] <paravoid>	 Feb 10 11:23:02 cp1059 systemd[1]: Unit nginx.service cannot be reloaded because it is inactive.
[12:42:13] <paravoid>	 Feb 10 12:23:02 cp1059 systemd[1]: Unit nginx.service cannot be reloaded because it is inactive.
[12:42:48] <bblack>	 morning :)
[12:43:08] <ema>	 good morning!
[12:44:41] <paravoid>	 hi :)
[12:46:37] <bblack>	 the last ones that did that were trivial and no real hardware fault
[12:47:03] <bblack>	 (the last two that were complaining of raid, just needed mdadm stuff to start syncing again, some fallout from reboot and systemd and god knows what)
[12:47:32] <paravoid>	 yeah, I've seen this happen before for some reason :/
[12:47:52] <paravoid>	 lvs4002 also had troubles booting yesterday, dropped into the initramfs shell
[12:47:56] <bblack>	 well it's the same reason a few of the caches stopped on bootup at an (initramfs)
[12:48:00] <paravoid>	 I mdadm assembled everything and rebooted and then it was okay
[12:48:02] <bblack>	 heh I couldn't even finish
[12:48:06] <paravoid>	 haha
[12:48:22] <bblack>	 basically something with kernel+systemd doesn't get /dev/sdX running "in time"
[12:49:14] <bblack>	 paravoid: the ocsp cronspam recently is from the decommed cache_mobile machines
[12:49:25] <bblack>	 their configuration's broken and they're de-roled, but that didn't remove the cronjob
[12:50:02] <paravoid>	 nod
[12:56:34] <bblack>	 so cp1067 happened to be a 4.4 canary too, and before I even had a chance to touch the raid speed_limit_(min|max) to speed it up, it already showed:
[12:56:37] <bblack>	 recovery =  5.8% (568832/9756672) finish=0.8min speed=189610K/sec
[12:56:51] <bblack>	 whereas the 3.19 ones always started out ~1024K/sec and ~120minute estimate
[12:57:59] <bblack>	 the values in procfs are the same though
[12:58:05] <ema>	 bblack: we can start messing up maps machines soon :) https://gerrit.wikimedia.org/r/#/c/269466/
[12:58:49] <bblack>	 ema: awesome :)
[12:59:32] <bblack>	 I'm guessing something changed from 3.19 to 4.4 that allows it to rebuild closer to _max rather than _min speed when the system is bogged down on other i/o
[12:59:43] <bblack>	 s/is/is not/
[13:04:04] <wikibugs>	 10Traffic, 6operations, 5Patch-For-Review: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#2014667 (10ema) With https://gerrit.wikimedia.org/r/#/c/269664/ and https://gerrit.wikimedia.org/r/#/c/269466/ applied the following procedure allows to upgrade a maps box to Varnish 4:    ech...
[13:09:19] <ema>	 bblack: oh, would you prefer using two different systemd template files for v3 and v4?
[13:10:51] <bblack>	 ema: the change as proposed now, if merged to ops/puppet, would modify the existing v3 systemd template in bad ways, right?
[13:10:57] <bblack>	 or was it too early in the morning for me? :)
[13:13:02] <ema>	 mmh, except for the order in which parameters are passed I do not see much of a difference for v3 (unless it's too late in the morning for *me*) :)
[13:13:54] <bblack>	 well you're replacing shm_workspace with a new name that doesn't exist in v3, and you're removing shm_workspace, which we really need in v3 to avoid crashes but doesn't exist in v4
[13:14:29] <bblack>	 oh and there's a conditional around it
[13:14:35] <bblack>	 see, it was too early in the morning for me :)
[13:16:09] <ema>	 haha :)
[13:16:26] <ema>	 is there a way to test the change on a single node before merging?
[13:16:40] <bblack>	 not easily really
[13:16:49] <bblack>	 but you can do puppet compiler against the nodes
[13:16:57] <ema>	 ah sure!
[13:17:03] <bblack>	 if you mark some node as varnish_version4 in the change I guess
[13:17:12] <bblack>	 or don't and just confirm existing ones aren't affected
[13:17:22] <ema>	 I wanted to try the latter first
[13:17:44] <bblack>	 I haven't really had a chance to look at the VCL diff, I need to download the change and see, since the "diff" isn't obvious in the diff what with the copy/rename
[13:18:07] <bblack>	 we'll need to be careful going forward once that gets merged btw, to make sure others' changes to the VCL templates get mirrored to both copies
[13:18:42] <ema>	 yeah, probably the best way to look at the VCL changes is to diff the _v4 files against the v3 ones
[13:18:50] <bblack>	 e.g. https://gerrit.wikimedia.org/r/#/c/269661/ that I'm about to merge :)
[13:19:34] <ema>	 alright that one is not too bad 
[13:20:19] <mark>	 error message of the day: "Exprected ct to be boolean, was: true"
[13:20:31] <paravoid>	 so how are the 4.4 canaries?
[13:20:38] <paravoid>	 no increased CPU there?
[13:20:53] <paravoid>	 I'm looking at http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1067.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1455109116&g=cpu_report&z=large&c=Text%20caches%20eqiad
[13:20:55] <ema>	 mark: you're not alone in your pain
[13:21:02] <bblack>	 paravoid: I haven't properly looked this morning.  yesterday shortly after, there was a little bit of an iowait bump, but expected post-reboot anyways
[13:21:38] <bblack>	 doesn't look like anything crazy happened
[13:21:48] <paravoid>	 nod
[13:22:00] <mark>	  whoops, wrong channel ;)
[13:22:09] <bblack>	 post-4.4, I'd really like to test removing the vm hack on the upload caches and see what happens too
[13:22:14] <bblack>	 it's about time we should be past that mess :P
[13:22:30] <paravoid>	 the per-minute cronjob?
[13:22:46] <paravoid>	 [    5.368853] systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:3] Failed to add dependency on varnish-frontend, ignoring: Invalid argument
[13:22:49] <paravoid>	 [    5.384216] systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:4] Failed to add dependency on varnish-frontend, ignoring: Invalid argument
[13:22:58] <paravoid>	 fwiw, I was just looking at dmesg :)
[13:22:59] <bblack>	 yeah
[13:23:09] <ema>	 per-minute cronjob?
[13:23:13] <bblack>	 ema:
[13:23:14] <bblack>	     cron { 'varnish_vm_compact_cron':
[13:23:14] <bblack>	         command => 'echo 1 >/proc/sys/vm/compact_memory',
[13:23:14] <bblack>	         user    => 'root',
[13:23:15] <bblack>	         minute  => '*',
[13:23:27] <bblack>	 it's in modules/role/manifests/cache/perf.pp
[13:23:52] <bblack>	 we used it to avoid some spiky/ugly VM behavior that happened on upload caches with some earlier kernels
[13:24:04] <bblack>	 there's a suspicion we don't need it anymore even on 3.19, but haven't tested
[13:25:08] <bblack>	 plus just below that we have some real vm tuning that's more-legitimate, which might mitigate it regardless
[13:26:01] <bblack>	 it's just it was such an ugly issue and takes so long to validate that it's really gone, and doesn't seem to hurt much leaving the cron in as a safety heh
[13:26:11] <bblack>	 so no priority on testing the removal of it :)
[13:26:34] <ema>	 https://puppet-compiler.wmflabs.org/1705/cp1043.eqiad.wmnet/
[13:26:42] <ema>	 I *think* we should be fine
[13:27:48] <bblack>	 ema: looks legit!
[13:32:08] <ema>	 bblack: merged, thanks
[13:32:51] <ema>	 now I'll Use url instead of embedded data for logo on error page
[13:32:59] <ema>	 in _v4 as well :)
[13:54:04] <wikibugs>	 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2014767 (10BBlack) Status updates:  1. Remaining code refs:   1. https://github.com/wikimedia/mediawiki-services-c...
[14:28:54] <bblack>	 ema: starting to review the current _v4 changes, will take a while though :)
[14:32:54] <ema>	 bblack: thanks! I basically went brute force mode trying to make it compile, surely there are lots of mistakes
[14:32:57] <bblack>	 ema: one thing I just noticed on the existing systemd work: we should switch thread_pool_add_delay - in v3 this is in ms and v4 it's floating-point seconds
[14:33:19] <ema>	 oh yes, I've noticed that today as well
[14:36:09] <ema>	 interestingly, thread_pool_add_delay is flagged as "experimental" in the v4 man page
[14:36:46] <ema>	 but not on v3
[14:38:06] <bblack>	 :)
[14:43:30] <ema>	 https://github.com/varnish/Varnish-Cache/commit/7e25234d6f25cf1dd622e4d17e70902c99e63b8b
[14:48:14] <bblack>	 honestly, we may want to review that setting anyways.  maybe start with leaving it at zero under v4, and make a note to check for threads_failed as we deploy experimental v4 hosts with real traffic...
[14:48:54] <bblack>	 with various changes in v4 code, and running much newer libc/kernel on nicer hardware than the original varnish work we did ages ago whenever that was set, it may no longe rbe necessary/beneficial to have the delay
[14:49:03] <ema>	 sounds like material for the "things-to-remember" task
[14:49:52] <ema>	 and yes, agreed, we should probably omit it on v4 at first. The new default is 0 instead of 2 ms on v3
[14:51:54] <wikibugs>	 10Traffic, 6operations: Upgrade to Varnish 4: things  to remember - https://phabricator.wikimedia.org/T126206#2014886 (10ema)
[15:03:17] <paravoid>	 speaking of newer libc/kernel, Debian still has a pretty old libjemalloc
[15:03:49] <paravoid>	 I filed https://bugs.debian.org/809239 about a month ago
[15:04:13] <paravoid>	 it matters for other pieces of software that we use, like hhvm
[15:04:39] <paravoid>	 the 4.0 release notes say "many speed and space optimizations" and "their cumulative effect is substantial"
[15:08:30] <ema>	 another package to backport \o/
[15:08:55] <paravoid>	 well, the new version doesn't even exist in sid yet
[15:08:58] <bblack>	 hopefully the cumulative effect does not include 234 new crasher bugs :)
[15:09:04] <paravoid>	 so we're far from a backport :)
[15:09:32] <paravoid>	 bblack: see the 4.0.4 release notes, which I pasted in that bug above
[15:09:38] <ema>	 right, s/backport/maintain ourselves/ :P
[15:09:49] <paravoid>	 This bugfix release fixes another xallocx() regression. No other regressions have come to light in over a month, so this is likely a good starting point for people who prefer to wait for "dot one" releases with all the major issues shaken out.
[15:43:31] <wikibugs>	 10Traffic, 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015134 (10mark) >>! In T125510#1989639, @BBlack wrote: > Should also note: while the above list of steps 1-5 sounds roughly cor...
[16:14:06] <wikibugs>	 10Traffic, 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015200 (10BBlack) >>! In T125510#2015134, @mark wrote: > Do you think it's reasonable to not use eqiad caches at all for a whil...
[16:15:55] <bblack>	 elukey: https://gerrit.wikimedia.org/r/#/c/269708
[16:22:16] <elukey>	 \o/
[16:22:57] <elukey>	 I didn't manage to make the test sorry, I was busy with Jessie and memcached/redis :(
[16:30:32] <bblack>	 np!
[16:52:08] <elukey>	 ema: argh I missed the =, good catch :)
[16:52:37] <elukey>	 does it change much? reading docs
[16:52:59] <elukey>	 ah yes
[16:53:04] <ema>	 well you just don't get anything without the = :)
[16:53:42] <elukey>	 yep yep I was reading the docs, I tend to forget erb very quickly 
[16:54:30] <ema>	 I only found it by running puppet on a test box BTW, those syntax issues are impossible to see simply by skimming through the code IMHO
[17:53:26] <wikibugs>	 10Traffic, 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015546 (10BBlack) Copying in notes from meeting etherpad, which capture some assumptions/thinking beyond what's currently in th...
[17:57:31] <wikibugs>	 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015578 (10BBlack) Note this went live circa 13:05 -> 13:15 UTC Feb 10.  So far preliminary data in our graphs looks (to me!) like in the aggregate of client reque...
[18:49:43] <wikibugs>	 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015756 (10Gilles) Are you sure this wasn't 15:10 UTC? Isn't that when the patch was merged?
[18:53:06] <wikibugs>	 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015771 (10BBlack) Yup, sorry, thinko while translating timezones.  Updated above too!
[18:59:50] <wikibugs>	 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015813 (10Krinkle) Last 12 hours compared to same time last week. Seems starting in the hour after 15:00 (red mark) there is a noticeable regression.  {F3330702 s...
[19:08:32] <wikibugs>	 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015878 (10BBlack) I'd say some of those graphs, they fluctuate so much we need more data to confirm the pattern.  But the TTFB, DOM-Complete, and onLoad ones cert...
[22:52:22] <wikibugs>	 10Traffic, 10Deployment-Systems, 6Performance-Team, 6operations, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2017086 (10Krinkle)