[03:38:22] 10Wikimedia-Apache-configuration: Titles containing a dot give HTTP 403 when accessed using action=raw and the short URL form - https://phabricator.wikimedia.org/T126183#2013992 (10matmarex) Previously rejected as T85751 by @brion. [10:14:19] 7HTTPS, 10Huggle: Huggle 2 fails on HTTP used when HTTPS expected - https://phabricator.wikimedia.org/T126357#2014331 (10DVdm) Got it! Just commenting out the aborts and exit subs does the trick. In **Login.vb** ``` Class LoginRequest : Inherits Request ... Protected Overrides Sub Process() ...... [12:15:45] 7HTTPS, 10OTRS, 6Security, 6operations: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#2014556 (10akosiaris) [12:29:00] 7HTTPS, 10OTRS, 6operations: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#2014594 (10akosiaris) 5Open>3Invalid A side effect of the upgrade is that this has been turned obsolete. Now OTRS is behind misc-web, treated the same as all oth... [12:29:19] bblack: I guess we can probably update varnish.systemd.erb already? https://gerrit.wikimedia.org/r/#/c/269664/ [12:39:23] cp1065 and cp1067 complain about RAID errors [12:40:48] thanks paravoid [12:41:13] also tons of [12:41:13] Subject: Cron /usr/local/sbin/update-ocsp-all [12:41:14] run-parts: /etc/update-ocsp.d/hooks/nginx-reload exited with return code 99 [12:41:20] but not sure if these are fixed by now [12:41:36] md0 : active raid1 sdb1[1] 9756672 blocks super 1.2 [2/1] [_U] [12:41:37] (the hook just does "service nginx reload") [12:42:10] Feb 10 11:23:02 cp1059 systemd[1]: Unit nginx.service cannot be reloaded because it is inactive. [12:42:13] Feb 10 12:23:02 cp1059 systemd[1]: Unit nginx.service cannot be reloaded because it is inactive. [12:42:48] morning :) [12:43:08] good morning! [12:44:41] hi :) [12:46:37] the last ones that did that were trivial and no real hardware fault [12:47:03] (the last two that were complaining of raid, just needed mdadm stuff to start syncing again, some fallout from reboot and systemd and god knows what) [12:47:32] yeah, I've seen this happen before for some reason :/ [12:47:52] lvs4002 also had troubles booting yesterday, dropped into the initramfs shell [12:47:56] well it's the same reason a few of the caches stopped on bootup at an (initramfs) [12:48:00] I mdadm assembled everything and rebooted and then it was okay [12:48:02] heh I couldn't even finish [12:48:06] haha [12:48:22] basically something with kernel+systemd doesn't get /dev/sdX running "in time" [12:49:14] paravoid: the ocsp cronspam recently is from the decommed cache_mobile machines [12:49:25] their configuration's broken and they're de-roled, but that didn't remove the cronjob [12:50:02] nod [12:56:34] so cp1067 happened to be a 4.4 canary too, and before I even had a chance to touch the raid speed_limit_(min|max) to speed it up, it already showed: [12:56:37] recovery = 5.8% (568832/9756672) finish=0.8min speed=189610K/sec [12:56:51] whereas the 3.19 ones always started out ~1024K/sec and ~120minute estimate [12:57:59] the values in procfs are the same though [12:58:05] bblack: we can start messing up maps machines soon :) https://gerrit.wikimedia.org/r/#/c/269466/ [12:58:49] ema: awesome :) [12:59:32] I'm guessing something changed from 3.19 to 4.4 that allows it to rebuild closer to _max rather than _min speed when the system is bogged down on other i/o [12:59:43] s/is/is not/ [13:04:04] 10Traffic, 6operations, 5Patch-For-Review: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#2014667 (10ema) With https://gerrit.wikimedia.org/r/#/c/269664/ and https://gerrit.wikimedia.org/r/#/c/269466/ applied the following procedure allows to upgrade a maps box to Varnish 4: ech... [13:09:19] bblack: oh, would you prefer using two different systemd template files for v3 and v4? [13:10:51] ema: the change as proposed now, if merged to ops/puppet, would modify the existing v3 systemd template in bad ways, right? [13:10:57] or was it too early in the morning for me? :) [13:13:02] mmh, except for the order in which parameters are passed I do not see much of a difference for v3 (unless it's too late in the morning for *me*) :) [13:13:54] well you're replacing shm_workspace with a new name that doesn't exist in v3, and you're removing shm_workspace, which we really need in v3 to avoid crashes but doesn't exist in v4 [13:14:29] oh and there's a conditional around it [13:14:35] see, it was too early in the morning for me :) [13:16:09] haha :) [13:16:26] is there a way to test the change on a single node before merging? [13:16:40] not easily really [13:16:49] but you can do puppet compiler against the nodes [13:16:57] ah sure! [13:17:03] if you mark some node as varnish_version4 in the change I guess [13:17:12] or don't and just confirm existing ones aren't affected [13:17:22] I wanted to try the latter first [13:17:44] I haven't really had a chance to look at the VCL diff, I need to download the change and see, since the "diff" isn't obvious in the diff what with the copy/rename [13:18:07] we'll need to be careful going forward once that gets merged btw, to make sure others' changes to the VCL templates get mirrored to both copies [13:18:42] yeah, probably the best way to look at the VCL changes is to diff the _v4 files against the v3 ones [13:18:50] e.g. https://gerrit.wikimedia.org/r/#/c/269661/ that I'm about to merge :) [13:19:34] alright that one is not too bad [13:20:19] error message of the day: "Exprected ct to be boolean, was: true" [13:20:31] so how are the 4.4 canaries? [13:20:38] no increased CPU there? [13:20:53] I'm looking at http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1067.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1455109116&g=cpu_report&z=large&c=Text%20caches%20eqiad [13:20:55] mark: you're not alone in your pain [13:21:02] paravoid: I haven't properly looked this morning. yesterday shortly after, there was a little bit of an iowait bump, but expected post-reboot anyways [13:21:38] doesn't look like anything crazy happened [13:21:48] nod [13:22:00] whoops, wrong channel ;) [13:22:09] post-4.4, I'd really like to test removing the vm hack on the upload caches and see what happens too [13:22:14] it's about time we should be past that mess :P [13:22:30] the per-minute cronjob? [13:22:46] [ 5.368853] systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:3] Failed to add dependency on varnish-frontend, ignoring: Invalid argument [13:22:49] [ 5.384216] systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:4] Failed to add dependency on varnish-frontend, ignoring: Invalid argument [13:22:58] fwiw, I was just looking at dmesg :) [13:22:59] yeah [13:23:09] per-minute cronjob? [13:23:13] ema: [13:23:14] cron { 'varnish_vm_compact_cron': [13:23:14] command => 'echo 1 >/proc/sys/vm/compact_memory', [13:23:14] user => 'root', [13:23:15] minute => '*', [13:23:27] it's in modules/role/manifests/cache/perf.pp [13:23:52] we used it to avoid some spiky/ugly VM behavior that happened on upload caches with some earlier kernels [13:24:04] there's a suspicion we don't need it anymore even on 3.19, but haven't tested [13:25:08] plus just below that we have some real vm tuning that's more-legitimate, which might mitigate it regardless [13:26:01] it's just it was such an ugly issue and takes so long to validate that it's really gone, and doesn't seem to hurt much leaving the cron in as a safety heh [13:26:11] so no priority on testing the removal of it :) [13:26:34] https://puppet-compiler.wmflabs.org/1705/cp1043.eqiad.wmnet/ [13:26:42] I *think* we should be fine [13:27:48] ema: looks legit! [13:32:08] bblack: merged, thanks [13:32:51] now I'll Use url instead of embedded data for logo on error page [13:32:59] in _v4 as well :) [13:54:04] 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2014767 (10BBlack) Status updates: 1. Remaining code refs: 1. https://github.com/wikimedia/mediawiki-services-c... [14:28:54] ema: starting to review the current _v4 changes, will take a while though :) [14:32:54] bblack: thanks! I basically went brute force mode trying to make it compile, surely there are lots of mistakes [14:32:57] ema: one thing I just noticed on the existing systemd work: we should switch thread_pool_add_delay - in v3 this is in ms and v4 it's floating-point seconds [14:33:19] oh yes, I've noticed that today as well [14:36:09] interestingly, thread_pool_add_delay is flagged as "experimental" in the v4 man page [14:36:46] but not on v3 [14:38:06] :) [14:43:30] https://github.com/varnish/Varnish-Cache/commit/7e25234d6f25cf1dd622e4d17e70902c99e63b8b [14:48:14] honestly, we may want to review that setting anyways. maybe start with leaving it at zero under v4, and make a note to check for threads_failed as we deploy experimental v4 hosts with real traffic... [14:48:54] with various changes in v4 code, and running much newer libc/kernel on nicer hardware than the original varnish work we did ages ago whenever that was set, it may no longe rbe necessary/beneficial to have the delay [14:49:03] sounds like material for the "things-to-remember" task [14:49:52] and yes, agreed, we should probably omit it on v4 at first. The new default is 0 instead of 2 ms on v3 [14:51:54] 10Traffic, 6operations: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2014886 (10ema) [15:03:17] speaking of newer libc/kernel, Debian still has a pretty old libjemalloc [15:03:49] I filed https://bugs.debian.org/809239 about a month ago [15:04:13] it matters for other pieces of software that we use, like hhvm [15:04:39] the 4.0 release notes say "many speed and space optimizations" and "their cumulative effect is substantial" [15:08:30] another package to backport \o/ [15:08:55] well, the new version doesn't even exist in sid yet [15:08:58] hopefully the cumulative effect does not include 234 new crasher bugs :) [15:09:04] so we're far from a backport :) [15:09:32] bblack: see the 4.0.4 release notes, which I pasted in that bug above [15:09:38] right, s/backport/maintain ourselves/ :P [15:09:49] This bugfix release fixes another xallocx() regression. No other regressions have come to light in over a month, so this is likely a good starting point for people who prefer to wait for "dot one" releases with all the major issues shaken out. [15:43:31] 10Traffic, 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015134 (10mark) >>! In T125510#1989639, @BBlack wrote: > Should also note: while the above list of steps 1-5 sounds roughly cor... [16:14:06] 10Traffic, 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015200 (10BBlack) >>! In T125510#2015134, @mark wrote: > Do you think it's reasonable to not use eqiad caches at all for a whil... [16:15:55] elukey: https://gerrit.wikimedia.org/r/#/c/269708 [16:22:16] \o/ [16:22:57] I didn't manage to make the test sorry, I was busy with Jessie and memcached/redis :( [16:30:32] np! [16:52:08] ema: argh I missed the =, good catch :) [16:52:37] does it change much? reading docs [16:52:59] ah yes [16:53:04] well you just don't get anything without the = :) [16:53:42] yep yep I was reading the docs, I tend to forget erb very quickly [16:54:30] I only found it by running puppet on a test box BTW, those syntax issues are impossible to see simply by skimming through the code IMHO [17:53:26] 10Traffic, 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#2015546 (10BBlack) Copying in notes from meeting etherpad, which capture some assumptions/thinking beyond what's currently in th... [17:57:31] 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015578 (10BBlack) Note this went live circa 13:05 -> 13:15 UTC Feb 10. So far preliminary data in our graphs looks (to me!) like in the aggregate of client reque... [18:49:43] 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015756 (10Gilles) Are you sure this wasn't 15:10 UTC? Isn't that when the patch was merged? [18:53:06] 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015771 (10BBlack) Yup, sorry, thinko while translating timezones. Updated above too! [18:59:50] 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015813 (10Krinkle) Last 12 hours compared to same time last week. Seems starting in the hour after 15:00 (red mark) there is a noticeable regression. {F3330702 s... [19:08:32] 10Traffic, 6Performance-Team, 6operations, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2015878 (10BBlack) I'd say some of those graphs, they fluctuate so much we need more data to confirm the pattern. But the TTFB, DOM-Complete, and onLoad ones cert... [22:52:22] 10Traffic, 10Deployment-Systems, 6Performance-Team, 6operations, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2017086 (10Krinkle)