[08:04:46] morning! [08:04:52] ERROR:conftool:Error when trying to set/pooled=yes on name=cp1073.eqiad.wmnet,service=varnish-be-rand [08:05:32] the only "no" that I can see is {"cp1073.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=cache_upload,service=varnish-be-rand"} [08:08:08] 10netops, 06Operations, 10Ops-Access-Requests: elukey - Access to network devices - https://phabricator.wikimedia.org/T147061#2691796 (10elukey) @ema also would like to get access to the network equipment, maybe we could couple both requests in one task. I'd also really like to get added to the noc email lis... [09:10:27] TIL: you can include closed tasks in the workboard https://phabricator.wikimedia.org/project/board/1201/query/all/ [09:12:41] elukey: thanks, repooled [09:36:56] ema, bblack: I'd like to refresh the kernels running on lvs [09:37:18] they're still at +wmf1 [09:38:43] I already installed the new kernel packages, could you deal with the rolling reboot? there's a wide range of other hosts I need to look into ATM [09:40:29] wmf5 is based on 4.4.19 and already running on a wide range of hosts [10:26:28] 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2692141 (10elukey) Idea to discuss: ``` moritzm: sure, I'll coordinate with Brandon when he gets online [10:39:40] thanks [12:24:43] godog: re: https://gerrit.wikimedia.org/r/#/c/310819/, could you add some more details to the commit log? To me it's not immediately clear what the CR is about [12:27:27] ema: running to lunch, will do! [12:34:19] 10netops, 06Operations, 10Ops-Access-Requests: Give elukey/ema access to network devices - https://phabricator.wikimedia.org/T147061#2692344 (10faidon) [12:44:01] 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2692380 (10BBlack) Seems like it's worth testing :) [12:44:36] ema: I don't think we've yet rebooted caches onto the latest 4.4 either [12:44:41] or the authdns boxes [12:47:39] yeah all the caches say: Linux cp1071 4.4.0-1-amd64 #1 SMP Debian 4.4.2-3+wmf2 (2016-05-11) x86_64 [12:48:03] installed version on at least some of them is 4.4.2-3+wmf3 [12:48:18] and policy says available is 4.4.2-3+wmf5 [12:49:05] oh I've checked just a couple of them and they had wmf5 installed [12:49:35] ok let me salt that [12:50:21] I've salted apt-cache policy with G@cluster:lvs and all seem to have wmf5 installed [12:50:42] yeah, I installed the wmf5 packages on lvs* earlier the day [12:51:05] yeah just most of the caches don't have it yet [12:51:11] any objections with me draining ulsfo right now? [12:51:22] moritzm: we're still not yet applying the last debian point release stuff too right? [12:51:38] paravoid: none here. temporary? [12:51:40] yes [12:51:43] just junos upgrades [12:51:45] then esams [12:51:47] ok [12:51:58] esams I could probably do it without draining it at all [12:52:00] maybe we'll avoid doing kernel reboots during that :) [12:52:12] ulsfo as well I guess [12:52:23] but let's not risk it [12:52:27] I started to distribute the latest point relase cluster-wide, but it's not complete yet, if you want you can dist-upgrade the cp hosts, but over the next week it'll reach the entire cluster [12:52:33] paravoid: yeah it's fine [12:52:44] moritzm: ok [12:52:58] it was released the week before the offsite, but didn't have much time for these upgrades since the trusty kernels needed reboots [12:53:34] moritzm: modulo waiting on faidon's depools, I'd say we (or I as the day wears on into US-only territory) should be able to focus on this today [12:53:34] * ema goes afk for lunch [12:53:48] and get dist-upgrade + latest kernel on the cp* + lvs* + authdns anyways [12:54:13] may not finish all the cache reboots, but can finish rebooting lvs/authdns first, and have all the packages ready on cp and some progress on reboots, anyways [12:55:08] great, the wmf1 kernels are the only ones affected by the recvmmsg bug which was featured in the latest android release (CVE-2016-7117): https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34b88a68f26a75e4fded796f1a49c40f82234b7d [12:55:48] this ended up in 4.4.8 via the stable kernels, but wasn't flagged as a security issue by security@kernel.org until the android bulletin was published [12:56:12] I'm not really sure if this allows DoS, but better safe the sorry [12:56:20] 4.4.8 ended up in wmf2 [12:57:21] hmmm might be relevant to authdns... [12:58:05] baham is the only one with wmf1 [12:58:18] that also needs to be updated [12:58:22] right [12:58:44] the CVE doesn't apply to 3.19? [13:00:12] ah, my salt search was to narrow, radon and eeden are still on 3.19 and also need an upgrade (there's only approx two dozen 3.19 hosts left) [13:00:20] ok [13:00:23] (as far as authdns is concerned) [13:01:52] it doesn't look exploitable, but better safe than sorry [13:02:03] since the authdns do pick up most of their requests via recvmmsg() [13:03:40] ok! [13:06:49] 10netops, 06Operations, 10Ops-Access-Requests: Give elukey/ema access to network devices - https://phabricator.wikimedia.org/T147061#2692470 (10faidon) 05Open>03Resolved Done for both of you, on all 30 devices (11 core routers, 12 access switches, 4 management routers, 2 management switches, 1 peering sw... [14:08:14] bblack: so cp* hosts upgraded and eeden rebooted right? [14:08:25] there are a few cp puppetfails I can look into those [14:09:38] the cp puppetfails are ulsfo due to faidon's network stuff presumably [14:09:48] I'm still looking at eeden, I did the full dist-upgrade there [14:09:56] cp* I've only updated the linux-meta and kernel packages on so far [14:10:37] eeden had some strange fallout from dist-upgrade, I think we have to do a second "apt-get update" afterwards, but not sure what was going on [14:11:19] seems like it anyways [14:12:15] so on cp4013 and cp4018 puppet failed because some packages couldn't be authenticated [14:12:28] apt-get update ; puppet agent -t fixes it [14:12:36] basically the flow on eeden was: install linux-meta linux-image-4.4.0-2-amd64, then apt-get dist-upgrade, then rebooted afterwards... then when it did its first puppet run post-reboot it tried/failed to install a bunch of package updates that are set to "latest", and "apt-get upgrade" showed a bunch of package updates (which is wrong) [14:12:56] but after "apt-get update", everythings back to zero upgrades avail... [14:13:14] it's kind of strange... [14:13:35] seeme related to tmux [14:14:09] well tmux was one of several it wanted to "upgrade" (I think actually downgrade), or package metadata was out of whack temporarily or something [14:14:21] meh, there's a similar problem to what I had been seen with tshark [14:14:34] yeah [14:14:39] anyways, trying baham now [14:14:57] it seems to try to install the version from jessie-backports (since it pulls in libutempter, which is only a dependency in the bpo version) [14:15:23] it did fix itself after some combination of apt-get update and puppet runs [14:16:25] I'll prepare a patch to drop the package->latest from standard_packages, it's only useful for a few cornercases like tzdata [14:17:00] right now it's causing more problems than adding value [14:20:19] of the authdns, I'll leave radon for later in the day Just In Case [14:20:32] 10Traffic, 06Operations, 05Prometheus-metrics-monitoring: Port gdnsd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147426#2692702 (10ema) [14:21:40] 10Traffic, 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port varnish metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147424#2692709 (10fgiunchedi) [14:21:51] ok [14:23:06] no strangeness on baham, but I did do an extra "apt-get update" after the dist-upgrade in case that was a factor [14:23:54] 10Traffic, 06Operations, 05Prometheus-metrics-monitoring: Port vhtcpd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147429#2692715 (10fgiunchedi) [14:23:56] starting on the LVS package stuff [14:24:06] cp hosts will be tricky, because dist-upgrade will pick up libvmod-netmapper.... [14:24:37] we should really just get over that hurdle at the same time, I guess [14:24:37] bblack: planning on dist-upgrading the lvs hosts? [14:24:43] ema: yes [14:24:45] cool [14:24:56] let me know if I can help [14:25:13] ema: I guess for cp, we can break it up into some smaller chunks, and dist-upgrade->reboot letting it take libvmod-netmapper as well [14:25:28] the only fallout should be no ability to reload VCL between the package install and the reboot, right? [14:25:53] or well, it's worse than that, right? trying to reload VCL will kill the child? [14:25:55] well the fallout is that any reload would fail leaving the service down [14:26:18] and of course, we have confd triggering a reload attempt on any depool, and the hosts auto-depool their services on reboots :) [14:26:30] 10netops, 06Operations: Upgrade cr1-ulsfo & cr2-ulsfo to JunOS 13.3 - https://phabricator.wikimedia.org/T143914#2692745 (10faidon) 05Open>03Resolved a:03faidon Done! [14:26:32] of course :) [14:27:08] this seems like an appropriate time for: [14:27:11] I HAVE NO TOOLS BECAUSE I DESTROYED MY TOOLS WITH MY TOOLS [14:27:15] hahahahaha [14:28:19] are all DNS up again? [14:30:09] paravoid: yes [14:30:54] ema: so either way, we need to do libvmod-netmapper one host at a time (install it just before rebooting the host) [14:31:15] we can either do something hacky to mask/prevent its installation, and then mass dist-upgrade stuff, and then do that package install alone just before reboot [14:31:32] or we can just do it such that we wait to dist-upgrade each until just before their reboots [14:31:40] bblack: so a safe way to upgrade cp hosts could be: depool, puppet disable, upgrade, reboot, repool? [14:31:57] that would leave puppet disabled :) [14:32:14] right, and re-enable puppet :) [14:32:33] but yeah anyways dist-upgrade just before reboot I guess [14:32:37] also, unless something's changed recently, they will by default auto-depool themselves on shutdown before stopping varnishd/nginx [14:32:47] and not auto-repool on startup, unless we touch some file to allow it [14:33:57] there's a "traffic-pool" service [14:34:02] which has: [14:34:03] After=varnish.service varnish-frontend.service nginx.service [14:34:03] but the auto-depool is going to fail because of the reload issue, isn't it? [14:34:10] ExecStart=/bin/sh -c 'if test -f /var/lib/traffic-pool/pool-once; then rm -f /var/lib/traffic-pool/pool-once; sleep 45; /usr/local/bin/pool; fi' [14:34:13] ExecStop=/usr/local/bin/depool ; /bin/sleep 45 [14:34:37] the auto-depool will succeed in depooling to etcd, which will trigger a failing VCL reload asynchronously everywhere related [14:34:58] (which is why we don't want multiple outstanding libvmod-netmapper upgrades) [14:36:59] so basically the commandline is something like.... [14:37:54] apt-get -y dist-upgrade; apt-get -y autoremove; apt-get update; touch /var/lib/traffic-pool/pool-once; reboot; [14:38:18] but probably best wrapped up in run-no-puppet to avoid apt commands failing or something? [14:38:28] or something [14:38:57] also to avoid vcl reloads in between [14:39:00] yeah [14:39:11] that won't stop confd reloads though [14:39:15] so.... [14:39:42] run-depooled-no-puppet-no-cron-just-dont-do-anything [14:40:00] but then there's the reboot in play too [14:40:04] we want puppet back on post-reboot [14:40:21] puppet agent --enable after "reboot;" would be racy, it might not always work at all [14:40:30] but we could do reboot with a time offset shortly in the future [14:40:55] at +1 reboot or whatever it is [14:45:10] echo reboot|at now + 1 minute [14:45:17] ^ that's the syntax I was looking for [14:46:06] nice [14:46:09] puppet shouldn't be reloading VCL anyways if we're not pushing VCL changes [14:46:15] just confd [14:46:20] so.... [14:47:00] bblack: when you have some free time, I'd appreciate you review on https://gerrit.wikimedia.org/r/#/c/312225/ [14:47:22] bblack: I did implement the changes you suggested, but varnish is still no the place where I'm comfortable :P [14:48:36] bblack, ema: once ulsfo is fully repooled, I'd like to proceed with draining esams; any disagreements? [14:48:58] for approximately 2 hours again [14:49:18] if it interferes with anything that you're doing, I can postpone for tomorrow or another day, it's not urgent [14:50:02] paravoid: it's fine, can work around it [14:54:08] really we don't have to worry that much about puppet and VCL reloads, it's mostly about puppet potentially messing with apt commands [14:54:17] so the "at" thing isn't really necessary [14:54:21] maybe something like: [14:54:23] albeit cool [14:55:09] ; service confd stop; apt-get -y dist-upgrade; apt-get -y autoremove; apt-get update; touch /var/lib/traffic-pool/pool-once; puppet agent --enable; reboot; [14:55:35] it's the "disable puppet reliably" we don't have a tool for yet, at least not one that doesn't try to contain commands within itself [14:55:56] disable puppet reliably == disable and then wait on any already-running one [14:58:28] perhaps: run-no-puppet whatever-command && service confd stop ... [14:58:54] or even run-no-puppet service confd stop && ... [14:59:30] yeah I'm retarded [14:59:49] run-no-puppet puppet agent disable && service confd stop ... [15:00:27] lol that's awesome [15:00:39] "run-no-puppet puppet agent --disable" should actually work, right? [15:00:44] I think :) [15:00:55] other than it breaking the whole "don't step on other disables" thing, which we implicitly have to break in this case anyways [15:03:16] oh no, it can't work right [15:03:18] has to be: [15:03:36] puppet agent --disable blah; run-no-puppet echo; .... [15:03:48] or else run-no-puppet will undo the inner disable, since it comes later and doesn't overwrite the reason [15:04:14] "run-no-puppet echo" just gets us the "wait on existing puppetrun to finish" part [15:05:18] you're right [15:05:46] so: [15:06:01] puppet agent --disable blah; run-no-puppet echo; service confd stop; apt-get -y dist-upgrade; apt-get -y autoremove; apt-get update; touch /var/lib/traffic-pool/pool-once; puppet agent --enable; reboot; [15:06:20] is what we need to step through on cache nodes, one host at a time [15:07:17] oh, so I think the reason I remembered that "at" hack at all, is I had to use this to salt reboots before [15:07:33] because if a salt command contains "reboot", it tends to timeout/hang the salt execution at the end [15:07:43] ok [15:07:58] so if we're executing this via salt (which makes sense): [15:08:14] puppet agent --disable blah; run-no-puppet echo; service confd stop; apt-get -y dist-upgrade; apt-get -y autoremove; apt-get update; touch /var/lib/traffic-pool/pool-once; puppet agent --enable; echo reboot|at now + 1 minute [15:09:08] even "-b 1" would give lots of overlap though, so we really want serial execution with sleeps [15:09:57] as in: make a hostlist file with all the cache nodes in some swizzled order, and then: for h in `cat hostfile`; do salt -v -b 1 -t 300 $h '.....'; sleep NNNNN; done [15:10:12] running in a screen, with a sufficient inter-node sleep to make it not too impactful on hitrates and such [15:10:16] with icinga-downtime for bonus points [15:10:32] heh that too, which we also have to use single-host salt to execute on neon [15:10:55] mostly this is about cache_upload, though [15:11:02] cache_text doesn't have the libvmod-netmapper problem [15:11:13] and misc/maps it's easier to reboot quicker, they're lower-traffic and less cache contents, etc [15:11:49] maybe disable the weekly restart cron ahead of starting that process on cache_upload, too, can start it back up after completion, so there aren't pointless backend restarts happening near the reboots [15:12:56] orchestration is easy, right? [15:13:28] indeed! [15:13:40] or we could just deal with libvmod-netmapper separately and simplify this greatly [15:13:45] that's probably better in the net [15:14:01] I should've done that with the frontend restarts in cache_upload for jemalloc [15:15:32] it takes about 3 hours total for frontends to recover their hitrates from a batched depooled restart yesterday (with no artificial sleeps, just the ones built into varnish-frontend-restart) [15:15:40] using -b 1 salt execution over them all [15:16:49] so yeah, maybe I'll just do that again (and even quicker on maps/misc) [15:17:00] and get us past libvmod-netmapper separately ahead of the dist-upgrade + kernel stuff [15:18:30] anyways, I'll look at that later today [15:18:37] lvs dist-upgrades... [15:19:57] (after esams is back online, which will be a while) [15:21:31] thinking out loud for other things todo today: copy down notes on Traffic Contract session into some kind of palatable task/ticket (of course the contract will evolve and grow more-perfect forever, so the task is mostly about the initial stuff, not making it perfect) [15:21:53] I already made tasks/updates for the crypto stuff [15:22:09] one thing we can do there is start working on the VCL for the 3DES page [15:22:35] https://phabricator.wikimedia.org/T147199 [15:23:16] basically a "200 OK" error synth page, and some work on HTML layout and content about upgrading, and VCL to conditionally use it to replace /wiki/ pageviews at X% rate (no need for cookie blocking I think) [15:24:00] we can probably steal the basic layout from our existing error synth page [15:35:28] not even ie8 users on xp deserve seeing my html, but if the goal is to punish them I can help [15:40:04] :) [15:41:30]

WE MISS YOU HERE IN 2016

[15:42:03] you forgot to use [15:42:25] users on windows xp don't read pages that lack [15:43:07] they don't find them trustworthy [15:43:57] only when combined with [15:53:22] I found this cute easter egg a couple of weeks ago -- https://www.google.com/search?q=marquee+tag [15:53:22] TIL: the blink element was implemented under the influence http://www.montulli.org/theoriginofthe%3Cblink%3Etag [15:55:58] splendid both [16:00:55] input welcome on dstat_varnishstat https://gerrit.wikimedia.org/r/#/c/314247/ [16:06:30] bd808: neat. I miss marquee [16:08:10] ema: might be nice to calculate exp_mailed - exp_received as an expiry mailbox lag [16:09:34] 10Traffic, 10MediaWiki-API, 10Monitoring, 06Operations, 06Services: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#2693434 (10GWicke) > I think anything we have today is going to just create more icinga spam in the IRC channel? While I agree that we n... [16:13:26] speaking of which, 4x of the eqiad backends have signficant mailbox lag that's slowly rising [16:13:36] and only ~3 days in [16:14:12] 10netops, 06Operations: Upgrade cr1-esams & cr2-knams to JunOS 13.3 - https://phabricator.wikimedia.org/T143913#2693493 (10faidon) 05Open>03Resolved a:03faidon …aand done as well. [16:14:13] but also we've been spiking up backend (hit-local and hit-remote) missrate with the ulsfo/esams depools, most-probably because of regional differences in hot object sets [16:14:49] so it could be that this wouldn't be normal, but is fallout from higher miss->fetch->nuke from that today, and may self-resolve at some point before it becomes an LRU_Fail problem [16:16:35] (well for causes, and/or the large general loadavg/traffic increase to the same uplaod eqiad nodes from taking on all the esams traffic) [16:17:09] I'm done with the esams stuff, just waiting a little bit before I repool [16:17:29] cr1-esams/cr2-knams are MX80s and are super slow in terms of CPU [16:17:38] which means it takes a while for routing to converge, unfortunately [16:18:35] looks like the problem started kicking in around 15:05 [16:18:57] which is around esams drain time [16:19:38] so, yeah. drain esams -> excess load directly to eqiad + significant URL pattern divergence driving a higher missrate -> expiry mailbox backlog becomes more-likely than usual [16:19:59] I bet it recovers eventually, though [16:20:03] fun. [16:20:38] notably the 4/11 hosts with the lag are all ones that are in the 2-3 days uptime category, not younger [16:20:45] unrelatedly, I'd like to put https://phabricator.wikimedia.org/T143915 back on the radar (just triaging the netops workboard) [16:20:46] but there are others in that uptime category that didn't backlog, too [16:21:15] paravoid: that's only the second reminder, I rarely do anything before someone's told me 3 times :) [16:21:42] :P [16:22:05] 10Traffic, 10netops, 06Operations: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2693648 (10BBlack) Bump [16:22:36] did you just bump yourself? :P [16:23:04] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=x&vl=&x=600&n=&hreg[]=cp1074&mreg[]=varnish.MAIN.exp_.%2A>ype=line&glegend=show&aggregate=1&embed=1&_=1475684207500 [16:23:17] ^ that's cp1074's mailbox lag issues since ~ esams drain [16:23:38] I think it's mostly the pattern-shift [16:24:03] under "normal" conditions hitrate is high. when we move regional traffic around there's a notable temporary miss spike, which means lots of nuking, hence exp_mailed skyrocketing, etc [16:24:34] 10Traffic, 06Analytics-Kanban, 06Operations, 06Performance-Team: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2693678 (10Nuria) Working on doc https://docs.google.com/document/d/1jRGjVAthJXoCovxyvXWyg07R1POb8zvD_n8IlJXrPVM/edit# Will start addressing @ellery's la... [16:25:01] paravoid: the self-bump is the 3rd reminder :) [16:34:47] esams is repooled again [16:53:34] 10Traffic, 06Operations, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2693856 (10Joe) [16:53:44] 10Traffic, 06Operations, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2693868 (10Joe) p:05Triage>03High [17:34:44] bblack: added f_exl and b_exl for frontend/backend expiry mailbox lag [18:32:03] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2694266 (10BBlack) We saw this again today. There was a bug in TimedMediaHandler causing a `500 Internal Server Error` only for (at least... [19:06:53] 10Traffic, 06Operations, 07Beta-Cluster-reproducible, 13Patch-For-Review: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2694407 (10BBlack) ^ The above turns out to be confusingly-named but unrelated. We still haven't quite figured out h... [19:07:58] the big mailbox lags from the esams depool did resolve themselves back to zero btw [19:08:21] either they would've done it eventually regardless as cache hitrate recovered with esams->eqiad for a long period, or they were able to after it moved away again [19:12:13] 10Traffic, 06Operations, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2694435 (10BBlack) 05Open>03Resolved a:03BBlack So far there doesn't seem to be any recurrence of the stats anomaly when removing the workaround. Closing for no... [23:29:01] ema: lvs + authdns are all updated (kernel + dist-upgrade) and rebooted