[00:20:40] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10RobH) p:05Triage>03Normal
[00:21:10] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908288 (10RobH) The port mappings for onsite use on the existing mr1-eqiad (and likely to match on new device unless @ayounsi advises otherwise):  ge-0/0/0  Core: msw1-eqiad:ge-0/0/32 ge-0/0/1  Core: asw-a-eqi...
[10:19:05] <wikibugs>	 10netops, 10Analytics-Kanban, 10Operations, 10monitoring, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3908967 (10elukey) @faidon whenever you have time do you mind to explain a bit what data is currently pushed to the netflow...
[10:19:22] <ema>	 ok so cp3034 (upload) upgraded to varnish 5 and repooled at 9:40ish
[10:19:59] <ema>	 it looks OK functionally speaking
[10:21:08] <ema>	 resource usage increased, possibly due to the different chashing used by v5 (shard vs vslp)
[10:21:14] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3034&var-datasource=esams%20prometheus%2Fops&from=1516267211136&to=1516269510663
[10:22:02] <ema>	 that also caused resource and mbox lag increase on other upload-esams hosts, namely cp3035, cp3037, cp3038 and cp3046
[10:22:35] <ema>	 (as expected)
[10:33:30] <ema>	 the hosts mentioned above have recovered nicely from the very different pattern of requests coming from cp3034, see eg cp3035
[10:33:33] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1516268262925&to=1516270494943&var-server=cp3035&var-datasource=esams%20prometheus%2Fops
[10:35:59] <ema>	 now, it would probably be wise to move quickly through upload-esams upgrades to reduce the impact of shard vs vslp chashing, but I'd first like to keep only cp3034 on v5 and make sure it behaves  
[10:40:20] <wikibugs>	 10Traffic, 10Operations, 10Page Content Service, 10RESTBase, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3909134 (10mobrovac) 05Open>03Resolved a:03Pchelolo It seems @Pchelolo's normalisation fix did the tric...
[10:45:18] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3909156 (10Krinkle)
[10:45:49] <ema>	 !log cp3034: restart varnishxcps and varnishmedia, they were both using 100% of a cpu core
[10:46:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:06] <ema>	 requests from cp3034 frontend are spread among the other esams upload backends in a relatively even way, so I'm not entirely sure why (for example) cp3046 is mbox lagged and cp3045 is not
[11:08:15] <ema>	 ah, probably because varnish-be has been running for ~4 days on the former and ~21 hours on the latter
[11:10:12] <ema>	 the fact that cached backend objects keep on increasing steadily on cp3046 (and other mbox lagged hosts) while they're not increasing in such a steep way on non-mbox lagged ones such as cp3045 is also interesting
[11:11:55] <ema>	 let's see what happens by restarting varnish-be on cp3046
[11:12:13] <ema>	 !log cp3046: restart varnish-be due to mbox lag
[11:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:10] <ema>	 cached objects keep on increasing because evictions can't happen on lagged hosts
[11:26:21] <ema>	 yeah so perhaps it's a good idea to step through the upgrades in decresing varnish-be runtime order 
[11:26:47] <ema>	 start with the host whose varnish-be has been running for the longest time that is
[12:04:13] <bblack>	 luckily, with the change we made a whlie back to make cache->cache fetches only-if-cached, the shard/vslp flip doesn't churn the storage of the unconverted DCs
[12:04:56] <ema>	 yup!
[12:19:30] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909349 (10Deskana) I don't want deployment of TemplateStyles to be a moving target—as it has been for many months—so I'm going to stick...
[13:23:34] <ema>	 nice, cp3038 managed to recover on its own
[13:32:15] <ema>	 bblack: I'd upgrade cp3037 soon, if you agree
[13:41:56] <bblack>	 ema: yeah I'd move through as quickly as you're comfortable at this point.  But might want to keep an eye on anything scary with the transport link
[13:47:55] <ema>	 bblack: would that be xe-0/1/3? https://librenms.wikimedia.org/device/device=145/tab=port/port=11521/
[13:48:24] <bblack>	 cr2-esams xe-0/1/3 I believe, yes
[13:48:28] <bblack>	 https://librenms.wikimedia.org/graphs/to=1516283100/id=11521/type=port_bits/from=1516261500/
[13:48:44] <bblack>	 which looks fine so far, it's a 10G wave
[13:49:09] <ema>	 awesome, carrying on with the upgrades then
[14:00:16] <ema>	 mmh interesting
[14:00:19] <ema>	 Jan 18 13:57:02 cp3037 varnish-frontend[41987]: Symbol not found: 'vslp.vslp' at ('/etc/varnish/directors.frontend.vcl' Line 2 Pos 19)
[14:00:20] <mark>	 any indication yet whether varnish 5 fixes the mbox lag stuff?
[14:00:38] <ema>	 '/etc/varnish/directors.frontend.vcl' looks good though, it does use shard and not vslp
[14:04:58] <ema>	 mark: not yet, nope
[14:09:18] <ema>	 there seems to be some kind of race when re-enabling puppet, restarting varnish-frontend/backend after the first run does the trick apparently
[14:09:48] <ema>	 cp3037 looks good, repooling
[14:25:20] <ema>	 upgrading cp3035 now
[14:52:17] <ema>	 transport link usage went up to ~3.3G https://librenms.wikimedia.org/graphs/to=1516286700/id=11521/type=port_bits/from=1516200300/
[14:52:25] <ema>	 waiting a bit for caches to refill now
[14:53:28] <ema>	 4/12 hosts upgraded
[15:21:05] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&var-cluster=upload&var-site=esams&from=now-6h&to=now
[15:21:30] <bblack>	 ^ is good for judging refills in some sense as well, but I wouldn't wait for full hitrate recovery, either.
[15:27:14] <ema>	 ok I'm bored of flipping hiera switches for each host :)
[15:27:27] <ema>	 I'll go with hieradata/role/esams/cache/upload.yaml (puppet disabled on v4 hosts)
[15:49:08] <ema>	 !log cache_upload: upgrade cp3048 to varnish 5
[15:49:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:21] <ema>	 !log cache_upload: repool cp3048 (varnish 5)
[15:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:18] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3909954 (10Jgreen)
[15:55:27] <ema>	 6/12 upgraded https://grafana.wikimedia.org/dashboard/db/cache-hosts-software-versions?panelId=2&fullscreen&orgId=1&var-cache_type=upload&from=now-6h&to=now
[15:59:14] <wikibugs>	 10HTTPS, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3909962 (10Deskana) p:05Triage>03Low
[16:00:57] <bblack>	 min hitrate 88.7%, currenly back up to 91%, nominal is somewhere around ~94.x% 
[16:01:40] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909975 (10Anomie) >>! In T133410#3907781, @Iniquity wrote: > @Isarra we are waiting for T180817 this task.  Regarding that, see T180817#...
[16:01:54] <bblack>	 the effects per-machine should reduce as you get over the halfway hurdle though, as shard from fe->be will increasingly line up right
[16:17:21] <volans>	 FYI on pybal-test2001 OOM killer killed pybal and puppet failed with Cannot allocate memory - fork(2)
[16:17:43] <volans>	 noticed just because I'm fixing random puppet around so looked at it in case was related ;)
[16:36:26] <ema>	 volans: thanks, looking
[16:59:36] <robh>	 XioNoX: So did we want to do ulfo the post all hands week?
[16:59:49] <robh>	 and if so, i'll go ahead and update the task and we can email ops list for fyi
[17:00:18] <robh>	 I'm in the city on Wednesdays anyhow. so how about Feb 1st?
[17:00:26] <robh>	 ulsfo even
[17:00:37] <robh>	 bblack: ^ talking about offlining ulsfo for a day and swapping its switches out to the new ones
[17:00:55] <robh>	 figured in here was best since it affects traffic ;]
[17:00:58] <XioNoX>	 robh: wednesday is the 31st
[17:01:05] <robh>	 oh you are right, my bad
[17:01:10] <robh>	 2018-01-31
[17:01:57] <robh>	 I don't expect it to take more than a few hours at most.
[17:02:12] <robh>	 the physical stuff should be about an hour 
[17:02:30] <XioNoX>	 robh: 31st works for me
[17:03:01] <robh>	 huh, i cannot find an actual task for the replacement onsite owrk, so i'll make one
[17:05:51] <bblack>	 wfm
[17:06:32] <robh>	 hrmm, when these systems lose network
[17:06:38] <robh>	 wondering if its ok to just let them sit without it (ideal)
[17:06:44] <robh>	 or if we need to do more
[17:06:51] <robh>	 (i mean we'll maint mode them all in icinga)
[17:08:31] <bblack>	 if they're losing network for significant time, we'll have to wipe caches afterwards before (carefully!) repooling
[17:08:39] <bblack>	 because they'll lose their invalidation traffic and have stale content
[17:11:39] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910184 (10RobH) p:05Triage>03High
[17:12:05] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910199 (10RobH) Since this involves #traffic as well as #netops, this plan should get @bblack's review/approval.
[17:12:32] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3910200 (10ayounsi)
[17:13:04] <robh>	 yeah im not sure how long it'll take, and then how long to reconfigure the new switches to make them work
[17:13:16] <robh>	 i assumed the faster the better due to invalidation issues
[17:13:35] <robh>	 anything over an hour is significant?
[17:14:13] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10ayounsi) Port mapping is correct. Note that the "fe-" ports will be renamed "ge-", but their physical location is unchanged.
[17:14:17] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910209 (10RobH)
[17:18:20] <bblack>	 robh: depends who you ask I guess :)
[17:18:53] <bblack>	 probably anything over a few minutes, we should do *something*
[17:18:59] <bblack>	 if it's hours, probably wipe
[17:19:21] <bblack>	 if it's <=1h, we might be able to get away with bringing them up as-is and then using some ban operation to cleanup afterwards asynchronously
[17:19:55] <bblack>	 if it's <=10m, we might do nothing and just react to any individual report of a stale file if warranted.
[17:20:43] <bblack>	 (text is more sensitive to this stuff than upload, and less painful to wipe, so we might call it differently on different clusters, too)
[17:20:59] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10Volans) I've run clean + deactivate for cp4018 as part of cleanup of stale puppet certs.
[17:21:39] <XioNoX>	 robh: is it possible to rack the switches ahead of time? As well as connect their mgmt/console ports?
[17:27:21] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910263 (10ayounsi)
[17:51:30] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3910404 (10Imarlier) We should look at the varnish logs to see if we can find other pages that have a similar behavior.  Is this part of T164248
[18:33:19] <wikibugs>	 10Traffic, 10Analytics-Kanban, 10Operations, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910563 (10fdans)
[18:37:36] <wikibugs>	 10Traffic, 10Analytics, 10Operations, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910569 (10fdans)
[18:52:46] <wikibugs>	 10netops, 10Analytics-Kanban, 10Operations, 10monitoring, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3910670 (10elukey) Had an interesting chat with @ayounsi and for the moment it seems that the only format expected in the ne...
[19:07:35] <wikibugs>	 10Traffic, 10Operations, 10Puppet: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910723 (10ema)
[19:27:14] <ema>	 upload-esams upgraded \o/
[19:28:07] <bblack>	 \o/
[20:21:52] <mutante>	 uhmm.. could i ask for a purging of URL on misc-web if i ask really nice.. cough... https://annual.wikimedia.org would like to be purged to get the new redirect to /2017 instead of /2016
[20:26:22] <paladox>	 mutante it does 2017 for me :)
[20:26:50] <mutante>	 paladox: :))! even better
[20:26:55] <paladox>	 :)
[20:27:14] <mutante>	 well, i guess nevermind guys
[20:27:33] <mutante>	 as long as the commns people dont say anything it'll be ok