[00:20:40] 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10RobH) p:05Triage>03Normal [00:21:10] 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908288 (10RobH) The port mappings for onsite use on the existing mr1-eqiad (and likely to match on new device unless @ayounsi advises otherwise): ge-0/0/0 Core: msw1-eqiad:ge-0/0/32 ge-0/0/1 Core: asw-a-eqi... [10:19:05] 10netops, 10Analytics-Kanban, 10Operations, 10monitoring, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3908967 (10elukey) @faidon whenever you have time do you mind to explain a bit what data is currently pushed to the netflow... [10:19:22] ok so cp3034 (upload) upgraded to varnish 5 and repooled at 9:40ish [10:19:59] it looks OK functionally speaking [10:21:08] resource usage increased, possibly due to the different chashing used by v5 (shard vs vslp) [10:21:14] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3034&var-datasource=esams%20prometheus%2Fops&from=1516267211136&to=1516269510663 [10:22:02] that also caused resource and mbox lag increase on other upload-esams hosts, namely cp3035, cp3037, cp3038 and cp3046 [10:22:35] (as expected) [10:33:30] the hosts mentioned above have recovered nicely from the very different pattern of requests coming from cp3034, see eg cp3035 [10:33:33] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1516268262925&to=1516270494943&var-server=cp3035&var-datasource=esams%20prometheus%2Fops [10:35:59] now, it would probably be wise to move quickly through upload-esams upgrades to reduce the impact of shard vs vslp chashing, but I'd first like to keep only cp3034 on v5 and make sure it behaves [10:40:20] 10Traffic, 10Operations, 10Page Content Service, 10RESTBase, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3909134 (10mobrovac) 05Open>03Resolved a:03Pchelolo It seems @Pchelolo's normalisation fix did the tric... [10:45:18] 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3909156 (10Krinkle) [10:45:49] !log cp3034: restart varnishxcps and varnishmedia, they were both using 100% of a cpu core [10:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:06] requests from cp3034 frontend are spread among the other esams upload backends in a relatively even way, so I'm not entirely sure why (for example) cp3046 is mbox lagged and cp3045 is not [11:08:15] ah, probably because varnish-be has been running for ~4 days on the former and ~21 hours on the latter [11:10:12] the fact that cached backend objects keep on increasing steadily on cp3046 (and other mbox lagged hosts) while they're not increasing in such a steep way on non-mbox lagged ones such as cp3045 is also interesting [11:11:55] let's see what happens by restarting varnish-be on cp3046 [11:12:13] !log cp3046: restart varnish-be due to mbox lag [11:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:10] cached objects keep on increasing because evictions can't happen on lagged hosts [11:26:21] yeah so perhaps it's a good idea to step through the upgrades in decresing varnish-be runtime order [11:26:47] start with the host whose varnish-be has been running for the longest time that is [12:04:13] luckily, with the change we made a whlie back to make cache->cache fetches only-if-cached, the shard/vslp flip doesn't churn the storage of the unconverted DCs [12:04:56] yup! [12:19:30] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909349 (10Deskana) I don't want deployment of TemplateStyles to be a moving target—as it has been for many months—so I'm going to stick... [13:23:34] nice, cp3038 managed to recover on its own [13:32:15] bblack: I'd upgrade cp3037 soon, if you agree [13:41:56] ema: yeah I'd move through as quickly as you're comfortable at this point. But might want to keep an eye on anything scary with the transport link [13:47:55] bblack: would that be xe-0/1/3? https://librenms.wikimedia.org/device/device=145/tab=port/port=11521/ [13:48:24] cr2-esams xe-0/1/3 I believe, yes [13:48:28] https://librenms.wikimedia.org/graphs/to=1516283100/id=11521/type=port_bits/from=1516261500/ [13:48:44] which looks fine so far, it's a 10G wave [13:49:09] awesome, carrying on with the upgrades then [14:00:16] mmh interesting [14:00:19] Jan 18 13:57:02 cp3037 varnish-frontend[41987]: Symbol not found: 'vslp.vslp' at ('/etc/varnish/directors.frontend.vcl' Line 2 Pos 19) [14:00:20] any indication yet whether varnish 5 fixes the mbox lag stuff? [14:00:38] '/etc/varnish/directors.frontend.vcl' looks good though, it does use shard and not vslp [14:04:58] mark: not yet, nope [14:09:18] there seems to be some kind of race when re-enabling puppet, restarting varnish-frontend/backend after the first run does the trick apparently [14:09:48] cp3037 looks good, repooling [14:25:20] upgrading cp3035 now [14:52:17] transport link usage went up to ~3.3G https://librenms.wikimedia.org/graphs/to=1516286700/id=11521/type=port_bits/from=1516200300/ [14:52:25] waiting a bit for caches to refill now [14:53:28] 4/12 hosts upgraded [15:21:05] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&var-cluster=upload&var-site=esams&from=now-6h&to=now [15:21:30] ^ is good for judging refills in some sense as well, but I wouldn't wait for full hitrate recovery, either. [15:27:14] ok I'm bored of flipping hiera switches for each host :) [15:27:27] I'll go with hieradata/role/esams/cache/upload.yaml (puppet disabled on v4 hosts) [15:49:08] !log cache_upload: upgrade cp3048 to varnish 5 [15:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:21] !log cache_upload: repool cp3048 (varnish 5) [15:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:18] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3909954 (10Jgreen) [15:55:27] 6/12 upgraded https://grafana.wikimedia.org/dashboard/db/cache-hosts-software-versions?panelId=2&fullscreen&orgId=1&var-cache_type=upload&from=now-6h&to=now [15:59:14] 10HTTPS, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3909962 (10Deskana) p:05Triage>03Low [16:00:57] min hitrate 88.7%, currenly back up to 91%, nominal is somewhere around ~94.x% [16:01:40] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909975 (10Anomie) >>! In T133410#3907781, @Iniquity wrote: > @Isarra we are waiting for T180817 this task. Regarding that, see T180817#... [16:01:54] the effects per-machine should reduce as you get over the halfway hurdle though, as shard from fe->be will increasingly line up right [16:17:21] FYI on pybal-test2001 OOM killer killed pybal and puppet failed with Cannot allocate memory - fork(2) [16:17:43] noticed just because I'm fixing random puppet around so looked at it in case was related ;) [16:36:26] volans: thanks, looking [16:59:36] XioNoX: So did we want to do ulfo the post all hands week? [16:59:49] and if so, i'll go ahead and update the task and we can email ops list for fyi [17:00:18] I'm in the city on Wednesdays anyhow. so how about Feb 1st? [17:00:26] ulsfo even [17:00:37] bblack: ^ talking about offlining ulsfo for a day and swapping its switches out to the new ones [17:00:55] figured in here was best since it affects traffic ;] [17:00:58] robh: wednesday is the 31st [17:01:05] oh you are right, my bad [17:01:10] 2018-01-31 [17:01:57] I don't expect it to take more than a few hours at most. [17:02:12] the physical stuff should be about an hour [17:02:30] robh: 31st works for me [17:03:01] huh, i cannot find an actual task for the replacement onsite owrk, so i'll make one [17:05:51] wfm [17:06:32] hrmm, when these systems lose network [17:06:38] wondering if its ok to just let them sit without it (ideal) [17:06:44] or if we need to do more [17:06:51] (i mean we'll maint mode them all in icinga) [17:08:31] if they're losing network for significant time, we'll have to wipe caches afterwards before (carefully!) repooling [17:08:39] because they'll lose their invalidation traffic and have stale content [17:11:39] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910184 (10RobH) p:05Triage>03High [17:12:05] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910199 (10RobH) Since this involves #traffic as well as #netops, this plan should get @bblack's review/approval. [17:12:32] 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3910200 (10ayounsi) [17:13:04] yeah im not sure how long it'll take, and then how long to reconfigure the new switches to make them work [17:13:16] i assumed the faster the better due to invalidation issues [17:13:35] anything over an hour is significant? [17:14:13] 10netops, 10Operations, 10ops-eqiad: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10ayounsi) Port mapping is correct. Note that the "fe-" ports will be renamed "ge-", but their physical location is unchanged. [17:14:17] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910209 (10RobH) [17:18:20] robh: depends who you ask I guess :) [17:18:53] probably anything over a few minutes, we should do *something* [17:18:59] if it's hours, probably wipe [17:19:21] if it's <=1h, we might be able to get away with bringing them up as-is and then using some ban operation to cleanup afterwards asynchronously [17:19:55] if it's <=10m, we might do nothing and just react to any individual report of a stale file if warranted. [17:20:43] (text is more sensitive to this stuff than upload, and less painful to wipe, so we might call it differently on different clusters, too) [17:20:59] 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10Volans) I've run clean + deactivate for cp4018 as part of cleanup of stale puppet certs. [17:21:39] robh: is it possible to rack the switches ahead of time? As well as connect their mgmt/console ports? [17:27:21] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910263 (10ayounsi) [17:51:30] 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3910404 (10Imarlier) We should look at the varnish logs to see if we can find other pages that have a similar behavior. Is this part of T164248 [18:33:19] 10Traffic, 10Analytics-Kanban, 10Operations, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910563 (10fdans) [18:37:36] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910569 (10fdans) [18:52:46] 10netops, 10Analytics-Kanban, 10Operations, 10monitoring, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3910670 (10elukey) Had an interesting chat with @ayounsi and for the moment it seems that the only format expected in the ne... [19:07:35] 10Traffic, 10Operations, 10Puppet: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910723 (10ema) [19:27:14] upload-esams upgraded \o/ [19:28:07] \o/ [20:21:52] uhmm.. could i ask for a purging of URL on misc-web if i ask really nice.. cough... https://annual.wikimedia.org would like to be purged to get the new redirect to /2017 instead of /2016 [20:26:22] mutante it does 2017 for me :) [20:26:50] paladox: :))! even better [20:26:55] :) [20:27:14] well, i guess nevermind guys [20:27:33] as long as the commns people dont say anything it'll be ok