[07:32:43] 10netops, 10Operations: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3322265 (10ayounsi) [10:47:42] 10netops, 10Operations: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274#3322805 (10ayounsi) [11:15:23] moritzm: all LVSs upgraded to 8.8 [11:18:56] thanks [11:26:49] 10netops, 10Operations, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3322959 (10ayounsi) [11:42:41] 10netops, 10Operations, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3322995 (10ayounsi) [11:58:53] 10netops, 10Office-IT, 10Operations: Some BGP sessions to the SF Office down - https://phabricator.wikimedia.org/T167281#3323004 (10ayounsi) [12:18:33] <_joe_> ema: oh you restarted pybal everywhere? [12:18:46] <_joe_> bummer, if I'd have known I had things to change in their config [12:26:14] _joe_: no restarts, just package upgrades to jessie 8.8 point release [12:26:36] <_joe_> ok [12:58:32] 10netops, 10Operations: Rancid improvements - https://phabricator.wikimedia.org/T167288#3323402 (10ayounsi) [13:03:04] 10netops, 10Operations: Rancid improvements - https://phabricator.wikimedia.org/T167288#3323427 (10faidon) Why not convert? I think there's a lot of value in doing so. Agreed on the rest. Moreover, it would be nice if we could filter the output and remove some of the known artifacts (cr2-ulsfo's TFEB -/+, LLD... [13:08:46] <_joe_> ema: traffic alarms in -operations [13:14:42] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3323439 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can... [13:53:58] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3323647 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can... [14:05:36] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3323729 (10BBlack) [14:05:39] 10Traffic, 10netops, 10Operations: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3323716 (10BBlack) [14:05:53] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3323731 (10BBlack) [14:11:42] 10Traffic, 10Operations: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190763 (10ema) We should evaluate [[ https://github.com/varnish/varnish-modules/blob/master/docs/vmod_vsthrottle.rst | vmod_vsthrottle ]], available in the `varnish-modules` package and see if it's... [14:31:39] <_joe_> bblack, ema I was looking at https://phabricator.wikimedia.org/T167048 and I think the simpler thing would be to have service-checker running on each cache::text node for restbase and on any cache::upload node for maps [14:34:01] _joe_: that only solves part of the problem though, and wouldn't have covered the example case with maps. actual external monitoring of public services (per-service) would. [14:34:28] <_joe_> bblack: acually, service-checker would've [14:34:49] <_joe_> it asks for the swagger spec to the application, and makes a series of requests [14:34:58] <_joe_> it validates various endpoints [14:35:03] oh you mean, run it against the public IPs, from icinga? [14:35:07] <_joe_> yes [14:35:16] I thought you meant, run it against the internal service, but running on the cache nodes [14:35:23] <_joe_> or locally on the varnishes via nrpe [14:35:27] <_joe_> which was my idea [14:35:29] ok [14:35:35] <_joe_> calling varnish itself [14:35:48] well, nginx via https on the public IP would be best still [14:35:50] <_joe_> well, nginx [14:35:52] <_joe_> yeah [14:35:53] <_joe_> :P [14:35:56] we can screw things up in a lot of layers :) [14:36:02] <_joe_> yup [14:36:16] <_joe_> I was thinking about that exactly [14:36:33] that's still "internal" monitoring though to be picky ;) [14:37:46] <_joe_> yeah they want that [14:37:52] <_joe_> when they say external [14:37:58] <_joe_> they mean varnish-level [14:38:28] <_joe_> it's ok for me to play with it on cp1008? [14:40:53] 10Traffic, 10Analytics, 10Analytics-Cluster, 10Operations, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3323871 (10Ottomata) We should do some work to understand how ACLs work and what ACLs for what topics we should set in production. [14:42:43] _joe_: yes [14:50:38] 10Traffic, 10Operations: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3323972 (10BBlack) re: vsthrottle, my thoughts after a quick look this morning: 1. The burst issue seems fine. It initializes fresh buckets with full capacity. 2. Memory leaks - it seems to have pr... [14:52:20] <_joe_> servicechecker.CheckError: Invalid certificate [14:52:27] <_joe_> damn https :P [14:52:36] use the right hostname :P [14:52:53] <_joe_> I did! [14:53:00] <_joe_> I think :P [14:53:03] 10netops, 10Operations: ospf link-protection - https://phabricator.wikimedia.org/T167306#3323984 (10ayounsi) [14:53:47] <_joe_> and I didn't! [14:54:01] <_joe_> oh right I'm dubm [14:54:04] <_joe_> *dumb [14:54:58] <_joe_> ok, the error is different, it's pretty clear I am not good at mocking urls :P [14:55:12] I think you just can't type! :P [15:00:55] <_joe_> no, I think there is a problem with my code [15:01:19] <_joe_> bereq.http.host is the value of the HTTP Host: header, right? [15:18:58] _joe_: yes, in certain contexts [15:19:07] <_joe_> sigh, urllib3 is such a bag of shit [15:19:19] (depends which VCL sub you're using it from, in some it's invalid) [15:21:08] <_joe_> bblack: basically I need to add 'assert_hostname' to the poolmanager instantiation [15:21:19] <_joe_> sigh [15:21:52] _joe_: http://book.varnish-software.com/4.0/chapters/VCL_Basics.html#variables-in-vcl-subroutines [15:22:43] <_joe_> yeah ema I fixed the python code [15:22:49] <_joe_> I didn't believe it was so asinine [15:22:58] <_joe_> well turns out it is! [15:26:54] bblack: these maps nodes have puppet disabled for a week now, what's the plan for them? [15:27:45] paravoid: to move them to spares and/or decom, depending on the DC (decom in ulsfo+esams, spare them out for future experimental use in eqiad+codfw) [15:27:59] then let's do that at least in puppet? [15:28:24] systems not running puppet for extended timeframes is really non-ideal [15:28:30] which was planned for monday, but: https://phabricator.wikimedia.org/T167046 [15:29:01] yeah I know it's not ideal. in the state they're in, if we re-enable puppet their config is broken and the run will fail. But it leaves them in an easy state to revert from. [15:29:21] there's already a commit stage to kill all their puppet config and move them to the spare role [15:29:25] *staged [15:29:30] if we remove the roles from site.pp and have them run puppet, it will probably not touch the varnish stuff, no? [15:29:41] so you'll get the same effect, just with base roles applied [15:29:46] well I guess firewall rules would be pruned [15:30:22] anyways, I don't think that ticket should block anymore, but it just seemed rude to proceed when the ticket was initially opened [15:30:29] (and thus kill easy ability to revert) [15:30:32] nod [15:30:39] makes sense [15:31:16] I can't reproduce that task either fwiw [15:31:24] not much I'm sure :) [15:32:12] 10Traffic, 10Operations: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3324322 (10BBlack) Stared at hashtable implementation some more, as well as the linux iptables hashlimit one (which I consider a sort of baseline canonical efficient implementation). The linux one i... [15:45:24] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324396 (10Jdforrester-WMF) [15:55:13] 10HTTPS, 10Traffic, 10MW-1.30-release-notes, 10Operations, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3324427 (10fgiunchedi) [16:13:30] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324466 (10Anomie) [16:13:48] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Anomie) [16:14:30] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Anomie) [16:20:17] 10netops, 10Operations: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274#3324498 (10Papaul) @ayounsi Yes I can be available [16:23:41] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324520 (10Jdlrobson) [16:25:25] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Jdlrobson) 05duplicate>03Open Sorry phabricator fail. [16:25:34] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324533 (10Jdlrobson) [16:25:55] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Jdlrobson) [16:35:09] 10HTTPS, 10Traffic, 10MW-1.30-release-notes, 10Operations, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3324616 (10aaron) Deploy was 00:18 May 26 UTC, and {F8400067} I can't discern an effect on api upload entry point runtime. [16:42:02] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3313139 (10RobH) warrany for lvs3001 ended on May 08, 2015 [17:01:56] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3324750 (10Papaul) [17:05:45] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3324778 (10Papaul) [17:12:39] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw: labtestneutron2002 sswitch port configuration - https://phabricator.wikimedia.org/T167326#3324843 (10Papaul) [17:16:31] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3324871 (10Jdforrester-WMF) [17:16:43] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Jdforrester-WMF) [18:15:47] 10Traffic, 10Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3325171 (10faidon) [20:06:06] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3330189 (10RobH) 05Open>03Resolved a:03RobH Done! [20:36:32] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3330316 (10Anomie) [21:35:24] 10netops, 10Labs, 10Operations: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406#3330525 (10chasemp) [22:24:46] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3239715 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3004.esams.wmnet', 'cp3005.esams.wmnet', 'cp3006.e... [23:03:48] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3330724 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3004.esams.wmnet', 'cp3005.esams.wmnet', 'cp4011.u...