[08:09:58] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894#4227676 (10Gilles) [09:40:06] 10Traffic, 10netops, 10Operations: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365#4228012 (10ayounsi) I looked into `charon.plugins.kernel-netlink.mtu` but for what I read it is only applied to routes added by ipsec in tunnel mode, while we use it in transport (transparent) mode.... [10:17:59] 10netops, 10Operations: update switch port label from naos.codfw.wmnet to deploy2001.codfw.wmnet - https://phabricator.wikimedia.org/T195422#4228178 (10ayounsi) 05Open>03Resolved a:03ayounsi ```lang=diff [edit interfaces ge-5/0/15] - description naos; + description deploy2001; ``` [11:13:29] 10Traffic, 10netops, 10Operations: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365#4228312 (10BBlack) >>! In T195365#4228012, @ayounsi wrote: > Raising the MTU above standard everywhere is indeed another can of worms and out of scope here. > With careful testing, raising it on so... [11:27:22] 10netops, 10Cloud-Services, 10Operations: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4228340 (10ayounsi) https://apps.db.ripe.net/db-web-ui/#/lookup?source=ripe&key=185.15.56.0%2F24AS14907&type=route created. IPv6 is tracked in T187929 and can indeed wait.... [11:58:57] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4228437 (10Krinkle) [12:07:31] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4228453 (10Krinkle) @ema ResourceLoader dashboards in Grafana have been updated to use Prometheus for all Varnish metrics. The varnishrls deamon for Grap... [14:03:50] if you don't have any blocker I would like to merge the net_driver change ( https://gerrit.wikimedia.org/r/#/c/434032/ ) [14:04:34] from the test done should be safe, but I can also disable puppet across all hosts that use it before merging, as you prefer [14:05:32] bblack ^^ [14:41:47] vgutierrez: I'm ok with it, but I'd note anytime you push new fact-code, there's additional risk because the new fact code is going to run on every node in our whole prod environment, and probably 1 or a few will fail because of some unpredicted condition. [14:42:26] vgutierrez: what I've done in the past to test that scenario easily, is copy out the core of the ruby code in the fact, but just as a separate little ruby function/script that prints outputs, instead of the puppet modules + facts. [14:42:42] vgutierrez: (and then push that rubby snippet to '*' with cumin and see if it breaks or does anything crazy or crashes) [14:43:38] vgutierrez: unless you're pretty highly-confident that there's no scenario out there that the ruby code doesn't account for (e.g. some previously-unseen values in sysfs) [14:43:43] bblack: already done, it doesn't break, just on trustys returns {}, like it is now [14:43:49] ok! [14:44:11] because the /device/driver/module file is not there [14:44:30] uh, virtual +1 because I'm logged out of gerrit and it would take a few minutes to get that going again [14:44:46] (yet another sign of manager disease, I've been productive all morning yet not needed my gerrit login :P) [14:45:13] I'm pretty confident on the fact side across the fleet, the only worry might be on the traffic hosts that actually use it. [14:45:15] rotfl [14:45:50] but from the tests the other day it seems that puppet does the right thing™ and pull the new fact code before exporting the facts, hence before compiling the catalog [14:46:06] so it can all be done in one go [14:46:08] oh right, the race-condition bit [14:46:15] yeah, sometimes that fails [14:46:19] (or it has, in the past) [14:47:09] I think my lame attempts to workaround it before was: disable puppet on the affected hosts, merge change, wait ~5 minutes just to be sure some puppetmaster disk cache isn't stale or whatever. then cumin out the re-enable + 2x puppet runs in a row, so it will still probably-fix itself on the second run if it has to. [14:47:26] but who knows, maybe recent puppet-layer upgrades have fixed some of that [14:48:02] if we want to be on the safe side I'll disable puppet across affected hosts [14:48:43] it shouldn't be that awful, it will just fail on trying to deference a hash that's really just a string and bomb the agent [14:48:48] depends how dangerous might be if the interface::ring is reduced to 2040 from 4078, see modules/profile/manifests/lvs/interface_tweaks.pp [14:48:48] (vs actually changing NIC settings) [14:49:41] no that's the tricky part, because we're moving from {'key': 'value'} to {'key': {'key2': 'value'}}, it will not fail, just be false [14:49:49] oh [14:49:54] shold be very careful then [14:50:18] if the interface::ring actually changes, it will blip the ethernet link state and cause a traffic spike [14:50:20] $facts['net_driver'][$interface] == 'bnx2x' will still be valid puppet code [14:50:37] ok then, I'll disable puppet around [14:51:35] although [14:51:42] you could also protect it a different way, I guess [14:52:29] ... sec [14:53:08] * volans standing by [14:53:17] yeah so the only "dangerous" one is the interface::ring in https://gerrit.wikimedia.org/r/#/c/434032/5/modules/profile/manifests/lvs/interface_tweaks.pp [14:53:42] but that's because the "} else {" is for the code that's supposed to run on bnx2, which is the only other type of interface we have on lvs* presently (just double-checked) [14:53:46] yeah, the other one even if false should be already applied to all hosts [14:53:55] so, replace the else with if driver == 'bnx2' [14:54:01] and then neither clause would run if this fails [14:54:05] (so no interruption) [14:55:03] it's more-sane that way anyways, vs implicitly assuming that LVS which isn't bnx2x is bnx2 [14:55:17] agree, changing it [14:58:20] done [14:58:21] virtual re- +1 :) [14:58:29] lol [14:58:44] no hurry, I can wait if you want to have a closer look [14:59:17] no I looked, looks fine to me assuming no stupid basic syntax errors [14:59:28] (I'm not a human compiler, although sometimes I wish I was!) [15:00:15] eheheh, you mostly are :-P [15:01:01] so no need to disable puppet around? [15:02:15] probably not, unless you just wait to avoid potential puppetfail minor spam from the facts-vs-manifests race [15:02:22] but I don't think it can knock out traffic or anything anymore [15:03:20] ack [15:05:35] proceeding then [15:12:18] testing few hosts so far it seems a noop as expected [15:42:27] 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4229008 (10TheDJ) Shall we consider this closed then ? [15:44:13] 10Traffic, 10Operations, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4229012 (10TheDJ) 05Open>03Resolved a:03TheDJ I think we can call this one resolved, per T180921#4226106 and {F18492888} [15:46:10] XioNoX: your new net_driver fact should be live everywhere now. The only 4 hosts where it doesn't work and reports {} are: [15:46:13] sca[2003-2004].codfw.wmnet,sca[1003-1004].eqiad.wmnet [15:47:04] on those it was not working even before, and they are Ubuntu Trusty (but other Trusty works fine) [15:55:48] bblack: just FYI I've also checked the catalogs on puppetboard and they report the correct lines/values applied [15:56:11] 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4229032 (10Tgr) Would be nice if someone could test it on IE or Edge, now that Safari is fixed those are the only two browsers that need a... [16:02:43] 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4229047 (10TheDJ) P.S. I do think that fallback is indeed broken on Safari. Note how the error message says it is reverting the policy to... [16:03:27] volans: good to know, thanks [16:38:06] 10Traffic, 10Operations, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4229170 (10Nuria) @TheDJ is there a way to know when the fix landed in the safari version users have (not when it was merged and... [17:06:05] 10netops, 10DBA, 10Operations, 10ops-codfw: Swtich port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229394 (10Papaul) [17:07:26] 10netops, 10DBA, 10Operations, 10ops-codfw: switch port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229282 (10Papaul) [17:09:43] 10netops, 10Operations, 10ops-codfw: switch port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229419 (10Marostegui) [17:31:21] 10netops, 10Operations, 10ops-codfw: switch port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229477 (10RobH) 05Open>03Resolved both ports have descriptions set, enabled, and they were both already in the private vlan [17:51:15] bblack: can you let me know when you're around, I want to re-pool esams, but would like to make sure there is another pair of eyes in case the issue happen again [17:51:57] issue being spike of 503, possibly network related but can't find anything to confirm it [18:28:04] well, repooled, monitoring [18:28:51] also when esams is depolled, Telia's uplink in eqiad gets dangerouly close to saturating https://librenms.wikimedia.org/graphs/to=1527186000/id=6841/type=port_bits/from=1527164400/ [18:40:13] 10netops, 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229634 (10Marostegui) Adding #netops to see if they can help out [18:46:39] 10netops, 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229648 (10Marostegui) I can the requests arriving fine (this is db2094) but looks like it is not going past that? : ``` May 24 18:25... [19:30:49] 10Traffic, 10Operations, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4229700 (10TheDJ) @nuria Safari 11.1 was released with iOS 11.3 and macOS 11.3.4 (as well as macOS 10.12.6 and 10.11.6) Both r... [20:41:25] 10Traffic, 10Operations, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4230029 (10JKatzWMF) @TheDJ Thanks for monitoring this and letting us know! It's a huge relief. [23:00:07] 10Traffic, 10Operations, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4230268 (10Nuria) So nice when things add up! I can check on the preview staff, let me know if you want that done [23:33:20] 10Traffic, 10Operations, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230355 (10Krenair) So I've got it serving files to a puppet client successfully. Client just has this: ```lang=puppet file { '/etc/centralcerts/testing.pub...