[06:31:29] 10netops, 10Operations, 10ops-esams: cp3036 and cp3037 production ports mislabeled - https://phabricator.wikimedia.org/T196970#4277861 (10ayounsi) 05Open>03Resolved a:03ayounsi Thanks, fixed: ```lang=diff [edit interfaces xe-3/0/4] - description cp3037; + description cp3036; [edit interfaces xe-3/0... [06:35:27] 10netops, 10Operations, 10ops-codfw: switch port configuration for bast2002 - https://phabricator.wikimedia.org/T196957#4277866 (10ayounsi) 05Open>03Resolved a:03ayounsi Added to the public vlan: ```lang=diff [edit interfaces interface-range vlan-public1-b-codfw] member ge-8/0/12 { ... } + mem... [07:40:06] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4278018 (10ayounsi) [07:41:25] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4273636 (10ayounsi) a:03Papaul Switch ports configured, table in description updated. [08:20:57] XioNoX: hey, let's move https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/435797/ forward! [08:21:08] it's getting rusty :P [08:24:20] vgutierrez: wanted to check if Faidon had feedbacks first. He said he had comments on Monday, but I don't know for which meeting items :) [08:24:48] oh ok :) [08:44:02] 10Traffic, 10Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#4278185 (10fgiunchedi) p:05Normal>03High There have been edac correctable memory errors reported for this host, raising priority to high since the cpu temp alerts also persist ``` Jun 13 04:... [09:47:20] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278435 (10fgiunchedi) I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking... [10:17:22] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4096194 (10Dzahn) gerrit.wmfusercontent.org now exists in cache::misc and requests would be forwarded to cobalt as the backend. This unblocked this to a certain extent because avatar... [10:17:28] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4278528 (10Dzahn) p:05Triage>03Normal [10:19:00] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278534 (10akosiaris) >>! In T183177#4278435, @fgiunchedi wrote: > I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine che... [10:24:16] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278560 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I'm resolving this task since we're alerting on uncorrectable memory errors found by EDAC now. Uncorre... [10:25:30] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278566 (10fgiunchedi) 05Resolved>03Open [10:45:54] XioNoX, vgutierrez: I have two main concerns: 1) is about the risks of doing this from every host, and 2) is... if the IP is in another host, chances are this check will never run, because Icinga will be unable to reach the host [10:46:03] and trigger it [10:47:13] have you looked into doing this kind of consistency check from the routers via SNMP, either to cross-check if the two routers in a pair have the same view, or to keep a cache between subsequent runs or something like that? [10:49:19] that would be passive, wouldn't need to inject extra ARP who-has in the network, and it'd be from a third-party view as well [12:13:52] 1/ should not be an issue, especially as we don't have to run it aggresively [12:13:52] 2/ it depends, if a decom server comes back to life it should have the checks as well, also if the IP flaps between MACs there will be times the checks can run and alert properly [12:17:08] querying the routers' ARP opens the door to special cases, like VMs, and maintaining states, so less trivial [12:53:44] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278971 (10fgiunchedi) 05Open>03Resolved >>! In T183177#4278534, @akosiaris wrote: >>>! In T183177#4278435, @fgiunchedi wrote: >> I researched the "panic on unc... [13:07:30] ema, bblack: let me know your thoughts regarding https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440114/ [13:07:54] I'm still waiting for some translation of top 10 affected countries (Germany, Poland && India) [13:12:03] nice memory skills, re: the advice from hackathon about explicit dir=rtl in there :) [13:12:13] right :) [13:12:18] err, explicit dir=ltr I meant :) [13:12:28] yep yep.. Amir comments were pretty useful [13:12:37] vgutierrez: where are the translations kept? I can do the German one if needed [13:12:57] moritzm: https://meta.wikimedia.org/wiki/User:Johan_(WMF)/AES128-SHA [13:13:04] thx <3 [13:13:51] gotta go now... I'll get your comments later :D [13:14:18] ok! [13:16:01] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4279030 (10Papaul) a:05Papaul>03ayounsi @ayounsi all fibers for lvs2010 and lvs2009 are already pulled according to the the first plan |LVS2009|C2|asw-c2|asw-a2|asw-b2... [13:23:09] vgutierrez: done [13:23:38] vgutierrez: so, yeah, I see you've tried to blend it back towards the way the initial commits looked in the 3DES case, rather than how it finally ended up. Unforunately the commit history for the 3DES case is a blend of both the correct evolution of the matching params, and "bugfix as we learn", so some of the final form belongs in the initial commit this time. [13:24:39] vgutierrez: I'd have to dig/remember a bit, but off the top of my head: I think your basic if-conditions make sense at this stage, but the 302-redirect-to-/sec-warning thing is probably something we want to do from the get-go this time. [13:25:18] vgutierrez: it made a big difference, vs the 418 method. some UAs would just retry the 418 over and over and spike the stats on the bad cipher, whereas 302 to a cacheable /sec-warning mostly quelled that behavior [13:25:42] (the downside is back-button to escape it) [13:31:31] oh reminder: fill this thing out in the next ~2h: https://etherpad.wikimedia.org/p/Traffic-2018-06-14 [13:35:38] XioNoX: the checks are active and need to be configured on the icinga server, so the checks will run on the server that's reachable, if at all [13:43:14] and if that happens then... I guess the "other" server will respond with an is-at to 0.0.0.0 [13:43:21] and that will override arp caches on the routers maybe? [13:43:52] update them, I mean [13:44:04] ema: if it would make life simpler.... we could also just refactor towards "install all VCLs on all cache hosts, regardless of role", so long as all the conflict ones have independent naming. [13:44:18] ema: it might make things easier down the line too, who knows. [13:44:40] oh also (3) the package "arping" conflicts with "iputils-arping" (they both provide the same binary but with different arguments/semantics), and the latter is a reverse dep of ganeti :( [13:50:20] ok [13:51:00] bblack: mmh essentially that's what the patch allows you to do (pass separate_vcl=['misc', 'upload'] on a text node), while still keeping the distinction between the main VCL file (the one loaded by varnishd -f) and those you might want to switch to from within VCL [13:56:53] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4279256 (10ayounsi) a:05ayounsi>03Papaul I was not aware of T196560. Changes rolled back for all interfaces other than NIC1. [14:07:32] ema: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/440069 okay to remove "prometheus-" from the name of the dashboard "prometheus-varnish-http-requests" ? [14:08:17] XioNoX: sure [14:09:35] ema: are you aware of any links to that page that would need to be updateD? [14:12:28] XioNoX: nope [14:13:31] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4279334 (10Papaul) @ayounsi thanks [14:15:01] also cf. T170150, godog ^ [14:15:03] T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150 [14:15:07] er [14:15:15] T178690 i meant [14:15:16] T178690: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690 [14:22:58] indeed! [14:23:17] one of the candidates (varnish/traffic) for sure [14:53:16] bblack: the cacheable 302 to /sec-warning makes sense at this stage when only a 1% of the requests should be redirected/intercepted to the sec-warning page? [15:04:22] from the commit history you detected the retry issues after hitting 100% and moving from 418 to 403 [15:04:31] so you went from 403 to 302+200 [15:04:37] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/384982/ [15:43:36] vgutierrez: yeah at this point I'm kinda on the fence about that. the 418 is less-intrusive, we just have to be careful that we don't necessarily trust our stats (re: measuring AES128-SHA decline) until after we move to a 302. [15:43:52] and maybe move to it a bit earlier in the process, I don't know. [15:45:12] well... one benefit of using 418 as ema pointed out this morning is that is easily trackable in our stats [15:45:21] the other things to keep in mind from that history (that you may have already observed I think): is that with <100% we have to be careful to only hit /wiki/ and not hit any /wiki/Special:Foo pages (hence the : exclusion, because "Special:" gets language-localized (ugh)) [15:45:45] but later with the 100%, there's no point filtering for special or non-wiki pages, we just want to avoid returning html for what are obviously images [15:46:18] which is how it ended up in that later state just with a negative regex for /static/images/ or whatever [15:46:25] yup [16:19:46] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4280024 (10Papaul) [16:36:27] 10netops, 10Operations: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147#4280093 (10Papaul) p:05Triage>03Normal [16:57:45] 10netops, 10Operations: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147#4280217 (10Papaul) @ayounsi the name proposal is just temporally so i can add the switches in racktables and do the setup in the scs-a1/c1. After you are done with the configuration and we remove the ol...