[01:14:47] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle) 05Open→03Declined [01:17:52] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#6465408, @hashar wrote: > We can't use the third party service gravatar.com since that leaks personal information to a third party.... [04:39:21] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Other / Uncategorized): Investigate what caused the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197 (10DannyS712) [07:38:40] hello hello, if anybody has time https://gerrit.wikimedia.org/r/c/operations/puppet/+/627856 (it is about adding hue-next.wikimedia.org config for ATS) [08:06:12] elukey: the change looks good, and the certificate too. Ship it! [08:06:52] ema: <3 [11:11:41] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Vgutierrez) p:05High→03Medium acme-chief updated to version 0.29 in our production environment, the unified cert should be renewed tomorrow, we will check... [11:18:05] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10ayounsi) [12:15:38] hi traffic o/ I would like to remove two services from LVS (following https://wikitech.wikimedia.org/wiki/LVS#Remove_the_service_from_the_load-balancers_and_the_backend_servers) [12:17:49] Both are low-traffic kubernetes and we keep using the IPs (e.g. I'm removing the HTTP version of the services, keeping the HTTPS version) [12:18:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/627266 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/627270 for reference. Let me know if you have any concerns [12:19:11] I read that backup LVS are lvs1016 and lvs2010, primary lvs1015 and lvs2009 [12:25:11] jayme: no concerns, and yes lvs1016 / lvs2010 are the backups (they have `profile::pybal::primary: false` in hiera) [12:26:24] that wikitech document isn't really accurate [12:26:34] I mean, you can follow that and it will "work" [12:27:19] but pybal will never remove the dead entries from the actual ipvs tables. We should probably update that wikitech page to talk about asking #traffic to manually remove (or document how to do it, but ipvsadm -D commands are scary) [12:28:10] but now I wonder if we've missed a few removals that followed the doc and never got the ipvsadm-level removal [12:28:21] or write a cookbook that does it to make it scarier [12:28:34] but repeatable [12:29:29] yeah, I'm wary of that [12:29:52] ipvs is not friendly! [12:30:03] I guess thats okay for me in first place as my main concern is to remove health-check load from pybal. In addition I'm going to do some more removals over the next week so it would probably be smart to call out to #-traffic after I've finished all of them, right? [12:30:32] (for running scary commands I mean) [12:30:53] or we can show you how to do the final scary step and we can see how it goes as a trial for "maybe this is unscary enough to automate" [12:31:25] it's a fairly simple command, so long as nothing strange is going on [12:33:13] okay for me as well. I'm not very used to ipvsadm so I might be a good guinea pig for this :) [12:33:35] basically: [12:33:41] root@lvs1016:~# ipvsadm -D -t 10.2.1.14:8888 [12:34:03] just make sure you have the correct port number there at the end, and the correct (per-datacenter) IP [12:34:15] and "-t" is for TCP, it would be different for the rare UDP services in pybal [12:34:37] you would do this as the final step on each affected LVS, after pybal has been restarted to not know/care about this entry [12:36:11] and if you get it wrong (e.g. you type the codfw IP while working on eqiad, or whatever, and somehow the entry you're trying to delete already doesn't exist [12:36:19] the error message to expect is: "Memory allocation problem" [12:36:40] which is what ipvsadm says when you get any arguments imperfect or it can't lookup whatever host:port you're referencing for some reason. [12:36:56] uh..did not look scary until the last two lines [12:38:06] https://phabricator.wikimedia.org/T82849 [12:38:37] apparently we fixed it upstream a few years ago, but it has yet to filter back down to debian stable :P [12:38:52] :D [12:39:21] there's not actually a memory problem, ipvsadm is just horrible at reporting errors [12:40:57] I can definitely try that! So the scary part of this is just that messing with ip:port could delete another services entry and everything goes nuts [12:41:43] yes, if you typo the IP:Port vs what it should have been (for that service in that DC on that LVS), you'll either get a scary and confusing error message (but now you know about it!) [12:42:04] or if you're super unlucky, you might destroy a working entry for an unrelated production service, as many of them are only a digit or two apart [12:43:01] (the "oh I meant to kill 10.2.2.12, but I forgot that final 2 so I took out appservers.svc.eqiad.wmnet" scenario) [12:43:46] if that happens, the fix is basically to restart pybal again so that it can re-create the deleted entry from config [12:44:17] we could document a list command first that does dns lookup of the output [12:44:31] so you can double check is the same and then just change the prev options with -D [12:45:34] or wrap that all in a script [12:45:38] or not and cookbook it [12:45:45] :) [12:45:54] or none of the above and hurry up and replace all of this :) [12:46:39] I hoping that in our brave new k8s world eventually, most "services" aside from our public edge stuff won't even use pybal (or its successor) [12:47:10] and some infra things I'm sure, but most applayer things anyways should be able to move off [12:49:16] Okay then. Will try to not break anything (as usual) and report back if I stuble across anything that has not been said. Thanks bblack [13:14:50] bblack: I see some "ROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal" alerts regarding my change [13:15:03] am I just not fast enugh with restarting pybal? [13:15:39] did you already perform the ipvsadm -D incantations? [13:16:00] no...sounds like that is missing, hm? [13:16:37] yeah, I'm pretty sure that check is comparing Pybal's state with the IPVS services table (which you haven't altered yet) [13:18:37] so it's actually very much requited to do the ipvsadm -D commands together with the process...I probably got that wrong [13:21:55] jayme: https://wikitech.wikimedia.org/wiki/PyBal#Services_in_IPVS_but_unknown_to_PyBal [13:22:45] it's not a big deal though, there's no rush to fix that quickly really so take your time! [13:25:02] I think I just wasn't awary that this will trigger alerts. :) [13:25:30] *aware [13:26:08] it is safe to assume that touching pybal in any way will trigger some sort of alerts [13:27:02] eheh, okay [13:27:06] yeah I didn't think it would trigger the alert either. At least something is keeping it from slipping through the cracks! :) [13:27:54] and so yeah in that other link the -D is already documented [13:27:55] is it okay for you if I add the "ipvsadm -D -t" step to wikitech, then? [13:28:01] yes, please [13:28:06] wilco [13:28:20] yeah so we alert either because (1) there's something in ipvs and not in pybal or because (2) pybal thinks there should be something while that thing is not defined in ipvs [13:29:43] IIRC we added the alert due to a surprising occurrence of (2) and concluded that (1) would be useful as well [13:31:18] jayme: maybe also !log the -D command for posterity [13:31:40] yeah [13:33:00] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Nemo_bis) >>! In T262869#6468339, @CDanis wrote: > There was another instance of it about 10 hours ago. Also right now it seems, at least from some TIM customer in... [13:34:22] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) >>! In T262869#6470378, @Nemo_bis wrote: >>>! In T262869#6468339, @CDanis wrote: >> There was another instance of it about 10 hours ago. > > Also right now... [14:09:35] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Nemo_bis) Yes, we're telling that to everybody (including to journalists who called WMIT, social media, internal mailing lists and colleagues). Did you get any info... [16:23:02] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) (Not wanting to nit-pick, really, just curious...) Is sending username/email to gravatar by default (how I assume this would work using that local... [16:28:18] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Ladsgroup) Sorta responding to Greg and hashar too. I agree having proxy is less than optimal. But there's another option as suggested in {T256541} (grav... [16:46:26] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#6471328, @greg wrote: > Is sending username/email to gravatar by default (how I assume this would work using that local proxy) consid... [16:48:09] 10netops, 10Operations, 10ops-codfw: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [16:49:08] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) Just for the context the reason I declined this again is because I have seen on IRC a notice about disabling gravatar for OTRS ( T187984#6465324... [16:53:53] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) Agreed with @hashar on the way forward for this request. [17:29:05] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) Filed {T263161} about having a Gravatar proxy. [22:20:54] 10netops, 10Operations: cr1-codfw<->cr1-eqiad link saturation - https://phabricator.wikimedia.org/T263206 (10CDanis) [22:52:58] 10Traffic, 10netops, 10Operations: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) 05Open→03Resolved a:03CDanis After extensive investigation by one of our network connectivity providers, we believe that the cause has been discovered... [22:54:38] 10netops, 10Operations: cr1-codfw<->cr1-eqiad link saturation - https://phabricator.wikimedia.org/T263206 (10CDanis) (an update: duh, we have ~3Gbit/s of codfw-->esams traffic that is traversing eqiad) [23:00:30] 10Domains, 10Traffic, 10Analytics-Radar, 10Operations, 10Wikimedia-General-or-Unknown: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10Krinkle) Those urls don't need to change. We just need to stop accidentally setting cookies on them. I'm 99% sure this is... [23:00:36] 10netops, 10Operations: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 (10faidon) SGTM! [23:01:00] 10Domains, 10Traffic, 10Analytics-Radar, 10Operations, and 2 others: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10Krinkle) [23:15:19] 10netops, 10Operations: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10faidon) p:05Triage→03Medium