[07:54:14] inflatador: yep yep but both tools use the Swift API IIUC, and I prefer way more the S3 protocol. In your case you could probably use something like boto or a similar s3 library within an Airflow DAG, IIRC it won't handle the deletion of a whole bucket by itself (every object needs to be deleted separately), but it is easy to write something that lists all the objects and batch-deletes them. [07:54:56] s3cmd is still working but it deleted ~500k objects from yesterday, and I have 90M to go, so it will take a while :D [09:40:48] I'm upgrading Java on the puppetservers, which requires an immediate restart due to the jruby JIT, there'll be a few failing puppet runs, but I'll splay these out so rate remains low [14:21:08] elukey: FWIW last time I had to delete 10s of millions of files from s3, I found rclone to be much faster at it than s3cmd [14:21:21] as in at least 5-10x faster [14:21:25] e-lukey yup, I'm well aware of the swift container deletion requirements from my Rackspace days, wrote automation back in the day that did exactly that. Will have to try the boto approach in some future [14:21:29] day when I have time [14:21:53] brouberol you also wanna be careful not to hit the swift API too hard, at least I got yelled at for that in the past ;) [14:22:00] (at Rax that is) [14:22:02] ah right :D [14:23:19] exactly yes :D Thanos swift is multi-tenant so I am happy to go slow, as long as I don't have to reauth every now and then like with the swift CLI [14:23:27] but TIL Rclone, good to know! [14:23:41] I need to delete ~180M files, it will take some weeks [14:59:18] rclone is pretty good software [15:07:02] +1 yeah I've had an rclone timer running for years now backing up all my stuff and it "just works" [15:21:22] fyi: I am going to do Cassandra restarts of the sessionstore cluster shortly. no impact expected, just raising awareness! [15:22:31] ack, thanks for the headsup [15:25:32] elukey / moritzm are ya'll still working on puppetserver? I got a connection failure to puppetmaster2001 when committing secrets on puppetmaster1001 [15:25:46] uhm. PUPPETSERVER not PUPPETMASTER [15:26:27] I've removed the puppetmaster role from puppetmaster2001, as it's being shut down [15:26:58] inflatador: so you got an error for puppetserver2001? [15:27:08] oops, no, it was in fact `gitpuppet@puppetmaster2001.codfw.wmnet` . Sorry [15:27:21] I think I'll simply run the decom cookbook against puppetmaster2001 now [15:27:41] there's probably some chicken and egg where it attempts to run puppet against itself which also removing itself or so [15:27:47] sounds like we're all good, sorry about the master/server confusion [15:29:51] ok! [15:30:08] the decom cookbook is running now, so that should also fully remove puppetmaster2001 from puppetdb [15:31:01] the final removal of puppetmaster1001 is going to be extra tricky, we'll not be able to use the decom cookbook for it and instead make up some plan of manual steps [15:32:28] moritzm: you might (to be checked) get away with a test-cookbook checkout of the cookbook where you comment a bunch of lines though [15:32:57] oh, that's a great idea! didn't think about that option [15:58:04] moritzm looks like the secrets don't get pushed out when the git commit hook fails on the puppetserver1001 [15:58:21] I can render the file manually but if there is a better workaround LMK [15:59:23] maybe do a NOP change (like updating a comment or so), then should catch up with the resulting sync [16:03:52] ACK, that worked! Thanks m-moritz [16:04:59] good :-9 [16:18:26] anybody working on tcpproxy boxes? [16:20:29] akosiaris: are you around? [16:21:26] vgutierrez: yes [16:21:29] what's up? [16:21:54] Last Puppet commit: (0c56e8c9b5) Alexandros Kosiaris - base::sysctl: Switch priority of the ubuntu-defaults stanza [16:22:01] * jelto is not doing anything on the tcp-proxys [16:22:08] that misconfigured rp_filter on tcp-proxy instances [16:22:11] at least in drmrs [16:22:32] ok, let me revert [16:22:46] what misconfiguration though ? [16:23:02] liberica is unable to reach the VIP via IPv4 now [16:23:35] not doing anything on tcpproxy, in staff meeting [16:23:35] thanks for spotting the Liberica alert btw [16:23:45] merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237262 right now [16:23:46] it's impacting eqsin as well now [16:24:09] https://www.irccloud.com/pastebin/CsZTA0Ah/ [16:24:36] ipip probably has a smaller number? [16:25:28] forcing puppet run on tcp-proxy6001 [16:25:42] https://www.irccloud.com/pastebin/GlaLAI76/ [16:25:46] the number theory is wrong btw, there is nothing before 51 [16:26:09] all rp_filter set to 1 is pretty bad [16:26:47] akosiaris: Feb 05 16:26:35 lvs6001 libericad[585144]: time=2026-02-05T16:26:35.293Z level=INFO msg="detected healthcheck state change" service=gerrit-sshlb_29418 hostname=tcp-proxy6001.drmrs.wmnet address=10.136.0.19 healthcheck_name=IdleTCPConnectionCheck healthcheck_id=937958871 healthcheck_result_old=false healthcheck_result=true [16:26:53] akosiaris: that's your puppet run fixing it [16:27:18] is it? I still see all.rp_filter=1 on tcp-proxy6001 [16:27:30] despite Notice: /Stage[main]/Profile::Lvs::Realserver::Ipip/Exec[disable-rp-filter-ipip0]/returns: executed successfully (corrective) [16:27:31] Notice: /Stage[main]/Profile::Lvs::Realserver::Ipip/Exec[disable-rp-filter-ipip60]/returns: executed successfully (corrective) [16:28:07] akosiaris: right... default went to 1 to 2 though [16:28:13] urgh.. from 1 to 2 sorry [16:31:06] uh... it started failing again [16:31:06] wtf [16:31:38] Feb 05 16:30:04 lvs6001 libericad[585144]: time=2026-02-05T16:30:04.002Z level=INFO msg="detected healthcheck state change" service=gerrit-sshlb_29418 hostname=tcp-proxy6001.drmrs.wmnet address=10.136.0.19 healthcheck_name=IdleTCPConnectionCheck healthcheck_id=937958871 healthcheck_result_old=true healthcheck_result=false [16:32:40] https://grafana.wikimedia.org/goto/_pvBm_NvR?orgId=1 [16:32:49] !log manually sudo sysctl net.ipv4.conf.all.rp_filter=0 on tcp-proxy6001 [16:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:35] I think that ^ repooled that 1 host in drmrs [16:33:39] yes [16:33:44] but ... why is it not happening via puppet [16:35:59] sysctl isn't applying the changes on real time? [16:36:03] or is it? [16:37:09] vgutierrez@tcp-proxy6001:/etc/sysctl.d$ fgrep rp_filter * [16:37:09] 10-ubuntu-defaults.conf:net.ipv4.conf.all.rp_filter = 1 [16:37:09] 10-ubuntu-defaults.conf:net.ipv4.conf.default.rp_filter = 1 [16:37:28] there is an exec update_sysctl that should refresh in every change in /etc/sysctl.d [16:37:43] ah wait, there is no relationship defined [16:37:45] what? [16:38:08] ah the refresh is in sysctl::conffile [16:38:20] and it just calls systemctl restart systemd-sysctl.service [16:38:26] so it should re-apply everything [16:39:09] so that file is telling the system to enforce rp_filter everywhere [16:40:32] ok [16:40:36] so, the 2's are from the trixie shipped file [16:40:42] the puppetization of tcpproxy is mixing a hiera key [16:40:46] fixing it [16:40:50] dammit [16:40:54] cripes [16:40:56] sorry vg [16:40:58] so this only bit tcp-proxy ? [16:41:06] akosiaris: probably [16:41:42] good catch [16:42:09] seems like it akosiaris https://grafana.wikimedia.org/goto/VUxnZ_HDg?orgId=1 [16:42:32] cdanis: nice, thanks [16:42:35] but when I pinged you I was worried it could impact other services [16:42:44] vgutierrez: understandable. My worry too [16:43:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237266 should fix it [16:43:10] waiting for PCC there [16:43:22] +1ed [16:43:31] I 'll revert my change and hang it under yours [16:44:31] https://puppet-compiler.wmflabs.org/output/1237266/5724/tcp-proxy6001.drmrs.wmnet/ looks good [16:44:32] merging [16:44:38] ack [16:45:59] triggering a puppet run on A:tcpproxy [16:46:00] vgutierrez: should that be allowed to be set the way it was for anything with, idk, profile::lvs::realserver ? [16:46:10] like, should that fail compilation? [16:46:25] cdanis: probably.. is it kosher to have dependencies like t hat across profiles? [16:46:34] idk it seems better than what happened [16:47:03] it was a lovely edge case [16:47:04] and like, a sub-profile checking hiera in profile::base seems very reasonable also [16:47:12] there is a TODO to turn the base::sysctl into a profile [16:47:30] triggered by akosiaris fixing the enforcement of the ubuntu sysctl settings [16:47:40] yeah nice job spotting it :) [16:47:43] lol [16:47:48] my parting gift to all of you :P [16:48:03] creating an outage during all staff meeting like it's 2014 [16:48:09] ahahahah [16:48:20] recoveries should appear soon [16:48:35] grafana dashboard looking good: https://grafana.wikimedia.org/goto/wFd2W_HDg?orgId=1 [16:48:42] vgutierrez@lvs6001:~$ liberica cp services [16:48:42] gerrit-sshlb_29418: [16:48:42] 10.136.1.18 1 healthy: true | pooled: yes [16:48:42] 10.136.0.19 1 healthy: true | pooled: yes [16:48:44] on the plus side, it was probably the best possible time to do it? The least amount of devs trying to git pull [16:48:57] akosiaris: lol, it's only opt-in atm, at least [16:49:04] dammit [16:49:12] I was hoping to create some impact [16:49:32] akosiaris: you can enforce rp_filter on cp hosts [16:49:42] * vgutierrez runs away [16:49:42] - Line 5: Line exceeds max length (166>100) [16:49:47] thank you jerkins-bot [16:49:50] I will not miss you [16:50:24] hot take, we should bring back "jerkins-bot" [16:50:49] yes, it IS a disrespectful and unconstructive way to talk about jenkins, and we SHOULD be bullying robots more [16:51:36] hmm that didn't fix it on every DC though [16:51:38] it would be completely inappropriate for our human coworkers and it's not rude ENOUGH for a silly little computer program that counts your newlines and tells you what to do [16:51:47] this has been my lightning talk, thank you [16:52:37] https://www.irccloud.com/pastebin/Hw4kgTxX/ [16:53:40] vgutierrez: ready to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237268 once more. Let's trigger it again! :P [16:53:48] akosiaris: it's still broken [16:54:01] party pooper [16:54:05] of course [16:54:10] should we run the manual "sudo sysctl net.ipv4.conf.all.rp_filter=0" via cumin first? [16:54:12] cause now it needs your fix lol [16:54:21] please go ahead akosiaris [16:54:22] "fix" ? [16:54:26] yeah [16:54:27] ack, merging [16:54:31] without your fix for trixie [16:54:37] the ubuntu values aren't being enforced [16:54:44] so rp_filter is not being disabled as expected [16:54:48] how did this even work in the first place ? [16:55:03] cause the ubuntu file disabling rp_filter wasn't enforced on trixie hosts [16:55:06] till you fixed that [16:55:15] breaking tcp-proxy in the process [16:55:42] a lovely chicken and egg issue [16:56:52] forcing puppet run on tcp-proxy6001 [16:57:32] drmrs is happy [16:57:38] cause you ran manually some sysctl magic there [16:58:02] I'm running puppet on A:tcppproxy-magru to test it there [16:58:21] alias and role being tcppproxy and hostnames tcp-proxy is making my brain hurt btw [16:58:47] net.ipv4.conf.all.rp_filter = 0 [16:58:47] net.ipv4.conf.default.rp_filter = 0 [16:58:48] is correct now on tcp-proxy6001 [16:58:51] in the file that is [16:59:09] as well as sysctl -a output [16:59:41] I'm running puppet on 2001 to see if that fixes gerrit ssh for me :) [17:00:02] vgutierrez: same feeling about that dash btw [17:00:13] uhm, I guess that might be one me with the naming mismatch. can upload a fix for that later [17:00:21] magru looks good though [17:00:23] *too [17:01:27] forcing a puppet run on A:tcpproxy [17:02:38] alerts already clearing on alerts.wm.o [17:03:47] thanks vg :) [17:04:33] ah, you know why that happened. role names usually only have underscores, not hyphens. but host names always have had - at WMF [17:04:35] hmmm [17:04:39] https://www.irccloud.com/pastebin/Qbo46Fac/ [17:04:49] wtf? :) [17:05:37] service is now healthy across the 7 PoPs FWIW [17:06:28] so the only difference is now the ens13.rp_filter ? [17:06:42] yep [17:09:37] 2 is not such a bad value for this, but what is going on ... [17:10:22] probably a reboot missing in some hosts? [17:10:28] an extra puppet run fixed it on tcp-proxy1001 [17:11:32] vgutierrez: I am thinking more like race condition? [17:11:49] yep [17:11:56] which class gets applied first ? profile::lvs::realserver::ipip or base::sysctl ? [17:12:29] but realserver::ipip doesn't mess with rp_filter for the main NIC [17:12:37] just on ipip0 and ipip60 [17:12:47] good point, scratch that theory [17:13:09] yeah, you are right, reboot is the reason [17:13:34] the interface was brought up before the change of the setting [17:13:48] and inherited default.rp_filter [17:24:20] those hosts are still insetup but can be moved to the new per-racks vlans (with a reimage using the --move-vlan option). It's a good opportunity, no impact for the services, and better to do it sooner than latter. I need to get in touch with the service owners. (cc topranks) [17:24:20] * aqs[1025,1027].eqiad.wmnet,wikikube-worker[1328-1329,1335-1342,1362-1364].eqiad.wmnet [17:24:20] * aqs1023.eqiad.wmnet,aux-k8s-worker[1006-1007].eqiad.wmnet,ganeti-jumbo1001.eqiad.wmnet,wikikube-worker[1330-1334,1343-1349,1365-1370].eqiad.wmnet [17:31:52] XioNoX: works for me! [17:32:21] bear in mind we need to add the BGP policies for some of them, the aux-k8s anyway, maybe ganeti-jumbo (is that routed?) [17:32:43] wikikube-worker should be ok the patch is merged but as yet untested (we need to verify routes received and propagated, should be fine) [17:33:13] ganeti-jumbo will be routed, but there is much more work to do on the ganeti side first (like upgrade to trixie) [17:33:59] yeah, the advantage is that they're in insetup so we won't break anything if there is some tune-up needed [17:34:09] also who is the point of contact for aqs ? marostegui ? [17:41:25] the puppet role_contact is Data Persistence while the Wikitech page says it's maintained by Data_Platform_Engineering/Experiment_Platform [17:42:01] mutante: thx yeah, I was looking for a specific person for a quick +1 [17:43:28] XioNoX: probably urandom [17:44:03] and btw -- for the k8s workers that are still `insetup` -- I think you can wait on adding the bgp policy until the service owner is ready to apply roles [17:58:47] XioNoX: I’m afk atm, how quick of a +1 do you need? [18:03:23] urandom: oh, not that quick! next week is fine, anytime before starting to use those servers [19:07:16] re: sysctls I've found that https://tuned-project.org/ (available in Debian) is a nice way to tie sysctls, sysfs settings etc into together profiles that you can layer. (Full disclosure: I'm a contributor ;P ) [19:08:03] er...tie **into** profiles that is [19:11:54] I would think that puppet already gives us those "opportunities", yeah? [20:48:09] Maybe with puppet it's easier, but when I had to turn on rps with ansible it was pretty nasty, fishing out cpu bitmasks, converting back and forth to hex etc. There are built-in variables to tuned that make it a bit easier [20:48:09] https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/customizing-tuned-profiles_monitoring-and-managing-system-status-and-performance#variables-in-tu [20:48:09] ned-profiles_customizing-tuned-profiles