[02:11:11] blacadesal: please when you can: [02:11:11] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/630 [02:11:11] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/39 [02:11:11] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1099776/1 [02:11:11] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1099777 [02:12:12] these are for the dead-locked pipeline [07:36:08] Raymond_Ndibe: on it [10:23:22] FYI I'm about to run the provision cookbook for cloudvirt1061 (francesco gave me the scape goat host) to test the patch for T379351. The host might get rebooted (or not) based on the value of the setting in the bios. [10:23:23] T379351: kernel message: SGX disabled by BIOS - https://phabricator.wikimedia.org/T379351 [10:24:59] volans: +1 [10:25:42] thx [10:26:10] I casually found cloudvirt1047 to have a full `/` since 30 minutes ago [10:27:20] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=cloudvirt1047&var-datasource=thanos&var-cluster=wmcs&from=now-3h&to=now&viewPanel=12 [10:27:27] then I guess that is alarming somewhere [10:28:25] hashar: I'm on it [10:30:12] there's a huge "/tmp/node4_virt.pcap" [10:37:20] topranks: did you create that pcap file by any chance? [10:37:43] I did [10:37:48] that's like about 60 seconds of data [10:37:51] sorry!! [10:37:52] LOL [10:37:59] but I'm not sorrry [10:37:59] can I delete it or do you want to download a copy? [10:38:08] give me two mins [10:38:14] take your time [10:38:19] I cleaned up some smaller files [10:38:48] ok it's gone [10:38:51] thanks :) [10:38:56] sorry didn't realise quite how huge it is [10:39:17] btw I was thinking I should be able to migrate a PAWS node to another cloudvirt [10:39:22] sorry I removed the one I had on cloudnet [10:39:36] for what reason? [10:39:40] like a cloudvirt in E4 where you have sflow [10:39:51] would that help? [10:40:45] good thinking but the pattern here seems fairly clear already based on tcpdump's [10:40:59] With my "ISP for cloud services" hat on I am disconnecting your internet service :P [10:42:13] nah but in all seriousness this is a serious issue for us, flagrant abuse of our network [10:42:20] can't really take a softly-softly approach [10:42:53] I've been completely blocking IPs on the cloudsw side of the cloudgw (between cloudgw and CR routers, first place WMF network devices can see the destination IP) [10:43:25] thus far have a few in there but it is whack-a-mole [10:43:27] topranks: ok, just seen your latest comment in the task [10:43:53] we need to as a fist step block all UDP from the PAWS hosts that is not port 53 (dns) or 123 (ntp) [10:44:11] if that hurst some esoteric legitimate paws use case so be it, the users will need to talk to us [10:44:12] sounds sensible, where's the best place to do it? [10:44:31] I will open a sub-task about "limiting PAWS outbound network" [10:44:43] it has to be done pre-NAT, hard to say [10:44:58] an iptables rule on the cloudvirts themselves would be the most effective place [10:45:04] "if you have to cryptomine, at least do it without spamming the network" [10:45:13] this isn't cryptominers [10:45:20] or it doesn't look like it [10:45:20] yeah just joking [10:45:26] seems like just udp floods / ddos [10:45:32] not sure what it is though, maybe ddos yeah [10:45:47] it's so uniform size packets and just floods it very much looks like it 99% [10:46:09] more dos than ddos though, all from one IP isn't it? [10:46:10] can openstack create rules like this? if not we can do it on the cloudgw with an nftables rule [10:46:43] dhinus: well we could just be one small node of their wider distributed denial-of-service system [10:46:44] paws are notebooks aren't they? Can the code generating those packets be find in one of the notebooks? [10:46:46] but yep [10:47:00] hashar: that seems sensible but outside of what I know what do to [10:47:29] and that sounds worth promoting it to a security task :) [10:47:31] topranks: yeah, but that one IP has a bit more bandwidth that the average home user :) [10:47:32] I created some graphs yesterday which show the interfaces on the PAWS VMs the traffic comes from, in theory also should be able to identify the K8s POD responsible from that but again I don't know how to join the dots [10:48:43] https://grafana.wmcloud.org/goto/OxZh_NVHk [10:49:05] Is arturo around? [10:49:49] topranks: he's out this week but he might pop online at some point [10:50:16] hashar: it should be possible to find the python code generating those packets in some PAWS notebook yes, I don't have admin rights though [10:50:46] it might be pointless though, because they could just recreate it from a different account [10:51:07] so I would proceed with restricting outbound UDP packets, and possibly more outbound network connections [10:54:07] dhinus: ok - so either openstack side or cloudgw what do you think is best way forward? [10:54:23] I can have a look at cloudgw / nftables, openstack is not my forte if we go that way [10:56:03] a.ndrew will be online soon and he might have more ideas on the openstack side [10:56:16] I'm having a look in the meantime [10:57:31] ok I'll prep the cloudgw patch and if you work out openstack prior to that we can go that way instead [10:58:14] I think a k8s network policy might be the easiest way, and should cover 99% of abuse, it would require PAWS users to break out of the k8s policies [10:59:24] ok that's even better, it'll be applied on the VMs themselves I assume? [10:59:52] it can be added to the k8s deployment definitions [11:00:53] let's do it [11:03:05] not something I've done before so give me 10 mins [11:13:38] so we're deploying paws through an upstream Helm chart, and that chart has some config options we can set [11:19:12] https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/main/jupyterhub/templates/_helpers-netpol.tpl [11:21:31] ok, I'm not familiar with that syntax but it looks fairly straightforward [11:22:17] yep, so what I'm trying to understand is if/why we have the "egressAllowRules.nonPrivateIPs" policy enabled [11:22:49] in theory, this is our config https://github.com/toolforge/paws/blob/main/paws/values.yaml#L300 [11:23:13] which only enables "egressAllowRules.privateIPs" [11:23:41] but "kubectl describe networkpolicy" shows we also have a policy for nonprivateips [11:24:28] yeah so we want to do something similar to in the patch I posted [11:24:32] oh... I didn't post the patch [11:25:06] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100077 [11:25:19] if that looks ok what I might do is [11:25:33] - disable puppet on cloudgw1002 [11:25:38] - merge patch [11:25:48] - run puppet on cloudgw1001 (non-active one) [11:25:53] - check it looks ok [11:26:10] - re-enable puppet on cloudgw1002 to make it live [11:26:53] I'm fine with merging your patch while I continue my research on how to modify the k8s policies [11:28:33] thanks topranks btw, your help on this is very appreciated. as always :) [11:29:00] np! [11:39:34] I raised T381373 to track the k8s filtering [11:39:34] T381373: Restrict outbound connectivity from PAWS hosts - https://phabricator.wikimedia.org/T381373 [11:39:57] dhinus: can I get a +1 on this one? [11:39:58] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100087 [11:40:11] few small errors in the syntax on last, I fixed manually on cloudgw1001 so this should be right [11:40:46] +1d [11:40:52] thanks :) [11:45:17] ok it looks ok on cloudgw1001, I will re-enable puppet on cloudgw1002 and trigger a run [11:48:28] ack [11:48:47] looks ok at first glance [11:53:31] ip saddr @paws_workers ip protocol udp counter packets 24501532 bytes 3018694396 drop [11:53:51] 3GB of packets dropped already :) [11:59:42] Ok I'm a lot happier now... seems to be blocking the junk as desired [12:00:21] we are also NATing to a different IP so I will try to keep an eye out for that in netflow/sflow logs and see if they try to change the pattern to work around the block [12:08:06] nice! [13:11:12] Rook: ack, I didn't think of git repositories but as you say they're covered by http/s anyway. I will add udp53/123 as suggested by topranks, and try to make the yaml linter happy [13:13:23] 'making linters happy' should be explicitly written in our job descriptions xd [13:14:08] :D [13:14:40] alias happy="composer fix; npm fix" [13:38:36] blancadesal: dhinus: anyone know of a quicker way to reach Joanna? [13:38:44] already emailed and messaged on slack [13:39:45] she just wrote to me in IRC, cc jobo [13:40:48] I'm reviewing your request Raymond, I'll be back to you with approval as soon as I'm done. [13:42:04] ok thanks Joanna! was afraid you didn't notice the message [13:57:22] dhinus, topranks I replied on T381373 FYI. I'm sorry I can't follow up at the moment [13:57:22] T381373: Restrict outbound connectivity from PAWS hosts - https://phabricator.wikimedia.org/T381373 [14:05:45] dhinus: did you respond to the cloudvirt1047 disk warning or did it clear up on its own? [14:06:34] andrewbogott: topranks created a big tcpdump there, it's now been deleted [14:06:44] ok! will ignore [14:34:45] andrewbogott: is this one something we should worry about? T380479 [14:34:46] T380479: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479 [14:34:57] the alert fired for a few hours on Nov, 21st, then it stopped [14:36:40] feels like that's not the first one of those, searching... [14:37:45] T368211 among others [14:37:45] T368211: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T368211 [14:38:25] hmm yes it looks unstable [14:38:31] not sure if it's worth raising with dcops again [14:38:38] or maybe wait until the next one :P [14:39:18] I tagged it with ops-codfw [14:39:21] thanks! [14:39:28] thanks for flagging [14:42:56] topranks: your paws dashboard is showing a new ongoing spike, though not a huge one [14:43:34] https://grafana.wmcloud.org/d/pqVlZS4Nz/cmooney-paws-bw-temp [14:43:45] is it getting blocked by the cloudgw filter? [14:47:58] there's no big spike inbound or outbound on the cloudgw right now [14:52:09] did you add the K8s patch? the graph shows 400kpps but I don't see it anywhere else on the network [14:52:41] yes the k8s patch is merged and deployed [14:52:56] maybe it's measured before it's dropped? [14:53:08] that would be simultaneously very cool and very confusing [14:53:28] well that graph is showing inbound from a K8s POD (presumably veth int - I never got the shell access on any of those nodes set up to properly check) [14:53:57] so all depends where the rules are getting applied, I'd expect to see the floods inbound on that graph, and then K8s policy to be using iptables in the VM kernel to implement the rules you added [14:54:16] so all that makes sense, and indeed why it doesn't appear later in the path, i.e. at cloudgw point [14:54:25] I never digged into it but I expect k8s policies to use something far more esoteric than iptables :P [14:54:41] but your reasoning still stands [14:56:58] nope [14:57:05] k8s/docker/openstack all of them [14:57:22] they usually are just a couple of hundred thousand to a million iptables rules if you check :P [14:57:29] LOL :D [14:57:36] stuff of nightmares for us network engineers [14:57:42] but yes your rule is working great!" [14:57:48] check the dashboard I have adjusted it [14:58:00] top of the graph is inbound from the K8s PODs [14:58:03] nice! [14:58:04] bottom is outbound from the VM [14:58:17] so you can see when your rules got applied it stopped the traffic leaving the vm [14:58:19] I declare your dashboard ready to drop the "temp" in its name :P [14:58:59] haha :) [15:02:31] cpu usage has shot down on the cloudnet's after your change too [15:02:41] (as they are no longer having to forward all those packets) [15:03:01] I'm re-running my dns failure test just in case this fixed everything :) [15:08:20] * dhinus crosses fingers [16:01:37] nope, still failing [16:02:18] topranks, dhinus, since you're working miracles today do you have thoughts about better ways to diagnose T374830? (that's the one with the 800ms pings) [16:02:18] T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830 [16:02:45] do we already have a dashboard showing per-cloudvirt network activity? [16:03:12] I've just stumbled in this other alert that might also be related: T380692 [16:03:12] T380692: ProbeDown Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://phabricator.wikimedia.org/T380692 [16:06:29] you're thinking that the cloudgw task and the dns failure task are the same issue? Seems likely [16:22:38] andrewbogott: yep exactly, I see some small blips in the graph, they could be another case where the network is misbehaving