[07:07:39] * arturo online [07:07:58] morning [07:08:21] o/ [07:23:00] * dcaro reading backlog [07:24:26] morning [07:24:28] Roo.k: about lima-kilo ssh keys, it uses ssh to log-in into the VM it creates, so it's using your default ssh keys to do so, I'm guessing it's picking up a pass-protected ones (whatever your ssh config says). [07:36:55] dcaro: would you mind taking a quick look at this to check if it's what you had in mind? https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93/diffs?commit_id=1bddeedbf1242e2c764d1035dcbdcc9e0533d973#9f206f6803bf339f97faf108ad9d54b5bbf46633 [07:37:37] sure [07:38:12] what's up with the tools alerts that have filled up my inbox overnight? [07:40:26] taavi: I don't have them. Where did they come from? a list or something? [07:41:08] yes, cloud-admin-feed@lists.wm.o, also in #wikimedia-cloud-feed on the irc side [07:41:23] ok [07:41:26] I see them now [07:46:06] I was trying to catch up first (in case it was expected, as the site is working for me), I'm guessing they are not xd [07:46:35] taavi: running `curl https://admin.toolforge.org/healthz` by hand gives me OK and 503 alternatively [07:47:07] however pods are reported to be healthy [07:49:02] the 503 seems like it is reported by the ingress-nginx, given it shows a fourohfour-produced message, as if no webservice was running [07:49:17] so maybe this is an indication of ingress-nginx having troubles [07:54:31] I'm checking ingress-nginx, but I don't see anything weird [07:55:05] I see the 503 on the tools-proxy access.log side, but not seeing anything in the error logs [07:58:07] maybe not related, but I see the ingress-nginx deploy has [07:58:12] Limits: [07:58:12] cpu: 2 [07:58:12] memory: 3G [07:58:13] Requests: [07:58:13] cpu: 1 [07:58:13] memory: 2G [07:58:21] which feels a bit too small? [07:58:45] even if we have 3 pod replicas [08:02:47] how much is it using in actual usage? [08:03:22] https://grafana-rw.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-tools&var-namespace=ingress-nginx-gen2 [08:03:53] kind of half the request [08:03:59] on memory, and 1/10th of cpu [08:04:13] (with peaks that don't reach the request) [08:13:06] kind of sidebar, conntrack is almost full on cloudvirt1042? https://phabricator.wikimedia.org/T365540 is it related to the OVS swap? [08:13:33] shouldn't be, but i'll have a look [08:20:40] thanks! [08:28:38] hello, cloudvirt1041 is alerting in Netbox with "Device is Active in Netbox but is missing from PuppetDB (should be ('decommissioning', 'inventory', 'offline', 'planned', 'staged', 'failed'))" [08:29:56] XioNoX: hey, that is T364984, I will set it to failed [08:29:57] T364984: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984 [08:30:17] thx! [08:30:46] we need some automation there, but we can't figure out what's best to do [08:34:02] I just created T365562 [08:34:02] T365562: toolforge: admin tool /healthz returns 503 from time to time - https://phabricator.wikimedia.org/T365562 [08:34:05] arturo: dcaro: seems like one of the tool-admin pods is not happy: https://phabricator.wikimedia.org/P62847 [08:35:01] taavi: good finding. Seems like an excellent candidate for a liveness prove [08:35:35] given that the tool already has a /healthz endpoint, I will just configure that in webservice [08:35:54] yes [08:37:57] was anyone paged by https://phabricator.wikimedia.org/T365462 ? [08:38:12] T365462 [08:38:13] T365462: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T365462 [08:38:33] that was due to the OVS maintenance yesterday [08:38:35] dcaro: that was yesterday, no? I received a bunch of pages [08:38:41] not sure if that one in particular [08:38:53] * taavi looks at victorops [08:39:10] ack, it's a kind of critical alert xd [08:39:44] yeah, I see a matching MetricsinfraAlertmanagerDown page in victorops [08:39:48] https://portal.victorops.com/ui/wikimedia/incident/4688/details yep :) [10:19:06] does maintain-harbor today have support for individual tool quota increases (T365536 for example)? [10:19:07] T365536: Request increased quota for video-answer-tool Toolforge tool - https://phabricator.wikimedia.org/T365536 [10:28:13] taavi: no, it has to be done manually [10:29:22] still waiting on upstream: T352417 [10:29:23] T352417: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417 [10:30:00] blancadesal: ok, how do I do that manually? are there any instructions? [10:31:59] taavi: I don't think there are. you need to do it through the harbor UI -> administration -> project quotas -> select and edit the project in question [13:28:21] there are two certs in the private repo under /srv/private/modules/secret/secrets/ssl: labtestservices2001.wikimedia.org.key and labtestwikitech.wikimedia.org.key [13:29:17] they were both commited by root in 2016 and it's my understanding that these were used before these test hosts were moved behind the main CDN [13:29:46] I'll go remove these tomorrow unless anyone can think of a reason to keep them, if so please let me know [13:36:56] moritzm: those hosts no longer exists, so it should be safe to remove [15:14:06] dcaro: FYI I think I found the right combo of fixture options to make the pytest go fast when not in recording mode [15:15:01] moritzm: correction, labtestwikitech is still a thing, but I image that it uses acme-chief certs nowadays [15:18:38] yeah, it's behind the CDN so uses the global cert bought from Globalsign [15:23:41] Is there a service name for the 185.15.56.11 IP that maps to the active tools-proxy host? I'm looking for something like the "proxy-eqiad1.wmcloud.org" service name that resolves to the public IP for project-proxy. [15:26:59] bd808: I think `toolforge.org` resolves to 185.15.56.11, not sure if that would help yoyu [15:27:01] you* [15:29:50] * arturo offline [15:33:38] bd808: I would use any toolforge.org subdomain, like www.toolforge.org [15:33:41] arturo: \o/ [15:39:17] arturo, taavi: ah, that sounds reasonable. `toolforge.org` does resolve to the desired IP [15:46:40] I created a new wikireplicas diagram, reviews are welcome: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Overview_diagram [16:50:26] dhinus: nice! I'll review tomorrow, the size of the text when embedded is a bit small (not sure if that can be changed, it's ok also as clicking on it zooms in) [16:50:31] * dcaro off [18:42:29] dcaro: I'm going to rvert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953 since I'm seeing May 22 17:30:34 cloudweb1003 puppet-agent[3512164]: (/Stage[main]/Openstack::Clientpackages::Bobcat::Bullseye/Openstack::Patch[/usr/lib/python3/dist-packages/openstack/config/loader.py]/Exec[apply [18:42:29] /usr/lib/python3/dist-packages/openstack/config/loader.py.patch]/returns) 1 out of 1 hunk FAILED -- saving rejects to file /usr/lib/python3/dist-packages/openstack/config/loader.py.rej on many hosts [18:45:34] Ok, it worked on cloud controls :/, I'll check tomorrow [21:03:39] FYI: https://phabricator.wikimedia.org/T365644 [21:23:49] meh