[07:07:39] * arturo online
[07:07:58] <dcaro>	 morning
[07:08:21] <arturo>	 o/
[07:23:00] * dcaro reading backlog
[07:24:26] <blancadesal>	 morning
[07:24:28] <dcaro>	 Roo.k: about lima-kilo ssh keys, it uses ssh to log-in into the VM it creates, so it's using your default ssh keys to do so, I'm guessing it's picking up a pass-protected ones (whatever your ssh config says).
[07:36:55] <blancadesal>	 dcaro: would you mind taking a quick look at this to check if it's what you had in mind? https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93/diffs?commit_id=1bddeedbf1242e2c764d1035dcbdcc9e0533d973#9f206f6803bf339f97faf108ad9d54b5bbf46633
[07:37:37] <dcaro>	 sure
[07:38:12] <taavi>	 what's up with the tools alerts that have filled up my inbox overnight?
[07:40:26] <arturo>	 taavi: I don't have them. Where did they come from? a list or something?
[07:41:08] <taavi>	 yes, cloud-admin-feed@lists.wm.o, also in #wikimedia-cloud-feed on the irc side
[07:41:23] <arturo>	 ok
[07:41:26] <arturo>	 I see them now
[07:46:06] <dcaro>	 I was trying to catch up first (in case it was expected, as the site is working for me), I'm guessing they are not xd
[07:46:35] <arturo>	 taavi: running `curl https://admin.toolforge.org/healthz` by hand gives me OK and 503 alternatively
[07:47:07] <arturo>	 however pods are reported to be healthy
[07:49:02] <arturo>	 the 503 seems like it is reported by the ingress-nginx, given it shows a fourohfour-produced message, as if no webservice was running
[07:49:17] <arturo>	 so maybe this is an indication of ingress-nginx having troubles
[07:54:31] <arturo>	 I'm checking ingress-nginx, but I don't see anything weird
[07:55:05] <dcaro>	 I see the 503 on the tools-proxy access.log side, but not seeing anything in the error logs
[07:58:07] <arturo>	 maybe not related, but I see the ingress-nginx deploy has
[07:58:12] <arturo>	     Limits:
[07:58:12] <arturo>	       cpu:     2
[07:58:12] <arturo>	       memory:  3G
[07:58:13] <arturo>	     Requests:
[07:58:13] <arturo>	       cpu:      1
[07:58:13] <arturo>	       memory:   2G
[07:58:21] <arturo>	 which feels a bit too small?
[07:58:45] <arturo>	 even if we have 3 pod replicas
[08:02:47] <taavi>	 how much is it using in actual usage?
[08:03:22] <dcaro>	 https://grafana-rw.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-tools&var-namespace=ingress-nginx-gen2
[08:03:53] <dcaro>	 kind of half the request
[08:03:59] <dcaro>	 on memory, and 1/10th of cpu
[08:04:13] <dcaro>	 (with peaks that don't reach the request)
[08:13:06] <dcaro>	 kind of sidebar, conntrack is almost full on cloudvirt1042? https://phabricator.wikimedia.org/T365540 is it related to the OVS swap?
[08:13:33] <taavi>	 shouldn't be, but i'll have a look
[08:20:40] <dcaro>	 thanks!
[08:28:38] <XioNoX>	 hello, cloudvirt1041 is alerting in Netbox with "Device is Active in Netbox but is missing from PuppetDB (should be ('decommissioning', 'inventory', 'offline', 'planned', 'staged', 'failed'))"
[08:29:56] <taavi>	 XioNoX: hey, that is T364984, I will set it to failed
[08:29:57] <stashbot>	 T364984: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984
[08:30:17] <XioNoX>	 thx!
[08:30:46] <XioNoX>	 we need some automation there, but we can't figure out what's best to do
[08:34:02] <arturo>	 I just created T365562
[08:34:02] <stashbot>	 T365562: toolforge: admin tool /healthz returns 503 from time to time - https://phabricator.wikimedia.org/T365562
[08:34:05] <taavi>	 arturo: dcaro: seems like one of the tool-admin pods is not happy: https://phabricator.wikimedia.org/P62847
[08:35:01] <arturo>	 taavi: good finding. Seems like an excellent candidate for a liveness prove
[08:35:35] <taavi>	 given that the tool already has a /healthz endpoint, I will just configure that in webservice
[08:35:54] <arturo>	 yes
[08:37:57] <dcaro>	 was anyone paged by https://phabricator.wikimedia.org/T365462 ?
[08:38:12] <dcaro>	 T365462
[08:38:13] <stashbot>	 T365462: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T365462
[08:38:33] <taavi>	 that was due to the OVS maintenance yesterday
[08:38:35] <arturo>	 dcaro: that was yesterday, no? I received a bunch of pages
[08:38:41] <arturo>	 not sure if that one in particular
[08:38:53] * taavi looks at victorops
[08:39:10] <dcaro>	 ack, it's a kind of critical alert xd
[08:39:44] <taavi>	 yeah, I see a matching MetricsinfraAlertmanagerDown page in victorops
[08:39:48] <dcaro>	 https://portal.victorops.com/ui/wikimedia/incident/4688/details yep :)
[10:19:06] <taavi>	 does maintain-harbor today have support for individual tool quota increases (T365536 for example)?
[10:19:07] <stashbot>	 T365536: Request increased quota for video-answer-tool Toolforge tool - https://phabricator.wikimedia.org/T365536
[10:28:13] <blancadesal>	 taavi: no, it has to be done manually
[10:29:22] <blancadesal>	 still waiting on upstream: T352417
[10:29:23] <stashbot>	 T352417: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417
[10:30:00] <taavi>	 blancadesal: ok, how do I do that manually? are there any instructions?
[10:31:59] <blancadesal>	 taavi: I don't think there are. you need to do it through the harbor UI -> administration -> project quotas -> select and edit the project in question
[13:28:21] <moritzm>	 there are two certs in the private repo under /srv/private/modules/secret/secrets/ssl: labtestservices2001.wikimedia.org.key and labtestwikitech.wikimedia.org.key 
[13:29:17] <moritzm>	 they were both commited by root in 2016 and it's my understanding that these were used before these test hosts were moved behind the main CDN
[13:29:46] <moritzm>	 I'll go remove these tomorrow unless anyone can think of a reason to keep them, if so please let me know
[13:36:56] <arturo>	 moritzm: those hosts no longer exists, so it should be safe to remove
[15:14:06] <arturo>	 dcaro: FYI I think I found the right combo of fixture options to make the pytest go fast when not in recording mode
[15:15:01] <arturo>	 moritzm: correction, labtestwikitech is still a thing, but I image that it uses acme-chief certs nowadays
[15:18:38] <moritzm>	 yeah, it's behind the CDN so uses the global cert bought from Globalsign
[15:23:41] <bd808>	 Is there a service name for the 185.15.56.11 IP that maps to the active tools-proxy host? I'm looking for something like the "proxy-eqiad1.wmcloud.org" service name that resolves to the public IP for project-proxy.
[15:26:59] <arturo>	 bd808: I think `toolforge.org` resolves to 185.15.56.11, not sure if that would help yoyu
[15:27:01] <arturo>	 you*
[15:29:50] * arturo offline
[15:33:38] <taavi>	 bd808: I would use any toolforge.org subdomain, like www.toolforge.org
[15:33:41] <dcaro>	 arturo: \o/
[15:39:17] <bd808>	 arturo, taavi: ah, that sounds reasonable. `toolforge.org` does resolve to the desired IP
[15:46:40] <dhinus>	 I created a new wikireplicas diagram, reviews are welcome: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Overview_diagram
[16:50:26] <dcaro>	 dhinus: nice! I'll review tomorrow, the size of the text when embedded is a bit small (not sure if that can be changed, it's ok also as clicking on it zooms in)
[16:50:31] * dcaro off
[18:42:29] <taavi>	 dcaro: I'm going to rvert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953 since I'm seeing May 22 17:30:34 cloudweb1003 puppet-agent[3512164]: (/Stage[main]/Openstack::Clientpackages::Bobcat::Bullseye/Openstack::Patch[/usr/lib/python3/dist-packages/openstack/config/loader.py]/Exec[apply
[18:42:29] <taavi>	 /usr/lib/python3/dist-packages/openstack/config/loader.py.patch]/returns) 1 out of 1 hunk FAILED -- saving rejects to file /usr/lib/python3/dist-packages/openstack/config/loader.py.rej on many hosts
[18:45:34] <dcaro>	 Ok, it worked on cloud controls :/, I'll check tomorrow 
[21:03:39] <RhinosF1>	 FYI: https://phabricator.wikimedia.org/T365644
[21:23:49] <bd808>	 meh