[07:16:15] <dcaro>	 morning
[08:18:00] <taavi>	 good morning
[08:28:55] <arturo>	 morning
[09:01:41] <taavi>	 dcaro: (re yesterday) by "web application" do you mean something that exposes a HTTP(S) API, or something that exposes an interface that is meant to be used in a browser?
[09:06:41] * dcaro reading backlog
[09:07:49] <dcaro>	 aahhh yep, that's what I meant, as in an example on how the authentication flow would work with keystone for a user using a web ui
[09:08:07] <dcaro>	 (or equivalent)
[09:08:49] <taavi>	 hmmmm
[09:08:57] <dcaro>	 the equivalent of https://apereo.github.io/cas/7.0.x/protocol/CAS-Protocol.html#webflow-diagram would be awesome
[09:09:48] <dcaro>	 though that's for sso with CAS
[09:11:24] <dcaro>	 there's an unrelated javascript project called keystonejs that messes up the search a bit xd
[09:11:46] <taavi>	 so right now our keystone install is using ldap directly, and so for example horizon.wikimedia.org just has an username/password prompt instead of using idp.wikimedia.org
[09:12:22] <dcaro>	 yep
[09:13:12] <taavi>	 keystone's equivalent of using an external SSO thing seems to be https://docs.openstack.org/keystone/latest/admin/federation/introduction.html
[09:14:10] <taavi>	 but i have no idea how easy that is to configure in parallel to the existing LDAP integration
[09:15:02] <dcaro>	 so your idea there is to move keystone to use idp, then use idp as the auth entry point right?
[09:15:45] <taavi>	 i think yes, eventually
[09:16:40] <taavi>	 i don't think that should happen entirely before idp has (self-service) 2fa support
[09:17:03] <taavi>	 but eventually it'd be great if horizon / keystone did not users process ldap passwords directly
[09:18:22] <dcaro>	 hmmm, how does that work with keystone application credentials and such?
[09:18:53] <dcaro>	 I guess that those will only be allowed on keystone side? (as in falling back to the keystone-based auth and skipping the idp completely)
[09:20:51] <taavi>	 app credentials would keep working as-is for things that need to authenticate against keystone in a non-interactive way. the bigger question to me are the various scripts that currently authenticate with ldap credentials, so at least novaobserver based stuff and acme-chief things (plus probably some others that I'm forgetting)
[09:24:03] <dcaro>	 those would probably need to do the idp dance, unless we keep the ldap auth backend somehow as a fallback
[09:25:09] <dcaro>	 about the app credentials and such, I was thinking of T358496
[09:25:11] <stashbot>	 T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage - https://phabricator.wikimedia.org/T358496
[09:27:47] <dcaro>	 there's a bunch of alerts coming up from ceph nodes, loooking
[09:31:22] <dcaro>	 https://www.irccloud.com/pastebin/BQxcsliR/
[09:31:32] <dcaro>	 ceph (the cluster) seems happy though
[09:32:47] <arturo>	 weird error, no?
[09:33:05] <dcaro>	 yep
[09:34:47] <dcaro>	 only on some cloudcephosd so far
[09:36:13] <arturo>	 was there any package update recently?
[09:37:37] <dcaro>	 yesterday, only less
[09:37:43] <dcaro>	 (the pager)
[09:38:37] <arturo>	 now the error seems to be expanding across the fleet
[09:39:16] <dcaro>	 I've seen it only on 6 hosts so far in the alerts, do you see it somewhere else?
[09:39:42] <dcaro>	 (all cloudcephosd1*)
[09:39:59] <arturo>	 no, my panel just went from 1 host to these 6
[09:41:03] <dcaro>	 ah, okok
[09:41:15] <dcaro>	 so far it's stable on 6 for me
[09:42:32] <dcaro>	 restarting the service brings it up again
[09:43:19] <arturo>	 maybe I will open a phab ticket to keep a record of this weird event?
[09:43:51] <dcaro>	 sure, they all seemed to fail at the same time
[09:45:27] <arturo>	 T364376
[09:45:28] <stashbot>	 T364376: cloudcephosd: user@0.service is in failed status - https://phabricator.wikimedia.org/T364376
[09:45:43] <taavi>	 whoever has added the /srv/git/ repos to root's git safe.directory and then keeps breaking the file permissions on metricsinfra-puppetserver-1, please do not do that
[09:45:43] <taavi>	 https://wikitech.wikimedia.org/wiki/Help:Project_puppetserver#Interacting_with_the_local_Git_clones
[09:52:58] <dcaro>	 ^you might want to announce that in the team meeting
[10:12:39] * arturo brb
[10:44:01] * dcaro food
[12:29:15] <arturo>	 dcaro: emergecy review: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/83
[12:29:20] <arturo>	 emergency*
[12:29:37] <arturo>	 cc taavi
[12:31:19] <dcaro>	 arturo: done
[12:32:43] <arturo>	 thanks
[12:32:46] <arturo>	 deploying
[12:46:01] <dhinus>	 I noticed I pushed a wrong git tag to jobs-cli because of GitLab "squash & rebase" behaviour. I pushed my local git tag, which was different from the merged commit in origin/main https://phabricator.wikimedia.org/P61982
[12:46:18] <dhinus>	 I've added a note about this in https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Packaging#Creating_a_release_in_GitLab
[12:48:00] <arturo>	 :-( gitlab
[12:49:32] <dhinus>	 I wonder if removing the "squash" option will result in a fast-forward merge, and avoid this problem
[12:49:45] <dhinus>	 but it's still way to easy to shoot yourself in the foot :/
[12:50:27] <arturo>	 there is at least one thing I dislike about the squash option: if you use it on a single-commit MR (so, nothing to squash really) it will replace the commit tittle and message :-(
[12:50:53] <dhinus>	 and the commit hash too, because it creates a new one
[12:51:22] <dhinus>	 I guess I'll stick to single-commit MR, and not check the squash option
[12:51:35] <arturo>	 ideally no squash button would show up on single-commits MRs
[12:52:06] <arturo>	 taavi: https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/36 I believe you did a cleanup related to this one, when dropping support for the grid
[12:55:00] <taavi>	 arturo: looking
[12:55:10] <dhinus>	 arturo: this was implemented, not sure if it was reverted later on, or if it's en enterprise-only feature :facepalm: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/1213 
[12:55:56] <arturo>	 https://usercontent.irccloud-cdn.com/file/D0ym1O8K/image.png
[12:56:29] <dhinus>	 yep, I'm confused if that means "only in enterprise", but probably yes
[12:56:41] <dhinus>	 it's also from 7 years ago so they probably changed the design 5 times since :D
[12:57:03] <dhinus>	 the "modify commit message" checkbox was nice but I don't think it's there anymore
[12:58:07] <taavi>	 arturo: approved, but please see my comment for the future
[12:58:49] <arturo>	 taavi: what prefix would you use for the commit title?
[12:59:43] <arturo>	 thanks for the review :-)
[13:00:13] * arturo food time
[13:00:42] <taavi>	 arturo: in that case, you're touching a specific module, so that (`backends: kubernetes: `?), but if it's touching the entire repository just do not use a component in the subject at all (instead of adding the repository name as component which does not give any extra detail that's not already present in the other commit/MR metadata)
[13:17:34] <dcaro>	 dhinus: you can use multiple commits as long as you don't squash, the hashes will be kept
[13:21:41] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/84
[13:25:44] <dhinus>	 dcaro: good, I had a single commit and I checked the "squash" checkbox out of habit, expecting it to be a no-op
[14:35:04] <arturo>	 dcaro: LGTM
[15:03:23] <arturo>	 there is an alert
[15:03:24] <arturo>	 Failed to update Puppet repository /srv/git/operations/puppet on instance toolsbeta-puppetserver-1 in project toolsbeta
[15:04:00] <taavi>	 this is me
[15:04:04] <arturo>	 ok
[15:04:07] <taavi>	 please ignore for now
[15:04:23] <arturo>	 I acked in alertmanager
[15:04:27] <taavi>	 arturo: I think you need to manually 'take' the shift on victorops to get the alerts for the drill, https://help.victorops.com/knowledge-base/manual-take-call/
[15:04:36] <arturo>	 ok
[15:05:23] <arturo>	 I just did
[15:05:44] <arturo>	 cc Rook I took your victorops shift for the purpose of the alert drill
[15:06:12] <Rook>	 Ok
[15:07:03] * andrewbogott can't stand the suspense
[15:07:20] <arturo>	 we are having a calm afternoon so far
[15:10:43] <dcaro>	 🎵 lalalaralara... 🎵
[15:16:58] <taavi>	 so the thing I'm trying to trigger has a `for: 5m` rule... so give it a few more minutes :/
[15:17:28] <arturo>	 ok :-)
[15:18:45] <arturo>	 I see
[15:18:46] <arturo>	 [2x] InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-worker-10 is down 
[15:19:37] <arturo>	 now ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_redirects_ip4)
[15:19:52] <arturo>	 wow
[15:19:53] <arturo>	 Kubernetes cluster k8s.toolsbeta.eqiad1.wikimedia.cloud:6443 almost out of cpu
[15:20:00] <arturo>	 definitely concerning, first time I see this alert
[15:20:43] <arturo>	 and this one did page my phone
[15:20:53] <arturo>	 so I declare there is an incident!
[15:21:08] <dhinus>	 https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Incident_Response_Process#If_you%E2%80%99ve_been_paged
[15:21:09] <dcaro>	 we should add the incident process page to the sidebar here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity
[15:21:20] <dhinus>	 dcaro: nice idea!
[15:21:33] <andrewbogott>	 Would someone other than me like to be coordinator?  (Since I was last time)
[15:23:08] <arturo>	 andrewbogott: I'll do it
[15:23:21] <dcaro>	 added there :)
[15:23:31] <andrewbogott>	 'k.  Are people in video chat?
[15:23:42] <arturo>	 I don't think we are
[15:23:47] <dcaro>	 I'm not
[15:23:59] <arturo>	 I'm starting the incident status doc
[15:24:19] <dhinus>	 we can use https://meet.google.com/mod-iwwv-pcb
[15:24:44] <arturo>	 https://docs.google.com/document/d/1ghYr0hpSkhwulumvKhSdYe5aiIuIt3PncaZPo2lRECI/edit
[15:24:50] <dcaro>	 arturo: I think you don't have the proper rights to set the status, maybe the cloak?
[15:25:05] <taavi>	 i'll sort that out
[15:25:12] <dcaro>	 thanks :)
[15:43:20] <arturo>	 incident resolved!
[15:43:30] <andrewbogott>	 good job everyone except taavi
[15:43:48] <taavi>	 :(
[15:44:45] <dcaro>	 I think he did a great job xd
[15:45:10] * taavi removes the manual hack from toolsbeta-prometheus-1 to disable any further pages from there
[15:45:22] <taavi>	 arturo: remember to remove the manual override from victorops
[15:45:30] <arturo>	 ok
[15:47:34] <dhinus>	 these drills are so interesting that I'm tempted to do more :D
[15:48:21] * dhinus has to log off for today
[15:48:35] <dhinus>	 victorops just went to green
[15:48:37] <taavi>	 one thing I was wondering beforehand is whether you'd treat that as a DoS/security incident or just a regular one
[15:48:56] <dhinus>	 I can see "Resolved by: SYSTEM" in the incident log in victorops https://portal.victorops.com/ui/wikimedia/incident/4656/details
[15:49:02] <dhinus>	 so I think it just took a while?
[15:51:06] <dcaro>	 taavi: good question, probably start as "regular misbehavior" and move to "DoS" if it keeps happening (or if the user is not an admin of the tools or such). Though I'd be prone to move to DoS as preventive measure, and de-escalate after. Probably the latter makes more sense
[15:53:02] <dcaro>	 logstash has a bit of a delay, I don't see yet the alerts that fired
[15:55:09] <taavi>	 hmmmmmm i don't think things from metricsinfra make it there
[16:12:34] <dcaro>	 good point :)
[16:14:16] * arturo offline