[10:31:43] !log tools.phpunit-results-cache add tools.wmde-wd-team (T378797) [10:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.phpunit-results-cache/SAL [10:31:46] T378797: [SPIKE] Use PHPUnit test results cache timing data to distribute tests in parallel runs - https://phabricator.wikimedia.org/T378797 [11:33:08] Hi, wikibugs seems to be down [11:43:10] paladox: all of its logs look happy but I'm restarting... [11:43:21] thanks! [11:44:24] ...did that do anything? [11:46:48] It hasn't joined #wikimedia-dev yet [11:47:33] The last quit I can see from it was bouncer quit [11:47:41] andrewbogott: can you check its bouncer too? [11:48:09] I don't think I know what/where that is [11:48:41] andrewbogott: what did you restart? [11:48:46] https://sal.toolforge.org/tools.wikibugs suggests there’s irc and znc [11:49:17] znc should be the bouncer (compare https://wikitech.wikimedia.org/wiki/Tool:Containers#BNC_container) [11:49:29] lucaswerkmeister: just 'become wikibugs' and then 'webservice restart' [11:50:21] toolforge jobs restart znc would probably be my guess from SAL [11:51:30] there we go [11:52:50] yay, thanks [11:52:54] can you log the restarts you did to the SAL? [11:53:13] sure [11:53:28] !log integration rebooting integration-puppetserver-01, it's showing DNS failures [11:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL [11:54:01] !log tools.wikibugs 'webservice restart' and 'toolforge jobs restart znc' and 'toolforge jobs restart irc' because it vanished from IRC [11:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [13:42:47] !log admin rebooting proxy-04.project-proxy for the ceph OSD mishap [13:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:16:22] I was using the Gitlab pipelines to run some test, but they seem to have been downgraded to less memory lately? The OOM Killer get's them everytime.. Was there a change I missed? [15:16:23] https://gitlab.wikimedia.org/kristbaum/como/-/jobs/438412 [16:00:14] kristbaum: There haven't been any notable changes to the gitlab runner configurations. You could try adding the `memory-optimized` tag to your jobs to ensure they run on runners that provide per memory per job. [17:28:25] !log lucaswerkmeister@tools-bastion-13 tools.quickcategories toolforge envvars create TOOL_EXPECTED_DATABASE_ERROR 'The tool is temporarily non-functional due to necessary database maintenance.' && webservice restart [17:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [19:44:13] oh oops, I forgot to quote the link in that quickcategories command ^^ [19:44:17] * lucaswerkmeister checks [19:44:45] ok, looks like it should still work. yay for HTML5 \o/ [19:45:50] ah, no, I did quote it, the quotes just got lost when I ran `dologmsg "toolforge … 'The tool is… .' && webservice restart` [19:45:55] should’ve escaped ther [19:45:57] *them [20:48:02] !log wikistats hard rebooting instance wikistats-bookworm from horizon [20:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistats/SAL [21:15:28] so I just tried to log in to https://horizon.wikimedia.org/, logged in to IDP, then it recdirected to https://openstack.eqiad1.wikimediacloud.org:25000/protected which results in 400 Bad Request [21:17:53] now that happens every time I navigate to https://horizon.wikimedia.org/ [21:17:57] hm, that looks like roughly the right redirect to me as far as I remember it from the past few months [21:18:04] (there should be a couple more URL parameters though) [21:19:27] yeah I just tried it and after ca. 15 seconds of redirects and “submitting…” it eventually sent me to a working horizon [21:20:28] (also apparently it’s not normal query URL parameters, it’s a hash fragment – /protected#access_token=blah&token_type=etc.) [21:22:31] yeah it does that, the URL with the hash fragment gets a 200 but according to Firefox it has no headers, request, or response payload, then loads just the bare /protected and 400s [21:22:48] ah, it's a POST that's 400ing [21:23:06] the POST is a 302 for me [21:24:36] noScript enabled, by any chance? (or something like uBlock Origin, but I don’t think I changed anything in there, whereas if I had to enable anything in noScript then I’ve probably forgotten it by now) [21:25:20] AntiComposite, lucaswerkmeister: as luck would have it andrewbogott has been trying to figure out some of that Horizon->Keystone->IDP->Keystone->Horizon behavior today and has what seems to be an important finding and possible fix in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1116868 [21:25:36] uBO was on, but doesn't seem to change anything [21:26:41] I have seen things get really weird myself like AntiComposite is reporting. Deleting all of my cookies for Horizon and IDP felt like it got back to a functional state. [21:27:05] yeah I'll give that a shot [21:27:10] like there is some way to get the bad state "stuck" in a session or other cookie [21:27:48] the haproxy injecting chaos into the system is the interesting finding [21:29:25] give me 5 minutes and I'll roll out this fix [21:31:05] AntiComposite: try now? [21:33:06] still 400s [21:33:45] (that's after nuking cookies and logging out from IDP) [21:34:05] ok, so my guess is that your ISP or router is blocking port 25000 [21:34:18] since you're describing the same behavior I get here with my wifi [21:35:13] and when I switch to using my phone hotspot it works [21:35:34] So... I will open another bug about this. If you're able to tell me where your port access cuts off that will be a useful data point for me [21:37:10] My router seems to be a Nokia G-2425G-B, just in case that's the same as yours :( [21:38:34] I guess it's just as likely my ISP as my router... [21:39:57] tried mobile hostpot, didn't work, nuked cookies and logged out again, now I've got a different looking error [21:40:03] You've hit an OpenID Connect Redirect URI with no parameters, this is an invalid request; you should not open this URL in your browser directly, or have the server administrator use a different OIDCRedirectURI setting. [21:40:25] that's sort of promising, try again in a private browser tab? [21:40:35] (for science, obviously that doesn't help you a lot in the long run) [21:40:37] that's what I get when visiting openstack.eqiad1.wikimediacloud.org directly without Horizon sending me there [21:40:50] yeah, by hitting 'reload' in the browser after a failed attempt [21:41:32] private browser tab does work [21:41:58] ok, with mobile hotspot but not with wifi? [21:42:25] tried it again on normal wifi, also works in the private tab [21:43:06] It probably would be nice to have openstack.eqiad1.wikimediacloud.org's Keystone on port 443 instead of 25000 in the long run. We know that lots of folks are behind firewalls that like to restrict outbound port ranges. (A classic IRC & Gerrit from school networks problem) [21:44:27] AntiComposite: oooh well now I don't know what's happening anymore [21:44:28] bd808: https://phabricator.wikimedia.org/T385527 [21:44:47] 443 is already taken (object storage, which also needs to be served to browsers) [21:45:02] let me try restarting firefox, see if it's cached a redirect wrong somewhere [21:45:32] AntiComposite: when you switched to normal wiki and tried again did you re-log in? [21:45:46] yes [21:45:54] ok then I remain confused [21:46:17] but you have successfully demonstrated that the issue is on your end, somehow :p [21:46:54] alright, restarting didn't fix anything [21:47:33] hm [21:47:39] andrewbogott: hmmm... could we easily put a different service name on either the S3 gateway or Keystone? [21:47:45] it pretty much has to be a cookie someplace, I can't think what else it would be [21:47:53] bd808: yes, new service name for S3 is probably the way to go [21:48:44] But I will probably wait until there are at least 2 users with the issue (considering that right now the 1 user is me and i have a workaround) [21:49:28] Also moving s3 to a different url seems pretty serious, since the world may be full of hrefs with that endpoint already in it [21:50:14] then move keystone? It should be mostly internal I would think [21:50:46] the same IP + port are fine, it should just be a name-based vhost thing [21:50:54] moving keystone is less 'correct' but also much less disruptive. [21:51:22] and we can just add another front end to the one we've got rather than actually move it [21:52:14] ah, it appears to have been caused by my user-agent switcher extension [21:52:38] ACN is a chaos monkey! :) [21:54:11] Hey team, we (the Catalyst team) are trying to permanently set our fs.inotify.max_user_instances on a CloudVPS VM. We tried adding "fs.inotify.max_user_instances = 512" at /etc/sysctl.conf, but it that didn't seems to work. We have a hunch, that we may need to set this in puppet? But we're (a) not sure if that's strictly true and (b) not sure how to do it [21:55:52] new problem: now I can't log *out* of Horizon, even if I log out from IDP it just 302s from /auth/logout/ to /auth/login/ and logs me back in again [21:56:34] kindrobot: The puppet way is with ::sysctl::parameters. And it looks like Puppet has rules to recursively manage /etc/sysctl.d which would mean that yes each Puppet run wipes out manual changes there. [21:56:46] AntiComposite: yeah, that's known. [21:57:15] The post-logout redirect puts you back into the login redirect flow. [21:57:39] The only way to break the cycle is to first logout from IDP [21:58:14] !log cvn Hard reboot cvn-app10 from Horizon per https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/message/ZOCPVXX6BKLC76OHMIQW26YLBCKEBTGQ/ [21:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cvn/SAL [21:58:30] bd808, I did log out from IDP, it redirects anyway [21:59:16] It would go horizon->keystone->idp but should stop there if you are not already authenticated to idp, correct? [22:02:05] doesn't seem to go back to IDP, goes horizon.wikimedia.org/auth/login to https://openstack.eqiad1.wikimediacloud.org:25000/v3/auth/OS-FEDERATION/websso/openid?origin=https://horizon.wikimedia.org/auth/websso/ to /auth/websso/ (POST) back to horizon.wikimedia.org/ logged-in [22:02:34] interesting. that's keystone remembering your session too then [22:02:44] kindrobot: are you currently applying any other puppet config to those hosts? I don't see a great way to modify that setting without writing a new puppet role or similar. [22:03:03] (although on the other hand I don't see where puppet would be restoring it either) [22:04:13] bd808: grepping for max_user_instances doesn't turn up any general defaults in the puppet repo unless there's some magic 'restore everything to unspecified system defaults' thing someplace [22:05:10] andrewbogott: the `recurse => true,` in the file{'/etc/sysctl.d'} resource in ::sysctl is what takes full ownership of all config files there. [22:05:27] !log cvn Hard reboot cvn-apache10 from Horizon per https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/message/ZOCPVXX6BKLC76OHMIQW26YLBCKEBTGQ/ [22:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cvn/SAL [22:05:45] ah, so it's not setting it it's deleting it [22:05:50] so any manual changes to files in that directory including adding new files will be wiped out by the next puppet run [22:05:53] yeah [22:06:17] would puppet leave files in /usr/local/lib/sysct.ld/ alone? (systemd-sysctl should read those…) [22:06:21] so it might make sense to add some hiera-configurable bits to that, or at least a hiera list of 'ignore these settings' [22:06:31] lucaswerkmeister: worth a try! [22:06:31] kinda goes against the “system config goes in /etc” thing but might be an easier workaround than puppet or hiera 🤷 [22:06:55] (kindrobot: ^^) [22:06:57] probably easier just to make some thin profile that folks can apply [22:07:36] kindrobot: you could also jump into the deep end and run your VMs without puppet at all. That would mean having to pass around ssh keys out of band, though, among other lost features. [22:07:56] I don't think we want to go that deep yet ;) [22:08:05] or move to a Magnum managed cluster [22:08:30] That is something we considering [22:08:49] or use modules/profile/manifests/wmcs/kubeadm/worker.pp which sets it to 1024 :D [22:09:14] I need to start cooking dinner, but kindrobot if you make me a task with the specific thing(s) you need set I can write a bit of puppet code. [22:09:30] OK, I can do that. [22:10:09] I'm also curious what in the world could be resetting it from /etc/sysctl.conf [22:41:36] andrewbogott: done https://phabricator.wikimedia.org/T385530