[04:10:14] i'm getting random "Wikimedia Toolforge Error" on requests. no logs on my backend. bottom says "tools-k8s-haproxy-8.tools.eqiad1.wikimedia.cloud" [08:45:24] derenrich: do you have a url you are getting them for? (to search logs, retry, etc) [09:38:03] Hi, I'm unable to log into gitlab.wikimedia.org. Error message is simply: Could not authenticate you from OpenIDConnect because "". [09:38:51] Clicking the button again just reloads the page with the same error message. Have tried both on my current IP (a Thai mobile ISP) and a NordVPN proxy thru the US [09:40:47] no JS errors or anything in console [09:43:33] Tamzin: you should probably ask in #wikimedia-gitlab, the admins are usually around there [09:43:44] gotcha, thanks [09:46:00] Tamzin: for what it's worth, I have the same problem :/ (after logging out) [13:06:28] !log lucaswerkmeister@tools-bastion-15 tools.lexeme-forms deployed 34ee7bf7ac (l10n updates: cy) [13:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [14:24:23] dcaro: it's only happening sporadically but this url fails https://best-of.toolforge.org/api/category/random but looking at my logs again i'm seeing a "Redis Client Error SocketClosedUnexpectedlyError: Socket closed unexpectedly" so maybe this is on my side [14:24:50] I think we have a task somewhere trying to improve redis reliability [14:26:00] but if you can recreate the connections when they time out/get closed that'd probably be a good improvement anyhow, and might get rid completely of that issue before we get to find out what's the issue there [14:40:09] semi-relatedly is there a way for me to see redis memory pressure? i'm curious how often my keys will get evicted prematurely [15:01:07] actually no i think the redis errors were unrelated. i'm seeing "Wikimedia Toolforge Error - Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again later." even when no such logs are being produced. this also never happens when run locally [15:49:13] !log admin set ceph cluster noout/norebalance and move cloudcephosd1048 to single nic - T399180 [15:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:49:21] T399180: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180 [15:58:11] !log admin set ceph cluster back to out and rebalance - T399180 [15:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:58:18] T399180: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180 [18:56:37] dcaro: looking at https://grafana.wmcloud.org/d/TJuKfnt4z/tool-dashboard?orgId=1&var-cluster=P8433460076D33992&var-namespace=tool-best-of&from=now-6h&to=now&timezone=utc i'm not seeing a 500s even though i'm manually triggering them by hammering refresh. so i think this is not on my end? [20:50:50] !help [20:50:50] If you don't get a response in 15-30 minutes, please email the cloud@ mailing list -- https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication [20:52:20] derenrich: I'm reading the backscroll... [20:52:26] thanks! [20:53:02] tl;dr if you hammer https://best-of.toolforge.org/api/category/random 10 times it'll error out. but that error doesn't appear in my logs [20:59:07] derenrich: I suspect that the proxy in front of your tool is overwhelmed and the errors are from that. I'll look and see what I can find but there's probably not a lot you can do about it on your end. [20:59:19] yeah that sounds right [21:01:00] how long have you seen this behavior? [21:02:56] so far I don't see any reason why the proxy would complain [21:03:50] andrewbogott: the tool is only 5 days old or so. so since always [21:04:27] at first i assumed it was my issue and delayed investigating because it was infrequent [21:07:01] If I want to 'become' your tool and look around, what is the tool name? [21:08:40] nm, found [21:10:03] sorry was afk (pinging me is useful, i'm not used to checking irc frequently) [21:10:52] hm zero logfiles [21:12:24] there should be? [21:12:36] > toolforge webservice logs -l 30 [21:12:36] returns things [21:16:10] `toolforge webservice logs ` throws an error but that's a longstanding toolforge bug. i opened a ticket on that years ago [21:18:38] https://phabricator.wikimedia.org/T362521 [21:19:33] (i'm going afk for a bit) [21:49:37] derenrich: I'm not getting anywhere; let's wait and see if d.caro has better ideas when he's back [21:50:03] Ok sounds good. No huge rush [21:50:15] We're planning user testing but that's gonna take a few days [22:51:56] derenrich: there is nothing at https://toolsadmin.wikimedia.org/tools/id/best-of or even in the tool's $HOME that helps find what code is running and how. That makes it more difficult to provide support. [23:22:48] bd808: i added a link to the source in the admin ui. not sure what people typically people in HOME. this is a very new thing so there's not much context to provide [23:23:47] it's a Cloud Native Buildpack app running node