[03:58:45] Grr. login-buster.toolforge.org seems broken, and I still don't have a good way to restart my bot without it. [04:08:29] Seems other stuff is broken too. "ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes." [04:14:24] Seems like networking is borked. "ERROR: TjfCliError: Unknown error (HTTPSConnectionPool(host='k8s.tools.eqiad1.wikimedia.cloud', port=6443): Max retries exceeded with url: /apis/batch/v1/namespaces/tool-anomiebot/jobs?labelSelector=toolforge%3Dtool%2Capp.kubernetes.io%2Fmanaged-by%3Dtoolforge-jobs-framework%2Capp.kubernetes.io%2Fcreated-by%3Danomiebot%2Capp.kubernetes.io%2Fcomponent%3Djobs%2Capp.kubernetes.io%2Fname%3Danomiebot-4 (Caused by [04:14:24] NameResolutionError(": Failed to resolve 'k8s.tools.eqiad1.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")))" and other errors. [04:53:03] anomie: I'm looking, not sure about the dns thing because I was thinking this was an nfs issue [04:56:25] !log admin 'systemctl restart nfs-server' on tools-nfs-2.tools.eqiad1.wikimedia.cloud [04:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [05:04:42] anomie: the login issue should be resolved, I'm curious as to whether you're still seeing dns issues. [05:04:44] Login issue was T380827 [05:05:05] T380827: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827 [05:40:45] !log admin rebooting tools-sgebastion-10.tools.eqiad1.wikimedia.cloud to get NFS things remounted [05:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [05:58:53] hello! reporting the following error on a simple command https://www.irccloud.com/pastebin/hNsZhJN8/ [06:03:01] !help is this a service outage? ^ I'm noticing other ToolForge-hosted bots haven't edited recently. [06:03:01] Sorry, you are not authorized to perform this [06:03:09] Very helpful :P [06:03:53] Tamzin: there was a brief outage, I think it's resolved [06:03:57] but I'll double-check your dns issue [06:04:04] ah it's working now on my end [06:04:54] now the question becomes, was the issue i was there to fix because of that too, or something else? off to the bugfixing mines I guess :P [06:08:05] ah and back to timing out [06:29:30] the dns issue seems intermittent, I haven't been able to track it down yet [06:30:14] Tamzin: is it specifically 'jobs list' that's timing out, or something else? [06:30:56] most recently `toolforge jobs load jobs.yaml --job "zinbot-one"` timed out [06:31:26] and is timing out again (probably) right now [06:32:16] yep, timed out, `list` too. all roughly the same error message [06:33:35] is it 'Temporary failure in name resolution' or something else? [06:34:00] still that, yes [06:34:54] can you tell me more about why you describe it as a timeout? Is there more context around the error message? [06:35:47] Well I'm calling it a timeout because it only happens after trying for a while, and the error message says "Max retries exceeded with url". If "timeout" isn't the correct term technically there, I apologize. [06:36:19] I'm not sure what the right term is :) Just wanted to make sure there wasn't more info that I'm missing [06:36:20] The only context here is that I'm trying to execute two basic `toolforge jobs` commands to see why my bot failed earlier (although it's probably the same outage) [06:36:49] I can reproduce the name resolution error although I can't make it happen outside of the jobs cli [06:40:55] FWIW the error my bot reported was very similar to this [06:41:03] requests.exceptions.ConnectionError: HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url: /w/api.php?meta=siteinfo%7Cuserinfo%7Cuserinfo&siprop=general%7Cnamespaces&uiprop=groups%7Crights%7Cblockinfo%7Chasmsg&continue=&action=query&format=json (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -3] Temporary failure in [06:41:03] name resolution')) [06:41:49] and I'm still getting onfailure notifs as of 3 minutes ago [06:42:07] yeah, me too [06:46:08] Tamzin: I'm rebooting worker nodes in hopes of refreshing caches, it'll take a while [06:47:00] cool thanks! my bedtime's in a few, but fortunately my bot doesn't do anything where the wiki will break from a few hours of downtime [06:47:40] yeah, I'm worried that the problem is widespread but it's intermittent so hopefully most tools are coping [06:53:57] !stashbot, you ok? [06:54:08] stashbot, still there? [06:54:08] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [07:38:21] !status Toolforge NFS outage - T380827 [07:38:44] Oh, that was why I was having problems restarting StewardBot? [07:39:10] that or T380832 [07:41:11] :o [07:41:28] Yeah, was getting a connection error thingy [07:45:58] since 2024-11-26 02:28 UTC all of my tols at toolforge can't reach de.wikipedia.org anymore. is that a known problem? where do i find the current status of toolforge? [07:46:38] see T380827 and its subtasks - various things are broken [07:46:49] the current issue with jobs-api seems to have been dns related. restarting some of the nodes seems to have worked, we are currently monitoring and restarting some of the failed jobs [07:51:06] Yeah, restarting errored out connecting using api_client.py, etc [07:51:28] it would be great, if at the starting page of toolforge.org, i.e. https://wikitech.wikimedia.org/wiki/Portal:Toolforge, there would be a link to some page "current state" or so. [09:08:12] !tools restarting tools-k8s-worker-nfs-17 [09:11:55] !log tools restarting tools-k8s-worker-nfs-50 [09:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:12:25] !log tools restarting tools-k8s-worker-nfs-70 [09:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:14:28] !log tools restarting tools-k8s-worker-nfs-72 [09:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:17:19] !log tools rebooting k8s-control-8 [10:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:22:22] !log tools rebooting k8s-control-9 [10:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:44:21] it seems that the dns issues are resolved, is anyone still seeing any anywhere? (new ones) [11:12:43] !status DNS incident resolved, report new DNS issues on T380844 [11:12:43] T380844: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844 [13:31:16] !log admin added cloudcephmon1004 to the ceph mon pool [13:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:44:58] ควย [15:45:19] หน้าเหี้ย [15:48:03] อีพวกหีคัน [15:48:29] ปัญญาอ่อน [15:48:34] อีโง่ [15:48:48] อีเหี้ย [15:50:08] อีเหี้ย [15:50:15] อีจังไร [15:50:21] !kb Guest88 [18:26:58] !log multichill@tools-bastion-12 tools.multichill Tired of the Unable to start, out of quota for memory, memory, created T380902 for more memory [18:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.multichill/SAL [18:29:07] bridge test [18:30:38] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot toolforge jobs restart bridgebot [18:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [18:32:13] ok, looks like bridgebot is back alive and catching up on messages now [18:34:04] I guess it must be affected by https://phabricator.wikimedia.org/T380844 ? [18:41:45] bridge test 2 [18:41:49] ok it’s working again [18:42:03] idk why it seemingly skipped bridging a few messages in between