[15:09:49] Lucas_WMDE: do you mind if i ask for another favour from you :p (in all seriousness, though, feel free to decline :) ) [15:09:59] i'm curious if logstash has anything interesting for the request ID at https://phabricator.wikimedia.org/T416112#11812816 [15:10:22] * Lucas_WMDE looks [15:12:27] A_smart_kitten: nothing in logstash AFAICT [15:12:40] based on the error message it might not have logged anything in MW [15:13:34] Lucas_WMDE: thanks for looking :) just to double-check, (as far as you're aware) would the logs have been garbage-collected in the last 7-8 days? [15:13:39] I don’t know if anything ends up in logstash when a request times out [15:13:45] I’m not sure how long we keep them [15:14:05] * Lucas_WMDE looks at the last 30 days without reqId filter [15:14:27] ok we at least still have stuff from 31 March [15:14:33] so I don’t think it would’ve been GCed yet [15:14:38] It depends on the exact type of timeout, generally yes though, however for php timeouts its somettimes not associated with the request id [15:14:46] hm, could be that [15:15:06] However, most of the time, excimer should make an exception which should get logged associated with the request id [15:15:42] But last time when i was debugging job queue things, i had to look at the logs for that specific host to find a timeout error, as it wasn't associated with the request [15:17:46] hm, I see a few skwiki OOMs on 13 April, but they’re all on /w/rest.php URLs AFAICT [15:17:58] seems like the error page i get for that request might be the `503.html` within mediawiki-config [15:18:04] (fwiw) [15:18:51] I see a bunch of "Parsing Zoznam SD objektov 2 was slow, took 8.04 seconds" but that's actually not very long in the grand scheme of things and shouldn't really cause problems [15:20:37] random thought, i wonder if it'd help if i used WikimediaDebug to enable verbose logging / excimer ui for that URL? [15:23:36] If you have the exact time of your request it can help narrow it down [15:25:58] But all i see on april 13, is a bunch of warnings about parsing taking 5 seconds, which is slow but not slow enough to cause a 503 [15:26:09] ack. just did a request right now, got a 503. the Date header from the server is today 15:24:00 UTC. according to my browser it took 2.1mins for the server to respond [15:26:26] i enabled a few settings in WikimediaDebug for that request, so there might be some more verbose logs [15:27:10] I think i see it, this is request id 38353eaa-67b0-4d1f-8f05-e22a81973c6c [15:27:23] Did that request give you a 503? [15:27:48] hmm i have 95a8d6d6-884e-49bd-bb49-3f5219f2fc78 for the most recent one, checking yours in case it matches a previous request [15:28:18] oh, that might have been the reqid for favicon.ico [15:28:55] yep that seems to match [15:29:03] (my browser's request for favicon.ico) [15:29:08] anyways, i do see a: [20-Apr-2026 15:24:05] WARNING: [pool www] child 2556, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /w/index.php?oldid=6558488&title=Zoznam_SD_objektov_2") executing too slow (5.308122 sec), logging [15:30:12] but nothing so far that looks like a timeout [15:31:20] hmm [15:34:16] Ok, when i try it, i get a 503, but it has an "age: 10" header, which is definitely odd (although x-cache-status is miss) [15:35:41] although the timeline is that it waited 15 seconds before giving me that response. Not sure what to make of that [15:37:55] And the server header changes each request, so presumably its making it to the backend [15:38:20] So i wonder if varnish is just set to a 10 second timeout or something [15:39:19] hmm. purely anecdotally, i feel like i've had successful requests to wikimedia wikis that last longer than 10 seconds, though [15:39:37] yeah, normally the limit is way higher than that [15:39:42] (also, would[n't] a timeout be a 504? and/or am i misunderstanding something?) [15:41:44] honestly i always get confused on the difference between 504 and 503 [15:42:00] fair :) [15:42:16] from finding https://www.mediawiki.org/wiki/Topic:Wrw2me16w73cwljw i'm wondering if the 503 might be indicative of something happening/going wrong on the level below mediawiki [15:42:29] *potentially indicative [15:44:04] Yes, its probably not a mediawiki generated error. Its probably varnish having an issue, but the issue might potentially be caused by something mediawiki did, although i think its more likely on some non-mediawiki layer [15:45:37] huh [15:48:25] Actually I see on your request from 15:36:27, there is a varnish warning that the request took 98.953719 seconds to fetch [15:50:08] but nothing in the php slow log seems to indicate it took anywhere near that amount of time [15:56:29] Possibly a coincidence, but maybe related [15:57:11] At 2026-04-20T15:50:40.660244275Z wikicube-worker-1362 gives the warning: [20-Apr-2026 15:50:40] WARNING: [pool www] child 9, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /w/index.php?oldid=6558488&title=Zoznam_SD_objektov_2") executing too slow (6.604736 sec), logging [15:57:39] The next log message from wikicube-worker-1362 is at time 2026-04-20T15:50:40.75651698Z [15:58:34] is a message about recieving SIGABRT [15:58:42] [20-Apr-2026 15:50:40] WARNING: [pool www] child 9 exited on signal 6 (SIGABRT) after 2454.014771 seconds from start [16:00:06] hmm, i'm probably showing my lack of knowledge here with this question, but i guess -- might that process exit have then lead to the 503? [16:01:22] plausibly. 503 usually means that varnish had issues talking to mediawiki, if the php process aborted, that would qualify [16:01:44] Although it is weird, as i would expect the executing too slowly message to be logged only after the request completed [16:02:19] SIGABRT is definitely a very weird error, something had to go pretty wrong for that to happen, probably on a really low level [16:29:26] A_smart_kitten: So after testing, i can confirm, every time i request ?oldid=6558488&title=Zoznam_SD_objektov_2 there is a SIGABRT in the log. That's probably as far as I can take debugging from here. The next obvious step i think would be to generate a core dump on the abrt and look at that, but that needs someone with higher access, so I'll just file a bug [16:31:15] anyone else getting upstream connect error or disconnect/reset before headers. reset reason: connection timeout on phab [16:31:33] seemed to be unhappy for a few moments [16:34:57] bawolff: fair enough! thank you for the investigations you've done :) [16:36:04] i was actually about to try and summarise your investigations into a comment on T416112, but if you're filing a bug then i guess we could just add that as a subtask(?) [16:36:05] T416112: HTTP 500/503 error on [[skwiki:Zoznam_SD_objektov_2]] due to embedding a template 20000 times - https://phabricator.wikimedia.org/T416112 [16:36:20] and/or, do you think we should copy anything here over to the comments on that task? [16:41:47] I edited the bug [16:44:38] thanks again bawolff, much appreciated :) i can only speak personally, but i like that (at least) we are now able to say what is likely to be causing the 503, rather than it just being an erorr in Phab with a completely unknown cause