[15:09:49] <A_smart_kitten>	 Lucas_WMDE: do you mind if i ask for another favour from you :p (in all seriousness, though, feel free to decline :) )
[15:09:59] <A_smart_kitten>	 i'm curious if logstash has anything interesting for the request ID at https://phabricator.wikimedia.org/T416112#11812816
[15:10:22] * Lucas_WMDE looks
[15:12:27] <Lucas_WMDE>	 A_smart_kitten: nothing in logstash AFAICT
[15:12:40] <Lucas_WMDE>	 based on the error message it might not have logged anything in MW
[15:13:34] <A_smart_kitten>	 Lucas_WMDE: thanks for looking :) just to double-check, (as far as you're aware) would the logs have been garbage-collected in the last 7-8 days?
[15:13:39] <Lucas_WMDE>	 I don’t know if anything ends up in logstash when a request times out
[15:13:45] <Lucas_WMDE>	 I’m not sure how long we keep them
[15:14:05] * Lucas_WMDE looks at the last 30 days without reqId filter
[15:14:27] <Lucas_WMDE>	 ok we at least still have stuff from 31 March
[15:14:33] <Lucas_WMDE>	 so I don’t think it would’ve been GCed yet
[15:14:38] <bawolff>	 It depends on the exact type of timeout, generally yes though, however for php timeouts its somettimes not associated with the request id
[15:14:46] <Lucas_WMDE>	 hm, could be that
[15:15:06] <bawolff>	 However, most of the time, excimer should make an exception which should get logged associated with the request id
[15:15:42] <bawolff>	 But last time when i was debugging job queue things, i had to look at the logs for that specific host to find a timeout error, as it wasn't associated with the request
[15:17:46] <Lucas_WMDE>	 hm, I see a few skwiki OOMs on 13 April, but they’re all on /w/rest.php URLs AFAICT
[15:17:58] <A_smart_kitten>	 seems like the error page i get for that request might be the `503.html` within mediawiki-config
[15:18:04] <A_smart_kitten>	 (fwiw)
[15:18:51] <bawolff>	 I see a bunch of "Parsing Zoznam SD objektov 2 was slow, took 8.04 seconds" but that's actually not very long in the grand scheme of things and shouldn't really cause problems
[15:20:37] <A_smart_kitten>	 random thought, i wonder if it'd help if i used WikimediaDebug to enable verbose logging / excimer ui for that URL?
[15:23:36] <bawolff>	 If you have the exact time of your request it can help narrow it down
[15:25:58] <bawolff>	 But all i see on april 13, is a bunch of warnings about parsing taking 5 seconds, which is slow but not slow enough to cause a 503
[15:26:09] <A_smart_kitten>	 ack. just did a request right now, got a 503. the Date header from the server is today 15:24:00 UTC. according to my browser it took 2.1mins for the server to respond
[15:26:26] <A_smart_kitten>	 i enabled a few settings in WikimediaDebug for that request, so there might be some more verbose logs
[15:27:10] <bawolff>	 I think i see it, this is request id 38353eaa-67b0-4d1f-8f05-e22a81973c6c
[15:27:23] <bawolff>	 Did that request give you a 503?
[15:27:48] <A_smart_kitten>	 hmm i have 95a8d6d6-884e-49bd-bb49-3f5219f2fc78  for the most recent one, checking yours in case it matches a previous request
[15:28:18] <bawolff>	 oh, that might have been the reqid for favicon.ico
[15:28:55] <A_smart_kitten>	 yep that seems to match
[15:29:03] <A_smart_kitten>	 (my browser's request for favicon.ico)
[15:29:08] <bawolff>	 anyways, i do see a: [20-Apr-2026 15:24:05] WARNING: [pool www] child 2556, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /w/index.php?oldid=6558488&title=Zoznam_SD_objektov_2") executing too slow (5.308122 sec), logging
[15:30:12] <bawolff>	 but nothing so far that looks like a timeout
[15:31:20] <A_smart_kitten>	 hmm
[15:34:16] <bawolff>	 Ok, when i try it, i get a 503, but it has an "age: 10" header, which is definitely odd (although x-cache-status is miss)
[15:35:41] <bawolff>	 although the timeline is that it waited 15 seconds before giving me that response. Not sure what to make of that
[15:37:55] <bawolff>	 And the server header changes each request, so presumably its making it to the backend
[15:38:20] <bawolff>	 So i wonder if varnish is just set to a 10 second timeout or something
[15:39:19] <A_smart_kitten>	 hmm. purely anecdotally, i feel like i've had successful requests to wikimedia wikis that last longer than 10 seconds, though
[15:39:37] <bawolff>	 yeah, normally the limit is way higher than that
[15:39:42] <A_smart_kitten>	 (also, would[n't] a timeout be a 504? and/or am i misunderstanding something?)
[15:41:44] <bawolff>	 honestly i always get confused on the difference between 504 and 503
[15:42:00] <A_smart_kitten>	 fair :)
[15:42:16] <A_smart_kitten>	 from finding https://www.mediawiki.org/wiki/Topic:Wrw2me16w73cwljw i'm wondering if the 503 might be indicative of something happening/going wrong on the level below mediawiki
[15:42:29] <A_smart_kitten>	 *potentially indicative
[15:44:04] <bawolff>	 Yes, its probably not a mediawiki generated error. Its probably varnish having an issue, but the issue might potentially be caused by something mediawiki did, although i think its more likely on some non-mediawiki layer
[15:45:37] <bawolff>	 huh
[15:48:25] <bawolff>	 Actually I see on your request from 15:36:27, there is a varnish warning that the request took 98.953719 seconds to fetch
[15:50:08] <bawolff>	 but nothing in the php slow log seems to indicate it took anywhere near that amount of time
[15:56:29] <bawolff>	 Possibly a coincidence, but maybe related
[15:57:11] <bawolff>	 At 2026-04-20T15:50:40.660244275Z wikicube-worker-1362 gives the warning: [20-Apr-2026 15:50:40] WARNING: [pool www] child 9, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /w/index.php?oldid=6558488&title=Zoznam_SD_objektov_2") executing too slow (6.604736 sec), logging
[15:57:39] <bawolff>	 The next log message from wikicube-worker-1362 is at time 2026-04-20T15:50:40.75651698Z
[15:58:34] <bawolff>	 is a message about recieving SIGABRT
[15:58:42] <bawolff>	 [20-Apr-2026 15:50:40] WARNING: [pool www] child 9 exited on signal 6 (SIGABRT) after 2454.014771 seconds from start
[16:00:06] <A_smart_kitten>	 hmm, i'm probably showing my lack of knowledge here with this question, but i guess -- might that process exit have then lead to the 503?
[16:01:22] <bawolff>	 plausibly. 503 usually means that varnish had issues talking to mediawiki, if the php process aborted, that would qualify
[16:01:44] <bawolff>	 Although it is weird, as i would expect the executing too slowly message to be logged only after the request completed
[16:02:19] <bawolff>	 SIGABRT is definitely a very weird error, something had to go pretty wrong for that to happen, probably on a really low level
[16:29:26] <bawolff>	 A_smart_kitten: So after testing, i can confirm, every time i request ?oldid=6558488&title=Zoznam_SD_objektov_2 there is a SIGABRT in the log. That's probably as far as I can take debugging from here. The next obvious step i think would be to generate a core dump on the abrt and look at that, but that needs someone with higher access, so I'll just file a bug
[16:31:15] <RhinosF1>	 anyone else getting upstream connect error or disconnect/reset before headers. reset reason: connection timeout on phab
[16:31:33] <RhinosF1>	 seemed to be unhappy for a few moments
[16:34:57] <A_smart_kitten>	 bawolff: fair enough! thank you for the investigations you've done :) 
[16:36:04] <A_smart_kitten>	 i was actually about to try and summarise your investigations into a comment on T416112, but if you're filing a bug then i guess we could just add that as a subtask(?)
[16:36:05] <stashbot>	 T416112: HTTP 500/503 error on [[skwiki:Zoznam_SD_objektov_2]] due to embedding a template 20000 times - https://phabricator.wikimedia.org/T416112
[16:36:20] <A_smart_kitten>	 and/or, do you think we should copy anything here over to the comments on that task?
[16:41:47] <bawolff>	 I edited the bug
[16:44:38] <A_smart_kitten>	 thanks again bawolff, much appreciated :) i can only speak personally, but i like that (at least) we are now able to say what is likely to be causing the 503, rather than it just being an erorr in Phab with a completely unknown cause