[09:16:49] zhuyifei1999_: ^ [09:16:54] trace result pls [09:16:57] ok [09:18:54] there are so much logs [09:19:01] 784 lines of logs since 09:11:41 [09:20:09] zhuyifei1999_: that's 1s before it lasted heard. anything [09:20:17] 10:13:49 [2020-05-14 09:13:49,094] sopel.irc.backends ERROR - Server timeout detected after 127s; closing. [09:21:03] I'll make a private paste on phab so I don't have to figure out what needs to be redacted [09:22:37] https://phabricator.wikimedia.org/P11192 [09:22:38] [ Login ] - phabricator.wikimedia.org [09:26:30] 10Tool-Zppixbot: Debug random ZppixBot(-test) restarts - https://phabricator.wikimedia.org/T248960 (10zhuyifei1999) Caught a trace of zppixbot-test failing: P11192. Keeping it private because there are way too many lines for me to look through to see what need redacting [09:29:46] what is the current timeout? [09:29:51] zhuyifei1999_: 120s [09:30:01] let me look at ZppixBot's logs [09:31:02] so disconnecting at 09:13:49, the error happened sometime around 09:11:49. strace has logs at 09:11:41 [09:32:05] [pid 17478] 09:11:41 sendto(3, "PING chat.freenode.net\r\n", 24, 0, NULL, 0) = 24 [09:32:05] [pid 22] 09:11:41 recvfrom(3, ":beckett.freenode.net PONG beckett.freenode.net :chat.freenode.net\r\n", 65536, 0, NULL, NULL) = 68 [09:32:23] you have a PONG at 09:11:41 at the very second the ping was sent [09:32:37] i saw [09:33:04] [pid 17478] 09:13:41 sendto(3, "PING chat.freenode.net\r\n", 24, 0, NULL, 0) = 24 [09:33:04] [pid 17478] 09:13:41 write(4, "2020-05-14 09:13:41,863 >>\tPING chat.freenode.net\n", 50 [09:33:04] [pid 22] 09:13:41 recvfrom(3, ":beckett.freenode.net PONG beckett.freenode.net :chat.freenode.net\r\n", 65536, 0, NULL, NULL) = 68 [09:33:04] [pid 17478] 09:13:49 <... write resumed> ) = 50 [09:33:04] * RhinosF1 waits for ZppixBot's logs of raw messages to load [09:33:38] the ping was sent -> starts logging -> pong received -> finished logging [09:34:03] [pid 17478] 09:13:49 sendto(3, "PRIVMSG #ZppixBot-logs :[2020-05-14 09:13:49,094] sopel.irc.backends ERROR - Server timeout detected after 127s; closing.\r\n", 128, 0, NULL, 0) = 128 [09:34:51] what do you mean it timed out right after you finished logging ¯\_(ツ)_/¯ [09:34:56] I see on the sopel side [09:34:59] 2020-05-14 09:11:41,720 >> PING chat.freenode.net [09:34:59] 2020-05-14 09:11:41,808 << :beckett.freenode.net PONG beckett.freenode.net :chat.freenode.net [09:34:59] 2020-05-14 09:13:41,863 >> PING chat.freenode.net [09:34:59] 2020-05-14 09:13:41,951 << :beckett.freenode.net PONG beckett.freenode.net :chat.freenode.net [09:34:59] 2020-05-14 09:13:49,129 >> PRIVMSG #ZppixBot-logs :[2020-05-14 09:13:49,094] sopel.irc.backends ERROR - Server timeout detected after 127s; closing. [09:35:14] so you got a pong [09:35:21] zhuyifei1999_: I meant I'm going to check what ZppixBoy sent [09:35:46] well that shouldn't have timed out ? nor does it look anything like the old bug [09:35:58] where it actually didn't get a reply [09:38:49] well, so issue in sopel code [09:38:57] at least in this instance [09:39:44] are both ping_timeout and server_timeout 120s? [09:39:51] yeah, at least this instance [09:39:59] ping timeout is server timeoout [09:40:15] the server timesout when nothing ping oor otherwise comes in 120s [09:40:16] then that could be racey [09:40:56] so the race condition: at the same second, _send_ping and _check_timeout are fired [09:41:10] it sounds like a race condition that [09:41:53] send ping sends the ping continues to _check_timeout, pong received right after _check_timeout finishes [09:42:38] or in this case, pong received before _check_timeout, but spent so much time trying to log it that _check_timeout finished [09:42:42] yeah [09:48:18] oh wait, I was wrong about 'this case'. the logging of the ping took 7 seconds, preventing the pong from being logged during that 7 seconds [09:49:15] well, yeah, just heck lot of race with ping + pong + check timeout happening in the same second [09:49:35] that's a fluke with sopel then [09:49:39] try reduce ping timeout compared to server timeout [09:49:57] zhuyifei1999_: it's only one config as far as i know [09:50:00] timeout = x [09:50:02] o.O [09:50:29] then I can't save you [09:50:32] and upping that just made it hang when it had issues [09:50:52] you might want to report the issue [09:51:04] zhuyifei1999_: let's trace again to see if there's no issues when it doesn't hit a race condition and I've raised it in #sopel [09:51:05] to sopel [09:51:18] it's still tracing [09:51:24] cool [09:55:54] as for not receiving, I think the explanation could be simple: if recv is half a second late then _check_timeout is done and it's doomed [09:56:21] probably [09:56:53] or even logging takes half a second considering updating last_event_at is the last thing it does, after logging https://github.com/sopel-irc/sopel/blob/7cdefb91597cd7f995bd59eac3c7bbcc52da1964/sopel/irc/backends.py#L173 [09:56:53] [ sopel/backends.py at 7cdefb91597cd7f995bd59eac3c7bbcc52da1964 · sopel-irc/sopel · GitHub ] - github.com [09:58:47] hmm [18:28:05] .t [18:28:05] 2020-05-14 - 18:28:05UTC [18:28:28] .settz America/Chicago [18:28:29] sariomobile: I now have you in the America/Chicago timezone. [18:29:21] .help [18:29:21] Hang on, I'm creating a list. [18:29:23] I've posted a list of my commands at https://clbin.com/EJc5K - You can see more info about any of these commands by doing .help (e.g. .help time) [18:31:47] For a feature request, should I just post here or should I make a GitHub issue? [18:32:56] .getchanneltz [18:32:56] #ZppixBot's timezone: UTC [19:31:14] zhuyifei1999_: 20:29:39 [2020-05-14 19:29:38,601] sopel.irc.backends  ERROR    - Server timeout detected after 123s; closing. [19:33:33] ok [19:34:47] zhuyifei1999_: I’m mobile but see what trace says and sopel’s own logs on the tool in .sopel/logs and stick them in a private paste [19:34:58] It’s the raw log we want [19:35:01] From sopel [19:35:05] https://phabricator.wikimedia.org/P11198 [19:35:06] [ Login ] - phabricator.wikimedia.org [19:36:12] https://www.irccloud.com/pastebin/Y5Xm6zOg/ [19:36:12] [ Snippet | IRCCloud ] - www.irccloud.com [19:36:28] still race on sopel's side [19:37:55] ok, notify me if you think it's a new issue. this cause is found [19:45:27] zhuyifei1999_: weird why they’re now races. That looks to be the same issue. [19:45:41] yes [19:45:47] * RhinosF1 wonders why his grandad ruined his oc [19:45:55] Pc for the umpteenth time [19:46:03] lol [19:47:01] zhuyifei1999_: luckily I don’t think it’ll cost him this time. It’s only cost him previosly ~£160 in repairs and new licenses [19:48:58] Actually, It will as I’ve ran out of office licenses [19:54:09] can't you transfer license? [20:01:48] zhuyifei1999_: no [20:01:53] Not for this one [20:01:55] ok [20:03:49] zhuyifei1999_: it’s another bill he’s racked up [20:04:12] He shouldn’t have done it [20:04:35] * RhinosF1 thinks he would have ignored many warnings from windows to get to that stage [20:08:07] zhuyifei1999_: I have a thought, we updated sopel on the 12th [20:08:42] I think it's just related to what host it is running on [20:08:57] certain hosts are more NFS-loaded and writing to logs take longer [20:12:52] zhuyifei1999_: maybe [20:13:08] I wonder if there’s a way to improve the race [20:15:05] It shouldnt be the update reading the changelog [20:15:15] And github compare [20:15:20] The backend wasnt touched [20:21:59] improve the race? [20:22:27] well, minecraft has a good algorithm for timeout detection [20:24:34] https://www.irccloud.com/pastebin/qGn2qqoO/ [20:24:34] [ Snippet | IRCCloud ] - www.irccloud.com [20:31:35] Sario: You can make a task on phabricator.wikimedia.org with the tag tool-zppixbot [20:39:01] MacFan4000: what was sario’s issue [20:45:31] RhinosF1: I had a thought about a feature request [20:49:49] MacFan4000: if you tell me his request, i’ll look [20:51:57] RhinosF!: He never said what his request was [20:52:54] MacFan4000: then I can’t do it! [21:00:42] Sario: what was your feature request? [21:01:29] I had an idea for expanding .status to allow others to check your current status. [21:04:40] * RhinosF1 wonders why you can’t just read the wiki