[00:03:13] 10Tool-Zppixbot, 10Upstream, 10User-MacFan4000, 10User-RhinosF1: Configure ZppixBot weather module properly - https://phabricator.wikimedia.org/T253758 (10MacFan4000) There’s been a response from upstream: It’s on my list of things to do. Should have it done in the next day or so. [06:25:51] Reception123: ^ [06:26:34] I'll need to capture the logs [06:29:11] 10Tool-Zppixbot, 10User-RhinosF1: ZppixBot-test starts improperly on python 3.7 - https://phabricator.wikimedia.org/T254348 (10RhinosF1) 05Resolved→03Open p:05Medium→03Unbreak! > 05:45:52 [2020-06-09 04:45:51,347] sopel.irc.backends  ERROR    - Server timeout detected after 121s; closi... [06:29:14] 10Tool-Zppixbot, 10Documentation, 10User-RhinosF1: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 (10RhinosF1) [06:33:33] 10Tool-Zppixbot, 10User-RhinosF1: ZppixBot-test starts improperly on python 3.7 - https://phabricator.wikimedia.org/T254348 (10RhinosF1) My IRC logs also show a crash lasting 32 mins over night. [06:36:39] 10Tool-Zppixbot, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) [08:16:27] 10Tool-Zppixbot, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) ` [2020-06-09 04:45:51,347] sopel.irc.backends ERROR - Server timeout detected after 121s; closing. [2020-06-09 04:45:51,388] sopel.irc.backends INF... [08:23:11] 10Tool-Zppixbot, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) ` Traceback (most recent call last): File "/data/project/zppixbot-test/zppixbottest37/lib/python3.7/site-packages/sopel/bot.py", line 606, in call ex... [08:31:35] 10Tool-Zppixbot, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) I'm going to have to try and debug before I restart it so it could be another half an hour before -test is back. I'll check logs again after but we may ne... [08:40:09] 10Tool-Zppixbot, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) >>! In T254348#6204927, @RhinosF1 wrote: > I'm going to have to try and debug before I restart it so it could be another half an hour before -test is back.... [08:40:19] 10Tool-Zppixbot, 10Upstream, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) [08:42:01] 10Tool-Zppixbot, 10Upstream, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) https://github.com/sopel-irc/sopel/issues/1868 would have been the original cause, https://github.com/sopel-irc/sopel/issues/1865 is the fi... [08:51:12] 10Tool-Zppixbot, 10Upstream, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) https://git.io/Jfyne for this actual issue [09:40:39] 10Tool-Zppixbot, 10Upstream, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) I plan to roll this back to python 3.5 [09:48:37] ok :( [09:48:51] RhinosF1: why did we have to rollback? [09:48:53] Reception123: Rolled back, Issue filed upstream [09:49:07] It's not recovering correctly after a crash [09:49:58] Reception123: it's spent 5.5 hours of the 12 hours since we rolled forward dowm [09:50:07] ah [09:50:34] 10Tool-Zppixbot, 10Documentation, 10User-RhinosF1: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 (10RhinosF1) 05Open→03Stalled Blocking [09:51:18] 10Tool-Zppixbot, 10Upstream, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) 05Open→03Stalled p:05Unbreak!→03Medium Blocked on upstream. [09:51:20] 10Tool-Zppixbot, 10Documentation, 10User-RhinosF1: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 (10RhinosF1) [09:51:49] Reception123: now we wait for a fix [09:57:38] 10Tool-Zppixbot, 10Documentation, 10User-RhinosF1: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 (10RhinosF1) Rollback completed in ~8 mins for the record. [16:00:08] .op [16:00:08] Please wait... [16:00:45] .deop [16:00:52] .deop ZppixBot [16:59:32] Zppix: we rolled python back again [16:59:43] Why? [17:00:25] -test died due with find-lines errors [17:00:33] multiple times [17:00:40] and it didn't recover [17:00:55] Zppix: the bot wasn't restarting properly after crashing [17:01:16] (I'm not too sure it should even have been crashing and certianly not that much) [17:04:24] * RhinosF1 has updated upstream and the appropiate Phabricator tasks have information [17:05:58] I don't think we'll be looking at a quick fix either [17:06:21] As it looks like that whole restart sequence is broke [17:06:45] And possibly timeout detection but we can use a ZNC for that [17:07:09] I dont really think we need a znc [17:07:25] Zppix: that's the more minor issue [17:07:33] Timeouts aren't a disaster [17:07:52] The fact it's not recovering is my issue [17:08:14] As it looks to be the reboot system that fails [17:08:42] Does it give an error? [17:08:53] Zppix: nothing helpful [17:09:13] It doesn't even attempt to connect before crashing again [17:09:14] What does it give [17:09:22] Zppix: see phab [17:10:00] * RhinosF1 thinks that python is panicking due to a known issue with the irc expections handler [17:10:50] https://phabricator.wikimedia.org/T254348#6204886 [17:10:51] [ ⚓ T254348 ZppixBot-test fails restarts correctly on python 3.7 after crash ] - phabricator.wikimedia.org [17:12:09] * RhinosF1 has a good idea what the cause is but isn't sure exactly how to fix [17:20:22] .help [17:20:24] I've posted a list of my commands at https://clbin.com/tD5Ho - You can see more info about any of these commands by doing .help (e.g. .help time) [17:25:15] Zppix: I will keep an eye on upstream prs & releases that will fix it [17:29:52] MacFan4000: FYI, if I read grafana right there was also a spike in resource consumption when you ran "kubectl delete pods --all" [17:30:13] oh [17:30:27] MacFan4000: ideally only delete the sopel pod [17:30:29] Not both [17:31:03] It looks like one of the pods started before the others stopped [17:31:34] Also the canonical flag on the webservice got reset [17:33:05] If you can't tell, I had a great day [17:33:57] RhinosF1: Is the coco outcome that bad for you? [17:40:55] Texas: I'm very annoyed with the process [17:41:16] ah [17:44:15] Texas: Today has been crap overall tbh [17:44:34] sorry to hear that [18:00:19] [02ZppixBot-Source] 07MacFan4000 deleted branch 03revert-185-patch [21:20:30] ^ me [21:21:09] ^ also me [21:32:47] Texas: me again [21:35:11] Texas: IT WORKED!!!!! [21:39:45] yayy [21:39:51] .version [21:39:51] Texas: Sopel v7.0.4 (Python 3.5.3) [21:41:12] Texas: I won't be upgrading yet, we got to fix the bug actually first [21:41:58] [02ZppixBot-Source] 07MacFan4000 pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jfy7Y [21:42:00] [02ZppixBot-Source] 07MacFan4000 037c2615a - Update join.py [21:42:05] We just know the cause for sure [21:42:19] MacFan4000: don't forget to remove from default.cfg [21:42:30] you don't have to restart though [21:42:40] as it will apply on next change [21:42:46] just pull the patch up [21:42:52] RhinosF1: I added it using .join [21:43:09] MacFan4000: ah [21:45:11] Texas, MacFan4000, Zppix: 22:44:56 !log tools.zppixbot-test keeping logging_channel off with debug logs and hoping the bot timesout to get some decent non log spammed logs to try and fix bug 2/3 in T254348 [21:45:27] cool [21:47:54] there's also the logging_channel_level setting, so that we could keep channel logging on with INFO level and have the log file with DEBUG [21:50:04] 10Tool-Zppixbot, 10Upstream, 10User-RhinosF1: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 (10RhinosF1) I've confirmed and updated upstream with the cause of https://github.com/sopel-irc/sopel/issues/1865 with high confidence and set some config... [21:50:53] MacFan4000: I don't want logging_channel on at all on -test [21:51:02] It's off to debug two issues [21:51:08] with less spam [21:51:31] and DEBUG is on to get detailed logs that only mention the actual issue (hopefully) [21:52:02] RhinoaF1: I'm saying we can set logging_channel_level different from logging_level [21:52:33] logging_channel_level applies only to channels [21:52:44] MacFan4000: yeah I know [21:53:08] but it being on at all is not wanted for now on test until I capture some logs [21:53:24] so unless you want to crash WMCS's network [21:53:42] we have to leave it off and wait for a bit [21:53:49] as having it on is causing a bug [21:53:58] ah [21:55:05] MacFan4000: if you read https://github.com/sopel-irc/sopel/issues/1865 and https://phabricator.wikimedia.org/T254348, you'll see [21:55:06] [ Channel logging still attempts to output after connection loss · Issue #1865 · sopel-irc/sopel · GitHub ] - github.com [21:55:07] [ ⚓ T254348 ZppixBot-test fails restarts correctly on python 3.7 after crash ] - phabricator.wikimedia.org [21:55:19] Hopefully I'll get all the logs within a few days [21:57:26] though aren t the issues more with py3.7 and not py3.5 [21:57:28] ?: [21:57:31] .version [21:57:31] MacFan4000: Sopel v7.0.4 (Python 3.5.3) [21:58:22] MacFan4000: no they occur on python 3.5 [21:58:35] It just recovers automatically on 3.5 [21:58:46] on 3.7 it needs a hard stop/start to fix [21:59:02] ah [22:00:09] From what I've seen, the main instance hasn't been having as many issues as -test [22:00:23] its been a bit more stable [22:00:41] MacFan4000: it's more stable [22:00:54] but -test is fairly stable [22:01:09] 3.7 was a disaster of a rollout though [22:01:18] so it's not worth the risk [22:01:45] i wonder how we mange to have the main instance more stable and not have find_lines issues [22:02:10] MacFan4000: we do have find_lines issues [22:02:23] but it's more stable so you see them less