[01:57:34] Krinkle: what is broken about it? [02:05:06] TimStarling: Looks like something caused the redis TCP connection between mwlog1001 and webperf1002 to get dropped on Dec 13 2018, and the redis-py client and/or the way we use that, means that if the connection is dropped without proper close message, xenon-log just indefinitely hands on listen() / socket.recv() and never ends. [02:05:16] See _security backscroll as well. [02:05:39] https://github.com/wikimedia/puppet/blob/be64ab2766685543d381cd1b23d28998fa460f04/modules/arclamp/files/xenon-log#L94 [02:05:45] https://gist.github.com/Krinkle/e7d22ed09986b0930797e2df1c18f70d [02:06:15] https://github.com/andymccurdy/redis-py/blob/2.10.6/redis/client.py#L2501 [02:06:37] Looks like nothing is ensuring that while waiting for the next messsage, the connection is still valid, e.g. no regular ping [02:07:22] TimStarling: btw, how would I find out whether wiki@wikimedia.org is blackholed or not? [02:07:29] context: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/BounceHandler/+/447933/ [02:12:12] right, that was somewhat confusing [02:12:36] going to that link I see the full list of log files [02:12:52] but I gather from the backscroll in security that there are actually two servers with separate copies [02:13:04] the codfw works, the eqiad one doesn't [02:13:17] TimStarling: that's right. I later saw that the codfw one was still working [02:13:24] we serve active-active :) [02:13:56] active-active without reliable replication [02:13:58] for me (London now) https://performance.wikimedia.org/xenon/svgs/daily/ shows today and then before that Dec 13. [02:14:29] TimStarling: Yeah, app servers (any cluster) to mwlog primary (eqiad currently), and from there subscribed to from webperf[12]002, which then generates SVGs locally. [02:14:38] yeah, I see that at webperf1002:/srv/xenon/logs/daily [02:15:10] Long-term I want to move storage of that out of local disk and to Swift or some such, so that we don't have to be limited to just a few days/weeks. [02:15:51] https://phabricator.wikimedia.org/T200108 [02:15:54] Anyway, another time :) [02:16:12] for now it'd be great to find a way to make sure it doesn't get stuck again. Afaik, first time that happened since 2015. [02:16:32] which means patching the client? [02:17:03] Yeah, either fixing redis-py, or using it in a way that isn't subject to this bug. [02:19:20] e.g. some kind of busy loop that polls get_message and occasionally tries a ping, or something relating to run_in_thread which seems relevant/interesting. Not sure to be honest. [02:19:21] on the second issue, I'm trying to work out what you (or legoktm) mean by blackholed [02:19:56] TimStarling: /dev/null, not persisted or usable by anyone, short of opsen manually inspecting the socket as things come in. [02:21:58] presumably we can inspect the exim4 configuration [02:22:26] wiki: :blackhole: This mail address sends out automated messages, please do not reply. [02:22:50] that is in mx1001:/etc/exim4/aliases , so yes, it is blackholed [02:23:10] cool, that's what I was looking for. Thanks :) [08:44:49] someone needs to look into https://phabricator.wikimedia.org/T215464 [08:45:03] "Oversighters can no longer see suppressed contributions past a certain date when using the offender parameter" [08:52:08] <_joe_> the fact we have no process to treat UBN! tickets is just crazy [08:52:27] <_joe_> we should stop relying on people's goodwill given how much the org has grown [08:54:17] <_joe_> and I'm not blaming anyone, quite the contrary: I think our collective goodwill has took us this far [16:08:15] bpirkle: turns out we already had a ticket going https://phabricator.wikimedia.org/T214507 we need to think about priority of this vs. other tasks though [16:09:20] KateChapman: yep. This sounds like a relatively low-effort task, so I'm hopeful we can accomplish it. [17:03:14] _joe_: it's rough, I have them listed on my phab dashboard, but the combination of A) not everyone uses it the same (eg: I ignore FR-tech's use of UBN! as I trust them and then you have user groups who use UBN! for things like interviews which I also have to mentally ignore) and B) sometimes they turn out to take a long time to resolve, unfortunately, so you have long-running UBN! (which maybe should [17:03:16] be reprioritized). [17:03:35] last time I tried to codify a standard for UBN! I received a lot of pushback so I just tried to manage it all in my head, with expected results [17:04:47] <_joe_> greg-g: (sorry I'm in bed sick and so forgive me if I'm not fully rational) I would chalk what you've been doing under "collective goodwill" though [17:05:04] sure [17:05:21] <_joe_> and thank you for that, btw, but it's true we don't have a standard policy about how to notify teams of UBN! tickets [17:05:32] <_joe_> and also, in some cases ownership is dubious [17:05:54] * greg-g nods to both [17:06:29] Phabricator workflows are difficult to standardize in part because Phabricator is a movement-wide tool rather than being exclusive to an org or department [17:06:40] but we can probably do better [17:07:02] andre__ would probably have ideas [17:07:40] <_joe_> I'm waving away again, sorry [17:12:21] bd808: yeah, hence the "doing it in my head" issue, which is fine-ish. [19:11:09] anomie: would you be able to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/JsonConfig/+/488525 ? [19:12:55] legoktm: Which versions do we currently support? I forget. [19:13:35] * anomie finds https://www.mediawiki.org/wiki/Version_lifecycle [19:13:36] https://noc.wikimedia.org/conf/ says wmf.14 and wmf.16 are deployed right now [19:13:41] We still support 1.27? [19:13:43] oh [19:13:47] I meant just to production [19:15:07] legoktm: I'm tempted to skip wmf.14, since group2 should be going to wmf.16 in about an hour. [19:15:44] fine by me [19:18:31] legoktm: I just added it to the SWAT window that's going on now. [19:18:38] thanks :) [19:18:47] * legoktm -> afk [20:00:24] legoktm: Stupid flaky npm caused a bogus test failure, so it didn't get SWATted. And now the train covers the rest of my day. If you're back later, feel free to deploy it.