[01:57:34] <TimStarling>	 Krinkle: what is broken about it?
[02:05:06] <Krinkle>	 TimStarling: Looks like something caused the redis TCP connection between  mwlog1001 and webperf1002 to get dropped on Dec 13 2018, and the redis-py client and/or the way we use that, means that if the connection is dropped without proper close message, xenon-log just indefinitely hands on listen() / socket.recv() and never ends.
[02:05:16] <Krinkle>	 See _security backscroll as well.
[02:05:39] <Krinkle>	 https://github.com/wikimedia/puppet/blob/be64ab2766685543d381cd1b23d28998fa460f04/modules/arclamp/files/xenon-log#L94
[02:05:45] <Krinkle>	 https://gist.github.com/Krinkle/e7d22ed09986b0930797e2df1c18f70d
[02:06:15] <Krinkle>	 https://github.com/andymccurdy/redis-py/blob/2.10.6/redis/client.py#L2501
[02:06:37] <Krinkle>	 Looks like nothing is ensuring that while waiting for the next messsage, the connection is still valid, e.g. no regular ping 
[02:07:22] <Krinkle>	 TimStarling: btw, how would I find out whether wiki@wikimedia.org is blackholed or not?
[02:07:29] <Krinkle>	 context: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/BounceHandler/+/447933/
[02:12:12] <TimStarling>	 right, that was somewhat confusing
[02:12:36] <TimStarling>	 going to that link I see the full list of log files
[02:12:52] <TimStarling>	 but I gather from the backscroll in security that there are actually two servers with separate copies
[02:13:04] <TimStarling>	 the codfw works, the eqiad one doesn't
[02:13:17] <Krinkle>	 TimStarling: that's right. I later saw that the codfw one was still working
[02:13:24] <Krinkle>	 we serve active-active :)
[02:13:56] <TimStarling>	 active-active without reliable replication
[02:13:58] <Krinkle>	 for me (London now) https://performance.wikimedia.org/xenon/svgs/daily/ shows today and then before that Dec 13.
[02:14:29] <Krinkle>	 TimStarling: Yeah, app servers (any cluster) to mwlog primary (eqiad currently), and from there subscribed to from webperf[12]002, which then generates SVGs locally.
[02:14:38] <TimStarling>	 yeah, I see that at webperf1002:/srv/xenon/logs/daily
[02:15:10] <Krinkle>	 Long-term I want to move storage of that out of local disk and to Swift or some such, so that we don't have to be limited to just a few days/weeks.
[02:15:51] <Krinkle>	 https://phabricator.wikimedia.org/T200108
[02:15:54] <Krinkle>	 Anyway, another time :)
[02:16:12] <Krinkle>	 for now it'd be great to find a way to make sure it doesn't get stuck again. Afaik, first time that happened since 2015.
[02:16:32] <TimStarling>	 which means patching the client?
[02:17:03] <Krinkle>	 Yeah, either fixing redis-py, or using it in a way that isn't subject to this bug.
[02:19:20] <Krinkle>	 e.g. some kind of busy loop that polls get_message and occasionally tries a ping, or something relating to run_in_thread which seems relevant/interesting. Not sure to be honest.
[02:19:21] <TimStarling>	 on the second issue, I'm trying to work out what you (or legoktm) mean by blackholed
[02:19:56] <Krinkle>	 TimStarling: /dev/null, not persisted or usable by anyone, short of opsen manually inspecting the socket as things come in. 
[02:21:58] <TimStarling>	 presumably we can inspect the exim4 configuration
[02:22:26] <TimStarling>	 wiki:           :blackhole: This mail address sends out automated messages, please do not reply.
[02:22:50] <TimStarling>	 that is in mx1001:/etc/exim4/aliases , so yes, it is blackholed
[02:23:10] <Krinkle>	 cool, that's what I was looking for. Thanks :)
[08:44:49] <legoktm>	 someone needs to look into https://phabricator.wikimedia.org/T215464
[08:45:03] <legoktm>	 "Oversighters can no longer see suppressed contributions past a certain date when using the offender parameter"
[08:52:08] <_joe_>	 the fact we have no process to treat UBN! tickets is just crazy
[08:52:27] <_joe_>	 we should stop relying on people's goodwill given how much the org has grown
[08:54:17] <_joe_>	 and I'm not blaming anyone, quite the contrary: I think our collective goodwill has took us this far
[16:08:15] <KateChapman>	 bpirkle: turns out we already had a ticket going https://phabricator.wikimedia.org/T214507 we need to think about priority of this vs. other tasks though
[16:09:20] <bpirkle>	 KateChapman: yep.  This sounds like a relatively low-effort task, so I'm hopeful we can accomplish it.
[17:03:14] <greg-g>	 _joe_: it's rough, I have them listed on my phab dashboard, but the combination of A) not everyone uses it the same (eg: I ignore FR-tech's use of UBN! as I trust them and then you have user groups who use UBN! for things like interviews which I also have to mentally ignore) and B) sometimes they turn out to take a long time to resolve, unfortunately, so you have long-running UBN! (which maybe should
[17:03:16] <greg-g>	 be reprioritized).
[17:03:35] <greg-g>	 last time I tried to codify a standard for UBN! I received a lot of pushback so I just tried to manage it all in my head, with expected results
[17:04:47] <_joe_>	 greg-g: (sorry I'm in bed sick and so forgive me if I'm not fully rational) I would chalk what you've been doing under "collective goodwill" though
[17:05:04] <greg-g>	 sure
[17:05:21] <_joe_>	 and thank you for that, btw, but it's true we don't have a standard policy about how to notify teams of UBN! tickets
[17:05:32] <_joe_>	 and also, in some cases ownership is dubious
[17:05:54] * greg-g nods to both
[17:06:29] <bd808>	 Phabricator workflows are difficult to standardize in part because Phabricator is a movement-wide tool rather than being exclusive to an org or department
[17:06:40] <bd808>	 but we can probably do better
[17:07:02] <bd808>	 andre__ would probably have ideas
[17:07:40] <_joe_>	 I'm waving away again, sorry
[17:12:21] <greg-g>	 bd808: yeah, hence the "doing it in my head" issue, which is fine-ish.
[19:11:09] <legoktm>	 anomie: would you be able to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/JsonConfig/+/488525 ?
[19:12:55] <anomie>	 legoktm: Which versions do we currently support? I forget.
[19:13:35] * anomie finds https://www.mediawiki.org/wiki/Version_lifecycle
[19:13:36] <legoktm>	 https://noc.wikimedia.org/conf/ says wmf.14 and wmf.16 are deployed right now
[19:13:41] <anomie>	 We still support 1.27?
[19:13:43] <legoktm>	 oh
[19:13:47] <legoktm>	 I meant just to production
[19:15:07] <anomie>	 legoktm: I'm tempted to skip wmf.14, since group2 should be going to wmf.16 in about an hour.
[19:15:44] <legoktm>	 fine by me
[19:18:31] <anomie>	 legoktm: I just added it to the SWAT window that's going on now.
[19:18:38] <legoktm>	 thanks :)
[19:18:47] * legoktm -> afk
[20:00:24] <anomie>	 legoktm: Stupid flaky npm caused a bogus test failure, so it didn't get SWATted. And now the train covers the rest of my day. If you're back later, feel free to deploy it.