[00:49:18] the remaining NEED_TOKEN load seems to be coming from a single bot [00:49:29] which tries to log in about ten times per second [00:50:46] btw shouldn't api.log soomehow redact passwords? [00:52:19] it shouldn't log them [00:53:24] api.log has lots of creepy things in it [00:53:27] I'll file a bug about that [00:53:35] but that seems like a really bad one [00:53:58] not trivial as it just logs the POST body but can surely be regexed away [00:54:56] the iOS app sends a per-device durable unique tracking token with each request (that actually causes an API warning as an unrecognized field) [00:58:41] bd808: now that I am in the security group, I have no idea how to file security bugs anymore :/ [00:58:51] the interface got a lot more confusing [00:59:14] do I need to set both Visible to: and Security: ? [00:59:25] heh. I don't know that I've ever know how to but maybe because I've always been in the security group [01:00:13] or maybe I should just read the text [01:00:16] "Security settings will override permissions and projects as needed." [01:03:25] tgr: https://www.mediawiki.org/wiki/Reporting_security_bugs [01:07:06] greg-g: should tgr and I wait until tomorrow morning to patch sessionmanager stuff? [01:07:24] we have 3-4 patches waiting now I think [01:17:25] 17:05 < greg-g> alright, I'm going afk for the evening, thanks all [01:17:34] we are on our own! [01:17:53] bd808: https://www.wikidata.org/wiki/User:Reza1615/BOT/request.py [01:18:12] this is the one bot still sending the login requests [01:18:59] I'm not fluent in Python, but looks like there is no cookie handling here at all? [01:19:19] cookielib.CookieJar() [01:19:52] oh, it's via the _opener? [01:20:43] looks like it "content = self._opener.open(a).read()" [01:20:48] that's the post action [01:21:12] My brain is getting tired but I can look more [01:21:33] I'll try to grab a full request log [01:21:46] probably less effort than trying to reverse-engineer [01:22:37] it uses http so probably related to that somehow [01:22:58] bd808: so what's the deployment plan? [01:23:36] tgr: well... now'ish or wait till tomorrow morning [01:23:49] what might we break with the patches tonight? [01:23:53] (everything) [01:24:31] If there is nothing terribly on fire I think waiting until tomorrow would be smartest [01:24:50] I haven't heard loud yelling for a while [01:24:54] https://gerrit.wikimedia.org/r/#/q/topic:sessionmanager-backports,n,z [01:25:01] is that all the pending stuff? [01:25:20] there is https://gerrit.wikimedia.org/r/#/c/267134 too [01:25:36] and my patch for loginwiki fixing that hasn't been merged [01:25:57] oh, yeah [01:26:42] so I think they all apply to loginwiki ? [01:26:46] these cover the two issues of no autologin for new accounts and no local login on loginwiki, right? [01:27:13] yeah I think that's the scope [01:27:57] doesn't sound terrible [01:28:11] hm and also no account creation on loginwiki [01:28:20] which is still not too bad [01:29:08] I wonder if should prepare a SessionManager-less wmf11 just in case there are more problems? [01:29:24] I think the downsides of a late evening deploy with few roots and no brad are worse than a few loginwiki bugs [01:29:34] ack [01:30:07] * anomie happens to be here again [01:30:09] if we had a .11 without sessionmanager ready that would probably make lots of people happy. [01:30:18] What's broken now? [01:30:28] nothing new anomie [01:30:42] just contingency planning [01:30:43] nothing new, but then, there was no rollout to group2 yet [01:30:56] wasn't there? [01:31:11] enwiki Special:Version reports wmf.11 [01:31:14] yeah [01:31:22] we're live to the world again [01:31:32] oh [01:31:41] I just wasn't paying attention then [01:31:49] thanks for scrambling to chase down all the weird bugs, folks [01:31:52] but last time we didn't really start getting reports of "oh shit" until 22 hours after deploy [01:31:53] must have been a stressful week [01:32:04] it's been "fun" [01:32:23] anomie has taken the brunt of it I think [01:32:48] with tgr and I just poking about to test his fixes and find new problems [01:33:03] thank you, anomie [01:33:04] And backport huge piles of stuff ;) [01:35:18] On the plus side I remembered that I can actually debug and write code [01:35:32] I'd lost track of that in the past few months [01:37:26] I realized I can't actually debug [01:37:35] xdebug spoiled me I guess [01:38:18] log statements ftw [01:38:49] interactive debuggers are a nice toy but never scale in my experience [01:38:52] well yes, but you actually need to understand the code to make sense of those [01:39:08] not just press next and watch where it takes you [01:39:36] I got broken of the use of ide debugging back in my java days with bugs that only showed up at full load in prod [01:40:33] ori: thanks for the words of support. they actually do help :) [01:45:14] well, if we don't plan to do anything else today, I'll go to forage food [01:46:08] sounds like a good plan [02:23:20] I left https://www.wikidata.org/wiki/User_talk:Reza1615/BOT/config_b.py#Should_be_using_HTTPS_rather_than_HTTP about the http config for that bot. [02:28:07] bd808: Doing some live logging hacks, it looks like Rezabot somehow has bad cookies that it keeps sending back, and those bad cookies makes CentralAuthSessionProvider keep rejecting its sessions. I don't know if it's ignoring the deletion of those cookies or if it somehow has them set on some other domain or path so the deletion doesn't apply. [07:09:33] Nice when bot operators post all their files on wiki :) [14:50:43] bd808, tgr: So, the need_token graph is still pretty flat. I pulled the entries from api.log for the 09 hour to look at the bump from about 08:00–12:00, and it looks like Rezabot just upped its rate then ~94000/hr for some unknown reason (no other bot showed up unusually in the logs). [15:46:28] tgr, bd808: So, what's the plan for https://gerrit.wikimedia.org/r/#/q/status:open+topic:sessionmanager-backports,n,z and https://gerrit.wikimedia.org/r/#/c/267134/ ? [16:00:49] anomie: we should deploy them soon I think [16:39:43] anomie: what are your thoughts on https://gerrit.wikimedia.org/r/#/c/267156/ ? tgr seemed slightly concerned that fixing it might break something else but I'm not sure I understood what. [16:41:45] bd808: I think it's a good fix for the bug. It's not perfect since it dumps you on Special:CentralLogin/status instead of sending you to the usual return URL, but since there's nothing *on* loginwiki to return to that's probably a low priority.. [16:47:47] Unless that case can somehow be triggered for logins on other wikis, then we'd want to fix it more urgently. To really fix it, it might be enough to replace lines 93–99 with a flag that later skips lines 111–134 (and set $newSessionId = $session['sessionId']; instead). [16:49:05] wctaiwan filed this not super helpful yet bug report overnight -- https://phabricator.wikimedia.org/T125194 [16:49:27] I was talking to him about it in #wikimedia-operations around 06:39 [16:50:10] it sounds vaguely like the bot complaints at https://phabricator.wikimedia.org/T124252#1979688 [16:50:56] Not sure if the logout script is still running or not. legoktm is on the other side of the world at the moment so it may be hard to find out [17:04:24] anomie: my reading of that code (which I don't trust much) was that it will ditch you on /status any time you attempt a login from a wiki while centrally logged in [17:04:52] ie. when you log in from something that does not have edge login (like outreachwiki) or when edge login is broken for some reason [17:05:34] Yeah, that's the "unless" case. :( [17:58:06] anomie: I wonder if T125194 and similar reports are due to you making the dirty session handling much better [17:58:19] like we used to save much more often [17:58:34] It's not impossible... [18:15:12] anomie: https://logstash.wikimedia.org/#/dashboard/elasticsearch/api-feature-usage-http [18:15:50] Beau.bot is the current leader [18:16:49] https://pl.wikipedia.org/wiki/Wikipedysta:Beau.bot [18:28:03] bd808, tgr: FYI, https://www.wikidata.org/wiki/User_talk:Yamaha5#Please_attend_to_your_bot [18:28:55] anomie: thanks! [18:28:57] harsh :) [18:31:52] * anomie wasn't trying to be harsh [18:33:20] I don't think it was harsh. You did you a direct tone but I don't see anything accusatory [18:39:21] probably I'm just paranoid about people taking offense when told to do something by "the WMF" [19:15:15] anomie, bd808: https://logstash.wikimedia.org/#dashboard/temp/AVKOztPiptxhN1Xa6jfY [19:15:24] we seem to have much more of these since the SWAT [19:17:38] bd808: Is there a way to somehow download the data (ideally just the exception.trace field) from logstash for those? [19:17:51] well... [19:18:23] technically yes, but I don't have an easy tool to do it at hand [19:18:50] Actually... maybe my ggml tool could do that [19:19:22] tgr: So one culprit is auto-create → AbortAutoAccount → AbuseFilter → trying to see if $wgUser is logged in → needs to load from session → boom [19:20:08] Can AbuseFilter just assume "no" if autocreating? [19:21:32] A lot of the ones on commons look to involve NewUserMessage [19:22:18] why is AbuseFilter even implementing that hook? [19:22:42] when it is a good idea to prevent autocreation of a local account for an existing central account? [19:37:21] anomie: https://phabricator.wikimedia.org/P2544 [19:37:57] bd808: Thanks! [20:22:59] tgr: Iirc, that was specifically for the case where vandals would create usernames that were blocked in one language on another wiki, then autocreate on the wiki where it was blocked. Those types of rules should all be global now. [20:25:19] bd808, tgr: Looking at T125194, CentralAuth's central session is supposed to stay in redis for 86400, but it's disappearing in less than an hour. Is redis evicting stuff, or is something else failing in there? [20:36:29] anomie: I get evicted_keys:0 on all redis instances [20:41:25] tgr: Well, let's test things. Do you see key "centralauth:session:23bbfd2a6103c1e8af327e7b7170e409" in redis now? It shouldn't expire until tomorrow. [20:46:44] nope [20:47:00] but it's entirely possible I'm just looking at the wrong place [20:47:10] should that be a string value? [20:48:01] tgr: It's in there if I do var_dump( CentralAuthUtils::getSessionCache()->get( 'centralauth:session:23bbfd2a6103c1e8af327e7b7170e409' ) ) on terbium. [20:55:37] the redis data is sharded to a pool of servers right? [20:55:51] using a chash or something like that [20:56:51] if there is redis strangeness that you are debugging I's suggest looping in ori and AaronSchulz since they are the redis dudes afaik [20:59:12] bd808: The background is that I got wctaiwan to reproduce the bug again and give me the centralauth_Session cookie value, and the entry wasn't in the cache. Then I tried one I had laying around from this morning, and that one was missing too. So the question is why they're disappearing in much less than the 86400-second expiry they were set with, with the followup being how do we fix that. [21:01:18] I thought it might be evictions, but tgr said he sees evicted_keys:0 on all redis instances. [21:01:37] I just took the brute-force approach of querying all redis servers [21:01:49] (assuming rdb1001-1008 is all of them) [21:03:10] but in theory if someone has a valid token cookie, shouldn't the session just transparently regenerate? [21:03:17] like local sessions do [21:04:29] If they don't check "remember me", though, they don't get the token cookie. And this early expiring of the CA session is probably why people are complaining about more finding themselves logged out after inactivity, since the CA session is sufficient to stay logged in and that stays for a whole day. [21:05:02] (It's probably not the bot issue, though, since API action=login always checks "remember me") [21:20:12] tgr, bd808 (also AaronSchulz, ori): The key I posted 38 minutes ago is no longer found. [22:07:03] tgr, bd808: So the #1 item on P2544 is from checking if the IP is blocked from account creation, specifically the $this->equals( $wgUser ) in User.php line 1529. I'm not sure whether the right answer there though is "just use the IP before Setup.php", "don't use the IP before Setup.php", "check if $this->getName() === $wgRequest->getSession()->getUser()->getName(), because that's what $wgUser will be", or "check if $this->getName() === [22:07:03] IP::sanitizeIP( $wgRequest->getIP() ), since that's what's actually intended in this use case I'm looking at". It might even be that T15611 is because the pre-SessionManager code was picking option 3 instead of option 4 (I think that was the case, anyway). [22:15:36] anomie: I can capture a few seconds / minutes of all redis activity on one of the session redis servers using MONITOR. That has been helpful for me in the past. [22:15:49] actually, you can do it yourself, too [23:13:28] anomie: jenkins shat all over that patch -- https://gerrit.wikimedia.org/r/#/c/267387/ [23:13:49] bah, unit tests. Will fix. [23:16:00] could just edit includes/session/SessionBackend.php on mira and sync that [23:16:08] we should be able to tell pretty quickly if it did the trick or not [23:16:56] anomie: should I do that live hack? ^ [23:17:08] bd808: Sure, it'll give us answers quicker [23:17:13] +1 [23:18:49] 3 channels, eh? :) [23:19:49] * bd808 changes to another channel out of spite [23:19:52] one possible counterpoint [23:20:16] if the issue is that mediawiki is saving sessions too frequently, you'd expect the traffic pattern to be flipped [23:20:33] but redis is sending more, mediawiki is receiving more [23:21:01] which would be consistent with loading sessions more frequently rather than saving them [23:29:21] (moved to -operations) [23:38:48] who runs mailman? [23:38:55] I just screwed up on accident [23:39:44] cursive: try pinging mutante on #wikimedia-operations if it's urgent [23:39:56] okay [23:44:41] Crap. I found a thing that requires ->save() to actually save a non-persisted session. I have a plan though. [23:46:58] anomie: you are entering the "one more patch! i'm so close!" rabbit-hole... i've been down it several times and regretted it [23:48:40] ori: This is in some sense a bit more like the "crap another sharp edge" that came with the hhvm rollout [23:48:58] but sadly we don't have a way to run 1/10th of the site on the new code all the time [23:49:26] yeah, I know -- I'm not pointing fingers. The auth stack is terrifying; I'm glad you guys tackled it and I'm sure that things will settle soon and the work will pay off. [23:49:58] *nod* didn't mean to just to a defensive posture [23:50:04] *jump to [23:53:24] I can't find the real replacement for checkSessionCookie(). the comment on that deprecated function names code that doesn't exist