[00:42:51] bd808: any thoughts about when (if at all) we should set $wgAuthenticationTokenVersion? [00:44:16] according to the edit stats dashboard the low point of editor activity is around UTC 04-08, so that's when it would be least disruptive [00:44:33] (less accidental IP recordings in page histories etc) [00:46:02] which means if we want it to happen on a different day then the group0 rollout (to be able to better differentiate intentional logout from errors), it should probably happen tonight [04:01:21] tgr: is there any way to avoid doing that? It's pretty disruptive, and we've done it several times recently. [04:02:31] yes, by not setting it [04:04:08] context: 1. in some cases cookies were leaked to the wrong person, or weren't deleted on logout, we don't know of any way to identify the affected users, so the decision was that everyone needs to be logged out [04:04:30] a script is running currently to do that, and it will finish approximately never [04:05:24] 2. we want to be better prepared if something like that happens again, $wgAuthenticationTokenVersion is intended to do that (all cookes become invalid when it is changed) [04:06:10] when $wgAuthenticationTokenVersion is initally set (ie. changed from null to 1), there is a way to recover the new valid cookie from the old invalid one [04:07:04] so we need to figure out what level of secureness to apply [04:08:13] the maximally secure/disruptive approach is to set $wgAuthenticationTokenVersion, run the script, everyone will be logged out twice, there is no way for an arbitrarily clever attacker to abuse a leaked cookie [04:08:55] a less secure approach is to let the script run, and only set $wgAuthenticationTokenVersion if there is a new leak [04:10:02] everyone will be / has been logged out once, for users with large central IDs a leaked cookie might remain valid for several days, is there is a new leak a clever attacker can make use of it despite $wgAuthenticationTokenVersion [04:10:11] but in practice it is rather unlikely [04:12:47] another option is to stop the script, set $wgAuthenticationTokenVersion; everyone will be logged out at the same time, even those who have already been logged out by the script (maybe 35% of users bz now?); changing $wgAuthenticationTokenVersion again in case of a new leak will be secure, but a clever attacker who has a cookie from an old leak will be able to use it indefinitely (assuming the script did not reach that user) [04:12:58] ori: ^ that about sums it up [04:13:04] how long has the script been running? [04:13:22] a week [04:13:43] does it iterate over all users? [04:13:56] yes [04:14:16] https://gerrit.wikimedia.org/r/#/c/268850/ might speed it up [04:14:39] but that does not change the fundamental security issue [04:15:04] you could have iterated on keys in redis, that would have been quicker. too late for that suggestion, though. [04:15:55] you can create a valid session by setting user_token and the username in a cookie even if the redis session has expired [04:16:01] that's what "remember me" does [04:16:22] right [04:16:30] that's why you need $wgAuthenticationTokenVersion [04:17:22] and we need the script because changing $wgAuthenticationTokenVersion from null to something is not foolproof [04:17:42] why not? [04:17:55] if we think that's an acceptable risk I'm OK with that, I'm just not comfortable being the one to make that decision [04:18:33] null -> user_token goes into the cookie, non-null -> user_token is salted by $wgAuthenticationTokenVersion [04:18:44] if $wgAuthenticationTokenVersion is random and private, that works [04:18:58] let's make it random and private [04:18:59] otherwise, the attacker can just calculate the result [04:19:12] fair enough [04:19:34] in that case, we should stop the script I suppose [04:20:11] Killing the script seems uncontroversial. The reason for the script is to protect account security, but a three-week run time is not an acceptable response to that, so either we don't care about security (and should kill the script), or we care about security to do something adequate,like deploy / change $wgAuthenticationTokenVersion (and kill the script) [04:21:25] we can also speed up the script, but of course that would also involve temporarily killing it [04:21:45] but rolling out $wgAuthenticationTokenVersion is going to log people out anyway, right? [04:22:40] yes, and if we make it private, there is no added security from the script [04:23:16] I'm not a fan of private settings, they increase the cost of understanding the system and make an emergency action slower [04:23:26] but I guess that beats a double logout [04:23:41] how do they make an emergency action slower? [04:24:08] takes more time to figure out what to do when it needs to be changed the next time [04:24:18] edit file on deploy server, sync, {{done}} [04:24:34] it's a moot point anyway, since we need to do it [04:24:46] whether or not it is optimal is irrelevant [04:24:57] ah, you mean it leaves less of an audit trail to find in the future [04:25:03] yes [04:25:44] if you feel strongly about it, you can commit a change that adds a comment to *Settings.php where the variable would normally go [04:25:49] someone with both some MediaWiki knowledge and some ops knowledge will have to be around to figure out what to do [04:25:55] but ori is right [04:26:16] include '$wgAuthenticationTokenVersion' in the comment so it's greppable and point the reader to PrivateSettings.php [04:26:21] private MW settings don't need ops help. They are jsut in a git repo that only lives on the deploy servers [04:27:08] knowing about the deploy servers is ops knowledge :) [04:27:34] tgr: we can even make the setting part private and part public I suppose. That's not too hard to do. [04:27:49] anyway, no disagreement about the course of action I suppose? [04:27:53] that way you need to change two repos [04:28:18] is the code for $wgAuthenticationTokenVersion deployed? In other words, is it just a matter of setting a value? [04:28:23] It really could be all public for every change after the first [04:28:28] yes [04:28:40] is it set on beta? [04:28:50] no, not yet [04:28:58] bd808: well, assuming no one recorded any of the tokens leaked in the last two weeks [04:29:11] but if that's the case we could just set it publicly now [04:29:25] does the script output any indication of progress? [04:29:35] yes [04:29:37] Chris wanted the CA token values changed still didn't he? [04:29:55] * bd808 thought this was talked out on phab on Friday [04:30:12] I think he was okay with not changing those if the token salt is private [04:30:26] I don't see any gains from changing the tokens in that case [04:30:40] unless you assume an attacker who can obtain the salt [04:30:57] ...but not the tokens from the DB, which seems unrealistic [04:31:16] is there another milestone part-way between beta and full production deployment of $wgAuthenticationTokenVersion? [04:32:21] we could deploy it on group0, but that would mean a different salt on loginwiki, I'm not sure if that would work or break group0 logins completely [04:32:39] since the salt is also applied to the CA token [04:32:54] it would break it, I think [04:32:55] nope. We asked to set it last Thursday but were told to wait until rolling back to a version that didn't support it was unlikely (eg Tuesday) [04:33:14] by whom? [04:33:23] isn't that already unlikely? [04:33:28] yes [04:33:39] wmf.12 has been on group2 for four days with no problems reported [04:34:16] I mean, that was a reasonable thing to say on Thursday [04:34:25] but setting it now should not be problematic [04:34:31] I agree [04:34:41] So how's this for a plan: 1) Kill the script, after noting its current position. 2) Deploy $wgAuthenticationTokenVersion to beta. 3) Verify. 4) Deploy $wgAuthenticationTokenVersion to production. [04:34:52] question is, how much time we want to leave to communicate it [04:35:16] keeping in mind that setting it while SessionManager is being deployed is not ideal [04:35:18] But getting it set tomorrow would really be more idea if the logout script is also killed so that subsequent logout issues are clearly due to SessionManager issues [04:35:18] *more ideal [04:35:18] * bd808 knocks on lots of wooden things [04:35:41] or: how much time do we want to leave for people to exploit the fact that they are logged in to another user's account [04:36:53] well, the first leak was 11 days ago, the second 4 days ago [04:37:34] the second seems to have affected a tiny amount of people, not sure about the first [04:37:45] I really think that this is either not a problem or it is an urgent problem, and it seems more like that latter [04:37:50] *the [04:38:34] so how about setting it UTC 05:00 tonight? [04:38:45] which is an hour from now, I think [04:38:46] ori: I've claimed a deploy window tomorrow for several logging config changes. I'd like you to look at this one -- https://gerrit.wikimedia.org/r/#/c/269063/ [04:38:46] that's the thing you were a little concerned about perf impact from [04:38:46] (duplicating message value before PSR3 expansion) [04:38:50] in prod? isn't that in 20 minutes? [04:39:07] ok, timezone fail [04:39:52] * bd808 was apparently very lagged [04:39:55] having it live in beta for at least a couple of hours seems best, and that would also mean that opsen in CET would be on hand to support a deployment. [04:40:07] if want to minimize disruption, there is a window till UTC 6 or maybe 8 (seems to depend on day of week) after which edit frequency rises sharply [04:40:28] more edits -> more chance of IP addresses recorded in page history involuntarily [04:40:58] I think that is a relevant consideration, but ultimately it takes a backseat to making sure production doesn't break accidentally because a delicate deploy was rushed [04:41:26] OK, so how would you time your step 4? [04:42:10] beta works as expected for at least an hour and _joe_ is up and had coffee [04:42:28] which should happen in two-three hours [04:42:44] that also gives him a chance to tell us it's a terrible idea and we should absolutely not do it [04:43:04] in case he sees something we haven't [04:43:10] in any case, the first three steps are uncontroversial I think? [04:43:45] I think so; let's see what bd808 thinks [04:43:49] he might be still lagged [04:43:56] *still be [04:43:57] I would lean towards leaving the prod change a day later so that we can give the communities some heads up [04:44:34] setting beta right now seems fine. [04:44:44] did IP edits go up the last couple of times sessions were invalidated? [04:45:07] I'm undecided on prod +2h vs +26h [04:45:23] most wikis have loud edit notice templates for ips [04:46:00] what happens in the Android app if the session disappears out from under you? [04:46:11] legoktm: when you are around, can you stop resetGlobalUserTokens.php? see backscroll but short version is we will use $wgAuthenticationTokenVersion instead and want to avoid logging out any more people unnecessarily [04:46:30] sure, should I do that now? [04:46:36] please do [04:46:56] legoktm: and record the position just in case [04:47:40] ori: well, the last time was "countinuously in the last two weeks" [04:47:55] I left the screen running which has the last username that was run [04:48:27] and the last id was 16602476 [04:48:33] Sunday night in US / early morning in EU is a nice time to do this because it marks the start of a new work-week. A new work-week means people switch to a different computer too (home vs. office) so it would be less jarring. [04:49:36] legoktm: what do you think about this? specifically, is advance notice required (b/c "more edits -> more chance of IP addresses recorded in page history involuntarily", per tgr) or not? [04:49:54] * legoktm reads up [04:51:42] ori: I have seen requests to oversight IPs, but only one or two [04:51:50] didn't go looking for them though [04:51:54] [20:46:00] what happens in the Android app if the session disappears out from under you? <-- I thought we fixed that months ago by them setting assert=user ? [04:52:59] ori: I think timing is going to suck regardless of when, we should just do it at the soonest time when we have proper ops/etc. coverage [04:53:05] most bots don't use assert=user either [04:53:10] I think PWB does [04:53:23] pwb should be [04:53:54] yeah, that's my opinion too [04:53:58] And I sure hope other bots are [04:54:02] (re: timing) [04:54:27] if that's the plan then we should write to tech ambassadors and wherever else right now so people know what's happening and there is less FUD [04:57:26] that sounds like a good idea [04:58:57] OK, I'll do that [05:01:17] I suppose there is no point in making the beta setting private? [05:01:48] no, make it private. People expect beta to mirror production, and having it be public in one and private in the other is misleading. [05:01:58] typically things that are private in prod are also private in beta [05:04:08] so I just edit /srv/mediawiki-staging/private/PrivateSettings.php, commit and sync? or is there some private review process? [05:07:52] bd808: ^ [05:08:18] tgr: no review process. you've got the steps [05:08:33] running the jenkins scap job will sync it for you [05:08:54] no command-line sync on beta? [05:11:23] it's possible but not typically done [05:11:46] the jenkins user is the one that has access to the ssh key via the log running agent [05:12:20] so to sync manually you need to sudo to the jenkins user (I never remember which one and have to look around to remind myself) [05:17:22] bd808: that's beta-scap-equiad > build now? [05:17:31] yes [05:28:54] no effect [05:29:17] but it was already running when I logged in to jenkins so probably I need to start it again? [05:30:02] possibly, yes [05:30:42] it should certainly sync everything in /srv/mediawiki-staging just like scap does in prod [05:40:20] bd808: ever seen "Archives directory /vagrant/cache/apt/partial" on mw-vagrant? http://fpaste.org/319746/54909980/raw/ [05:40:41] I updated both VirtualBox and mw-vagrant recently...the latter for the first time in a month or so [05:40:50] yeah. I think we have an open bug about it [05:41:42] it happens when the virtualbox updater plugin tries to use apt before the directories are mounted from the host computer [05:42:20] it doesn't halt your VM booting does it? [05:42:51] no [05:42:57] I can ssh in, but /vagrant is empty except for logs [05:43:18] https://phabricator.wikimedia.org/T69976 [05:43:24] * bd808 reads to see why we closed it [05:43:47] bd808: sync-common has finished and I am still not logged out, so I probably made some mistake [05:44:12] tgr: hmm.. have you checked for the setting with eval.php? [05:44:30] legoktm: just ssh in, create it and reload vagrant [05:44:40] happens to me every once in a while [05:44:46] ok [05:45:36] yep, working now [05:45:44] or at least it's progressing further [05:46:07] bd808: it's set [05:46:17] did I make a typo in the name or something? [05:46:24] * bd808 looks [05:48:22] spelling is right according to git gerp [05:48:24] *preg [05:48:30] lol [05:48:34] *grep [05:50:08] eh, stupid [05:50:23] I ran mwscript on the bastion [05:53:27] bd808: I get permission denied when trying to SSH to the appservers [05:53:55] right, probably should set up agent forwarding [05:54:02] sorry, it's late :) [05:54:51] ugh, no mwscript [05:55:02] tgr: I see $wgAuthenticationTokenVersion in deployment-mediawiki01.deployment-prep:/srv/mediawiki/private/PrivateSettings.php [05:55:21] we don't provision mwscript on MW cluster hosts in beta or prod [05:55:27] not reall sure why [05:55:47] probably just hysterical rasins [05:56:06] yeah, finally managed to check [05:56:50] vagrant working now, thanks bd808 and tgr [05:57:20] all three appservers have it and I'm still not logged out [06:02:03] the code seems to be doing its job, getToken() is different from the raw token for both local and central users [06:02:56] and my cookie still has the old token so WTF is going on? [06:05:44] I logged out, cleared my cookies and logged back in. I don't have any cookie that matches User::getToken() [06:08:31] that's normal, CA suppresses that [06:09:05] it should match CentralAuthUser::getInstance( User::newFromName( ... ) )->getAuthToken() [06:23:17] setting $wgAuthenticationTokenVersion works fine on my vagrant CA testwiki [06:46:29] anomie: when you are up, can you look at this? [06:46:39] I'm unable to figure out what's wrong [06:47:12] in eval.php the session gets rejected as it should, whether I try to load it from cookies or access it by session ID [06:47:33] but over the web my user remains logged in with the old cookies just fine [14:35:42] " But getting it set tomorrow would really be more idea if the logout script is also killed so that subsequent logout issues are clearly due to SessionManager issues" - well, except for the part that Wikipedia will still be on wmf.12 until Thursday. [15:08:24] tgr: Huh. When executed through the web on deployment-mediawiki01, $wgAuthenticationTokenVersion is coming through as null. [15:12:31] tgr: Touching PrivateSettings.php itself didn't make a difference, but touching the symlink (`touch -h /srv/mediawiki/wmf-config/PrivateSettings.php`) made it work, so I'm guessing some sort of caching issue. [15:13:46] For whatever reason, deployment-mediawiki02 doesn't seem to have had a problem with it in the first place. [15:51:01] tgr: there is a few of your patches that are going to fails due to MediaWiki core ApiDocumentationTest :-((( [15:52:38] anomie: good morning! I will rid the train tomorrow. It will start one hour later than usual (8pm-10pm UTC or noo-2pm PST). I will cut the branch in my afternoon or roughly around 1pm UTC [15:53:53] Good morning, hashar [15:54:27] wmf.13 will have the session manager right [15:54:42] That's the plan. [16:04:59] going to be a lot of fun I guess [16:08:35] Hopefully this time we don't have mysterious unreproducible issues reported by just two people on a Saturday. [17:30:11] anomie: it wasn't the host, my browser was served via deployment-mediawiki02 [17:30:24] and I did check all three hosts with eval.php [17:30:47] looks like eval.php and a web request are somehow loaded from different files [17:31:04] another beta cluster weirdness [17:33:17] tgr: All I know is that when I tried it, deployment-mediawiki02 was serving me logged-out pages while deployment-mediawiki01 had me logged in until I touched the PrivateSettings.php symlink. Maybe deployment-mediawiki02 did have the same bug when you tried it but someone/something fixed just that one before I tried? [17:42:53] anomie: did you do the touch on the appserver or the deployment bastion? [17:44:56] tgr: On deployment-mediawiki01 [19:02:59] anomie: #pywikibot question [19:03:12] < tgr> the current OAuth implementation requires storing a cookie to preserve the session, I believe, so if your bot does the OAuth handshake correctly but does not return the cookie, you get a new session every time and the tokens don't work [19:03:27] am I getting that right? [19:03:36] that seems problematic [19:04:09] tgr: I don't think it does require a cookie, actually. In SessionManager it's optional but I don't think we turned it on. [19:04:25] The non-SessionManager code doesn't use a cookie at all, I believe. [19:04:37] yeah, the non-SM one does not [19:05:22] someone is having CSRF token problems, but SM is not deployed so that can't be reason then [19:07:41] how does the current OAuth implementation preserve the session, then? [19:08:16] (also, does it make sense at all to require tokens on an OAuth request?) [19:09:54] tgr: They might have wound up with a bogus session cookie in their bot somehow, which blows up on non-SessionManager due to https://gerrit.wikimedia.org/r/#/c/268702/ [19:10:11] If that's the case, I believe just restarting the bot fixes it. [20:00:02] legoktm: around? [20:00:49] he's probably still mostly on Australian time [20:00:54] :/ [20:01:06] need help with extension registartion again (maybe someone else can help?) [20:01:22] i want a few settings to be different for wikidata (https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/master/extension.json#L1991-L1996) [20:01:39] these settings are arrays, and having looked at the code, it appears that array merge is done [20:03:00] combining the wikidata setting (which is set after the extension is loaded, but before ConfigFactory::getDefaultInstance() which is where i think combining them happens) [20:03:21] bd808: we could switch to your evil hack, which maybe would work [20:05:07] aude: yeah, using the hook should work if nothing else does. Kunal had some ideas last week but I don't remember what they were [20:05:08] (i'm also not sure attributes would work well for this case) [20:05:24] oh yeah attributes was his fix right? [20:05:48] i could try attributes but appears (at a glance) that array merge is also done? [20:06:52] there's also something about merge strategies, but they are all some form of merge or combining arrays [20:08:11] $this->attributes = array_merge_recursive( $this->attributes, $info['attributes'] ); [20:22:51] What's blocking the 5.5.9 bump now? https://gerrit.wikimedia.org/r/#/c/266931/ [21:24:20] aude: is there a bug about what you're trying to do? I'm still on messed up timezones >.> [21:30:36] Reedy: more CI stuff, it keeps getting more and more complicated. https://phabricator.wikimedia.org/T126211?workflow=create came up last night [21:31:26] legoktm: i can create one [21:32:00] basically, i don't see a way in LocalSettings to override and replace an array-based default setting in an extension [21:32:11] or unset parts of the array [21:35:42] aude: https://phabricator.wikimedia.org/T121378 maybe? [21:36:03] Also, I just got the most ridiculous email: [21:36:04] > [21:36:06] Recently the Wikimedia Foundation released version 1.26.2 of the MediaWiki platform. Support for older versions <= 1.22.x and 1.24.x is discontinued. Please update to a newer version today! [21:36:06] We took a look at your https://www.wikidata.org/wiki/Wikidata:Main_Page wiki website, with 21069492 pages and 16281175 articles. Your 2650023 users have made 301161634 edits so far and uploaded 0 images/files. [21:36:06] It's an important web property with an Alex rank of 19267 and with an Quant Cast rank of 35453 [21:36:08] Based on the current version of your site (MediaWiki 1.27.0-wmf.12), we'd like to offer help with upgrading to keep your 2650023 users happy! [21:36:13] > [21:36:46] from " eQuality Technology" [21:40:44] legoktm: https://phabricator.wikimedia.org/T126273 [21:41:12] hah [21:41:25] my task might be a duplicate [21:42:52] https://phabricator.wikimedia.org/T120197 is what i am ultimately trying to fix [21:44:44] * legoktm will take a look in a bit [21:47:05] ok