[09:01:20] ok RoanKattouw [09:01:32] Good morning [09:01:37] Good morning [09:01:42] I'll ask mlitn to come in here [09:01:43] how is the jet lag? [09:02:01] Surprisingly good! [09:02:14] I've woken up around 9am every morning so far [09:02:59] that's good, I didn't move, but I didn't have much sleep myself :-/ [09:04:44] hello, mlitn [09:04:54] hi [09:05:55] so, can we proceed? [09:06:46] I think so [09:06:47] I believe so [09:06:56] Matt updated the SQL files, lemme link to them [09:07:01] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FFlow/bc96b21b8c699fde0062a4fa50da8c8a9b80f234/db_patches%2Fpatch-reference_wiki-phase2.sql [09:07:01] I've just rewrote the alters into online alters [09:07:06] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FFlow/bc96b21b8c699fde0062a4fa50da8c8a9b80f234/db_patches%2Fpatch-reference_wiki-phase3.sql [09:07:09] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FFlow/bc96b21b8c699fde0062a4fa50da8c8a9b80f234/db_patches%2Fpatch-reference_wiki-phase4.sql [09:07:09] yes, used that [09:07:29] Awesome [09:07:36] so, all phases toghether [09:08:03] basicly, the only thing I would ask is to check that flow continues working during and after the process [09:08:18] I will notify on the main channel [09:08:56] ok [09:10:58] and this is why I tend to insist on primary keys: The new table `flowdb`.`_flow_wiki_ref_new` does not have a PRIMARY KEY or a unique index which is required for the DELETE trigger. [09:11:23] not a blocker now, I can do an online alter without pt-online-schema-change [09:11:31] but something to fix for the future [09:11:55] (it may block future schema changes if the table grows large) [09:12:29] We can do that [09:12:29] I checked just now and it looks like ~1500 rows already have a nonempty ref_src_wiki filed [09:12:29] *field [09:12:29] We have a feature flag in Flow that deliberately ignores that field for now [09:12:30] I'll also tail the error log [09:12:58] RoanKattouw, that was my previous run [09:13:03] that I aborted [09:13:24] knowing that by the code [09:13:56] do not worry, no change of plans, just let me rewrite my scripts in a sec [09:13:59] OK [09:14:07] Wait, we have DELETE triggers? [09:14:31] Oh, I guess the online-schema-change requires a primary key [09:14:32] no, RoanKattouw, those are created by the online change tool [09:14:35] +script [09:14:36] exactly [09:14:47] Yeah I vaguely remember hearing about that back when the script was written [09:14:52] they are deleted afterworse [09:15:07] but in general, PK as very useful also to check intgrity [09:15:23] Yeah, I agree [09:15:25] by many dba tools [09:15:55] and maybe this week I will be able to aprove the enforcement of that rule on the RFC meeting [09:16:04] but again, not something for now [09:16:19] let me try again in other way [09:16:38] I would support taht [09:16:47] For these tables we agreed we should add an artificial PK [09:16:52] We should actually get around to doing that, too [09:18:02] so now it went though [09:18:08] *through [09:18:49] performing now the updates, let's see if the indexes work [09:19:23] they do, 0.15 sec execution time [09:20:08] Awesome [09:20:52] no lag detected by my monitoring [09:21:35] I think flow_wiki_ref ended [09:22:31] mysql:research@x1-analytics-slave [flowdb]> select count(*) from flow_wiki_ref where ref_src_wiki is null or ref_src_wiki = ''; returns 3 [09:22:39] Let me see if that's just empty string rows [09:23:03] NULLs, weird [09:23:34] Maybe those rows point to things that don't exist any more [09:23:38] is the current code deployment compatible with that change, or will it fill it the old way [09:23:46] functionality still works, new data makes it in just fine [09:24:02] I can reexecute the updates [09:24:06] no issue with that [09:24:33] No, that won't help [09:24:45] Hmm, good question [09:24:47] true, 0 changes [09:24:55] mlitn: Do you know if new things will be populated with ref_src_wiki set or not? [09:25:02] Or do we have to disable the back-compat $wg thing now? [09:25:08] yes [09:25:14] they should be, and it appears they are [09:25:40] I can custom-change specific rows if told, but let's not rush [09:25:49] should I continue with the rest of the changes? [09:25:54] jynus: The 3 rows I found point to a workflow_id that doesn't exist, for some reason [09:25:58] So we should probably delete them [09:26:05] But I wonder how this is possible [09:26:10] mlitn: Any idea how that could happen? [09:26:27] RoanKattouw, can you copy on a paste or on the ticket the specific issue/rows [09:26:30] ? [09:26:39] Sure [09:27:04] looks like that workflow just doesn’t exist [09:27:27] (although I don’t think we ever delete workflows) [09:27:31] It's from Topic:Qaakkkf3x55yrrcz on ... some wiki [09:27:35] yeah [09:27:40] Difficult to tell which wiki it is [09:28:11] A wiki on which Project talk:Requests/Permissions/Hazard-SJ_(bureaucrat_2) exists I suppose [09:29:02] yeah [09:29:35] I can backup those rows and delete them on the actual table [09:29:46] see if we spot any errors [09:29:49] Sure [09:29:55] Let me paste them in the task [09:30:00] yes, please [09:30:08] Hmm I guess I should HEX() the binary fields [09:31:11] Done [09:31:26] that’s 04ce8e3d46482f5a00e943 [09:31:28] in hex [09:31:42] can’t find a workflow with that id [09:32:44] (which means the update queries were fine) [09:33:25] so I've backuped 3 records "from flow_wiki_ref where ref_src_wiki is null or ref_src_wiki = ''" on db1029:/home/jynus/flow_strange_records [09:33:57] and now I am going to run "DELETE from flow_wiki_ref where ref_src_wiki is null or ref_src_wiki = '';" [09:34:23] which would be much safer if we had PKs to select them individually [09:34:28] :-) [09:34:45] Yeah :) [09:34:45] ok with that, or do you want some time to discuss it? [09:34:53] Please add a condition [09:35:09] please tell me, or better, documment it on the ticket [09:35:13] ref_src_namespace=2600 and ref_src_title = 'Qaakkkf3x55yrrcz' [09:35:18] so other people can see it [09:35:26] Yeah, sorry, I just wanted to buy myself time to type out the condition :) [09:35:43] do not worry, we have all the time to do it well [09:35:51] and little time to do it wrong :-) [09:35:58] The 3 rows have those values --^^ and things pointing to that title are almost certainly useless, because it doesn't exist [09:36:09] I think it's from testwiki, as well [09:36:25] It looks like a half-saved duplicate of https://test.wikipedia.org/wiki/Topic:Qaakkkd93wke51om [09:36:36] that topic has the exact same links [09:37:14] So I think you can safely delete those 3 rows, they appear to be unsalvageable and also duplicated [09:38:27] jynus: You can also use WHERE HEX(ref_src_workflow_id) = '04CE8E3D46482F5A00E943'; if you like, because that workflow_id is dead [09:38:32] and all 3 rows match taht [09:39:32] In happier news, the other table (flow_ext_ref) has 0 blank rows [09:39:39] https://phabricator.wikimedia.org/T111084#1636556 [09:39:54] is that a go? [09:40:17] jynus: stop [09:40:30] You need parentheses when you mix AND and OR [09:40:32] just updated it [09:40:34] yes [09:40:47] OK yes that looks good [09:41:33] Query OK, 3 rows affected (0.00 sec) [09:41:42] so, shall we continue? [09:42:24] Yes, sounds good to me [09:43:25] running the final alters [09:43:30] I see no strange errors in the logs apart from exceptions that we know to be caused by the absence of the field we just populated (but are still ignoring because we haven't turned off the b/c code yet) [09:45:15] ok done [09:45:49] I would like to stress the PK again, otherwise, the schema change may be impossible in the future [09:46:10] let me paste the final schema [09:46:26] on the ticket and I will un-own it after that [09:46:32] Yeah, I'll add the PK task to our "do this now or soon" list [09:47:27] FYI: On https://en.wikipedia.org/wiki/Wikipedia_talk:Flow/Developer_test_page, I’m getting “(Cannot access the database: Too many connections (10.64.16.18))” [09:47:33] ( https://phabricator.wikimedia.org/T109676 ) [09:48:28] indeed an issue [09:48:39] Flow\Data\Storage\BasicDbStorage::find is blocked [09:49:22] What kind of query? [09:50:00] SELECT /* Flow\Data\Storage\BasicDbStorage::find (flow_wiki_ref) */ * FROM `flow_wiki_ref` WHERE ref_src_namespace = '0' AND ref_src_title = 'title' ORDER BY ref_src_object_id ASC LIMIT 500 [09:50:33] Is it blocked waiting for the ALTER to finish? [09:50:58] no [09:51:27] it is the indexes [09:51:35] the query is slow [09:51:47] I will revert the index drop [09:51:47] it should also have a WHERE ref_src_wiki … condition, probably [09:52:09] Ooh wait [09:52:18] Let me look at the phase4 index changes [09:52:44] we probably can’t drop that old indices until we remove the feature flag [09:53:14] Yeah, exactly [09:53:24] The v2 indexes are what we will use once we turn off the feature flag [09:53:37] mlitn: Which, BTW, we should be able to turn off the feature flag now, right? [09:53:55] yeah, should be safe [09:54:00] OK let's do that then [09:54:04] well, just turning the flag of should be safe [09:54:11] jynus: I'm going to deploy a config change so it'll use the v2 indexes instead [09:54:26] I have some patch that will remove the old paths, but that one needs to be rebased first [09:54:33] It looks like the instructions were in the wrong order, and we should have changed the config before doing the index drop in phase4 [09:54:51] mlitn: Oh you mean a patch that removes the code paths that run when the feature flag is on? [09:55:24] please give me the old indexes [09:55:29] that will be quicker [09:55:31] yeah, but that one needs to be updated first - don’t merge that one yet [09:55:42] only change the config for now :p [09:56:03] jynus: Let me grab them [09:56:37] I am not happy with this, this is not only making flow fail, but all services on that server [09:56:50] those must’ve been: [09:56:51] CREATE INDEX /*i*/flow_wiki_ref_idx ON /*_*/flow_wiki_ref (ref_src_namespace, ref_src_title, ref_type, ref_target_namespace, ref_target_title, ref_src_object_type, ref_src_object_id); [09:57:06] CREATE INDEX /*i*/flow_wiki_ref_revision ON /*_*/flow_wiki_ref (ref_src_namespace, ref_src_title, ref_src_object_type, ref_src_object_id, ref_type, ref_target_namespace, ref_target_title); [09:57:30] CREATE INDEX /*i*/flow_ext_ref_idx ON /*_*/flow_ext_ref (ref_src_namespace, ref_src_title, ref_type, ref_target, ref_src_object_type, ref_src_object_id); [09:57:45] CREATE INDEX /*i*/flow_ext_ref_revision ON /*_*/flow_ext_ref (ref_src_namespace, ref_src_title, ref_src_object_type, ref_src_object_id, ref_type, ref_target); [09:57:50] [09:59:12] ok ,fixed [09:59:29] https://test.wikipedia.org/wiki/Topic:Qaakkkd93wke51om works [09:59:36] Thanks [10:00:09] this would have been a full outage if flow wasn't on a separate server [10:00:15] works again [10:00:20] also, most querys went to the master? [10:00:34] Yeah Aaron has been complaining about that already :S [10:01:00] Flow (and Echo) sending queries to master is one of the things holding up the multi-DC thing he's working on [10:01:27] So we'll be looking at that; while I was sleeping off my jetlag Matt scheduled a meeting for this week linking to those tasks, so I assume someone's been prodding people about that [10:01:30] look, I will put the current state of things [10:01:33] on the ticket [10:01:40] (current schema and indexes) [10:01:56] but I want you guys a better plan [10:02:08] ^to create [10:02:12] ok? [10:02:22] To fix the PK issue and the queries-to-master issue you mean? [10:02:38] no, to avoid the outage [10:02:46] Right [10:02:55] but I will block any further change without a PK [10:03:23] Well we should not have asked you to remove any indexes this early, that was asking for trouble [10:03:37] I have a config change that should cause Flow to use the new (v2) indexes [10:03:57] Before deploying it I'll check SHOW CREATE TABLE to verify those v2 indexes actually exist [10:04:16] Then we can check that it does in fact use the v2 indexes and doesn't need the old ones any more [10:04:21] and then we can remove the old indexes [10:04:27] That's how we should have done it all along [10:05:20] no time now, give me a plan on the ticket, and convince me that this will not happen again [10:05:27] please [10:05:38] let me update it [10:08:36] 528293 requests failed, this was a major f* up [10:08:53] I'll write an outage report [10:10:28] Thanks for updating the task [10:16:08] jynus: am I right that all changes have already been applied to officewiki (including dropping the indices)? [10:16:25] that also means the index is not being used there [10:16:40] Let me check what SHOW CREATE TABLE says on officewiki [10:16:41] yes, but let me copy the structure from there, too, to have the exact definition [10:16:44] we need to disable feature flag there soon (or also put those indices back in place) [10:16:58] just ran show create table - old indices are indeed gone [10:17:33] Yup, v2 indexes only on officewiki [10:17:34] That's bad [10:17:35] and that is worse because it could create an outage on 800 wikis [10:17:45] But I guess we've survived it because officewiki traffic is really really low [10:17:48] yes [10:17:54] That's still very scary though :| [10:18:16] I will create the above indexes on officewiki [10:18:18] Thanks [10:18:29] I've updated the task with how I think we should proceed with flowdb [10:18:47] mlitn: What other wikis have local tables again, aside from officewiki? [10:18:49] does testwiki have to be altered, too? [10:19:47] 'wmgFlowCluster' => array( [10:19:47] 'default' => 'extension1', [10:19:48] 'private' => false, [10:19:49] ), [10:19:49] No, testwiki doesn't have local tables [10:19:53] Looks like it's only private wikis [10:20:16] Let me see which private wikis have flow [10:20:29] RoanKattouw: new plan seems fine [10:20:39] Only officewiki it seems [10:23:17] I've updated the plan to add that we'll disable the feature flag in beta first [10:23:23] Which Matt already wrote a patch for [10:24:20] let’s update & merge existing patch to the state prod is in & create (a) new patch(es) for next deployment [10:25:22] Yeah I'm working on that [10:25:58] See https://gerrit.wikimedia.org/r/238111 and dependency [10:27:44] LGTM [10:27:50] Let me do some quick tests, though [10:30:58] you can discuss in private or in a devel channel, if you want (as everything here will be logged, and maybe later more difficult to search) [10:31:17] Yeah we can go back to -collaboration [10:31:38] Let's check in later today about migrating from the old indexes to the new ones?