[07:26:12] 10DBA, 10Schema-change: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 (10Marostegui) [07:26:22] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 (10Marostegui) [07:26:29] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 (10Marostegui) [07:46:16] 10DBA, 10monitoring, 10Wikimedia-Incident: Monitor swap/memory usage on databases - https://phabricator.wikimedia.org/T172490 (10Marostegui) p:05Triage>03Normal [07:51:56] <_joe_> jynus: I would really need your feedback on those patches of mine today [07:52:13] <_joe_> they need to be live by tomorrow if we want to use them for the switchover [07:52:24] <_joe_> else, I'll just let it go and postpone this [08:01:22] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10Marostegui) @Andrew @bd808 any input? [08:12:20] ok _joe_ I will review them- I though this was not part of this switch [08:21:32] 10DBA, 10Schema-change: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 (10Marostegui) [08:21:42] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 (10Marostegui) [08:21:52] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 (10Marostegui) [08:24:26] _joe_: I have questions for you regarding the switch [08:24:40] and etcd [08:25:21] <_joe_> please ask [08:27:04] if I were literally now to drop the per-section read only status of codfw configuration, we would be in a good state? Things would still be in read only? [08:27:35] <_joe_> $wgReadOnly is set in codfw [08:27:39] aka is etcd global dc read only working? [08:27:42] <_joe_> to a non-false value [08:27:50] <_joe_> it should; if it's not, that's an issue [08:28:10] basically I want to do the above, and it would be nice if you helped me check [08:28:16] <_joe_> we should at least try on one section [08:28:18] as we still had the section [08:28:37] <_joe_> we can also add ReadOnlyBySection to etcd, of course [08:28:38] so far, but we need to remove that for the switch dc way of doing it [08:28:45] no [08:28:48] not yet [08:28:59] I want to make sure the global one works first :-) [08:29:05] <_joe_> ok [08:29:08] <_joe_> so let's try [08:29:13] <_joe_> do you have a test in mind? [08:29:15] let me make s5 [08:29:21] <_joe_> I try an edit on testwiki? [08:29:24] and we try to maybe do an edit [08:29:26] <_joe_> on codfw [08:29:30] that would be dewiki [08:29:32] 10DBA: Remove partitions from s7 masters (db1062 and db2040) for metawiki.pagelinks - https://phabricator.wikimedia.org/T203548 (10Marostegui) p:05Triage>03Normal [08:29:35] <_joe_> via x-wikimedia-debug [08:29:38] which in the worst case scenario [08:29:43] is the easies to rebuild [08:29:47] <_joe_> ok! [08:29:56] and anyway, we also have the dbs in read only mode [08:30:01] <_joe_> yes [08:30:01] so it physically cannot happen [08:30:08] <_joe_> I was about to say that [08:30:10] <_joe_> :P [08:30:11] but it should fail with a mediawiki error [08:30:14] <_joe_> let's try [08:30:21] not a mediawiki db is in --read-only [08:30:31] _joe_: give me 1 second [08:30:35] <_joe_> yeah, got it :) [08:33:42] BTW we will want the etcd-controlled section read only [08:34:12] but not relevant for this switch, just for master switches, which also require the other thing you were working- so no rush [08:36:05] jynus: I am going to start this on db2040 unless you are going to do something with s7 or that host: https://phabricator.wikimedia.org/T203548 [08:36:43] yes, please [08:37:15] 10DBA: Remove partitions from s7 masters (db1062 and db2040) for metawiki.pagelinks - https://phabricator.wikimedia.org/T203548 (10Marostegui) a:03Marostegui [08:44:52] _joe_: we can test s5 now (dewiki) [08:46:22] mw2017, mw2099 are the right debug hosts on codfw? [08:46:27] <_joe_> jynus: yes [08:46:34] <_joe_> you can use x-wikimedia-debug [08:46:47] I, yes, I see those on the config [08:47:20] The Wikipedia database is temporarily in read-only mode [08:47:23] that is for s1 [08:47:51] now testing on s5 [08:48:01] * marostegui crosses his fingers [08:48:05] <_joe_> it seems ok [08:48:14] Warning: The database has been locked for maintenance [08:48:33] <_joe_> I just opened a talk page on dewiki to edit it [08:48:35] ok, so removing all those configs [08:48:39] \o/ [08:48:40] <_joe_> and I see a big red thing [08:48:41] from all sections [08:48:52] so that it can be purely etcd-based [08:48:53] ok? [08:48:57] <_joe_> saying "MediaWiki is in read-only mode for maintenance." [08:49:06] <_joe_> which is what is in etcd [08:49:22] <_joe_> jynus: I'd like to try to change that message [08:49:33] <_joe_> and see if it gets to the wiki page or not [08:49:43] I just wanted an extra pair of eyes, as this can be a delicate issue on switch [08:49:48] sure [08:49:55] on s5 for now [08:50:13] as I guess the others for now will have the section preference [08:50:25] _joe_: doesn't say: "MediaWiki is in read-only mode for maintenance. Please try again in a few minutes." ? [08:50:30] that's the etcd value [08:50:58] The system administrator who locked it offered this explanation: MediaWiki is in read-only mode for maintenance. Please try again in a few minutes. [08:51:06] so it has its own internal explanation [08:51:11] ok [08:51:13] and then it prints the etcd value [08:51:29] although I think the visual editor ignores completely the custom value [08:51:33] <_joe_> confirmed the message changed in 5 seconds :))) [08:52:30] _joe_: what is an easy way to see the etcd state? [08:52:42] confctl --quiet --object-type mwconfig select 'name=.*' get [08:52:42] does it have a web page like the other pooling states [08:52:48] or from siteinfo [08:52:50] I said easy (casual) [08:52:51] from mediawiki [08:53:15] curl -x mw2217.codfw.wmnet:80 -H'X-Forwarded-Proto: https' 'http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2' | jq -r '.query.general["readonly"]' [08:53:15] <_joe_> jynus: it will once we've merged my changes :D [08:53:19] and readonlymessage [08:54:11] <_joe_> he said "easy" [08:54:21] casual, it can be cached [08:54:44] if I want a canonical and ui-bad, I would do the conftool call [08:54:52] I think there is a patch from Alex to expose it in noc.w.o [08:55:07] cool, if it is not yet there, no biggie [08:55:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/455578 [08:56:45] <_joe_> that patch will depend on my changes now :) [08:57:47] jynus: marostegui hey, do you think this can be done during the time that codfw is active? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/456027 [08:58:27] Amir1: not now, but arriving so late, no guarantee [08:58:35] send a bug on phab [08:58:56] sure, waiting for it to get merged first, (almost done) [08:59:36] being I think a non-core table change, it will probably independent, but we will be busy with other core maintenaance [08:59:59] Amir1: We already have a bunch of tasks that we need to do (T189107), once you create the task for it, we can add it there but no guarantees we'll get to it, it will depend on how all the other maintenance goes [08:59:59] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [09:00:15] when it's done I can flip the switch to read from the new column, that would make RC, Watchlist, history and some other queries way faster [09:00:31] Sure thing [09:01:11] Amir1: Also, at a first glance, it doesn't look like a task that would strictly needs the DC failover [09:01:20] that is what I meant [09:02:25] Amir1: If we have time, we might deploy the change in eqiad whilst it is passive, and then codfw once we have flipped back. But as we said, no guarantees we'll get to it. Sorry [09:02:28] Yeah, I mean it's not the type that needs DC failover but since you said no big schema change when codfw is active [09:02:42] I was wondering if that's okay [09:02:53] Amir1: To merge and create the task, that is ok ;-) [09:02:59] Amir1: as I said, create a task when ready [09:03:09] we will ping back depending on our availability [09:03:18] merge is ok, if it is not active [09:03:19] Amir1: Enabling a new feature that relies on a new schema change done during the failover, that is not :) [09:03:23] that ^ [09:04:44] marostegui, _joe_ volans: about to merge https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/458128/1/wmf-config/db-codfw.php [09:04:51] I see, I won't turn it on any time soon as this needs to be switched on for lots of test wikis fist [09:05:05] that makes sense [09:05:20] and test wikis can probably done quicker and enabled during that [09:05:29] ok to reboot db1116 or would that disrupt anything? [09:05:31] jynus: fine by me with that merge [09:05:48] moritzm: remind me what 11116 is? [09:05:59] moritzm: that is a spare host, so fine [09:06:27] jynus: spare: https://phabricator.wikimedia.org/T196376 [09:06:30] former sanitarium [09:06:35] jynus: ack, do you need anything from me? [09:06:54] I just wanted to doublecheck that none of you is doing anything with it currently (like setting it up or so :-) [09:07:05] rebooting in a bit, then [09:07:05] moritzm: we are mostly blocked on analytics, platform and cloud to do the pending eqiad reboots [09:07:19] we pinged some of them, got no answer yet [09:07:44] help us make some pressure there :-) [09:08:21] on our side, only 6 masters pending that we will do in a week or so [09:11:46] volans: just one question [09:11:54] sure [09:12:15] should we test switch with master-master replication or before we set it up? [09:12:33] after== more realistic state, before=more secure testing [09:12:41] I know you plan to set the replica codfw->eqiad tomorrow [09:12:52] (operationally, it should be the same) [09:13:06] we'll probably start doing some tests today, and I know that hte check of eqiad dbs be in sync with codfw ones will fail [09:13:14] and should succeed tomorrow [09:13:21] after you enable it [09:13:34] ok, so give me feedback when you need it [09:13:39] so I guess we're good in the sense that the tests today will be safer [09:13:46] and tomorrow will be more realistic [09:13:47] :) [09:14:04] _joe_: I am rechecking your patches now [09:14:53] marostegui: BTW db1114 still behaving wrongly in query performance [09:15:04] after restart+upgrade+analyze [09:15:19] maybe we can clone it from elsewhere or something [09:15:30] 10DBA, 10Schema-change: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 (10Marostegui) @Bstorm s3 is finished. You can re-run the views. Only s1 pending which most likely be done in a couple of weeks. Thanks for the patience! [09:15:35] all other replicas, including codfw ones behave nicely [09:16:10] jynus: I pinged Luca on https://phabricator.wikimedia.org/T184267#4527078, he had overlooked your comment until now [09:16:13] jynus: Yeah, let's reclone it from another replica once eqiad is passive so we don't waste more time with it [09:22:36] I will pool it with low api load to prevent high amount of errors [09:22:48] it produces probably 50-80% of long running queries [09:55:05] jynus, marostegui: are you ok once we complete the dry-run tests to make a live test in the opposite direction for the switchover? that means setting core DB masters in CODFW to RO, ofc should be a noop [09:55:52] volans: that's fine with me [09:57:10] we can do a live test of some other things too, like tendril [09:57:30] (in case you want, not that important) [09:57:42] we'll do a live test of everything, I was mentioning the risky step :D [09:57:51] ok [09:57:53] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/457944 [09:58:03] sorry forgot to add you too to that one I though I had [09:58:16] no memcache wipe /restart in the end? [09:58:38] yes but before the RO period [09:58:54] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_0_-_preparation point 4 [09:58:55] ok [09:59:36] let me focus on the mw_primary stuff so we don't need to do a puppet deploy [10:00:21] ack, that would be great! thanks [10:00:38] (we still need a puppet deplou for traffic, but yeah the less the better) [10:32:27] volans: I don't understand https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/457944/3/cookbooks/sre/switchdc/mediawiki/00-wipe-and-warmup-caches.py [10:32:39] you want to wipe and restart eqiad? [10:33:09] or live_test is confusing to me, one of the 2 [10:33:12] jynus: ofc not :) [10:33:12] It requires that DC_FROM is already the passive datacenter [10:33:13] and DC_TO is already the active datacenter [10:33:41] that's live-test, we'll test codfw->eqiad with --live-test [10:34:06] that in an ideal world should be a noop and working as is [10:34:22] but given some things are special the --live-test takes care of those [10:34:48] do you check what is the active dc before allowing a live test? [10:34:59] because calling it live test is confusing to me [10:35:34] any better name to propose? [10:35:44] adding a check to the active one is tricky [10:36:05] because each piece has it's own concept of active (MW, discovery records, traffic) [10:36:11] but I'm open for suggestions [10:36:23] "test" [10:36:37] test gives the idea it doesn't do harmful things [10:36:40] while it technically does [10:37:08] there is also --dry-run, but that doesn't run at all the things that modify stuff [10:37:26] or just make a new recipe [10:37:34] that calls the old one with inverted parameters [10:38:02] ? [10:38:36] ignore me, but keep me away from those functions because I will bring down production [10:39:04] I don't want to ignore you, just understand ;) [10:40:13] that function is confusing to me [10:40:43] so, for a bit of context [10:41:10] last year jo.e and me did this "inverse" test manually, reasoning over each task if it was safe to run it inverted (codfw->eqiad) or not [10:41:28] I don't have a problem with doing what you want to do [10:41:29] except for the warmup task, that is special, and must be run eqiad->codfw for the test [10:41:59] so instead of having to think about this for each one, I though that was better to provide a flag that allow to make this [10:42:33] but you say "It requires that DC_FROM is already the passive datacenter and DC_TO is already the active datacenter" [10:42:34] now, we can also call it --live-test (or any other name) and if you pass eqiad->codfw it actually do the inverse [10:42:51] I guess is confusing in both cases, but I'm open to invert that meaning if it's more confusing [10:43:37] I am guessing that someone will get confused with from, to, inverse switch and live test [10:43:41] yes, I wanted the user to make an informative choice and tell me, go codfw->eqiad but bare in mind this is a live-test, so avoid dangerous things in eqiad [10:44:20] I agree, I'm not sure --live-test eqiad->codfw that does the opposite of what you asked is less confusing though [10:44:33] *not you, the use running it [10:44:43] I am not proposing to invert it [10:45:20] I am proposing to check that what we want to do is what we really want to do because the ui is confusing [10:45:28] or changing the ui so it is less confusing [10:45:38] let' make the ui better, do you have a proposal? [10:45:43] it's a confusing concept in itself [10:49:48] live test for wipes should do the same [10:49:57] as a non-live test [10:50:51] if you want to do eqiad -> codfw testing, say that, and the script will invert things only as needed [10:51:15] but the script will invert everything BUT the cache wipe+warmup [10:51:30] and will log moved Foo from codfw to eqiad [10:51:39] wouldn't that be confusing? [10:52:36] not on code- it will be present on those places where changes will be done later [10:52:45] e.g. on an active-active scenario [10:53:12] or call the test --invert-test [10:53:18] something more explicit [10:57:05] moritzm: db2070 was restarted 6 days ago only [10:57:15] was a kernel released 6 days ago? [10:57:49] how fresh is https://phabricator.wikimedia.org/P7510 [10:58:07] ? [10:58:39] jynus: ok, I'll think of a proposal and let you know [10:58:55] I was not complaining saying to ignore me [10:59:08] but meaning that I would be confused [10:59:54] jynus: let me check [10:59:56] yep and I don't want people to get confused :) [11:00:02] moritzm: Linux db2070 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 (2018-08-21) x86_64 GNU/Linux [11:00:06] I think that host is ok [11:00:12] for example [11:00:39] either the list is old or it didn't check for newer kernels or something? [11:01:21] up 6 days [11:01:32] yeah, db2070 is in fact up-to-date, the paste should be fresh, though. probably a one-off error on my side, but I'll doublecheck the remaining servers in a bit [11:01:48] db1090 [11:01:53] up 6 days [11:01:59] 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 [11:02:08] ^not a one off [11:02:14] I can generate my list [11:02:19] don't want to bother you [11:02:37] but maybe you are skipping the latest kernel or something [11:02:44] or considering 110 < 88? [11:02:58] (just a ping in case you do it for other hosts, not a problem for me) [11:03:04] it's fine, I'll look into it when I'm done rebooting deploy1001 [11:03:22] I'll write to the channel if there are others missing [11:03:50] thanks, sorry for bothering you [11:05:20] np at all [11:08:18] doublechecked the list and removed db1090/db2070, everything else is < 4.9.88 [12:20:35] 10DBA, 10Schema-change: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 (10Bstorm) Done on my end. [12:48:35] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Ladsgroup) We should note that hive is behind NDA and production access which only most staff and handful... [13:07:44] 10DBA: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Marostegui) p:05Triage>03Normal [13:08:08] Maybe banyek can take ^ during the DC failover? [13:09:14] Sure I can, but the question is 'how to reclone a host' [13:09:25] We'll get to it! :) [13:09:48] a.w.e.s.o.m.e. [13:11:28] FYI, I'm installing PHP updates, there be might a few seconds of non-avail for dbmonitor shortly [13:11:59] Thanks for the heads up moritzm [13:14:42] all done [13:56:33] banyek: the reason we are waiting to telling you this is because the fast way is really running 1 command [13:57:14] but we want to tell you everthing that happens bellow that command because hw and sw are usually not in agreement with each other :-) [13:57:36] I see [13:58:24] also, recloning a db is a bit more dangerous than recloning an app server, so better you have al the information [14:00:33] that makes sense [14:40:02] * volans will be available for more detailed explanations/info on the 1 command, if needed [14:40:08] ;) [14:42:59] https://jynus.com/better-call-volans.jpg [14:44:00] jynus: I would like to know how many hits that URL had in the past month XD [14:46:33] ROTFL jynus that image xD [14:49:45] * volans wants the royalties on the hits [15:49:51] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) [17:15:56] 10DBA: Remove partitions from s7 masters (db1062 and db2040) for metawiki.pagelinks - https://phabricator.wikimedia.org/T203548 (10Marostegui) db2040 finished: ``` root@db2040.codfw.wmnet[metawiki]> alter table pagelinks remove partitioning; Query OK, 968681449 rows affected (8 hours 33 min 57.15 sec) Records:... [17:16:06] 10DBA: Remove partitions from s7 masters (db1062 and db2040) for metawiki.pagelinks - https://phabricator.wikimedia.org/T203548 (10Marostegui) [17:16:48] 10DBA: Remove partitions from s7 masters (db1062 and db2040) for metawiki.pagelinks - https://phabricator.wikimedia.org/T203548 (10Marostegui) [17:34:14] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10Andrew) Hi all! I'm a bit lost because I think this task no longer has anything to do with its original post (which is about moving the databases off... [17:36:37] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10Marostegui) In the end nothing is needed. m5 will not be read only in eqiad. Wikitech traffic will be redirected to eqiad and thus to m5. So nothing t... [17:38:27] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10Andrew) >>! In T167973#4560316, @Marostegui wrote: > In the end nothing is needed. m5 will not be read only in eqiad. Wikitech traffic will be redirec... [19:51:33] 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) T202588 exists for the quarry migration. That will unblock a lot of this. [19:53:07] 10DBA, 10Cloud-Services, 10Community-Wikimetrics, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625 (10Dzahn) T202588 exists for the quarry migration. that will unblock a lot of this. Also T162070 is a duplicate of this ticket in a way. [21:06:38] 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10colewhite) Mysql module is also used in puppet/modules/profile/manifests/icinga.pp. This should be removed once the transition to st...