[05:15:56] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) p:05Triage>03Normal a:03Papaul Can we get a new disk here @papaul? Thanks! [05:44:36] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) I have been syncing-up with @ayounsi about the scheduled network maintenance and the switches issues. So far they are still doing some tests (T201145) and they should know a bit more how to proceed further in a few day... [06:05:10] check https://gerrit.wikimedia.org/r/#/c/457847/ one last time and I will deploy it now [06:52:21] hello :) [06:52:37] jynus: I am reviewing https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/454291/ but I don't have much context on it [06:53:09] this is only to remove non-system mysql users? [06:53:22] yes [06:53:39] I know analytics have some non-core mysql hosts [06:53:54] I want to make sure this doesn't break that installation [06:54:04] aka all your mysql users are system users [06:54:04] thanks a lot for the heads up, going to check [06:54:29] it should not happen if you are using jessie or higher [06:54:35] but better safe than sorry [06:55:18] so the check is basically to id mysql on the hosts in which we run non-core dbs and make sure that nothing is > 999 [06:55:50] the actual id may vary depending on the os [06:55:59] for system users, I mean [06:56:03] yep yep [06:56:26] check /etc/adduser.conf [06:56:45] but yes, in debian latest versions I think it starts at 1000 [06:56:55] grep mysql /etc/passwd [06:57:37] sadly, mysql users used to be created as a non-system user [06:57:43] years ago [06:57:50] so some host may have survived [06:58:19] I checked all db* hosts, but could be other usages [06:58:51] +1 [06:59:17] we have dbs on bohrium (piwik) and analytics1003 (for hadoop services) [06:59:20] all good afaics [06:59:51] thanks, I hoped so [06:59:55] I will deploy this soon [07:03:46] about https://phabricator.wikimedia.org/P7516, are those things in wikitech or presentations that you guys already gave to new opsens? [07:14:40] [09:07] some are on wikitech [07:14:42] [09:07] some overview is in presentations [07:14:43] [09:07] but mostly is not [07:14:45] [09:07] mostly is on puppet [07:14:46] [09:08] or things a dba should now indepentendly of WMF knwledge [07:15:50] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10Marostegui) I am happy to merge this task with T161859 if @jcrespo is (as he is the original task creator) [07:26:16] jynus: in yesterday's test of the ReadOnly message, did you by any chance check it also in other wikis (not english) and notice if it gets translated or not? [07:27:06] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10jcrespo) why not just make this a child of the other and agree to talk only on a single place? This is mostly for DBA work of data migration, which is... [07:27:21] 10DBA, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10jcrespo) [07:27:52] 10DBA, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973 (10jcrespo) [07:29:06] volans: I only checked dewiki and enwiki [07:29:21] however, I didn't check because I have all wikis in english interface [07:29:56] and I couldn't change it because you cannot change the preferences in read only [07:30:10] you can use uselang= to test it [07:30:29] if I use the mwdebug extension and go to it.wikipedia and try to edit should do the same right? [07:30:56] if you chose a codfw debug server, yes [07:31:02] yeah ofc :D [07:31:17] ok so visualeditor show me a big red box with a long italian message [07:31:18] I am guessing it has italian as the default language [07:31:22] or you do on your browser [07:31:38] as a pop notification on the top right [07:31:44] visualeditor I think ignores the admin message [07:31:58] same for text edit, top red banner [07:32:01] I would consider that as a bug [07:32:14] with a long italian message [07:32:20] I think the wikitext editor does the right thing [07:32:29] show a message from the interface [07:32:39] and then say the admin showed the following reason: [07:32:40] none of those show me the REadOnly message in etcd, but another longer message [07:32:45] and show the read only message [07:32:46] ah right [07:32:48] in the middle [07:33:03] yeah sorry pebcak [07:33:04] sometimes it can be overriden if the parameter is not set [07:33:08] no no [07:33:08] the etcd message is in both [07:33:14] in english after the localized message [07:33:17] so I think it's ok [07:33:24] if the translation or the wiki has been localized without the message [07:33:26] it can happen [07:33:31] but I guess that is not our problem [07:33:36] ... L'amministratore che impostato il blocco ha fornito questa spiegazione: MediaWiki is in read-only mode for maintenance. Please try again in a few minutes.. [07:33:53] volans: should we link to meta? [07:33:58] for more info [07:34:29] not sure if links will actually link, I remember some discussion from last year [07:34:39] and I'm waiting the official message from comms [07:34:43] that's the one I'll be using [07:35:01] luckily I put it as a CLI param, so we can change it last minute :D [07:37:46] we used to have https://meta.wikimedia.org/wiki/Tech/Server_switch_2017 [07:37:47] ah, yes [07:37:47] on wikitext it should work [07:37:47] volans: it is easy to test now, just an etcd deploy away [07:37:48] however, the main issue is multi-language support [07:37:50] the 2017 message was translated, we don't have that now [07:37:51] volans: what happens if you try to log in in read only? [07:37:51] does it fail? [07:40:31] mmmh nope [07:40:33] I can login [07:41:08] * volans brb in ~10 [07:54:20] I am going to start enabling replication codfw -> eqiad in s1-s8, any obejctions? [07:54:58] wait [07:55:05] I was about to deploy the codfw change [07:55:11] ok [07:55:48] also do the change masters but let me do a double check before start slave- while the change is trivial [07:55:54] a simple mistake can be very bad [07:56:07] better 2 people checking [07:56:10] yeah, I was planning to do so anyways :) [07:56:37] also let me double check the backups [08:00:54] did the partitioning for the s1 codfw hosts finished? [08:03:59] marostegui: yes [08:04:00] I didn't touch the others, something to fix for another time [08:04:00] backups are ok, I can see it on the metadata gathered [08:04:40] you can proceed with the replication changes, let's start with x1 or s5 [08:05:09] yeah, I was planning to do: s5,s6,s2 to start with [08:05:31] I asked about the partitioning cause I am updating our etherpad [08:05:33] with the pre-things [08:07:08] sorry, I forgot about that [08:08:27] jynus: check db1070 show slave status [08:08:31] that is s5 [08:08:37] I kind of did line 65 [08:08:42] but I need to do more [08:09:34] downtime the replication checks, it may page [08:09:42] if we take some time setting it ip [08:10:14] done [08:10:24] also log the actions at least once [08:10:31] I logged it [08:10:39] sorry, didn't see it [08:10:45] too much going on right now :-) [08:10:48] ;) [08:11:22] s5 is ok to start [08:11:29] ok, starting it [08:11:31] should I check others? [08:11:38] no, I am doing one at the time [08:11:47] I will let you know when you can check the others [08:11:59] I think it cannot page [08:12:02] We should also disable GTID on codfw [08:12:06] (on codfw masters) [08:12:10] as stopped is only a warning [08:12:14] but let's do that once all the replication is set up [08:12:23] gtid on the master? [08:12:33] codfw masters replicate from eqiad, and gtid is enabled [08:12:34] it should be disabled by default? [08:12:42] maybe we enable it [08:12:45] not sure, we should double check [08:12:48] and if so, disable it [08:12:50] sure [08:12:53] +1 [08:13:00] that is on line 56 so we don't forget [08:13:43] check db1061 (s6 master) [08:14:43] looks good [08:14:48] enabling it [08:15:02] maybe there is not reason to do the checks? [08:15:11] no, I prefer to get them checked [08:15:14] worse case scenario it has to fail immediately [08:15:22] check db1066 (s2 master) [08:15:26] unless it is x1->s* or s* -> x1? [08:15:48] We do this once a year, so I prefer to get another pair of eyes, even if we waste 15 minutes of our lifes XD [08:16:02] we can automate it! [08:16:22] s2 looks good [08:16:37] enabling [08:16:47] worst time was bad beacuse it brought down tendril with an infitnite loop [08:17:29] good times [08:17:29] (first time) [08:18:03] check s1 master db1067 [08:20:01] db2048 ok [08:20:13] enabling [08:20:24] I am checking logs at the same time too [08:20:56] check s3 master db1075 [08:22:46] it is ok [08:22:49] enabling [08:23:40] check s4 master db1068 [08:24:33] all good [08:24:40] enabling [08:24:45] and only s7,s8 and x1 pending [08:24:52] es2 and 3 [08:24:59] yeah, I mean from core [08:25:07] es is core :-) [08:25:13] :-P [08:25:17] XD [08:25:19] es is es! [08:26:01] so I call them "core metadata" [08:26:07] check s7 master db1062 [08:26:08] and es "core content" [08:26:29] good naming [08:26:46] as all litearally implement mariadb::core on puppet [08:27:19] s7 good [08:27:23] enabling [08:28:10] check s8 master db1071 [08:28:39] looks good [08:28:43] enabling [08:29:32] check x1 master db1069 [08:31:10] marostegui: jaime raised an exception while checking :-P [08:31:24] checking overflow [08:31:45] db2034 looks good [08:31:52] enabling [08:32:03] going for es2,3 now! [08:32:13] did we do x1 or later? [08:33:10] x1 was db1069 :) [08:33:26] the one you just checked XD [08:33:37] ok [08:33:39] check es2 master es1015 [08:34:24] everything ok [08:34:27] ok enabling [08:35:20] check es3 master, last one! es1017 [08:35:46] we've got a page [08:35:48] checking that [08:46:15] lets continue [08:46:22] yeah [08:46:28] check es3 master es1017 [08:46:32] that is the last one [08:46:57] we are good [08:47:20] done [08:47:21] we are good then [08:47:26] I will check the GTID [08:47:32] to see if it is enabled [08:47:54] so you only deleted the heartbeat row to start x1? [08:47:58] yeah [08:48:04] so we can repool it, se patch [08:48:10] yeah [08:48:41] should be disable gtid now? [08:48:43] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) Replication codfw -> eqiad has been enabled on s1-s8,x1,es2,es3 [08:48:44] I can run the check the switcdc does [08:48:52] if you want [08:49:03] should we wait until next week? [08:49:11] to disable GTID? [08:49:19] there is a chance of a master host crashing? [08:49:23] sure, that is a good a idea [08:49:25] I am mostly asking, not sure [08:49:29] No, it is a good idea [08:49:35] No benefit of doing it now [08:49:40] I will create a calendar event for tuesday [08:49:43] set it on the todo [08:49:45] that [08:49:49] yep [08:49:54] I think volans' check will still work [08:50:10] and we always test+modify at will [08:50:21] and or we can check its state in any case [08:51:01] so the check right now worked (was failing yesterday ofc) [08:51:04] yeah, the inverse check won't work [08:51:06] for all core DBs [08:51:19] what do you mean jynus ? [08:51:31] if you check the codfw -> eqiad [08:51:37] it will fail [08:51:40] it's what I just did [08:51:46] it was failing yesterday, works now [08:51:53] \o/ [08:51:55] that should not work [08:52:01] why? [08:52:06] * volans missing something [08:52:19] because the current masters are not using gtid [08:52:29] so you should not be able to check its gtid position [08:52:49] I guess wait will return inmedietly [08:53:00] which is why it works now [08:53:03] it retuns immediately with 0 [08:53:16] but it is not checking what we want to check [08:53:26] another reason to avoid gtid- missleading info [08:53:36] and doesn't fail? [08:53:38] :( [08:53:58] but it was failing yesterday [08:54:08] yeah, beacause it wasn't replicating [08:54:16] now it replicates but without gtid [08:54:36] when we do a check, please print the debug and send it to me [08:54:47] I need to attend joe [08:54:51] so, what query should I use to check? [08:55:28] send me a detailed debug [08:55:29] I am going to deploy your change to repool db1120, jynus [08:55:35] please do [08:55:39] :) [08:58:03] jynus: https://phabricator.wikimedia.org/P7519 [08:58:17] that's x1, the others have the same logic [08:58:42] 171966572-171966572-297840043,171974681-17197468, 1-198565537,180355159-180355159-17196909,180363268-180363268-40608909 [08:58:47] yeah, that is bad [09:01:09] the replacement, however, is not simple [09:01:20] SHOW MASTER STATUS\G [09:01:39] and then comparing each column individually [09:01:49] or [09:01:56] we can check heartbeart [09:02:18] jynus: but this has changed since last year? (we had GTID everywhere last year?) [09:02:38] yes and no [09:02:47] we know since last year, gtid is unreliable [09:02:51] does that count? [09:02:59] lol [09:05:06] we can do "select ts FROM heartbeat.heartbeat WHERE datacenter = 'eqiad' and shard = 'es3' ORDER BY ts DESC LIMIT 1"; [09:05:40] on master and replica and to master should be strictly larger than master or retry [09:06:01] 'eqiad' is the datacenter from [09:07:00] do you want me to write the code? [09:07:25] of it's that it's easy [09:07:29] we can keep the same logic [09:07:36] I should be able to send a patch shortly [09:07:39] but there is no wait [09:07:45] we retry [09:07:47] we have to implement the wait on code [09:07:52] @retry [09:07:52] <_joe_> ok so it's easier than it seemed? [09:08:00] we've a nice decorator for that [09:08:37] <_joe_> nice to hear, I was starting to worry [09:08:37] that is likely the solution we are going to do for mediawiki aswell [09:08:56] this step anyway is non-critical [09:09:10] we could have added a 10 second wait and we would be mostly ok [09:09:36] the reason it worked now is because gtid was "contaminated" on both sides [09:33:26] jynus: strictly > or >= ? [09:34:12] > [09:34:24] because it only hits every second [09:34:37] we need to make sure a whole second is after [09:34:51] ok so master in the new DC must have a newer value compared to the one just read from old DC [09:34:54] you may want to wait one second on client to avoid retries [09:34:54] ack [09:35:10] and do the checks in parallel [09:35:19] to avoid waiting 20 seconds in series [09:35:37] so to.heartbeat > from.heartbeat [09:35:40] ok [09:35:54] that will work without read only, but will not really test anything [09:36:24] cannot do the checks in parallel as of now [09:36:25] so it is easy to test right now [09:36:43] okso wait 1 second and do the checks when you can [09:36:53] in other words [09:36:59] yeah eyah got it [09:37:14] can you gather the from master info quickly [09:37:23] and then check quickly [09:37:38] from, eg, an array? [09:37:57] I can help [09:38:09] other thing is if you trust my code [09:39:21] marostegui: we can now use the new $mw_primary local variable with multiiinstance [09:39:30] \o/ [09:39:36] and I can deploy the read checks (but not this week) [09:39:41] yeah [09:39:42] *read only [09:39:51] will you also deploy the AUTO change? [09:40:17] no, there is no auto [09:40:26] the solution is to check on "client" [09:40:30] So this can also move forward: https://phabricator.wikimedia.org/T199124 [09:40:39] but now it is better [09:40:47] as it is etcd controlled [09:40:55] not hiera controlled [09:40:57] So we don't need this anymore? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449711/ [09:41:12] https://phabricator.wikimedia.org/T199124 has just been resolved [09:41:19] \o/ [09:41:40] 10DBA, 10Operations, 10Puppet: Remove all usages of $::mw_primary on puppet - https://phabricator.wikimedia.org/T199124 (10jcrespo) a:05jcrespo>03Joe [09:42:30] marostegui: we need still to change the current behaviour [09:42:39] everthing is critical [09:42:47] to the same way the core is done now [09:43:19] Ah, so basically only need to change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449711/8/hieradata/role/common/mariadb/core_multiinstance.yaml [09:43:25] the rest should stay as it is [09:43:43] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/457491/4/modules/role/manifests/mariadb/core.pp [09:43:50] ^like this [09:43:58] it still needs some changes [09:44:09] as I think the latest version is critical=1 for all [09:44:19] yeah [09:44:42] but no icinga or script changes [09:44:48] that is great [09:45:26] So what's the plan with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449742/ ? [09:45:42] I will abandon both options [09:45:57] joe implemented a sort of better version of [09:46:26] Cool, I have lost track of all the patches, so I will probably need help with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449711/ [09:47:09] https://gerrit.wikimedia.org/r/345346 [09:47:33] <_joe_> marostegui: I can help I guess, PM me when you need it [09:47:54] Thanks _joe_ will do :) [09:48:30] I can do that [09:48:35] I kept track [09:48:45] :-) thanks! [10:11:59] I'm senting a preview of the code while I write the tests [10:12:19] *sending [10:13:50] we are on a meeting :) [10:20:10] jynus, marostegui: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/458470 [14:04:20] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423 (10Marostegui) Just for the record. The only hosts we have running with multi-source at the moment are: - labsdb1009-1011 - dbstore1002 (which will be gone once the new HW is bought an... [15:48:56] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) Talked to Papaul - this disk will be replaced on Monday, as he is on a different DC! Thanks! [16:48:29] marostegui, jynus: FYI updated paste with the switchdc check: https://phabricator.wikimedia.org/P7519 [17:18:37] 10DBA, 10Operations, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10jcrespo) 05Resolved>03Open p:05Normal>03High There is now too much logging, or it is not rotated fast enough: logs are consuming 70% of available disk:... [17:30:19] 10DBA, 10Operations, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) I will play with the different log levels tomorrow to see which is the minimum we can do to still get the requests logged, or at least the failures [17:39:38] 10DBA, 10Operations, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) I have purged logs from other dbproxies too just to make sure they are ok. [19:45:27] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Ladsgroup) [19:45:59] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) p:05Triage>03Normal [19:48:24] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Ladsgroup) a:05Ladsgroup>03None [23:24:21] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Pine)