[04:28:24] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) I have repooled labsdb1011 back \o/ In the end, I have gone for innodb_purge_threads = 4, given that we have a super recent backup of this host. Let's see if that help... [04:52:25] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) @Bstorm I have depooled labsdb1011 again as I was getting permission denied from quarry, however, the grants for my user are on labsdb1011, maybe some grants are missi... [05:06:27] 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) [05:06:37] 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) 05Open→03Resolved This was done. RO starts: 05:00:44 RO stops: 05:03:47 [05:06:39] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:07:09] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:07:31] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [05:09:13] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) 05Open→03Resolved This has been dropped everywhere! \o/ [05:09:15] 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) [05:17:07] 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) Works for me. We could also do {T146055} at the same time. [05:21:53] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) ` root@cumin1001:/home/marostegui# mysql.py -hlabsdb1010 mysql -e "select count(*) from user" +----------+ | count(*) | +----------+ | 4048 | +----------+ root@cum... [05:23:47] I am going to guess that you run mysql upgrade and it gave no errors, after the import [05:27:00] yes [05:27:46] I can check the mysql tables are in the correct format anyway [05:27:54] maybe it didn't like the import [05:28:11] yeah, I was going to do that after going for a run [05:28:20] my guess whas that maybe we need to re-create the users or something [05:28:22] let me do a preliminary check [05:28:29] ok, I am going to go for a run meanwhile [05:28:30] and you can go for the run [05:28:33] heads up [05:28:39] I will depool some es hosts [05:28:41] fyi [05:28:42] cool [05:28:48] will be back later [05:28:51] thanks o/ [05:28:52] bye! [05:29:13] 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) @mmodell could we schedule a specific date for this, so it is not forgotten? How much time do you need to prepare T146055? Work on our side is not too time consum... [06:25:22] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10jcrespo) One thing I can see is that labsdb1011 uses the new mysql authentication format, meaning: 10.1 host: ` Host: % User: sXXXXX... [06:35:45] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10jcrespo) Grants for labsdbuser, which is the default role on both servers for cloud users are also (almost) the same: ` $ diff <(mysql.py -h labsdb1010 -e "show grants for labsdbu... [07:12:03] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) More food for thought, this also happens on the CLI: ` root@labsdb1011:~# mysql --skip-ssl -uu15343 -p Enter password: Welcome to the MariaDB monitor. Commands end wi... [07:16:05] not sure if you are fully back, but it seem that authentication works, but autherization (grants to read data don't)? [07:16:17] yeah [07:16:23] that's what I am seeing on the CLI [07:16:30] maybe a view issue? [07:16:38] I checked also with the underlying db [07:16:39] not the view [07:16:54] I checked both [07:18:21] can you select without use? [07:18:49] yes [07:18:52] I can do a select now [07:19:05] but somehow I can not see anything else [07:19:24] maybe the grant meaning has changed [07:19:33] select doesn't imply list [07:19:37] what do you mean? [07:19:37] ah [07:19:55] or something something roles are broken in the new version [07:20:12] I am going to drop the role and create it again [07:20:24] because it looks like the role isn't correctly applied or something [07:20:53] yeah, we had this ticket that had issues with role race conditions [07:21:03] which is supposed to be fixed in 10.4 [07:21:05] I was reading this [07:21:08] https://jira.mariadb.org/browse/MDEV-19650 [07:21:19] but I don't think it is related [07:23:12] the views themseleves look good, definer, viewmaster exists, etc [07:23:55] flush privileges? [07:23:58] :-D [07:24:01] done already :) [07:24:09] classic thing to do when you are out of ideas XD [07:24:12] it wouldn't make sense after a restart [07:24:18] but hey [07:24:23] same thinking [07:24:40] my vote is on what you mention [07:26:11] did you see: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=es1022&var-port=9104&from=1589869562303&to=1589873162303 ? [07:26:38] there is definitely more load, and the purging is ramping up [07:26:58] 2 threads? [07:27:09] but looks good overal, no significant jump in threads_connected due to excessive locking [07:27:14] yeah, 2 threads [07:27:32] and I am ok on leaving it like that, assuming it doesn't take too long [07:27:54] until it takes too long and we have to think a different strategy [07:28:15] note also xmldumps are running at the same time [07:42:00] db2136 is alerting [07:42:47] kormat: ^ [07:43:06] on it [07:44:11] the fuck [07:44:21] i disabled it's notifications completely yesterday [07:44:31] they are disabled [07:44:34] notifications are disable [07:44:38] it is not really alerting on irc or anything, jaime meant on icinga [07:44:49] ahh, i see [07:45:06] so I was literally checking the icinga manual [07:45:21] by coincidence, trying to improve it [07:45:32] ok :) [07:45:45] but it kinda addresses this [07:45:56] Merely disabling notifications does not move alerts out of the unhandled section and has the potential to be forgotten. Consider using downtimes and ACKs instead. ACKs automatically are removed on the next state change. [07:46:34] what I do is acking when a server is catching up (without notifications) [07:47:33] again, I insist I only share how I do things, in case that is helpful, not telling you how you should do them :-D [07:47:58] icinga seriously confuses the hell out of me [07:48:03] yeah [07:48:30] don't get me wrong, you are not the problem here :-D [07:48:33] haha [07:48:44] but precisely because how special it is [07:48:56] we create those "standards" to make it manageable [07:49:00] also - i've no idea how you guys notice when something _is_ alerting [07:49:10] mm [07:49:16] i never see it until someone points it out to me on #-operations [07:49:22] what am i missing there? [07:49:32] ok, so again, time to tell you how *I do things* [07:49:36] :) [07:49:56] I have 20 or so always open sticky tabs on firefox [07:50:13] :O [07:50:53] mail, calendar, eqiad.json, icinga ongoing alerts, prometheus mysql aggregated, gerrit, phabricator, etherpad, wikitech sal, tendril, kibana, and replag tools db [07:51:03] it saves me time [07:51:45] the other thing is I am being a bit more explicit with you that I would be with manuel- feel free to tell me off [07:51:50] :-D [07:51:58] for learning purposes [07:52:42] I have a bad tendency of stressing people, Manuel can atest, so feel free to tell me to cool off :-D [07:52:48] related question: is there somewhere in the icinga ui that takes you to icinga.wikimedia.org/alerts? i've been unable to spot anything [07:53:09] kormat: I normally have one screen dedicated only to icinga.wikimedia.org/alerts [07:53:12] yeah, I think that is just unhandled criticals [07:53:14] so i keep it open all the time [07:53:29] ^see other option, similar to mine, but different [07:53:43] jesus. ok. [07:53:59] I think that is actually something you can practice [07:54:05] "env awareness" [07:54:38] please ask questions if you saw something late [07:54:44] we are happy to help [07:55:18] kormat: to be honest, I didn't tell you this yet, I think you still have more things to worry about and this will add an extra overhead and stress [07:55:24] yeah [07:55:27] ok :) [07:55:43] kormat: Same for the logstash dashboards I sent you yesterday [07:55:53] No need to keep them open at all times I think [07:56:01] You are not at that point yet, and it will just add more stress [07:56:05] so please don't worry about that [07:56:12] it is too soon [07:56:20] ok :) [07:56:58] just to be clear, I just was mentioning the possiblity of him asking questions, not that he should be doing the same as I [07:59:52] and manuel will be right to answer "for later" :-D [08:08:03] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) More tests: I have dropped the labsdbuser role, and recreated it. Assigned it to my user but still no luck. ` mysql:root@localhost [(none)]> show grants for labsdbuse... [08:23:48] jynus: I think I found the issue with labsdb1011 [08:23:50] let me double check [08:24:19] default role? [08:24:30] yeah, looks like it is not getting assigned for some reason [08:24:38] if I force the user to get it itself, it works [08:24:59] maybe default roles disappeared on upgrade? [08:25:11] maybe, but the column is there on the user table [08:25:13] I will dig a bit [08:25:20] and labsdb user is set as default [08:25:26] but maybe needs manual assignment first [08:25:39] https://phabricator.wikimedia.org/P11229 [08:25:47] It wouldn't surprise me if it is a bug [08:27:44] i cannot reproduce in my env [08:27:50] however, something weird happens [08:28:28] the grants show up on the show grants [08:28:31] the inherited ones [08:29:35] https://phabricator.wikimedia.org/P11229#64545 [08:29:52] looks like that is it [08:29:52] ah, no that is normal [08:30:12] so if it is "those were dropped on upgrade" [08:30:18] that is easy to fix [08:30:27] yeah, I am going to fix that now [08:30:35] but maybe worth reporting upstream [08:30:39] yeah [08:30:43] as I wouldn't expect that to happen to be honest [08:30:47] maybe I can reproduce [08:30:57] on a 10.1 -> 10.4 upgrade [08:31:15] "default roles drops after upgrade" [08:32:02] ah, and double check you reboke that enwiki grants to labsdb :-D [08:32:07] *revoke [08:32:08] yeah ;) [08:32:43] honestly, if this is the bug, it is good news [08:32:55] I was already thinking roles were completely broken [08:33:25] if they are only a little broken, we can live with that :-D [08:34:00] there is a chance that the roles were dropped and what fixed it was the role recreation [08:34:04] ? [08:34:38] no, because the role recreation didn't fix it until I did it manually: https://phabricator.wikimedia.org/T249188#6147580 [08:35:20] I am not sure if I explained myself [08:35:21] I am checking all the other users and: default_role: [08:35:31] while my user: default_role: labsdbuser [08:35:39] so it definitely dropped the default one there [08:35:41] the issue was existing [08:36:03] but on recreation it didn't fix it because default role may get dropped too? [08:36:07] I can test [08:36:08] maybe [08:36:10] on my db [08:36:43] yeah, if you drop the role, it drops the default/assignment [08:36:54] so still an upgrade issue [08:36:56] ok, that makes sense [08:36:58] yeah [08:37:11] but probably the same issue we were suffering on 10.1? [08:37:20] "role not applying" [08:37:42] the race condition? [08:37:45] yeah [08:37:51] specially with the upgrade + logical import [08:37:56] I can see that happening again [08:38:12] not sure, I think the race condition also implied creating a db or something, I cannot recall exactly, I can look for the ticket [08:39:08] what surprised me is that roles work differently than normal grants [08:39:24] I am going to manually change the proxy first, and if you can test select @@hostname in quarry with your user, that'd be cool [08:39:48] ok [08:39:49] I have given the role to all the users now [08:39:54] oh, that was fast [08:39:55] let me know when ready and I can reload the proxy [08:40:17] ready [08:40:22] ok, one sec [08:40:52] try [08:41:04] labsdb1011 [08:41:10] https://quarry.wmflabs.org/query/44983 [08:41:11] sweeeet [08:41:32] Access denied for user 's52788'@'%' to database 'enwiki' [08:41:37] excellent [08:41:38] (that was expected, just checking) [08:41:47] I can see queries flowing too [08:41:51] from other users [08:41:52] I can query revision [08:42:00] yeah, there are queries now [08:42:04] let me change it in puppet [08:42:07] wait, isn't this server lagging? [08:42:15] no [08:42:27] oh yes [08:42:28] yeah, repl stopped for 4 h [08:42:28] it is [08:42:31] why? [08:42:32] let me check [08:42:32] lets depool [08:42:37] catch up [08:42:39] then repool [08:42:41] yep [08:42:59] fuck me [08:43:00] it crashed [08:43:03] what? [08:43:16] check the error log :( [08:43:31] so it is user traffic what makes it crash [08:43:41] ooor the innodb purge threads [08:43:45] which I changed it to 4 in the end [08:43:58] it crashed before it had user traffic [08:44:22] I changed it to 4, because as we have the "recent" backup, I thought it was worth testing [08:44:27] so, let me wipe it and transfer the data back [08:44:57] May 19 08:42:31 labsdb1011 mysqld[17921]: 2020-05-19 8:42:31 0 [ERROR] InnoDB: Unable to find a record to delete-mark [08:45:02] I think that is from the purge [08:45:38] the good thing is that we can check the default_role once we copy everything back :) [08:46:58] o I see, it crashed at 4:"3 [08:47:05] yeah, right when I enabled it [08:47:06] was it after reboot? [08:47:11] nop [08:47:22] https://phabricator.wikimedia.org/T249188#6147299 [08:47:26] I enabled it a bit before that [08:47:32] And it has been replicating fine for 5 days [08:47:36] very suspicious [08:48:10] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) The issue with the roles is found: ` mysql:u15343@localhost [(none)]> SELECT CURRENT_ROLE; +--------------+ | CURRENT_ROLE | +--------------+ | NULL | +------... [08:48:50] why not enabling it with 1 purge thread [08:48:57] for some time [08:49:11] [10:43:44] which I changed it to 4 in the end [08:49:12] [10:43:58] it crashed before it had user traffic [08:49:12] [10:44:22] I changed it to 4, because as we have the "recent" backup, I thought it was worth testing [08:49:20] ok [08:49:21] you mean now? [08:49:28] to see what happens [08:49:33] although it is very strange [08:49:35] we can always recover [08:49:37] because if the default is 4 [08:49:47] it has been running with 4 since it started to replicate [08:50:53] we could also install the debug version of the binary to get a proper stack trace [08:51:06] mariadb won't accept bugs without it [08:51:50] they have accepted multiples without it [08:52:05] I don't know, up to you [08:53:10] so the first crash happened at 04:29 [08:53:36] that is roughly when I enabled it to 4 [08:53:48] but on the other hand, it was running with 4 when we first started a few days ago [08:53:51] then it was set to 1 [08:53:53] and then back to 4 [08:54:08] it could be an optimization, it just crashes 4 times faster :D [08:54:23] XDD [08:54:28] Buuut [08:54:34] When I pooled it the first time: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597183/ [08:54:38] it was a bit before that crash [08:54:46] but of course all the users would be getting permission denied anyways [08:54:50] so this is weird [08:55:03] https://jira.mariadb.org/browse/MDEV-12463 [08:55:30] yes, I referenced that in our phabricator ticket I believe [08:55:49] https://jira.mariadb.org/browse/MDEV-22373?filter=-2 [08:56:19] shoudl we try [08:56:19] SET GLOBAL innodb_change_buffering=inserts; [08:56:19] SET GLOBAL innodb_change_buffering=none; [08:56:19] ? [08:56:30] like copy the data back, and try that? [08:57:36] Just bringing up options [08:58:05] let's copy the data back, try the purging set to 1 (like it was on the 10.1) and those two options [08:59:36] they are set as duplicates of https://jira.mariadb.org/browse/MDEV-9663 [09:00:22] interesting For older versions, a work-around to prevent future corruption (not to heal any existing corruption) would be [09:00:23] SET GLOBAL innodb_adaptive_hash_index=OFF; [09:01:04] let me try those 3 options and start mysql [09:01:31] heh, it crashes [09:01:36] maybe it is already too corrupted [09:01:37] but we reimported logically, there shouldn't be any "existing corruption" [09:01:40] yeah [09:03:05] nah, the host is totally broken now, it won't start [09:03:19] i have set all those 3 options + purge lag = 1 [09:03:41] ls -a [09:03:51] * kormat coughs [09:06:18] this is frustrating :( [09:10:03] wait, what time did it crash? [09:11:05] I restarted prometheus there at "May 19 06:38:51 labsdb1011 systemd[1]: Started Prometheus exporter for MySQL server." [09:11:54] nah, issues started May 19 04:23:04, much earlier [09:12:25] that's when I changed the purge thing [09:13:13] but the weird thing is that as the default is 4, when we first started it a few days ago, it was already set to 4 [09:13:21] and it has been catching up all these days with that flag set to 4 [09:13:32] I changed it to 1 yesterday I think [09:13:35] And then back to 4 this morning [09:13:51] yeah, I am sure that may be unrelated [09:13:57] it is the restart [09:14:10] but it was restarted a couple of times already [09:14:23] keep in mind that we sent a backup to backup1002 yesterday [09:14:28] InnoDB: Record in index `cl_sortkey` of table `enwiktionary`.`categorylinks` was not found on [09:14:31] and I did a restart once it caught up to check that [09:14:38] that is the asserts that kills it [09:14:49] yeah, not saying that the restarts kills it [09:15:01] but it may trigger someting else [09:15:07] as it said on the ticket [09:15:39] This is what I reported to mariadb: Apr 28 05:43:21 labsdb1011 mysqld[31450]: 2020-04-28 5:43:21 29 [ERROR] Master 's2': InnoDB: Record in index `cl_sortkey` of table `zhwiki`.`categorylinks` was not found on update: TUPLE (info_bits=0, 4 fields): {[27] (0xE6A183E59C92E5B882E9AB98E7B49AE4B8ADE7AD89E5ADB8E6A0A1),[1] (0x01),[39] [09:15:39] (0xE6A183E59C92E5B882E68890E58A9FE9A [09:15:39] B98E7B49AE5B7A5E59586E881B7E6A5ADE5ADB8E6A0A1),[4] ~k(0x00097E6B)} at: COMPACT RECORD(info_bits=0, 4 fields): {[27] (0xE6A183E59C92E5B882E9AB98E7B49AE4B8ADE7AD89E5ADB8E6A0A1),[1] (0x01),[39] (0xE6A183E59C92E5B882E68890E58A9FE5B7A5E59586E9AB98E7B49AE4B8ADE7AD89E5ADB8E6A0A1),[4] ~k(0x00097E6B)} [09:15:42] Same table [09:15:43] same index [09:15:46] just different wiki [09:17:26] we can drop the indexes there and recreate them [09:17:53] it is a secondary index issue after all [09:19:20] I say to transfer things backup, regereate that index on all tables, and let it with default options [09:19:29] if it crashes again, go back to 10.1 [09:20:22] or percona... [09:20:24] you mean 10.4 defaults? [09:20:48] well, not change anything just because of the bug, as it was while replicating [09:21:17] that is my suggestion, but again, up to you [09:21:22] yeah, let's do that [09:21:27] And also de index drop [09:21:30] my reasoning is [09:21:45] we cannot just be "doing workarounds" for 10.4 upgrade [09:21:53] specially after a logical dump [09:22:05] if it doesn't work as in, basic usage [09:22:18] we should use something else that does [09:22:40] yes, but on the other hands these hosts are very special [09:22:43] I can accept corruption on upgrade [09:23:07] yes, but 10.1 on labsdb1009/10 seem fine [09:23:22] to some extent [09:23:34] maybe it is something with labsdb1011 memory or something? [09:23:39] could be too [09:24:03] do we have another host large enough to test the restore? [09:24:39] mmm, maybe one of the new new ones [09:24:42] they have 7.6TB [09:24:56] I think we lose nothing to test that [09:25:11] root@labsdb1010:/srv/sqldata# du -sh . [09:25:11] 7.0T . [09:25:17] so we do have space [09:25:22] we can try that too yeah [09:25:43] maybe kormat can help us with that? [09:25:53] we do a parallel load [09:25:59] to discard host issues [09:26:26] maybe using the older snapshot even [09:26:32] would that work still? [09:26:34] the logical one? [09:26:44] I changed sanitariums to keep 30 days of binlogs [09:26:47] no, the one on backup1001 [09:26:54] at least not for now [09:27:14] the one in 1001 is the logical, no? [09:27:14] hey. what can i do? [09:27:17] or we took another one? [09:27:38] marostegui: that should be teh physical just after the logical, isn't it? [09:27:53] you did it, but I think you did 2 [09:28:29] yeah may 12 [09:28:47] so there is a confusion here [09:28:48] kormat: let's wait if marostegui wants to do it or you do it [09:28:58] ? [09:28:59] backup1002: binary [09:29:04] backup1001: logical [09:29:26] no, the other way around [09:29:28] I don't think taht is right, based on my ls [09:29:30] :-D [09:29:41] both should be binary [09:29:45] ok, let's stop here [09:29:46] please [09:29:50] I am getting very confused [09:29:55] we have too many open fronts [09:30:07] ok, can I tell kormat to try that and you handle labsdb1011? [09:30:21] handle what? [09:30:28] a test copy to a new host [09:30:45] ok [09:30:50] ok, then forget about it [09:30:55] focus on labsdb1011 [09:31:01] I will tell kormat to help us [09:31:06] take a break too :-D [09:31:12] yes [09:31:15] I need some fresh air [09:31:20] please :-D [09:31:22] this can wait [09:31:42] I will pm you kormat to reduce the noise [09:37:00] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) The action plan for now is: - Rebuild labsdb1011 from the backup we made yesterday. -- Let it start with mariadb 10.4 default options (replication stopped) -- Drop a... [09:59:02] I will upgrade to 10.4.13 on labsdb1011 btw [09:59:24] that version is what will be installed on db1140 anyways, so I will upgrade labsdb1011 [10:01:52] +1 [11:00:51] marostegui: fyi - i selected db1141 as the temp testing host for labsdb1101, and transfer.py is busy copying over data to it. [11:01:42] sounds good, thank you :) [11:03:12] kormat jynus remember: https://wikitech.wikimedia.org/wiki/MariaDB#Recloning_a_Wiki_replica [11:03:55] 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) 05Open→03Resolved done! the module has been removed [11:04:08] are you pointing to the procedure in general or something in particular? [11:04:32] or that we should update it? [11:04:37] ah [11:04:39] wikireplica [11:04:40] indeed [11:04:42] thanks [11:04:54] you were pointing to the special things of relay stuff [11:05:24] I read wiki replica as a replica for the wikis (core), that was why I was confused [11:06:20] yeah, the fact that you have to delete all those files and all that [11:10:05] kormat: you might want to downtime db1141 [11:13:40] es4/5 backups took 2h30 with limited concurrency [11:13:56] so I think we can leave them like this until they start to run slower [11:14:27] +1 [11:14:30] I will now depool 1 host of the other sections [11:14:42] and start a fast backup until tomorrow [12:01:23] db1141 downtime'd [12:01:47] marostegui: i assume that not removing those files is why i had stale replication status info on db2136 earlier [12:02:19] yeah, kinda, although we found that a few more need to be removed when dealing with multisource [12:02:28] hence the wiki page, cause we always forgot which ones they were [12:02:30] ? [12:03:48] gotcha :) [12:04:25] oh, again, thinking about xtrabackup [12:04:38] which doesn't take the binlogs [12:04:50] *doesn't copy them [12:05:32] has anyone considered failing over to, say, esams so that we could do maintenance on both codfw and eqiad at the same time? i mean, sure, we'd lose all the data. but apart from that, it doesn't seem so bad. [12:05:45] kormat: we don't have DBs in esams [12:05:54] exactly :) [12:05:58] maybe we can put some rpis [12:06:02] haha [12:06:05] with sd cards [12:06:13] i have a 256M one lying around here i could put in the post [12:06:17] maybe external usb disks [12:06:26] you laugh [12:06:38] i have it: https://www.wired.com/2009/05/five-disk-floppy-raid-4mb-of-blistering-fast-storage/ [12:27:33] i might as well ask the basic question re: db1101 - have we run a memtest on it? [12:28:22] *labsdb1101 [12:30:52] no, we haven't [12:32:31] ok. could be worth it (though i assume it would take quite a long time) [12:33:00] to be fair, dimms have ECC, if it has a non-recoverable error, it crashes and says it [12:33:16] if it had a recoverable errror, there is monitoring that shows it [12:33:51] hmm, right. [12:34:59] compare to behavior of : https://phabricator.wikimedia.org/T252492 [12:35:44] or https://phabricator.wikimedia.org/T213664 [12:36:16] * kormat nods [13:06:39] 10DBA, 10Patch-For-Review: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Kormat) [13:07:11] 10DBA: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) [13:07:13] 10DBA, 10Patch-For-Review: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Kormat) [13:07:22] 10DBA: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) p:05Triage→03Medium [13:07:49] 10DBA: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) [13:07:51] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10Marostegui) [13:08:44] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10Marostegui) [13:08:48] 10DBA, 10Operations, 10observability, 10Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473 (10Marostegui) 05Stalled→03Declined Closing this in favour of T253120 which has more concrete points of action [13:09:39] Don't Be Late! Percona Live ONLINE Starts In One Hour! [13:09:41] \o/ [13:17:35] i just added a rate-of-change line to the replication graph: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-1h&to=now&refresh=1m&var-dc=codfw%20prometheus%2Fops&var-server=db2073&var-port=9104 [13:17:40] let me know what you think [13:18:05] it's currently reading -15, which means it will catchup at 15x real-time [13:19:11] rate of change in seconds? like it catches up 15s for each real 1 second? [13:19:15] yep [13:19:30] (assuming i didn't screw up the logic :) [13:19:32] I like it [13:19:41] it will reveal interesting patterns [13:22:29] cool :) [13:38:17] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) I have done the first alte... [13:40:40] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [13:41:19] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [13:54:13] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) s6 progress [] labsdb1012... [14:07:17] kormat: hehe the graph now is funny: https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&orgId=1&from=now-1h&to=now&refresh=1m&var-dc=codfw%20prometheus%2Fops&var-server=db2073&var-port=9104 [14:08:37] hah :D [14:09:23] why not make it positive? "the speed of change" [14:09:47] jynus: well, if a machine is not keeping up, it will go positive as it is [14:10:19] but sure, we can swap the polairty [14:10:24] *polarity [14:10:25] I don't know [14:10:37] maybe if I see it positive I will complain too [14:10:40] :-D [14:11:09] i would expect nothing less :) [14:12:51] updated, with this note: [14:12:52] > Make jcrespo complain in the other direction. [14:13:03] let me see how others do it [14:13:31] best inspiration is to copy the ones that know [14:15:44] percona one is uglier than ours: https://pmmdemo.percona.com/graph/d/mysql-replicaset-summary/mysql-replication-summary?orgId=1&refresh=1m&var-interval=$__auto_interval_interval&var-replication_set=MySQLReplSet1&var-service_name=ps8-slave-mysql&var-node_name=ps8-slave&var-version=8.0.19-10&var-crop_service=ps8-slave-mys&var-crop_host=ps8-slave&var-environment=All&var-cluster=All&var-database=All&var-node_type=All&var-service_type=All&var- [14:15:45] schema=All [14:16:26] log replication is nice in some cases [14:16:35] but it loses the detail [14:17:01] so I don't have an answer to make it better [14:18:20] we should add, however a compare one, side by side [14:19:38] I actualy like it now, see: https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&orgId=1&from=1589894359363&to=1589897959363&refresh=1m&var-dc=codfw%20prometheus%2Fops&var-server=db2073&var-port=9104 [14:19:54] it is less noisy when things are good (everything 0 and no noise) [14:21:24] thanks for the addition, kormat [14:21:53] * kormat smiles [14:22:10] i've found myself a few times in the last few weeks trying to figure out how quickly replication is catching up, [14:22:11] now the big question is [14:22:16] and calculating how long it will take [14:22:19] yeah, that was useful [14:22:21] this should hel [14:22:23] *help [14:22:31] the big question is light mode or dark mode? [14:23:45] what's your preference? [14:24:17] the versions of grafana i used before here only had dark mode, [14:24:26] so i'm enjoying the novelty of light mode here :) [14:24:37] oh, I thought dark was the default? [14:25:11] however, I am light mode gang, all the way [14:25:16] light is currently the default on grafana.wm.o [14:25:20] hah :) [14:25:34] I thought it was dark and it was my configuration making it light [14:26:06] jynus: that could have been in a previous version [14:26:34] I think I once changed the global default from light to dark or the other way [14:26:58] and people were very vocal about it [14:30:52] can I change you 1 thing else? [14:32:28] the derivative over 2 minutes rather than 5 so the 0es mostly match? [15:19:16] jynus: i can check, i know it didn't work at all at 1m [15:19:37] yeah, 2m works [15:19:59] jynus: can you check the dbctl diff on cumin1001 please? [15:20:07] i'm repooling db2073 after it's holiday yesterday [15:21:20] let me see [15:22:57] sure [15:23:16] go ahead [15:23:46] thanks :) [15:24:13] if you are going away, have a nice day! [15:25:14] yeah i'm done at this point. ttyt o7 [15:54:59] 10DBA, 10MediaWiki-Logging: Evaluate the need for FORCE INDEX (ls_field_val) [now IGNORE INDEX (ls_log_id)], delete the index hint if not needed anymore - https://phabricator.wikimedia.org/T164382 (10Aklapper) 05Stalled→03Open The previous comments don't explain what/who exactly this task is stalled on (["... [15:59:54] 10DBA, 10MediaWiki-Logging: Evaluate the need for FORCE INDEX (ls_field_val) [now IGNORE INDEX (ls_log_id)], delete the index hint if not needed anymore - https://phabricator.wikimedia.org/T164382 (10jcrespo) > jcrespo moved this task from Triage to Backlog on the DBA board. It was previously stalled waiting... [16:13:45] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Addshore) {meme, src="full-steam-ahead"} [19:25:11] jynus: I just realised that the discussion about earlier's backup location is slightly different [19:25:49] We have two binary backups 1 in backups1001 which is the binary back right after finishing importing the logical, and the one in backup1002 which is the binary of the host after it fully caught up with the master [19:26:17] I initially copied the one from backup1001, but I have deleted it, and started to copy the one from 1002, so we don't have to wait for it to catch up again (it will be like 10 days of replication lag) [19:26:34] we'll see what happens with that one [19:27:20] if you guys are copying the one from 1001, that means you need to start replication from: https://phabricator.wikimedia.org/T249188#6115163 [19:31:47] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Marostegui) Downtime expired - I have acked the alerts in Icinga [19:42:03] kormat: omg, I even get pathological output from icdiff for this: https://phabricator.wikimedia.org/F31833204 [19:42:12] how does it possibly make sense to do that [19:42:26] the only change here is removing the `dump { db2073 }` section [19:42:40] (and `vslow` as well) [19:45:45] since there's a choice of referencing two equally-unclear things here, https://phabricator.wikimedia.org/P11238 [19:45:51] no colorization here but there's the text [19:45:54] it's aligning lines like: [19:46:00] "logpager": { "logpager": { [19:46:02] "db2084:3314": 1, [19:46:04] "db2091:3314": 1 "db2084:3314": 1, [19:54:40] cdanis: worth trying either deepdiff or jsondiff [19:54:46] both available in debian IIRC [19:54:57] I will try something else first [19:55:01] this is really dumb though [19:55:22] i think python difflib is doing something silly, icdiff uses it internally [19:55:27] (actually calls difflib._mdiff) [21:22:16] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Jclark-ctr) Main board replaced today entered password & management address into server