[11:02:23] hey DBA team, we are stopping toolsdb VMs in the affected cloudvirts due to the PDU operations [11:16:18] that is up to what cloud thinks is better for tools, we will help you with any decision you take [11:17:05] mariadb was stopped, the VM shutdown (clouddb1001) [11:17:11] that's toolsdb_primary [11:17:30] the hypervisor is also shutdown (cloudvirt1019) [11:17:39] trying to prevent any form of disk corruption [11:18:13] same for clouddb1004 (osmdb_secondary) which is postgresql [11:18:14] ok to me [11:18:31] makes sense with the large amount of MyISAM tables [11:18:50] sorry, I thought you were asking first [11:19:29] this is what brooke suggested the other day in our WMCS team meeting [11:19:40] no, I mean I am ok [11:19:46] I was agreeing with you [11:19:50] cool [11:20:20] you don't have to ask [11:20:29] just ask help if you need it [11:21:47] ask for* if you need it [11:21:51] *help [11:22:34] ok, thanks! [11:22:49] hopefully we will start everything after the PDU switch and everything will be working [11:28:06] arturo: yeah, Brooke asked me the other day about it and we agreed that stopping mysql was a good decision, as we have been doing the same with production hosts [11:28:55] arturo: also told her we can help if needed indeed [11:29:02] cool thanks [12:07:07] 10DBA, 10Core Platform Team, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Performance Issue, 10mariadb-optimizer-bug: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 (10Marostegui) I have analyzed all the queries that went overnight till ar... [12:07:41] jynus: db2092 finished the analyze, I am going to repool it, you can scratch it from the list of things [12:08:20] ok [12:24:33] 10DBA, 10Core Platform Team, 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Marostegui) [12:24:53] 10DBA, 10Core Platform Team, 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Marostegui) p:05Triageβ†’03Normal [12:25:31] 10DBA, 10Core Platform Team, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Performance Issue, 10mariadb-optimizer-bug: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 (10Marostegui) The `analyze table` on db2092 for `revision` didn't help wi... [12:36:04] hey [12:36:14] I'm trying to start mariadb in clouddb1001 and I get this [12:36:16] https://www.irccloud.com/pastebin/InFyHFkS/ [12:37:09] not sure if I'm supposed to created that dir by hand [12:37:58] marostegui: ? [12:44:18] anyway, I created the directory and chmod'd it by hand then mariadb was happy to start [12:46:43] that's strange [12:46:51] it was working fine before? [12:47:01] ah the socket directory [12:47:17] puppet creates it for us, but we've had those issues in the past [12:48:40] ok that makes sense [12:48:48] puppet was stopped in the server during the operation window [12:49:02] and after the reboot, /var/run is wiped [12:50:51] yep [13:17:15] marostegui: I could use some help to check if toolsdb is working fine [13:17:30] arturo: sure, what do you need? [13:18:21] we just got a notification that some RW ops are failing [13:18:46] arturo: clouddb1001 is running with read_only OFF [13:18:48] and has a slave hanging [13:18:57] ok [13:19:00] so it is writable and has its normal slave [13:19:41] so everything looks normal there? [13:19:43] weird [13:19:58] from mysql point of view, it does [13:20:11] what are you exactly seeing? [13:20:32] there are lots of stuff running too [13:20:35] from what I can see [13:20:49] 15:14 <+β€―icinga-wm> PROBLEM - toolschecker: toolsdb read/write on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/db/toolsdb - 340 bytes in 60.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:21:24] there are lots of things there running, but I am not sure if that is the normal state or not [13:21:41] which stuff are you refering to? [13:21:51] like lots of selects, and inserts [13:21:58] and things taking time to run [13:22:16] so it can be a load issue? [13:22:53] I don't know its normal status, I can see writes taking long time to be done, and others being done quite fast, so that might its normal status? [13:23:10] lots of selects as well [13:23:21] do you have some graphs? [13:23:35] looking for them [13:25:04] too bad we don't have journal logs before the reboot :( [13:25:49] mysql just died? [13:26:01] Oct 24 13:25:23 clouddb1001 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT [13:26:13] mmm [13:26:18] can you check hw logs? [13:26:45] Oct 24 13:25:21 clouddb1001 mysqld[1622]: *** buffer overflow detected ***: /opt/wmf-mariadb101/bin/mysqld terminated [13:26:45] in the hypervisor? I don't see anything weird [13:27:28] mysql back up [13:27:29] oh that doesn't sound good [13:28:20] I see you upgraded to 10.1.41? [13:28:27] I just ran mysql_upgrade for you [13:28:27] ?? [13:28:45] did you run an upgrade? [13:28:50] I didn't upgrade anything. Not sure if the reboot did something automatically [13:28:52] because i was surprised to see 10.1.41 [13:28:58] but the mysql_upgrade didn't run [13:29:35] could the upgrade happen automatically and the after the reboot the new version started? [13:29:39] I don't see any more hangs [13:29:59] arturo: Not sure how it works in your infra, in production it definitely doesn't happen [13:30:10] graphs BTW https://grafana-labs.wikimedia.org/d/000000273/tools-mariadb?orgId=1 [13:30:51] https://www.irccloud.com/pastebin/5UQ2Pb4w/ [13:31:02] Not sure if that crash had anything to do with the upgrade+mysql_upgrade...but from what I can see it works fine now [13:31:13] Like no transactions are hanging [13:31:23] Start-Date: 2019-09-12 06:07:47 [13:31:23] Commandline: /usr/bin/unattended-upgrade [13:31:23] Upgrade: wmf-mariadb101:amd64 (10.1.39-1, 10.1.41-1) [13:31:23] End-Date: 2019-09-12 06:08:11 [13:31:28] :-/ [13:31:35] so unattended-upgrades updated the package the other day [13:31:43] well, last month [13:31:47] yeah, I see [13:32:11] might be related to the crash [13:32:16] (I don't see how) [13:32:27] But too much of a coincidence? [13:32:32] so my next question is: could the package be upgraded and the service not restarted? [13:32:40] yes [13:32:55] not ideal, but it can done [13:32:56] then that's what happened. Today with the reboot, the new version started [13:33:13] we have been sitting in a pending upgrade for a month, and today it triggered [13:33:22] normally we try to run the mysql_upgrade script before letting things start to run [13:33:26] (replication, reads etc) [13:33:30] to avoid things like this [13:33:32] like weird states [13:33:45] arturo: you might want to revisit that policy of unattended upgrades for mariadb [13:33:53] that's true [13:34:05] will open a phab task [13:34:17] so far everything looks good after the crash+upgrade [13:37:56] 10DBA, 10Tools, 10cloud-services-team (Kanban): Toolsdb: prevent unattended-upgrades from upgrading mariadb - https://phabricator.wikimedia.org/T236384 (10aborrero) [13:38:09] I just created T236384 [13:38:11] T236384: Toolsdb: prevent unattended-upgrades from upgrading mariadb - https://phabricator.wikimedia.org/T236384 [13:38:13] 10DBA, 10Tools, 10cloud-services-team (Kanban): Toolsdb: prevent unattended-upgrades from upgrading mariadb - https://phabricator.wikimedia.org/T236384 (10aborrero) p:05Triageβ†’03High [13:39:15] thanks marostegui [13:41:12] yw [13:47:38] 10DBA, 10Core Platform Team, 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Anomie) Looks like the statistics are probably ok-ish, forcing the index each way shows similar row es... [13:47:56] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Anomie) [14:27:30] it seems to have dropped again? [14:27:32] briefly [14:28:34] mariadb crashed [14:35:12] marostegui ^ It recovered with one table marked in need of repair. I don't see any reason why per se so far. I'll try to take some notes. [14:38:05] I'll run a repair on it [14:38:31] Hrm...the level of churn on this maybe I shouldn't [14:50:53] Damn it, it's flapping. Looking into options [15:16:45] jynus: are you around? toolsdb is crashing and we could use a rescue. [15:17:09] crashing? [15:17:15] I thought it was put down? [15:17:34] send me the hostname, please [15:18:24] is it clouddb1001? [15:18:34] jynus: we shut it down during the pdu update. When it came back up it had an unexpected version upgrade waiting... [15:18:36] clouddb1001.clouddb-services.eqiad.wmflabs [15:18:38] and it's never really worked since. [15:18:41] It's the VM version [15:18:53] bstorm_: is running repairs in r/o mode right now [15:19:11] Yeah, but it's probably not going to help much...it's not myisam tables πŸ˜› [15:19:14] did you run apt upgrade on that host? [15:19:18] I just got up and am trying to think of things [15:19:35] We didnt'...our unattended upgrades did and apparently we didn't imagine that would happen [15:19:41] ugh [15:19:43] jynus: it was unattended upgrades that did it. https://phabricator.wikimedia.org/T236384 [15:19:44] yeah, ugh [15:19:46] 10.1.41 is unstable [15:19:48] Now we know that it was configured to upgrade mariadb :( [15:19:55] please disable that [15:20:01] not important now [15:20:09] but probably the root cause [15:20:19] Good to know [15:20:19] (linked task above is about disabling upgrades for maria/mysql) [15:20:46] [ERROR] Do you already have another mysqld server running on socket: /var/run/mysqld/mysqld.sock [15:20:52] someone tried to start it twice [15:21:01] I'm running mysqlcheck right now [15:21:06] don't [15:21:06] huh [15:21:08] we shouldn't [15:21:11] ok should I kill it? [15:21:16] let's only have 1 [15:21:24] hand at the same time or things get worse [15:21:26] I have it [15:21:59] I mean, should I stop the mysqlcheck command (is that safe?) [15:22:08] I'm aiming to be careful here [15:22:24] so there is a process ongoing [15:22:36] and mysql is up [15:22:40] mysqlcheck has been running for a long time [15:22:56] I just want to make sure it is safe to kill that before I do [15:23:29] don't worry, bstorm_ I have it [15:23:33] Ok [15:23:37] :-) [15:24:14] I will kill it [15:24:48] There it goes :) [15:24:54] will explain later [15:24:58] πŸ‘πŸ» [15:25:14] I am not ignoring you, just trying to fix first, then discuss later [15:25:30] No worries at all. I'll drink tea and work on waking up more [15:26:06] uf, bad signs on log [15:26:50] I will focus on the more important stuff, can I kill users connecting? [15:27:00] aka user connections? [15:27:03] Sure. I believe we are telling people it is unstable [15:27:07] ok [15:27:20] I will need that to properly finish the upgrade [15:27:25] Makes sense [15:28:29] I will restart it so it gets bound only to localhost [15:28:39] then upgrade, then see what is the state [15:28:57] Ok [15:29:08] disabling puppet [15:29:15] not sure where to log [15:29:20] doing it here [15:29:40] jynus, if you want to you can "!log admin " in #wikimedia-cloud [15:29:43] but logging here is also fine [15:29:48] and we can cut-and-paste later [15:29:56] will restart mariadb in skip-networking mode [15:30:00] !log clouddb-services in #wikimedia-cloud is ideal, but here is fine [15:30:00] bstorm_: Not expecting to hear !log here [15:30:00] no outside access [15:30:05] :) [15:30:26] so people know it won't work for a while [15:30:54] now that there is no conenction, I can upgrade/repair [15:31:02] πŸ‘πŸ» [15:31:23] I don't think it should take much [15:31:41] Repairing tables [15:31:43] s51290__dpl_p.t_s_all_dabs [15:31:45] Error : Table 's51290__dpl_p.t_s_all_dabs' doesn't exist in engine [15:31:45] We've announced that it is suffering general instability, so we can adjust that to "hard down" if it takes a while. Otherwise, that seems to cover it [15:31:46] status : Operation failed [15:31:57] That doesn't sound good [15:32:35] My first thought on this one was to fail over, but then I realized the "upgrade" would be on the secondary as well. [15:33:38] mv /srv/labsdb/data/s51290__dpl_p/t_s_all_dabs.frm /srv/labsdb/ [15:33:44] fyi [15:33:49] that is an issue for another time [15:33:56] πŸ‘πŸ» [15:34:08] stoppping again [15:35:11] according to the log innodb is clean [15:35:30] That sounds very good [15:35:59] well the "--Thread 140163871127296 has waited at dict0stats_bg.cc" [15:36:07] on previous log are not that good [15:37:16] And we are now upgraded to a version we'd rather not be on in general, unfortunately, right? [15:37:32] yeah, but to a version we don't really like [15:37:43] we were waiting for the next one that releases tomorrow [15:37:48] I see [15:38:26] We'll prioritize that ticket to stop this from "auto upgrading" [15:38:38] I'm looking at that now [15:38:41] so the rest can autoupgrade [15:38:53] it is only wmf-mariadb we don't want [15:39:02] fair [15:39:12] I am sweeping the logs now [15:40:25] no errors since the last restarts [15:40:34] we will see when we allow external conections [15:40:50] enabling puppet [15:41:38] so we support auto-upgrading, but on the non-wmf package [15:41:41] the debian one [15:41:56] I see [15:42:22] "support", more like we don't support it on the wmf-mariadb one [15:42:30] :) [15:42:59] I see the conenctions coming in now [15:43:12] Hopefully it doesn't randomly explode again [15:43:26] no ongoing errors [15:43:31] as before [15:43:36] we'll see [15:43:36] good [15:43:39] Yeah. [15:43:43] but status should be no back up [15:44:01] so we saw some issue on 10.1.41 [15:44:11] but we didn't remove it from the repo [15:44:16] because we wanted to test it [15:44:22] we skipped that from produciton [15:44:35] That makes sense. I was surprised when this upgraded itself. [15:44:36] and will upgrade from 39 to 42 directly on production [15:44:47] you probably want to do the same [15:44:54] Fair [15:44:55] remember the labsdb issues? [15:45:01] Oh yes [15:45:02] we belive it could be connected [15:45:10] Interesting [15:45:19] but it is difficult to proof [15:45:25] so we just are careful [15:45:50] should we do the same on the replica? [15:45:57] I think so [15:46:09] as immediate actionables, upgrade to 10.1.42 as soon as it is available [15:46:14] we don't want to be on .41 [15:46:20] Ok, I'll make a ticket [15:46:32] the other thing is the table or view that got corrupted [15:46:48] I saved the .frm, but there was no data to save [15:47:06] huh. [15:47:23] we should contact the user and tell him that table is gone [15:47:27] I can reach out to the user...yeah [15:47:38] I am not to worried, it most likely failed on creation [15:47:44] that is why it was empty [15:47:55] (that would be my assumption) [15:48:28] still no flagrant errors like before [15:48:46] le me check clouddb1002 [15:49:06] so, sorry I cut you before [15:49:25] repair will not work when there is ongoing traffic because it will get stuck on metadata on a non-read-only host [15:49:37] specially on these community dbs [15:49:51] killing them "it's complicated" [15:50:29] I thought it was set to read-only before trying the repair? [15:50:35] Got it [15:50:40] andrewbogott: nope [15:50:44] 'k [15:50:52] some scripts may do that [15:51:08] but wmf-mariadb* package on purpose don't touch the database [15:51:19] because one is supposed to do everthing with time [15:51:25] depooled, etc. [15:51:32] so it is more involved [15:51:55] that is why for caual databases (for end users VPS) we recommend the debian package [15:52:00] is less attended [15:52:11] wmf- ones require more attention [15:52:30] so clouddb1002 is in 10.1.38 [15:52:38] Oh? [15:52:38] can you maybe disable unatended upgrades? [15:52:53] Well, we can try to sort that out [15:52:55] and I would suggest to upgrade it to 10.1.39 [15:53:04] and later jump to 10.142 [15:53:12] when it is tested elsewhare [15:53:23] we plan to upgrade labsdb1011 nest week [15:53:43] the replication seems clean [15:53:46] I seem to recall that the general notion is to get things out of rotation and run mysql-upgrade or something like that, right? [15:53:51] so that is good [15:53:57] Definitely [15:54:29] yeah, "depool", whatever that means in each env [15:54:29] Once the packages are installed that is and mariadb has been stopped and started again [15:54:37] install wmf- ones [15:54:39] In this env, close off to the outside world entirely [15:54:42] lol [15:54:44] apparently [15:54:45] they may not be on instal1002 [15:54:52] ok [15:54:58] I can grab them for you [15:55:04] then run mysql_upgrade [15:55:07] then restart [15:55:10] Ok [15:55:13] the last restart is normally not needed [15:55:23] but just to be careful, in short, just do it [15:55:36] then monitor the journalctl log [15:55:44] to see everthing is ok [15:55:57] in this case it was spittiong the monitor output [15:56:04] which happens when there is a bad internal error [15:56:13] like internal blockage or something [15:56:20] could also be hw [15:56:38] give it a good ol' check to dmesg kernel msg, etc. [15:57:12] but my first guess would be an upgrade package without mysql_upgrade, messing up with the internals [15:58:00] not sure how useful I was, I just pressed keys until thins stopped giving errors [15:58:27] lol [15:58:35] Thank you very much for your help [15:58:41] thank you jynus! [15:58:54] let me know if things happen again [15:58:57] Does it make sense to just wait on the replica upgrade until 10.1.42? Like since it isn't broken now? [15:58:58] * andrewbogott orders jaime a cape [15:59:10] bstorm_: yeah [15:59:14] just block autoupgrade [15:59:16] Ok, I'll aim for that then [15:59:29] I'll check that issue for now [15:59:32] Thank you! [15:59:33] even wait for it to be on labsdb1011 for some time [15:59:41] sorry for the trouble [16:00:17] we didn't delete 10.1.41 from the repo, becaus techinically it is not bad, it is just a statistically bad version [16:00:47] Well, it exposed a problem in our setup, so that's good...just kind of painfully lol [16:14:43] jynus: should we schedule a couple of hours next week for the bacula failover/migration ? [16:14:56] yes please [16:15:04] Monday is a national holiday over here, but starting from Tuesday on I am all yours [16:15:13] ok, morning or later? [16:15:50] ^ akosiaris [16:16:23] hmm next week is the DST removal, which means waking up an hour earlier biologically speaking, so everything should be adjusted for that if I am to have a working brain [16:16:31] up to you [16:16:34] I am assuming you too [16:16:52] I predict a couple of false tries [16:16:52] so, 10:00 UTC on Tuesday? [16:16:55] ok to me [16:17:02] let me setup a calendar [16:17:10] invite [16:17:11] yes, please do, I was about to suggest that [16:17:15] 2 hours for starters? [16:17:18] yeah [16:17:27] but plan for larger [16:17:32] I prefer not to [16:17:38] 10DBA, 10Data-Services, 10Operations: Prepare and check storage layer for ka.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10MarcoAurelio) [16:17:39] and if we have to abort, we reschedule [16:17:53] I don't want to be without new backups for long [16:18:19] BTW, last thing, the buster incompatibility may change slightly the stratggy [16:18:24] will talk next week [16:18:39] ok [16:18:48] there is pdu maintenance at that time [16:18:57] let me check our hosts are not affected [16:24:11] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:24:20] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) p:05Triageβ†’03High [16:25:18] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:26:45] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10akosiaris) [16:27:02] thanks, I was doing exactly that [16:27:16] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:29:07] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) So because of buster clients and jessie storage daemons cannot talk to each other, we will have to alter slightly the upgrade strategy. Several opt... [16:48:01] jynus: it crashed again [16:48:48] (I don't know any details yet, just relaying to you while Brooke gets some breakfast) [16:50:28] I can see [16:51:06] jynus: it looks like the pending version on the replica is 10.1.41, so if we want to fail over we'll have to do some work there as well [16:51:32] jynus: I'm keeping my hands off for now but lmk how I can help [16:55:44] it starts again at Oct 24 16:43:24 [16:56:03] "Oct 24 16:43:24 clouddb1001 mysqld[11861]: --Thread 140240572368640 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:" [16:56:22] yep, that's when I got the alert [16:56:46] (and just now got a recovery alert) [17:00:56] my suggestion is to dump all data and reload it again [17:01:54] is there room on the local filesystem to do that? [17:02:04] probably yes, there is 2 TB [17:02:10] (And, that means you've concluded that this is a corruption issue and not a bad package issue?) [17:02:29] no, I am just suggesting something [17:02:34] ok [17:02:55] if we do the dump and reload, is there any chance that the replica will copy the temporary-empty tables and get wiped? [17:03:33] no, the danger would be to overwrite the existing tables [17:03:51] the plan would be to stop the replica, reload with binlog and the restart the replica [17:04:17] *reload without binlog [17:04:24] so there is always a plan B [17:04:57] the issue is that you could do that and this continues to happen because it is a server bug [17:05:12] no info at this time [17:05:22] would a total dump/reload take… hours? Or a few minutes? [17:05:34] 1.3 tb, depending on the storage [17:05:36] Also... is reverting to the last known good server version an option? [17:05:46] between 4 and 12 hours [17:06:01] yes, but it could also be a problem [17:07:28] I don't have a good solution for you, on production, if a server crashes we just nuke it and use another [17:07:49] sure [17:07:59] because there is no reason to treat is as a pet [17:08:11] I cannot tell if this is hw, software or data [17:08:11] I'm hoping bstorm_ will appear and have an opinion. [17:08:16] sorry [17:08:37] I'm not clear on how the replica plays into this β€” we could upgrade 2002, fail over, and see if it crashes in the same way [17:08:46] but then I suppose we run the risk of having corruption in both places [17:08:53] s/2002/1002/ [17:08:55] you could also copy it from the replica [17:09:07] data corruption no [17:09:13] replica is a logical copy [17:09:24] but if it is hw or sw, it won't solve it [17:10:08] cannot you create a new vm, copy it in a hot way and try? [17:10:22] I guess not because there is not resources [17:10:29] I'm not sure β€” checking [17:11:18] there is not a huge set of options with limited resources [17:11:47] it looks like there is barely room for another VM the size of toolsdb1001 on the hypervisor [17:12:00] so when you say 'copy in a hot way' you mean copy the existing VM? [17:12:15] I was thinking of xtrabackup [17:12:25] moving the data [17:12:34] or mydumper, moving it logically [17:12:42] ah, ok [17:12:50] so, yes, that can probably be done [17:13:04] you can also do that from the replica [17:15:50] hm, looks like it just crashed again [17:16:46] I'm checking available space for backup purposes [17:18:36] it didn't crash [17:19:14] it's been up for 1412 seconds [17:19:40] it is complaining about blockage, however [17:19:57] normally because low performance [17:22:12] we had the same issue (though on mariadb 10.2.27 (upgraded from 10.2.24)) [17:22:15] bstorm_: so, there is technically room for another pair of toolsdb servers on those virthosts (I say 'technically' because if we do that it'll be overprovisioned for disk space, and pretty close to breaking) [17:23:05] So the way these are set up, we put them on specific servers to make sure we could manage them carefully. So you understand, jynus. We'd likely be spinning up on the same server or the other server in the pair. One thing that changed here also is the server was rebooted for some security kernel thing, so that is a possible factor. [17:23:37] overprovisioned seems dangerous on databases, but I know that the postgres databases have lots of free space [17:24:31] clouddb2001 is occuping 2.2Tb of physical space. the host has 2.5Tb available [17:24:36] so I see a lot of import ongoing [17:24:39] by users [17:24:54] those may explain the slowdown [17:24:56] import? like someone is doing a big load [17:25:01] but clouddb2001 is provisioned for…3.4Tb [17:25:14] several users doing batch inserts in parallel [17:25:28] that's annoying. [17:25:36] that could explain the slowdown [17:25:41] it could [17:25:46] but there is not much that can be done [17:26:05] andrewbogott: 370G is free on the two postgres servers...not as much as I thought [17:26:49] If we want to spin up another replica or something, I'd do it with another cloudvirt where there's lots of free space with. [17:27:04] * andrewbogott looks for that [17:27:04] These have SSDs right? [17:27:07] yes [17:27:15] s51187 s52561 s52897 s52532 [17:27:17] So dumping to disk would be very fast at least [17:27:27] even our production hosts wouldn't be able to handle that [17:27:48] so my guess right now is that it gets slow until mysql things it is stalled and crashes itself [17:28:16] or runs out of memory, are there memory graphs? [17:28:41] let's pause the moving hw stuff [17:28:56] now it crashed [17:29:46] InnoDB: We intentionally crash the server, because it appears to be hung. [17:29:57] bstorm_: we have 'lifeboat' hypervisors with ssds and 10Gb networking. Want me to create a VM or two there in case you decide to go that way? [17:29:58] There are...finding the link [17:30:04] InnoDB: Assertion failure in thread 140384919348992 in file srv0srv.cc line 2410 [17:30:08] Sure, we can always kill them :) [17:31:11] jynus the same error happened to us after upgrading from mariadb 10.2.24 to 10.2.27 which makes me think there's a bug in mariadb? [17:31:18] jynus: https://grafana-labs.wikimedia.org/d/000000273/tools-mariadb?orgId=1 [17:33:19] I'll go dig up those usernames [17:33:52] there is really no fix, ask for those users to skip the imports to see if that helps [17:35:11] xtools is the first [17:35:38] wsexport [17:36:08] anything I can help with? my team maintains both xtools and wsexport, heh [17:36:18] deadlinks and listeria [17:36:43] musikanimal: if you can make them not do batch imports right now...or maybe have them do it in a staggered fashion? [17:37:00] I think that the database becoming unavailable might be triggering some tools to load stuff [17:37:16] Do you also work with listeria and deadlinks, perhaps? [17:37:19] sure, but it is a cyclical thing [17:37:24] Hmm [17:37:50] batch imports of what? sorry, a lot to read through above [17:38:04] I'm not sure. jynus: do you have any insight there? [17:38:12] I can disable the writes to tools-db, if that's what you mean? [17:38:34] they are not important for xtools. I'm not certain about wsexport [17:38:56] I didn't get full logs, just connections doing inserts of the same type on several connections [17:39:18] Ah ok [17:39:26] musikanimal: that might help for now [17:39:57] deadlinks is community tech and listeria is magnus [17:40:22] I am not saying that is the cause, but trying to think ways to make things more stable [17:40:31] I see. [17:40:38] other option would be to move heavy hitters to a separate db [17:40:57] so bisect if there is a single query or load that causes it [17:41:08] here db == instance [17:41:33] musikanimal: please disable that, if you could. It might help at least until we can get things looking better. [17:42:05] done! I think [17:42:11] Thanks :) [17:42:49] there is also someone inserting web access long on the db [17:42:55] I created clouddb1001bak and clouddb1002bak. Now I need to run to the doctor but will try to keep my phone and laptop handy. [17:42:58] which is probably a bad idea, but not related [17:43:08] thanks andrewbogott [17:43:32] s52481__stats_global [17:44:32] s51187 usage timeline is the one importing now into s51187__xtools_prod [17:46:32] musikanimal: does that make sense? [17:47:06] not down, but toasted again [17:48:06] :( [17:48:19] hmm maybe! I don't know what you mean when you say "importing" [17:48:31] large inserts, I believe [17:48:35] but yes, the xtools writes are just simple usage tracking. not important at all [17:48:53] they should be very small, usually updates, not inserts [17:48:55] apparently usage timeline is still writing? [17:49:13] is it? let me take another look [17:50:25] I am trying to do some kills to see if I can make it not crash [17:50:35] Thank you! [17:50:57] but it may be too late [17:51:21] I am restarting it [17:53:54] it doesn't really answer to the sigint [17:54:05] What an exciting surprise mess [17:54:15] ugh [17:57:34] I'm working on tracking down other users of the DB accounts here [17:58:25] i've setup some pt-kills on a couple of screens [17:59:34] That can't hurt [17:59:53] they are called pt-kill and pt-kill2 as root [17:59:59] in case you want to kill those [18:00:44] it won't fix anything, but maybe it can kill stuck connections before a crash [18:01:04] ok [18:02:31] the other thing you can try is to downgrade mariadb [18:02:39] herby dragons [18:02:52] I was worried about the dragons on that... [18:03:05] It seems like a logical solution, but... [18:03:16] or even 10.3 [18:03:33] That would be using buster? [18:03:42] yeah, probably [18:03:51] there are older versions at root@install1002:~/stretch [18:03:54] up to you [18:04:38] hmmmmmm [18:07:45] I'm leaning toward 10.3, but talking to Bryan a bit. [18:08:29] Do you know how bad such an upgrade would be? [18:08:44] Like does that require dump and restore to accomplish or is it similar to other upgrades? Have we tried it? [18:09:15] no, on upgrade we don't do that, we reinstall only the / partition, and manually skipp the formatting of /srv [18:09:22] then upgrade in place [18:09:47] ahhh ok [18:09:53] you can even upgrade without reimaging [18:10:07] note we have only tested 10.3 very little [18:10:23] we have a couple of hosts on production, but nothing else [18:11:23] We may have just stopped deadlinks from writing [18:11:41] which should be s52897 [18:11:52] we are aggressive killing queries too [18:14:33] not sure that is doing anything [18:15:26] no, it is not, it went into a bad state again [18:15:34] Alerts, yup [18:15:39] Just got paged [18:16:06] let's downgrade, then [18:16:19] Ok. We still have a functioning replica, I guess [18:16:29] at least until we restart there :-p [18:16:46] May as well try downgrade with how bad it is behaving [18:17:08] I can uninstall unattended upgrades on there for now [18:17:15] please do [18:17:32] I did on the replica, fwiw [18:18:27] I am stopping mariadb, if I can [18:19:27] done. puppet is disabled and unattended-upgrades is purged [18:19:49] Unpacking wmf-mariadb101 (10.1.39-1) over (10.1.41-1) ... [18:22:00] Server version: 10.1.39-MariaDB MariaDB Server [18:22:56] yay! [18:23:03] Hopefully all goes well with this [18:23:05] I would suggest to do a local dump, if there is data corruption, a dump will likely catch it [18:23:12] ok [18:23:18] there is probably enough space on the local filesystem [18:23:40] use mydumper for faster generation and load, it shouldn't take more than 1h30 or so [18:24:04] you can even send it remotelly if there is not enogh space [18:24:50] monitor "journalctl -fu mariadb" [18:25:11] if it starts splitting innodb monitor outputs, it is already too late [18:25:46] there is also aside from data, hw and sofware a thrd things that could be contriubuting to it [18:25:57] and that is host resource constraints [18:25:58] I'll try locally maybe. The DB partition is using 2 TB with 1.3 TB free. If the dump comes out small enough that could work [18:26:17] make sure you use compression and you should get ~5 less space usage [18:26:26] check out backup scripts if unsure about what options to use [18:26:37] Hrm. yeah I have no idea [18:26:42] either it will be completed, and you will be able to switch or something [18:26:55] or it will crash and you will be able to know on thiwch data part [18:27:03] hah [18:27:56] good luck! [18:28:53] So after backup, just restart the server and...it should be downgraded, right? [18:29:38] no, it is downgraded already [18:30:13] what there is a big chance is that that would not fix the issue [18:30:21] Ah ok, so just make a backup in case it's really bad :) [18:31:04] I might try to find another spot for backup, then. I need the space buffer on this disk. Thanks! [18:32:46] you can even backup from the replica, except the filterd databases [18:32:57] and then backup only the heavy writers from the master [18:33:02] that would be safer [18:33:17] Good thinking [18:33:24] Thanks so much for your help! [18:34:04] note we warned the heavy writers that they may lose data, so also as an option, copy from the replica in file format for faster recovery [18:34:37] if you have a copy, I would just failover ther [18:44:11] bstorm_: could you check hw errors and memory on host, it used to crash when db is using more than 15GB of memory. maybe just a coincidence: https://grafana-labs.wikimedia.org/d/000000273/tools-mariadb?orgId=1&panelId=1&fullscreen&from=now-12h&to=now [18:44:26] phyical host I mean [18:45:33] Sure. Will do. [18:46:00] just to discard that [18:49:11] Just got back (had run to the restroom) [18:55:53] dmesg is pretty clean. There's some slow NMI handlers here and there. Nothing obviously hardware [19:01:00] jynus: it seems suspiciously stable at the moment... [19:02:02] https://www.irccloud.com/pastebin/6N7E8C6h/ [19:02:14] I'm not seeing much else in the logs [23:12:26] 10DBA, 10Data-Services, 10Operations: Prepare and check storage layer for ka.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10colewhite) p:05Triageβ†’03Normal