[11:02:23] hey DBA team, we are stopping toolsdb VMs in the affected cloudvirts due to the PDU operations [11:16:18] that is up to what cloud thinks is better for tools, we will help you with any decision you take [11:17:05] mariadb was stopped, the VM shutdown (clouddb1001) [11:17:11] that's toolsdb_primary [11:17:30] the hypervisor is also shutdown (cloudvirt1019) [11:17:39] trying to prevent any form of disk corruption [11:18:13] same for clouddb1004 (osmdb_secondary) which is postgresql [11:18:14] ok to me [11:18:31] makes sense with the large amount of MyISAM tables [11:18:50] sorry, I thought you were asking first [11:19:29] this is what brooke suggested the other day in our WMCS team meeting [11:19:40] no, I mean I am ok [11:19:46] I was agreeing with you [11:19:50] cool [11:20:20] you don't have to ask [11:20:29] just ask help if you need it [11:21:47] ask for* if you need it [11:21:51] *help [11:22:34] ok, thanks! [11:22:49] hopefully we will start everything after the PDU switch and everything will be working [11:28:06] arturo: yeah, Brooke asked me the other day about it and we agreed that stopping mysql was a good decision, as we have been doing the same with production hosts [11:28:55] arturo: also told her we can help if needed indeed [11:29:02] cool thanks [12:07:07] 10DBA, 10Core Platform Team, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Performance Issue, 10mariadb-optimizer-bug: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 (10Marostegui) I have analyzed all the queries that went overnight till ar... [12:07:41] jynus: db2092 finished the analyze, I am going to repool it, you can scratch it from the list of things [12:08:20] ok [12:24:33] 10DBA, 10Core Platform Team, 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Marostegui) [12:24:53] 10DBA, 10Core Platform Team, 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Marostegui) p:05Triage→03Normal [12:25:31] 10DBA, 10Core Platform Team, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Performance Issue, 10mariadb-optimizer-bug: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 (10Marostegui) The `analyze table` on db2092 for `revision` didn't help wi... [12:36:04] hey [12:36:14] I'm trying to start mariadb in clouddb1001 and I get this [12:36:16] https://www.irccloud.com/pastebin/InFyHFkS/ [12:37:09] not sure if I'm supposed to created that dir by hand [12:37:58] marostegui: ? [12:44:18] anyway, I created the directory and chmod'd it by hand then mariadb was happy to start [12:46:43] that's strange [12:46:51] it was working fine before? [12:47:01] ah the socket directory [12:47:17] puppet creates it for us, but we've had those issues in the past [12:48:40] ok that makes sense [12:48:48] puppet was stopped in the server during the operation window [12:49:02] and after the reboot, /var/run is wiped [12:50:51] yep [13:17:15] marostegui: I could use some help to check if toolsdb is working fine [13:17:30] arturo: sure, what do you need? [13:18:21] we just got a notification that some RW ops are failing [13:18:46] arturo: clouddb1001 is running with read_only OFF [13:18:48] and has a slave hanging [13:18:57] ok [13:19:00] so it is writable and has its normal slave [13:19:41] so everything looks normal there? [13:19:43] weird [13:19:58] from mysql point of view, it does [13:20:11] what are you exactly seeing? [13:20:32] there are lots of stuff running too [13:20:35] from what I can see [13:20:49] 15:14 <+ icinga-wm> PROBLEM - toolschecker: toolsdb read/write on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/db/toolsdb - 340 bytes in 60.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:21:24] there are lots of things there running, but I am not sure if that is the normal state or not [13:21:41] which stuff are you refering to? [13:21:51] like lots of selects, and inserts [13:21:58] and things taking time to run [13:22:16] so it can be a load issue? [13:22:53] I don't know its normal status, I can see writes taking long time to be done, and others being done quite fast, so that might its normal status? [13:23:10] lots of selects as well [13:23:21] do you have some graphs? [13:23:35] looking for them [13:25:04] too bad we don't have journal logs before the reboot :( [13:25:49] mysql just died? [13:26:01] Oct 24 13:25:23 clouddb1001 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT [13:26:13] mmm [13:26:18] can you check hw logs? [13:26:45] Oct 24 13:25:21 clouddb1001 mysqld[1622]: *** buffer overflow detected ***: /opt/wmf-mariadb101/bin/mysqld terminated [13:26:45] in the hypervisor? I don't see anything weird [13:27:28] mysql back up [13:27:29] oh that doesn't sound good [13:28:20] I see you upgraded to 10.1.41? [13:28:27] I just ran mysql_upgrade for you [13:28:27] ?? [13:28:45] did you run an upgrade? [13:28:50] I didn't upgrade anything. Not sure if the reboot did something automatically [13:28:52] because i was surprised to see 10.1.41 [13:28:58] but the mysql_upgrade didn't run [13:29:35] could the upgrade happen automatically and the after the reboot the new version started? [13:29:39] I don't see any more hangs [13:29:59] arturo: Not sure how it works in your infra, in production it definitely doesn't happen [13:30:10] graphs BTW https://grafana-labs.wikimedia.org/d/000000273/tools-mariadb?orgId=1 [13:30:51] https://www.irccloud.com/pastebin/5UQ2Pb4w/ [13:31:02] Not sure if that crash had anything to do with the upgrade+mysql_upgrade...but from what I can see it works fine now [13:31:13] Like no transactions are hanging [13:31:23] Start-Date: 2019-09-12 06:07:47 [13:31:23] Commandline: /usr/bin/unattended-upgrade [13:31:23] Upgrade: wmf-mariadb101:amd64 (10.1.39-1, 10.1.41-1) [13:31:23] End-Date: 2019-09-12 06:08:11 [13:31:28] :-/ [13:31:35] so unattended-upgrades updated the package the other day [13:31:43] well, last month [13:31:47] yeah, I see [13:32:11] might be related to the crash [13:32:16] (I don't see how) [13:32:27] But too much of a coincidence? [13:32:32] so my next question is: could the package be upgraded and the service not restarted? [13:32:40] yes [13:32:55] not ideal, but it can done [13:32:56] then that's what happened. Today with the reboot, the new version started [13:33:13] we have been sitting in a pending upgrade for a month, and today it triggered [13:33:22] normally we try to run the mysql_upgrade script before letting things start to run [13:33:26] (replication, reads etc) [13:33:30] to avoid things like this [13:33:32] like weird states [13:33:45] arturo: you might want to revisit that policy of unattended upgrades for mariadb [13:33:53] that's true [13:34:05] will open a phab task [13:34:17] so far everything looks good after the crash+upgrade [13:37:56] 10DBA, 10Tools, 10cloud-services-team (Kanban): Toolsdb: prevent unattended-upgrades from upgrading mariadb - https://phabricator.wikimedia.org/T236384 (10aborrero) [13:38:09] I just created T236384 [13:38:11] T236384: Toolsdb: prevent unattended-upgrades from upgrading mariadb - https://phabricator.wikimedia.org/T236384 [13:38:13] 10DBA, 10Tools, 10cloud-services-team (Kanban): Toolsdb: prevent unattended-upgrades from upgrading mariadb - https://phabricator.wikimedia.org/T236384 (10aborrero) p:05Triage→03High [13:39:15] thanks marostegui [13:41:12] yw [13:47:38] 10DBA, 10Core Platform Team, 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Anomie) Looks like the statistics are probably ok-ish, forcing the index each way shows similar row es... [13:47:56] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Anomie) [14:27:30] it seems to have dropped again? [14:27:32] briefly [14:28:34] mariadb crashed [14:35:12] marostegui ^ It recovered with one table marked in need of repair. I don't see any reason why per se so far. I'll try to take some notes. [14:38:05] I'll run a repair on it [14:38:31] Hrm...the level of churn on this maybe I shouldn't [14:50:53] Damn it, it's flapping. Looking into options [15:16:45] jynus: are you around? toolsdb is crashing and we could use a rescue. [15:17:09] crashing? [15:17:15] I thought it was put down? [15:17:34] send me the hostname, please [15:18:24] is it clouddb1001? [15:18:34] jynus: we shut it down during the pdu update. When it came back up it had an unexpected version upgrade waiting... [15:18:36] clouddb1001.clouddb-services.eqiad.wmflabs [15:18:38] and it's never really worked since. [15:18:41] It's the VM version [15:18:53] bstorm_: is running repairs in r/o mode right now [15:19:11] Yeah, but it's probably not going to help much...it's not myisam tables 😛 [15:19:14] did you run apt upgrade on that host? [15:19:18] I just got up and am trying to think of things [15:19:35] We didnt'...our unattended upgrades did and apparently we didn't imagine that would happen [15:19:41] ugh [15:19:43] jynus: it was unattended upgrades that did it. https://phabricator.wikimedia.org/T236384 [15:19:44] yeah, ugh [15:19:46] 10.1.41 is unstable [15:19:48] Now we know that it was configured to upgrade mariadb :( [15:19:55] please disable that [15:20:01] not important now [15:20:09] but probably the root cause [15:20:19] Good to know [15:20:19] (linked task above is about disabling upgrades for maria/mysql) [15:20:46] [ERROR] Do you already have another mysqld server running on socket: /var/run/mysqld/mysqld.sock [15:20:52] someone tried to start it twice [15:21:01] I'm running mysqlcheck right now [15:21:06] don't [15:21:06] huh [15:21:08] we shouldn't [15:21:11] ok should I kill it? [15:21:16] let's only have 1 [15:21:24] hand at the same time or things get worse [15:21:26] I have it [15:21:59] I mean, should I stop the mysqlcheck command (is that safe?) [15:22:08] I'm aiming to be careful here [15:22:24] so there is a process ongoing [15:22:36] and mysql is up [15:22:40] mysqlcheck has been running for a long time [15:22:56] I just want to make sure it is safe to kill that before I do [15:23:29] don't worry, bstorm_ I have it [15:23:33] Ok [15:23:37] :-) [15:24:14] I will kill it [15:24:48] There it goes :) [15:24:54] will explain later [15:24:58] 👍🏻 [15:25:14] I am not ignoring you, just trying to fix first, then discuss later [15:25:30] No worries at all. I'll drink tea and work on waking up more [15:26:06] uf, bad signs on log [15:26:50] I will focus on the more important stuff, can I kill users connecting? [15:27:00] aka user connections? [15:27:03] Sure. I believe we are telling people it is unstable [15:27:07] ok [15:27:20] I will need that to properly finish the upgrade [15:27:25] Makes sense [15:28:29] I will restart it so it gets bound only to localhost [15:28:39] then upgrade, then see what is the state [15:28:57] Ok [15:29:08] disabling puppet [15:29:15] not sure where to log [15:29:20] doing it here [15:29:40] jynus, if you want to you can "!log admin " in #wikimedia-cloud [15:29:43] but logging here is also fine [15:29:48] and we can cut-and-paste later [15:29:56] will restart mariadb in skip-networking mode [15:30:00] !log clouddb-services in #wikimedia-cloud is ideal, but here is fine [15:30:00] bstorm_: Not expecting to hear !log here [15:30:00] no outside access [15:30:05] :) [15:30:26] so people know it won't work for a while [15:30:54] now that there is no conenction, I can upgrade/repair [15:31:02] 👍🏻 [15:31:23] I don't think it should take much [15:31:41] Repairing tables [15:31:43] s51290__dpl_p.t_s_all_dabs [15:31:45] Error : Table 's51290__dpl_p.t_s_all_dabs' doesn't exist in engine [15:31:45] We've announced that it is suffering general instability, so we can adjust that to "hard down" if it takes a while. Otherwise, that seems to cover it [15:31:46] status : Operation failed [15:31:57] That doesn't sound good [15:32:35] My first thought on this one was to fail over, but then I realized the "upgrade" would be on the secondary as well. [15:33:38] mv /srv/labsdb/data/s51290__dpl_p/t_s_all_dabs.frm /srv/labsdb/ [15:33:44] fyi [15:33:49] that is an issue for another time [15:33:56] 👍🏻 [15:34:08] stoppping again [15:35:11] according to the log innodb is clean [15:35:30] That sounds very good [15:35:59] well the "--Thread 140163871127296 has waited at dict0stats_bg.cc" [15:36:07] on previous log are not that good [15:37:16] And we are now upgraded to a version we'd rather not be on in general, unfortunately, right? [15:37:32] yeah, but to a version we don't really like [15:37:43] we were waiting for the next one that releases tomorrow [15:37:48] I see [15:38:26] We'll prioritize that ticket to stop this from "auto upgrading" [15:38:38] I'm looking at that now [15:38:41] so the rest can autoupgrade [15:38:53] it is only wmf-mariadb we don't want [15:39:02] fair [15:39:12] I am sweeping the logs now [15:40:25] no errors since the last restarts [15:40:34] we will see when we allow external conections [15:40:50] enabling puppet [15:41:38] so we support auto-upgrading, but on the non-wmf package [15:41:41] the debian one [15:41:56] I see [15:42:22] "support", more like we don't support it on the wmf-mariadb one [15:42:30] :) [15:42:59] I see the conenctions coming in now [15:43:12] Hopefully it doesn't randomly explode again [15:43:26] no ongoing errors [15:43:31] as before [15:43:36] we'll see [15:43:36] good [15:43:39] Yeah. [15:43:43] but status should be no back up [15:44:01] so we saw some issue on 10.1.41 [15:44:11] but we didn't remove it from the repo [15:44:16] because we wanted to test it [15:44:22] we skipped that from produciton [15:44:35] That makes sense. I was surprised when this upgraded itself. [15:44:36] and will upgrade from 39 to 42 directly on production [15:44:47] you probably want to do the same [15:44:54] Fair [15:44:55] remember the labsdb issues? [15:45:01] Oh yes [15:45:02] we belive it could be connected [15:45:10] Interesting [15:45:19] but it is difficult to proof [15:45:25] so we just are careful [15:45:50] should we do the same on the replica? [15:45:57] I think so [15:46:09] as immediate actionables, upgrade to 10.1.42 as soon as it is available [15:46:14] we don't want to be on .41 [15:46:20] Ok, I'll make a ticket [15:46:32] the other thing is the table or view that got corrupted [15:46:48] I saved the .frm, but there was no data to save [15:47:06] huh. [15:47:23] we should contact the user and tell him that table is gone [15:47:27] I can reach out to the user...yeah [15:47:38] I am not to worried, it most likely failed on creation [15:47:44] that is why it was empty [15:47:55] (that would be my assumption) [15:48:28] still no flagrant errors like before [15:48:46] le me check clouddb1002 [15:49:06] so, sorry I cut you before [15:49:25] repair will not work when there is ongoing traffic because it will get stuck on metadata on a non-read-only host [15:49:37] specially on these community dbs [15:49:51] killing them "it's complicated" [15:50:29] I thought it was set to read-only before trying the repair? [15:50:35] Got it [15:50:40] andrewbogott: nope [15:50:44] 'k [15:50:52] some scripts may do that [15:51:08] but wmf-mariadb* package on purpose don't touch the database [15:51:19] because one is supposed to do everthing with time [15:51:25] depooled, etc. [15:51:32] so it is more involved [15:51:55] that is why for caual databases (for end users VPS) we recommend the debian package [15:52:00] is less attended [15:52:11] wmf- ones require more attention [15:52:30] so clouddb1002 is in 10.1.38 [15:52:38] Oh? [15:52:38] can you maybe disable unatended upgrades? [15:52:53] Well, we can try to sort that out [15:52:55] and I would suggest to upgrade it to 10.1.39 [15:53:04] and later jump to 10.142 [15:53:12] when it is tested elsewhare [15:53:23] we plan to upgrade labsdb1011 nest week [15:53:43] the replication seems clean [15:53:46] I seem to recall that the general notion is to get things out of rotation and run mysql-upgrade or something like that, right? [15:53:51] so that is good [15:53:57] Definitely [15:54:29] yeah, "depool", whatever that means in each env [15:54:29] Once the packages are installed that is and mariadb has been stopped and started again [15:54:37] install wmf- ones [15:54:39] In this env, close off to the outside world entirely [15:54:42] lol [15:54:44] apparently [15:54:45] they may not be on instal1002 [15:54:52] ok [15:54:58] I can grab them for you [15:55:04] then run mysql_upgrade [15:55:07] then restart [15:55:10] Ok [15:55:13] the last restart is normally not needed [15:55:23] but just to be careful, in short, just do it [15:55:36] then monitor the journalctl log [15:55:44] to see everthing is ok [15:55:57] in this case it was spittiong the monitor output [15:56:04] which happens when there is a bad internal error [15:56:13] like internal blockage or something [15:56:20] could also be hw [15:56:38] give it a good ol' check to dmesg kernel msg, etc. [15:57:12] but my first guess would be an upgrade package without mysql_upgrade, messing up with the internals [15:58:00] not sure how useful I was, I just pressed keys until thins stopped giving errors [15:58:27] lol [15:58:35] Thank you very much for your help [15:58:41] thank you jynus! [15:58:54] let me know if things happen again [15:58:57] Does it make sense to just wait on the replica upgrade until 10.1.42? Like since it isn't broken now? [15:58:58] * andrewbogott orders jaime a cape [15:59:10] bstorm_: yeah [15:59:14] just block autoupgrade [15:59:16] Ok, I'll aim for that then [15:59:29] I'll check that issue for now [15:59:32] Thank you! [15:59:33] even wait for it to be on labsdb1011 for some time [15:59:41] sorry for the trouble [16:00:17] we didn't delete 10.1.41 from the repo, becaus techinically it is not bad, it is just a statistically bad version [16:00:47] Well, it exposed a problem in our setup, so that's good...just kind of painfully lol [16:14:43] jynus: should we schedule a couple of hours next week for the bacula failover/migration ? [16:14:56] yes please [16:15:04] Monday is a national holiday over here, but starting from Tuesday on I am all yours [16:15:13] ok, morning or later? [16:15:50] ^ akosiaris [16:16:23] hmm next week is the DST removal, which means waking up an hour earlier biologically speaking, so everything should be adjusted for that if I am to have a working brain [16:16:31] up to you [16:16:34] I am assuming you too [16:16:52] I predict a couple of false tries [16:16:52] so, 10:00 UTC on Tuesday? [16:16:55] ok to me [16:17:02] let me setup a calendar [16:17:10] invite [16:17:11] yes, please do, I was about to suggest that [16:17:15] 2 hours for starters? [16:17:18] yeah [16:17:27] but plan for larger [16:17:32] I prefer not to [16:17:38] 10DBA, 10Data-Services, 10Operations: Prepare and check storage layer for ka.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10MarcoAurelio) [16:17:39] and if we have to abort, we reschedule [16:17:53] I don't want to be without new backups for long [16:18:19] BTW, last thing, the buster incompatibility may change slightly the stratggy [16:18:24] will talk next week [16:18:39] ok [16:18:48] there is pdu maintenance at that time [16:18:57] let me check our hosts are not affected [16:24:11] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:24:20] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) p:05Triage→03High [16:25:18] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:26:45] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10akosiaris) [16:27:02] thanks, I was doing exactly that [16:27:16] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:29:07] 10DBA, 10Operations, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) So because of buster clients and jessie storage daemons cannot talk to each other, we will have to alter slightly the upgrade strategy. Several opt... [16:48:01] jynus: it crashed again [16:48:48] (I don't know any details yet, just relaying to you while Brooke gets some breakfast) [16:50:28] I can see [16:51:06] jynus: it looks like the pending version on the replica is 10.1.41, so if we want to fail over we'll have to do some work there as well [16:51:32] jynus: I'm keeping my hands off for now but lmk how I can help [16:55:44] it starts again at Oct 24 16:43:24 [16:56:03] "Oct 24 16:43:24 clouddb1001 mysqld[11861]: --Thread 140240572368640 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:" [16:56:22] yep, that's when I got the alert [16:56:46] (and just now got a recovery alert) [17:00:56] my suggestion is to dump all data and reload it again [17:01:54] is there room on the local filesystem to do that? [17:02:04] probably yes, there is 2 TB [17:02:10] (And, that means you've concluded that this is a corruption issue and not a bad package issue?) [17:02:29] no, I am just suggesting something [17:02:34] ok [17:02:55] if we do the dump and reload, is there any chance that the replica will copy the temporary-empty tables and get wiped? [17:03:33] no, the danger would be to overwrite the existing tables [17:03:51] the plan would be to stop the replica, reload with binlog and the restart the replica [17:04:17] *reload without binlog [17:04:24] so there is always a plan B [17:04:57] the issue is that you could do that and this continues to happen because it is a server bug [17:05:12] no info at this time [17:05:22] would a total dump/reload take… hours? Or a few minutes? [17:05:34] 1.3 tb, depending on the storage [17:05:36] Also... is reverting to the last known good server version an option? [17:05:46] between 4 and 12 hours [17:06:01] yes, but it could also be a problem [17:07:28] I don't have a good solution for you, on production, if a server crashes we just nuke it and use another [17:07:49]