[05:37:13] 10Blocked-on-schema-change, 10DBA: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) p:05Triage→03Medium [05:38:48] 10Blocked-on-schema-change, 10DBA: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) This should wait for: T250071#6083365 so at least we are consistent on that front and don't mix too many alters on the same table which can lead... [05:44:29] 10DBA: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) So looks like `innodb_purge_threads = 10` is the one causing issues here. The host had all the threads up-to-date so I stopped MySQL and restarted it with that flag and...: ` Apr 28 05:41:34 labsdb1011 mysqld... [05:46:02] 10DBA: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) I am going to update the MariaDB bug and reclone this host. Positions in labsdb1012: https://phabricator.wikimedia.org/P11055 [05:56:47] 10DBA: FlaggedRevs has lots of database drifts but only in s1 and s5 - https://phabricator.wikimedia.org/T251191 (10Marostegui) p:05Triage→03Medium @Ladsgroup is this how the table is supposed to look? (taken from `grwikimedia` wiki which was created a couple of weeks ago): ` CREATE TABLE `ipblocks` ( `ipb... [06:05:19] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) [06:12:28] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [06:36:41] 10DBA, 10MediaWiki-User-management, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Marostegui) >>! In T250071#6051598, @Marostegui wro... [07:01:09] 10DBA: FlaggedRevs has lots of database drifts but only in s1 and s5 - https://phabricator.wikimedia.org/T251191 (10jcrespo) Be careful with these changes, I remember at least 1 table that was maintained in production 1 way, and in code another, and if it was changed in production, it would create bad queries. I... [07:01:43] "Last dump for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2020-04-28 00:00:02 is 158 GB, but previous one was 183 GB, a change of 13.6%" [07:02:03] The wb_terms deletion most likely [07:07:09] yep [07:07:22] that's good, it means it works :-D [07:07:50] yeah, eqiad alert will also happen "soon" as I am deleting it from the backups host later today once it's finished the already running backup :) [07:08:12] on a further note, we should compare table existance and sizes [07:08:22] but that needs live inventory [07:08:28] so for a later time [07:11:59] I will ack that alert for 1 week [07:28:37] 10DBA: FlaggedRevs has lots of database drifts but only in s1 and s5 - https://phabricator.wikimedia.org/T251191 (10DannyS712) >>! In T251191#6087679, @Marostegui wrote: > @Ladsgroup is this how the table is supposed to look? (taken from `grwikimedia` wiki which was created a couple of weeks ago): > ` > CREATE T... [07:32:57] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1105.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [07:36:04] 10DBA: FlaggedRevs has lots of database drifts but only in s1 and s5 - https://phabricator.wikimedia.org/T251191 (10Marostegui) Indeed! Going to move it, thank you! [07:36:22] 10Blocked-on-schema-change, 10DBA: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) @Ladsgroup is this how the table is supposed to look? (taken from `grwikimedia` wiki which was created a couple of weeks ago): ` CREATE TABLE `i... [07:47:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/592872 [07:49:33] done [07:53:33] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1105.eqiad.wmnet'] ` and were **ALL** successful. [07:58:12] going for db2102 reimage, as I need a testing host to do stuff [08:02:40] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['db2102.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto... [08:30:54] 10.4 upgrade is very smooth on core hosts [08:31:03] had not a single issue [08:31:08] very cool, thanks marostegui [08:32:49] jynus: you'd still need to restart the exporter :( [08:32:57] also the machine learning algorithm I implemented on puppet seems it worked fine [08:32:57] due to that bug [08:34:04] jynus: another thing I have observed is that you'd need to drop+add the host in tendril, otherwise looks like the tendril events get a bit stuck [08:34:18] I see [08:34:31] did you add that to the issues on the page? [08:34:33] I havent' had the time to look into that yet [08:34:44] nope, if you have the time, could you? [08:34:48] yes [08:35:15] thanks <3 [08:35:22] https://tendril.wikimedia.org/host/view/db2102.codfw.wmnet/3306 [08:35:28] ^however, I see it working fine [08:35:38] no, check the "Act" column [08:35:42] It won't ever recover [08:35:47] that's the only issue I have seen so far [08:35:47] ah, I see now [08:44:38] there is one issue, which is ssacli [08:44:46] didn't that get disabled? [08:45:19] that got the name changed in buster [08:45:20] I think [08:45:46] yes, but I think filippo said the best way was to disable that functionality, and that got deployed [08:46:22] ah, you mean with partman? [08:46:29] sorry, smard [08:46:32] smartd? [08:46:42] no [08:46:48] the hp raid stuff [08:46:57] of maybe it was that [08:47:03] cannot remember [08:47:05] there were to RAID issues, the tool and the smartd related thing [08:47:59] *two [08:48:02] https://phabricator.wikimedia.org/T220787 [08:48:19] This was one https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/581617/ [08:48:39] ah, yes, that [08:48:48] so it was deployed [08:50:09] but then there is the second issue, why is ssacli not installed? [08:50:28] I think that changed names many things, and moritzm and myself worked on that on buster a few months ago, but I cannot even remember which is the name now [08:50:35] E: Unable to locate package ssacli [08:51:00] I will dig [08:51:07] you helped with the smart stuff [08:51:16] yeah, I had a crazy rename of sorts [08:51:24] yeah, they did another rename for buster [08:51:28] but I cannot even remember now what it was [08:51:35] but I remember we worked on it recently [08:51:37] (and fixed) [08:51:43] I will find it, don't worry [09:01:27] for some reason it is importing raid::ssacli instead of raid::hpsa [09:02:45] I believe that was the new name [09:03:01] it /may/ have changed again [09:03:03] :-D [09:03:05] :-( [09:03:07] marostegui: you actually can drop the labs part tomorrow, the announcement was done a milltion years ago [09:03:24] Amir1: cool, I might do it even next monday, just in case [09:03:29] I have so many open fronts now.. [09:05:38] I can imagine [09:06:30] jynus: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/577292/ I think that was the last name [09:06:47] jynus: however I cannot see any of those being installed on db2102 [09:06:51] I see [09:06:54] I think it is the facts [09:06:55] is it a gen10? [09:08:19] HP ProLiant DL360 Gen10 [09:08:57] I think facter is trying to detect it based on device id [09:09:07] and it may not detect it properly [09:09:12] I remember we had the issue on a gen9 [09:09:16] if it is only that, it should be easy [09:09:17] (db1078 if I recall correctly even) [09:09:33] if it changed everything again, it will be harder [09:09:47] what was the fix for db1078? [09:10:30] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/577292/ that I believe [09:13:42] I see [09:14:32] https://phabricator.wikimedia.org/P11062 [09:15:01] it must be a new controller [09:15:12] so probably easy fix [09:15:39] yeah, maybe it is a new controller indeed [09:15:43] for the gen10 [09:16:44] 9005028f is detected on db2102 but not on db1078, triggering the difference [09:32:56] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2124.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202004280932... [09:48:15] I got it [09:48:31] the derection is working correctly, but the package is missing in buster [09:48:36] the other host worked [09:48:42] because it used an older controller [09:48:45] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2102.codfw.wmnet'] ` and were **ALL** successful. [09:48:54] kormat: ^ \o/ [09:49:00] ah no, that's jaime's [09:49:01] sorry [09:49:01] XD [09:49:04] he he [09:49:04] haha [09:49:10] I was like: that was fast [09:49:17] I fought quite a lot eh! [09:49:36] jynus: so the issue is the new controller doesn't match on the facter or what? [09:49:47] no, the detection is ok [09:49:53] let me double check it [09:50:02] but the package is missing from buster [09:50:49] yeah, it works [09:52:16] Weird, I thought moritzm added it to the repo [09:53:06] moritzm: I need the latest ssacli from hp into buster, can I just upload the one at http://downloads.linux.hpe.com/SDR/repo/mcp/Debian/pool/non-free/ into our third party? [09:53:34] I think I updated the repo sync definitions, I'll check in a bit [09:53:50] I could not find it, check db2102 [09:54:02] as an example [09:54:19] but no rush ofc [09:57:07] I can see modules/aptrepo/files/updates right [09:58:09] but I only see after apt update "hpssacli/buster-wikimedia 2.40-13.0 amd64", which is not what I want [09:58:49] vs ssacli on stretch [10:00:00] marostegui: it is only the new controller that require the new cli, which was not on buster yet [10:00:26] I got confused because the old one was installable, but didn't find the controller [10:00:41] so I thought it was a driver issue [10:00:48] back in March I added ssaducli to the package list and the update sync definitions are also correct, "reprepro --component thirdparty/hwraid checkupdate buster-wikimedia" shows that it will add ssacli and ssaducli [10:01:03] moritzm: I belive you [10:01:11] but it doesn't show on apt :-D [10:01:20] it does show on apt on stretch, though [10:01:31] but the actual sync is currently failing as something broke with the jenkins repo (and for some reason is tries to reach that as well) [10:01:41] having a look [10:01:41] it is ok, no worries [10:01:59] I comment this because I got very confused [10:02:06] but knowing it is that, no issue [10:03:16] it doesn't help the super-clear-and-compatible-hp naming scheme [10:03:23] ssacli now imported, seems the jenkins repo had a short hiccup, worked in a second attempt [10:03:29] thanks, moritzm [10:03:32] will try [10:03:32] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2124.codfw.wmnet'] ` and were **ALL** successful. [10:03:48] I see, another case of https://www.youtube.com/watch?v=8fnfeuoh4s8 [10:04:39] I can see it now [10:06:12] I had https://www.youtube.com/watch?v=8fnfeuoh4s8 yesterday when I thought "adding DHCP to automated service restarts should be quick" and after several rabbit holes in ended up with https://phabricator.wikimedia.org/T251112 :-) [10:06:26] if I can add a nitpick [10:07:01] recommended version on buster for hp is 4.15 but 3.30 is available [10:07:19] I don't think it is a huge issue though :-D [10:08:24] I am sure they'll change the name again soon anyways yeah [10:08:31] they've changed it like 3 times already [10:08:43] let me check when I set up the initial hwraid components, there was no buster repo at HPE (so buster-wikimedia currently imports the HPE stretch debs), maybe that changedin the mean time [10:08:50] more commenting for "importing buster hp packages into our buster" etc [10:09:12] aka our chaintool [10:09:18] oh, in fact they did, I'll update the sync definitions [10:10:40] if you teach me how CC me on patch, next time I will be able to not bother you :-D [10:11:22] and just sending you a CR :-D [10:18:30] already done :-) https://gerrit.wikimedia.org/r/592900 [10:19:48] cool, that way I know what to do for bullseye! [10:30:48] reverting this, FYI, I know you are doing reimages: https://gerrit.wikimedia.org/r/c/operations/puppet/+/592903 [10:33:25] we are done with the reimages [10:33:44] ok [10:49:37] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10jcrespo) ` root@cumin1001:~/codereview_exports$ mysql.py -BN -h db1123 -e "select table_name from information_schema.tables WHERE table_schema... [10:49:50] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10jcrespo) p:05Triage→03Medium [10:58:47] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10jcrespo) @ArielGlenn @Bstorm ^ give me a server and a path and I will put it here. I can also create a tarball if preferred. [11:41:58] 10DBA, 10MediaWiki-User-management, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Marostegui) >>! In T250071#6087705, @Marostegui wro... [11:46:08] 10DBA, 10MediaWiki-User-management, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Marostegui) 05Open→03Resolved s8 done: ` CREATE... [11:47:12] 10Blocked-on-schema-change, 10DBA: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) s3 and s8 are now done from T250071#6083365 so this can go ahead [11:50:29] 10Blocked-on-schema-change, 10DBA: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) [12:10:37] -rw-rw---- 1 mysql mysql 800M Apr 28 11:56 ./ruwikiquote/ipblocks.ibd [12:10:48] I am surprised with such a big table on that wiki hah [12:30:04] There is a mail on cloud-l "[Cloud] Replica DB timeouts" that probably is relevant to you [12:31:26] I just replied [12:34:19] "Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2020-04-28 00:00:01 is 189 GB, but previous one was 163 GB, a change of 15.9%" [12:34:36] this is because they are so small still [12:35:00] I think I will ack them for a week, as it will get less than 15% by then, I think [12:35:23] no, that is because I dropped a column on the img_table and probably compacted the table too [12:35:27] they grow liniarly, not exponentially [12:35:36] this is es2, no metadata :-D [12:35:41] *es4/5 [12:35:54] hopefully you didn't drop anything there :-D [12:35:57] ah sorry, missed the es [12:36:25] it is a 15-20% growth now, but will be 10% next week probably [12:36:47] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) [12:36:48] +25G/week/cluster [12:37:41] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) s4 eqiad [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1004 [] db1138 [] db1125 [] db1121 [] db1103 [] db1102 [] db1097 [] db1091 [] db1084 [x] db1081 [13:09:29] marostegui: thanks for the quick reply to the cloud user [13:10:03] how's labsdb1011 going BTW? [13:10:21] almost done [13:10:47] jynus: did you see the update? https://phabricator.wikimedia.org/T249188#6087671 [13:10:50] tranferring or something else? [13:10:59] The transfer I started in the morning [13:11:11] I hadn't seen it [13:11:28] was it that recent or from yesterday? [13:11:35] what? [13:11:40] the log [13:11:49] that's today's log [13:11:54] as in, after a new transfer? [13:11:59] Yes, with the new data [13:12:01] With 0 lag [13:12:04] uf [13:12:11] I have updated mariadb's bug with that info [13:12:19] as innodb_purge_threads looks like is doing something nasty there [13:12:24] maybe it is because of multisource? [13:12:32] arturo: yw! [13:12:33] shouldn't matter [13:12:44] so this is the 3 transfer already? [13:12:50] no, the 2nd [13:12:54] ah, ok [13:13:02] and hopefully the last :p [13:13:21] because I was about to say, hey, we agreed to revert! [13:13:25] :-D [13:13:40] but that is a bad bug [13:13:41] yeah, I tested it manually, better to see if that was the issue or what [13:13:43] yep [13:13:50] we'll see what they say [13:14:02] elena assigned it to a dev already, so we'll see [13:14:04] was that after _upgrade? [13:14:13] so they don't have an excuse? [13:14:18] yep [13:14:41] lots of 10.4 issues [13:14:42] and the host was also doing absolutely nothing, nothing to catch up on, nothing to purge... [13:14:53] this bug apparently was from 10.1 [13:14:56] or the similar one at least [13:15:17] lots of bugs after 10.1 [13:15:34] I mean, bugs are normal, but them impacting us not so much [13:16:00] so far I have filled 3 for 10.4 I think [13:16:09] 2 of them are supposed to be fixed in the next released [13:16:14] so that's good [13:16:22] maybe we should upgrade our backup source logically [13:16:37] when we are ready for that [13:16:37] maybe, yeah [13:16:54] we also need more production testing [13:17:02] yeah, I am getting more hosts on 10.4 [13:17:15] the good part is that they definitely perform better cpu wise with compression [13:17:23] it always hits us on non-core hosts [13:17:29] toolsdb, wikireplicas, etc. [13:17:41] wikireplicas are quite out of the norm hosts XD [13:17:58] yeah, but they are just more "dense" [13:18:07] nothing out of the ordinary [13:18:17] mediawiki eventually will start using new features too [13:18:26] well, they have very intense multi source replication, lots of small databases, lots of views... [13:19:08] In theory we could enable gtid already...but I am not going to take that risk [13:19:13] nope [13:19:28] labsdbs are 1G, right? [13:19:45] maybe if we upgrade to 10G hosts, recovery would be much faster [13:19:49] I did test their fix and worked, but I am not too confident about issuing that delete gtid_domain_id=0 command on each master [13:19:56] I think they are 1G yeah [13:20:04] plus separating them in smaller parts [13:20:10] if possible [13:20:15] we definitely to go to multi instance there [13:20:50] remember when they used to be tokudb, they get corrupted every 3 months and had to be loaded and sanitized logically? [13:21:10] I remember when we had to put labsdb1001 and 1003 to idempotent [13:21:11] :( [13:21:17] yeah, that too [13:21:29] rows drifting and we had new tickets every week [13:21:30] that was sad [13:21:33] yeah [13:41:49] 10DBA: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) labsdb1011 is back up and catching up. I will repool it tomorrow after making sure it is stable. [14:12:20] o/ DBAs :D I'm just about to have another quick look at https://phabricator.wikimedia.org/T246415 regarding spliting the "groups" that we use for wikibase queries a bit. This might be client vs repo, and perhaps also term storage realted queries. [14:12:45] Am I correct in thinking right now it is not possible to tell how many read queries hit s8 from the various different mediawiki sites? [14:13:16] I believe I asked that before but would want to write the answer in the ticket this time! :D [14:14:43] there is no per-domain logs or stats of db queries- those would be on applicaiton side, dbs don't know if request urls [14:15:00] ack! typ! [14:15:09] *ty [14:15:12] if they are there, they would be on mw logging/stats [14:15:16] we can know per-db stats [14:15:19] or per user stats [14:15:23] of per query pattern [14:15:40] domains are not sent so low [14:16:03] per user stats, interesting, so if each wiki used a different user, then that would be visible at the db level [14:16:11] sure [14:16:20] addshore: yeah, but we only use "wikiuser" :( [14:16:34] or if there was some kind of comment debugging that, maybe [14:16:41] just for legacy reasons, or as an active choice? [14:17:00] it is not a good idea to separate users if all have the same stats [14:17:04] *grants [14:17:11] comment debugging is an interesting point, as for wikibase we could definlty include this detail alongside the method name being called [14:17:21] a wiki request to enwiki may call s8 dbs, s7, s4 and even more [14:18:09] you cannot also log read-only queries- that would be a lot of overhead [14:18:17] but you can create in-memory counters [14:18:32] and report those every X time/requests, etc. [14:18:48] which I belive is what mw does on graphite/prometheus [14:19:10] you could log a sample, too [14:19:18] e.g. 1/1000 read queries [14:19:24] yup! I'll investigate doing some of this in the app then :) [14:20:03] for example, I know domain/url is done for errors [14:20:17] maybe on mwdebug there is that too [14:20:54] just don't try sending to elastic all queries, or you will end up with 500K logs per second [14:20:57] 0:-D [14:21:03] yup, i did that once by accident ;) [14:21:20] I think you should reach performance team [14:21:25] they do this a lot [14:21:26] good idea! [14:21:46] and may be able to, not help you, but point you in using the same method they do for pefromance monitoring [14:25:49] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikidata-Trailblazing-Exploration, and 2 others: Investigate a different db load groups for wikidata / wikibase - https://phabricator.wikimedia.org/T246415 (10Addshore) I just checked with the DBAs to confirm that there is currently "no per-domain logs or stats of... [14:25:54] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikidata-Trailblazing-Exploration, and 2 others: Investigate a different db load groups for wikidata / wikibase - https://phabricator.wikimedia.org/T246415 (10Addshore) p:05Triage→03Medium [14:25:59] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikidata-Trailblazing-Exploration, and 2 others: Investigate a different db load groups for wikidata / wikibase - https://phabricator.wikimedia.org/T246415 (10Addshore) a:03Addshore [14:41:40] jynus: re: T215183 -- all sounds good, and certainly no rush on that. I don't expect to check in on the status for another 3 months or so [14:41:42] T215183: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 [14:43:50] cdanis: I know, but assuming it was a relatively safe process, better making it good now [14:43:55] 👍 [14:44:03] specially for the backup hosts, which I hope to not touch in the next X years [14:44:12] touch as in "reimage fully" [14:44:45] the proxies are much safer as they are mostly stateless [14:44:53] but I will not touch them without manuel's ok [15:14:58] did labsdb1011 crash? [15:18:15] the log is full of warnings, it is impossible to read [15:18:22] but yeah, it crashed [15:23:36] 10DBA: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10jcrespo) labsdb1011 crashed again at around 15:11 UTC. Lots of innodb aserts ongoing. [15:26:36] I think mydumper way is doable- if that doesn't work either mariadb or the server are beyond repair [15:28:53] either that, or a reimage back to stretch [15:32:59] log is complaining about mysql_upgrade, could it be also the major version skip? [15:35:18] but db2102 seems quite happy about it, so that shouldn't be it [16:13:27] I've left a screen ready to backup logically labsdb1012, but won't work until network whole is open (dbprov1001 doesn't have direct access to labsdb1012) [16:13:58] ^in case you want to use it [16:17:03] There is a screen on: dbprov1001 120518.backup_labsdb1011 [16:47:19] yes, it has crashed, jesus christ [16:47:22] maybe it is a hardware issue? [16:47:39] it must be [16:50:22] 10DBA: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) At this point I think this must be faulty hardware (storage) or something similar. The host was having no reads or anything that would make it overloaded, just replicating to catch up. Of course the HW logs ar... [18:22:11] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) [20:36:14] was db1114 meant to be depooled in https://phabricator.wikimedia.org/P11039 ? [20:44:27] ping: marostegui, jynus ^^ [22:23:04] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Bstorm) Well, at very least I can say that the firmware for the RAID controller all the disk is on is from when the server was purchased (2015 or so). There's loads of upgrades since then...