[06:07:47] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Majavah) [07:34:50] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10jcrespo) > I think this ticket is done now. Where database backups requested/needed? [07:47:12] big fail: https://gerrit.wikimedia.org/r/c/operations/puppet/+/609101 [07:50:02] jynus: ahh [07:50:08] kormat: if you can check what's the issue with db1077? [07:50:34] T256939 [07:50:35] T256939: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 [07:50:41] sure thing [07:50:55] it is a test host, so not big deal, but we should understand what is going on [08:16:10] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Kormat) I'm at a loss here. There's nothing recent in the ilo log. This is the last entry: ` /system1/log1/record36 Targets Properties number=36 severity=Caution date=06/18/2020 time=... [08:17:33] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10jcrespo) It says: Battery count: 0. Did the battery die or did it come back after it failed (or detected failed)? [08:20:28] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Kormat) The battery has been dead for a long time {T225391}, it's a [[ https://github.com/wikimedia/puppet/blob/4d037cb8debde241870cee57a7ecb39e2b718f25/hieradata/hosts/db1077.yaml#L2 | known issue ]]. [08:21:10] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10jcrespo) Apparently this is known issue: T226519 Maybe then auto-icinga-ack should be disabled for this host? [08:24:16] jynus: i wonder. icinga hasn't opened a task for db1077's raid since last june. maybe that service was downtimed for ~a year? [08:25:49] maybe, manel would know [08:26:03] so maybe downtime for another year? [08:38:39] hmm, no, i don't think that's it. we've reimaged db1077 a few times, and that nukes any old downtimes [08:39:06] unless m.arostegui has been manually re-downtiming db1077:RAID every time [08:39:18] could be [08:39:44] feel free to solve in any way you think is reasonable [08:39:51] then report next week to manuel [08:40:10] +1 [08:42:43] I think there is no cache in use for that host [08:42:57] if that was mw production, we would have lots of errors due to performance [08:43:06] but being a test host, we will live with it until replaced [08:43:27] as the alert is acked, i think i can leave it as-is, and poke mar.ostegui about it next week [08:43:28] manuel will know when it is scheduled, or maybe you can find in the budget scheduled, kormat [08:43:29] yeah [08:43:42] *scheduled to be decommission [08:44:18] 10DBA, 10Operations, 10ops-eqiad, 10User-Kormat: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Kormat) a:03Marostegui [09:05:15] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) 05Open→03Resolved Thanks @jcrespo Thanks for helping this is all set up and ready for the ticket to close however this database will... [09:09:51] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [09:10:53] jynus: you asked on ^ if dbstore and general misc roles have been covered. if you can give me an example of a machine for each, i can triple-check [09:11:32] just to make sure we're talking about the same things [09:14:52] https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L679 [09:15:58] https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L574 [09:16:44] https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L652 [09:17:08] ok, cool. yes, i can confirm they are already covered by the previous CR(s). `sudo cumin "A:db-section-s2"` and `sudo cumin "A:db-section-m1"` contain those hosts as appropriate [09:17:20] ok [09:17:29] it wasn't part of the patch [09:17:34] and it wasn't core [09:17:40] so I asked about them [09:17:56] yep, no problem, thanks for checking :) [09:18:16] one thing that may be confusing [09:18:41] you will see mentions of shard and slave on previous code [09:18:59] with very few exceptions those terms are not used anymore [09:19:14] we use section (which may be shards or not) for replica groups [09:19:16] shard => section. what's the replacement for 'slave'? [09:19:18] and replica [09:19:21] for slave [09:19:24] ahh, ok. [09:19:25] but because dependencies [09:19:33] it is difficult to change it all at the same time [09:19:38] I mention in case of new code [09:19:39] yeah understood [09:19:47] to use the latest terminology [09:19:50] i'll file a task for the renaming of slave to replica [09:19:53] to keep track of it [09:19:56] I think mediawiki uses master [09:20:01] but mysql now uses source [09:20:07] we can discuss which one we prefer [09:23:10] so pc1 and es1 are real shards, but s1 is not [09:25:24] oh? how so? [09:34:35] sX are not horizontal partitions https://en.wikipedia.org/wiki/Shard_(database_architecture) they are just tenant's of a multi-tenant setup [09:34:59] pcs and es* are real partitions [09:35:09] ahh. right, yes [09:35:11] by hash or id [09:35:31] they all are replica sets "groups of servers that hold the same data" [09:35:52] but I think sections, which mw uses, are the simplest way to clasify them [09:36:36] sections for either sharding or multi-tenancy [09:37:32] * kormat nods [09:37:52] so I don't care too much about the actual name used [09:38:06] but I would like to slowly unify towards a single name for better communication [09:38:22] so it is not called one thing on code and another thing on another code [09:57:38] 10DBA: Choosing a wrong host with transfer.py produces an "ERROR: The specified source path X doesn't exist on Y" - https://phabricator.wikimedia.org/T256951 (10jcrespo) [09:58:13] 10DBA: Choosing a wrong host with transfer.py produces an "ERROR: The specified source path X doesn't exist on Y" - https://phabricator.wikimedia.org/T256951 (10jcrespo) [09:58:17] 10DBA, 10Google-Summer-of-Code (2020), 10Patch-For-Review: GSoC 2020 Proposal: Improve the framework to transfer files over the LAN - https://phabricator.wikimedia.org/T248256 (10jcrespo) [09:59:57] 10DBA: transferpy package does not depend on python3-yaml - https://phabricator.wikimedia.org/T256604 (10jcrespo) 05Open→03Invalid transfer.py doesn't depend on python3-yaml, the wmf database backup system does. My fault. [10:00:56] can I give you a suggestion regarding task creation, kormat? [10:03:07] please :) [10:04:24] by experince, thing like T256879 will age badly [10:04:25] T256879: Remove unused parameters from profile::mariadb::monitor::prometheus - https://phabricator.wikimedia.org/T256879 [10:04:51] please comment properly what is the planned work so future kormat knows what you meant months ago :-D [10:05:07] reference existing code with urls, etc. [10:05:26] this is important for you, but also for manager that will want to know that you are doing [10:05:31] *managers [10:06:31] ack. the reason i didn't in that case was i need to do some investigation into how it is used, and what _are_ the unused parameters :) but yeah, i should do that [10:06:33] "There use to be this (link to code) after this (link to patch) this is no longer needed because (link to code) is being used" [10:06:45] not worried about the particular ticket [10:07:15] but I have been working on this with the GSoC student and wanted to mention it to you too [10:07:49] let me tell you that I have done the same thing in the past [10:08:10] but after having 500 ticketes and not remembering what I meant, I regretted not doing it :-D [10:08:56] understood [10:09:36] I literally said to myself, wtf did I meant here with "fix"? [10:09:44] :-D [10:10:36] 10DBA: Choosing a wrong host with transfer.py produces an "ERROR: The specified source path X doesn't exist on Y" - https://phabricator.wikimedia.org/T256951 (10Privacybatm) Parsing cumin output seems to be a better idea, let me check the output of cumin in this kind of cases. [10:11:06] did phab mysql just went down? [11:45:27] kormat: not sure if you understood my comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/608874 [11:45:32] the patch itself is ok [11:45:54] but there is no general explanation on commit msg [11:45:59] and no text on T256866 [11:45:59] T256866: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 [11:46:13] I am asking to commit a context (a couple of lines max) [11:56:23] s8 on dbstore1005 got stopped, but I don't know why [11:57:10] we need to know if maintenance is happening there [11:59:19] uh, no, it crashed [11:59:26] creating a ticket [12:03:49] 10DBA, 10Analytics: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) [12:04:53] 10DBA, 10Analytics: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) On replication start, instance crashed again- probably there is data/fs corruption. [12:07:39] 10DBA, 10Analytics: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) Same issues as T249188 ? [12:08:50] jynus: ahh, sorry. will do :) [12:09:14] more worried about T256966 now :-( [12:09:15] T256966: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 [12:10:08] oh ouch [12:10:57] I am going to luch, poke if you have time, and start remembering how to setup an instance [12:11:02] *lunch [12:27:34] 10DBA, 10Analytics: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) This host was reimaged to buster recently (2020-06-22) as part of T254870, and the symptoms do sound very like https://jira.mariadb.org/browse/MDEV-22373, with the significant difference that this... [12:37:37] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [13:01:38] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:03:31] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) p:05Triage→03Medium [13:06:27] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:08:16] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:09:09] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:27:21] 10DBA, 10Operations, 10User-Kormat: Remove unused parameters from profile::mariadb::monitor::prometheus - https://phabricator.wikimedia.org/T256879 (10Kormat) p:05Triage→03Medium [13:27:51] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:27:56] 10DBA, 10Operations, 10User-Kormat: Remove unused parameters from profile::mariadb::monitor::prometheus - https://phabricator.wikimedia.org/T256879 (10Kormat) [13:41:28] s8 snapshots just finished [13:41:42] should we try to recover it to dbstore1005? [13:42:41] it's worth a shot [13:43:41] jynus: i'm happy to take care of it [13:43:47] as a learning experience [13:48:37] sure, ping luca or otto [13:48:44] transfer while mysql is running [13:48:46] jynus: https://phabricator.wikimedia.org/P11726 is the proposed procedure [13:49:04] and make sure you have their ok before putting the old instance down [13:49:44] no, transfer first so there is no extensive downtime [13:50:01] or you have analytics ok [13:50:05] one of the 2 [13:50:33] huh, ok. as replication is stopped, i didn't think it was being used. will do! [13:50:55] they normally not sensitive to replication down for a few minutes [13:50:57] *Are [13:51:01] but better ask [13:52:17] +1 [13:52:45] if you get their ok/awareness you can do it an any way you want :-D [13:53:28] got a go-ahead from elukey [13:53:32] does the rest of the procedure look ok? [13:53:35] yep [13:53:44] great, thanks :) [13:53:49] dump if you can the grants [13:53:56] before putting it down [13:53:58] to have a copy [13:54:05] they're not in the backup? [13:54:15] ohh. it's analytics, [13:54:16] the backup has core-like grants [13:54:19] they could have different grants [13:54:21] ok [13:54:22] exactly [13:54:27] we can load them from other instance [13:54:28] how does one dump grants? :) [13:54:30] but this way you can dump [13:54:37] and just execute blindly [13:55:01] run pt-show-grants with the right socket [13:55:05] let me see [13:55:47] pt-show-grants S=/run/mysqld/mysqld.s8.sock [13:55:47] the output of `pt-show-grants -S /run/mysqld/mysqld.s8.sock` looks legit [13:55:59] dump that to a file [13:56:17] the load with mysql -S /run/mysqld/mysqld.s8.sock < grants.file [13:56:29] that I think is the easiest/most portable [13:56:37] note we only have 10.1 backups yet [13:56:45] so you will eat the upgrade [13:56:54] updated procedure [13:57:00] that another reason to keep the original grants [13:58:00] I think there is only 1 important user there, the one for researchers [13:58:09] but doesn't hurt to backup all :-D [13:58:39] do you see why we need a centalized grant management :-D [13:59:12] i do, indeed :) [14:00:06] so downtime is not important, othewise it would be in high availability configuration [14:00:23] but if the owner is aware they can answer user that get an error [14:00:51] so we should try to communicate if there is going to be a downtime [14:01:24] *users [14:02:41] log when you put it down, too [14:03:19] +1 [14:04:34] hah. i just ran into the transfer.py bug where it reports a dir doesn't exist when the hostname doesn't resolve [14:04:53] he [14:04:57] there is a ticket for that [14:05:19] T256951 [14:05:19] T256951: Choosing a wrong host with transfer.py produces an "ERROR: The specified source path X doesn't exist on Y" - https://phabricator.wikimedia.org/T256951 [14:05:41] I will speed up review if the patch looks good [14:07:49] ^ that's me - Ha ha, not following your own guide? XD [14:08:01] i thought i had already done it. i was wrong. :) [14:09:12] I don't want to bother but I hope you see with example some of the decision regarding things like "not automatically autostart replication after a restart" [14:09:35] those decisions can be challenged as we get better with monitoring/automation [14:09:44] they are not set on stone [14:09:50] * kormat nods [14:10:34] 10DBA, 10Analytics: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) p:05Triage→03High a:03Kormat [14:10:42] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) [14:30:38] 10DBA, 10Patch-For-Review: Use logging package instead of print statements in transferpy package - https://phabricator.wikimedia.org/T255999 (10jcrespo) Most of this is done, but let's keep this open, even with lower priority, to see if we can add some extra logging at a later time from new features. [14:30:50] 10DBA, 10Patch-For-Review: Use logging package instead of print statements in transferpy package - https://phabricator.wikimedia.org/T255999 (10jcrespo) p:05Triage→03Low [14:52:23] 10DBA, 10Operations, 10Performance-Team (Radar), 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672 (10Krinkle) [14:52:48] I am super ignorant about setting up master replica, so I started reading https://mariadb.com/kb/en/setting-up-replication/ [14:53:09] and added the status of meta/matomo dbs (the masters to replicate) in https://phabricator.wikimedia.org/T234826#6274823 [14:55:22] the first n00b questions are - do I need server_id set in both masters? (default seems 1, so possibly only on replicas?) and also, is binlog ROW ok or should I use something different? [15:39:04] sorry, was at a meeting [15:39:35] we can take care of the details, but please be patient as with manuel away we may be busier than normal [15:39:46] kormat: did the transfer work? [15:40:43] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) @dpifke ^ Do you want backups? [15:48:25] ah, it is transferring still [15:48:42] not much pending [15:58:29] transfer has finished [16:02:07] let me know how it goes, if you are going to do the recovery now [16:02:30] ERROR 1959 (OP000) at line 43: Invalid role specification `research_role` [16:02:34] when i tried to load the grants [16:03:00] ah, that was the role issue with labs remember? [16:03:16] that some roles dissapeared on upgrade or something? [16:03:29] not a clue [16:03:51] do the rest of the upgrade process so as to start replicaiton [16:04:07] and we can check the grants when we confirm things work ok [16:04:12] ok [16:04:41] can you tell me where you had kept the grant file, so I can check? [16:04:56] /root/s8.grant on dbstore1005 [16:06:27] not the difference between sudo -s and sudo -i (efective uid) [16:06:46] (it was on /home/kormat) [16:07:30] nothing that cannot be fixed later [16:07:39] let's hope the recovery works first [16:07:40] ah, oops, right :) [16:07:45] replication started [16:08:08] Got fatal error 1236 from master when reading data from binary log [16:08:54] are you sure you did the right change master? [16:08:54] oh, crud. i used the wrong master [16:09:08] db1109 [16:09:40] reset slave all; to be sure after stop [16:09:48] ah [16:09:52] i did stop, change master, start [16:09:59] if it worked ,ok [16:10:06] yeah it looks ok [16:10:06] I think it didn't replciate anything [16:10:10] phew [16:10:11] so it should work [16:10:23] if it had, we should have deleted the binlogs and relay logs with taht command [16:10:35] it's ok, don't worry [16:10:44] gotcha [16:10:46] we configured the gtids preciselly to avoid those issue [16:11:09] replication looks good [16:11:28] let's see that grant issue [16:11:45] did you try to apply the grant again and see if it works? [16:12:29] i can try again [16:12:40] same error [16:13:02] interesting, I see the same grants for the user [16:15:18] it could be a limitation of pt-show grants and mariadb roles [16:15:53] I am going to use create role on s8 [16:16:06] do not touch it now, ok? [16:16:25] +1 [16:16:45] yeah, I think it fails to handle roles [16:18:34] I think it worked now if I do CREATE ROLE [16:18:39] run the file again? [16:18:46] although it may stop at the first error [16:19:04] so we may want to run everything after line 43 [16:19:04] no error this time [16:19:32] now the ultimate test diff < file < (pt-show-grant ...) :-D [16:19:42] so this is new info to me [16:19:48] pt-show grant doesn't work for roles [16:20:05] and we really don't have an alternative [16:20:15] specially in an independent format [16:20:23] there's a bunch of differences [16:20:36] send me the oneliner [16:20:39] so I can see them [16:20:57] if it is extra stuff from the backups it is not a big deal [16:21:03] of may be it is an ordering issue [16:21:05] or both [16:21:06] `diff -u ~kormat/s8.grants ~kormat/s8.grants.post-recover` [16:21:47] yeah, it is all extra stuff from the backup [16:22:01] we can put a note to clean up for tomorrow [16:22:11] but everyhing that should be there is there [16:22:27] so I think the immediate issue is solver, thanks! [16:22:32] unless it breaks tomorrow again... [16:22:46] let's keep the ticket open and monitor tomorrow the state [16:22:57] good work, kormat [16:23:35] thanks for the help :) [16:23:45] is this your first large outage solved on your own? [16:24:05] I mean at wmf, or course [16:24:13] yep [16:24:21] good work, kormat [16:24:26] thanks :) [16:25:43] update ticket, and take a well-desrved rest/beer [16:26:16] more will come :-D [16:31:06] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) Data restored from backup, machine is now catching up on s8 replication. There are some extra grants from the backup that should be cleaned up, but otherwise things are in a good... [21:16:35] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for push notifications tables - https://phabricator.wikimedia.org/T246716 (10Mholloway) This is ready for DBA review. To elaborate on what we're doing: We plan to create two new... [22:06:22] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10Dzahn) db_pass was removed from private hieradata, from private passwords module, from labs/private.... [22:14:36] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10QChris) I went over the possible scenarios with @dzahn. How long do we keep DB backups? If we can... [22:40:03] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10Dzahn) >>! In T255715#6275884, @QChris wrote: > Does removing them need sign-off from releng? cc: @... [23:32:36] 10DBA, 10Core Platform Team, 10MediaWiki-Page-derived-data, 10TechCom-RFC, and 2 others: RFC: Normalize MediaWiki link tables - https://phabricator.wikimedia.org/T222224 (10Krinkle) p:05Triage→03Medium [23:33:18] 10DBA, 10Core Platform Team, 10MediaWiki-Page-derived-data, 10TechCom-RFC, and 2 others: RFC: Normalize MediaWiki link tables - https://phabricator.wikimedia.org/T222224 (10Krinkle)