[04:53:05] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10jcrespo) > whether that needs changing on the desired thresholds is a different discussion. The director of SRE was the person who decided that at the time becaus... [05:02:46] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Joe) >>! In T233534#5518243, @jcrespo wrote: >> whether that needs changing on the desired thresholds is a different discussion. > > The director of SRE was the p... [05:39:53] 10DBA, 10Operations, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) This was done successfully. read only start: 05:10:14 UTC AM read only stop: 05:13:08 UTC AM total read only time: 2 minutes 54 s... [05:39:59] 10DBA, 10Operations, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) 05Open→03Resolved [05:50:08] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) >>! In T233534#5517306, @Krenair wrote: > I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed th... [05:54:46] 10DBA, 10Icinga, 10observability: Make primary DB masters page on HOST DOWN alert - https://phabricator.wikimedia.org/T233684 (10Marostegui) [05:55:04] 10DBA, 10Icinga, 10observability, 10Wikimedia-Incident: Make primary DB masters page on HOST DOWN alert - https://phabricator.wikimedia.org/T233684 (10Marostegui) p:05Triage→03Normal [06:02:26] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Joe) >>! In T233534#5518359, @Marostegui wrote: >>>! In T233534#5517306, @Krenair wrote: >> I'm wondering if an entry should be added under "Where did we get lucky... [06:06:45] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) >>! In T233534#5518388, @Joe wrote: >>>! In T233534#5518359, @Marostegui wrote: >>>>! In T233534#5517306, @Krenair wrote: >>> I'm wondering if an entry... [06:23:49] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) [06:24:30] 10DBA, 10Operations: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Marostegui) a:03Marostegui [08:08:39] 10DBA, 10Icinga, 10observability, 10Wikimedia-Incident: Make primary DB masters page on HOST DOWN alert - https://phabricator.wikimedia.org/T233684 (10Krenair) [08:08:41] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Krenair) [08:28:23] 10Blocked-on-schema-change, 10GlobalBlocking: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 (10Marostegui) [08:31:49] I was going to bring down db1114 [08:32:11] go for it! [08:32:21] but saw this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/538837 [08:33:15] but db1114 has notifications disabled [08:33:32] I see [08:33:45] I still don't think "MariaDB Slave IO: test-s1 #page" is ok [08:34:21] but I guess it is not really necessary [08:35:13] 10Blocked-on-schema-change, 10GlobalBlocking: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 (10Marostegui) [08:35:24] I am going to stop mariadb there anyway [08:37:33] jynus: o/ - do you have time this afternoon to chat about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538045/ ? [08:37:43] sure [08:40:06] thanks :) [08:42:20] what do you want to do with es1019, is it worth to power drain again? [08:43:04] I guess so [08:43:05] I guess one more last time... [08:43:09] Yeah [08:43:23] Could you create a ticket for it? Or reopen the existing one? [08:43:26] So I can take care of it? [08:43:30] I am more worried about the others [08:43:42] yeah, I can copy and paste one ticket [08:43:50] :-) [08:44:00] check the last email from daniel though [08:44:05] many hosts were already done last night [08:44:14] oh, I didn't see the latest updates [08:44:38] it is also difficult to track what were missing [08:44:40] 10Blocked-on-schema-change, 10GlobalBlocking: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 (10Marostegui) [08:52:36] 10DBA, 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10jcrespo) [08:52:42] filed [08:52:52] thank you! [08:53:23] 10DBA, 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) [08:53:46] 10DBA, 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) p:05Triage→03Normal [08:54:23] 10DBA, 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) @Cmjohnson or @Jclark-ctr let me know when it is a moment to power drain this host and I will have it ready (aka I will depool it) [09:15:36] 10DBA: Drop frwiki.archive_save table - https://phabricator.wikimedia.org/T233187 (10Marostegui) 05Open→03Resolved This table has been dropped. [09:15:41] 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [09:16:34] marostegui: T233701 [09:16:35] T233701: No grafana dashboard with working disk writes and reads in bytes - https://phabricator.wikimedia.org/T233701 [09:17:42] interesting! [09:17:43] subscribed! [09:18:14] Did you check other DBs? [09:18:19] The ones on that example are decommissioned [09:18:37] I checked on several disks outsid of dbs [09:18:42] ah cool [09:20:04] actually, now that I see, I just discovered one working, which is the MySQL dashboard [09:20:21] the per host one, right? [09:20:52] on that one we have disk latency, iops and throughput as far as I remember [09:23:30] yes [09:55:51] 10Blocked-on-schema-change, 10GlobalBlocking: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 (10Marostegui) [09:58:52] 10Blocked-on-schema-change, 10GlobalBlocking: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 (10Marostegui) [10:05:19] 10Blocked-on-schema-change, 10GlobalBlocking: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 (10Marostegui) [10:24:52] 10Blocked-on-schema-change, 10GlobalBlocking: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 (10Marostegui) [10:29:49] <_joe_> is it normal that a sizeable number of deadlocks happen on wikidata? [10:29:53] <_joe_> Function: Wikibase\Lib\Store\Sql\Terms\{closure} [10:29:55] <_joe_> Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.48.172) [10:30:02] I assume it is the migration script from Amir1 [10:30:03] <_joe_> I see quite a few of them [10:30:21] <_joe_> the error comes from a random server though [10:30:32] <_joe_> but is probably caused by amir's script you mean? [10:30:35] yep [10:30:41] <_joe_> ok, makes sense [10:30:41] it has been running for a couple of weeks now [10:30:46] and it is expected to run for a few more [10:30:59] <_joe_> sorry my senses are heightened now as we've fully migrated api to php7 [10:31:11] totally understandable! :) [12:15:19] 10DBA, 10Operations, 10ops-eqiad: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Cmjohnson) The ticket was created with Dell. I am waiting on their approval and then for the Dell tech to coordinate a day/time to swap the board out [12:16:20] 10DBA, 10Operations, 10ops-eqiad: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Excellent! Thank you! [12:31:10] 10DBA, 10Operations: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Trizek-WMF) [12:31:13] 10DBA, 10Operations: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Trizek-WMF) [12:31:16] 10DBA, 10Operations: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Trizek-WMF) [12:31:19] 10DBA, 10Operations, 10Patch-For-Review: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Trizek-WMF) [12:49:26] are you doing any recovering or provisioning at the moment? [12:56:16] not at the moment no [12:56:31] why? [12:56:56] I am seing some dbprov traffic that I don't know where it comes from [12:57:19] and I don't think backups run at this time, so maybe you were doing some restores or something [12:57:24] I will keep checking [12:57:31] nope, not doing anything regarding backups [12:57:41] there is also nothing in-progress on the backups table from what I can see [12:58:04] I will research it, I just wanted to discard it wasn't you [13:23:42] jynus: o/ - do you have some time for backups or better later on? [13:23:53] now better [13:24:02] ack :) [13:24:41] so? [13:25:05] well I am not sure what are the next steps for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538045/ [13:25:50] I asked some questions [13:26:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/538045#message-2ba73f555861dbde131e31a4c4c1cbf38ade5fb9 [13:26:47] maybe I missed the answer somewere else? [13:27:53] the link that you provided doesn't lead me to the questions, but I am reading the correspondence now [13:28:20] we don't need codfw redundancy and if possible we'd like not to have it self hosted (so to use what SRE provides basically) [13:28:27] if it is still ok for you of course [13:28:49] sure, it is just that I didn't know if that was what you wanted [13:28:56] I thought that the "Let's talk tomorrow on IRC, and once agreed and deployed we can do a manual test run." was the next step [13:29:09] nono it is super good [13:29:21] yeah, let's talk about you answering those doubts I have [13:29:23] :-D [13:29:45] :) [13:29:48] the other implied question was size [13:29:59] if it need archival [13:30:13] and add the sections to monitoring [13:30:51] I am a bit ignorant about the "archival" part, not sure exactly what you mean [13:30:57] but if it is on the wikipage I'll read it [13:31:04] sure, that is why I am here [13:31:15] and why I suggested IRC for interactive chat [13:31:52] can you show me the database you want to backup? [13:32:12] so the "piwik" database on matomo1001 [13:32:35] there is a mysql instance on port 3306 (should be accessible via unix path) [13:33:10] and all the databases on an-coord1001 (instance on port 3306, unix path as well in theory) [13:33:18] sorry not mysql but mariadb [13:33:57] so archival means it will consolidate different databases into one file [13:34:13] if you have small number of tables, that will probably not be needed [13:35:29] so the reason I suggested as a possibility [13:35:37] the setting up a separate infrastructure [13:35:48] is that by adding it to production one, you will lose a bit of control [13:36:08] I am all for standardizing and following the best practices of SRE [13:36:12] as it will be the production backups, and will need further agreement for changes [13:36:23] makes sense yes [13:36:26] that is why in that case I ask for sizes and that [13:36:39] piwik is 1.7GB [13:36:48] what about the other [13:37:51] the others are more than one, we use mariadb on an-coord1001 for various tools that need a db [13:38:10] yeah, but still small number of files/tables [13:38:21] yep yep, low volume [13:39:01] so now what I will do is setup that file temporarelly and run a manual run of the backups [13:39:25] I need to check because not sure if it should be dbprov1001 [13:40:02] the one thing that the patch needs is to be added to monitoring [13:40:12] there is a separate class for that [13:41:22] that is hiera key profile::mariadb::backup::check::dump::sections [13:41:36] send a patch with the name of the new sections [13:41:45] (an ammend) [13:41:52] or as a separate one, both work [13:42:14] no need to refactor, this will break our eqiad/codf simetry but I will take care of that [13:42:48] while I do a manual backup run [13:43:32] we will put the backups on dbprov1002 better [13:43:37] based on disk availability [13:44:22] ack, should I break my change into two? [13:44:33] up to you [13:44:35] one for adding ferm rules (so you can run a manual test) [13:44:40] and the latter for config [13:44:52] we will definitely need the whole now [13:44:56] even if it is manual [13:45:00] *hole [13:45:04] for testing [13:45:28] I'll break it down then so we'll merge only the ferm changes [13:47:00] I am ready to start the dump when you are [13:47:26] please check because if there are long running queries and they are masters, it could block writes [13:47:50] ack [13:47:52] once done I will show you the format in which they are stored and how to recover [13:48:00] prepping the ferm changes [13:48:29] note we will do it from dbprov1002, but probably both should be allowed [13:48:35] if later things get reorganized [13:49:12] comment somewhere dbprov also so it can easly nbe grepped when hosts get replaced or added [13:52:02] reduced in scope the https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538045/ [13:52:07] now it is all about ferm rules [13:52:25] and mentions the dbprov dns records, so easy to find [13:54:05] oh, I thought you wanted to deploy the rules only [13:54:34] ah, so that is exactly what you sent [13:54:35] nono I want to open ferm to allow you to make the test run [13:54:55] go for it then [13:55:06] do you need my +1, comments here on the ticket? [13:55:23] nono I'll proceed [13:55:54] if you need to wikilawyer me you can refer to this log :-D [13:56:24] let me know once deployed and checked services are ok when to start [13:56:42] and we will need some additional changes to the next patch [13:57:21] ack! [14:04:23] how is it going? [14:05:08] just discovered that there are no AAAA records so ferm is not happy :) [14:05:11] fixing [14:07:14] that is fixed in the ferm package in buster-wikimedia, BTW [14:07:52] moritzm: I am guessing he doesn't want to reimage his databases just for that :-D [14:08:45] sure, sure. Just saying that there's light at the end of the tunnel :-) [14:08:53] :) [14:09:04] I still want to backport that to stretch-wikimedia, but haven't found the time yet [14:11:26] ready? [14:12:26] yep! [14:12:29] should be good now [14:13:35] ok, please open the services and the mysql dashboards [14:13:44] to check for possible bad patterns [14:13:55] I will start the manual run in a second [14:14:05] sure [14:17:34] we don't have a fancy dashboard yet [14:18:01] but: https://phabricator.wikimedia.org/P9165 [14:18:27] \o/ [14:19:53] I don't think it is working [14:20:42] no errors on the log so far, so it may be timing out tis connection [14:21:48] elukey@matomo1001:~$ sudo mysql [14:21:49] ERROR 1040 (HY000): Too many connections [14:22:01] not a good sign [14:22:05] yeah, as I said, you should monitor for bad patterns [14:22:33] yep I spotted it while checking the process list https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fanalytics&var-server=matomo1001&var-port=13306 [14:22:37] an-coord1001 seems fine [14:22:51] I don't think we can support backups of that [14:24:06] well the current mysqldump weekly backup works fine [14:24:16] not really [14:24:25] you are probably generating garbage data [14:24:41] why? [14:25:12] what this process does is generating consistent backups [14:25:25] if you are backing up without consistency, it will work [14:25:44] but you will not be able to recover it properly [14:26:16] to do consistent backups, transactions have to be small [14:26:37] so it starts not in the middle of one transaction [14:27:41] it was blocked for "_wait_for_tstate_lock" on both databases [14:28:20] I am guessing you do batch inserts or large selects there? [14:28:56] could be, I am not super sure the exact traffic that goes to those databases, but the traffic is relatively smally [14:28:59] *small [14:29:40] yeah, it is not a question of how large traffic is [14:29:54] but how large transactions are- for it be it could have a single transaction [14:30:30] could be [14:30:39] for an-coord1001 we currently use mariadb::mylvmbackup [14:30:46] and copy the result to another host [14:31:01] but I wanted something a bit more robust [14:31:19] so it sounds that we can't use bacula for our use cases? [14:31:29] no, the problem is not bacula [14:31:58] I meant bacula as infrastructure/worflow [14:33:05] the only think I would suggest is to setup a replica and perform the backups there [14:33:23] can we try to relax the locking requirements for backups, and see if we can make it work anyway? I know that there are chances that we might end up with incomplete data etc.. but the current workflow that we use work well (we tried to restore backups and dump data) [14:33:49] sure, we coud do that, but I don't have support for that at the moment [14:34:11] and that is for a reason- it is required to get consistent backups [14:34:25] if the risk is loosing some transactions I think that we can live with that, if it is worse probably nor [14:34:28] not [14:35:07] I am not saying I'd be happy to loose some, but it is not a strict requirement for our tools [14:35:16] (not like loosing edits etc..) [14:35:38] there is a -k option which skips the locking [14:35:57] but I don't have support for that in our backups, it would need coding and I don't have time at the moment [14:36:03] it is possible [14:36:08] but requires way more work [14:36:32] e.g. adding a "skip-locks" option on those 2 sections [14:36:39] that sets the -k [14:39:12] and if I use something like the mylvmbackup or mysqldump, would it be possible to use bacula to store the backups? Like we do now with Matomo for example [14:39:24] not great solution but probably a starting point for me [14:40:14] I don't gatekeep bacula, as long as you know the risks [14:40:49] (potentially not being able to recover, no db support from me for that) [14:41:25] as in, you can do as you want, as long as later you dont blame me! :-P [14:41:48] nono of course I will not, just knowing if bacula can support this kind of use case [14:41:58] I tell you the right way, the rest is compromises [14:42:21] I know, but I have to find one [14:42:28] this is why I am asking [14:42:54] bacula is a file backup solution, if the question is, can I store this 20GB of files, the answer is yes [14:43:17] just understand I am worried not about that, but about the db dumping solution [14:43:46] and I can help, but it needs more work than expected [14:43:49] sure, but I can take that risk with my team and of course not blaming anybody else if things go south [14:44:39] the main question is right now if bacula can be used as generic file backup solution and who maintains it, or if the supported use case is backups done via the new infrastructure [14:44:59] so there is 2 things [14:45:07] I mean, if storing those 20G is fine and if there is support to make sure that I can retrieve them if needed :) [14:45:12] bacula (general backups) and database backups [14:46:13] the second things is "new" and suggested, but you can use only the first [14:46:44] and setup your own solution [14:47:20] all right, seems good enough for the moment [14:47:41] note we use the same thing for all misc hosts, not only mediawiki [14:48:02] but backing up a master can be problematic [14:48:28] one long term solution could be to use a db as replica of both, and then backup only that (as you suggested) [14:48:46] indeed [14:49:08] but there is probably no budget for it.. Maybe we could explore if a VM on ganeti is enough [14:49:41] if I can do a more general comment, I think you cut too much corners [14:49:47] not you personally [14:49:51] your team [14:50:05] what do you mean? [14:50:05] e.g. I suggested to keep 2 servers for eventlogging [14:50:40] I know you are the first ones to wanting decom [14:50:58] but replicas can be helpful for redundancy and cases like this [14:51:18] and it is not a question of lacking resources because you already have them [14:51:38] on that side, there are multiple reasons.. one of them is that db1107 hosts a log database that is different a lot from the one on db1108.. not super great, but I believe that moving users to db1107 in case 08 fails wouldn't bring any consistency, just some form of availability [14:51:59] sure, I am not entering into specifics [14:51:59] plus we'd have a backup and data flowing to hadoop (last 1.5y on it) [14:52:02] etc.. [14:52:19] I am based on your latest comments of setting a replica but not having resources [14:53:09] you try to opimize too much :-D [14:53:52] the mysqldump+bacula solution will be good for the moment, long term (as said) we'll think about a proper replica [14:54:03] I meant that we don't have resources budgeted for this use case this year [14:54:08] next probably yes [14:54:11] it happens :) [14:54:25] so for what I see [14:54:36] you have sleeping transactions on the dbs [14:54:43] open for 12400 seconds [14:54:55] that creates issues [14:55:29] with backups you mean or in general? [14:55:43] in general, but with backups in particular [14:56:06] I am probably sure the -k option would avoid those [14:56:16] or creating a lower connection timeout [14:56:37] this is a good suggestion, will investigate our current settings [14:57:18] I don't see any query ongoing on matomo [14:57:23] I can retry there [14:58:21] a backup you mena? [14:58:24] *mean? [14:58:26] yes [14:59:09] sure [14:59:37] matomo probly works, but I cannot connect [14:59:55] grants may be missing [15:01:05] but dumps worked a while ago no? [15:01:20] I may be doing something wrong [15:01:55] interesting, I don't see dbprov1002 among the users [15:02:01] dump@ I mean [15:03:01] lol are those missing? [15:03:38] apparently yes, checkend on an-coord1001 and they are there [15:03:45] probably my pebcak [15:04:27] ok, so only 1 failed [15:05:13] now it works? [15:06:12] I am trying to add the user now [15:07:10] nope sorry it was already there, wrong sql select [15:07:28] dbprov1002 is whitelisted [15:07:31] in any case, I am going to retry onlt that [15:08:25] and the grants are in sync with an-coord1001 [15:08:27] sure [15:09:18] _joe_: yes, it's migrating the gigantic wb_terms to a set of normalized tables, it's going to take some time and have lots of writes (sorry for late answer, I'm sick) [15:11:22] <_joe_> Amir1: heh take care [15:11:52] thanks. [15:13:33] nah, same issue [15:13:38] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fanalytics&var-server=matomo1001&var-port=13306&panelId=37&fullscreen - process list again spiking.. [15:13:49] yeah [15:14:22] the application must be doing some global lock of something weird [15:15:44] yep probably [15:15:48] it is piwik :) [15:22:01] thanks for the work jynus, will try to follow up on long running transactions [22:55:04] wmf-pt-kill.service just died on labsdb1011...which seems odd [22:58:01] Looks like mariadb crashed [22:58:07] https://www.irccloud.com/pastebin/ISPLve3h/ [23:30:19] Created a ticket for the morning and depooling that replica in case that helps. Feel free to repool it if that seems pointless.