[04:43:06] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) db1141 has been working fine during the weekend - no trace of possible crashes or the error that precedes the crashes on the logs.... [04:55:54] 10DBA, 10Parsoid, 10Parsoid-Tests: testreduce_vd database in m5 still in use? - https://phabricator.wikimedia.org/T245408 (10Marostegui) 05Open→03Resolved `testreduce_0715` has been dropped: ` root@db1133.eqiad.wmnet[(none)]> drop database if exists testreduce_0715; Query OK, 5 rows affected (1 min 0.27... [05:07:25] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) I am altering db2071 (enwi... [05:09:54] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [07:02:18] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [07:41:28] 10DBA: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 (10Marostegui) >>! In T231769#6176564, @Marostegui wrote: >>>! In T231769#6120789, @Marostegui wrote: >> This seems very similar to what we see: https://jira.percona.com/browse/PS-6961 (also affects 5.6, 5.7). I will tr... [07:41:40] jynus: you think it is worth reporting ^ [07:43:09] reading [07:45:46] do you mean our issue to mariadb, mysql 8 to mariadb, our issue to percona/mysql? [07:46:06] basically that issue reported on percona/mysql, report to mariadb [07:46:10] (I haven't found it) [07:46:22] but not sure if the numbers I got are really worth reporting [07:46:31] what do you think? [07:46:44] idk [07:46:56] I can tell you what I would do [07:47:04] are those numbers enough to say there's a leak? [07:47:06] what would you do? [07:47:15] which is try to create an event executing multiple times a second to reproduce it easily [07:47:24] that's what I did [07:47:27] oh [07:47:29] there are 100 events [07:47:33] running every second [07:47:36] I didn't see that let me reread it [07:47:43] I only read the ticket [07:47:46] the jira one [07:47:50] aaaah [07:47:55] No, read these two comments: [07:48:09] https://phabricator.wikimedia.org/T231769#6176564 https://phabricator.wikimedia.org/T231769#6180922 [07:48:11] that's all you need [07:48:12] yeah, there is a lot of comments and I cannot follow well [07:48:25] yeah, sorry, those two are the ones [07:48:57] which host is that? [07:49:00] db1077 [07:49:06] the one that has 0 activity [07:49:10] apart from those events [07:50:22] I cannot see the events on db or comments, where are they? [07:50:52] serverStatus database [07:51:00] (I did exactly the same steps the original reporter did) [07:51:03] thanks [07:51:05] same numbers, same names etc [07:51:12] same events etc [07:51:25] ok, thanks, now I get the whole context, let me reread the ticket again [07:51:30] sorry yeah [07:51:34] I should've explained better [07:52:23] it doesn't have a mysql bugs equivalent, right? [07:52:49] oh, it has [07:52:52] yeah [07:52:53] it does [07:52:56] but not mariadb [07:53:00] (or I couldn't find it) [07:53:03] so definitely would report it [07:53:07] pointing there [07:53:27] Is it worth providing my numbers too? [07:53:29] I am not sure [07:54:23] maybe wait for more than a few hours [07:54:30] but report it anyway [07:54:33] those are the numbers from Friday [07:54:58] oh, sorry, looked at the wrong month [07:55:07] it indeed started on the 29th [07:55:17] yeah [07:57:19] did it not happen on 10.1? [07:57:35] I didn't test [07:57:35] tendril is in 10.1, right? [07:57:40] but db1115 has shown issues [07:57:40] yeah [07:57:41] exactly [07:57:44] so I am sure it is connected [07:57:49] as tendril has like 2k events or more [07:58:17] I would do another final test: remove the events [07:58:23] and check it stops [07:58:25] oh, good idea [07:58:41] I would report it for sure [07:58:42] I will remove the events or stop the event scheduler and wait another 3 days [07:59:13] maybe test on 10.1 not to make sure it runs, but to make sure it is the same issue than tendril [07:59:26] (event scheduler stopped on db1077) [07:59:48] I can try on a codfw host with 10.1 indeed [08:00:03] do the query confirm high sql memory usage too? [08:00:15] we don't have those tables on sys on Mariadb :( [08:00:19] oh [08:00:31] I see, I guess it is 8 only [08:00:38] yeah [08:00:47] I couldn't find them anywhere [08:00:53] I thought maybe someone ported them or something [08:00:56] but I didn't find them [08:00:58] on mariadb [08:02:29] no, it is P_S [08:02:33] not sys what it is missing [08:02:42] doesn't have those metrics [08:03:03] they were added on 5.7 [08:04:37] ah yeah, p_s, got confused [08:04:45] but they are not there anyways, neither on mariadb doc [08:05:01] I would mention that detailed profiling is not possible due to that [08:05:07] good idea! [08:05:34] having said that, even if I think that should be fixed [08:05:44] you know what I think about sinking time into tendril... [08:06:00] yeah :) [08:06:02] even if this get fixed [08:06:07] Nah, tendril needs to go [08:06:11] I don't like logic on events [08:06:17] But at least we know that this tendril OOMs are because of a this bug [08:06:38] yeah, we need it fixed anyway due to the events on production [08:06:49] yeah [08:07:03] we don't massively use them but on tendril we clearly see the OOM every 2-3 months [08:07:04] but this is only 1 of the many issues of that model [08:07:11] I hate events [08:07:35] it is as usual how you use a tool, not to tool itself :-D [08:08:00] we need observability and debugging [08:08:13] yeah [08:08:36] having 2.5k events on a host running every second cannot be good anyways :) [08:10:08] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) @Jclark-ctr One last thing- this was not an issue for me because as I had remote login so I could fix it myself, but may be interesting for you: remote IPMI was disabled, so I... [08:10:46] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1140.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20200... [08:11:04] maybe also worth upgrading BIOS and firmware on db1140 (if it wasn't done already - I haven't followed the ticket) [08:12:08] I would guess that if vendro changed it in person, it will be done on a new board [08:12:12] but I can check [08:12:32] in any case, it wouldn't interfere with me reinstalling it [08:14:01] yep [08:14:20] should I deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/599596 everywhere? [08:14:36] it will be needed by db1140 after install [08:15:15] db1140 will be running 10.4? [08:15:20] yes [08:15:25] I want to test it early [08:15:28] ah, you are testing it [08:15:29] yeah [08:15:41] I don't want to have 90% hosts upgraded [08:15:42] but those same sections are being backuped in db1095 no? [08:15:50] correct [08:15:56] then say "there is a huge blocker" [08:16:26] after testing , I can reimage back to stretch or keep it, unsure [08:16:43] yeah, as long as we still have those sections with 10.1 being backuped, that's cool! [08:16:59] the plan is to have them in parallel for some time [08:17:07] then recover and do some checks, etc. [08:17:14] that sounds good yeah [08:17:15] +1ed [08:17:28] I will disable puppet just in case [08:18:00] but worse case scenrio, backups on tuesday fail [08:19:18] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [08:23:35] I have disabled puppet on all databases and dbprov hosts [08:30:30] Notice: /Stage[main]/Mariadb::Packages_wmf/File[/usr/local/bin/mbstream]/ensure: created [08:31:08] I think the link being owner by root is an ok thing [08:31:14] *owned [08:32:00] and it is now on path [08:32:13] what should I test before mass deploy, a package upgrade? [08:34:29] I am going to upgrade a codfw host to 10.1.44 [08:53:52] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) a:05jcrespo→03Jclark-ctr @Jclark-ctr Serial port redirection doesn't work. This is a blocker because I cannot read the console output on restart, and understand why it is n... [08:58:37] I take db2075 and upgrade it to 10.1.44 [08:59:01] ok [09:04:55] deploy works as intended, will reenable puppet everywhere soon [09:07:28] and host stuck loading initial ramdisk.... [09:08:18] it was in the list of problematic ones T216240 [09:08:19] T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 [09:08:54] oh those... [09:09:06] yeah sometimes they've bitten me again [09:09:58] will those go away next year? [09:10:15] no :( [09:10:32] it went throught on 2nd try [09:11:00] can you create a subtask of that task to get that one upgraded? [09:11:54] nope, it got stuck now on cpu smp configuration [09:12:06] :( [09:18:37] 10DBA, 10Operations, 10ops-codfw: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10jcrespo) [09:26:41] puppet should be enabled everywhere [09:28:02] check if that affected your deploy of ae6dfa7d75 [09:28:53] checking [09:29:04] ah yes [09:29:08] I will re-run it [09:29:52] we had a calendar reminder to check old backup files for labsdb1011 [09:29:58] what is the status of that? [09:30:01] I replied to it [09:30:03] via email [09:30:05] that it is all odne [09:30:06] done [09:30:36] I see a may 28 dump [09:30:40] 10DBA, 10Operations, 10ops-codfw: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Marostegui) p:05Triage→03Medium a:03Papaul [09:30:44] could we move that to backup1002? [09:30:46] jynus: where? [09:30:58] backup1001:/srv/production/db1141_logical_once_replication_caught_up/ongoing [09:31:04] yeah, but that's db1141, not labsdb1011 [09:31:13] yeah [09:31:19] but yes, we can move it to backup1002 [09:31:19] that is why I am asking [09:31:23] I can take care [09:31:28] Ah cool thanks [09:31:30] I don't mind using backup1002 [09:31:44] I would need to take another binary of db1141 today (once it has caught up again) [09:31:46] but backup1001 is on production volume, which I would prefer not touch it [09:31:48] where could I place it? [09:31:50] ah sure [09:31:59] let me see where there is space [09:32:45] how large? [09:32:58] 6.5T [09:33:24] yeah, backup1002 should have 55T free right now [09:33:28] plenty of space [09:33:40] and can be organized better within /srv [09:33:46] that is why I prefer it [09:34:03] sure, let me know where I can copy it to (no rush, it is still catching up) [09:34:46] backups [09:35:08] create a snapshot dir or whatever under /srv/backups [09:35:23] that should have enough space for 10 of those if necessary [09:35:23] cool [09:35:34] you'll copy db1141's already existing backup there? [09:35:34] backup1001 is way more utilized [09:35:37] yes [09:35:40] excellent [09:35:54] I don't want backup1001 with limited space if you understand me [09:36:13] sure sure [09:36:14] backup1002 should only have low activity on tuesdays [09:36:23] so it will run faster and with more disk [09:36:28] let me know when the data is transferred there [09:36:34] yes, doing it now [09:36:40] I think I will need to start the other transfer tomorrow morning [09:36:41] will run a checksum even [09:36:47] I doubt it will be already sync'ed with the master by today [09:36:55] and check with you before deleting it [09:37:00] cool [09:38:11] also backup1002 should have db backup tools installed if needed [09:38:31] vs backup1001 not having a client or mysql utilities [09:38:32] backuptools? [09:38:35] ah [09:38:36] I see [09:38:40] yeah, dumping, xtrabackup [09:38:45] so it will be a better fit [09:38:48] yep [09:38:50] backup1002 is like a dbprov [09:39:15] so it has mysql and other stuff installed there [09:39:29] (with no daemon running) [09:43:53] backup1002:/srv/backups/T249188/ongoing is ongoing [09:44:02] copying to port 4444 [09:44:12] thanks [09:45:01] that way we will leave backup1001 with all the original resources [09:45:04] 0:-D [10:01:14] jynus: meeting? [12:01:29] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [12:58:23] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) @Jclark-ctr could you confirm if you want to do this maintenance today Monday 1st June or tomorrow Tuesday 2nd June? [13:10:54] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) John confirmed via IRC that the maintenance will be done on Tuesday - thank you! [13:30:46] I am running an md5sum -c on backup1002 [13:30:54] nice [13:31:17] if it ends with no error, we can delete it from backup1001 [13:31:38] I named the dir with the ticket number under /srv/backups [13:31:56] yeah, I saw that [13:33:05] but the transference looked good: 920955506013 bytes correctly transferred from backup1001.eqiad.wmnet to backup1002.eqiad.wmnet [13:33:34] sounds good, I will probably do the binary transfer tomorrow morning [13:33:43] As s1, s4 and s8 are still catching up on db1141 [13:33:55] I will place it on that same path that you created [14:06:31] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10daniel) I constructed an example for a... [14:12:22] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) >>! In T238966#6181871, @d... [14:36:23] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10daniel) >>! In T238966#6181873, @Maros... [14:54:25] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) I have sync'ed up with @da... [15:29:06] checksum is correct, dump at backup1002 has the same data than the one on backup1001 [15:30:11] I will remove db1141_logical_once_replication_caught_up on backup1001 only [15:30:18] ok? [15:54:00] sounds good [15:54:11] can you update the task with the new location and path? [15:54:20] it will be useful to have it there :) [15:54:21] ok [15:54:34] just comment on there and that's enough, so we can have a record of all the steps we've done [15:56:11] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10jcrespo) The logical backup mentioned at T249188#6171874, which used to be at `backup1001:/srv/production/db1141_logical_once_replication_caugh... [18:45:56] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Tgr) NameTableStore, added in 1.31, provides a convenient way of handling pseudo-enums from the PHP side. (Alternatively, you can just use numeric I... [20:40:13] 10DBA: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10-jem-) [20:41:35] 10DBA: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10-jem-) p:05Triage→03High [22:19:55] 10DBA, 10Data-Services: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10bd808) Tagging with #DBA to get some ideas of what might be happening here. It may be somehow related to the ongoing work on {T249188}, but I do not know how to verify or disprove that. [23:55:28] 10DBA, 10Data-Services: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10AntiCompositeNumber) The issue seems to be roughly constant per day over the past month (https://quarry.wmflabs.org/query/45496). Doesn't really help figure out when it started, other than knowing it was at...