[06:03:26] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) db1087 has all its tables compressed but wb_terms. I am going to wait for labs to catch up a bit before moving them under codfw's sanitarium before starting the compression on that table, which I guess it will take aro... [06:14:18] 10DBA, 10Parsing-Team: testreduce_vd database in m5 still in use? - https://phabricator.wikimedia.org/T245408 (10Marostegui) >>! In T245408#5895216, @ssastry wrote: >>>! In T245408#5891847, @Marostegui wrote: >>` >> | GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, ALTER, CREATE TEMPORARY TABLES ON `testre... [06:30:25] 10DBA: Compress table watchlist_expiry - https://phabricator.wikimedia.org/T245358 (10Marostegui) [06:49:06] 10DBA: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 (10Marostegui) >>! In T245489#5893628, @Anomie wrote: > Yeah, in this case it makes little observable difference whether it fetches 501 rows via `el_index_60`, or fetches 553 rows via `el_index` and filesorts... [07:04:18] 10DBA: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 (10Marostegui) [08:23:23] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1020.eqiad.wmnet']... [08:33:07] 10DBA: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 (10Marostegui) [08:42:37] 10DBA: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 (10Marostegui) [08:42:49] 10DBA: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 (10Marostegui) 05Open→03Resolved This is all done [08:42:51] 10DBA: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 (10Marostegui) [08:53:09] 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui) [08:53:21] 10DBA: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Marostegui) [08:54:15] 10DBA: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `dbproxy1007.eqiad.wmnet` - dbproxy1007.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Downtimed management interfa... [09:03:43] Feb 19 09:03:25 db1140 systemd[1]: [/lib/systemd/system/mariadb@.service:77] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service' [09:03:44] Feb 19 09:03:25 db1140 systemd[1]: [/lib/systemd/system/mariadb@.service:77] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service' [09:03:54] woot? [09:04:12] I warned about this some time ago [09:04:14] so that flag is gone in buster or what? [09:04:30] no, that option is supposed to be on the systemd on buster only [09:04:55] Ah, yes, db1140 is stretch [09:04:56] as systemd on stretch is not supported, so dir is wiped out on stop [09:05:06] but it is the first time we've seen that no? [09:05:12] no [09:05:29] I have been stopping multiinstance hosts a lot lately, and didn't see that [09:05:42] if you stop all at the same time, it wouldn't affect you [09:05:55] or they have a package not affected [09:06:08] which package is db1140 running? [09:06:16] 10.1.43? [09:06:22] the latest we have, yes [09:06:37] yeah, the ones I stopped didn't have that one I reckon [09:06:42] we need to upgrade to a -1 [09:06:59] and we can do a hot upgrade, just substituting the systemd unit [09:07:07] yeah, looks like it is indeed needed [09:07:15] not a nice feature hehe [09:09:41] also, for some reason, s3 socket doesn't come up [09:09:56] but s2 and x1 does [09:10:18] the server does, but the socket doesn't [09:10:27] no errors on logs? [09:10:40] the server comes up normally [09:10:56] although I started it the last [09:12:04] I am taking a look too [09:13:23] how is that possible? :-/ [09:13:54] systemd [09:14:22] that option is not supported, so I thought it would only fail, but it is doing weird stuff [09:14:49] yeah, I wonder how mysql was able to start without complaining about the socket [09:14:58] maybe it created it during start and then once it was up, it got deleted [09:15:11] because it was created unix-like then deleted something something [09:19:24] I am going to rebuild a -1 [09:19:31] +1 [09:23:09] I will also build, but not upload 10.1.44 [09:23:19] sounds good yeah [09:23:20] could you check the state of mariadb-client on buster? [09:23:34] what do you need? [09:23:36] I think it was a blocker for cumin buster upgrade [09:23:47] just if there is a buster mariadb client we like [09:23:50] on repo [09:23:52] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1020.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1020.eqiad.wmnet'... [09:23:58] I built it and installed it on db1107 [09:24:07] the client package? [09:24:15] yeah [09:24:37] ok, so not a blocker for when buster is on cumin, right? [09:24:46] nope, I haven't seen anything weird [09:25:06] (I have been actually using it from db1107 precisely to test it) rather than using cumin in most cases [09:25:20] if you could find the ticket discussing that and just saying that, it would be great [09:25:30] sure, let me check it [09:26:29] done [09:27:06] thanks [09:27:35] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) >>! In T241359#5896178, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['es1020.eqiad.wmnet'] > `... [09:42:25] while I finish, could also ask you how many hosts are on the latest package version? [09:42:56] and/or if we could depool one multiinstance for testing the fix? [09:44:34] sure, you can depool db1096:3315, db1096:3316 [09:44:37] do you want me to depool it? [09:44:50] not yet [09:44:53] (most of the hosts are running 10.1.43) [09:45:21] but if there is any blocker to do it (e.g. ongoing maintenance) try to make it possible [09:45:39] nope, that host can be depooled anytime [09:46:03] also let's test it first on non-critical hosts [09:46:10] then let's go for codfw [09:46:25] db2089 [09:50:22] it us uploaded now, could you help me with that one ^ [09:50:31] sure, depooling db2089 [09:50:34] and downtiming [09:51:45] you can now proceed [09:51:46] what order should we do, first test the issue is solved, then test updating the systemd works withour a restart? [09:51:55] yeah, that sounds good [09:52:10] then I prefer to do it on db1140 first [09:52:18] then on db2089 [09:52:19] up to you, db2089 is fully ready [09:53:03] see -ops [09:53:12] sure [09:54:00] it will be package upgrade only (no server) so no mysqld_upgrade necessary FYI [09:54:17] oki [09:54:52] dpkg: error: dpkg status database is locked by another process [09:55:10] puppet running I guess? [09:55:21] you think a fluke? [09:55:27] it worked on next run [09:55:49] Feb 19 09:45:58 db1140 puppet-agent[21372]: Applied catalog in 8.72 seconds [09:55:51] so not that [09:56:04] could be debmonitor [09:56:26] we will see next time [09:56:38] which host? [09:56:41] will stop daemons [09:56:47] db1140 [09:57:17] ok [09:58:08] pgrep ing mysqld as we cannot trust systemd [09:58:20] and starting them again, 1 by one [09:58:24] ok [09:58:49] I think there may be a bit of a race condition [09:59:02] depending how much time between starts, etc [09:59:15] maybe do it all at once with &&? [09:59:28] well, I am mostly worried about the stop :-D [09:59:38] which is what deletes the working directory [09:59:51] because directory preseve not available [10:00:08] hehe yeah [10:00:25] things should be now normal [10:00:27] debmonitor is triggered only by apt hooks and by a daily crontab that on db1140 runs at 17:51 [10:00:30] so should be unrelated [10:00:42] will wait for icinga to confirm [10:02:01] yeah, it is ok, there is lag but that is because I am running with skip-slave-start [10:02:03] now the stop [10:02:06] cool [10:03:20] stops looking ok as expected [10:05:18] I think the issue could present only if the process is killed [10:05:28] which I technically did for s3 [10:05:37] but still, not something we want [10:06:22] let me test the upgrade on db2089 to see if we will need a -3 [10:07:34] right [10:07:47] 12 packages can be upgraded [10:07:54] should I do full or just wmf? [10:08:02] nah, do everything I would say [10:08:07] it is mostly perl [10:09:01] no errors this time [10:09:16] so maybe we were doing a few apt manually? [10:09:45] post-inst only calls update-alternatives, does nothing with apt [10:10:10] what I don't remember is if I do systemd-reload [10:10:16] let me check [10:11:21] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1020.eqiad.wmnet']... [10:11:40] yeah, it does /bin/systemctl daemon-reload [10:13:04] on paper this is fixed [10:13:46] want to try db2089? [10:13:52] but if I stop the instances I am not sure we would be able to replicate it [10:13:52] sorry [10:13:54] you did already [10:14:01] no, didn't restart yet [10:14:06] ah ok! [10:14:36] as in, would a systemd-reload be enough to forget the RunTimeDir? [10:14:48] let's see! [10:14:50] or is it tied to the process [10:15:22] but if I do a normal shutdown, I am sure it would work in both cases, only a kill/systemd failed may have triggerd it [10:15:49] I will at least stop one instance [10:15:55] sure [10:17:10] see -ops [10:17:18] ok [10:18:12] yeah, no issue as expected [10:18:48] I belive the only thing left is to install -2 over all .43 [10:19:03] cool [10:19:10] we can do that slowly indeed [10:19:22] although it is mostly needed on multiinstance, no? [10:19:33] yeah, the others don't matter [10:19:41] sweet, that can be done easily [10:19:44] because even if the dir is removed [10:19:53] no harm done [10:19:56] yep [10:20:08] I guess it could fail on start? [10:20:56] but I think it is created on puppet run and on start [10:21:06] yeah, I was going to say that puppet would take care of that [10:21:12] it also wouldn't have affected mediawiki [10:21:23] beacuse the servers would be up [10:21:28] but all monitoring would fail [10:21:35] as it is done based on socket [10:22:06] I will create the .44 package [10:22:25] thanks [10:23:01] will upload it to install1002, but only to my home [10:23:35] ok [10:33:34] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1020.eqiad.wmnet'] ` and were **ALL** successful. [10:35:32] I also checked no more errors on new package d/system/mariadb@.service:77] Unknown lvalue 'RuntimeDirectoryPreserve' [10:35:44] \o/ [10:35:47] good job [10:35:57] can I repool db2089? [10:37:18] if caught up, yes [10:37:44] yep it is [10:39:09] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [10:40:10] I might upload the 10.1.44 client, though [10:40:36] yeah, that's cool [10:41:01] I left the server on my home [10:41:05] for manual install [10:41:19] but will not install anyware (10.1.44) [10:43:12] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) es1020 installed correctly: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus [10:43:55] this is me right now: https://i.imgur.com/T80xXuA.gifv [10:45:09] I need to fic the package, to fix s3 source, to take a backup, to improve the backup system [10:45:55] hhahahahahahaha [10:45:57] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1021.eqiad.wmnet',... [10:46:00] I love that gif [10:46:46] it is from the malcom-in-the-middle-breaking-bad cinamatic universe [10:46:52] yeah, I love malcom [10:54:29] yay for es102* hosts :-D [10:59:39] that's some fast logging [10:59:54] looks like they might have failed [10:59:58] I am seeing some cumin errors [11:00:06] 3 reinstalls or just 1? [11:03:05] looks only related to downtime [11:03:41] did you end up going for buster? [11:03:45] nop [11:04:03] don't want to do expirements with ES hosts and 10.4 :) [11:04:04] as I saw some installer issues being fixed, so I thought it was a new process [11:04:26] maybe it was different hosts [11:04:47] those fixes were for strecht and all the 1G/10G stuff [11:04:52] I see [11:05:35] to be fair (although I don't want to overload you) it would be nice to test the buster installer before installing stretch over it [11:05:54] I will see what I can do [11:06:06] yeah, we can test it with newer hardware next Q [11:06:10] no rush [11:07:08] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1022.eqiad.wmnet', 'es1023.eqiad.wmnet', 'es1021.eqiad.wmnet'] ` and we... [11:07:13] buster installer was tested with db1107 already though [11:07:16] also once we have faste recovery systems [11:07:28] we would have more margin for errors [11:12:59] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) >>! In T241359#5896474, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['es1022.eqiad.wmnet', 'es1... [11:14:20] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [11:17:43] codfw prepare is now also failing [11:18:10] I wonder if I am targeting this wrong, the issue is ulimit on xtrabackup [11:18:32] and not related to mysql at all, and it just went over the 65K limit this month [11:20:17] nah, it just has 9 right now [11:26:14] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [11:28:36] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1024.eqiad.wmnet', 'es1025.eqiad.wmnet']... [12:28:54] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1024.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1024.eqiad.wmnet'] ` [12:45:54] db1140:s3 logical backup finishing, running snapshot now, will see how it goes and run compare later [13:28:45] lots of commonswiki errors [13:28:57] since 13:21 [13:29:32] db1084 [13:31:41] ^depooled it [13:35:18] 10DBA, 10Operations: db1084 reboot causing commonswiki connection errors (crash?) - https://phabricator.wikimedia.org/T245621 (10jcrespo) [13:52:24] :( [13:52:55] I was about to leave for lunch, either you research or I will later, things are stable [13:53:00] I will take over [13:53:06] Thank you for triagging it [13:53:29] the assigning it to you was mostly for noticing purposes [13:53:30] Looks like BBU dioed [13:53:50] when I want your attention, I think that is a good way :-D [13:53:56] definitely [13:54:40] if it has to be recovered, testing an s4 snapshot would be nice [13:54:49] afk [13:55:05] 10DBA, 10Operations: db1084 reboot causing commonswiki connection errors (crash?) - https://phabricator.wikimedia.org/T245621 (10Marostegui) Looks like BBU died: ` Battery/Capacitor Count: 0 ` ` /system1/log1/record15 Targets Properties number=15 severity=Caution date=02/19/2020 time=1... [13:56:42] 10DBA, 10Operations: db1084 reboot causing commonswiki connection errors (crash?) - https://phabricator.wikimedia.org/T245621 (10Marostegui) @wiki_willy do we have spare HP BBUs in eqiad? [13:57:17] 10DBA, 10Operations: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) [13:57:33] 10DBA, 10Operations: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) p:05Triage→03Medium [13:59:57] 10DBA, 10Operations, 10Patch-For-Review: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) [14:02:21] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1024.eqiad.wmnet'] ` The log can be foun... [14:04:05] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) es1025: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus looking good. [14:04:41] marostegui thanks for finishing the es servers [14:04:45] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [14:05:03] cmjohnson1: es1024 doesn't have link I think, it cannot PXE :( [14:05:07] The rest are good [14:05:57] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) @Cmjohnson can you double check es1024's link? It cannot PXE boot: ` Booting from PXE Device 1: Integrated NIC 1 Port 1 Partition 1 PXE: No m... [14:06:29] can you get out of the console for es1024 please [14:06:35] sure [14:06:48] cmjohnson1: done, all yours [14:07:00] thx [14:20:48] marostegui: it's installing now I will ping you once it's finished. Boot setting was wrong in bios. [14:20:59] ah cool! [14:21:00] thank you [14:21:21] cmjohnson1: do you happen to know if we have spare HP BBUs in eqiad? [14:22:48] I do not have any that I know of, there was talk that we were going to be ordering spare but I'm not sure the result of the talk. [14:23:05] Cool, will talk to Willy then! Thank you! [14:34:55] marostegui the install is done [14:35:05] yay! [14:35:18] cmjohnson1: thanks - I will do some checks and close the task [14:38:47] 10DBA, 10Operations, 10ops-eqiad: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1024.eqiad.wmnet'] ` and were **ALL** successful. [14:39:26] 10DBA, 10Operations: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Cmjohnson) [14:41:32] 10DBA, 10Operations: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) es1024: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus looking good. [14:41:45] 10DBA, 10Operations: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [14:41:56] 10DBA: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 (10Marostegui) [14:41:59] 10DBA, 10Operations: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) 05Open→03Resolved All hosts have been installed successfully. Thanks! [14:42:12] 10DBA: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 (10Marostegui) 05Stalled→03Open [14:42:14] 10DBA, 10Epic, 10Goal: Setup es4 and es5 replica sets for new read-write external store service - https://phabricator.wikimedia.org/T226704 (10Marostegui) [14:52:21] 10DBA: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 (10Marostegui) [15:27:57] 10DBA: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 (10Anomie) The execution itself doesn't seem to be reading too many rows, ` wikiadmin@10.64.0.214(enwiki)> SELECT /* ApiQueryExtLinksUsage::run */ el_index_60, el_id, page_id, page_namespace, page_title, el_t... [15:30:40] 10DBA: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 (10Marostegui) >>! In T245489#5897481, @Anomie wrote: > > That suggests whatever is being slow is in the logic for doing the filtering rather than in the row fetching itself. Yeah, that's why I believe that... [15:32:06] 2020-02-19 13:04:47 139698789949888 [ERROR] InnoDB: Table csbwiktionary/watchlist_expiry in the InnoDB data dictionary has tablespace id 348750, but tablespace with that id or name does not exist [15:32:55] I guess it could be DDL + backup? [15:33:10] what do you mean DDL? [15:33:31] that table maybe being created recently [15:33:46] interesting [15:33:51] that could be the on-going compressing I am doing [15:33:54] for that table in s3 [15:34:05] but not for source mysqls [15:34:30] I am doing it on the master with replication (it is an empty table) [15:34:35] ah [15:34:43] well, is it still ongoing? [15:35:04] yeah [15:35:09] ETA? [15:35:25] end of day [15:35:35] I think if I stop replication, it shouldn't affect me [15:35:47] I will try like that [15:36:00] in any case, that should be a non-fatal error [15:36:01] yeah, it won't :) [15:36:17] I am guessing each alerter takes <1 second or so? [15:36:32] so I can just stop slave and it will stop being applied [15:36:48] (for that host) [15:38:35] it is complaining about that, but, however, on a next prepare it compltes ok [15:38:45] vs. being stuck for 3+ hours [15:38:56] so that I think is "expected" [15:39:34] each alerter? [15:39:47] ah alter [15:39:48] yeah [15:39:49] yeah, it sees movements on the transaction logs [15:39:49] they are fast [15:39:53] but I have a big sleep [15:40:03] so it gives an error that it will be ignored [15:40:06] because otherwise s3 suffers (because of all the many files..) [15:40:11] because it is a table it doesn't know about at the time [15:40:26] yeah, your operations are not an issue, just to be clear [15:40:36] I am just trying to understand what xtrabackup is doing [15:40:43] and I think it is ok it errors out [15:40:49] and expected [15:41:00] not the root cause of prepare stalling [15:41:18] as on the next run, it finished in 10 seconds, with the same "errors" [15:41:29] but with an ok output [15:41:42] Maybe it is comparing the original frm with the new one? (compressed)? [15:42:00] yeah, it is innodb, but same idea [15:42:09] on aleter the tablespace id will be reconstructed [15:42:25] so it cannot apply changes to it, so it warns about it [15:42:39] but that is ok, those will be applied on replication [15:43:03] but let me stop replication to remove that variable [15:43:08] sure thing [15:43:16] and potentially understand the real reason for the stalls [15:43:42] it is very weird both s3 are failing at around the same time [15:44:02] yeah, it is definitely the compression [15:44:03] so I am still thinking something like "number of objects grown over ulimit limit" [15:44:15] beacuse new table, or something like that [15:44:29] something that makes s3 special [15:44:35] vs the other sections [15:44:58] open files or descriptors? [15:45:00] maybe? [15:45:31] I thought that,but while stalled, I only saw 9 files open [15:45:50] some temporary ones and the logs, it wasn't scanning .ibds [15:46:00] but something along those lines is the theory [15:46:03] maybe os [15:46:09] or maybe mysql config [15:46:14] maybe we should try to strace the process [15:46:20] and see if we see something [15:46:32] e.g. open_file_limit for xtrabackup [15:46:53] I guess it will just use the OS default? [15:47:11] mysql's default, as we only tune it for a regular mysql [15:47:18] but again, those are guesses [15:47:39] whenever I run it manually, I cannot reproduce it and backups works ok :-/ [15:48:45] root@db1140:~# cat /proc/27041/limits | grep files [15:48:46] Max open files 200001 200001 files [15:48:51] I really hope it is not opening more than 200k XD [15:49:47] yeah, but that is the mysql process [15:49:54] which is what I meant before [15:50:04] xtrabackup is actually ok [15:50:15] it is the prepare on dbprov* that might have issues [15:50:19] (just a guess) [15:50:30] the instance that is started up to run --prepare [15:51:27] since they re-arched xtrabackup I stopped looking at its internals [15:51:54] I think strace might help [15:52:03] another option is that the upgrade to 10.1.43 was relatively recent [15:52:10] could have changed something [15:52:48] marostegui: yes if I can get to strace one where it stalls :-D [15:52:51] https://mariadb.com/kb/en/mariadb-10143-changelog/ [15:53:36] the main issue is not that it fails, it is that it sometimes works, sometimes stalls [15:54:21] sorry, don't want to distract you [15:54:28] please ignore me on this ranting [15:55:19] maybe we can just strace by default until we get it to fail again [15:55:27] and the see if there's something interesting there [15:55:52] also enable verbose logging [17:00:07] 10DBA, 10Operations: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10wiki_willy) @Marostegui - we have a few spare BBUs in the process of being shipped onsite, one of them for T244958, which should be arriving early next week. You can just shoot open a dc-ops task with us, and... [17:47:21] checking stats, I think thing my serial data check is taking significant resources: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=All&from=1582123613684&to=1582145213685&fullscreen&panelId=8 [17:47:31] *is NOT taking [18:19:51] 10DBA, 10Expiring-Watchlist-Items, 10Community-Tech (Kanban-Q3-2019-20), 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.19; 2020-02-11): Create required table for new Watchlist Expiry feature - https://phabricator.wikimedia.org/T240094 (10Mooeypoo) 05Open→03Resolved T... [19:03:18] marostegui: Hey, I have some question about s8, how much read should be considered as unhealthy? I know it's too general. Let me give you some context: We are gradually increasing the read on the new term store, when it was suddenly moved, it brought down everything but there are so many moving parts (like several performance improvements we recently deployed) that we can't say we need to do more or not. I want to gradaully increase [19:03:18] it [19:03:45] but stop and do more once we see some red flags in s8 [19:12:28] there's one rather low hanging fruit left to improve but we get to it if needed [20:31:50] 10DBA, 10Data-Services, 10MediaWiki-General, 10Security, 10Security Related: Make (redacted) log_search table available on Wiki Replicas - https://phabricator.wikimedia.org/T85756 (10bd808)