[00:54:44] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Bstorm) p:05Triage→03High [00:56:27] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10bd808) [06:19:42] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Marostegui) That can be a consecuence and not really the cause. If the server is too overloaded, it might not... [08:20:26] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Surlycyborg) Hi, thanks for filing this. Here's some background I have: - This is for my tool Citation Hunt... [09:29:11] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Marostegui) I don't know what was the situation yesterday night, as I wasn't present during the troubleshooti... [10:41:12] 7TB/11TB are used on the rw external storage servers [10:44:49] that was more or less within the original plans, 4-5 years of usage [10:45:00] 2019-2020 expansion [10:45:06] nice [10:45:27] I think we should get es4 servers by then [10:45:42] although maybe we should create a new cluster now [11:18:14] https://phabricator.wikimedia.org/T2950 [11:19:01] recovered from dumps in ~1h [12:02:58] ? [12:06:33] 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10jcrespo) [12:06:59] 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10jcrespo) a:03jcrespo [12:43:39] marostegui: how much did you wait to declare a kernel load failed? [12:45:12] 20-30 seconds [12:45:18] it would reboot itself anyways [12:45:40] yeah, I was wonderinf if you waited for the reboot or did it before [12:45:52] it failed the seond time on db2089 [12:45:54] no, I waited for it to reboot itself [12:46:09] I would suggest you enable debug mode on the kernel [12:46:45] what is the difference, I mean in terms of outcome when doing that? [12:46:46] We are not sure whether it triggered something or not on the HW logs, but it doesn't harm to have it [12:47:08] On db2085 it would sometimes get stuck on the kernel boot at CPU #2 [12:47:14] And it could be seen on the kernel trace [12:47:23] as well as on the HW logs [12:48:46] maybe the firmware does nothing, maybe it is a hw issue and the drain resets it? [12:48:56] could it be a possibilty? [12:49:22] But db1106 and db2085 booted up 10 times with no issues [12:49:22] 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10jcrespo) Rebooting db2089: ` PowerEdge R630 BIOS Version: 2.4.3 ` ` 1st reboot: OK 2nd reboot: FAIL 3rd reboot: FAIL 4th reboot: 5th reboot: 6th reboot: 7th reboot: 8th reboot: 9th reboot... [12:49:36] You can also try to ask for a drain before the firmware update [12:49:38] to discard that [12:49:40] marostegui: after being fully poweroff [12:50:43] So maybe ask for the drain before ^ [12:51:00] But on both db1106 and db2085 there were HW errors logged [12:56:47] I booted once with debug, but got nothing strange on dmesg or ipmi [12:57:05] On my case it showed up after a bunch of reboots [12:57:06] should I search somewhere else? [12:57:11] ok [12:57:37] I guess after debug is enabled + it fails to boot [13:02:06] in my case it logged the errors on the HW logs right after the automatic reboot [13:08:19] and now I am not getting any failure to boot :-D [13:08:36] of course XD [13:38:44] db1106 and db2085 are seeing some cron spam from the auto restart I don't see on any of the other hosts, it's odd. anyone has an idea what might have caused those? [13:38:47] e.g.: [13:38:56] Showing one /org/freedesktop/systemd1/unit/cron_2eservice [13:38:58] Sent message type=method_call sender=n/a destination=org.freedesktop.systemd1 object=/org/freedesktop/systemd1/unit/cron_2eservice interface=org.freedesktop.DBus.Properties member=GetAll cookie=1 reply_cookie=0 error=n/a [13:38:59] Got message type=method_return sender=n/a destination=n/a object=n/a interface=n/a member=n/a cookie=1 reply_cookie=1 error=n/a [13:39:23] I can fix the auto restart script to silence that, but I'm puzzled why those hosts are affected [13:39:40] I got thos after restarting in debug mode [13:39:42] And those are the ones we have been testing the reboots on (although not today) [13:39:55] but I didn't use any script, I did it with ssh [13:40:03] Yeah, same here [13:40:19] I guess db1106 an db2085 are still running with debug mode [13:40:23] As I don't recall removing it [13:40:59] aahh, that might explain, let me check [13:41:05] I am only restating db2089, though [13:41:18] not touched 85 or 1106 [13:41:35] yep, they in fact still have debug enabled in cmdline [13:41:43] ah [13:41:55] so those should be restarted [13:41:56] but [13:42:04] Cron arrived today [13:42:16] I can take care of db2089 and db1106, I need to work with them next week [13:42:18] and that meas it is mostly the scripts fault? [13:42:52] the script calls out to systemctl and that one uses dbus and when in debug mode dbus seems to log additional stuff [13:42:54] as in, why a kernel debug mode should change what nagios outputs? [13:43:01] ah, ok [13:43:03] still [13:43:14] should that be redirected to stdout? [13:43:53] maybe, I need to verify to which fd dbus is logging [13:44:01] I'll fix the grub.cfg on db1106 and db2085 [13:47:48] done (and when the new 4.9.144 kernel is installed, grub.cfg gets updated anyway) [14:09:57] moritzm: after doing https://phabricator.wikimedia.org/T216240#4957077 I am a bit lost on what to do [14:15:03] oh well, what a neverending story :-) [14:15:21] are the firmware versions on 2089 identical to 2085 and 1106? [14:15:39] (just to rule out that it's some odd variation in updates offered or so) [14:15:39] 89 has not been upgraded [14:16:05] ah, I misunderstood [14:16:34] that's consistent with what Manuel and myself were seeing, only after the upgrade to the latest firmware, reboots were reliable again [14:17:03] yeah, my point is after a while, every reboot succeeded [14:19:03] and I am not certain the upgrade actually did something [14:21:46] ok, on the 12 reboot if fails again [14:26:01] IIRC Manuel did 8 successful reboots with 1106 and 2085 each [14:26:31] I think it's totally unrelated to the kernel at this point and caused by whatever was buggy in their firmware [14:27:03] when I had debug mode enabled you could sometimes crash is very early, after maybe 1 second of output of so [14:35:23] that is exactly the opposite of what is happening here [14:35:38] no single crash with it enabled, around 50% chance without it [14:36:14] we need a statistical analysis to declare if it is significant! [14:40:40] 10DBA, 10Operations, 10ops-codfw: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10jcrespo) a:05jcrespo→03Papaul @Papaul from next week, please also upgrade firmware/BIOS of db2089 (**only that one for now**). I will put it back to product... [16:17:00] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Bstorm) @Surlycyborg The database account for the tool was locked last night pending contact with you (and in... [16:30:48] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Bstorm) I believe I correctly enabled the DB account @Surlycyborg. Please change the code so that it does no... [18:38:48] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Surlycyborg) OK, so this morning I'd already disabled the batch jobs that make the heaviest use of the databa... [18:53:33] marostegui - ping - global rename? :) [18:55:21] 10DBA, 10Serbian-Sites: Mass bigdeletion scheduled for sr.wikinews - https://phabricator.wikimedia.org/T212346 (10MarcoAurelio) Could I have my botflag at sr.wikinews removed, please? Until T122705 is resolved I won't be able to handle this, and at this point I think that the maintenance script path will be a... [18:55:47] 10DBA, 10Wikimedia-Site-requests, 10Serbian-Sites, 10Wikimedia-maintenance-script-run: Mass bigdeletion scheduled for sr.wikinews - https://phabricator.wikimedia.org/T212346 (10MarcoAurelio) [19:37:57] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Bstorm) Ah if the password doesn't work I can reset the password again properly. [19:43:58] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Bstorm) Try now. [19:46:00] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Bstorm) Additionally, I will say that the database might not work for you anyway today. It's in a terrible s... [20:01:16] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues - https://phabricator.wikimedia.org/T216213 (10Surlycyborg) Thanks! Yes, `sql tools` does work now. Good luck with the rest of the recovery!