[08:37:30] Changing es4 master in eqiad, should be a noop really [08:38:06] hi, marostegui [08:38:16] Hi jynus welcome back! [08:38:22] happy new year [08:38:27] Likewise [08:38:31] I thought I wouldn't see you until feb! [08:38:36] Why so? [08:38:52] I think you mentioned it, so I though you were on vacation :-D [08:39:08] Ah, I said in person :) [08:39:16] gotcha [09:20:50] would someone be open for a meeting with me to get important updates of the last month? Doesn't have to be today, ofc [09:21:20] jynus: I can probably do it tomorrow [09:22:02] ok, let's talk tomorrow [09:22:35] No that I know that much either as I was on the MW offsite and then on vacation and then the break, but I have some recaps [09:23:02] sure, just the high level thingies [09:23:21] or if there is something you want me to prioritize (broken stuff, etc) [09:23:23] jynus: By the way, dumps for s1 are disabled [09:23:28] XML dumps that is [09:23:46] 👍 [09:24:26] jynus: And if you could plan for db2239 https://phabricator.wikimedia.org/T373579 [09:24:28] That'd be helpful [09:24:41] I don't really need updates about db infra, just blockers or organizational changes,etc [09:24:53] Sounds good [09:25:28] obviously, dbs that I overview would be ofc to the top, as those are blockers for you :-D [09:25:38] thanks, we can talk tomorrow more in detail [09:25:42] Sounds good [09:31:38] jynus: next OOH-stuff update will be on 21 Jan (you have a mail about this somewhere in the backlog, but that's the TLDR) [09:31:51] OOH-stuff ? [09:32:07] out-of-hours cover [09:32:10] ah, I see [09:32:11] saw it [09:32:14] thanks, Emperor [09:32:19] sorry, me and my endless abbreviations [09:32:20] and for you work [09:32:44] hope you had a nice end of year and a good start of this one! [09:33:02] If you ignore Swift, yes :lolsob: [09:33:13] ha ha, that applies whenever you read this [09:33:22] this == your last line [09:33:26] XD [12:43:53] Welcome back jynus 👋 [13:01:40] 👋 [13:32:49] marostegui: do you know why pc1017 is replica of pc2017 instead of pc1014 [13:32:55] https://orchestrator.wikimedia.org/web/cluster/alias/pc5 [13:33:35] surprisingly, and despite some things I have to take care of (not necesarilly on my side), backups stood up relativelly solidly [13:33:36] Amir1: That was the status quo when I got back from sabbatical, I simply fixed them and didn't touch dbctl config [13:33:50] Amir1: I have plans to make everything back as normal whenever I have time [13:34:20] we can just make them pc6 now-ish. Let me check how we can depool a section [13:44:42] Amir1: You mean setting up the new section? [13:45:05] yeah, moving spares to the ring [13:45:08] Can you create a task for it? [13:45:11] I can take care of that [13:45:26] But what happens if we lose a host while that transition happens? [13:45:33] I think I can do that in a day, but just making sure [13:45:41] yeah, I'll make it a subticket of the ring ticket [13:45:55] we can depool that whole section [13:46:23] I need to figure out the depooling of full sections before we can create pc6 and pc7 [13:46:30] Yeah [13:46:43] I won't touch the current spares then, if we are just going to create a new sections [13:46:44] I'm hopeful I can get it done today. I'll do a test run of depooling pc5 [13:46:55] yeah [13:46:56] It doesn't make sense to move them around twice [13:48:22] actually, we need to reboot pc hosts anyway, I depool pc5, reboot the hosts and pool them back in. Does that sound good to you? [13:48:38] once we confirm this works, then we switch to setting up pc6 and pc7 [13:48:43] it does, please upgrade mariadb [13:48:57] for a bribe, sure [13:48:58] Once you create the ticket for the new sections assign it to me [13:49:06] Thanks <3 [13:50:04] but right now, I need to figure out how to delete 300M rows in pc3 [13:50:26] oh I know, I cry first [13:51:07] Just slowly :) [13:55:59] check pt-archiver, should be installed almost everywhere [13:56:07] part of pt-toolkit [15:22:43] sqlite question: given 'CREATE INDEX ix_object_deleted_name ON object (deleted, name);' why would the integrity check complain 'row 423322 missing from index ix_object_deleted_name' when the object table has no row 423322 ? [15:23:51] structure? [15:24:12] sorry, I don't understand the question [15:24:38] what's the table create statement/restrictions? [15:24:49] before the index [15:24:53] OIC [15:25:50] jynus: https://github.com/openstack/swift/blob/95b9e6e335878c8bc97088543a630cb9f67e7262/swift/container/backend.py#L560 and following matches what I see in the file [15:27:10] [my corrupted container table gets two complaints from the integrity checker about rows missing from that index, but the other row does at least correspond to a row in the object table] [15:32:03] I am rusty with sqlite, but I would suggest corruption. Try dropping the index and readd it or repair the table. [15:34:37] jynus: this is apropos T383053 ; I tossed the container and recreated it (thumbor didn't melt), but I'm trying to understand why all 3 copies got corrupted the same way (hypothesis: by the same duff update) and thought that maybe seeing what the bad rows were might give me a hint. It may just be "we have no way of knowing", which is a bit sad, but increasingly looks like the case; I can indeed either reindex or use the .repair tool, [15:34:38] that's not why I'm interested [15:34:38] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [15:35:28] if it happened the same way, I would bet on bug [15:36:23] because you don't copy those in binary for, right? [15:37:12] sometimes, what looks like corruption is just a bad error message of a restriction, but I am not as fluent with sqlite to say it for sure, been some time since I handled those [15:39:30] the other guess is it could be related to the trigger [15:40:23] maybe it has to be disabled if you did some operations manually (?) [15:43:03] sometimes that requires deep investigation and more context, sorry I may not be too useful in a short amount of time [15:47:14] No worries, I am resigned to this remaining a mystery, but it does smell funny [15:47:26] some people say it happened to them when they run out of space [15:47:31] which I guess it was not the case? [15:47:54] I don't think so, no [15:48:18] some of our storage nodes are squeezed for container-db-space, but not these [15:57:46] this is relatively recent, check if it could be related: https://github.com/rails/solid_queue/issues/324 [16:03:09] That's one of the "how to corrupt your sqlite database" things, yes :) I'd be quite suprised if swift had that sort of error (or, that if it did, we weren't tripping over it all the time) [16:42:16] FIRING: [2x] SystemdUnitFailed: ferm.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:01] DCops working on that host [16:48:14] alas, doesn't seem fixed. [16:49:20] RESOLVED: [2x] SystemdUnitFailed: ferm.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:20] FIRING: [2x] SystemdUnitFailed: ferm.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:16] RESOLVED: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:16] FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:04] going to downtime that server since it's bust cf T382707 [18:11:06] T382707: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707