[08:35:16] s8 snapshots shrinked today FYI: The previous backup had a size of 1.0 TB, a change larger than 5.0%. [08:35:39] -7.7 % / -81 GB [08:45:09] marostegui: schema change on db1159 in s5 errored out trying to drop il_to index in pplwiki [08:45:26] :( [08:45:28] fixing [08:45:39] I checked all the hosts, that's strange [08:46:43] it errored out with https://phabricator.wikimedia.org/P91549 on the only run today on s5-eqiad [08:46:43] fixed [08:46:47] thanks [08:46:55] I will check again the hosts [08:47:23] rerunning [08:47:42] oh it just errored out again [08:47:53] same identical error [08:48:03] db1159? [08:48:11] yes [08:48:16] and same wiki [08:48:30] it makes sense, as I fixed it, which means I applied the schema change too [08:48:51] https://phabricator.wikimedia.org/P91549 [08:49:28] see above [08:49:29] the check() is not detecting the change as fully done as it's trying to make the change again [08:50:11] ERROR 1091 (42000) at line 1: Can't DROP INDEX `il_to`; check that it exists [08:50:22] that of course don't exist because I dropped it [08:51:11] the check function looks for the il_to column in imagelinks table. If the column is present it will try to do the change [08:51:48] if the index is dropped but the column is still there it will error out [08:52:19] can you share the queries you run for the fix? Just dropping the index? [08:53:19] federico3: et session sql_log_bin=0;stop slave; DROP INDEX `primary` ON /*_*/imagelinks; ALTER TABLE /*_*/imagelinks CHANGE il_target_id il_target_id BIGINT UNSIGNED NOT NULL; alter table imagelinks drop column il_to; ALTER TABLE /*_*/imagelinks ADD PRIMARY KEY (il_from, il_target_id); start slave; [08:53:40] https://phabricator.wikimedia.org/P91549#371408 [08:53:45] is this how the table needs to look like? [08:56:25] that's puzzling, clearly the column is dropped, yet the script does: [08:56:25] def check(db): [08:56:25] return "il_to" not in db.get_columns("imagelinks") [08:56:51] (and the fix has been working ok on other hosts) [08:57:10] is that how the table needs to look like? [09:00:01] I don't know how we want other indexes to be, I can only tell we want to drop the il_to and il_backlinks_namespace indexes in the schema change and drop the il_to column. [09:00:55] (ideally it would be really good to compare the CREATE TABLE before and after the change as part of auto_schema as a safety check but that's a discussion for the future) [09:04:02] FIRING: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:04] it feels like this is a different bug that what we saw before, maybe db.get_columns("imagelinks") is getting confused by something else [09:04:05] maybe you can check hosts that never failed and check if the tables are the same? [09:04:12] or check the table catalog [09:04:46] I was also thinking at doing a print on that get_columns and in --check mode [09:06:09] the create table is identical to other dbs [09:07:03] Does the script still detect it as not done? [09:07:51] well, it tries to do the change so either check() is returning False or there's a bug elsewhere, I'm adding some prints now [09:12:01] and now "Already applied on pplwiki in db1159, skipping" [09:12:34] I'd suggest you keep running it and let's see if we run into this again [09:14:02] RESOLVED: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:03] now the check run returns Result: {"already done in all dbs": ["db1159"], "needed in some dbs": ["db1161", "db1185", "db1200", "db1207", "db1210", "db1216:3315", "db1245:3315", "dbstore1008:3315"]} [09:15:03] ...and I did not make any change to the db or logic change to the script other than a print [09:16:03] Can you run the --check and see if it is really true? [09:16:32] I git-reset the change and run the script with --check and shows the same: "already done in all dbs": ["db1159"], [09:17:40] Well it makes sense, db1159 is probably the first host it ran and the script never ran on the rest of hosts? [09:19:16] A quick for in s5 shows that all hosts have il_to except db1161 and db1159, so all the other hosts need it? [09:19:24] I'm talking about db1159 and pplwiki specifically: I only ran --check and now it's showing as done but before it was not showing as done [09:19:55] [11:12:34] I'd suggest you keep running it and let's see if we run into this again [09:19:55] aka it feels like something in get_columns was "cached somewhere" or not committed somehow [09:20:34] yes I can keep it running but I'm worried about this unexpected behavior [09:21:04] The only thing I can think of is that you checked it while I was still fixing it [09:21:12] This whole thing with pplwiki is weird [09:23:01] (the timing of my checks is in https://phabricator.wikimedia.org/P91549 ) [09:23:21] Yeah, I don't know mines exactly [09:23:23] and now it errored out on a different db, updating the paste [09:23:57] pasted https://phabricator.wikimedia.org/P91549#371425 [09:24:16] federico3: But that's a different host [09:24:21] yes [09:24:38] did you check the status of the table? maybe it needs my fix above [09:26:09] https://phabricator.wikimedia.org/P91549#371426 indeed [09:26:20] :) [09:26:47] I have no idea what happened with that wiki upon its creation, but it is strange, it looks like it was created with an old schema [09:26:57] anyway, I'd suggest you check it across s5 all hosts [09:27:34] marostegui> Yeah, I don't know mines exactly <--- is your fix running in batch in background? Is it meant to run on all hosts? [09:27:44] no, I only fixed db1159 [09:33:34] @marostegui : here's another https://phabricator.wikimedia.org/P91549#371431 [09:34:00] yep, you may need to fix all [09:36:45] do you want me to run the query you pasted above? [09:36:55] yes, that is the fix [09:37:20] https://phabricator.wikimedia.org/P91549#371432 <-- pasted here [09:37:27] ok [09:38:25] I added use pplwiki, good to go? [09:38:32] yes [09:47:16] federico3: there are 3 depooled hosts in s5 in eqiad that means we are having only 3 slaves available, please repool the ones that cna be repooled asap [09:48:47] I'm pooling in 1159 but I'll parallelize pooling the other done one 1161, which succeeded minutes ago [09:49:01] thanks [09:49:59] the fix worked on it but db1185 errored out so I'll keep fixing them as the script runs [09:51:08] You probably want to fix all of them before running the script, there is no need to depool for fixing them [09:51:26] for the sanitarium masters, do not use set session sql_log_bin=0 [10:38:48] federico3: ^ack? [10:40:16] 1161 is feeding to sanitarium and was already fixed manually before (but with sql_log_bin=0), in the meantime I repooled the hosts and run the command manually on the other hosts and the script is restarted a moment ago [10:40:44] can we do anything on 1154 maybe? [10:40:45] federico3: then you'd need to check the sanitarium host and fix it too manually with replication enabled [10:40:56] so the fix arrives to the wikireplicas [10:42:12] yes, hence my question on 1154 - can it be done there but does it need some other tweaks? [10:42:38] no, just do it there for the affected wiki but without set session sql_log_bin=0 [10:50:26] ok it's done on sanitarium and its replicas [10:50:37] meanwhile something is odd with db1210 [10:52:11] did it just crash? [10:52:27] no, it is being reimaged [10:53:18] federico3: you may want to check https://wikitech.wikimedia.org/wiki/Server_Admin_Log for those things [10:54:57] I saw the ssh key changing and I looked at https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance ... well having T423841 would make it a bit easier perhaps [10:54:58] T423841: Add section-level locking to scripts and cookbooks - https://phabricator.wikimedia.org/T423841 [10:55:52] Yeah, I think reimages do not make it to database maintenance map [10:55:58] (automatically that is) [12:39:02] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:02] RESOLVED: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:34] Hi folks, could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277553 please? Remove 3 drained eqiad backends from the rings so they can be re-imaged to per-rack VLANs [13:46:32] done [13:47:34] TY :) [14:34:47] Emperor: ms-be1070 is at 85% https://grafana.wikimedia.org/d/000000378/ladsgroup-test?forceLogin=true&from=now-7d&orgId=1&to=now&timezone=utc&viewPanel=panel-29 is it okay if I vaccum it? [14:35:02] marostegui: do you mind my rename of https://phabricator.wikimedia.org/T422365 for clarity to separate from other task that could be confusing? [14:35:37] Yep, will do [14:35:38] I was going to create a task for backup sources [14:35:58] I did, just asking if you are ok with it or want to polish it better [14:36:31] Ah I think it's better if you create one too, I'll rename that one right now to say: production [14:36:52] sorry, not sure you are understanding me [14:37:01] I will be creating one for backup sources [14:37:17] but I just renamed it from "Migration to Debian Trixie" [14:37:26] https://phabricator.wikimedia.org/T422365#11861425 [14:37:33] to be more specific [14:37:39] and I am going to create a subtask [14:37:56] That task in the description says production related [14:38:03] But yeah it can be a general one [14:38:07] ah [14:38:10] Like a meta [14:38:14] well, then whatever you want [14:38:23] what I wanted is the database parts [14:38:32] to differenciate from other migrations of backups or other stuff [14:38:52] so when I get an email it is clear what it is about [14:39:01] Migration to Debian Trixie can be confusing, as it is a task for database related things, it could be read as the whole infra (even other team's) [14:39:06] yes [14:39:17] that's why I wanted to update it [14:39:24] I included "production" [14:39:26] but feel free to polish it further [14:39:28] thanks [14:39:39] So it's clearer for us I think [14:40:07] all good for me [14:40:21] so, do you want me to subtask there the backup sources or outside of that [14:40:23] ? [14:40:39] Up to you, whatever you find easier [14:40:43] I don't mind either way [14:40:59] ok, I will add it, as that way you can resolve the other ones ignoring those [14:41:05] but I will handle those [14:41:14] Thanks! [14:47:55] FYI https://phabricator.wikimedia.org/T424541 [14:48:12] Great! [14:57:16] Amir1: go ahead [14:58:10] https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExYzUyemppaGRpM215M2d4M3AyMXQwb3RseTdzem80ZGRuMW1lbTV4dCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/0eVM7GVxTDDKxn7OyX/giphy.gif [14:58:45] XD [15:50:50] Emperor: sretest2010 still parked, I am waiting for SM's new firmware (didn't forget it) [15:51:24] federico3: the schema change on db1210 and db2171 hasn't finished yet? [15:52:08] They've been depooled for a bit and I need them both back [15:52:10] Especially db1210 [15:52:12] 2171 is pooling in at the moment but I'll rerun the check [15:52:43] elukey: ack, thanks [15:52:55] federico3: cool, I am more interested in db1210 too [15:53:04] ok pooling in it now [15:53:09] thanks [15:53:19] but Il'll have to rerun the check [15:53:26] you can run the check now [15:53:29] I can wait a few mins [15:53:30] do you need 1210 specifically or just all hosts? [15:53:44] I need db1210, as it is the candidate [15:53:52] and I need to create a switchover task, so I need it to be pooled [16:14:28] federico3: so eta to repool db1210? [16:15:00] it's pooled in [16:15:13] federico3: ah ok, I'd have appreciated a heads up :) [16:15:14] thanks [16:16:31] ah looks like my previous message got lost [16:16:35] "it's pooling in fast, should be ok to run the switchover generator" [16:16:44] yeah, that never arrived, thanks