[09:45:04] regarding auto schema ( https://phabricator.wikimedia.org/T410508 ) I tested past commits from the beginning of the new repo and I'm seeing the same failure, see https://www.irccloud.com/pastebin/LrmYgp4t/ - I suspect we might be conflating 2 different bugs, one fixed in https://gitlab.wikimedia.org/repos/data_persistence/dbtools/auto_schema/-/merge_requests/15 and the other is not [10:20:21] federico3: so how do you want to proceed? trying to fix both on the same commit? or separate ones? [10:22:19] The one in the MR shoud be fixed but the other one appears to be present before Tue Sep 5 '23 and I don't know when it was introduced. Do we have a commitish where the bug is not triggered? [10:22:42] I don't know [10:23:02] Chasing it with the current HEAD may be easier than chasing that I guess? [10:23:18] You know way more than I do with auto schema :) [10:26:19] I'm not familiar with the code changes before I Tue Sep 5 '23 to check out even older commits and run them on production - it could be harmful. Amir1 do you have a past commit where the bug was not happening? I can share the exact steps to reproduce what I'm seeing [10:30:58] federico3: Can you trace it with the current version you are using? (the one that has my optimize table, which is harmless) [10:34:12] knowing when it first appeared would help and I'm quite confused on why I haven't run in the bug myself with all the other schema changes but I can try stepping through with pdb [10:34:51] Yeah, but I am not sure it will be faster to look through all the commits to find a bug, than checking the current code and look for that one [11:07:56] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-ethtool-exporter.service on backup2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:37] federico3: any idea what can be happening here? https://phabricator.wikimedia.org/P85844 [11:20:20] odd, looking... [11:20:30] thanks! [11:23:53] ah, did we perhaps pass the fqdn into instance.get ? [11:24:02] don't know :) [11:24:10] most likely, try running the cookbook with the hostname only instead of FQDN [11:25:14] federico3: yep, that worked! [11:25:15] thanks! [11:25:38] federico3: Which means https://phabricator.wikimedia.org/T391581 isn't resolved? [11:25:52] pushing the fix in a second [11:25:56] thanks! [11:32:36] d'you mind testing it if you need to do pooling anyways? test-cookbook -c 1212114 ... [11:32:58] sure [11:33:00] give me a sec [11:34:39] yep, it is working [11:53:32] marostegui: I'm stepping through auto schema and still not seeing an obvious error in the logic. It can see "user_iiid" in db.run_sql("desc user;")> returning False on ruwiki on db1187 after optimize table user ran with the error "Table does not support optimize, doing recreate + analyze instead" and from the point of view of auto schema this is a failed check [11:54:09] mmmm it shouldn't, but let me change the command [11:54:15] you are using the one in my /home right? [11:55:41] I copied it from there and added logging: https://phabricator.wikimedia.org/P85854 [11:55:55] ok, I will edit that one, give me a second [11:57:03] federico3: done (also changed the table, to pick a smaller one so you don't have to wait much for each test) [11:57:15] ah wait, I need to do it again, forgot "sudo" XD [11:57:43] federico3: done [11:57:57] wdym? [11:58:22] federico3: I changed the command so it doesn't give you: Table does not support optimize, doing recreate + analyze instead" [11:59:36] which file can I grab? /home/marostegui/auto_schema_current/auto_schema/run_schema_change.py /home/marostegui/git/software/auto_schema/run_schema_change.py /home/marostegui/git/auto_schema/run_schema_change.py ? [11:59:38] federico3: The reason I added user_iiid is so the script always sees that the schema change needs to be done, otherwise it wouldn't go [11:59:55] federico3: Didn't you tell me you were using: https://phabricator.wikimedia.org/P85854? [11:59:58] I edited that one [12:00:09] ah you edited the paste? ok [12:00:40] federico3: However, does the "user_iiid" in db.run_sql("desc user;")> returning False...is that performed AFTER the schema change has been done? [12:00:49] Because if that's the case, then it will always fail [12:01:49] aha, that's the problem. check() is done before making the change to determine if the change is needed, then the change is done, then is run again to ensure that the change was successful. [12:02:03] then that's the problem, because that check will never work [12:02:09] yup [12:02:31] ok, no worries, I have 2 schema changes ready, and I will check out the latest HEAD and test with them [12:02:39] unfortunately there's no way to test the change in advance [12:06:20] mmmm, but then how can we do https://phabricator.wikimedia.org/T299441 ? [12:06:37] I am going to test the schema changes I have ready for live, with the latest HEAD and see if they work [12:07:09] federico3: can you review https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/52 and https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/53 [12:29:55] marostegui: in both MR the comment above the command looks unrelated. [12:30:14] federico3: what? [12:30:24] ah [12:30:53] Easy fix! [13:02:18] federico3: on "I'm not familiar with the code changes before I Tue Sep 5 '23 to check out even older commits and run them on production - it could be harmful. Amir1 do you have a past commit where the bug was not happening? I can share the exact steps to reproduce what I'm seeing " The last commit before your first commit should be fine. [13:02:45] federico3: https://phabricator.wikimedia.org/T410508#11412753 [13:13:52] marostegui: that alter on db1165 was too fast [13:14:13] did I mess up something something [13:15:43] Amir1: not really, in 10.11 it should be an online ddl operation if possible [13:17:52] ah, that means we probably need to optimize revision on them later [13:17:58] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1165&var-datasource=000000026&var-cluster=mysql&viewPanel=panel-28&from=now-6h&to=now&timezone=utc [13:18:01] no change [13:18:38] semi-related, on s2 the optimize of all tables have cleaned up 400GB https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1197&var-datasource=000000026&var-cluster=mysql&viewPanel=panel-28&from=now-24h&to=now&timezone=utc [13:18:59] let me optimize on db1165 [13:19:13] <3 [13:19:21] maybe add it after alter? :D [13:19:33] we can remove it when doing for master [13:19:38] *dc masters [13:19:46] yeah, let's see how it goes on db1165 [13:36:51] Amir1: revision on ruwiki went from 43GB to 29GB, but of course there may be more to the optimization than just the column drop [13:36:57] I will add the optimize anyway to the schema change [13:37:00] it won't hurt [13:37:43] it's very likely mostly the column drop. revision unlike *links tables is basically append-only [13:38:28] cool, just added it and re-started the change [13:39:43] btw I am doing https://phabricator.wikimedia.org/T410531 s4 eqiad and codfw in parallel, otherwise it will take ages given how many hosts s4 has at the moment [13:39:58] Amir1: for your optimizes, are you close to finish with s2? what would be next? [13:39:59] marostegui: want to try https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/42 ? [13:40:08] I added the docs to the wiki [13:40:25] it's not merged no? [13:41:31] federico3: did you see my above message earlier with the new failure paste? [13:42:10] yes, it looks like a different issue and I'm looking into it [13:42:26] federico3: ok thanks, please ack next time, because I wasn't sure if you did or not [13:42:28] marostegui: s2 optimize is eqiad only so you can go codfw. It's going to take a week to finish eqiad I think [13:42:50] Amir1: No, I am not ready yet for s2, just asking in case I am during the weekend [13:43:40] Amir1: I will do s1 once finished with s4 [13:43:52] hopefully tomorrow and I can leave s1 running during the weekend [13:43:55] cool. Go for it. Drop all the things [13:48:02] marostegui: you are referring to https://phabricator.wikimedia.org/T410508#11412753 ? [13:48:12] yep! [14:08:52] marostegui: BTW is there a specific reason why you use run_schema_changes.py? I updated the usage of PYTHONPATH to fix it in https://wikitech.wikimedia.org/wiki/Auto_schema#Running [14:10:15] federico3: because I thought that's the way we normally run it, it is even there on the doc [14:10:20] python3 run_schema_change.py --run --dc-masters --dc $DC [14:11:21] you can just run the scripts directly as in the example above if you want, I find it simpler without overriding the file contents [14:11:41] Thanks! [14:13:28] I can't see anything wrong with check() in dry run mode, I'll have to run 2025/drop_ar_sha1_T411163.py in pdb and it will depool a host, let me know if/when I can run it [14:15:09] mmmm can you specify just a replica, it would be: db1187 [14:15:30] so it is not ran everywhere, but just there [14:17:17] yep, can I run it nw? [14:17:20] *now [14:17:44] yep, with that host [14:18:21] ooh that's interesting [14:19:24] https://phabricator.wikimedia.org/P85890 with and without setting db1187 [14:19:53] it is not s7, it is s6 [14:20:01] sudo PYTHONPATH=../auto_schema python3 2025/drop_ar_sha1_T411163.py --check --section s8 --dc eqiad [14:20:01] ^ [14:20:22] that's a new glitch I wasn't aware of [14:42:02] marostegui: did you always set replicas = ["db1187"] ? I suspect the bug happens only with replicas = None [14:42:56] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-ethtool-exporter.service on backup2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:10] the issue is with get_columns() existing on both Db and Host classes and expecting different params [14:51:17] federico3: no, I never do it. It had to be done now so you don't deploy the change everywhere [14:51:54] It always None [14:52:04] the bug is triggered only when we runt replicas=None and doing --run, so I cannot reproduce when setting replicas = [...] [14:52:16] host and db differ when all_dbs is set to True or False, the replica set won't make a difference [14:52:21] and IIRC we had a workaround in check() [14:52:25] at least it shouldn't [14:54:47] we have a bunch of past changes with all_dbs set to True, replicas=None and get_columns() worked [14:56:44] 2025/drop_ar_sha1_T411163.py ran successfully for me with all_dbs=True and replicas set to db1187 [14:57:11] marostegui: was all_dbs set to False during https://phabricator.wikimedia.org/T410508#11412753 ? [14:57:58] I'll add a safety check across all_dbs and replicas anyways [15:43:59] federico3: always set to false yes [15:45:11] ah that's probably it [15:47:05] marostegui: I uesd the file from https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/54/diffs#4802acd12a57178f9289173da9550cca5c92f031_0_14 where it's True - can you run the schema change with all_dbs set to True or is it harmful? A first improvement would be to throw an error early with a clear message, next step would be fixing the root cause [16:19:29] Yes sorry, it is set to true [16:19:38] Whatever is on the file is now it was ran [16:20:41] marostegui: can I grab the file you are currently using from your home to reproduce the issue? [16:21:14] Yes, I pasted it there so you can use it [16:21:20] maybe there's some minor things that got overlooked [16:37:36] marostegui: this is what I'm getting: https://phabricator.wikimedia.org/P85890 - I suspect I have to use replica=None to reproduce it [16:42:05] I'll to run it on the testbed [16:44:30] federico3: Yeah, because if not, you'd deploy the change to production and we don't want that :) [16:44:42] You can always copy also the table definition to run it fully on the test hosts [17:18:32] I can test it with a "noop" change and simulate making an alter table