[05:19:56] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on pc1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:12] ^ to be ignored, host will be decommissioned today [05:36:02] @marostegui want to try the decom cookbook for it? [05:36:21] yeah [05:39:37] federico3: is it ready? [05:39:44] I will run it later today, not now [05:45:36] should be ready, pending another review [05:45:58] @marostegui added few steps at the top and bottom of https://phabricator.wikimedia.org/T425622 [05:46:29] federico3: Not sure if in 20 seconds we are going to be able to check if it is all RO [05:46:33] Give it 60 seconds so we can check [05:46:39] ok [05:46:50] federico3: How do you plan to check? [05:47:10] by reviewing logstash and grafana [05:47:29] federico3: cool, I'll try an edit on any of the s2 wikis [05:48:11] we should see the write stop drop pretty clearly [05:49:47] yep [05:51:57] albeit I'm looking at past examples and not finding much [06:00:22] started [06:00:30] let's coordinate in -operations [06:46:34] federico3: once you've done your schema change on old s2 eqiad master let me know, so I can reimage it [06:46:58] ok [07:07:53] marostegui: so we want the standby DC MariaDB to be left untouched essentially? [07:08:00] yes [07:08:06] ok [07:35:53] @marostegui the schema change is done [07:36:03] federico3: thanks, taking db1222 then [10:23:27] is gitlab glitching again? " An error occurred while getting merge request counts " [10:38:38] I see the "please refresh" messages, anything to do with the 404 on graphql requests? [10:51:08] We're getting hammered with traffic [10:51:23] See also #wikimedia-gitlab [10:51:54] sobanski: ack, ty [11:08:40] hello! preparing for https://phabricator.wikimedia.org/T426199 I was wondering if db2197 and es2041 were good to go? I can take care of the others by running the cookbook [11:09:13] federico3: you taking care of es2041? ^ (per our chat yesterday during the meeting) [11:09:18] db2197 is jaime's [11:09:56] looking [11:12:56] I can confirm es2041 is the "fake-master" in the read-only es4 so it cannot be depooled with the cookbook [11:13:08] you have to switch it yeah [11:13:12] I can update it in dbctl [11:13:26] it's read-only so it's just a reordering of the fields in dbctl [11:13:49] yep [11:14:25] XioNoX: at what time do you want me to switch it? [11:14:52] I can do it right now in preparation anyways [11:14:53] federico3: maintenance starts in 45min so unless it's time sensitive, now is fine [11:14:57] cool, thx [11:19:40] federico3: You can commit my dbctl change too [11:20:05] just got it, pc1024 yes? [11:20:19] @marostegui ? [11:20:26] yep [11:20:31] go for it [11:20:31] done [11:20:35] thanks [11:40:02] federico3: want me to test the decomm cookbook? is it ready for testing? [11:40:49] I see there are some comments ongoing in the review so just double checking [11:57:46] I have stopped and downtimed stuff for T426199 (1 backup*, 1 db*), let me know f. if I missed something from the db* [11:57:47] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [11:57:56] <3 [11:58:05] marostegui: nothing major, you can test it [11:58:34] federico3: Excellent going to a couple of meetings in a row and will check later [14:05:49] federico3: got this on the last part of the decomm: https://phabricator.wikimedia.org/P93015 [14:06:19] oopsie, yup it's missing a header [14:06:29] should be fixed easily [14:06:55] but that means that most of it ran ok? [14:07:09] federico3, jynus, hosts can be repooled [14:07:33] yeah, it failed at that step [14:07:40] the previous things were ok [14:11:14] thank you, XioNoX, doing [14:13:47] @marostegui I can finish the last steps of the decommissioning [14:14:20] federico3: I removed it from zarcillo and orch [14:15:07] ok [14:24:17] XioNoX: ok, repooling, thanks [14:25:36] 16:24:50] <+logmsgbot> !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [14:25:36] [16:24:51] <+logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1014: Rack maintenance completed [14:25:46] federico3: There are several issues there [14:26:24] the host has mariadb down cause it will be decommissioned [14:26:34] so the script shouldn't have repool it [14:26:40] and I am going to depool it manuyally now [14:35:20] fyi, I've started repooling db2223, it's taking longer than I was expecting. There will be db2196, db2221, db2222 that need a repool after [14:40:31] XioNoX: I wonder if by default, it does a slow pooling to avoid errors, maybe they can be done in parallel, but fede will know more about that [14:40:49] yeah I prefer to defer to the DB team [14:42:23] XioNoX: I'd guess it is doing it step by step in gaps of 10-15mins [14:42:24] federico3: ^ [14:42:46] if it is using the normal repool cookbook [14:42:50] yeah, for a network-only outage one could be more aggressive than after a server maintenance [14:42:55] yeah [14:43:09] it is ok as a default, though [14:43:11] we have the --fast option [14:43:19] ha ha, that sounds like a joke [14:43:19] so maybe that cookbook could use it [14:43:40] jynus: hahaha [14:43:43] the script stopped, it did not depool [14:43:48] The mariadb one is --run-faster [14:43:49] I mean it did not repool [14:44:08] oh, then something wen't wrong [14:44:12] *went [14:44:15] federico3: related to my comment above pc1014? [14:44:54] yes, pc1014 [14:45:13] yes, that should have not been attemtped to repool as mariadb was down [14:45:33] I think you'd probably have to manually repool the hosts XioNoX mentioned above [14:46:58] can I leave those to your team? [14:47:05] Pooling instance db2223 at 56% [14:47:09] fyi [14:47:15] I have to step away [14:47:20] XioNoX: yeah [14:47:25] I'm checking the stataus of the others: i see 2222 and 2221 still depooled [14:47:32] federico3: I am not really following this maintenance, can you take care of all this? [14:47:58] yes, I'm going to check and pool them one by one [14:48:09] thanks [14:51:43] both db2221 and db2222 look ok, with replication lag quickly dropping to normal after the maintenance; pooling both of them in [16:40:27] that's for next week: a few DB hosts and a lot of k8s: https://phabricator.wikimedia.org/T427301