[08:16:28] Amir1: can I start the schema change in s3 codfw? Also the setup of es2049? [08:16:53] federico3: go for it [08:17:23] both things? [08:47:35] Amir1: ah the clone cookbook is for s* section, it might needs tweaking for es*. Is there a documented process to set up es* nodes? [08:51:54] cookb [09:21:55] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:44] probably a side-effect of the envoy upgrade, but I'll have a quick look [09:25:42] yeah, fixed now [09:27:10] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:55] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:55] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:32] fixed (again, consequence of the envoy upgrade, apologies for the noise) [10:22:10] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:27] grafana is lagging, it's fine on the host. [10:23:31] Can I get a +1 on two CRs, please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1183627/ to remove 3 drained hosts for disk controller swap ; and then https://gerrit.wikimedia.org/r/c/operations/puppet/+/1183628/ to re-load those and drain the next batch [10:25:42] going for a coffee outside [10:26:55] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:58] Emperor: d'you mind putting the hostnames in the description just for extra safety? [10:28:50] federico3: like that? [10:29:10] yes thx [10:29:49] ah I wanted to +1 but Jaime did it already [10:36:10] well, thanks to both of you :) [12:03:31] https://www.irccloud.com/pastebin/j1AObd21/ [12:03:37] how [12:07:39] this is the same range of commons extlinks and commons has +100M pages [12:22:48] aaaaaa [12:22:51] https://www.irccloud.com/pastebin/grQ4PBMR/ [12:23:07] 27M links to wikimedia.org [12:23:23] Amir1: I think the copy of the es host needs a bit of a bespoke process for cloning so I have a copy of the clone cookbook with the replication checks disabled https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1183646 - can I give it a shot? [12:26:16] federico3: 1- it still tries to set replication. 2- I suggest to add a flag to the existing coobook [12:26:32] for 1, line 639 [12:29:22] i could integrate it in the existing cookbook but i would do it only after getting the second one tested few times [12:31:10] also the CHANGE MASTER TO few lines above [12:44:15] Are we having a team meeting today? AFAICT Kwaku, Kavitha, Giuseppe, and Eric are all OoO [12:47:13] good point, we can skip it and sync up here [12:47:50] I always appreciate less meetings [13:01:24] oh [13:03:02] Good to know, but I'm by no means against ;) [13:03:43] * Emperor content either way, it just seemed worth asking with so many people away [15:06:36] Amir1: we spoke about automating some parts of the master flips - but there's also T196366 ( See https://phabricator.wikimedia.org/T196366#10479471 ) and I'm taking a look at https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/merge_requests/3/diffs - maybe I can help with the new script without touching the existing one? [15:06:37] T196366: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 [15:38:48] I'd pause more long term improvements until Manuel is back [15:44:01] ok [17:04:45] Amir1: the schema change on s3 is done, can I start s7? [21:56:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:04] That was supposed to be the source for cloning the new host, downtime expired?