[05:55:10] <marostegui>	 btullis: you aware of those reboots needed in dbstore* https://phabricator.wikimedia.org/T395241 ? thanks
[07:09:51] <marostegui>	 es7 under trouble again because of semi sync
[07:10:41] <marostegui>	 We really have to disable semi sync on external store
[07:10:49] <marostegui>	 It is a source of issues with such big blob inserts
[07:32:40] <marostegui>	 I am going tohave to disable writes on es6
[07:32:45] <marostegui>	 The master is misbehaving all the time
[07:35:20] <marostegui>	 ok, holding off as I've disabled everything related to semi sync and it seems to be back
[09:11:58] <federico3>	 marostegui: did es1035 crash?
[09:23:35] <marostegui>	 federico3: Semi sync issues when I stopped a slave on its section
[09:23:48] <federico3>	 I can see https://phabricator.wikimedia.org/P76797 but I'm wondering why there was no alarm here on this channel
[09:24:04] <marostegui>	 federico3: It didn't crash
[09:24:19] <federico3>	 I see, thanks
[09:24:22] <marostegui>	 I had to kill it, but i downtimed it to avoid p4ges
[09:24:59] <federico3>	 the semi sync itself did not alert?
[09:25:04] <marostegui>	 federico3: I am going to propose disable semi sync on es in the team meeting, and if you guys want, I will create a task to allow a flag in the .yaml of the files in puppet to disable it on master and slave, and if you have time, I'd like to see if you can work on it
[09:25:44] <jynus>	 hello, this is not necesarilly a suggestion, but more of a question. I wonder if an upgrade to 10.11 could help?
[09:25:53] <marostegui>	 jynus: I have no idea :(
[09:26:18] <marostegui>	 jynus: But those semi sync issues only happen on es (where they get stuck as soon as one slave stops) my guess is maybe the size of the transactions
[09:26:21] <marostegui>	 BUt it is a pure guess
[09:26:41] <jynus>	 that wouldn't make sense, as usually the size is is very small
[09:26:50] <jynus>	 I wonder if it is related to the size of the cluster
[09:26:57] <marostegui>	 jynus: I mean the size of the blobs
[09:27:04] <marostegui>	 They are larger than the metadata ones
[09:27:06] <federico3>	 marostegui: yes, makes sense, perhaps we can also think of ways to monitor it to issue warnings here?
[09:27:08] <jynus>	 ah, now I understand what you mean
[09:27:24] <marostegui>	 federico3: Unfortunately there is not much to monitor, when it happens the slaves start to lag
[09:27:31] <marostegui>	 So monitoring semi sync wouldn't make much sense
[09:27:42] <marostegui>	 jynus: But it is a wild guess
[09:29:08] <jynus>	 btw, federico3 did anyone help you on friday with your icinga access?
[09:30:35] <federico3>	 jynus: yes but it's not fixed, anyhow I'm using alertmanager in the meantime
[09:34:22] <jynus>	 who did you talk to, I would like to help with that, I think it is important that you have access
[09:34:37] <jynus>	 but I want to sync to see what is not working still
[09:35:45] <federico3>	 to joe but I'm going to chase the issue, also the access to the IRC bot
[09:37:18] <jynus>	 thanks, I will ask him
[09:45:19] <federico3>	 aha I found something
[10:15:47] <federico3>	 Amir1: the script detecting the kernel upgrade progress introduced small changes in the host list in https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-ornal7wfj7ekyrq/
[10:25:08] <jynus>	 Let me know if I can do anything to aliviate the load with the es issue
[10:25:47] <marostegui>	 jynus: Thanks, for now I am going to hurry up to migrate to 10.11
[10:26:06] <marostegui>	 But I think we should work towards making es puppetizable with regards to the semi_sync keys
[10:26:12] <jynus>	 one *temporary* measure would be to strictly depool for writes the section while doing maintenance
[10:26:12] <marostegui>	 (hiera keys, that is)
[10:26:30] <jynus>	 while the issue may be ongoing
[10:26:31] <marostegui>	 yeah, it looks like it can be the best approach
[10:40:52] <jynus>	 Now that I see it, the high memory usage could be a synthom of something going badly
[10:51:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:54:26] <Amir1>	 federico3: what change? the addition of es2047?
[10:56:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:56:48] <federico3>	 Amir1: yes, that. In the last change it set the checkmarks on a bunch of hosts but also added es2047 which was not listed before. In previous changes that we discussed before in made similar changes with hosts appearing/disappearing
[10:59:16] <Amir1>	 that's because es2047 just got racked, that's expected
[10:59:30] <Amir1>	 T395771
[10:59:31] <stashbot>	 T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771
[11:14:00] <federico3>	 Amir1: can i start reboots on s2 in eqiad?
[11:14:11] <Amir1>	 yes go ahead please
[11:14:44] <federico3>	 tnx, starting now
[12:21:00] <Amir1>	 Happy Monday https://usercontent.irccloud-cdn.com/file/h3DbyRPi/IMG_20250601_142022_779.jpg
[12:22:22] <taavi>	 where are we with x3-on-wikireplicas? still doing its thing?
[12:22:56] <Amir1>	 taavi: we just talked about it with Manuel, it's happening 
[12:23:23] <Amir1>	 for now, I need to drop a couple of tables 
[12:41:24] <Amir1>	 > sudo db-mysql db2243 -e "use wikidatawiki; show tables" | grep -v wbt | xargs -I{} bash -c "sudo db-mysql db2243 -e 'use wikidatawiki; drop table if exists {};'"
[12:48:47] <Amir1>	 marostegui: 900GB with binlogs: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db2243&var-datasource=000000026&var-cluster=misc&viewPanel=panel-28&from=now-3h&to=now&timezone=utc
[12:49:26] <Amir1>	 I think it'll go down more when more binlogs get rotated out, probably to 800GB-ish
[12:49:26] <marostegui>	 Amir1: Good, thanks!
[13:16:34] <Emperor>	 urandom: FYI, there are 4 restbase servers (restbase2030 restbase2034 restbase2035 restbase2027) alerting on disk-space
[13:20:09] <urandom>	 Emperor: thanks; I'll take care of them
[13:20:36] <federico3>	 hi moritzm , what kind of review can I do for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152680 other than checking for the correct URL? Any additional testing or checks I can do?
[13:26:07] <moritzm>	 from your point mostly just that the group is right, cn=ops is the LDAP group everyone with global root/SRE is a member pf
[13:27:08] <federico3>	 ok, thanks
[13:44:42] <federico3>	 marostegui, Amir1: I see some hosts having a much higher rate of tmp tables creation, is this expected? https://grafana.wikimedia.org/goto/irS5VXBNg?orgId=1  
[13:59:02] <btullis>	 marostegui: Thanks. Yes, I had those reboots in https://phabricator.wikimedia.org/T392980 but was waiting until dumps are a bit less busy. Didn't get a chance to do them last week, which would probably have been more convenient, but never mind.
[13:59:22] <marostegui>	 btullis: No worries, i was just making sure you were aware
[13:59:24] <marostegui>	 thanks
[14:44:36] <marostegui>	 Amir1: is x3 on all.dblist?
[14:44:51] <Amir1>	 x3 only has wikidatawiki
[14:45:12] <Amir1>	 (maybe I'm confused about your question?)
[14:45:34] <marostegui>	 Amir1: I am taking a look at./modules/role/files/mariadb/check_private_data.py and wondering if x3 will be scanned by default or not
[14:45:43] <marostegui>	 based on all_dblist_path = os.path.join(mediawiki_config_path, 'dblists', 'all.dblist')
[14:46:09] <Amir1>	 I think it'll be scanned since the db itself is part of all.dblist 
[14:46:14] <marostegui>	 Ah never, mind, we call that from the report also
[14:46:20] <marostegui>	 So no worries
[14:46:20] <marostegui>	 Should be all fine
[14:46:24] <Amir1>	 ah awesome
[15:31:54] <Amir1>	 marostegui: on phone so can't do gerrit but 1152760 has my +1
[15:32:03] <marostegui>	 thanks!
[17:47:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[21:47:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure