[05:12:55] <marostegui>	 swfrench-wmf: I think that's fine. We can manually triage them with the appropriate tags 
[06:50:47] <federico3>	 as a side note, perhaps it could be useful to also have a tag for MariaDB hosts so we can setup IRC highlights or personal notifications based on that?
[07:05:51] <jynus>	 "Last dump for x3 at codfw (db2200) taken on 2025-06-10 00:45:37 is 35 GiB, but the previous one was 266 GiB, a change of -86.9 %"
[07:05:55] <jynus>	 ^expected?
[07:06:31] <jynus>	 I am guessing yes, it is just the delayed size change for dumps
[07:08:43] <jynus>	 meanwhile backup2013 ran out of space
[07:34:53] <jynus>	 I ordered more disk from Amazon: The filesystem on /dev/mapper/hwraid-backups is now 13421772800 (4k) blocks long.
[07:55:25] <jynus>	 review pre-db2197/dbprov2003-upgrade: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153641
[07:56:03] <federico3>	 looking
[07:56:30] <jynus>	 that was more for manuel, but happy to get more eyes
[07:57:36] <jynus>	 I will deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155082 unless someone has some reservation against
[08:01:07] <federico3>	 I see "bookworm" as a comment being removed on most sections but not line 906, MariaDB on db2197 and dbprov2003 being set to 10.11
[08:01:54] <jynus>	 yep, I figured the comment was no longer useful if it was on all hosts :-D
[08:04:34] <marostegui>	 jynus: expected yes, as the s8 tables were dropped from there 
[08:05:39] <jynus>	 So I was about to do the upgrade that federico3 was talking
[08:05:42] <jynus>	 ok to go?
[08:06:20] <jynus>	 I just merged the zuul grants and will redo m1 backups (but there are no tables on db yet)
[08:07:21] <federico3>	 marostegui: can I start restarts in s7 in eqiad?
[08:09:59] <jynus>	 I think today I will do the upgrade, but won't do any further maintenance, as I worry that I won't be around tomorrow if something goes badly
[08:15:38] <marostegui>	 federico3: yeah good from my side 
[08:35:54] <federico3>	 thanks, started
[09:21:26] <jynus>	 proceeding with the db2197 upgrade
[09:35:19] <marostegui>	 btullis: Do you have any ETA on this? https://phabricator.wikimedia.org/T394373#10824966
[09:42:30] <jynus>	 the upgrade is complete, testing a run and a recovery now
[09:42:36] <marostegui>	 thanks
[09:44:06] <jynus>	 tracking: T394487
[09:44:06] <stashbot>	 T394487: Migrate backup sources to MariaDB 10.11 - https://phabricator.wikimedia.org/T394487
[09:45:35] <btullis>	 marostegui: I think that we can do them this week.
[09:46:32] <marostegui>	 btullis: can I do dbstore1009 today?
[09:47:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on es2032:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:51:09] <btullis>	 marostegui: Yes. Feel free to go ahead. There is a wikidata dump that will probably crash out on s8, but that's almost always going to be the case, so we'll just deal with it.
[09:52:11] <marostegui>	 btullis: ok, doing it then thanks!
[09:52:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ferm.service on es2032:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:52:40] <btullis>	 Now is better than around the 1st or 20th of the month, so we can probably fit them all in, this week, if that works for you. 
[09:52:49] <marostegui>	 yeah, that can be doable
[09:52:50] <marostegui>	 thanks
[09:54:33] <federico3>	 maybe putting this stuff in an .env
[09:54:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: ferm.service on es2032:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:57:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: ferm.service on es2032:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:57:43] <jynus>	 I think I made a mistake and upgrade the right db hosts, but not the right dbprov
[09:58:16] <jynus>	 dbprov2004 should have been upgraded, not dbprov2003
[09:58:29] <jynus>	 I am going to change that and downgrade dbprov2003
[09:59:34] <marostegui>	 btullis: I will upgrade its kernel too as part of https://phabricator.wikimedia.org/T395240
[10:00:55] <btullis>	 Ack, many thanks. 
[10:09:57] <jynus>	 now backups running: m1 for zuul adition, es* from disk issue, and s2, s6, x1 for 10.11 testing
[11:20:43] <jynus>	 10.11 snapshots looking good
[11:34:47] <marostegui>	 nice!
[11:52:46] <jynus>	 taking a long break while the rest of backup progress, will return later to evaluate results
[13:04:51] <jynus>	 back
[13:09:45] <elukey>	 federico3: o/ I expanded https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Add_private_data/secrets_(optional) as promised
[13:13:24] <federico3>	 ok, I'll ask in k8s-sig if they want to provide guidance
[13:30:35] <marostegui>	 Going to depool pc1 for https://phabricator.wikimedia.org/T378715
[13:59:26] <federico3>	 marostegui: looks good? https://phabricator.wikimedia.org/T384212#10899856
[14:01:35] <marostegui>	 federico3: I think the database column should reflect the database name, or it will just be zarcillo and zarcillotest?
[14:02:51] <marostegui>	 federico3: Also I'd suggest to avoid "-" on the usernames, just to avoid having to escape things, quoting etc
[14:03:11] <marostegui>	 I'd replace with "_"
[14:03:21] <federico3>	 ok, renaming them
[14:03:56] <federico3>	 the database column in the table in phabricator you mean? The first row is "all" and the other 2 rows are "db1215"
[14:04:09] <federico3>	 maybe there's a formatting glitch in phab?
[14:04:31] <marostegui>	 federico3: With "database" I understand the name of the database within zarcillo
[14:04:38] <federico3>	 ah wait let me rename it :D
[14:04:55] <marostegui>	 With "all" you mean we have to create this user everywhere in production?
[14:05:07] <federico3>	 also zarc_repl_scraper or zarcillo_ ? :)
[14:05:21] <federico3>	 yes, unless we want to reuse an existing user of course
[14:05:42] <federico3>	 ok, table updated
[14:06:56] <marostegui>	 I think the prometheus user has those grants for the "all" databases
[14:07:02] <marostegui>	 But I'll have to check
[14:10:18] <federico3>	 I see 4 prometheus-mysqld-exporter users with 1 ipaddr each (no cidr)
[14:14:28] <jynus>	 if it is for replication status, not for binlog status, shouldn't the permissions be: REPLICA MONITOR instead or replication client?
[14:15:16] <jynus>	 assuming MariaDB >=10.5
[14:16:45] <jynus>	 I know this because I recently had to redo the grants for backups
[14:16:53] <jynus>	 after the 10.6 upgrade
[14:17:50] <jynus>	 the other thing should be to have a low connection limit, like it was done for prometheus
[14:23:53] <federico3>	 jynus: ah https://mariadb.com/kb/en/grant/#replication-client 
[14:25:09] <federico3>	 indeed it switched from 10.5 and I think we would need a dedicated user for this to limit the ipaddr range and privileges to the miinimum
[14:25:37] <jynus>	 I don't know what's the goal, but I guessed the experience would be useful to you
[14:25:40] <federico3>	 we could first create such user only on db2230 and/or db1215 if prefered
[14:25:52] <jynus>	 check this file:
[14:27:51] <jynus>	 https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/mariadb/grants/production.sql.erb$51
[14:31:38] <federico3>	 I saw the grant by looking around in db1215, I guess it's this guy /usr/bin/prometheus-mysqld-exporter --collect.global_status --collect.global_variables --collect.info_schema.processlist --collect.slave_status --no-collect.info_schema.tables --collect.heartbeat --collect.heartbeat.utc
[14:32:18] <federico3>	 ...is it also extracting the `SHOW REPLICA STATUS` somehow?
[14:32:34] <federico3>	 as in not just some statistics but the whole data
[14:32:47] <jynus>	 yes, although being unix_socket / localhost it is less worrying
[14:33:13] <federico3>	 we might get away without having to scrape directly 
[14:33:29] <jynus>	 slave monitor was the old replica monitor
[14:36:10] <marostegui>	 federico3: Why do we need that grant actually? to find if it is a master or a slave?
[14:37:21] <federico3>	 that, and who is replicating from whom, but maybe we can get away (at least for now) by scraping Prometheus and/or Orchestrator more
[14:41:57] <marostegui>	 federico3: Yeah, I was thinking that having yet another grant on all databases that does the same that orchestrator isn't ideal. We already have that on orchestrator (we can call orchestrator-client for that) or even: db-replication-tree db1163 for instance
[14:42:28] <marostegui>	 https://phabricator.wikimedia.org/P77540
[14:42:33] <marostegui>	 (we can also call clusters)
[14:42:52] <federico3>	 orchestrator has a massive grant with ability to make changes etc
[14:43:14] <federico3>	 right now I cannot scrape it as it needs SSO but hopefully there's a way to access
[14:43:17] <marostegui>	 https://phabricator.wikimedia.org/P77540#311455
[14:43:28] <marostegui>	 federico3: You can just use the client
[14:43:33] <marostegui>	 (or api)
[14:44:09] <federico3>	 yes I'm using the spicerack client but it's probably missing some credentials
[14:44:21] <marostegui>	 Double check also db-replication-tree
[14:44:22] <federico3>	 https://grafana-rw.wikimedia.org/d/celrzpf6av8qob/zarcillo?orgId=1&from=now-15m&to=now&timezone=utc&viewPanel=panel-6
[14:44:24] <marostegui>	 As it may be simplier
[14:44:33] <federico3>	 https://grafana-rw.wikimedia.org/d/celrzpf6av8qob/zarcillo?orgId=1&from=now-15m&to=now&timezone=utc&viewPanel=panel-5
[14:45:40] <federico3>	 these look promising, maybe I can reliably extract the replication status from here. Of course the safest bet would be asking the databases directly but if we get timestamped datapoints from the prometheus exporter it might be enough
[14:51:40] <marostegui>	 Repooled pc1
[14:54:50] <jynus>	 my suggestion, federico3, is to deploy the access to to zarcillo, as that is trivial; the rest may need a rethink
[14:55:40] <federico3>	 ok
[14:55:41] <jynus>	 there used to be a tool we had called tendril that extracted data from all dbs, but it had a lot of problems scaling to so many dbs
[14:56:09] <marostegui>	 Yeah, zarcillo_ids_rw_preprod and zarcillo_ids_rw_prod can go ahead
[14:56:12] <federico3>	 do we know if the data in the 2 panels I shared is reliable?
[14:56:16] <marostegui>	 The first one, we need to check some alternatives
[14:56:48] <federico3>	 for now we can start with the zarcillo_preprod database :)
[14:56:57] <jynus>	 I don't think we should have 20 monitoring tools- but 1 that can be reused by different monitoring tools
[14:57:02] <marostegui>	 Agreed
[14:57:19] <marostegui>	 federico3: can you create the databases?
[14:57:23] <jynus>	 e.g. if there is a service that extracts data and then can be queried by prometheus, orchestrator, other dashboards, etc.
[14:57:36] <federico3>	 that was part of the idea of the central source of truth 
[14:58:04] <jynus>	 reminds me of https://xkcd.com/927/
[14:58:23] <federico3>	 but for now I can fetch data from prom/orch for the time being 
[14:58:29] <jynus>	 again, not saying no, but needs some thinking
[14:59:10] <jynus>	 specially what I said about removing icinga as a priority
[14:59:17] <federico3>	 marostegui: `zarcillo` is already existing, want me to create `zarcillo_preprod` and/or the users as well?
[14:59:20] <jynus>	 and the blocker of a lack of a private data store
[14:59:58] <marostegui>	 federico3: Yes, the one you need, if you can also generate the grants, we can double check them
[15:00:26] <federico3>	 ok I'm pasting here the commands for confirmation
[15:00:33] <marostegui>	 paste them on the task, much better
[15:01:11] <federico3>	 ok
[15:05:22] <jynus>	 e.g. maybe there should be a micro service for private status queryin on each db, and then query that microservice? I am not sure
[15:05:53] <jynus>	 and that could be used for zarcillo and for whatever is the future query monitoring setup
[15:06:27] <jynus>	 I think we should think more long term for that
[15:07:26] <jynus>	 and tell clearly obs. team that their setup is not working for us
[15:07:26] <federico3>	 jynus: currently zarcillo is more or less doing that: spawning a dedicated process that can only fetch data sources and write into its datastore, while another can read the datastore and serve data via an API. 
[15:08:19] <federico3>	 (right now it's all read-only access to pretty much public[ish] data but later on we can split the parts into different containers entirely)
[15:08:25] <jynus>	 I don't like direct polling, bad things can happen from that- both security and scaling wise
[15:09:03] <jynus>	 think that processlist may need to poll at least once every 10s or so
[15:09:30] <federico3>	 (more in general, if we could do push rather than poll it would be even better, yes)
[15:09:32] <jynus>	 again, I am not saying how it should be done
[15:09:53] <jynus>	 I am saying it is a large problem with yet unclear scope and will need more work to do it well
[15:10:20] <jynus>	 so step by step, let's setup a dashboard
[15:10:38] <jynus>	 we can do some custom population by a custom script for now
[15:10:54] <jynus>	 and later we see how to pivot it to something else
[15:11:37] <jynus>	 I think it is important to have a snall POC asap
[15:12:38] <jynus>	 and let's talk later on the week about db backup automation, as there could be some good synergies there too
[15:13:02] <jynus>	 dbs querying table metadata and backups querying db metadata
[15:14:50] <jynus>	 BTW, I will be out tomorrow again
[15:25:40] <federico3>	 marostegui: https://phabricator.wikimedia.org/T384212#10899856 updated
[15:26:50] <federico3>	 marostegui: can I go ahead with those?
[15:29:45] <marostegui>	 federico3: From a first glance I don't think those would work
[15:30:10] <marostegui>	 You need to include the ipv6 ones there
[15:30:15] <marostegui>	 I'd suggest you test all this on db2230
[15:30:44] <federico3>	 are we using ipv6 by default on k8s and the dbs?
[15:31:19] <marostegui>	 federico3: Your comment says you need to allow an ipv4 and ipv6 network, if the look up is for ipv6, you need to specify that
[15:31:28] <marostegui>	 Again, test with them on db2230, I just took a quick look
[15:32:57] <federico3>	 I mean: in the codebase I can use ipv4 or 6 as we prefer so we might want to enable only ipv4 as it's just a /21?
[15:33:48] <marostegui>	 +1 for only ipv4 https://phabricator.wikimedia.org/T270101
[15:34:34] <volans>	 FYI orchestrator's API internally don't need SSO if in the list of authorized hosts
[15:36:09] <federico3>	 marostegui: I created the user on db2230... after all we could just keep the test database on db2230 instead of db1215 and be done with it :) 
[15:36:49] <marostegui>	 federico3: yeah, but db2230 isn't very reliable as in, it can be corrupted, deleted, etc
[15:37:21] <federico3>	 ok, the commands worked ,should I do it on db1215 now or do we want to test connecting from k8s to db2230 first?
[15:37:43] <marostegui>	 Yeah, I think the connection testing is what you need
[15:37:44] <federico3>	 volans: maybe the aux cluster needs authorisation?
[15:37:53] <volans>	 modules/profile/templates/idp/client/httpd-orchestrator.erb
[15:38:18] <volans>	 I'm not sure if  authorizing the whole k8s cluster is a good idea, up to you
[15:38:57] <federico3>	 marostegui: are you worried about the netmask?
[15:43:11] <marostegui>	 federico3: just make sure the grants work
[15:43:41] <federico3>	 that's going to take a while :D
[15:44:59] <marostegui>	 federico3: you don't have to check _all_ hosts, just check a couple. Also, I am with volans here, doing a /21 is strictly needed?
[15:46:27] <volans>	 I meant for the orchestrator APIs, but that works for grants too :D 
[15:46:34] <federico3>	 you don't have to check _all_ hosts, just check a couple -> you mean the source hosts in k8s? I don't seem to have rights to get a shell on the pod
[15:47:12] <marostegui>	 yeah, definitely not for orchestrator api
[15:47:33] <federico3>	 marostegui: I doublechecked with claime and afaict it's the k8s aux address range
[15:47:48] <claime>	 It's the pod ip range
[15:49:06] <claime>	 and it's the aux-k8s one, not the wikikube one, so the surface is low, there's only internal tools on that cluster
[15:50:47] <federico3>	 if we can have a read-only api access range on orchestrator with a dedicated user/pass ? But first I can experiment more with the data from Prometheus
[16:50:52] <marostegui>	 btullis: all dbstore migrated to 10.11 and rebooted with the latest kernel
[16:51:14] <btullis>	 marostegui: Fantastic! Thanks very much.
[17:34:24] <federico3>	 marostegui: and indeed the user @ ipaddr/netmask does not seem to work
[17:38:18] <federico3>	 yet I was able to log in from cumin to db2230 using an user (with no grants) from a netmask containing cumin itself CREATE USER 'zarcillo_ids_rw_preprod'@'10.64.48.0/255.255.252.0' 
[19:13:41] <marostegui>	 federico3: because that is covered by the normal user we use from cumin which has 10.64.% 
[20:10:31] <federico3>	 marostegui: err? I logged in using 'zarcillo_ids_rw_preprod' specifically