[08:27:47] I don't know who that question was aimed at, sorry [10:01:46] Sorry, I wasn't specific enough: there are 4 dbproxies with backends failed over, expected? [10:02:39] dbproxy1022,dbproxy1023,dbproxy1028,dbproxy1029 [11:47:53] federico3: ^ [11:47:58] He rebooted them [11:48:51] jynus: yes, see https://grafana.wikimedia.org/d/fc48lf4/dbproxy?orgId=1&from=now-24h&to=now&timezone=utc [11:49:28] for https://phabricator.wikimedia.org/T419961 [11:50:13] (speaking of which, I could not find useful metrics from the dbproxy fleet, any hint?) [11:50:43] if haproxy is expected to be in failover mode, can it be downtimed then? [11:51:18] is it supposed to be in failover mode? [11:51:38] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=haproxy+failover [11:51:48] idk, that's why I was asking [11:53:22] it shouldn't be [11:53:30] these are passive haproxies [14:30:09] federico3: when are you planning to turn on the circular replication? [14:30:22] I need to go to meetings, but after that, let's do when I'm around [14:31:09] Amir1: that was planned for monday but do you think it should be moved earlier? [14:31:31] if the plan is for monday, that's fine. It also means Manuel will be around [14:31:43] (but I won't, even better :P) [14:31:53] XD [14:51:21] PROBLEM - MariaDB sustained replica lag on s1 on db2145 is CRITICAL: 88.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2145&var-port=9104 [14:52:21] RECOVERY - MariaDB sustained replica lag on s1 on db2145 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2145&var-port=9104 [16:01:10] I will be deploying new grants to backup1-* sections soon [16:04:45] hum strange, the lag was not quite captured by the charts [16:50:57] FYI clouddb1013 crashed again after I repooled it https://phabricator.wikimedia.org/T420177#11728804 [16:52:49] not sure if it's really something with that mariadb version, or something bad on that host [16:52:55] but it never happened before the upgrade [17:28:41] FIRING: DiskSpace: Disk space backup1010:9100:/srv/objectstorage 2.069% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=backup1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:36:43] ^I've extended the downtime [17:38:11] dhinus: could be many things, what I would check first is that pt-kill is running normally, it used to crash a lot without it [17:38:57] jynus: good point, I did check it was running after the mariadb upgrade, but not after that [17:39:28] it doesn't have to not be running, could be that something changed in syntax or whatever [17:39:45] check if you have graphs before and after [17:40:03] to see if there is a difference in resources used [17:40:20] for production, we did't see any big change on that upgrade [17:40:30] but cloud is a different beast [17:41:02] previous upgrade were way more involved and breaking IMHO [17:42:24] did you get anything on the mysql log? [17:42:37] even if it crashes, usually it leaves a trace or debug [17:53:48] hi folks! I'm the approver for deployment access requests. I have an access request with the reason as "backport fixes" (fine) and "access to live db for query optimization". And I'm aware you _could_ do that. I have a general feeling you all have opinions about that. Is there a doc/policy I should point to here? [18:00:47] access to live db for query optimization ? [18:01:44] is this a legitimate request? as in, someone that you would otherwise grant permision for deployment? [18:02:12] maybe it is just a request to access for performance metrics [18:05:29] There is some staff at: https://www.mediawiki.org/wiki/MediaWiki_database_policy [18:05:47] this person does have a need to deploy code, likely, yes. The other part of the request I felt like I should say something about :) [18:06:34] and this is for schema changes, but kind of debates the risks of changes on production: https://wikitech.wikimedia.org/wiki/Schema_changes [18:07:11] I also dug up https://wikitech.wikimedia.org/wiki/MariaDB#Testing_servers [18:07:17] jynus: thanks, will have a look tomorrow and report back [18:07:21] in general, tuning should not be done in production [18:07:32] there are dedicated servers for that, like those [18:07:57] please ask if they just mean metrics and cleanup [18:08:21] and it is just a question of surprising phrasing [18:09:04] we can provide access on non production hosts, just ask for clarification [18:09:22] probably they just mean "running maintenance scripts" not raw access [18:09:45] jynus: thanks, I will clarify and say that there are specifc server if needed. [18:10:17] feel free to send them to me if they insist they need direct db access :-D [18:10:39] so we can put them on pagerduty ;-D [18:11:00] :D