[00:41:51] bd808: ha! [08:15:25] infamy infamy, they've all got it... [08:25:53] can someone check dbproxy1026? it seems to be an m3 db proxy but is failed over for 6 days [08:28:17] will do [08:30:35] thanks [08:31:53] I use this dashboard to check for ongoing data persistence alerts: https://alerts.wikimedia.org/?q=team%3D~%28sre%7Cdata-persistence%29&q=%40state%3Dactive&q=instance%21~an- [11:56:40] elukey: hi. How are you getting on with the SM config-Js, please? I've got a meeting a little later today where I'm going to be asked about where we are with the ms and thanos backend nodes... [12:14:00] Emperor: o/ I am online in the EU afternoon, but TL;DR is that Jesse thinks we may have a BMC firmware bug for the double d-i issue. In theory IIUC it shouldn't happen after we reimage all the nodes successfully, but I need to confirm it with Jesse later on. Worst case we could reimage the two thanos-be and put them in production, following up with Supermicro in the meantime. Otherwise [12:14:06] we can use Legacy/BIOS and have to flip UEFI on/off when a disk needs to be replaced, not ideal but hopefully it will happen not frequently. [12:14:25] ms-be2* are almost all ready, ms-be1* have to be configured by Dcops yet but it should be a matter of a day/two [12:14:45] so depending on what we choose, I'd say that they could be ready by end of week or early next one [12:14:53] lemme know your thoughts [12:15:07] (I can jump on a meeting later on if you prefer to chat about those) [12:17:55] Thanks! I'm relaxed about the ms-be* nodes (I want them in service this quarter, but we've time left for that still). [12:19:13] [I'd like to avoid flipping between BIOS and UEFI if possible] [14:09:34] Amir1: dumb question, how do I invoke sql.php to create the tables on x1? [14:14:01] there used to be a cluster or something? I cannot remember [14:14:06] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 22 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [14:14:23] some kind of option I cannot remember now [14:15:23] "--cluster" shows up as an option: https://www.mediawiki.org/wiki/Manual:Sql.php [14:16:48] --cluster extension1 cdanis [14:16:55] https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debugging_databases [14:17:03] thanks [14:17:07] is there a dry run mode? [14:17:32] I'd say run it in interactive mode to double check you are on the right host ? [14:17:39] cdanis: the doc should have something [14:17:40] with a non query? [14:17:44] ok [14:17:52] https://wikitech.wikimedia.org/wiki/Creating_new_tables [14:18:03] as in a SELECT @@hostname; or something like that [14:18:05] worst case, just directly create it on master of the section [14:18:06] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [14:18:22] cumin1002$sudo db-mysql db1234 [14:22:29] checking [14:24:31] πŸ’”cdanis@mwmaint2002.codfw.wmnet ~ πŸ•€β˜• mwscript sql.php --wiki=commonswiki --cluster=extension1 [14:24:33] > select @@hostname; [14:24:35] stdClass Object [14:24:37] ( [14:24:39] [@@hostname] => db2196 [14:24:41] ) [14:25:16] cdanis: compared to what orchestartor says, looks good to me: https://orchestrator.wikimedia.org/web/cluster/alias/x1 [14:25:31] as codfw is active [14:26:05] I think it is better to do it through the script, if for some reason we made a mistake, read only would prevent a write to the wrong host [14:26:19] as sudo you can override that restriction [14:26:49] only the active primary db on the cluster is in read-write mode for regular users [14:28:54] looks like it was a blip in replication for db1206 https://grafana.wikimedia.org/goto/n4-uySnHg?orgId=1 this panel is the only one raising concern [14:29:23] is it s1, could be dumps? [14:30:12] also what is prometheus-mysqld-exporter ? [14:36:44] its the native dashboard that goes with the exporter: https://github.com/prometheus/mysqld_exporter/blob/bb4b4ba9c376d4c8381cc0c92153b8a06248c6bd/mysqld-mixin/dashboards/mysql-overview.json [14:37:34] I see- there is a lot of overlap with the MySQL one. Maybe it would be nice to implement the improvements in a single dashboard? [14:38:37] (not that there can be multiple views, but to have a good one rather than 2 that can be improved) [14:40:10] I must admit this is my default one :D I'd be happy to have a list of panels that are better in each direction to merge them, this is quite easy from that perspective [14:41:14] Did you know that the other existed? What did you miss from it? [14:44:15] We wanted to do at some point one to compare 2 hosts side by side [14:44:44] yes I knew about both, the generic one is more in my habits as I've been using it since mysqld-exporter exists :D I don't think any lacks or has anything more, I find the readability quite different in both context and merging them from my perception would be mostly aggregating all panels under a single umbrella dashboard! And wanting to be able to [14:44:45] compare hosts, in my opinion, calls for yet another dashboard [14:45:16] yes, the comparison one should be a different one [14:45:25] but can't agree more with you, this would be handy indeed [15:13:15] https://grafana.wikimedia.org/d/549c2bf8936f7767ea6ac47c47b00f2a/prometheus-mysqld-exporter? jynus please refresh and compare 2 random hosts inline (mileage may vary depending on the volume of host you want to compare but it is not limited to 2) [15:14:16] Sorry, I just wanted to ask as I wasn't aware of it [15:14:52] I need to focus now on reshading + oncall, but you should show it to manuel when he is back [15:16:47] i've added the feature it was quick, just enabling repeat on each row :) [15:27:06] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 140.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [15:27:52] dumps ↑ [15:30:48] _again_? :-( [15:31:00] yep :-( Creating sort index | SELECT /* WikiExporter::dumpPages */ [15:31:29] 96s in the making [15:32:43] I'm drifting back to team "stop dumps until they stop damaging prod" [15:33:24] while I am with you technically I guess it is only creating noise [15:33:48] FIRING: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (5m 56s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [15:33:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (5m 56s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [15:34:21] I've downtimed the host and notified the ticket: T368098 [15:34:22] T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [15:35:06] πŸ‘ [15:56:24] elukey: I did all test I could with the available time at https://phabricator.wikimedia.org/T378584#10337813 [15:56:49] I say with the available time because I am going away for vacations soon until next year [15:57:26] the summary of the summary is that I can use it, but for databases manuel may need to do further testts (I cannot say it will work for them without ssds) [15:58:31] I will leave backup1011 and backup2011 unsetup until next year, as I have moved mediabackups to backup1010/backup2010 for now [15:58:54] and will expand bacula next year there [16:09:34] jynus: saw it, really nice thanks for all the details! At this point we can proceed in buying the controller for the backup hosts, and I'll follow up with Manuel while you are away on the next steps for dbs [16:10:55] the thing is, it look ok, but it could behave very differently with ssds [16:11:18] specially those stalls I saw, I don't care for backups, but could be a no-go for dbs [16:11:40] for backups we just care about throughput, not latency [16:34:06] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [16:45:38] jynus: makes sense yes, we'll need to review in depth the db use case for config j