[07:47:37] Platonides: it could indeed but I'm not comfortable creating a snowflake (i.e. swapping kernels) on debian just to unstuck this, it might be coming from a fix in fedora's kernel that doesn't exist yet on upstream debian? Lets see what dc-ops says, I bet this is not the first time they see this issue (Ive found some traces of stuff like this in [07:47:38] previous tasks but was unable to trace it back to a proper fix) [08:29:19] good morning, should we worry about thanos disk space? [08:51:43] I think godog has been adjusting thanos retention; we're still a bit stuck on hardware delays :( [08:54:20] 16th Sep SRE meeting notes said "early next week", so hopefully before I go on leave on Friday... [09:01:00] yeah I'll take a look at the retention later today [10:18:15] arnaudb: ok to run compare on masters, cross dc? [10:18:34] I see no objection to it [10:18:53] will log when I start it and will leave it on a screen running on cumin1002 [10:19:01] ack, thanks [10:20:01] giving a heads up as it may show unusual read traffic on the masters [12:02:26] jynus: https://orchestrator.wikimedia.org/web/cluster/alias/pc3 there seem to be an issue on pc3 [12:02:35] (fyi, looking into it) [12:02:58] Slave_IO_Running: No [12:02:59] Slave_SQL_Running: No [12:03:27] sal is dead? [12:04:41] seems stopped indeed [12:04:45] checking [12:05:23] there is no replication error [12:05:27] let me check logs [12:05:30] also: why is it not alarming anywhere else than orchestrator 🤔 [12:05:44] well, it doesn't contain canonical data [12:05:49] ah [12:05:49] so it is not an immediate issue [12:05:56] data can be different between hosts [12:06:10] it is just that it should be normally running to warm up the cache [12:06:32] arnaudb: it just crashed [12:06:50] let's create a ticket, you do it or I do? [12:07:10] we can start replication, as I said, data missing is not important there- it is just a cache [12:07:24] I let you start replication, I'll create the ticket [12:07:30] thanks [12:07:54] 240923 10:58:42 [ERROR] mysqld got signal 7 ; [12:08:02] it was a software error/assertion [12:08:17] let me put it on a paste so you can add it to the ticket [12:08:29] <3 [12:09:54] it happened at 10:58:42 [12:10:36] T375382 [12:10:37] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [12:10:40] strangelly not a lot of errors [12:10:48] https://phabricator.wikimedia.org/P69388 [12:11:15] it is on 10.6.14, is there a newer version? [12:11:39] oh indeed, .19 [12:11:47] it's quite late [12:11:52] so something maybe to note for the future [12:11:56] after switchover [12:13:10] {{ done }} in task description [12:15:28] you are right about monitoring being wrong [12:15:43] because there is monitoring in pc2013, but not in pc1013 [12:16:11] weird [12:16:49] that is a missconfiguration as pcs should have the same config as x2, replicationg both ways [12:17:11] you can copy that in the ticket, as a TODO [12:17:17] /actionable [12:17:31] ack [12:19:01] nah, I was wrong, it wasn't a software error [12:20:32] https://phabricator.wikimedia.org/P69389 [12:20:43] oof [12:20:47] I'll @dc-ops [12:22:43] CC swfrench-wmf let's talk if we want to hold on it until switchover or not during the meeting [12:23:37] I am going to lunch, I will check the management hw info if not done by then [12:23:55] (when I come back), but free to do it on your own [12:24:21] sure :) on it [12:24:44] ah we cross posted x) [12:25:19] let me add it to the agenda too [12:27:02] thanks, hw position noted in the ticket [12:34:23] I these kind of issues are why amir liked a clean orchestrator, it makes easier to spot those gaps on monitoring [12:34:36] going for lunch finally [12:38:42] it was an easy catch on orchestrator, but I wish we had caught it an hour earlier while it happened x) [12:40:59] we had plans to monitor the uptime, but it is non trivial to distinguish a reboot from a crash [12:41:40] this should probably be the operator's job where monitoring is just here to raise their concern imho [14:06:19] sorry for talking too much @ meeting [14:15:48] I will start by handling the replication for warmup [14:16:07] but I may need help with dbctl from someone more familiar with it CC volans [14:16:11] for later [14:16:33] I may also be able to assist with the dbctl bits [14:16:46] thanks [14:17:07] it is currently not pooled in 2 sections [14:17:07] sure, I'm in a meeting right now but I should be able to follow along [14:17:17] don't worry, that may not even happen today [14:17:23] it was more of a heads up [14:17:57] ack [14:18:16] so not feeling super confident about the exact commands (amir, manuel, arnaud would do it every day) [15:20:26] https://github.com/percona/orchestrator/commits/master/ has this version of orchestrator been tested in the past? It seems that percona is maintaining this version [15:21:48] arnaudb: this is the second time I look at wmf orchestator, so no idea [15:22:03] haha ack maybe Amir1 has an idea [15:49:04] we should review dbctl for pcs, check if we need to set the last spare as candidate for all pcs at eqiad [15:53:30] I am going to depool pc1013, as technically it is still pooled (with should be no impact as it is not the master) [15:54:31] or maybe it doesn't handle it? [15:55:13] yeah, it says it is pooled, but it is a noop [15:55:35] your guess is at least better than mine here [15:55:39] actually, no, it worked but needed no comit as only internal config changed [15:56:04] pc is a weird one [15:56:24] I spent some time digging on what should be done on orchestrator, I saw no relevant tagging, also I saw very little tag usage (10% of our instances have a tag) [15:56:25] I will add the same hw issue note to pc1013 [15:56:26] arnaudb@dborch1001:~ $ for instance in $(sudo orchestrator-client -c all-instances); do sudo orchestrator-client -c tags -i ${instance} ; done | wc -l [15:56:26] 26 [15:56:26] arnaudb@dborch1001:~ $ sudo orchestrator-client -c all-instances | wc -l [15:56:26] 292 [15:57:10] yeah, I don't think there was anything to do, except maybe removing pc1013 [15:57:18] yep [15:57:25] I used a downtime there, but no big deal [15:57:34] {{done}} [15:57:36] I just changed the topology and remove the heartbeat stufff [15:57:49] zarcillo too, did prometheus generatoin run ok? [15:58:01] aka the master is set as such on config [15:58:01] will check [15:58:04] please do [15:58:08] :-) [15:58:25] best if automation check the change, I don't trust myself to do it [15:59:09] generation forced on 1005 to see if it matches our topology [15:59:28] if you don't want to go over grafana [15:59:40] the files generated are on srv/prometheus ops or something [15:59:52] you read my mind [15:59:52] it generates some yaml files over there [16:00:10] yeah I patched the script a while ago, I wasn't remembering where the files were outputed :P [16:01:29] T375382#10168177 [16:01:29] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [16:02:08] hm I must have missed a table on zarcillo [16:02:09] will check [16:03:26] the script has to run, eh [16:03:30] it is not automatic [16:03:37] I forced it [16:03:44] MariaDB [zarcillo]> UPDATE section_instances SET section = 'pc3' WHERE instance = 'pc2015'; [16:03:44] Query OK, 1 row affected (0.001 sec) [16:03:44] Rows matched: 1 Changed: 1 Warnings: 0 [16:03:47] this was missing [16:04:02] ah, true [16:04:12] it was set as master, but it wasn't set as part of the section [16:04:23] as it was probaly on pc4 [16:04:55] exactly [16:05:30] swfrench-wmf: the host forced us to do the failover earlier than expected, but other than a small extra amount of misses (in theory, within normal range: swfrench-wmf: https://grafana.wikimedia.org/d/a97c66ff-0e10-4d2a-b9e1-37b96b7b4d35/parser-cache-misses?orgId=1 ) things should be back to normal [16:05:43] I'll have to sign off for the day, I'll check if my updates covered everything [16:06:05] thanks [16:07:53] interesting, what complained the most wasn't pc, but es because overload due to lack of caching [16:08:51] "Wikimedia\Rdbms\DBUnexpectedError: Database servers in cluster26 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated" but the circuit breaking worked CC _joe_ that's actually quite good [16:09:21] <_joe_> indeed [16:09:32] relatively speaking, of course [16:12:10] FWIW, T373037 will handle most of these issues (or makes it easier) [16:12:10] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [16:16:10] a lot of these stuff will change in the next quarter, I think let's not spend too much time on it [16:16:52] jynus: thank you! [16:19:12] ah, then I will merge that ticket into that [16:19:25] I didn't know that existed [16:23:34] Actually, let me put it as a subtask of it, just for awareness, and it can be dropped when necessary, as the one is more of an architecture change, and mine is more of an operational monitoring change [16:50:05] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance pc1013:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1013&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [16:50:33] maybe downtime it? [20:50:20] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance pc1013:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1013&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow