[00:57:04] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:04] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:13] checking [07:36:43] also, db2202 which is not pooled (ans is a s1 host) has had corruption issue, will deal with it afeter [07:42:05] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db2202:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2202&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [07:48:26] db2202 has pagelinks repairing [09:37:46] Data engineering: whatever is happening to an-redactdb1001, I doubt it will be fixable by end of the month, as it will take around the same time to catch up than its lag (it now has 8 days) [13:13:22] Emperor: do you have a host with the original new RAID controller? [13:15:35] I'll restart mariadb on zarcillo fyi [13:18:47] jynus: plenty, yes, none yet in service. [13:19:02] ...pourquoi? [13:19:12] Emperor: could I borrow one for a few minutes? [13:19:53] e.g. ms-be2082 [13:20:14] jynus: sure, pp elukey and jhathaway for awareness [13:21:10] Can I write to a disk, if I later drop everything I created? [13:22:16] sure [13:22:55] ok, taking ms-be2082 will ping you when done and cleaned up [13:24:14] ack [14:29:04] ack! [14:41:19] Emperor: FYI https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091597 is merged, so docker-registry wise we should be good with hammering Swift [14:43:21] Emperor: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1092345 and the lack of...dryness... :) Are you referring to the duplication between the seeds (which I think has come up before), and the Cassandra instance configs, or the repetitiveness of the Cassandra instance configs? [14:45:18] kindof both, really. Almost that entire CR looks like it should be doable from roughly "hostname of node & that it has -a -b -c" [14:47:34] so...the seeds aren't actually a Cassandra cluster thing, it's restbase (and everything there is called "restbase", but there is a distinction). I'm not sure what you would/could do there without creating unwanted coupling, BUT, all of that should be going away soon (because restbase-the-service is being removed from that cluster). [14:48:48] Emperor: I actually didn't need the host, I didn't event touch it [14:49:04] for the rest of it, you're configuring the instance(s), and while in a perfect world that would always be the same, they aren't necessarily the same [14:49:19] (I just run lspci on it) [14:49:23] jynus: FE [14:49:41] as-in, we would endeavor to the keep them the same because consistency is nice, but they can be (have been) different [14:50:01] and it would be an error for any of that to change (sans a reimage) [14:50:36] which is not to say it couldn't be better, only that I'm not sure what that would look like [14:50:50] Let me spitball a couple of things, and you can tell me I'm wrong :) [14:53:19] in the hieradata where we assign jbod_devices - presumably we already know what those should be since the reimaging process has to know that (so there's already a risk of hiera being out of sync with the reimage); likewise rack is knowable from either netbox or IP address... [14:53:56] are the "seeds" you mention the lengthy stanza in hieradata/role/common/restbase/production.yaml? It'd be good if that were going away :) [14:54:43] no, that's hieradata/role/codfw/restbase/production.yaml [14:55:22] Ok, so jbod_devices does not necessarily map to devices [14:55:49] it does here, but isn't constrained to, and it doesn't at all in other clusters. [14:56:54] and for restbase, the rack (so far) matches one-to-one with the concept of a _row_ in the data-center, but as we run out of room to provision nodes in a row, we've been using others and creating... "meta rows", I guess [14:57:14] basically saying: Ok, rows A & D will be "rack A" [14:58:43] jbod_devices> `git grep cassandra::jbod_devices` only produces lines of exactly the form `cassandra::jbod_devices: ['sda4', 'sdb4', 'sdc4']` ? [14:59:07] urandom: row> so is the question then how I should have checked the rows were correct? [14:59:46] yes, that's a good question! [15:00:43] those row equivalencies haven't happened (yet) for restbase, for the other clusters where it has I've left comments in all of those hosts heira files that document the mappings [15:00:46] [ for the very lazy> git grep 'cassandra::jbod_devices' | cut -d ':' -f 2- | sort -u ] [15:00:58] that is terrible, I know, a comment [15:01:10] it's also on wikitech, but I'm not sure what else to do there [15:01:33] sorry, I'm not trying to be difficult, nor saying this should all get fixed now! Just reviewing this has made me ask how we could do it easier/better [15:01:44] Emperor: yes, for the restbase cluster, right now, we are fortunate in that all of the hosts have the same disk config (they didn't always) [15:02:12] no no, I also hope there is a way to make it better! [15:02:31] It's awful no matter the reason [15:03:07] urandom: with swift we encode the disk mapping in one place, and then just set hiera for "it's a host of this sort" in regex.yaml [15:03:31] like a profile or alias? [15:03:36] likewise, for racks I would be tempted to have a mapping of physical rack -> cassandra rack in one place [like we do with swift ring zones] [15:04:30] urandom: yeah, or set a hiera variable (e.g. like we do for when we toggled servers_per_port for swift) [15:05:20] hah, though I see a bug in how that's done for the new thanos h/w, which I will now fix 🤦 [15:07:20] So... you'd take the machine's actual location in netbox, evaluate that against a mapping of real to virtual row/rack in puppet, and use that to automatically assign rack in Cassandra? [15:07:42] Yep (or do the lookup by network if that's easier) [15:08:02] In that case, anyone who updated the location in netbox would break a cluster [15:08:39] even if they actually moved the machine and had updated netbox accurately, that would create breakag [15:09:05] bad breakage actually, topology breakage [15:09:51] unless you could restrict to that to a one-time deployment action [15:10:20] Is "relocate cassandra storage nodes" a thing we expect to do? [15:11:10] [without going through the cassandra puppetry in detail I don't know how hard/routine/impossible it would be to not change a rack once deployed] [15:11:18] no, but I suspect that accidentally changing a location in netbox is an easier mistake to make than committing a puppet changeset [15:11:37] that would be an argument for doing it by network instead, certainly [15:11:42] and having more than one vector to destroy a cluster is probably not ideal either :) [15:12:00] urandom: while I'm wasting your time, would you mind looking at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1092847 please? to fix the bug I spotted [15:12:17] sure [15:12:28] (and this is not a waste of time) [15:13:50] Thanks :) [15:15:29] Anyway, so I think that if we were to inadvertently change a hosts network, we probably have bigger/more problems, but insofar as topology goes this would create a problem where none existed before. Meaning: you could change a nodes network without breaking the cluster, but if you change the effective rack you will most definitely break it. [15:16:18] Again, I don't know if that is a valid argue against, because if we're accidentally renetworking cluster hosts, we're probably already toast. [15:17:07] Mmm [15:17:38] again, if we could make it deployment-only, and immutable thereafter, that's a nonissue [15:18:31] maybe if that properties file weren't a template, if it were generated wholesale, and you used file existence as a condition [15:20:38] I think the approach depends on what you expect to happen if you renumber a cassandra host currently [15:23:48] (I think the answer is 🔥 since renumbering it will change the set of seeds in the cluster, which is presumably Bad(TM)?) [15:24:05] It's not [15:24:52] Everything here is pretty resilient when compared to changing the rack, which is pretty much guaranteed data loss [15:25:39] I mean, if the actual location were wrong, then you might have data unavailable in a data-center failure mode that you didn't expect, but no actual loss [15:25:59] I think I've reached the part of this discussion where my brain is ready to go into: Stop trying to put a square peg in a round hole, and make the peg round-mode [15:26:11] Fair enough [15:26:31] Or, "redefine the problem" [15:26:51] Like when you can't quite remember how to spell a word, so you choose an alternative :) [15:27:57] Networking wants us to stop using secondary IPs, and that's now possible (by binding to different ports on the same/host IP). [15:28:39] And the unit of failure in the data-center is no longer the row, everything is now (or will soon be) interconnected, making by-rack (1:1 with netbox) feasible [15:30:54] And I would love to simplify the disk configs of all the clusters, and we're being asked to —where possible— increase density and do more/the same in less hosts, so actually using jbod might be The Answer™ there too. [15:31:08] :) [15:31:26] None of which could easily be done with the existing clusters. :( [15:32:24] But we could probably come up with migration plans that convert them in place, decommissioning and rebootstrapping everything [15:33:15] Combined with attrition to slowly change hardware profiles... [15:33:57] * urandom reaches for a brown paper bag as he begins to hyperventilate [15:34:16] la la la, T123918 [15:34:16] T123918: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 [15:37:23] (never mind how long the disk_by_path migration is going to take...) [15:39:29] it's like a river carving out a canyon sometimes [16:10:45] let's start calling this the oxbow approach to change management, then? [23:39:43] urandom, Emperor: i quipped y'all at https://bash.toolforge.org/quip/zBrKRpMBFk7ipym_NyjP, because wow https://commons.wikimedia.org/wiki/File:Meander_Oxbow_development.svg seems very much to describe how we do some of the bigger migrations.