[06:48:42] pii cookbook running for T378469 T378462 [06:48:42] T378469: Prepare and check storage layer for tcywikisource - https://phabricator.wikimedia.org/T378469 [06:48:43] T378462: Prepare and check storage layer for tcywiktionary - https://phabricator.wikimedia.org/T378462 [08:26:59] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:18] its me ↑ [08:27:24] (will downtime it) [08:39:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1084710 <-- could I get a +1, please? Yesterday I carefully checked the upstream default scrape interval, and forgot that I'd set it to 60s locally to match our default... [08:48:54] https://phabricator.wikimedia.org/P70686 → inputs welcomed on the question "Best way to discover locally if a server has mariadb", which can apply for all systemd services [08:49:16] cc volans ↑ :) [08:51:31] Do we not know from puppet whether a server is single-instance or multi-instance? [08:51:41] [which may just be a way of saying I don't understand the problem you're tring to solve] [08:52:18] ah I see what I failed to described, let me edit :D (this is in the context of a cookbook, so we "discover" at runtime some context informations) [08:53:23] Added phrasing for context, thanks Emperor [09:00:57] do we install the mariadb@.service template unit on all servers, or only on multi-instance ones? [09:01:23] all [09:01:26] unfortunately [09:01:35] shame, otherwise you could use systemctl list-unit-files [09:01:48] I know but that's also not really a possibility [09:01:55] list-unit-files shows just the template [09:01:59] doesn't tell you which sections [09:02:20] because shows the files on disk [09:02:25] not the logical units [09:02:45] and, IIRC, we don't set up databases to auto-start, so it's not like we could look in multi-user.wants or anything like that? [09:03:01] we try to avoid autostarting yep [09:03:51] We should setup a source of truth for that. We could call it zarcillo2 [09:03:52] So the only way a human knows what to do is asking puppet / orchestrator ? [09:03:59] jynus: LOL [09:04:00] jynus: xD [09:04:20] ...which you don't want to do from a cookbook, is that correct? [09:04:36] I think we can query hieradata facts from puppet [09:04:43] from cookbooks** [09:04:43] Emperor: orchestrator doesn't know yet [09:04:47] at setup time [09:05:11] that's the point- puppet should not know about the data, it is a config management, not orchestration [09:05:15] we can query orchestrator, puppetdb, hiera lookups from cookbooks, no problem in that :) [09:05:37] jynus: but it does right now... sets up the config files [09:05:55] then check the config files [09:05:59] I think that querying puppet is the right thing to do now [09:06:06] so, my initial idea was: checking the filesystem for config files, we could move above this and check difference between puppet and existing config [09:06:13] and stop if a discrepancy exists? [09:06:54] if puppet and filesystem disagree that is probably a sign to stop (or maybe handle it specifically later) [09:07:44] if we later change how setup works, the cookbook could change then :) [09:07:44] and we would be able to 1. discover instances, 2. validate config satefy [10:05:47] hey folks, not sure if already discussed but there is a proposal for the hw-raid controller to use for Supermicro Config J: https://phabricator.wikimedia.org/T371416#10273774 [10:08:56] yeah, I didn't discussed that, but the truth is "I don't know" [10:09:37] I can imagine yes, should we create a task with a list of tests/etc.. to do once the first controller arrives? [10:09:47] rather than having dcops buying a lot of them in one go [10:11:52] It is also a different setup (not in brand, but in technology) so I cannot say. For what I see, it is a first in the cluster. [10:12:05] yep yep [10:12:24] * performance on writes and reads [10:12:36] * reliability on disk loss [10:12:56] * monitoring from linux [10:13:33] * Data persistence if the controller dies [10:17:28] * Emperor laughs at the final point [10:18:33] I can see point 1) and 3) to be the most critical to test, the other two may be more difficult and we'd need to trust the specs on paper (or do you think there is a way to do a proper test?) [10:19:43] Why on paper? [10:19:47] we can also ask Willy to check if supermicro can provide more options, and we can decide the best one that suits our need (namely, the best one to buy and do the real tests) [10:19:50] we should be able to eject a disk [10:20:13] And see that, e.g. mysql doesn't crash due to io stopping responding [10:20:15] okok perfect, I was thinking more on the "if the controller dies" part [10:22:35] If we have 2 it is easy to test- will its configuration survive or can it? [10:22:41] by replacing it [10:23:03] sorry didn't get this part [10:23:21] my past experience has been that data very rarely survives a RAID card failure [10:23:25] (please read what I am writing as "I am very ignorant in the subject, and I am trying to understand it better") [10:56:17] I am going to open a task for the new raid controller summarizing this chat [11:05:59] aaand https://phabricator.wikimedia.org/T378584 [11:11:57] elukey: while we're at it, DYK where we are on the JBOD 24 disks question? [11:12:13] there could be other things for the dcops team, likea how easy is to setup/automation or service [11:12:24] but leaving those to them [11:12:48] exactly yes [11:13:06] Emperor: last that I know is https://phabricator.wikimedia.org/T371400#10269819 [11:13:45] so IIUC there is a ticket opened with supermicro [11:14:47] OK, I will try and be patient, but it's a bit 😱 [11:17:47] yep yep I can imagine, but I suspect that we'd need to change the raid controller for the 24 disks [11:18:23] if so, it would be really a pity since supermicro could have warned us about the fact that the controller didn't work [11:18:48] I'm pretty sure their web-page about this system says both "24 disks" and "JBOD" [11:20:44] have they tested that config though? :D [11:20:53] not assuming bad faith but..