[13:54:16] hi all in relation to ilo how to you get a console. running iLO Advanced License required for this functionality. [13:54:36] sorry ill try again. [13:54:43] yeah :-D [13:54:49] running TEXTCONS on ms-be1022 gives me a message [13:54:53] jbond42: run: vsp [13:54:54] iLO Advanced License required for this functionality. [13:55:06] marostegui: cheers [13:56:55] the HP text console is very Advanced indeed, they probably charge $1K per ssh session just to support all the amazing software engineering work that goes into supporting it. [13:58:35] it was the vsp i wanted, been a while since i logged on to ilo and missed it in the help output [13:59:12] IIRC last time I tried to use VSP on an HP host, it was super slow. It felt like being on an old 9600 baud serial connection or something. [13:59:32] oh i think both drac and ilo suffer from that tbh [13:59:40] but maybe that was just specific misconfiguration of the one I remember, it may not be a general problem [14:00:49] usually my drac ones are reasonably fast. I mean, there's some interaction latency, but screen paint comes through reasonably-fast when opening some bios setting window or whatever. [14:01:16] but the VSP one really seemed like it was stuck emulating some low-rate serial interface, even just painting an updated bios settings screen took forever. [14:01:56] ahh in that case not sure the one i just used seemed to be simlar to drac, which as you say is not as bad as 9800 :) [14:01:59] but the hosts I'm thinking of were quite some time ago, and had other problems. who knows, hopefully fixed! :) [14:02:49] the whole of how bios and remote consoles work on x86 servers these days is a major annoyance to me. [14:03:20] I mean I get there's legacy concerns and there's people running windows servers still as a major market, and that like to cart around real VGA+keyboard stations or whatever.... [14:03:46] but the linux server market is big enough, you'd think they're provide alternate firmware that does things in the way mass linux server deployers would expect and enjoy the most.... [14:28:02] the latest iDRAC versions support VNC for the VGA [14:28:10] which we don't enable, so I haven't tested [14:28:26] in general, I think we can do better, we're doing some weird stuf [14:28:39] like setting the console to ttyS1 unconditionally, although there isn't a good reason for that anymore I think [14:29:09] also, I think we buy the iLO advanced license? may be a misconfiguration [14:30:14] re: ttyS1 you mean as opposed to using the first port, or? [14:30:59] yes [14:31:02] VNC VGA consoles might be interesting/better, if we document an easy way to ssh-tunnel through to them (I'm sure it's easy to figure out) [14:31:43] well better than the serial experience, given the constraints of the machines we have today. [14:31:56] I think serial can work well, not sure what your experiences are [14:32:04] we also do inconsistent stuff there I think wrt redirection after boot [14:32:10] yeah for sure [14:33:23] with the right settings, I think things are capable of working well. [14:33:38] I don't know where we are on validating that initial (manual) bios settings end up correct though. [14:34:41] it'd be nice to automate the setting of them, or at least automate the validation of all the details at the end, because it's too easy to miss a setting that still allows imaging to work, but e.g. we have the wrong powersaving profile or the wrong CPU settings wrt to virtualization support or hyperthreading, etc [14:35:24] (or stuff about usb and physical security, etc) [14:36:45] yeah I was looking at that [14:36:52] the new Dells can fetch a json over HTTP(S) [14:37:01] with all iDRAC settings, which nowadays include BIOS settings [14:37:28] there are also a cross-vendor RESTful API that replaces IPMI, it's called Redfish [14:37:31] the problem I've seen in the past with such efforts, is that there's so much variance for your automation to account for between models and firmware levels and server generations, etc [14:37:32] there are python libraries for tha tetc. [14:37:49] that's definitely a problem yes [14:38:00] this is something that I've wanted to bring at the summit [14:38:01] (because even two nearly-similar machines end up having slightly-different names for the same setting or whatever) [14:38:07] but the way I see it [14:38:10] what we need at this point [14:38:18] is a service of some sort, backed by some configuration/storage [14:38:43] that given a hostname (or really: SMBIOS/DMI serial number for bare metal, or UUID for VMs, which we can map back to hostname through Netbox) [14:38:52] it can return: [14:39:41] 1) a) iDRAC configuration profile (little variance) b) the BIOS settings part of that c) the HW RAID parts of that [14:40:22] 2) a partman recipe (to avoid the huge case/esac with patterns in shell in our d-i config) [14:41:19] 3) a Puppet role (i.e. ENC) [14:41:59] plus networking configuration (IPs, VLANs, etc.) that we should put in Netbox [14:42:21] so I definitely get (1). It would be a huge win to have some service managing that and giving us an easy API to pull it for any server hostname/ID/whatever, to plug into other efforts and auditing, etc [14:42:27] oh yeah and 4) the selected distribution for provisioning (stretch etc.) [14:42:41] and there is a 4b/5 which is an alternative profile, i.e. "wipe" or "rescue" [14:42:59] 4+5 are in the works, we're not far off from that, probably done in Q1 [14:43:31] networking configuration is also relatively close, and includes the DNS bits of it are in progress [14:43:34] I would've expected 3/4/5 to be off in netbox or other such layers, not in this bios-fetching/storing service or whatever. [14:43:45] so we could do all of it in Netbox [14:43:46] paravoid: you're spoilering it all before the summit [14:43:50] or none of it, or some combination [14:43:51] 2 is in a weird middle-ground [14:44:03] in theory we could just add custom fields for all of that in netbox and have it all there [14:44:18] the problem is that we'd need another level of tooling to be able to apply regexps like we do now [14:44:28] (are we just moving the huge case/esac to somewhere less-visible, or auto-generating dynamic partman based on some higher-level inputs from the role and the hw raid config, or?) [14:44:50] that's my idea yes [14:44:58] I think I'd start with moving the case/esac that chooses a profile, while leaving the profiles in ops/puppet [14:45:01] based on puppet role and hw specs [14:45:11] but then there is the orthogonal problem that the profiles are not very DRY at all right now, so we should also fix that [14:45:17] there is a lot of repetition for no good reasn [14:45:46] I have some ideas on how to fix some of that, but not all [14:46:13] well there is a good reason, and it's that partman sucks and nobody should have to dig into making those profiles DRYer. Usually you call it a win if you can get a working partman config at all and don't care if it's a 90% duplicate of another existing one. [14:46:28] well, I'd like to fix that too :) [14:46:45] but yeah, different level of problem [14:47:02] so the "partman chooser" I'd like to tackle first [14:47:42] basically, I'd like to put new hosts in prod without having to do 6 different commits in 3 different parts of our tree(s), and hand-crafting regexps and shell cases and dhcpd configs and DNS entries [14:48:42] dc ops adds cp2031 to netbox with its serial number, you say "I want this to be exactly like cp2021, but I want to deploy buster instead of stretch" and just run wmf-reimage [14:48:42] what scares me (perhaps irrationally!) is that some of this may never really get functionally better than we have today, and in effect we just shuffle things around such that something (e.g. role-mapping of server hostname regexes, or partman case blob, etc) is just now netbox-ified, which means it's not in git history and we have less tooling to work on it with and we're doing in some netbox UI to d [14:48:48] o the same thing we could've done easier in a git commit. [14:49:01] 17:43 < paravoid> so we could do all of it in Netbox [14:49:03] 17:43 < paravoid> or none of it, or some combination [14:49:11] that's part of the discussion I want to have in Dublin :) [14:49:29] there is a range of options [14:49:43] it could be another thing we have that is a YAML, it could be a web tool that we write that keeps an audit log [14:50:02] it could be Netbox if we think that we don't care [14:50:02] yeah there's a lot of depth to this [14:50:04] I don't know [14:50:16] but at the end of the day, I think all of this config is tied together and should be in one place [14:50:27] HW RAID + SW RAID + partitioning are really faces of the same config [14:50:51] "which tool" is less the issue than "how do we standardize and/or automatically-generate the toil part and only describe at a high level what really matters to the human decision-maker and has practical value" [14:50:59] yes :) [14:51:19] but there is the "in one place" part, and there is also the "who" part [14:51:25] also if the tool has appropiate (has appropiate review, control and logging mechanisms) [14:51:39] we need to also reduce coordination points where we hand-off a server 4 times between service owner and DC Ops [14:51:49] another important angle I'd like to resolve in Dublin is to agree on the steps to get to the final state, I think we mostly agree on the final state and the hard part is to get there without breaking stuff or adding more toil [14:51:51] I love netbox, much better than other things, but git + review chain is a nice workflow [14:52:10] dc ops asks "public or private" from the service owner to know whether the DNS is eqiad.wmnet or wikimedia.org [14:52:13] it would be nice to have the strengths of both [14:52:13] then adds it [14:52:25] then needs to install the server, but doesn't know the partioning [14:52:33] or HW RAID settings [14:52:49] asks HW RAID settings first to do the changes using the crash cart, then does them [14:53:00] (and even within DC Ops, between rob and chris/papaul) [14:53:20] these are things that should be defined in one place, by the service owner [14:53:59] from the "consumer" pov, I imagine my ideal state of affairs is something like this: [14:55:35] I can define one or more abstract "profiles" or whatever you want to call them, like 2019-stretch-r440-cache-node, which specifies in high-level terms something like: "uses puppet role Foo, expects standard mirrored root disks, HT on, Virtualization off, debian stretch". [14:56:06] and then easily map hostname regexes to it, saying that the above profile applies to cp30[56][0-9], but a different setting may apply to cp10[78][0-9]. [14:56:27] heh, yes, that's a great way to think of this [14:56:58] (and that's one of the problems with netbox, no regexps or anything, we'd have to write some tooling that uses to API to make mass changes based on a regexp like that) [14:58:08] and the "standard mirrored root disks" part: I'd expect the HW raid pull + auto-generated partman to "just work" and cover odd cases where some hosts may use actual hw raid, and some mdadm/lvm to do the mirroring or whatever. [14:58:25] and I'd assume there's a lot of other high-level inputs especially on the disk front, it's hard to think of all the cases. [14:58:46] (I'm sure db servers and other storage-heavy things have specific abstract-ish needs) [14:59:41] of course, we can always define a split between the basic rootdisk setup (which may have few variants) at this level, and leave non-root storage to something else that's managed later in the pipeline by actual runtime puppetization or whatever. [14:59:58] oh you mean just saying "I want RAID1 for /, if the box has a HW RAID controller use that, otherwise set up an md"? [15:00:03] yeah [15:00:14] yeah that'd be ideal, but I'd leave that for a v2 tbh [15:00:25] yeah this was my ideal case [15:00:56] ideal is "I care about the resulting properties observable as a runtime user of this host, but I don't actually care about many of the low-level hardware details of how you get there" [15:01:47] for the root disks case though, I imagine with a little pressure on standardizing, we have very few true needs for non-standard setups. [15:02:51] most of the real need for variance is probably about whether and how much of the "rest of the root disks" is allocated to some other purpose and how it's partitioned. [15:05:42] you could argue for pushing some of the standardizing all the way down to the hardware ordering, e.g. just telling everyone "look from now on, we buy two minimal-size SSDs that are used exclusively for the base OS / rootfs needs, and nothing else goes there. You have storage needs, you buy separate disks and puppetize the configuration of them at runtime", and make an exception path where someone re [15:05:48] ally has to argue for the cost/benefit of variance for a specific large cluster or whatever. [15:06:16] in my limited experience, that seems to work alright for swift backend hosts [15:07:14] the previous-gen (current at all sites but the newest) varnish hosts currently vary on that. We buy only a pair of fast/expensive/oversized SSDs, and we put the mirror root at the front of them and use the rest in a custom way for disk cache storage. [15:08:12] but we probably could've been forced into the other model (and had two cheap SSDs for root, and two extra SSDs exclusively for cache) and it would've worked fine and the cost diff would've been minimal, and the standardization gains are nice. Plus it takes heavy spikes on the rootfs from some monitoring thing or whatever out of the applayer i/o picture. [15:08:32] for dbs, where we have lots of variance in needs we started doing that (buy the largest form factor we need), and for smaller dbs we just integrate many instances on the same host. It makes our life much easier and saves a lot on hw in the end [15:08:57] just mentioning that exists also the option of having the OS live on a RAID1 of SD cards (IIRC dell offers that option) and use the disks for the real dynamic stuff (modulo logs that are always a mess) [15:09:08] yeah that too perhaps [15:09:46] there's also an internal usb port you could put a usb disk on ;) [15:10:14] or we just move almost-everything into true VMs and/or k8s. managing the standard underlying hardware for all of that gets simple, and maybe there's like 5 other non-standard bare metal use-cases or whatever. [15:10:43] that's a tempting option tbh [15:11:50] forget SD [15:11:56] the new Dells have two M2 slots [15:11:56] M2: Confirm MediaWiki Account Link - https://phabricator.wikimedia.org/M2 [15:12:03] M.2 stashbot :P [15:12:13] these are the format that laptops use [15:12:44] two of them, specifically exist for root filesystem use [15:13:22] so we can standardize on 2x240GB NVMe M.2 for /, knowing they wouldn't be easily replaceable/hot-pluggable [15:13:30] and then all the hotplug bays to be used for storage [15:13:51] yeah [15:15:05] "Boot Optimized Storage Subsystem (BOSS): HWRAID 2 x M.2 SSDs 120GB, 240 GB with 6Gbps" [15:15:31] https://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-PowerEdge-Boot-Optimized-Storage-Solution.pdf [15:31:30] oooh neat [17:05:36] notebook1003 is alerting for low disk space, who can I ping about it? [17:08:37] elukey ^ [17:10:43] I think this is labs [17:10:58] hmm I don't remember really [17:17:13] elukey indeed (analytics etc) [17:40:33] hello :) [17:40:49] yeah it is alarming due to people using too much space for their homes, I need to follow up with them [17:40:56] nothing horrible for the moment [17:40:59] lemme ack it [17:41:04] elukey: https://phabricator.wikimedia.org/T224682 [17:41:12] I acked it and opened a task [17:41:18] <3 [17:41:25] 2% nothing horible? :) [17:46:36] in this case yes, not because we don't care but because people already know the issue but sometimes they exceed their home dir size. We'd probably need to enforce quotas [17:46:42] but still wip :) [17:51:51] ok :) [17:52:02] elukey living on the edge [17:53:05] mutante: is that one for you? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=phab2001&service=puppet+last+run [17:53:47] XioNoX: yes, it is. i will take it. [17:53:55] reinstalled that yesterday [17:54:21] cool, thx! [17:54:36] though it's also known that aphlict does not start on the non-active server.. but should not look like this [17:56:47] mutante: can you ack the alert? trying to see how much I can reduce the current Icinga critical [17:57:21] done! i like very much that you are doing this [18:36:46] yay, no more criticals: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=detail&sortobject=services&servicestatustypes=16&hoststatustypes=3&serviceprops=42&sorttype=2&sortoption=6 [18:39:12] very nice! [18:42:52] now on to the warnings [18:43:07] UNKNOWNs [18:45:13] feel free to look at them :) [18:45:18] keeping the worse for the end [18:46:43] herron: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mx1001&service=exim+queue#comments [18:52:48] 21 unhandled UNKNOWNS: 13 x HP RAID (https://phabricator.wikimedia.org/T210723) ACKED | 6 x ps1-b5 (https://phabricator.wikimedia.org/T223126) ACKED | 2 x labmon/prometheus - creating new ticket for it [18:55:26] awesome, thanks! [18:59:36] mutante: is there doc on how to add a runbook to an icinga alert? [19:01:30] for most declarations in puppet you need simply to add a notes_url argument. the one exception IIRC is check_prometheus where we have some self-inflicted technical difficulties [19:02:12] cdanis: thanks, I'm looking for a doc I can link people to [19:02:32] ahh, I would also love to know of such a thing [19:02:33] eg. to https://phabricator.wikimedia.org/T224692 [19:03:00] there is T197873 [19:03:01] T197873: link Icinga checks to runbook / notes URLs - https://phabricator.wikimedia.org/T197873 [19:03:10] but that's more of a worklog than a document [19:05:40] I added a couple lines to https://phabricator.wikimedia.org/T224692 [19:12:14] XioNoX: i made the meta runbook [19:12:15] https://wikitech.wikimedia.org/wiki/Icinga/Runbooks [19:13:14] perfect! [19:16:25] XioNoX: looks like we have 813 messages queued for a single gmail user causing that alert [19:16:28] SMTP error from remote mail server after RCPT TO:: 452-4.2.2 The email account that you tried to reach is over quota. [19:17:02] I’ll ack [19:17:10] thx! [19:17:20] I already ACK with the task :) [19:17:21] oh nvm has already been acked! [19:32:34] mutante: I tried to investigate this alert but can't figure it out https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=sodium&service=Debian+mirror+in+sync+with+upstream [19:32:48] mutante: the logs don't show issues [19:33:30] XioNoX: maybe it's just set to do it every 12 hours or something and that is works as designed but the WARN time needs to be adjusted (> 10 h) ? [19:34:30] no idea, I couldn't find a cron or similar [19:34:56] the WARN is at 8, CRIT a 14 [19:46:12] there is a cron, it is set to every 6 hours [19:46:47] ah, didn't find it [19:47:10] so it if fails once it will alert as warning? [19:47:19] maybe it should run every 4h [19:47:22] [sodium:~] $ sudo crontab -u mirror -l [19:47:30] or warn at 12 [19:47:31] let's run the command manually [19:48:00] i started it.. waiting [19:48:30] yes, one fail would be enough [19:48:33] mutante: cool, might be useful info for the runbook too [19:51:47] herron: that's helpful, thanks https://wikitech.wikimedia.org/wiki/Exim#Troubleshooting_%22exim_queue_warning%22_alerts [19:52:10] herron: what would be the next step after seeing who is the source of the issue? [19:52:18] for sure! can add some more examples as thinks break in new and exciting ways in the future [19:52:30] in this case if possible we can let them know [19:52:43] of course that’s hard to do if you only have their email [19:53:24] eh [19:54:52] total size is 1,255,005,685,753 speedup is 24,182.90 [19:55:01] well.. it finished an rsync without issues [19:55:10] reschedules service check [19:55:44] Unrelated, this should be an easy +1 (host not in prod and happy compiler) https://gerrit.wikimedia.org/r/c/operations/puppet/+/513364 [19:56:02] yeah not great, basically our mail systems are doing exactly what they should by queueing tempfailed messages. it’s just that the queue filling can be an early sign of other serious issues [19:56:24] kind of surprised you did not get -1 from jenkins for that [19:56:29] somebody fixed it i guess [19:57:02] mutante: ah? [19:57:09] XioNoX: +1 .. assuming you are first doing this and then add them to DNS [19:57:21] there, now you have a +2 [19:57:32] yeah will do dns right now [19:57:47] XioNoX: in the past i got style downvotes for adding those in site.pp and i added lint-ignore because that was more like a bug [19:58:03] but it looks fine now [19:58:06] herron: but if we can't contact the user, what do we do? blackhole his email? find the service sending the email and disabling the acount? [19:58:19] ah, cool :) [19:58:53] the mirrors rsync alert is NOT fixed even though it finished rsync ... now feels like there is another bug in how it checks that [19:59:03] have to get food though [19:59:12] we can give it some time, hopefully they realize their email is not working soon enough and the queue will drain [19:59:24] oh.. my bad. i ran the ubuntu sync and the alert is the debian sync [19:59:32] if it’s still happening after a couple days we can do something more drastic [19:59:43] ok [20:00:09] and open to ideas for how to better monitor that [20:00:23] the high queue alert is kind of a crap shoot [20:01:11] maybe the check could do some of the digging [20:01:24] eg. high queue on multiple providers [20:01:31] and critical on that [20:01:43] but high queue for 1 user we ignore for a while [20:02:56] XioNoX: actually.. /usr/local/sbin/update-ubuntu-mirror is puppetized cron and all.. but the alert is about Debian mirror and not Ubuntu. this might need a ticket [20:03:16] need to get some lunch [20:03:30] yeah, I couldn't find the debian cron [20:03:32] i was totally expecting update-debian-mirror in the same cron [20:03:40] but not there [20:20:53] Writing DNS PTR is a pain. [20:21:04] er, v6 DNS PTR :) [20:23:34] XioNoX: i recommend only copy/pasting. forward entry diretly from output of ip a s ..and reverse record from something else in the same subnet and then changing only the first (last) octet [20:23:54] used the wrong row a couple times etc.. [20:25:00] I used that http://rdns6.com/hostRecord [20:25:17] and then trimmed it so it match the lenght of the other lines [20:25:54] ah! [20:26:56] btw, the Debian mirror sync uses "ftpsync" unlike the ubuntu sync [20:27:07] so it's /etc/ftpsync [20:27:10] be back in a while [20:29:37] but that conf file doesn't have any schedule info [20:32:03] If someone wants to review a v6 DNS CR :) https://gerrit.wikimedia.org/r/c/operations/dns/+/513405 [20:40:41] XioNoX: done! lgtm. but now going afk for real [20:50:49] thx! [22:55:17] re: ftpsync / debian : " A minimum of 4 updates per 24 hours will ensure that your mirror is a true reflection of the archive. " [22:55:28] but it just got over 14 hours now .. so yea... [22:59:23] "Note that if your site is being triggered with a push mechanism, then you don't need to worry about any of this." [23:00:05] " ftpsync[14207]: We got pushed from 2001:4f8:1:c::16" [23:00:21] XioNoX: ^ so this is why you dont see a cron for that.. it's pushing to us [23:00:53] ahhh [23:00:55] ok [23:01:03] but doesnt answer why it stopped [23:01:34] the last log was still ending with : Back from rsync with returncode 0 [23:03:32] i'll send a mail to mirrors@debian.org [23:25:55] mutante: opened https://phabricator.wikimedia.org/T224706 [23:34:19] XioNoX: ok, updating! thanks