[00:35:50] something's off with cr1-eqsin [00:36:01] more than usual I take it [00:36:20] yeah it's super slow [00:36:28] librenms is not even polling CPU graphs [00:36:32] snmp doesn't complet [00:36:32] e [00:36:34] yeah [00:36:39] I'll re-downtime it and reboot it [01:04:06] cdanis: somewhat :) [01:05:16] bblack: no worries, it was re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/571596 but all is fine there [01:06:00] ok [06:42:16] what's up with all the ores alerts? [06:45:05] is that the oom on log rotation issue, or has that been resolved? [06:48:03] from a quick look yes it seems OOMs that are killing nagios-nrpe-server.service [06:48:23] that's why all random alerts go awry [06:48:59] why the OOM pick NRPE I have no idea [06:50:59] https://phabricator.wikimedia.org/T242705 there's a task [06:54:51] yes yes still the logrotate [06:55:16] volans: did it kill celery too? [06:55:19] or only nrpe? [06:55:38] I dropped a message into -ai to let them know it's still an issue but I guess they are already aware [06:55:47] I'm renaming the task [07:01:35] ok OOM was killing celery (as it should), but nrpe failed to allocate memory and bailed out [07:01:38] updated task [07:02:02] yep seems the same problem [07:02:12] [observability] we should consider having a check on NRPE and have all the nrpe checks depend on it so that if it fails we just get one alert per host [08:35:42] stashbot looks down for the last couple of hours at least, I will give it a restart [08:35:42] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [08:35:56] it is not showing phab comments on irc [08:36:16] wikibugs I mean [08:36:34] I guess another https://phabricator.wikimedia.org/T241109 occurrence [09:17:07] I am going to failover dbproxy for m2, which contains recommendationapi, debmonitor, gerrit, otrs... it will gracefully fail, I won't be killing any connections to force reconnects, if you see something strange though, please let me know [09:17:43] marostegui: ack, I'll tail debmonitor logs to check all is good [09:18:02] thanks :) [09:19:08] already done? [09:19:16] yeah, TTL for the DNS is 5M [09:19:20] I have merged and deployed [09:19:22] right [09:20:32] I just saw a new connection from otrs going thru the new proxy [09:21:17] marostegui: is the old proxy still working? because as long as it works might keep using the old conn [09:21:21] yep [09:21:25] DNS switched on debmonitor1001 fwiw [09:21:33] are you seeing any conn from 10.64.32.62 [09:21:34] ? [09:21:51] I don't see any connection from debmonitor at the moment [09:22:00] now I do, from the old proxy [09:22:13] and now the new one :) [09:22:49] nice [09:23:05] logs are totally clear [09:23:10] nice [09:23:13] otrs looks good to [09:23:21] I have no rush, so I will let the connections slowly shift [09:24:40] ok, gerrit connections too going from the new proxy [10:15:54] moritzm: now that we have ganeti clusters on the 5 DCs, could we get netboot mirrors in every DC? [10:16:24] rebooting a server into the installer is up to 3x slower in esams than codfw for instance [10:16:35] not to mention the poor eqsin [10:17:57] vgutierrez: there's a plan at https://phabricator.wikimedia.org/T242602 [10:18:19] and the first steps are being currently done as part of the reimages of the install* servers in eqiad/codfw [10:18:28] oh <3 [10:18:29] like splitting off the repo [10:18:31] thanks :D [10:19:20] with part do you mean with "netboot mirrors"? fetching the d-i images during the first step of the installation? [10:19:33] yep [10:19:33] those are currently retrieved from apt.wikimedia.org (so eqiad) via HTTP [10:19:41] oh ok [10:20:03] but yeah, it's one of the things to address with T242602 [10:20:04] T242602: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 [11:03:53] I am trying to use install-console on irc2001.wikimedia.org (new vm created) and the first puppet run to create the cert ends up in [11:04:12] Error: Could not request certificate: Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known) [11:04:28] in resolv.conf I don't see any search wikimedia.org [11:04:31] but only [11:04:38] # Generated by NetworkManager [11:04:38] nameserver 10.3.0.1 [11:04:46] (this is buster 10.3) [11:04:58] am I missing something obvious? [11:07:13] yep adding search wikimedia.org works [11:09:59] I proceeded with the cert sign request etc.., puppet runs now, I'll note it down in the task [11:50:34] hmm haven't met that before. Maybe some recent change in our DHCP configs? [11:51:20] fyi, I 'll kill sessionstore for mw1331 for an hour or so. Planning to collect data about what happens when sessionstore is not reachable by mediawiki [14:35:42] 3-yo physical disk says "Estimated Life Remaining based on workload to date: 60093 days" 🤔 [14:36:38] now the question is, which Junos version to upgrade esams to [14:37:00] so many to choose from! [14:37:02] as I don't want to lose CPU/memory visibility on all the routers [14:37:13] btw did you open JTAC ticket re that? [14:38:05] about SNMP? not yet [14:38:18] I'm so tired of opening JTAC tickets :) [14:40:36] fair :) [15:18:03] any issue on puppet right now? [15:18:20] hmm nah, it's jsut me [15:18:21] *just [15:18:47] jbond42: what's the proper way of overriding a hiera value at host level to let it be "undefined"? [15:19:08] i.e: we have profile::trafficserver::tls::parent_rules defined at profile level (tls.yaml) and I need it to be undef on cp4031 [15:23:17] vgutierrez: do you already have a sample cr [15:23:55] jbond42: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571736 [15:23:58] from memory this is not an easy thing to do, due to the way hiera merges keys as it esencially just ignores anything which is undef (unless the fixed/changed that which is possible) [15:24:07] can't you set lookup_options? [15:24:08] ill have a play and get back to you [15:24:26] i think you have to set a knockout_prefix in lookup options [15:24:40] but its been a while so will need to dig a bit [15:28:04] oh.. it's ~ not - /o\ [15:28:13] gotta love my L8-ism [15:28:51] vgutierrez: i think hiera has about 4 ways to say undef '-' may well be one of them but i know '~' definetly is so just wanted to rule that out (although gut feeling it wont work) [15:29:11] hmm nope [15:29:18] at least pcc doesn't show the expected change [15:29:20] https://puppet-compiler.wmflabs.org/compiler1001/20770/cp4031.ulsfo.wmnet/ [15:29:53] parent_rules still contains the data from tls.yaml [15:30:26] yes thats what i figuered [15:37:25] so redefining it as an empty array made the trick [15:38:27] oh cool i though '', [] and {} all ended up as undef from hieras point of view so thats good to know [15:40:03] vgutierrez: did you get a parse error because `parent_rules expects array got hash` when you tried `{}` [15:40:07] or something elses [15:40:10] yes [15:40:24] the class expects an Array [15:40:26] so Hash is no good [15:41:37] yes thats what i would expect but wanted to double check i have seen multiple behavious with this specific use case before. it wouldn;t supprise me at all if you get different behaviour in wmcs i.e. puppet 4.8 [15:52:41] vgutierrez: you ok for me to merge that change [15:52:49] go ahead [15:52:50] :) [15:52:55] its going [15:54:16] done? :) [15:55:04] Info: Applying configuration version '(b9dee7d22b) Vgutierrez - ATS: Test KA on cp4031 whilst parent proxies are disabled' [15:55:04] yey [15:55:57] o/