[09:06:14] 10Traffic, 10Operations, 10ops-ulsfo: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3845938 (10Volans) @RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning. [13:42:28] 10Traffic, 10Operations, 10ops-ulsfo: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3845938 (10ema) >>! In T183176#3847321, @Volans wrote: > @RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning. I've just ack'ed all related strongswan alerts... [13:43:30] ema: so what is the best practice for those alerts, ack them too? because AFAIK this will mask possible additional failures [13:44:48] I was wondering if we could improve this by having a dynamic list of hosts in the icinga alert so that when a host is depooled Icinga skip it for the strongswan alerts [13:45:14] unless we'll be able to do TLS and ditch strongswan pretty soon ofc ;) [13:45:33] volans: https://phabricator.wikimedia.org/T148976 [13:46:48] eheheh [13:46:57] ditching strongswan would be great, I'm afraid it's not gonna happen very soon though :) [13:48:23] oh, and re:lvs1007, I see in SAL that chris swapped NIC cards a few days ago [13:48:39] the puppetfails might be related to that [13:50:10] yes hiera is looking for eth2/3 while udev named them 10/11 [13:50:26] but given I didn't know the situation of the host I didn't send a patch/merge it [13:51:43] also because maybe you might just want to manually force udev to get those as eth2/3 clearing the old IDs [13:52:06] I think this should be the last chapter in the lvs1007 saga https://phabricator.wikimedia.org/T181419 [13:52:42] yeah hopefully! [13:54:18] for the time being I'd just ack those icinga errors too, but let's wait for brandon to confirm maybe [13:55:21] don't see why keeping puppet broken there where is easily fixable, but sure as you want [14:13:51] I'd say just reinstall it if the nics are fixed [14:29:00] bblack: I'm installing pdns-recursor security updates, esams and codfw were upgraded by depooling the hosts during the update. is that also sufficient for the more busy eqiad recursors, since there were a few problems with hosts hitting servers directly? IIRC we've seen those problems with reboots, but in this case the upgrade will only have 1-2 seconds of unavailability. or do we need a bigger hammer like dropping them temporarily from our [14:29:02] resolv.conf? [14:32:41] moritzm: IIRC, I think our current practice to avoid problems is: [14:33:12] 1) Remove the recursor-to-be-rebooted's IP from /etc/resolv.conf on LVS servers in the same DC. [14:33:28] 2) Depool from LVS recdns using confctl [14:33:32] (and then reverse when you're done) [14:33:56] I don't think we've had to actually restart pybals though, just edit the resolv.conf [14:34:55] and obviously, puppet has to stay disabled on those LVSes for the duration if you're manually editing out the resolv.conf entry [14:35:16] (or you can puppetize it I guess, either way, in the site.pp nameserver_overrides) [14:35:55] they're also in each other's resolv.conf, so technically that should be fixed as well, but I don't think it has a huge impact in this case [14:37:15] ok [14:40:42] bblack: could you sanity check https://gerrit.wikimedia.org/r/398819 when you've got the time? [14:45:07] looks sane to me [14:45:16] thanks! [15:55:35] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3848555 (10ema) [16:21:44] 10Traffic, 10Operations, 10ops-ulsfo: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3848653 (10RobH) Error codes from ePSA test: Service Tag : 3ND3KH2 Error Code : 2000-0125 Validation : 107826 [16:24:26] 10Traffic, 10Operations, 10ops-ulsfo: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3848685 (10BBlack) Dell info says that code means: `The IPMI system event log is full for various reasons or logging has stopped because too many ECC errors have occurred.` [16:30:04] 10Traffic, 10Operations, 10ops-ulsfo: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3848695 (10RobH) Yeah, it turns up nothing but the error codes for the actual failed dimm. It doesn't matter much, just helps for the part replacement. SR958387090 is the self dispatch part # for the repl... [18:39:43] 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3849122 (10ayounsi) 05Open>03Resolved