[06:02:34] 10Traffic, 10Operations: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455383 (10elukey) [06:02:40] ema: ---^ [07:22:06] 10Traffic, 10Operations: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455383 (10MoritzMuehlenhoff) I had the same symptom wich oxygen a few days ago and a "racadm racreset" fixed the mgmt for me. [07:32:18] 10Traffic, 10Operations: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455451 (10elukey) From ipmitool sel I got a lot of these: ``` 7b | 07/20/2017 | 01:06:48 | Processor #0x0d | Transition to Non-recoverable | Asserted 7c | 07/20/2017 | 01:06:49 | Unknown #0x28 |... [08:15:37] elukey: hey :) [08:16:15] elukey: so cp3048 did come back up after power-cycling I see [08:17:14] _joe_: thanks for taking care of the depool [08:17:59] <_joe_> np [08:20:09] ema: yep! Weird but this morning me/Daniel weren't able to connect to mgmt [08:21:21] elukey: so mgmt wasn't reachable at first, then later on it was? [08:23:34] this is my understanding but it was early in the morning so it might have done some PEBCAK [08:23:55] the sshd on the mgmt is old, so it runs into the "slow DH group" problem from https://phabricator.wikimedia.org/T171041 [08:24:23] I think it was working fine all the time, but simply timed out with openssh > 7 as the client [08:25:27] oh interesting [08:58:36] 10Traffic, 10Operations: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455570 (10ema) 05Open>03Resolved a:03ema So as @MoritzMuehlenhoff mentioned on IRC the mgmt issues might have been due to T171041. The host is back online and looks fine at the moment so I've re... [09:24:11] 10Traffic, 10Operations, 10Reading-Admin, 10Reading-Community-Engagement: TEST: redirect small portion of unauthenticated desktop users to mobile web - https://phabricator.wikimedia.org/T117826#3455702 (10fgiunchedi) [11:22:20] ema: we're getting a lot of alerts from UnitedLayers' Icinga about a PDU failing [11:22:33] it may be that their network is flaky, or that the PDU is indeed flaky and rebooting or something [11:23:00] if it's the latter which I doubt, we may be seeing equipment of ours losing half their power [11:23:03] jfyi :) [11:28:57] paravoid: thanks for the heads up! [12:22:31] paravoid: last time I've asked rob about this he told me that seems that their check is flaky and the PDUs were fine, but worth checking again [12:22:46] it was a couple of months ago I think [13:15:57] ema: cp3039 has a weird OCSP warning [13:21:07] paravoid: mmh, globalsign-2016-rsa-unified.ocsp is in fact one day old [13:24:28] Jul 20 10:52:42 cp3039 update-ocsp-all: Error querying OCSP responder [13:24:29] Jul 20 10:52:42 cp3039 update-ocsp-all: 140171424560784:error:27076072:OCSP routines:PARSE_HTTP_LINE1:server response error:ocsp_ht.c:314:Code=524,Reason=Unassigned [13:24:31] Jul 20 10:52:42 cp3039 update-ocsp-all: [13:24:33] Jul 20 10:52:42 cp3039 update-ocsp-all: OCSP update failed for /etc/update-ocsp.d/globalsign-2016-rsa-unified.conf [13:24:42] I've tried running update-ocsp-all manually and it did work fine [13:28:14] ema: I remember to chat with brandon about this, basically my understanding is that we run it once per day and if it fails icinga will warning for 1 day and the next day it if it runs fine the alarm goes away [13:28:49] I've proposed to maybe do a single retry on the script that fetches it on failure after sleeping a bit [13:29:04] to avoid those false positives for a single failure [13:30:14] +1 [13:38:53] mmh cp1050 is stuck at 'Initializing firmware interfaces...' [13:40:58] 10netops, 10Operations, 10monitoring: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3456248 (10faidon) [13:42:45] yes, I am trying to turn it off and on again :) [13:51:21] 10netops, 10Operations, 10monitoring, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3456276 (10fgiunchedi) [13:54:29] 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3456281 (10Jeff_G) Another new symptom using the same browsers as in the original description: https://upload.wikimedia.org/wikipedia/commons/thumb... [13:55:55] 10Traffic, 10Operations, 10ops-eqiad: cp1050 apparently stuck while "Initializing firmware interfaces..." - https://phabricator.wikimedia.org/T171168#3456283 (10ema) [13:56:11] 10Traffic, 10Operations, 10ops-eqiad: cp1050 apparently stuck while "Initializing firmware interfaces..." - https://phabricator.wikimedia.org/T171168#3456296 (10ema) p:05Triage>03Normal [13:56:36] 10Traffic, 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3456297 (10ema) @Cmjohnson please replace the disk (sda) whenever you've got the chance! [14:05:18] 10Traffic, 10Operations: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442#3456328 (10ema) Forwarding-only caching resolvers would help with issues such as T171048 and T151643. [14:29:12] 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3456387 (10Aklapper) No such problems in Firefox 54 or Chromium 59 on a Linux desktop. Issue seems to be browser / platform specific? [15:40:13] 10netops, 10Operations, 10ops-eqiad: Replace cr1/2-eqiad air filters - https://phabricator.wikimedia.org/T170138#3456762 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson done [22:59:54] 10Traffic, 10Operations, 10Performance-Team, 10TemplateStyles, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3458687 (10Etonkovidova)