[00:18:03] 10Traffic, 10Operations, 10ops-codfw: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4274234 (10Papaul) [00:45:12] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4274283 (10Papaul) [07:30:25] 10Wikimedia-Apache-configuration, 10Operations: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968#4274566 (10Joe) [08:09:21] mark: morning [08:09:30] morning [08:09:48] we're having some issues with cp3037 [08:09:56] ok? [08:09:59] main interface looks unreachable [08:10:15] the mgmt interface answers to icmp traffic and 3way handshake on 22/tcp [08:10:20] but I'm unable to get a ssh session there [08:10:40] weird [08:10:45] so I was trying to bring the mgmt interface down in the switch side [08:11:37] but I'm failing to identify where is the mgmt connected [08:11:41] i can imagine [08:11:43] esams is a mess [08:11:47] let me have a look [08:12:16] also I identified that cp3037 port is mislabeled, it's xe3/0/5 (switch says it's cp3036, but the mac address matches to cp3037) [08:12:18] it's very likely to be connected to an unmanaged switch [08:12:26] are you talking about the production port now? [08:12:32] yes [08:12:59] ok best make a ticket about that [08:13:19] right [08:13:19] with netops and ops-esams [08:13:37] so I don't think you can bring the management port down [08:13:41] as it's an unmanaged switch [08:13:55] lovely, can we drain the power from the PDU? [08:14:08] for cp3037 only of course [08:15:40] no, we don't use switched PDUs either [08:15:52] just depool it and make a ticket for on-site work? [08:15:58] ack [08:16:05] the DRAC is supposed to work always [08:16:09] and once upon a time it could be relied upon [08:16:14] but lately we seem to have a lot of issues with that [08:16:26] it used to be very rare that we couldn't login or powercycle a server with that [08:16:27] fwiw at the end of april it was working (I did an audit of the fleet) [08:16:57] i expect to return to esams work early july [08:17:04] i am now able to travel again [08:17:12] just offsite and HR work in the way ;p [08:17:36] vgutierrez: which cluster is cp3037 in? [08:18:06] upload [08:18:20] mark: I guess there is also the option of smart hands in case it might become urgent right? [08:18:36] yes [08:18:52] but hopefully that's never the case with a single server out of a cluster dying [08:19:02] but we can ask them to do a power cycle or something like that yes [08:21:10] mark: also FYI, in the last puppet run the mgmt icinga check was removed, because the ipmi_lan fact disappeared, and I've narrowed it down to 'bmc-config -o -S Lan_Conf' not returning valid data (at least not valid enough for our script) [08:21:34] so probably our icinga check crashed our drac [08:21:38] wonderful ;) [08:21:46] 10netops, 10Operations, 10ops-esams: cp3036 and cp3037 production ports mislabeled - https://phabricator.wikimedia.org/T196970#4274656 (10Vgutierrez) [08:21:48] how? [08:21:54] because $bugs [08:21:57] it seems like it's "only if DRAC already works, then check if it works" [08:22:02] in a way [08:22:11] but 'bmc-config -o -S Lan_Conf' is run by our facter, not icinga [08:22:41] what does the management icinga check do? [08:22:53] unless you're implying that our checks broke the idrac too, but how they broke the server also? :D [08:23:05] hmm icmp every X hours? [08:23:05] i think those two are probably unrelated [08:23:14] we've seen our icinga checks cause more issues on our dracs [08:23:21] probably they have bugs and can't handle the concurrency etc [08:23:23] yey.. I remember talking about this with brandon [08:23:29] they are simple embedded devices after all [08:23:36] the icinga checks on the mgmt interface are: [08:23:48] and we pinged the mgmt every X hours and not more often to avoid crashing them [08:23:49] 1) ping 2) DNS 3) ssh handshake without login [08:23:59] yes, so plenty of opportunity for that to cause issues :) [08:24:14] ping and ssh handshake that is [08:24:21] funny enough.. that's the current status cof cp3037.mgmt [08:24:32] i know it's sad [08:24:38] ping and handshake works.. but anything else :( [08:24:46] sure, but how that correlates with teh server going down? the two things happened this morning [08:25:04] volans: well.. we detected this morning cp3037.mgmt being down [08:25:17] it probably was down already [08:25:19] right [08:25:24] ping and ssh handshake worked and still do [08:25:26] no I saw it in icinga puppet logs the check disappeared this morning [08:25:34] but the check disappeared due to the fact disappearing? [08:25:41] yes [08:25:55] Jun 12 07:01:38 (puppet run on einsteinium) [08:26:42] so the main prod interface is unreachable [08:26:44] but it does arp? [08:26:51] or what is the state of that? [08:31:21] so.. it shows as physically Up on the switch [08:31:31] but not traffic at all apparently [08:31:31] https://librenms.wikimedia.org/graphs/to=1528792200/id=10875/type=port_upkts/from=1528781400/? [08:32:03] but it does arp? [08:32:08] arp is at least some traffic [08:35:31] I've depooled cp3037 [08:49:08] nice [09:33:16] 10Traffic, 10Operations, 10ops-esams: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10Vgutierrez) [09:35:23] 10Traffic, 10Operations, 10ops-esams: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10Vgutierrez) p:05Triage>03Normal [09:45:39] so now we have 3 upload@esams hosts down for hardware issues :( [09:46:03] T190607 T189305 T196974 [09:46:03] T190607: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 [09:46:04] T196974: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974 [09:46:04] T189305: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305 [09:46:19] s/down/depooled/ [09:50:31] ouch [09:50:50] maybe we need mark there soon(TM) [09:53:17] or a traffic summit in amsterdam O:) [09:55:11] let's move the team building to the DC [09:55:24] at least we would be cool [09:55:38] lol [09:58:15] mailbox lag is growing in upload@esams, cp3035 and cp3046 [10:01:57] mark: I think we need smart hands soon [10:43:17] ok [10:43:28] can you prepare an email with english instructions for what you want done? [10:43:32] send it to me [10:43:35] and I will have it forwarded [10:44:17] i'll be out getting a haircut in 10min though [10:45:10] mark: sure [10:45:51] going afk in a bit for lunch too, I'll prepare the email right afterwards [10:50:51] ok [12:37:25] so cp3035 is not doing well; it's mbox lagged and transitioning from healthy to sick and back [12:37:53] varnish-backend is scheduled to be restarted tomorrow, I'm gonna restart it manually now [12:51:39] on the bright side update-ocsp seems to behave as expected [12:57:35] ema: seems that cp3039 is misbehaving as well [12:59:58] vgutierrez: yes 3039 is lagged too. 503s have stopped since the restart of 3035 though, let's wait for a bit now [13:00:06] ack [13:01:05] see hospital logs here https://logstash.wikimedia.org/goto/ff5e258640c51ef6203f4c1e4fb25c94 [13:01:25] and of course: [13:01:26] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&panelId=3&fullscreen&var-site=esams&var-cache_type=upload&var-status_type=5&from=now-3h&to=now [13:04:34] I'm not sure what we should suggest smart hands to do when it comes to cp3048 - https://phabricator.wikimedia.org/T190607 [13:04:58] the last SEL entry doesn't seem encouraging [13:05:01] 106 | May-16-2018 | 10:55:57 | Status | Processor | Critical | Assertion Event | IERR [13:06:03] what dc ops usually do it to swap the CPU in the sockets to see whether the error continues to be logged (but then in a different socket) [13:06:18] if that's the case, that's apparently sufficient justification for Dell/HP to replace it [13:06:54] although warranty ended three months ago [13:10:31] ok thanks moritzm [13:40:44] err ema [13:40:52] you're asking a data center to replace memory, how would they do that? :) [13:40:54] they're not a server vendor [13:40:58] we can ask them to pull cables and stuff [13:41:40] or are you explaining for me? [13:41:54] i mainly need an email to forward to them directly as I have no time to spare today at all, sorry :( [13:41:55] oh, I optimistically assumed we had some spare memory available in esams [13:42:00] no [13:42:08] we need to get dell to ship stuff and it's all complicated [13:43:15] so send me an email you would write to someone who knows nothing about our systems but works there and can press buttons and pull cables ;) [13:43:28] * mark disappears into meetings [13:57:45] ema: if we end up having significant delays on getting some of those 3x upload@esams fixed, another option is we can do some re-roling to cover. [13:58:01] e.g. steal a text node over to upload to help, but it's not ideal in esams at present to do that. [13:58:11] (and a PITA to do and undo) [14:01:42] yeah [14:02:08] so, the host with memory errors seem like the most complicated one to fix, as that requires shipping memory and such [14:02:32] cp3037 might just be a case of turning it off and on again [14:03:20] or maybe we have an already decommed server of the same model? dc ops often take spare parts out of those if a server OOW [14:03:39] we don't, these 20x that are in upload+text at present are the only of this model at esams [14:03:48] ok [14:04:01] currently it's 8xtext + 12xupload, and upload is missing 3/12 [14:05:25] so if we're just stuck this way for a long while, rebalancing the two clusters from an 8/9 split to a 7/10 split would at least help. [14:05:50] or even 6/11 maybe. text probably has less issue from storage scaling, which is our primary scaling pain point right now [14:08:01] my understanding from earlier chat was that m.ark could be able to go there in early july [14:58:25] yes [14:58:32] we should probably make sure we have spare parts available then though [14:58:46] and we can even have dell engineers escorted by remote hands [15:03:55] esams-upload 503 spike, this time it's 3039 [15:05:34] so we can just ask remote hands to powercycle cp3037 [15:05:38] that's easy [15:05:57] some suggested power drain it for 10 minutes [15:06:09] apparently that helps with the DRAC [15:06:51] restarting 3039's backend meantime [15:06:55] but that's easy and it should help us at least diagnosing what the hell is wrong with cp3037 [15:08:49] in the meantime, I can work on re-roling one of the text nodes at least, if that helps ema? [15:09:09] bblack: +1 [15:09:11] it's just a puppetization patch + depool + repuppet + repool cycle essentially [15:10:28] Hi, i am wondering is it possible to have gerrit.wmfusercontent.org just behind the cache proxy without having to do it for gerrit.wikimedia.org please? [15:10:38] we are going to use gerrit.wmfusercontent.org for avatars. [15:13:16] mark: yes, power draining cp3037 seems like the low hanging fruit [15:14:32] paladox: probably, open a ticket about it or tag us in existing one [15:15:20] bblack ok [15:15:31] * paladox already working on it :) [15:15:39] by adding it to cache::misc [15:16:43] bblack: shouldn't we change 3043's role in manifest/site.pp too? [15:16:56] 10Traffic, 10Operations, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4275890 (10Johan) Translations are being collected at https://meta.wikimedia.org/wiki/User:Johan_(WMF)/AES128-SHA [15:17:47] ^^ awesome :) [15:18:11] ema: yes. I think the process will roughly go like: (0) puppet-disable esams text+upload (1) depool 3043 from all layers in text and make sure it's really depooled (no traffic from LB or other nodes) (2) stop daemons on cp3043, wipe out storage files (3) merge patch, and run puppet on cp3043 (could take a couple runs to fix itself) (4) run puppet on all the rest of text+upload esams nodes too, f [15:18:17] or the changes in backend sets and ipsec nodelists [15:19:13] (-1) downtime on icinga ;0 [15:19:15] ;) [15:20:47] and (5) repool it in the upload cluster once everything looks sane (it will initially enter in a depooled state) [15:21:04] right [15:21:11] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4275931 (10Paladox) [15:21:26] bblack done ^^ [15:21:39] paladox: thanks :) [15:21:48] for the time being we're in a decent situation: no upload@esams host is lagged anymore, no 503s [15:21:50] your welcome :) [15:23:37] ema: doublecheck patch pls? it's easy for me to get confused on 3043 (intended) vs 3034 (oops) I imagine [15:23:44] sure [15:24:21] oh I didn't push new one, pushed now [15:27:45] ema bblack https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439939/ :) [15:30:16] bblack: mmh, with the new cache_text regex we seem to fail matching cp3030 [15:30:35] ema: how? [15:31:04] (we'll get diffs on every node I think, but it's nodelist diffs) [15:31:48] bblack: nah, just a simple PEBKAC [15:32:06] mine or yours? :) [15:32:14] mine :) [15:34:11] yeah obviously in the diff, lots of things will need stopping (all the stats daemons too) [15:34:45] we don't cleanly unconfigure/reconfigure them on removal in puppet either, I imagine it won't be truly-clean until a reinstall (in the sense that a reboot might bring a text-only stats daemon back to life) [15:35:29] would it be better/easier to reimage instead? [15:36:02] yeah maybe [15:36:13] the trick is the ipsec alerts while waiting on the reinstall, etc [15:36:33] I'm very quick at clicking "acknowledge" on icinga though [15:36:40] hahaha [15:36:43] unless we actually move it to role::spare first and remove it from the heirdata/confctl, then in a second commit move it back once install is ready to first puppetize [15:36:54] that would be the sanest/cleanest way [15:36:59] safest for sure [15:37:17] since we're not actively in 503, I'll re-do it that way [15:37:26] ok [15:37:34] role spare has the firewall IIRC [15:37:39] I have a meeting coming up in ~23 mins, so I'll defer going through the process until later in the day at this point. [15:37:41] you might need to clean it up after [15:38:12] volans: I won't actually puppet it in that role, just placeholder while reinstalling (puppet disable before merge role::spare) [15:38:26] and then put it in the right role before the first puppet run (manual) [15:38:35] ah ok [15:41:37] we should put someone on the task of making it easy to clear downtimed things by-regex :) [15:42:37] a Software Engineer working in Infra Foundations perhaps! [15:42:37] it's all down to write to the command file ;) [15:42:40] * volans hides [15:42:55] or at least a button for "clear all downtimed services on this host" [15:43:42] we need a proper icinga CLI tool [15:43:45] I guess I don't have to muck with role::spare at all really, just save the site.pp change for the end [15:56:26] I once setup https://github.com/zorkian/nagios-api vgutierrez at another job, the code is mess but it did work pretty well [15:57:34] can confirm, used nagios-api too and it did what it said on the tin [15:57:53] well... I get cancer every time I open icinga web interface... [15:58:49] oh god that was in 2012 [16:03:32] my math says 2012 as well godog :) funny world [16:04:34] heheh indeed chasemp [16:06:14] bblack: BTW, I love kafka TLS handshake implementation.... https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/network/SslTransportLayer.java [16:07:59] (said no one ever) [16:20:43] anything weird in particular? [16:21:26] elukey: so far... nope [16:21:48] ah ok ok, was just curious :) [16:22:15] java? :) [16:22:23] well.. :) [16:58:44] 10Traffic, 10DNS, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install dns200[12].wikimedia.org - https://phabricator.wikimedia.org/T196493#4276412 (10Papaul)