[08:21:44] 10Traffic, 10Operations: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) p:05Triage>03Normal [09:21:00] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282 (10Prtksxna) >>! In T192129#4443301, @bd808 wrote: >>>! In T192129#4440027, @Prtksxna wrote: >> Would it make sense to add rules t... [10:07:25] 10Traffic, 10Operations, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) CI tests [[https://integration.wikimedia.org/ci/job/debian-glue/1232/console | were failing ]] due to CI slaves being jessie and thus running with an old pristi... [13:45:12] 10Traffic, 10Operations, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) [13:47:37] ema, elukey: perhaps we could temporarily return 404 from varnish to the client when path ~ /socket.io/ and see if that helps alleviate the problem? [13:56:09] or in general mobrovac, it seems that /socket.io/ is not supported anymore? [13:56:25] nope [13:58:49] mobrovac: oh sure, if /socket.io/ is not supported anyways we can give that a try [14:01:22] mobrovac: prepping a patch [14:02:34] grazie [14:03:11] mobrovac: is there a task I can reference? [14:03:22] sure [14:03:29] ema: T199813 [14:03:29] T199813: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 [14:06:01] mobrovac, elukey: https://gerrit.wikimedia.org/r/447423 [14:07:33] looking [14:09:30] ema: lgtm, elukey: does it match the path you see in the webreq logs? [14:09:46] (i double-checked, /socket.io/ is not served any more from ES) [14:11:09] the path I see from the offending IP is /socket.io/1/, so to be precise we could return 404 for that only [14:11:52] I am going to triple check in hive for a /socket.io/ but it looks good to me! mobrovac maybe let's wait a sec before deploying to see if the rdkafka upgrade makes a difference? [14:12:12] was thinking the same elukey [14:12:37] ema: i think /socket.io/ is good/enough as neither exist and both are erroneous :) [14:12:38] (ema: we just deployed a new node-rdkafka version, that in turn uses librdkafka, to see if a memleak fixed could be the cause of the mess) [14:12:54] k, let me know when/if you guys want to merge the vcl change [14:12:57] <3 [14:13:00] ema: but, as elukey says, let's hold off with this patch for 30 mins or so to assess the update impact [14:13:05] thnx! [14:13:22] yw! [14:13:26] there is no way to award tokens in here sadly [14:16:18] elukey: you can thank ema in namely :D [14:22:06] yeah socket.io is from the thing eventstreams replaced [14:22:22] mobrovac: I usually send private wikilove messages [14:22:26] I think we used to split that path to a different backend during some past transition period, but it's gone now? [14:22:34] hi bblack :) [14:22:40] afaik, yes bblack [14:23:20] so I checked for the 22/06/2018 whole day in hive, and all the /socket.io/* stuff is either a 301 (http -> https) or a 404 [14:23:39] but the predominance of reqs goes to cp2* hosts for some reason [14:23:44] (python and java UAs) [14:23:58] including our friend [14:24:00] probably just geography of the short list of clients still using it [14:26:28] at this point I guess stale/old clients still running somewhere, since their life is getting a stream of 404s [14:28:28] yeah [14:29:04] elukey: it seems updating to 2.3.4 did help, but i still see some workers (on scb2004, e.g.) increasing their mem footprint [14:31:43] I was about to say the same [14:33:06] ok so I would still merge https://gerrit.wikimedia.org/r/447423 later on to avoid all these conns to the scb nodes if possible, just to remove noise [14:38:00] yup, elukey, ema, i'd be +1 on merging it now [14:38:14] ok [14:43:34] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) They provided me with a list of 7 names, I asked them to specify which one (or two) are going onsite. It takes 24 hours for them to get back to me from any reply. [14:44:16] ha, syntax error [14:44:18] fixing [14:51:36] ema: lmk when the change is applied on the cp2* hosts please [14:53:27] mobrovac: forcing puppet run now [14:53:34] kk thnx [14:55:00] mobrovac: applied, I see 404s being returned [14:55:05] \o/ [14:57:02] ah, the Java client gets the 301 TLS redirects and then does not follow them apparently [14:58:22] so far I've only seen python clients getting 404s for /socket.io/ [14:59:18] yeah we tried redirects for the websocket clients way back, before eventstreams, during the switch of it to HTTPS, almost none of them followed [15:01:06] heh, then it's basically 1 req/s getting 404s, I doubt that will change much on the scb side? [15:01:21] nope, that's fine [15:01:27] much better than 40rps :) [15:16:31] 10Traffic, 10Operations: Discard of cold labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) [15:16:41] more fun! ^ [15:16:44] 10Traffic, 10Operations: Discard of cold labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) p:05Triage>03Normal [15:21:20] 10Traffic, 10Operations: Discard of cold labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) [15:46:21] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) Got it down to two names and submitting a smarthands ticket for escort on Wednesday, July 25th. [15:47:39] so if Dell is going there on Wednesday, shouldn't they fix cp5001 too? [15:47:47] robh: ^ [15:48:04] ive had enough trouble getting them to fix just one in the timeframe [15:48:12] trying to get two things fixed i can try but itill delay [15:48:59] i assuemd was best to get at least one fixed and worry about the followup for the other [15:49:47] what was the trouble about? [16:05:50] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10ayounsi) [16:10:28] I thikn 5001 is just a failed DIMM IIRC [16:10:36] *think [16:11:32] yeah, uncorrectable, dimm "B4" indicated by both SEL and Linux [16:12:04] indeed [16:12:15] it should be a piece of cake for them [16:31:03] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) Site Visit Ticket #: 1-162553077672 SmartHands Escort Ticket #: 1-162554266089 Emailed info over to the dell tech and its scheduled for 9am this Wednesday. (They may show up later, 9am is the ea... [19:36:22] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) email sent to team list so all other sre team members are aware of this work next Wednesday (2018-07-25). [19:56:44] vgutierrez, volans: this just caught my eye: https://usn.ubuntu.com/3720-1/ [19:57:56] https://security-tracker.debian.org/tracker/CVE-2018-10903 [20:00:26] not sure it'd have any impact on what we're doing with certcentral [23:09:04] 10netops, 10Operations: Unexpected network packets in codfw mgmt - https://phabricator.wikimedia.org/T199832 (10ayounsi) 05Open>03Resolved From support: > I have confirmed that these addresses 128.0.0.16 , 191.255.255.255 are used in the system for internal purposes only. > This type of traffic can be safe... [23:09:17] 10netops, 10Operations: Unexpected network packets in codfw mgmt - https://phabricator.wikimedia.org/T199832 (10ayounsi)