[01:28:41] 10Traffic, 06Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3261314 (10BBlack) [08:47:06] 10netops, 06Operations: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262009 (10fgiunchedi) [08:51:46] 10netops, 06Operations: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262036 (10fgiunchedi) I tried a traceroute from our side and it takes a different path ``` filippo@cr1-esams> traceroute 2.235.74.121 traceroute to 2.235.74.121 (2.235.74.121), 30 hops max, 40 byte p... [09:23:23] 10netops, 06Operations: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262009 (10ayounsi) Traffic is indeed not smooth as usual on the interface toward Init7. I called Init7 and disabled the v4 and v6 BGP sessions. The person I had on the phone mentioned that the engineer... [09:26:51] 10netops, 06Operations: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262155 (10ayounsi) Got confirmation on IRC that the issue can't be reproduced. [09:31:48] 10netops, 06Operations: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262169 (10ayounsi) a:03ayounsi [09:43:05] 10netops, 06Operations: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262251 (10ayounsi) From Init7: >We are experencing some BGP issues in our backbone. Troubleshooting is under way and I'll contact you once we fixed the issue. [10:03:39] 10Traffic, 06Operations, 13Patch-For-Review: varnish frontend transient memory usage keeps growing - https://phabricator.wikimedia.org/T165063#3262301 (10ema) 05Open>03Resolved a:03ema Crazy transient memory usage [[https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage?orgId=1&from... [10:06:24] 10netops, 06Operations: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3262317 (10Nemo_bis) p:05Triage>03High [10:28:51] 10netops, 06Operations: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3262418 (10Pyb) My connection is chaotic since this morning. Other customers from the french ISP Bouygues report the same problem. This is my traceroute results: |--------------------------------... [12:53:20] 10netops, 06Operations, 13Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3262843 (10ayounsi) [12:57:51] ema: now that it's not the weekend, try 0s grace on the n-hit-wonder hfp? there are probably other cases to look at too, but that might be fairly low-hanging fruit [12:58:12] (or if 0s still seems like it needs more testing/validation first, maybe set them to 1m grace instead of 60?) [12:58:58] 10netops, 06Operations: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3262862 (10Nemo_bis) [12:59:00] bblack: it sounds like a good plan in general [12:59:07] any possible impact on BBR evaluation though? [13:00:16] I don't think so, not on the broad stats [13:00:22] in any case, it's still kinda early in the week :) [13:00:30] true that :) [13:00:41] please start thinking about next quarter goals [13:00:50] faidon will be discussing them in mgmt meeting on monday [13:01:01] yup [13:01:33] ema: also, I won't make ops meeting today, are you? [13:01:43] I'll be there, yes [13:02:08] ok [13:02:19] we should also discuss new firewalls [13:02:24] when would you have time for that brandon? [13:03:27] the pfw stuff? [13:04:19] bblack: I'll make a VCL patch with 60s grace to begin with [13:04:46] mark: if it's on EU overlap time (US morning), probably not until wednesday [13:04:52] ema: ok [13:04:58] yes pfw [13:05:30] we're running out of time though, are you ok if arzhel/faidon/me decide? [13:06:24] and jeff I guess [13:07:13] yes that's fine. I can recap my last input when faidon and I discussed it: [13:07:37] 16:10 1) On the face of the tradeoffs about cost, Juniper bugs + support, etc... I lean fairly strongly in the direction of the Linux-based solution. [13:07:40] 16:11 2) But, one of the caveats of going down that road is that it will require significant time investment from some ops/net-skilled person(s) to engineer the initial solution and to maintain/debug it down the line. [13:07:44] 16:12 3) "Not it" for doing that myself, and given our limited resources in general-ops, network, and fr-tech, where do we expect to find that new time/effort input? [13:07:48] 16:12 (whereas if we go juniper, it saves us that time and effort, at the cost of bugs that just sit idle as unresolved/unresolveable for long periods because juniper sucks) [13:07:52] 16:13 so I tend to think that Juniper may be the pragmatic compromise, and it's just going to suck and we have to live with it [13:09:13] ok, thanks [13:09:20] (I fully agree with that, btw) [13:17:25] bblack: should we also cap transient storage size now? https://gerrit.wikimedia.org/r/#/c/353274 [13:17:31] need to rethink the values though [13:18:11] maximum usage currently is ~1.7G in ulsfo-text frontends [13:18:16] ema: let's wait a little, it's good to get longer-term data about spikes anyways. I suspect when we do hit transient cap it could be minorly ugly (but better than ooming the host), so it would be good to pick conservative values that cover most reasonable spikes. [13:18:44] there have already been spikes over the weekend (post-fix) for upload-frontend that have hit the 10-15GB range very briefly that I can't explain [13:19:07] yeah, I was just looking at upload stats now and they're not great [13:19:31] https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage?orgId=1&from=1494836114891&to=1494837414561&panelId=2&fullscreen [13:19:34] ^ like that one this morning [13:19:51] I don't even have a good theory about those brief huge upload-frontend spikes [13:19:58] on the backend?! [13:20:04] oh no sorry [13:20:10] frontend :) [13:20:46] that zoomed one there, it's +14GB and lasts almost exactly 1 minute [13:21:00] possibly related to huge objects passing through, but I donno [13:23:13] ema: one thing I wanted to specifically bring up in ops meeting (or do some research on where we're at on the issue in general beforehand) is T165252 from this weekend [13:23:14] T165252: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252 [13:23:26] the thing about "why are we not icinga alerting on mcelog / temp still?" [13:23:47] that one was going on as far back as syslog goes (for a week) before it started causing spikes of 503s [13:23:58] ok, I'll bring that up [13:24:01] thanks! [13:26:29] so staring at the ulsfo upload frontend hfp rate relative to the spike you pasted above [13:26:33] https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage?orgId=1&from=1494836114891&to=1494837414561 [13:27:24] it seems like the rate went down roughly when the storage usage spike started, and had yet another small dip at the end of the spike [13:27:50] yeah but those dips seem to be part of the surrounding shape, mostly? [13:28:07] I tend to think it's something other than hfp, but I really don't know [13:29:20] we're hfp-ing all objects >=256KB for the fes, so it's not storage allocation failure. And the spike is very crisp with the 1-minute thing, and very huge. there's some past ones over the weekend that are similar, and have hit eqiad+ulsfo upload-fe around the same time, plus a smaller attendant spike on eqiad be [13:29:39] also, hfp rate is the number of hits on hfp objects I think, not the number of hfp objects created [13:29:46] my best wild guess is it's using transient as a buffer for a very large uncacheable object [13:30:19] e.g. there's a 14GB file out there somewhere in upload, and when it's requested neither of the FE or BE cache it (because of our rules, intentionally), but chunks of transient storage end up getting used like a network buffer [13:31:19] or something related to some scenario like that [13:32:41] could even be that the space never really gets "used" - maybe a 14GB file is coming through like that, and varnish initially allocates 14GB of transient when it sees the CL header, then our VCL later changes the behavior and causes it not to use the space, but it takes 60s for the initial malloc to expire anyways [13:32:53] because exactly 60s doesn't seem right for "spool 14GB out to a client" either [13:34:08] heh, coincidentally there's also a bunch of timeouts set to 60s :) [13:35:01] in any case, I'm offline soon and won't be back for a few hours. this should be the last of my recent unreasonable schedule disruptions :) [13:35:32] :) see you [13:50:08] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3263083 (10ayounsi) @fgiunchedi indeed, it's happening again. During those jobs, ports are completely saturated. Because of the nature of t... [13:52:04] 10Traffic, 06Operations, 06Release-Engineering-Team: Can't upload large files with X-Wikimedia-Debug turned on - https://phabricator.wikimedia.org/T165324#3263087 (10Gilles) [14:43:26] 10Traffic, 06Operations: Can't upload large files with X-Wikimedia-Debug turned on - https://phabricator.wikimedia.org/T165324#3263258 (10greg) (not really a RelEng task, we care about the debug servers and use them, but Ops manages them and the nginx config) [17:26:44] 10netops, 06Operations, 10ops-eqiad: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3264152 (10Cmjohnson) [18:57:04] 10Traffic, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#1396785 (10AndyRussG) @Krinkle Thanks so much for the explanation!!! Just c...