[04:12:32] 10netops, 10Analytics, 10Operations: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041596 (10ayounsi) a:03ayounsi 1st change applied. Waiting for confirmation for the 2nd. [05:37:16] 10HTTPS, 10Traffic, 10Analytics, 10Operations: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4041656 (10Tbayer) [05:40:34] 10HTTPS, 10Traffic, 10Analytics, 10Operations: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4041659 (10Tbayer) @BBlack Thanks again! Back to the task at hand: I have tentatively updated the documentation based on my understanding of your remarks: https:/... [09:03:31] moritzm, _joe_: regarding messages' rate limiting, it looks like jessie doesn't allow configuration of Journal per unit: Mar 12 09:03:11 pybal-test2001 systemd[1]: [/lib/systemd/system/pybal.service:10] Unknown section 'Journal'. Ignoring [09:05:51] 10netops, 10Analytics, 10Operations: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041924 (10elukey) [09:08:37] vgutierrez: actually, the rate limit seems only configurable in journald, not per unit [09:09:18] yup... I've seen that :( [09:09:35] but at least on strech packages can deploy their own config without messing /etc/journald.conf [09:10:25] at least, the manpage mentions several paths for that [09:10:26] /run/systemd/journald.conf.d/*.conf [09:10:27] /usr/lib/systemd/journald.conf.d/*.conf [09:10:33] those aren't present in jessie apparently [09:10:44] https://manpages.debian.org/stretch/systemd/journald.conf.5.en.html VS https://manpages.debian.org/jessie/systemd/journald.conf.5.en.html [09:10:48] yep, but on jessie only /etc/systemd/journald.conf :-/ [09:10:56] indeed :( [09:12:38] 10netops, 10Analytics, 10Operations: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041928 (10elukey) Since we are doing some cleanups, I'd also like to review the following: ``` term mysql { from { destination-address { 10... [09:13:09] 10netops, 10Analytics, 10Operations: Review some IPs in the analytics-in4 filter - https://phabricator.wikimedia.org/T189408#4041932 (10elukey) [09:13:53] yet another reason to do T177961 [09:13:54] T177961: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961 [09:24:07] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4041969 (10Vgutierrez) It looks like journald messages' rate limiting is not configurable per unit. So it needs to be done system-wide. Even worse, in Debian jessi... [10:09:10] 10netops, 10Operations, 10ops-codfw: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452#4042196 (10ayounsi) [10:43:43] 10netops, 10Cloud-Services, 10Operations: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933#4042381 (10ayounsi) 05Open>03Resolved [11:46:00] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4042645 (10faidon) [13:00:57] ema: so, these fetchfailed/503 connection pileup things. I tried to read through backlog a bit... [13:01:38] it seems there was some differential from setting the transaction_timeout, but maybe still need to look into exactly what? [13:02:24] I don't *think* it takes effect on existing transactions that are already in-progress when it's set. [13:02:49] but it could still have a relatively-short-term positive effect if it prevents further hanging transactions consuming threads. [13:03:33] but, I would suspect it would only work well if there was actually a backend cause (e.g. MW, RB, etc actually having slow resposnes is what's initially causing the pileup) [13:04:15] if the pileup's root cause is varnish-internal (e.g. storage woes -> lock contention -> massive internal delays in handling requests), I think it's hard to say whether or how transaction_timeout might kick in and/or help. [13:04:59] (I suppose it still might, if during the delays transactions are still occasionally making minor forward progress and invoke the timeout-checking code, which checks vs the absolute start time of the backend fetch) [13:07:37] aside from theorizing at this level, though: [13:08:43] 1) Let's get your slowlog improvements merged, so we can possibly get some better insights. I'll try to review them shortly, at least visually. [13:09:28] 2) I'll spend some time this morning looking at the quicker-restarts thing (e.g. 3d or 4d instead of 7d). [13:32:06] 10netops, 10Operations, 10ops-codfw: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452#4043057 (10Papaul) p:05Triage>03High [14:29:52] <_joe_> all: say I was to add some timing-at-apache-layer to the appservers, so that we have system-level telemetry on how mediawiki is behaving [14:30:10] <_joe_> what do you think would be interesting to record as metrics from the traffic prespective? [14:30:23] _joe_: we already have that in Backend-Timing, or do you mean something else? [14:30:56] <_joe_> backend-timing is measured by varnish, correct? [14:31:06] <_joe_> or you mean the mediawiki-measured data? [14:31:29] I mean the Backend-Timing header set by MW's Apache [14:31:58] <_joe_> do you plot/collect those data? [14:32:13] <_joe_> that's basically what I wanted to collect [14:32:15] we're working on integrating it as part of these patches [14:32:45] (when we log a slow req up in varnish, also extract delay from that header if it's present and log that too, so we can show whether or not there was a matching delay in Backend-Timing) [14:33:34] there's a universal problem, of course, that we can't trust any given daemon in the stack to monitor its own delays reliably, only those of [14:33:41] <_joe_> ok, that's for the slowlog which is already useful, I wanted to export via prometheus the latencies of various endpoints [14:33:52] <_joe_> bblack: yes [14:34:54] So in this case since Apache is recording/emitting Backend-Timing, it's a reliable indicator of timing constraint on hhvm and everything beneath hhvm, but it's not necessarily reliable that a short D= there isn't paired with an unrecorded delay in Apache itself. [14:35:52] 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#4043288 (10ayounsi) 05Open>03Resolved This is done, all peers are up with proper new ASN. AS43821 is not in use anywhere in esams. [14:36:05] <_joe_> we should daisy-chain nginx to it so that we know the latencies in apache! [14:36:16] similarly, varnish-be might claim all of its timers were short for a given request, but the consuming varnish-fe might say "hey but fetch took forever", indicating varnish-be's unreliability. [14:36:19] <_joe_> oh wait, we already do it, but not for varnish :P [14:36:36] (or, something between the two) [14:36:45] <_joe_> bblack: yes, that's why I want to at least have a timer that's outside MediaWiki [14:36:50] <_joe_> and before varnish-be [14:37:11] <_joe_> if you're extracting the be-timing data already for slowlogs, that's great [14:37:30] <_joe_> or if you're going to do that soon [14:37:46] _joe_: if you want the data in the general case, even when varnish doesn't record a slow-req [14:38:31] probably the best avenue for that is to hook up with analytics on getting it added as a field at that level (via patching Varnish X-Analytics output that gets sent to kafka, and doing whatever they do on that end to support a new field) [14:39:06] <_joe_> bblack: yes, that's clearly cleaner than my original solution [14:39:20] <_joe_> although that means having the data in hadoop or something :P [14:39:20] we could also do it the other way, where we send data to prometheus from mtail or whatever, but I increasingly tend to think the analytics-level solution is better (or logstash, depending on the case) [14:39:43] <_joe_> bblack: yeah I was thinking of mtail+prometheus to get a decent dashboard [14:39:46] prometheus should be about our systems, analytics should be about per-request metrics [14:40:08] (IMHO) [14:40:22] <_joe_> well, I'm thinking of aggregating the result for each endpoint/wiki [14:40:29] and the interface is pretty decent for investigation, but it's non-realtime and not for alerting, etc [14:40:33] <_joe_> that would help a lot spotting slowness [14:40:56] the varnishslowlog stuff will get you the slowness spotting I think [14:40:59] <_joe_> I want to have a dashboard that clearly tells me "the api cluster is overloaded/slow" [14:41:10] if MW responds slow, by definition varnish also responds slow, and then it's there in logstash. [14:41:17] <_joe_> well, only over a certain threshold, right? [14:41:25] yes [14:41:48] <_joe_> so my point is: for load.php, 400 ms is already several std devs out of the normal response times [14:41:54] <_joe_> I'd like to know that [14:42:16] <_joe_> or to know requests for wikis in s3 on the apis are currently slower than normal [14:42:20] yeah I'm more worried about whether there are occasional values in the multi-second range stalling things [14:42:24] <_joe_> but not enough to cause a drama [14:42:33] we've observed them before with RB<->parsoid -related fetches [14:42:44] and it doesn't take many super-slow requests to pile up connections [14:42:49] <_joe_> I think both things are needed [14:43:47] yeah, we need to put some thought into the other case (re: having general graphing/alerting on Backend-Timing delay field that's ops-useful) [14:43:54] 10netops, 10Discovery, 10Operations, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4043303 (10faidon) p:05High>03Unbreak! ``` faidon@re0.cr1-eqiad> show arp no-resolve | match 10.64.0.17 78:2b:cb:2d:fa:e6 10.64.0.17 ae1.1017 none faidon@re0... [14:44:21] I think Backend-Timing is generically templated into place anywhere we have apache too, not just the MediaWiki case [14:44:29] but also, some services won't be apache based and won't have it [14:45:01] so, I think maybe Varnish is the wrong place to be gathering it from [14:46:06] maybe something (perhaps optionally) should deploy alongside all apache installs that records Backend-Timing data from apache log outputs directly over to prometheus along with some relevant non-PII metadata like request URI and the server hostname and the service's name? [14:46:51] (well, I don't know if we can always assume the request URI is non-PII. Maybe a service owner has to turn this on and it's documented to check for that first, I donno) [14:51:19] hmmmm [14:51:39] so I never really looked deeply at Backend-Timing before, now I'm staring at it and the relevant apache docs [14:51:45] "The time from when the request was received to the time the headers are sent on the wire. This is a measure of the duration of the request. The value is preceded by D=. The value is measured in microseconds." [14:52:10] so D=123 means it took 123 microseconds to get the to the point of sending response headers. [14:52:25] it's saying nothing about the possible 6 hour delay sending the contents afterwards :P [14:53:58] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4043362 (10faidon) a:05faidon>03ayounsi Just heard from RIPE: ``` I just finished the provisioning of sg-sin-as14907.anchors.atlas.ripe.net and noticed that port 5666 is filtered.... [14:55:14] I was about to update pybal in eqsin, so before doing it I want understand the whole picture and I got this strange output from cr1-eqsin [14:55:22] vgutierrez@re0.cr1-eqsin> show route 103.102.166.224 [14:55:22] inet.0: 680288 destinations, 2432666 routes (680082 active, 1 holddown, 208 hidden) [14:55:25] Restart Complete [14:55:28] + = Active Route, - = Last Active, * = Both [14:55:30] 103.102.166.224/28 *[Static/5] 6d 09:50:25 [14:55:33] > to 10.132.0.11 via ae1.520 [14:55:43] I was expecting a /32 BGP route from lvs5001 [14:56:00] what am I missing here? [14:56:02] what you're seeing is the static fallback route [14:56:09] there's always a static fallback route to the intended primary LVS [14:56:12] right [14:56:17] so yeah, no BGP is working for that route presently [14:56:42] vgutierrez@re0.cr1-eqsin> show bgp neighbor 10.132.0.11 [14:56:42] Peer: 10.132.0.11+179 AS 64600 Local: 103.102.166.129+63245 AS 14907 [14:56:43] Type: External State: Established Flags: [14:56:49] that could be because (a) all of this was recently configured and something's wrong/missing in the router-side config [14:56:56] thing is pybal is up in 10.132.0.11 [14:57:10] Table inet.0 Bit: 10001 [14:57:10] RIB State: BGP restart is complete [14:57:10] Send state: in sync [14:57:10] Active prefixes: 0 [14:57:11] Received prefixes: 1 [14:57:14] or (b) pybal's bgp to the router is borked because pybal [14:57:22] or (c) some other existing mis-config [14:58:03] perhaps the router-side is filtering out the received prefix, and it needs to be whitelisted appropriately [14:58:38] I'd sync up with ayounsi on it [14:58:43] yup [14:58:57] for some reason the router is not accepting the prefixes its getting from pybal [14:59:08] Accepted prefixes: 0 [14:59:12] happens with lvs5001-3 [14:59:29] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4043392 (10ayounsi) Should be good now for eqsin. [14:59:49] XioNoX: ^ [15:00:11] XioNoX: cr1-eqsin ignoring pybal BGP routes is a feature or a bug? :) [15:00:45] looking [15:05:38] "Hidden reason: rejected by import policy" [15:07:02] so it looks like config issue on the router side [15:07:04] Need to add eqsin LVS prefix in LVS-service-ips/LVS-service-ips6 [15:07:06] yup [15:08:34] XioNoX: that's silent regarding BGP right? [15:08:57] I mean, junos doesn't notify pybal via BGP notification messages that the prefixes aren't being accepted [15:09:05] indeed [15:09:12] prefix is accepted now [15:09:46] <3 [15:09:48] 103.102.166.224/32 *[BGP/170] 6d 10:04:39, localpref 100 [15:09:48] AS path: 64600 I, validation-state: unverified [15:09:48] > to 10.132.0.11 via ae1.520 [15:09:48] [BGP/170] 5d 21:35:01, MED 100, localpref 100 [15:09:51] AS path: 64600 I, validation-state: unverified [15:09:53] > to 10.132.0.13 via ae1.520 [15:09:56] nice :D [15:10:44] let's upgrade eqsin to pybal 1.15.2 then :D [15:10:52] * vgutierrez breaking stuff [15:10:56] O:) [15:16:34] breaking eqsin... you hero [15:23:07] * _joe_ hands vgutierrez the apache config for mediawiki [15:23:44] * vgutierrez runs away [15:23:56] <_joe_> uhm, the new one is too smart [15:24:06] yeah so, now I'm even more leery of trusting Backend-Timing than I was before, given the "headers" definition of it. [15:24:54] <_joe_> bblack: why? [15:24:57] strict interpretations in the general case aside though, in general do we expect hhvm to not respond at all with headers until it can stream the whole response? or do we expect hhvm to sometimes send headers and then possibly pause before/during content generation? [15:25:34] <_joe_> in the general case, it can do both [15:25:54] _joe_: I previously assumed incorrectly that our Backend-Timing D= value was the latency of the whole transaction request->response cycle. [15:25:56] <_joe_> basically in php headers are sent back to apache as soon as something tries to write some output [15:26:08] but the definition from apache docs claims for that field: The time from when the request was received to the time the headers are sent on the wire. [15:26:17] nothing to do with content output timing [15:26:22] <_joe_> bblack: headers sent on the wire by apache [15:26:52] <_joe_> I think, given it's acting as a reverse proxy, that it doesn't do request streaming [15:26:57] <_joe_> response streaming [15:27:08] <_joe_> but we need to check that [15:27:08] well that's not a general restriction of reverse proxies [15:27:20] <_joe_> no just of how crappy mod_proxy_fcgi is [15:27:25] <_joe_> from what I remember of the code [15:28:27] either way, even if the hhvm<->fcgi<->apache pipeline fully buffers (i.e. apache cannot send response headers until hhvm has sent full response content and finished already) [15:29:00] <_joe_> it's not measuring what matters to varnish, yes [15:29:02] that says nothing about whether the content was slow to send from apache's buffer, because e.g. network issues and/or small tcp window, etc [15:29:06] <_joe_> just the backend processing time [15:29:12] <_joe_> yes [15:29:44] it could be the case that varnish logs FetchTime=72s, and Apache logs D=0.1s, but the network transfer of the response is the problem (and then we have to look at why... tcp perf issues on either end, etc) [15:31:03] the thing we really want to extract at the moment (or try to), is when Varnish says FetchTime=72s, was it because of varnish-internal bullshit, or something beyond varnish (network conditions, lvs, apache, etc) [15:31:33] <_joe_> yeah so if that hypothesis is true (apache does full buffering), that number is useful for my goals, not yours [15:31:37] no definition of the apache D= time tells us about network/lvs (and possibly some apache) cases. but the headers-only definition is even weaker than I was expecting. [15:33:08] if assume full buffering is definite and absolute, it does still eliminate hhvm and beneath as a cause, though. [15:33:25] <_joe_> yes, that's what I was hoping for [15:33:30] (and leave us looking at Varnish, network, LVS, and maybe-apache) [15:33:38] <_joe_> or well, identify problems in the applayer [15:33:46] <_joe_> in general, not just their absence [15:33:50] right [15:34:38] <_joe_> but yeah ofc there is a ton of things we don't consider at first glance and that are potentially relevant [15:34:50] <_joe_> leaky abstractions are leaky :P [15:35:14] in this case, I still think all of the recent stuff is related to all the ongoing investigations that are slowly starting to tie together in the 160s ticket. [15:35:33] I think it's just we keep changing other semi-related things, and the behaviors of the bug shift around and look different or better or worse [15:35:33] <_joe_> what's the 160s ticket? the one about slowlog? [15:35:37] yes [15:36:01] <_joe_> ok [15:36:02] https://phabricator.wikimedia.org/T181315 [15:36:16] <_joe_> sorry, bbiab, I got to go away before the meeting [15:36:19] timo started to pull together some of the past bugs in the top too, that are likely other manifestations of the same basic underlying problem [15:36:56] so, the meta-point here being this is a situation we've repeatedly debugging and/or mitigated, and has still repeatedly eluded us and re-surfaced. [15:37:20] so at this point it warrants us being very careful about tracking down the exact nature of the root cause in extreme detail. [15:37:43] s/debugging/debugged/ [15:37:47] https://httpd.apache.org/docs/trunk/mod/mod_proxy_fcgi.html -- our configuration doesn't use flushpackets, which is how it's documented to disable output buffering. [15:38:10] So it would be expected that apache would buffer the entire response, as far as I can tell. [15:39:52] if that's the case, unrelatedly I wish it would bother tacking on a content-length header since it already knows :) [15:40:15] (lack of CL output on MW responses is a thorn in the side of cache perf tuning in general) [15:41:23] if it were emitting CL consistently, we could apply the file-storage binning mitigation we use on cache_upload (since swift always emits CL) [15:42:48] <_joe_> marlier: I don't think flushpackets was available in mod_proxy_fcgi 4 years ago, when we wrote our config [15:43:10] <_joe_> I'm not even sure if it's available in the version we have installed today (but it might well be) [15:45:12] also, I'm not entirely convinced by just a quick glance at the apache docs, that flushpackets isn't only about the contents. [15:45:14] <_joe_> apparently it is [15:45:23] <_joe_> bblack: right [15:45:44] i.e. it may be a valid intepretation of the docs that the revproxy will receive, process, and forward headers, and then after that deal with buffering (or not) output content. [15:47:14] if the implementation is simplistic, it probably receives a full set of headers, makes decisions based on them and/or modifies them (e.g. adds Backend-Timing and other such configured headers), and sends them onwards immediately before even thinking about consuming->proxying the content part. [15:47:27] which would make it impossible for it to tack on a CL header after buffering the response, too [15:48:37] now that I think about it, that raises other questions about the definition of that Backend-Timing D= value heh [15:48:58] <_joe_> bblack: not sure about what we do in the logs, but I think we record the same thing [15:49:07] how can it possibly record the delay as documented ("The time from when the request was received to the time the headers are sent on the wire. [15:49:16] ") [15:49:25] and then send that value within the headers being sent on the wire? :P [15:49:27] <_joe_> logs don't have any such limitation, btw, given they're written after the response is completed [15:49:32] right [15:50:02] anyways [15:50:18] this is all a deep side-path that's maybe not necessary at present for our immediate problems [15:50:35] we can log the D= and all the varnish fields whenever varnish has a long timer, and dig from there [15:53:49] Sorry, have been reading through the mod_proxdy code. [15:53:53] proxy* [15:54:56] it does appear to have some native flush logic, regardless of how flushpackets is set [15:55:50] https://svn.apache.org/viewvc/httpd/httpd/trunk/modules/proxy/mod_proxy_fcgi.c?revision=1823886&view=markup#l599 -- writebuflen is defined as being equal to the iobuf_len setting, which I think I saw documented as 8192 bytes by default. [15:56:37] marlier: you can tune the iobuf_len via mod_proxy param [15:56:45] Right [15:57:06] We appear not to, AFAICS [15:57:45] the flushpackets logic is going to be in 2.4.32 (upcoming release).. it basically inserts a FLUSH bucket after each write to the output brigade [15:58:11] it helps some use cases but generally it is a bit invasive [15:58:35] and could end up in having worst perfs if not used wisely [15:58:49] (more overhead for small chunks of bytes that could be grouped, etc..) [15:59:46] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4043614 (10ayounsi) a:05ayounsi>03faidon [16:00:54] For sure, just trying to clarify what the behavior is... [16:05:05] bblack: headers are read, sent to the client, then the body starts sending. [16:05:53] right [16:06:19] so, on the present problem we're observing, which is essentially that at least some requests, varnish logs a very long time fetching them from the backend [16:06:33] (and yet, in at least some cases Backend-Timing D= is small) [16:07:06] we've got a few possibilities in current decreasing order of likelyhood (but subject to change as we investigate of course) [16:07:40] 1. Root cause is varnish-internal issues with storage woes (fragmentation, lock contention, etc), which cause varnish to excessively delay itself while processing the fetch from apache. [16:08:40] 2. Varnish-internal issues are ok (ish, but perhaps exacerbate, explaining the correlation to uptime in days), but there's a real delay outside of varnish in getting the response back, but one not counted in D= [16:08:54] possible causes of that: [16:09:54] 2a. lvs1003 handles the varnish->apache side (but not the apache->varnish side) of the tcp connections, and is known to be a shitty old host with an overloaded network card routinely sending pause frames to the switch because it's not operating well at this traffic scale. It could be harming the TCP connections which shrinks windows and makes the transfer slow. [16:10:35] 2b. apache could be, for whatever reason more-local to the apache host, delaying sending the complete output for its own reasons. [16:11:03] 2c. apache could be forwarding the headers quickly and not introducing its own delay at the apache or host level, but hhvm is trickling out response through it for [16:12:08] 2d. some other network-layer issue [16:12:30] (but we ahve decent metrics on most of the 2d-like causes, it seems unlikely) [16:13:11] --- [16:14:55] what makes this difficult to reason about, is that once at least some requests are taking long times to complete, even if it were say a (2c) sort of problem, that will tie up varnish backend connections and threads, which then definitely will help trigger a cascade of (1) above with the storage-related woes and spread the failure far and wide to all requests flowing through that varnish. [16:15:44] so, when we see the big fallout and we look at the 503s, they look pretty random and probably weren't problematic on the apache/lvs/etc level. [16:16:19] but what we really want to know is what caused the pileup in the first place. If it's not (1), it could be a 2-like case which affects a relatively small percentage of requests, but ties up threads for very long time windows. [16:16:49] we've seen this before where some restbase<->parsoid mayhem was tying up a varnish thread/connection for ~3 minutes trying to eventually fail a single request. [16:16:59] and if those piled up in a burst, it took down everything as per the cascading above [16:18:12] (also, above 2b/2c are written in terms of apache/hhvm for MediaWiki, but yeah it could be another service causing the root issue, such as RB) [16:19:22] I hope with an expanded slowlog, when the incident happens again (it almost certainly will), we can focus on the first batch of slowlog entries bursting out of that cache host and see if there's a pattern to what service/url/etc is involved in those. [16:19:58] if they look random, then we're almost certainly stuck with the varnish-internal explanation. [16:28:39] --- :) [16:29:13] but all of that being said, some of that's from the broader perspective of the long-term inter-related problems we've seen on this front. There could be more than one cause, etc. [16:29:33] the short term problems the past several days, are almost-certainly going to end up in the varnish-internal bin. [16:29:49] (but I wish we could prove it harder than uptime correlation) [18:21:08] 10netops, 10Operations, 10ops-codfw: audit codfw switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4044272 (10RobH) p:05Triage>03Normal [18:35:53] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4044380 (10faidon) [18:37:07] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4044272 (10faidon) I just ran into a similar thing today in eqiad with T188045, so I reworded the task to make it generic and for both data centers. I also added a sentence t... [18:40:53] 10netops, 10Operations: Detect IP address collisions - https://phabricator.wikimedia.org/T189522#4044398 (10faidon) p:05Triage>03High [18:47:39] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4044439 (10faidon) [22:31:46] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4044914 (10Dzahn) I heard that repos now exist. Could you update the ticket with the repo names please? [22:35:37] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4044925 (10Volker_E) @Dzahn The repos are: https://gerrit.wikimedia.org/r/#/projects/design/landing-page for the root https://desi... [22:56:32] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4044983 (10Volker_E) [22:57:10] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Volker_E) [23:08:59] 10netops, 10Discovery, 10Operations, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045049 (10RobH) Ok, I misparsed all of that. So next steps: 1) Chris traces out and sees what is connected to ge-4/0/18. It somehow has the same IP address as wdqs1004 and is just... [23:40:30] 10netops, 10Discovery, 10Operations, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994382 (10Platonides) Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, since it would have been suffering the same... [23:59:02] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4045191 (10tstarling) So we need a Greenhouse admin to go to Configure > Email Settings, then enter "careers.wikimedia.org" for the domain and click "Register"....