[01:17:34] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3428785 (10Krinkle) @GWicke Regarding T138848, note that there are two separate problems imho. I don't mind them being solved at the same time... [01:31:27] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3428800 (10GWicke) @krinkle: Agreed that there are some subtle differences between the tasks. I mainly merged them since the discussion here h... [01:38:04] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3428811 (10Krinkle) >>! In T133178#3345748, @GWicke wrote: > To summarize the options using a single domain only: > > ## Use www.wikimedia.or... [05:31:27] 10Traffic, 10netops, 10Operations: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3429009 (10ayounsi) The BFD timer change didn't improve anything. Got emails from Zayo saying they fixed an issue on the the eqiad-codfw link (finished at around 02:14Z today). Will mon... [06:07:51] 10netops, 10Operations: Remove unsecure SSH algorithms on network devices - https://phabricator.wikimedia.org/T170369#3429050 (10ayounsi) [08:41:27] bblack: I think you're right wrt https://gerrit.wikimedia.org/r/#/c/364605/ [08:42:52] what do you mean with "304s on shorter objects" though? [08:47:26] and wow https://gerrit.wikimedia.org/r/#/c/364606/, what a commit to find in the morning! [09:10:23] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3429337 (10ayounsi) 05Open>03Resolved Switch went down for about 10min and came back up properly. Some notes: - The upgrade was more smooth than using NSSU - If a ganeti* h... [09:11:23] 10netops, 10Operations, 10Patch-For-Review: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3429339 (10ayounsi) 05Open>03Resolved a:03ayounsi Diffscan has been running smoothly. [09:13:03] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3429342 (10ema) >>! In T169345#3429337, @ayounsi wrote: > - The only page was "search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL" Preceded by "search.svc.codfw.wmnet/ElasticSearch... [09:15:19] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3395298 (10dcausse) The number of shards never reached the critical threshold, in irc I've seen: `10:24 PROBLEM - ElasticSearch health check for shards on search.svc.cod... [09:41:00] 10Traffic, 10netops, 10Operations: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380#3429417 (10ayounsi) [10:00:06] 10Traffic, 10netops, 10Operations: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380#3429506 (10Marostegui) From the db side: * db1031 needs to be downtimed as it is db2033's x1 master and will page with replication broken once db2033 becomes unreachable * We could downtime all the aff... [10:03:28] 10Traffic, 10netops, 10Operations: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3429513 (10ema) The fetch errors last precisely 240 seconds on each machine. A quick look at our varnish-be settings seems to open two routes for investigation: - `thread_pool_timeout... [11:26:09] 10Traffic, 10netops, 10Operations: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380#3429417 (10fgiunchedi) re: graphite machines, we'll take the 10 min hit, ditto mwlog [11:29:33] 10Wikimedia-Apache-configuration, 10Wikidata, 10Wikimedia-Site-requests, 10Patch-For-Review, and 3 others: wikidata.org/entity/Q12345 should do content negotiation immediately, instead of redirecting to wikidata.org/wiki/Special:EntityData/Q36661 first - https://phabricator.wikimedia.org/T119536#3430010 (10... [12:56:55] ema: re the fixed-7d-keep patch: in T124954 Krinkle was talking about how, even for the main /wiki/Foo page outputs, MW doesn't obey conditional request semantics properly (it can return 304 Not Modified in response to a conditional, when in fact an unconditional request would've given different content than what Varnish currently has stored) [12:56:55] T124954: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954 [12:57:29] oh [12:57:43] fascinating :) [12:58:13] ema: but I think we can (have to) live with that for now, and it's ok for the article output case at 7d so long as we're aware of all the details. But this re-raises the original question-mark (the reason for the "cap keep to TTL" behavior we're trying) about other outputs: [12:59:53] ema: because MW has other shorter-term / special cacheable outputs, e.g. MW API requests that are cacheable for mere minutes, or resourceloader outputs, Special:Foo outputs, etc. If some of these are short-TTL (max minutes not days or weeks), are they also exhibiting some bad 304 behavior such that a fixed keep-time of 7d is really going to wreck their effective TTL? [13:00:37] right now the "cap keep at TTL" code we have is at least preventing the worst of that fallout, just at a cost of being dumber than it should be in a lot of normal cases [13:01:15] the obvious question is: how hard is it to fix MW so that it properly respects IMS semantics, but I imagine the answer is gonna be "very" [13:01:28] right, very [13:01:45] and probably the amount of very varies by which MW outputs we're talking about [13:02:28] the key quote bits (and I think we're mostly talking about the simpler cases like article pageviews + the bits they pull in) from the ticket are: [13:02:58] "MediaWiki quite often responds with a 304 Not Modified when the response is in fact different because we only track the internal wiki page content as means for validating If-Modified-Since. Changes to MediaWiki core output format, WMF configuration changes, and changes to the Skin, are not tracked in a way that the application is aware of" ... + "For the most part, the architecture design for la [13:03:04] rge-scale MediaWiki deployments is that all state outside the actual revision history of wiki pages is observed as static. And we rely on cache expiry to base compatibility decisions, such as: [13:03:08] How long to keep CSS or JS code around [...]" [13:04:51] you can kind of see that working, so long as you have control over cache policy (CC header and requests to us to mess with VCL), and so long as the cache is simplistic and doesn't use conditional requests, but simply refreshes the whole content when a TTL expires. [13:04:58] but the latter part, we really don't live in that world :) [13:05:12] (and don't want to either, we save a ton of transfer every day with 304s) [13:06:13] so on one hand changes like skin updates do not trigger PURGEs, but on the other we would want the page to be considered as changed in case of conditional requests [13:06:25] the good new is, in practice MW's $wgSquidMaxAge parameter does effectively limit the fallout, preventing potential infinite 304-refresh of stale content. [13:07:00] because once an article output (in MW, I guess in the parser cache?) is older than wgSquidMaxAge, it will fail the conditional request and send us new copy. [13:07:27] but that value's currently 14d, so in spite of our 1d-per-layer TTLs, the effective content TTL for wikimedia in terms of invalidating old outputs is 14d at present. [13:08:08] (also, the reset at wgSquidMaxAge is unconditional - it's going to fail conditional and re-send the content even if it didn't change, too) [13:11:40] so, I recommended he drop that value to 7d (which will also drop the max CC:s-maxage to 7d as well), for now, to reduce things a bit. [13:12:10] if they drop lower than that, it would affect our ability to use the 7d-keep mechanism to help survive repooling after long depools anyways, as half-broken as that is now. [13:13:20] the Age-related issue I mentioned at the bottom of the ticket is interesting too. It's potentially easy for them to fix that in their outputs. [13:13:28] right, at least returning sane CC and Age can't be that hard though, can it? :) [13:13:50] at which point we could switch to "Cap the 7d keep value at the CC:s-maxage" which isn't as broken as capping to the current object TTL [13:17:11] ok so the idea for now would be to stop capping keep to the object ttl, wait for MW to return non-decreasing maxages and then perhaps re-introduce the cap? [13:17:48] well, I'd like to stop capping keep and just leave it at that, if we're sure that, aside from the discussed issue with 304s on page outputs, there aren't any glaring issues with 304 on other shorter-lived/special outputs. [13:18:07] but if we can't get some verification of that, I think we have to continue capping in some form or other throughout. [13:19:21] also on a re-read, I don't know for sure that I fully understand CC:s-maxage without an Age output, either. Is it possible the standards and/or Varnish have another interpretation there where they infer the missing Age from the Last-Modified date? (in which case we're actually counting down the TTL twice as fast as we should?) [13:19:29] surely not, but I guess I'm in a question-everything mood now. [13:23:42] bblack: it doesn't look like [13:23:55] age is only used, if present, to decrease t_origin [13:24:14] yeah I'm looking at that code now too [13:25:03] it does look at expires-date as a possible TTL value (both from server headers), but not -LM as an Age replacement [13:25:35] but all of that logic is only in the absence of a CC field, ok [13:25:57] so long as CC:s-maxage or CC:max-age exists, none of that logic runs, and only the CC+age matter [13:26:29] and missing age is effectively age=0 [13:27:38] perhaps it's me not having enough coffee after lunch, but why is all that under case 414? [13:27:48] it's fall-through for all the codes above too [13:28:04] ah! [13:35:34] so I agree that MW shouldn't decrease s-maxage, but should it send Age? Shouldn't Age be sent by caches only? [13:36:14] or do we consider the applayer as a cache somehow in the discussion of all the above here? [13:36:40] MW has its own parser cache as well [13:36:51] right, in that sense we consider it a cache then [13:36:51] so the age is how long since it last re-rendered it [13:36:53] ok [14:09:11] IIRC the exptime in parsercache is set to now + 30d IIRC [14:09:16] fwiw :) [14:26:35] volans: oh I was getting the impression from the ticket discussion it was $wgSquidMaxAge (=14d presently)? [14:27:01] maybe it's that, if MW sees a parsercache entry older than wgSquidMaxAge it forces a re-render, and the exptime is something separate at the parsercache level? [14:29:07] could be [14:29:18] select max(exptime) from pc100; 2017-08-11 14:29:11 [14:29:33] and the other tables seems the same [14:29:49] this from pc1004 that is pc1 shard [14:31:47] bblack: the min(exptime) is 3 days ago, but doesn't mean much, there could be a long queue of items not requested anymore [14:45:50] and anyway IIRC we should have a cronjob that purges expired stuff that might be running every day now (it was every week before) [14:55:19] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3430769 (10ayounsi) Thank you! For the naming, please use: pfw3a-codfw and pfw3b-codfw for the firewalls and fasw-c8a-codfw and fasw-c8b-codfw for the switch members... [15:04:09] <_joe_> bblack: https://gerrit.wikimedia.org/r/364748 [15:04:18] <_joe_> puppet is failing across the fleet [15:09:57] lol [15:15:03] pybal upgraded to 1.13.7 on lvs2006, running tcpdump -n dst port 53 on lvs200[36] gives interestingly different results on the two hosts :) [15:15:27] so yeah, confirmed that till 1.13.6 we were resolving hostnames at each and every check [15:16:02] what are we doing now? I haven't really reviewed any of the new stuff [15:16:23] bblack: this is the gist of it: https://github.com/wikimedia/PyBal/commit/8a261c98b73abad3854ac19016c0a2223bafe704 [15:16:42] use self.server.ip instead of self.host (hostname) [15:17:12] we *thought* we were using the IP even before that commit, but that wasn't the case https://phabricator.wikimedia.org/T154759#3404022 [15:18:51] ok [15:19:37] my only immediate thought is, I know some services are configured with multiple ipv4 + multiple ipv6 in the config data in the puppet repo, I forget if that's broken out to separate services before it reaches this level in pybal, or those actually are lists of multiple still? [15:20:15] well, presently we have no examples of that anyways heh [15:20:35] but we've done it in the recent past at least to handle transitions (e.g. move maps IP to upload pybal service's definition) [15:20:49] text used to have several addresses, doesn't anymore [15:20:58] (per-dc+family) [15:22:17] the logic that drives the pybal.conf template digs into current-dc + family-split, but I don't know if it then splits IPs with distinct names within [15:22:23] or sets ip = X, Y, Z? [15:23:59] ok I checked, my concern is invalid, it does split them [15:24:01] so we currently have things like textlb_80 and textlb6_80, using the v4 and v6 address respectively [15:24:33] pybal 1.13.7 then monitors the primary IP according to the service address family [15:25:12] I don't think we do monitor the non-primary addresses at all presently [15:25:14] right, and if we had multiple IPs there, they would've been listed in hieradata/common/lvs/configuration.yaml under separate labels, which breaks them into separate pybal services [15:25:25] so there is only one IP per service in that sense [15:25:37] e.g. under text we see per-dc data like: [15:25:38] codfw: [15:25:38] textlb: 208.80.153.224 [15:25:39] textlb6: 2620:0:860:ed1a::1 [15:26:04] if we had multiple IPs per family, they'd be like "text2lb: 192.0.2.1; text2lb6: fe80::1" or whatever [15:26:13] and get split to separate stanzas in pybal.conf templating [15:26:36] * ema nods [15:27:40] right, so for all the normal check types, this should be a vast improvement (monitor the configured IP, and don't pointlessly rely on and possibly even be fooled by DNS) [15:28:21] then there's the dns checker itself which is special, and apparently this patch makes sure it actually monitors ipv6 correctly? [15:29:01] 10netops, 10Operations, 10ops-eqiad: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3430940 (10ayounsi) For the naming, please use: pfw3a-eqiad and pfw3b-eqiad for the firewalls, and fasw-c1a-eqiad and fasw-c1b-eqiad for the switches Then for the cabling, please follow: {F87... [15:29:04] 10netops, 10Operations, 10ops-eqiad: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3430941 (10ayounsi) For the naming, please use: pfw3a-eqiad and pfw3b-eqiad for the firewalls, and fasw-c1a-eqiad and fasw-c1b-eqiad for the switches Then for the cabling, please follow: {F87... [15:29:12] I hope maybe some of this madness was the source of the sproadic failures/timeouts for some of the pools [15:29:39] (e.g. when pybal.log would sometimes monitor-down a large chunk of the appserver fleet temporarily and then bring it back shortly after) [15:31:27] so the dnsquery monitor was already doing A/AAAA queries [15:31:31] query = dns.Query(hostname, type=random.choice([dns.A, dns.AAAA])) [15:32:25] with this patch we simply use the main DNS server IP as the resolver instead of using one of its v4 addresses (randomly) [15:33:33] and I think this will have an impact on sporadic madness, I've confirmed that 1.13.6 on lvs2003 was failing a bunch of IdleConnect checks while 1.13.7 on pybal-test2001 (same config as 2003) wasn't [15:33:36] "main DNS server IP" being the one configured in the service? [15:34:26] I mean, I think unless I'm completely lost,t he dnsquery monitor is for pybal pooling of actual DNS services, as in: [15:34:29] dns_rec: &ip_block005 [15:34:31] eqiad: [15:34:34] dns_rec: 208.80.154.254 [15:34:36] dns_rec6: 2620:0:861:ed1a::3:fe [15:34:51] so before, it was having two separate service stanzas for these, but not monitoring the v6 one correctly I guess since it only had a v6 IP? [15:35:35] oh, I'm thinking wrong anyways. it's not the service IPs that actually matter in this context, it's the per-backend-server IPs... [15:36:22] conftool-data/node/eqiad.yaml: [15:36:28] api_appserver: [15:36:28] mw1189.eqiad.wmnet: [apache2,nginx] [15:36:42] .. which are specified in DNS terms [15:36:58] hmmm [15:37:11] dns: [15:37:11] chromium.wikimedia.org: [pdns_recursor] [15:37:42] nevermind the DNS case, since it's so much more confusing. So rewinding to e.g. IdleConnection monitoring of the appservers... [15:38:01] we don't have hardcoded IPs for all the backend servers, we only have hostnames [15:38:32] trying to re-read the patch and understand better now [15:39:33] I guess the key link I was not noticing is "that was selected to be pooled by IPVS" [15:40:16] so at monitoring time, we're using whatever IP IPVS was configured with for that backend server instance [15:40:39] and I'm guessing currently pybal isn't smart enough to notice a DNS change and delete->add at that level anyways [15:40:52] not at the IPVS level AFAIK [15:41:15] as in, detect the change, update IPVS state [15:41:42] so probably the DNS lookup of "mw1189.eqiad.wmnet" must happen when the node is first observed (via etcd pool data), and the A+AAAA are fetched and then cached, and used to set IPVS backends and are the IPs that are monitored [15:42:05] yes, that's my understanding [15:42:20] but if the DNS for that hostname actually changed at runtime, it probably wouldn't pick up the change for either purpose without a restart, or at least without deleting then re-adding the node via etcd [15:42:31] correcty [15:42:35] ok [15:42:37] s/y// :) [15:43:29] and the logic up to 1.13.6 was broken (even if it would have worked) in case of multiple IPs in the sense that we'd have kept on checking random IPs while still keeping only the first one pooled in IPVS [15:43:36] right, so that's all well and good, and before that patch the ipv4 and ipv6 versions of the dns_rec service were both basically monitoring over IPv4 [15:44:31] ok [15:45:28] I wonder what happens if the DNS lookup fails (e.g. due to some transient issue) when a backend server is first added to a pool via etcd (at startup, or at dynamic addition to the pool)? [15:45:51] I'd suspect the server doesn't make it into the pool, but does the operation get retried later, or crash the daemon, or? [15:46:20] no idea [15:47:25] it seems it'd call _initFailed and log an error [15:47:30] (pybal/coordinator.py) [15:48:17] and then just never add the host? [15:48:33] it's an interesting case for the FSM modeling IPVS and such too, right? [15:48:34] trying three times for each address family before giving up with 1/2/5 seconds timeouts [15:48:40] ok [15:48:54] so hopefully that papers over most transients in practice [15:49:07] right, but it's definitely an interesting case [15:49:17] but let's say a new backend server appears via etcd, and the 3x retries on the DNS lookup fail [15:49:39] is it now in the server list in pybal's memory, but not in the server list in IPVS because we have no IP to put there? etc... [15:50:12] probably the ideal way to manage that is to have some state bit about whether that entry has succesfully initialized or whatever [15:50:39] so long as it's not initialized, it might still be periodically retrying the DNS lookup or any other failable operation until it gets it into IPVS and monitored via the IP [15:50:51] more in general, don't we want to potentially update a server IP in case it changes in DNS? [15:51:02] then the same logic could be used for failed init vs. check for new address [15:51:32] but if it stays uninitialized because DNS never works, and then later the server entry is removed from etcd, it would need to delete the internal state it had on the backend server, but not do an IPVS delete or monitor-stop since it never reached that point [15:52:17] the "check for new address" is tricky, we'll have to get to the point that we're getting TTL information when we do DNS lookups [15:52:36] and refresh them on TTL expiry I guess, and notice that it changed and do a delete->add cycle and switch the monitoring data, etc [15:54:00] anyways, that's just random thinking about potential problems [15:54:11] someone actually has to map all of these states out in terms of the formal model [15:55:22] this has all kinds of database-like problems when looking at the model, too, since the etcd data and thus internal state probably effectively does its key-lookup based on the hostnames, but the IPVS state's natural keys are the IPs [15:56:01] re: sporadic madness, lvs2003 keeps on finding elastic2007.codfw.wmnet down and then up again, while lvs2006 (upgraded to 1.13.7) is happy [15:56:08] so this definitely seems like a step forward [15:56:10] so the formal model has to account for the idea that we may see backend servers mw1189 + mw1190 for a service as distinct in etcd and internal state, but both accidentally or temporarily resolving to the same IP 192.0.2.1 for appservers4 service, and not step on itself [15:56:57] I guess by detecting duplicate IPs among a set of servers, and refusing to sync state (initialize failure state like above) for the second one that arrives with the same IP, until the situation resolves [15:59:48] you could basically do something like: ip = lookup(new_server_host); if duplicates_existing_ips(ip) { consider this like a dns lookup failure state-wise } [16:01:12] which means if it was an IP change from a previous good IP, we'd IPVS delete but not re-add. if it's new, we just don't add. [16:01:39] which is actually probably different than handling a dns lookup failure on a previously-ok entry (we'd probably maintain old state if the lookup is failing, as opposed to successfully returning a bad/dupe IP) [16:02:17] the modeling for this crap is ridiculously complicated :) [16:02:30] I'm so glad it's not me doing it! :) [16:03:52] :) [16:44:12] oh, man. Twisted's dns client doesn't work if the nameserver ip is v6 [16:44:51] ip = "2620:0:861:1:208:80:154:50" [16:44:51] resolver = client.createResolver([(ip, 53)]) [16:44:51] deferred = resolver.lookupAddress(hostname, timeout=[5]) [16:44:56] this one fails ^ [16:45:09] with `ip = "208.80.154.50"` it works [16:45:45] https://matrix.org/jira/browse/SYN-254 [16:50:58] <_joe_> bblack: I want to merge https://gerrit.wikimedia.org/r/364458, I already ran puppet on all auth dns hosts [16:51:07] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#904929 (10ema) Almost there! I had to downgrade pybal to 1.13.6 on lvs1010 because of the following exception: ``` File "/usr/lib/python2.7/dist-packages/twist... [16:52:22] <_joe_> bblack: anything I should be careful about? [16:52:34] <_joe_> maybe run authdns-merge on cp1008 first? [16:52:40] <_joe_> the local version I mean [16:54:57] <_joe_> or ema as well, ofc [16:55:05] <_joe_> but is seems pretty late for ema :) [17:01:14] <_joe_> well, it's beginning to be late for me too, I'll merge/upgrade tomorrow morning [17:01:29] <_joe_> err, dns-update [17:02:45] _joe_: I think that trying a authdns-update on pu as you mentioned would be a good idea, yeah, but it definitely is late and I'm not even sure :) [17:03:07] <_joe_> yeah, I can do it tomorrow, I'm not in a hurry [17:03:16] _joe_: https://phabricator.wikimedia.org/T82747#3431360 if you want to lol a bit before calling it a day [17:03:24] see you all tomorrow o/ [17:03:40] <_joe_> ema: I saw that already [17:03:42] <_joe_> grrr [17:19:35] 10Traffic, 10MediaWiki-General-or-Unknown, 10Operations, 10Services (watching): Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#3431519 (10GWicke) [17:27:36] 10Traffic, 10Operations, 10monitoring, 10Patch-For-Review: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3121488 (10faidon) @Ema, it seems like the task as described has been completed (awesome work and great presentatio... [17:30:02] _joe_: yeah for tomorrow... the puppet-vs-dns repo dependencies there are tricky. I think probably the right sequence, if you're using separte commits that each work sanely: commit just the +mock to config-geo-test, then commit the puppet side of things which generates the real geoip resources (and run puppet on the authdnses), then do the commit->update with the actual new record in the zonefil [17:30:08] e? [17:31:01] <_joe_> I did the mock and the actual record at the same time [17:31:06] right [17:31:08] <_joe_> I mean in the same commit [17:31:12] <_joe_> that I still have to merge [17:31:13] that will pass jenkins tests, but fail to deploy [17:31:25] <_joe_> even if puppet is already done? [17:31:27] unless the puppet side is already done too? [17:31:35] <_joe_> that's merged, and applied [17:31:38] I don't know, I haven't thought about all of this in a month or two [17:31:40] ok [17:31:49] <_joe_> and I looked at the state files and they're ok [17:31:53] if the authdns hosts are showing the live geoip config for the new thing that came from puppet [17:32:03] then yeah perhaps that's all you need, and then your dns commit just works [17:32:42] <_joe_> yeah I am workin on the lvs hosts, restarting pybal on all low-traffic ones for now [17:32:59] I think the problem I ran into before was merging the dns commit first, which passes jenkins and then fails to reload on authdns-update [17:47:21] <_joe_> yeah i'll still do it tomorrow [19:27:29] 10Traffic, 10ArchCom-RfC, 10Operations, 10Performance-Team, and 5 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3432296 (10GWicke) [19:48:13] 10Traffic, 10Operations, 10Performance, 10Services (later): Look into solutions for replaying traffic to testing environment(s) - https://phabricator.wikimedia.org/T129682#3432378 (10GWicke) [19:49:05] 10netops, 10Operations: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3432382 (10ayounsi) 05Open>03Resolved Did some more troubleshooting on that interface and some others showing regular l3 incomplete. I managed to capture packets coming from various providers, toward... [19:49:48] 10Traffic, 10netops, 10Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3432388 (10ayounsi) p:05Normal>03Triage [19:50:07] 10Traffic, 10netops, 10Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3205379 (10ayounsi) p:05Triage>03Normal [20:13:52] 10netops, 10Operations: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#3432604 (10Cmjohnson) [20:13:57] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2762011 (10Cmjohnson) 05Open>03Resolved Resolving this task, I created a subtask for the decom portion. [20:19:39] 10netops, 10Operations, 10ops-eqiad: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3432650 (10Cmjohnson) @ayounsi I will start racking the gear this week according to the google doc. Much of what I can remove are cable managers. [22:37:42] 10Traffic, 10ArchCom-RfC, 10Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3433502 (10GWicke)