[00:57:39] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230484 (10Krenair) > Separately, some sort of letsencrypt::server class would collect the list of hosts which have applied each of the defined certs, in order...
[01:38:42] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230570 (10Krenair) Random upstream problem I noticed while browsing: https://tickets.puppetlabs.com/browse/PUP-8890
[03:02:01] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230636 (10BBlack) >>! In T194962#4230355, @Krenair wrote: > Anyway, as part of my initial code I made the "oh, it's not issued yet, let's use a self-signed cer...
[09:17:09] <XioNoX>	 https://labs.ripe.net/Members/romain_fontugne/as-hegemony-measuring-as-interdependence
[09:17:28] <XioNoX>	 and us: http://ihr.iijlab.net/ihr/14907/asn/?date=2018-05-24&last=7
[09:28:33] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230923 (10RazeSoldier)
[09:29:55] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230923 (10jcrespo) Can you retry, it should be solved now or shortly (or may need a refresh of your browser)?
[09:30:05] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230923 (10Marostegui) We are on it
[09:31:44] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230953 (10RazeSoldier) >>! In T195563#4230935, @jcrespo wrote: > Can you retry, it should be solved now or shortly (or may need a refresh of your browser)?  Yes, I retry but this problem still exists.
[09:35:24] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230955 (10RazeSoldier) It seems that I can visit wikitech but other WMF‘s websites cannot access.
[09:36:42] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230962 (10RazeSoldier) I can visit now, but I don't know what happened.
[09:38:05] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230963 (10jcrespo) Not your fault, issues with connectivity on certain regions on the world- a datacenter was disabled to workaround it. Thank you for the quick report, it helped!
[10:06:57] <wikibugs_>	 10Domains, 10Traffic: HTTP 500 on invalid domain - https://phabricator.wikimedia.org/T195568#4231062 (10Tgr)
[10:07:21] <wikibugs_>	 10Domains, 10Traffic, 10Operations: HTTP 500 on invalid domain - https://phabricator.wikimedia.org/T195568#4231073 (10Tgr)
[11:03:02] <bblack>	 XioNoX: nice tool.  I think all things considered we're doing pretty ok on that hegemony metric given how small our network is :)
[11:03:17] <XioNoX>	 yeah I agree
[11:05:59] <bblack>	 the result on google's AS is interesting too, as it's showing no dependencies.  I guess in real terms, this means every ripe node they test from happens to be in a network with a direct peering to google.  There might be some networks in the world that need transit to reach google *and* lack a ripe test node though.
[11:06:46] <bblack>	 but still, it's kind of crazy that a company that isn't an ISP can reach that point
[11:10:43] <XioNoX>	 or it's low enough so they trim that data, but yeah
[11:12:40] <bblack>	 I don't think they trim, see amazon's main AS (for US-based stuff anyways):
[11:12:43] <bblack>	 http://ihr.iijlab.net/ihr/16509/asn/?date=2018-05-24&last=7
[11:12:58] <bblack>	 there's some "micro" level entries at the bottom of amazon's list
[11:13:34] <bblack>	 even amazon has its top dependency at ~.08 out of all of those
[11:13:49] <bblack>	 our top dependency (Telia) is ~0.22 for us
[11:14:42] <XioNoX>	 haha, the µ are fun
[11:16:50] <bblack>	 FB is another interesting case, where you know they have $$ and try to connect everywhere: http://ihr.iijlab.net/ihr/32934/asn/?date=2018-05-24&last=7
[11:17:21] <bblack>	 shorter list of dependencies than amazon, top entry is at ~0.025, and the numbers don't add up anywhere near 1.0, meaning they have tons of direct peering too.
[11:19:18] <bblack>	 (so they're maybe approaching google-like status, but still have a ways to go)
[11:21:30] <bblack>	 I wonder if having this metric available will change pressures on various parties re: peering.
[11:21:51] <bblack>	 e.g. some manager somewhere will be like "we need to decrease this dependency number by doing more open/settlement-free peering!"
[11:21:54] <bblack>	 I doubt it though :)
[11:25:45] <bblack>	 it would be nice if that graph also provided the sum total of the dependency numbers
[11:26:25] <bblack>	 for our AS it's ~0.54 last week.  which I guess means that at least from the ripe probe network's POV, ~46% of their nodes can reach us without transit.
[11:26:59] <XioNoX>	 that's decent
[11:27:02] <bblack>	 (seems higher than I'd expect for the whole globe, but then "networks with ripe nodes" probably skews that view in non-random ways)
[11:27:10] <XioNoX>	 indeed
[11:29:35] <bblack>	 facebook's total is around 0.09, so 91% of probes find them without transits
[11:30:25] <bblack>	 (but also, perhaps fb and google donate hosting of a bunch of ripe probes themselves, which would really skew these metrics!)
[11:31:59] <XioNoX>	 https://blog.cloudflare.com/path-mtu-discovery-in-practice/ to workaround their router not doing properly ecmp for ICMP packet too big, they wrote a too that broadcast those ICMP as soon as any of their server get one
[11:33:53] <XioNoX>	 s/too/tool
[11:36:27] <bblack>	 yeah I've read it before
[11:36:34] <bblack>	 but we're not doing ECMP yet, so we don't have to worry :)
[11:37:06] <bblack>	 (our basic LVS/ipvs stuff does handle icmp routing appropriately.  at least it's supposed do, modulo bugs)
[11:38:22] <bblack>	 but really, the "right" answer is that their network hardware should be doing like LVS does: hashing ICMP based on the connection info on the inside, not the outer bits like it does for TCP.  Then it would be a non-issue.
[11:38:39] <bblack>	 it's kind of sad that router vendors haven't caught on to offering that as part of their solution, so these workarounds have to happen.
[11:41:37] <XioNoX>	 yep
[11:42:03] <XioNoX>	 speaking of filtering on the inside, can't figure out the tcpdump filter to do that
[11:42:36] <XioNoX>	 and wireshark is "ip.addr == <ip>" so it filters on everything IP related, not only the inner packet
[11:44:37] <XioNoX>	 I'm also trying to find doc on how linux implemented PLPMTUD (tcp_mtu_probing), but the only option seem the RFC
[11:49:34] <bblack>	 XioNoX: or read the kernel source code :)
[11:49:48] <XioNoX>	 pick your poison
[11:51:57] <bblack>	 I can kinda stream-of-consciousness braindump as I read it
[11:52:09] <bblack>	 in tcp_timer.c, there's:
[11:52:10] <bblack>	 static int tcp_write_timeout(struct sock *sk)
[11:52:15] <bblack>	 bleh bad paste above
[11:52:24] <bblack>	 ./* A write timeout has occurred. Process the after effects. */
[11:52:24] <bblack>	 static int tcp_write_timeout(struct sock *sk)
[11:52:52] <bblack>	 within that, if the state is not SYN_SENT or SYN_RECV (we're not in-handshake), and a write timeout happens:
[11:53:02] <bblack>	 if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1, 0)) { /* Black hole detection */ tcp_mtu_probing(icsk, sk);
[11:53:28] <bblack>	 then in tcp_mtu_probing(), if the sysctl isn't set nothing happens (rely on ICMP)
[11:54:07] <bblack>	 first time through if the plpmtud stuff hasn't been initialized yet, it just sets a timestamp so it can do something smarter on the next timeout.  if we already have such a timestamp recorded:
[11:54:26] <bblack>	         } else {
[11:54:26] <bblack>	                 mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
[11:54:29] <bblack>	                 mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
[11:54:32] <bblack>	                 mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
[11:54:34] <bblack>	                 icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
[11:54:37] <bblack>	         }
[11:54:39] <bblack>	         tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
[11:54:56] <bblack>	 so there's it finding the low/high range to probe with, based on the base_mss sysctl vs up to the actual data size that we wanted to send.
[11:55:47] <bblack>	 then separately over in tcp_output.c there's other related bits:
[11:56:00] <bblack>	 critical commentary there:
[11:56:01] <bblack>	    inet_csk(sk)->icsk_pmtu_cookie is last pmtu, seen by this function.
[11:56:33] <bblack>	 which is just above tcp_sync_mss(), whose core code does:
[11:56:35] <bblack>	         if (icsk->icsk_mtup.search_high > pmtu)
[11:56:35] <bblack>	                 icsk->icsk_mtup.search_high = pmtu;
[11:56:35] <bblack>	         mss_now = tcp_mtu_to_mss(sk, pmtu);
[11:56:35] <bblack>	         mss_now = tcp_bound_to_half_wnd(tp, mss_now);
[11:56:38] <bblack>	         /* And store cached results */
[11:56:40] <bblack>	         icsk->icsk_pmtu_cookie = pmtu;
[11:56:42] <bblack>	         if (icsk->icsk_mtup.enabled)
[11:56:45] <bblack>	                 mss_now = min(mss_now, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low));
[11:56:48] <bblack>	         tp->mss_cache = mss_now;
[11:56:50] <bblack>	         return mss_now;
[11:57:14] <bblack>	 then there's tcp_current_mss(), which relatedly does:
[11:57:16] <bblack>	         if (dst) {
[11:57:16] <bblack>	                 u32 mtu = dst_mtu(dst);
[11:57:16] <bblack>	                 if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
[11:57:17] <bblack>	                         mss_now = tcp_sync_mss(sk, mtu);
[11:57:17] <bblack>	         }
[11:57:52] <bblack>	 also later in the same file, TCP Fast Open uses the same stuff to decide how much data can go with the SYN for SYN+data:
[11:57:55] <bblack>	         space = __tcp_mtu_to_mss(sk, inet_csk(sk)->icsk_pmtu_cookie) -
[11:57:58] <bblack>	                 MAX_TCP_OPTION_SPACE;
[11:58:05] <bblack>	 and that's about it
[11:58:25] <XioNoX>	 "and that's about it"
[11:58:58] <XioNoX>	 I kind of guess the overall process, but will have to read that several times
[11:59:45] <bblack>	 yeah without trawling all of the related callers and callees and data structures, it's still hard to grok it all just from the pastes above.
[11:59:55] <bblack>	 but you can glimpse the overall structure and methodolgy loosely
[12:00:07] <bblack>	 which sounds a lot like the RFC :)
[12:00:46] <bblack>	 application has data buffered to send which could generate a packet up to X size with best-known current MTU guess
[12:01:12] <bblack>	 it sends the X-sized packet, waits for ack, then hits a timeout, and suspects MTU issues
[12:01:52] <bblack>	 if the RFC4821 stuff is enabled, it will readjust the effective mtu down to tcp_base_mss and use that as the lower bound vs X as the upper bound.
[12:03:10] <bblack>	 and try sending the smaller packet, and if that works, on the next send it will close the size gap upwards by half (e.g. if base_mss is 1024 and the interface mtu->mss is 1460 (for 1500 mtu), 1242 would be the next attempt.
[12:03:37] <XioNoX>	 I see
[12:03:38] <bblack>	 basically it will keep moving around by halfs until it homes in on the exact effective MTU that lets packets through.
[12:04:23] <bblack>	 so if you had a pattern where 1300 was the effective mss limit, and base_mss is 1024 (our value), and interface mtu was 1500 implying mss = 1460....
[12:05:13] <XioNoX>	 thanks
[12:05:15] <bblack>	 assuming there's always tons of outbound data buffered to fill up write packets: it would try sending mss=1460, timeout waiting on the ack, then send a 1024 which succeeds, then a 1242 which also succeeds, then a 1351 which will fail again
[12:05:29] <bblack>	 then drop back again, etc... as it narrows in on the mss=1300 value.
[12:06:31] <bblack>	 of course if plain old ICMP PMTUD actually works (we get the PTB message) things work much more quickly and effectively.
[12:06:53] <bblack>	 this is just the fallback if ICMP is blackholed (or not possible because the MTU mismatch is on a shared L2 network heh)
[12:07:20] <XioNoX>	 yep, for icmp blackhole
[12:08:47] <XioNoX>	 bblack: unrelated, ulsfo caches pull from codfw, right? aka. if I had to add a GRE tunnel from ulsfo to somewhere else (to prevent the outage we had earlier), codfw is the best option?
[12:08:57] <bblack>	 XioNoX: yes
[12:09:09] <XioNoX>	 cool
[12:20:39] <XioNoX>	 draw.io/google drive integration was not working for me since a very long time. Changing my cookies setting to "Always accept 3rd party cookies" solved the issue
[12:20:52] <XioNoX>	 well workedaround, more than solved
[12:21:04] <XioNoX>	 I guess I still start diagraming soon
[12:21:10] <XioNoX>	 s/still/will
[12:35:54] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4231327 (10ayounsi) 05Open>03Resolved a:03ayounsi Our San Francisco datacenter is linked to our infrastructure by 2 links. 1 link had a planned maintenance, the other had an outage at the wrong time. as...
[13:59:36] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4231523 (10RazeSoldier) >>! In T195563#4231327, @ayounsi wrote: > as soon as we noticed the issue, we disabled that datacenter, redirecting the users to a functional datacenter.  In this case, is the disabling...
[14:01:10] <wikibugs_>	 10Traffic, 10Operations: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4231538 (10ayounsi) Manual. It would be automatic in an ideal world, but not enough resources to work on that.
[17:35:15] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4232230 (10Krenair) >>! In T194962#4230570, @Krenair wrote: > Random upstream problem I noticed while browsing: https://tickets.puppetlabs.com/browse/PUP-8890...
[18:30:03] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4232346 (10Krenair) I'm going to find out what's going on with puppet DB in T187736, in the mean time my patch for it looked like this (completely untested and...
[19:02:28] <Krenair>	 from tlsproxy::localssl docs
[19:02:35] <Krenair>	 # [*certs*]
[19:02:40] <Krenair>	 #   Array of certs, normally just one.  If more than one, special patched nginx
[19:02:40] <Krenair>	 #   support is required.  This is intended to support duplicate keys with
[19:02:40] <Krenair>	 #   differing crypto (e.g. ECDSA + RSA).
[19:02:46] <Krenair>	 what's the details of the special patched nginx?
[19:05:39] <Krenair>	 https://scotthelme.co.uk/hybrid-rsa-and-ecdsa-certificates-with-nginx/ says support for this went into nginx 1.11.0 so maybe the comment is outdated?
[19:06:14] <Krenair>	 hm: https://gerrit.wikimedia.org/r/#/c/320704/
[19:08:23] <Krenair>	 looks like we have 1.13.6-2+wmf1~jessie1 on deployment-cache-text04
[19:47:24] <bblack>	 they still don't support ssl_stapling_file w/ multiple independent staples for separate ECDSA+RSA in upstream nginx
[19:47:31] <bblack>	 that's what we're still patching at this point
[19:48:24] <bblack>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/nginx/+/wmf-1.13/debian/patches/0600-stapling-multi-file.patch
[20:17:00] <wikibugs_>	 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#4232707 (10JMinor)
[22:39:48] <wikibugs_>	 10Domains, 10Traffic, 10Operations: HTTP 500 on invalid domain - https://phabricator.wikimedia.org/T195568#4231062 (10Dzahn) >  a domain that doesn't even exist.  It exists.  The issue is that "stats" exists in DNS in the wikipedia.org zone as an alias for stats.wikimedia.org and stats.wikimedia.org is point...
[22:40:36] <wikibugs_>	 10Domains, 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4233112 (10Dzahn)
[22:45:11] <wikibugs_>	 10Domains, 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4233129 (10Dzahn) option a) delete stats record from the wikipedia.org zone  option b) add stats.wikipedia.org to hieradata/role/common/cach...