[05:14:16] 10Traffic, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) What IMO we need to know to understand how the service will deal with spikes: * Are requests cached?... [05:39:05] 10Traffic, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Pchelolo) > Does that also go for errors? (Especially when rendering takes too long and gets aborted by th... [06:02:56] 10Traffic, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service will dea... [06:19:02] 10Traffic, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) >>! In T213371#4901269, @Pchelolo wrote: > Given the req rate, my gut feeling is that PDFs will take... [07:24:04] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) [07:25:17] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) [08:54:24] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) >>! In T99531#4878798, @CRoslof wrote: > Transferring the domain name from WMDE to the Foundation requires that WMDE complete an own... [09:46:05] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Aklapper) Thanks for the good report! See https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue for the full list. :) [10:13:12] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) > See [wikitech] for the full list" Do you mean "Additional troubleshooting" sectional and curl to http? Is this a completely full list or maybe you need something else? The problem... [10:21:24] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) [11:51:34] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10akosiaris) Adding that a, from our `esams` DC, traceroute to this IP seems to stop before `beelive.ru` ` $ traceroute 83.220.238.125 traceroute to 83.220.238.125 (83.220.238.125), 30 hops m... [12:40:32] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel - https://phabricator.wikimedia.org/T214459 (10Aklapper) [13:12:48] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10mark) I really don't see the point of this. With the scarcity of IPv4 space we only need to get MORE flexible about how we use our IP space, and we will almost certainly not be able... [13:22:31] 10netops, 10Operations, 10ops-esams: reconfigure esams switch port for new bastion - https://phabricator.wikimedia.org/T186021 (10mark) 05Stalled→03Declined This was solved by fixing the original bastion, a while ago. [13:38:54] 10netops, 10Operations: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10mark) Have a look at https://github.com/mwiget/bgp_graceful_shutdown for a JunOS op script (SLAX) that does this fully automatically for all peers with a single command. It does unfortunately seem to need a m... [13:44:16] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10BBlack) It's the same basic rationale as moving WMCS out of `10.68.0.0/16`. We could obviously leave them there and just manage our ACLs better with more automation, but it pays som... [13:55:25] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10mark) >>! In T211254#4902223, @BBlack wrote: > It's the same basic rationale as moving WMCS out of `10.68.0.0/16`. We could obviously leave them there and just manage our ACLs bette... [14:23:16] 10netops, 10Operations: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10mark) Yes, we should probably move over to `prefix-limit` to prevent (improving) filters from making `accepted-prefix-limit` ineffective. 1) Is worth checking indeed, I suppose we can do that... [14:40:51] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10BBlack) >>! In T211254#4902250, @mark wrote: >>>! In T211254#4902223, @BBlack wrote: >> It's the same basic rationale as moving WMCS out of `10.68.0.0/16`. We could obviously leave... [15:17:01] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10mark) >>! In T211254#4902340, @BBlack wrote: >>>! In T211254#4902250, @mark wrote: >>>>! In T211254#4902223, @BBlack wrote: >>> It's the same basic rationale as moving WMCS out of `1... [15:57:21] 10Traffic, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service will... [15:57:25] 10netops, 10Operations: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) a:05faidon→03ayounsi [16:22:12] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10BBlack) >>! In T211254#4902524, @mark wrote: > It may be //possible// to get more space in various shady ways, but it's not possible by following RIR rules. Well, we can obviously s... [17:05:29] bblack: hello! https://phabricator.wikimedia.org/T184063 lists a few esams DNS servers as decom (see under OE12) I guess things have changed since? [17:07:28] 10Traffic, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Jhernandez) This was discussed in the Web/Infra/SRE/Services Q3-Q4 interlock meeting today. I think there... [17:13:05] XioNoX: no, that list doesn't look right at the present time. nescio, maerlant, multatuli, and eeden are still in active use, at least until replacements are purchased and installed [17:13:23] bblack: I just added a comment and removed them from the decom list [17:13:26] which I think is probably due, but I'm not sure as to the present status [17:14:24] right now nescio and maerlant are esams recdns servers, multatuli is esams authdns, and eeden is the "spare" used to rotate with the other 3 if we need to re-provision / test OS upgrades / etc. [17:14:51] esams in Netbox should now reflect the real life esams [17:14:59] ok :) [17:15:57] I can't find a procurement task for those [17:16:00] I think somewhere is recorded the notion that those 4 should be replaced with fresh hardware as dns300[12] and authdns300[12], but I don't think we've even ordered the hw yet [17:16:03] so I guess it's very old :) [17:16:38] I think this was the order we tacked the 3-node esams ganeti order onto, and then stalled for various procedural reasons instead of ordering it last Q [17:16:51] ok! task? :) [17:18:16] yeah I'm trying to dig one up, I think there is one but I haven't found it yet. It could be we delayed based on our hardware refresh sheets and then never got to the part of making a purchasing ticket [17:20:28] ok, probably not realistic for the next site visit, but maybe if one happen around the SRE offsite? [17:21:14] maybe! I still can't find a ticket, there may just not be one yet (and maybe we should make one this quarter to start purchasing). But we can sync up on that next week while we're all together to remember whatever's going on with it. [17:24:00] I guess it would be 1x authdns host (authdns3001) based on how we did eqiad/codfw last. I think the thinking is we'll eventually merge up authdns onto the redundant dnsN00x at all sites, once all other related plans eventually fall together. [17:24:53] (at which point, if wanted to we could even do a 3-host cluster using the previous singular authdns box in the mix too, at the core sites and esams) [17:55:07] bblack: re that Beeline email/task, are we sure we're not blocking anything at the caching layer? [17:56:08] they seem to be able to ping us, so I'd guess it's not routing related (and path looks correct) but https times out [17:59:45] paranoia guess would be a middle box somewhere trying to intercept tls? [18:04:14] I thought the traces/pings they were showing were *not* reaching our IP? [18:05:24] all the blockage/ratelimiting we do at the cache layer returns HTTP-level error codes like 403, 429, etc [18:05:53] they do, the reverse from us to them doesn't but it's probably their NAT gateway not replying to pings [18:07:10] in any case, we don't do any kind of network-layer blocking in caches/LVS (no ferm/iptables). We do block some users based on seen headers, and we do ratelimit on IPs, but those all return HTTP error codes and shouldn't time out. [18:07:43] ok, thx! [18:07:57] I'm curious to know if http works [18:08:15] yeah you could try plain unencrypted http with curl or something. It should at least get you the 301. [18:08:32] bblack: do we have anything http or https in esams that is not the LVS VIPs? [18:08:59] not really, at least not for public testing/docs [18:09:20] we could make something [18:10:17] also thinking of reaching out to the community liason team to know if they have any volunteer impacted or in that ISP [18:12:37] yeah it's a good idea [18:13:07] I forget, I think you already said on the earlier traffic about this subject, but do we have a ripe probe on beeline? [18:13:37] commrel-support@wikimedia.org ;-) (I tried to understand scrollback, but it's too technical or contextless. I'll wait for the email version ;) [18:14:43] bblack: there used to be one 3 years ago on that ASN :) [18:16:59] quiddity: TL;DR is a bunch of users in Russia using just one ISP are reporting issues reaching us now for a few days, and nobody has an easy technical answer, and it could possibly be intentional (but seems unlikely with just 1 ISP) [18:17:51] beeline is the ISP, and the ISP themselves reached out to us asking if the problem was on our end, too [18:27:55] Hmm. I haven't heard anything. I'll ask the team-chat if they have. -- However, it might be most efficient to put the concise public details in a phab task, and ask in the #wikimedia-stewards channel if anyone has further info to add (they tend to know about that sort of thing). [18:28:15] (they, and the other people who lurk in there :) [18:31:23] quiddity: https://phabricator.wikimedia.org/T214459 [18:33:30] ok, I'll pass the link along to that channel, and ask them to give any input they have [18:33:46] thanks! [18:41:27] XioNoX, please do send an email to the team if you want any further help/suggestions from us :) [18:41:36] yep, will do, thx! [18:58:58] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel - https://phabricator.wikimedia.org/T214459 (10ayounsi) I think the "poliplastic" track a red herring, even though those few IPs are assigned to them in whois, it's used by [[ https://bgp.he.net/ip/62.105.150.251... [19:08:27] quiddity: sent! [20:03:05] Ok, any traffic SREs about? i need to offline cp4026 for a memory error [20:03:11] and i rather clear it with the team before just doing it [20:06:55] 10Traffic, 10Operations, 10ops-ulsfo: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10RobH) p:05Triage→03Normal [20:17:24] 10Traffic, 10Operations, 10ops-ulsfo: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10RobH) a:05RobH→03BBlack So, everything I see on wikitech supports that I can offline this single host at any time to do the work on reseating the dimm. However, it is the week before all ha... [20:29:46] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel - https://phabricator.wikimedia.org/T214459 (10Elitre) Ciao @ayounsi! It is possible that CommRel can help you finding someone to diagnose this tomorrow if @Iluvatar doesn't happen to find others to test in the mea... [21:02:08] 10Traffic, 10Operations, 10ops-ulsfo: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10BBlack) a:05BBlack→03RobH https://wikitech.wikimedia.org/wiki/Cache_servers#Depool_and_downtime is correct, it just needs to be depooled (it will auto-depool on shutdown, but a manual depool... [21:05:35] 10Traffic, 10Operations, 10ops-ulsfo: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10BBlack) See also T178011 for last time. Why didn't the icinga EDAC check catch this? [21:07:56] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177 (10Dzahn) T214516 was a case of a memory error but Icinga did not detect it? T214516#4903917 [21:18:32] 10Traffic, 10DNS, 10Operations, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Jgreen) [21:18:50] 10Traffic, 10DNS, 10Operations, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Jgreen) [21:20:17] 10Traffic, 10DNS, 10Operations, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Jgreen) [21:21:00] 10Traffic, 10DNS, 10Operations, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Dzahn) This is the kind of ticket where we need that Phabricator calendar feature.. add a date to a task and have it notify or raise priority once the data gets cl... [22:39:09] 10Traffic, 10Cloud-VPS, 10Operations, 10Toolforge: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10bd808) [22:59:42] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177 (10CDanis) @Dzahn investigation on the mystery of `cp4026` ongoing in T214529 [23:17:01] 10Traffic, 10netops, 10Operations: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel - https://phabricator.wikimedia.org/T214459 (10Iluvatar) >>! In T214459#4903299, @ayounsi wrote: > Please try: > - http with the same endpoint: `curl -v http://en.wikipedia.org/wiki/Main_Page` > - https on a differ...