[09:55:18] 10Traffic, 10Analytics, 10Operations, 10RESTBase, and 2 others: Verify that hit/miss stats in WebRequest are correct - https://phabricator.wikimedia.org/T215987 (10ema) Is there anything to do here? :-) [10:03:33] phew, just went through the untriaged issues on the traffic workboard. Most of them were "caching" things, we're now at ~100 open caching tickets [10:04:37] it looks like you need a fork to help you with those tasks O:) [10:09:07] it does, doesn't it :) [10:11:15] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ema) p:05Triage→03Normal [10:11:29] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ema) Anything else to be done here? [10:12:04] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ema) p:05Triage→03Normal [10:12:27] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ema) Can this be closed? [10:16:32] 10Traffic, 10Operations, 10Parsoid, 10RESTBase, and 5 others: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 (10ema) p:05Triage→03Normal >>! In T215956#4977137, @mobrovac wrote: > it seems like we will need to add a rule to Varnish to pass on these requests Would it... [10:19:41] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10ema) [10:21:00] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10ema) I haven't done any investigation yet, but it sounds similar to T215389. [11:08:36] Krenair: I'd love to hear your feedback on https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/494506 [11:08:46] I'm aware pylint is still crying about code complexity [11:09:09] any feedback to clean that up is welcome [11:09:52] vgutierrez, that is interesting, I'll try to find some time [11:10:15] thanks, I've asked the same to volans, being our in-house python guru [11:10:26] Krenair: BTW, there is any real need for the /certs API? [11:10:28] * volans hides [11:10:42] Krenair: or is just there for debugging in the early development phases? [11:11:19] vgutierrez, I have a feeling bblack may have suggested we have a puppet way and more plain way? [11:11:26] vgutierrez, but I'm not sure it's actually used [11:11:32] not in production [11:12:14] BTW, that also fixes a hideous race condition that we have right now [11:13:15] if for some reason puppet tries to fetch a certificate while the certificate files are being copied on the acme-chief server from the news_cert directory to the live one, it could end with the new key and the old cert [11:14:21] yeah I think in one of the early versions I tried to be careful to minimise that period [11:14:24] with the whole directory approach this is fixed, and the clients will just point to /etc/acmecerts/live/rsa-2048.{crt,key} and the API will switch the symlinks when needed [11:15:26] migration between live and new is still responsibility of acme-chief-backend [11:15:36] this just makes the deployment easier [11:15:42] and the client config as well [13:02:22] ema: re the VCL cleanup ticket, we're kinda behind on that. There's some subtleties there, and we can mess up the existing redirects, and we still want our XFF/Carrier-tagging stuff that was zero-involved to live on. [13:02:42] and the manual decom/cleanup of the zerowiki database fetcher and its cronjob and icinga checks, etc [13:02:56] but might be a good one to jump on early-ish [13:03:11] I missed a keyword in there somewhere [13:03:18] "the VCL + Zero cleanup ticket" [13:13:21] bblack: yup! I wanted to merge the ats-be thing first though https://gerrit.wikimedia.org/r/#/q/topic:T213263+(status:open+OR+status:merged) [13:13:22] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [13:13:38] bblack: sanity check review welcome! :) [13:19:56] ema: basic sanity confirmed. I've lost track, but I guess in cache_upload now all the services have TLS available for ATS to connect to? [13:22:38] bblack: swift and maps do, which I think should cover all of cache_upload unless I'm forgetting something [13:26:23] yeah [13:29:09] looking back over that stuff: in the varnish world we split on thumb-vs-original paths for two different swift backends, with originals going to only one DC and thumbs going a/a to both. [13:30:03] which would be like swift-rw / swift-ro split [13:34:03] anyways, we can get into all that and double-checking backend VCL behavior vs current lua later before turning on some live ATS traffic [13:34:13] IIRC the thumb/original split is there to switch traffic more gradually from one DC to the other [13:36:04] godog: maybe knows for sure? but current config has thumbnails as active/active (like swift-ro) and originals as active/passive (like swift-rw) [13:39:58] ah yes. If it's important to keep thumb a/a we can do that with some lua (not with a standard remap rule because uri-based remap rules are terrible stuff) [13:41:26] well in the long run our push is for everything to be a/a and treat a/p like the exception, but in practice most traffic is still a/p :) [13:42:03] it's not functionally-important for traffic working at all though, just less-optimal. [13:59:47] bblack: +1 to merge the ats-be changes? :) [14:01:16] ema: I think so. At least it looks right to human review :) [14:01:46] I suspect you've already looked at other layers like puppet-compiler [14:02:01] yes pcc says functional noop on cache nodes [14:02:13] :) [14:02:45] something always feels a little odd about how happy we all are (me included) when we get no-op results on things we worked hard on. [14:02:49] "Yay, we did nothing!" [14:03:04] uhuh! nothing changed! [14:03:47] well there's some distinctions to be drawn there. We changed things in the meta, but the results are proof we changed nothing for the production system in practice, in operational terms. [14:14:28] ok conftool-sync output looks good https://phabricator.wikimedia.org/P8166 [14:23:36] and the data is indeed there (`confctl select service=ats-be get`) [14:27:02] it feels so tangible now :) [14:27:40] * ema is excited [14:35:17] ah, we need a pool/depool script :) [14:37:00] well no we have it already, I thought the service name would have been hardcoded but luckily it's not [14:37:17] yeah we did thumbs/originals split in two different dns names when we failed over to codfw the first time, and moving thumbs only first then originals [14:37:54] you can pass the service name to pool/depool [14:38:58] volans: right, otherwise it just pools/depools all services for the given hostname which is enough for ats hosts (they've got one service only anyways) [14:39:28] volans: long time no see btw, hi :) [14:40:45] lol, thx, I said welcome back the other day ;) [14:52:31] nice.. it smells like ATS here :D [15:11:33] 10Traffic, 10MobileFrontend, 10Operations, 10TechCom, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) [15:14:12] 10Traffic, 10MobileFrontend, 10Operations, 10TechCom, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) >>! In T214998#4929700, @Jdlrobson wrote: > [..] This feels like an RFC to me. [..]... [15:19:04] got a good URL to link to in case one of the ATS Icinga checks triggers?:) [15:19:13] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494745 [15:19:30] (it can always be amended, i just need any URL to be able to make it required for new code) [15:19:57] this would be re: the discussion at offsite about Icinga and runbooks [15:21:10] mutante: https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server has some general debugging info, not specific to the icinga checks though [15:23:04] ema: ACK! that's what i used so far [15:23:20] could be changed later to some #anchor within that [15:23:27] yup! [15:23:27] or we can make an empty one [15:23:30] ok:) [15:23:47] i am just trying to do this for ALL existing checks, not just traffic [15:23:59] so i am happy with best effort and any URL [15:24:08] too many of them :) [15:24:23] nice [15:25:05] you don't necessarily want to be searching wikitech while icinga screams [15:25:53] there was a great talk at srecon duesseldorf about ER surgeons and their runbooks [15:27:06] tl;dr in "normal circumstances", whatever that means if your job is ER surgeon, they do not think about anything at all and just follow the procedure [15:29:13] whereas in the SRE metaphor we often first have to understand whether the body in question is indeed a human, could be a dog or another animal you had never heard of before [15:30:39] ema: lol, that's a nice way to put it [15:30:58] yea, so the original plan was "make it a required parameter, so whenver people add new Icinga checks they have to add one" [15:31:32] and it turned out not feasible to have jenkins check for that [15:31:48] so now it's about actually adding one to each existing check [15:31:59] well we can make puppet itself require the parameter (so puppetfail / compiler-fail if not specified), right? [15:32:09] eventually once they're all added [15:32:13] and i am using either generic Wikitech pages or also Phab workboard links in some cases [15:32:24] that at least helps to find the right people via the "members" link [15:32:31] and tells you where to make a ticket [15:32:51] bblack: yes, i will do that ..once they are added to all existing onees [15:33:02] and at that point i will write a list mail [15:47:04] pooling the ats-be services in etcd. They won't get any traffic till we flip cache::ats_backends on a eqiad/codfw varnish host [15:55:17] which is what https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494761 is for :) [15:59:15] to be merged (maybe tomorrow?) with cp2002 depooled and the fire brigade ready [16:04:53] yeah maybe [16:05:27] we should get some secondary review to run through the ATS lua stuff vs VCL stuff and validate that we expect a sane transition on that level. [16:05:54] (for the upload-backend VCL and the wikimedia-generic backend/common VCL) [16:06:19] obviously lots of it is varnish-specific, but then the cases I really worry about are the ones where it's kinda varnish-specific, but the FEs will expect it for analytics or something to work right. [16:07:33] I assume you think it's all good because you worked on it :) [16:07:44] but maybe I should stare at it and ask stupid questions a bit [16:08:29] yeah I know how stupid your questions usually are :P [16:09:04] lol [16:09:44] definitely, it would be great if you could do that. At a very basic functional level upload should be fine (I've been using pinkunicorn through ats for upload.w.o and maps myself for a while) [16:10:15] ok [16:15:07] heh [16:15:20] I've just noticed that on varnish-be we don't return x-cache-status? [16:16:11] frontend only, nice [16:16:46] there you go, found a problem already :) [16:26:30] what a great meta-tactic [16:26:37] I don't actually have to read anything, just threaten to :) [16:27:40] haha yes [16:50:58] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Tgr) >>! In T187960#4997875, @Marostegui wrote: > #reading-infrastructure-team-backlog tagging you here as this affects x1 master (T187960#4997790... [16:53:07] there's so much less line noise in the Lua version of everything heh [16:53:14] and so much less meta-varnish-workaround crap [16:53:25] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Marostegui) [16:54:07] refreshing my brain a bit on transition plan for cache_upload, trying to recap what I think I thought: [16:54:29] 1) Try percentages of traffic in eqiad/codfw against these separate ats-be boxes, up to some reasonable level? [16:56:33] 2) Step through real cache_upload boxes in eqiad/codfw, reinstalling them to a new profile with nginx+varnish-fe+ats-be. Depool from nginx+varnish-fe+varnish-be, reinstall, repool as nginx+varnish-fe+ats-be ? [16:57:40] I'm not sure if we even try to get to 100% in (1) or, just get to some large-ish number like ~25-50%, then ramp it up later as more of the (2) reinstalls move along. [16:58:37] 3) For the edge sites: skip (1), start with (2) and move frontends to using ats-be's as we go percentage-wise to keep load balanced reasonably between the be clusters. [16:59:57] and then I assume for text it's a lot easier (once we get to where we're happy with TLS + VCL concerns), since the be pool is shared with upload, we have lots of options about moving traffic% to them first before we even start reinstalls. [17:01:34] anyways, now I think that the above can't possibly be right. We can't finish killing varnish-be in core DCs before we transition edge DCs, or they have nowhere to connect to :) [17:02:01] so maybe it's (1), then convert edge DCs to completion, then (2)? [17:04:11] agreed, yes [17:04:32] (didn't think of the dependency on core DCs actually) [17:05:17] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [17:05:23] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) 05Open→03Stalled p:05Normal→03Low I'm keeping them open for a month after the memory swap for followup. [17:05:35] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) 05Open→03Resolved a:03RobH [17:06:40] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [17:06:43] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) 05Open→03Stalled p:05Normal→03Low I'm keeping this open for a month after the swap. If no further errors are logged (need to manually check the SEL) by March 25th, this can be res... [19:03:45] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Here is the diff for codfw: `lang=diff [edit interfaces ae1 unit 2017 family inet] + filter { + output private-out4; + } [edit interfaces ae... [19:33:45] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) a:03RobH [19:34:16] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) After applying it only to cr1-codfw, I noticed an increase of ICMP errors to eqiad's LVS, see https://grafana.wikimedia.org/d/000000513/ping-offload?orgId=1&from=... [19:35:44] XioNoX: did I read that right? cr1-codfw change == *eqiad* icmp error increase at lvs? [19:36:40] bblack: eh yeah, that's confusing, I don't understand it yet [19:37:09] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) lvs101[012] all exist on asw-c-eqiad (but have ports also reserved on asw2-c-eqiad): ` robh@asw-c-eqiad> show interfaces descriptions | grep lvs1010 xe-8/0/23 up... [19:38:07] I double-checked all the IPs, etc. and nothing touch eqiad. So maybe an issue in the monitoring. But also the codfw ping offload machine didn't see an increase of ICMP [19:39:14] I do see an increase to codfw ping offload, it just doesn't match the dropoff on the LVSes? maybe also a monitoring issue [19:39:30] https://grafana.wikimedia.org/d/000000513/ping-offload?orgId=1&from=now-1h&refresh=30s&to=now [19:39:39] bottom row, leftmost two graphs? [19:41:19] the increase is from right after I rolled back the change (~19:26) [19:42:04] so that's also unexpected [19:48:11] lvs1002 is the one that had the spike of icmp_errors, we could do a packet capture there to see what those errors are [19:48:53] well [19:49:06] we should maybe first figure out why there was no shift of echo traffic [19:49:15] (in codfw) [19:49:36] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) Since this is lvs, they are on every switch stack =P Row A: lvs101[012] don't show on either asw-a-eqiad or asw2-a-eqiad. Row B: doesnt show on asw2-b-eqiad, asw-b-eqiad i... [19:49:41] I guess either is a good hint what's going wrong, but, I donno [19:55:26] XioNoX: so the diff in the ticket, you applied the main firewall chunk from the bottom, plus only the unit 2017 bit at the top for single vlan? [19:55:35] correct [19:55:57] on cr1-codfw only [19:56:36] the `filter output private-out4` is what applies the rule to the interface [19:57:25] I can also test the rule with an unused IP, to check why the redirect didn't happen without risking the main VIP [19:57:46] "then accept" skips all further terms? [19:59:00] correct [20:01:39] I'm still staring at juniper docs trying to make sense of their horrible syntax, sorry! [20:01:52] it seems like it could at least be simplified maybe [20:03:03] e.g. put: source-prefix-list { wikimedia4 except; trusted-space4 except; 0/0; } in offload-ping4 and skip the whole no-offload-ping4? [20:03:28] but the negation rules are tricky, I don't know if I even really get it yet [20:04:44] XioNoX: I like the idea of testing with some other unused IP, but we don't have any unused IPs that are routed with LVS and all that jazz I think, unless the default blocks make them work? [20:05:01] we could still at least test the offloading part, but not necessarily the imact on the LVS boxes themselves [20:05:34] yeah, I did it like that to make it more explicit, the except make it more confusing to me [20:05:53] ok [20:06:05] so maybe try with .225 and we can at least look at where those pings are going? [20:07:12] yeah the last resort static routes are for the whole ranges, (eg. 208.80.153.240/28) [20:07:28] yeah even though .225 isn't configured in pybal for bgp advert, the default lvs routing blocks route it [20:07:34] so it still points at lvs2001 [20:07:44] correct [20:07:49] (and it points there over the 2017 vlan too, so bonus points) [20:07:59] so it should still be caught by the redirect [20:08:15] will do that and report back [20:08:27] and we can also observe whether it's being caught at all, by tcpdumping for .225 traffic on lvs2001 itself too [20:10:18] indeed [20:10:36] and if that traffic shows up in eqiad :) [20:11:54] the bump in eqiad errors was small, I wonder if it was somehow related to rob and lvs10xx decom and just coincidental timing? [20:12:39] (maybe some icmp errors from those old machines disconnecting and some kind of internal monitor connections getting RSTed, etc) [20:13:00] I donno, random theory [20:15:51] It matches too well the network change times (and the other graphs) [20:16:17] but on the other hand it's common, just not so tall of spikes: https://grafana.wikimedia.org/d/000000513/ping-offload?orgId=1&from=now-6h&to=now&panelId=7&fullscreen [20:36:25] XioNoX: I had a thought, although I don't think it's the problem with this first test... are we even sure that an out filter on ae1.1017 can send a packet via "next-ip" to an IP on a different vlan? (but it so happens that in the initial 1017 test case, ping2001 is on the same vlan, so there goes that theory for this particular case) [20:40:51] doesn't matter there, but it's a good question [21:16:32] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Redirect test with unused .225 IP `lang=diff [edit interfaces ae1 unit 2017 family inet] + filter { + output private-out4; + } [edit firewa... [21:46:17] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Everything has been rolled back for now. I also added a logging term: ` then { count ping-redirected; next-ip 10.192.0.22/32; } ` Wh... [21:47:46] bblack: updated the task, 1/ I'm surprised by the high amount of hits the redirect rule gets, and 2/ I'll open a ticket with Juniper, as I'm out of ideas [21:49:09] XioNoX: how long was that turned on (as in, what kind of rate does that counter represent)? [21:49:48] bblack: about 10min max [21:50:11] crazy, and that count was for .225 that shouldn't get traffic anyways right? [21:50:17] correct [21:51:09] there is a lot of noise on the internet, like people randomly scanning port ranges, but it looks like a lot [21:51:47] I can also log those packets to see exactly what they are, but I worry to overwhelm the router [22:53:16] bblack: well, we still don't have Juniper support... so opening a ticket will have to wait [23:13:17] 10netops, 10Operations: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 (10ayounsi) On suggestion from the [[ https://bird.network.cz/pipermail/bird-users/2019-March/013155.html | Bird mailing list ]] (and doc) is to change the dynamic port range on the sever side. From the current: `cat /proc/s... [23:14:07] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:14:19] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:15:06] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:17:01] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:29:14] 10Traffic, 10Operations, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ayounsi) cp1099 is the last standing host between me and powering off asw-c-eqiad. From this task and the prompt `cp1099 is a Unpuppetised system for testing (test)` it should be fine... [23:55:15] 10Traffic, 10Operations, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ayounsi) [23:55:18] 10netops, 10Operations, 10ops-eqiad: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi)