[07:40:09] klausman: o/ [07:40:30] hey folks, please welcome klausman to WMF :) today is his first day on... idk, some team. [07:40:51] They told me it has to do with computers. [07:41:06] Also, 'lo everyone. [07:41:06] klausman: welcome! [07:41:08] klausman: welcome aboard! [07:45:20] klausman: welcome! (Luca, Analytics) [07:45:34] klausman: moin and welcome! [07:54:13] welcome klausman ! [08:08:56] klausman: welcome o/ [08:26:22] welcome, klausman! [09:37:04] welcome klausman [09:45:31] 👋 klausman , welcome! [12:00:50] 👋 [12:08:11] sobanski: o/ welcome!! [12:08:14] hey everyone, please welcome sobanski :) He's the new manager of the Data Persistence team [12:08:16] :) [12:08:58] give him all your pity, he's going to need it. ;) [12:09:11] Hi everyone (and Stephen) [12:09:15] :D [12:10:10] hello klausman as well :) [12:11:46] 👋 [12:13:32] <_joe_> hi sobanski :) [12:16:22] welcome sobanski :) [12:16:30] hello sobanski, welcome! [12:22:59] hey sobanski o/ [12:30:04] hi all i merged a change to remove the old hiera3 backends today which broke the spec tests on andything which used the shared spec_helper.rb which includes among others the standard module. I have now fixed this but if you see any strange CI issues regarding where its unable to find hiera values it may be related and a rebase will be required. for me the CI error looked like ... [12:30:10] ... https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/9974/console [12:50:23] klausman, sobanski: welcome both! 👋 [13:06:28] I am going to failover m3 (phabricator) dbproxy, it should be transparent, but if you notice issues, please let me know [13:36:07] FYI: services switchover will start in about 25 minutes, followed by depooling eqiad -- please plan to hold off on any other production changes for a bit :) [13:40:38] rzl: at the risk of asking the dumb question - isn't the dc switchover tomorrow? [13:40:50] we're switching mediawiki tomorrow, yep [13:40:50] the Mediawiki switchover is tomorrow [13:41:13] today we're doing the lower-risk stuff, depooling from eqiad most of the microservices that normally run active-active [13:41:36] and then also depooling its frontend traffic [13:41:39] ah i see [13:41:43] and depooling some of the macroservices, too ;) [13:41:50] thank you for asking though [13:42:11] I would 100% not have put it beyond me to spend months planning this out and then press the button on the wrong day [13:42:17] haha [13:43:36] cdanis: next to mw, everything is a microservice [13:57:42] (that's a talk title) [14:00:14] <_joe_> we only have one macroservice (restbase), and a couple microservices who graduated to actual services [14:00:25] <_joe_> I think the plan is to graduate them down [14:30:06] I believe the purged alerts are a false alarm [14:30:13] moving here from ops to dodge icinga noise [14:30:19] _joe_ cdanis ema volans mark [14:30:26] we are no longer producing events to the eqiad kafka topic, but, the codfw topic volume has increased [14:30:26] <_joe_> cdanis: based on what? [14:30:29] https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&from=now-3h&to=now [14:30:41] <_joe_> cdanis: oh right we switched eventgate-main [14:30:42] the 'event lag' is merely how long it has been since a purged has seen a message for that topic [14:30:43] <_joe_> yes [14:30:47] <_joe_> yes sorry [14:30:48] indeed [14:30:48] that is expected y? [14:30:51] <_joe_> yes [14:30:56] oh, good find [14:30:56] coo [14:30:57] <_joe_> but maps isn't [14:30:58] it's an end-to-end alert reasonable to have in the usual case, but it's inaccurate here [14:31:00] yes, maps is real [14:31:13] <_joe_> ok so, let's focus on maps [14:31:13] any volunteer to downtime the purged alert for now, please? [14:31:20] and then yes we look at maps [14:31:22] <_joe_> I'll do it rzl [14:31:25] do we repool karto in eqiad? [14:31:27] ack, thanks _joe_ [14:31:48] <_joe_> rzl: let's repool momentarily, yes [14:32:16] the issue on maps seems to be due to CPU saturation: https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=now-30m&to=now [14:33:00] confctl args check please? on the shared terminal [14:33:14] rzl: lgtm [14:33:17] +1 [14:33:26] thanks both [14:34:00] I'm watching https://grafana.wikimedia.org/d/000000030/service-kartotherian?viewPanel=10&orgId=1&refresh=30s&from=now-1h&to=now for recovery [14:34:02] https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&from=now-6h&to=now&var-site=codfw&var-cluster=maps&var-instance=All&var-datasource=thanos [14:34:14] something happened around 12:35 as well, some increase in traffic maybe [14:34:36] wow it sure did [14:34:54] but yes, having maps just in codfw is very cpu-starved [14:35:29] huh, at 12:35 the NICs actually saturated some [14:36:10] <_joe_> ok I have a proposal [14:36:27] <_joe_> we've verified we can work with eventgate-main failed over to codfw [14:36:39] <_joe_> but there is no good reason not to leave it active in both DCs [14:37:03] <_joe_> it's also true that when we switch mw over, we'll get this same alert [14:37:15] <_joe_> as we will be producing events in codfw only [14:38:39] kartotherian latency looks like it's recovering [14:39:52] _joe_: yeah, let's probably disable the alert before switching tomorrow, if nothing else [14:40:14] in the meantime though, repooling eventgate-main in eqiad sounds fine to me, I don't have strong feelings either way [14:40:55] <_joe_> meh, let's not [14:41:06] let's not what? [14:43:00] hello sobanski, missed your arrival earlier! [14:43:43] cache_upload looking good again: https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=All&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=now-1h&to=now [14:44:26] <_joe_> mark: repool eventgate-main [14:46:08] i'm also not sure why we'd repool it [14:47:01] WRT the purged alert, I think we should change the alert so only one of the topics needs to be 'working', not both of them [14:47:15] in that case we wouldn't have to downtime it, and it wouldn't matter if eventgate was depooled somewhere or not [14:47:33] that sounds like the correct fix, yeah [14:47:58] yes [14:48:25] let's talk in a bit about whether we'll have that done in time for tomorrow [14:48:27] I think it should be enough to wrap the existing prom query in min() [14:48:31] for now, though: [14:48:46] kartotherian is repooled in eqiad and we'll keep it there, we've learned all we need to know about that :) [14:48:57] it sounds like we're deciding not to repool eventgate-main, is that correct? [14:50:50] correct, at least from the pointo of view of Purged, purges are coming in nicely so there's no functional reason to repool IMHO [14:52:18] okay, let's roll with that [14:52:20] <_joe_> cdanis: the correct thing to do is to check if eventgate is pooled in a dc, then alert on the corresponding metric [14:52:35] that's too hard [14:52:37] I'm not aware of any other issues we're tracking from the services switch -- anybody? [14:52:39] <_joe_> because else we can miss one dc not sending the alerts [14:53:01] <_joe_> cdanis: I can cook up something tomorrow morning :) [14:53:05] I don't believe we have pooled-ness of services expressed as Prometheus metrics [14:53:06] <_joe_> it's not that hard [14:53:32] <_joe_> cdanis: I don't think either, but we can just run a wrapper around check_prometheus [14:54:41] https://gerrit.wikimedia.org/r/623398 addresses it for now, tested by hand on icinga1001 [14:54:42] okay, next thing is cache-depooling eqiad then -- ema I'll send you a patch [14:54:49] and it lets us un-downtime the alert [14:54:52] rzl: https://gerrit.wikimedia.org/r/c/operations/dns/+/623360 [14:54:58] which is a more comfortable situation for today [14:55:04] or that :D [14:55:25] it's already there for your reviewing pleasure :) [14:55:53] ema: can you hang it off T243316 instead? [14:55:53] T243316: FY2020-2021 Q1 eqiad -> codfw switchover - https://phabricator.wikimedia.org/T243316 [14:56:08] sure [14:56:20] otherwise LGTM, fire when ready [14:58:57] running authdns-update [14:59:36] OK - authdns-update successful on all nodes! [15:06:50] all good so far: https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=codfw&var-site=eqiad&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 [15:11:04] swift looks like it is still swifting [15:13:23] godog: haha I started from https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=10&orgId=1&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops [15:13:26] and switched DC to "All" [15:13:36] which apparently does not work, because it just showed me the eqiad data, so I panicked a little [15:13:42] but the codfw data is going up to match :) [15:14:15] I see an icinga unknown: "Elevated latency for eventgate-logging-external eqiad" [15:14:51] perhaps something to look into, similarly to the purged kafka alert [15:14:57] <_joe_> ema: yes [15:16:37] noted in the doc [15:18:38] ema: when you get a chance, https://gerrit.wikimedia.org/r/623398 ? [15:21:45] cdanis: clever! [15:21:53] it is a hack but a fine one [15:22:23] maybe add a comment [15:23:31] done [15:25:11] we're coming up on the scheduled time for restoring dnsdisc TTLs to 5m -- any objections? [15:26:36] Reading through https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Dashboards_4, I noticed the Load Balancers dashboard link is dead. Should it be pointing at https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1 instead? [15:27:21] sobanski: I thought I fixed it, perhaps you're on a old version of the page? [15:27:43] ema: indeed :) [15:27:56] I had it open since this morning [15:28:19] logstash alert, looks like the usual but checking [15:28:24] ack [15:30:49] yeah, looks unrelated [15:31:12] I'm going to go ahead and run 02-restore-ttl for services, last call :) [15:31:29] _joe_: ^ [15:31:38] ack [15:32:51] going -- for anyone still watching my terminal, note the retries are still expected, and we may have to rerun the whole step [15:33:09] <_joe_> rzl: +1 :) [15:34:20] success \o/ [15:34:24] nice [15:35:12] ema: that's quite a long tail on eqiad traffic, it's almost enough to make you think some people aren't respecting DNS TTLs correctly [15:35:24] rzl: that always happens, yeah [15:35:29] but surely THAT can't be right, people would never be so irresponsible [15:35:30] <_joe_> rzl: that would be ludicrous! [15:35:35] <_joe_> rzl: yes! [15:35:42] rzl: yeah there's plenty of those unfortunately! [15:35:55] rzl: at some other shops it's standard operating procedure to eventually force traffic elsewhere via BGP advertisements [15:40:09] here we have traffic_shutdown in hiera to flip the VCL finger when *really* needed [15:42:20] looks like there's still a fair amount of requests to swift eqiad, I was expecting to have swift eqiad fully drained for reads though, expected? [15:44:20] godog: in esams swift resolves to eqiad [15:44:20] swift.discovery.wmnet has address 10.2.2.27 [15:44:43] did we change the dns discovery entry for swift? :) [15:44:50] _joe_: ^ [15:45:00] it looks like we decided not to but I missed that [15:45:11] mhhh that was my expectation too, but clearly not [15:45:33] I think we expected it to move with either the cache or MW, not sure [15:45:58] reads with cache, and writes with MW [15:46:22] <_joe_> rzl: yes, we should move swift.discovery.wmnet too, probably [15:46:34] <_joe_> I didn't think of esams/eqsin :) [15:47:04] I'm around to do it now if folks are ok with it ? [15:47:13] I think we should do it [15:47:21] <_joe_> godog: gimme 10 mins, we can just do it with confctl [15:47:30] _joe_: ack [15:47:33] <_joe_> or, anyone who wants to, just do it :) [15:48:23] <_joe_> ema: what record do you use at the traffic layer? [15:48:26] will do right after this meeting [15:48:28] <_joe_> swift.discovery.wmnet? [15:48:33] _joe_: correct [15:49:14] <_joe_> confctl --object-type discovery select 'dnsdisc=swift' get [15:50:03] yep we're active/active right now -- I'll just set/pooled=no in name=eqiad, yeah? [15:50:06] <_joe_> confctl --object-type discovery select 'dnsdisc=swift,name=eqiad' set/pooled=false [15:50:15] <_joe_> pooled=false [15:50:19] ack [15:50:26] <_joe_> this is true/false, not yes/no/inactive :P [15:50:46] https://wikitech.wikimedia.org/wiki/Conftool to be updated then :D [15:51:14] swift still has the five-minute ttl so this won't be immediate, but that's fine [15:51:56] <_joe_> +1 [15:52:04] SGTM [15:52:06] <_joe_> although 10 is better for rollbacks [15:53:10] yeah it's true, just didn't think of it in time [15:53:41] automation Good, manual process Bad [15:54:00] I guess I could have just run the whole cookbook with --services=swift [15:55:06] <_joe_> rzl: you'd get a surprise [15:55:19] aw man does EXCLUDED_SERVICES override it? [15:55:20] <_joe_> it's one of the services we put in the blacklist at the top [15:55:25] :( [15:55:37] <_joe_> just remove it from that list :) [15:55:41] without even an --i-promise-i-know-what-i'm-doing flag [15:55:52] <_joe_> that's the --bblack flag [15:55:56] <_joe_> and nope :P [15:56:01] --please-just-do-what-i-told-you-instead-of-what-you-think-i-want [15:56:12] <_joe_> users don't want it. [15:56:45] 5m elapsed, checking swift graphs [15:56:50] <_joe_> like the headphone jack on a phone. [15:57:54] definitely more traffic movement, doesn't look like it's leveled off yet [15:58:42] rzl: the swift graphs use rate() and not irate() so that's extra-expected [15:58:52] yeah figures [15:59:01] indeed, close to be drained [15:59:02] something something joke about swift [15:59:15] what's the joke about swift? :) [15:59:30] something like, you'd expect the swift graphs to be faster [15:59:30] it doesn't drain swiftly [15:59:45] this was pretty low-quality to begin with and then I phoned it in, that's on me [16:00:04] haha fair [16:00:41] heterological swift [16:00:44] the mediawiki_prod bytes graph, without the bottom axis being fixed at zero, gives me a small heart attack every time [16:02:32] we're down to double-digit RPS in eqiad, looks like that did the trick [16:02:45] +1 [16:03:13] and codfw still looks healthy afaict [16:03:56] <_joe_> the requests in eqiad are from mediawiki writing originals [16:03:57] confirmed https://commons.wikimedia.org/wiki/Special:NewFiles works as expected [16:04:14] also no 5xx, which is nice [16:05:07] ema: what a perfectionist [16:05:19] we need to be serving results *and* they need to not be 5xx? wow [16:05:22] <_joe_> I don't see thumbnails in special:newfiles though [16:05:45] I have some broken links there [16:06:04] <_joe_> well just for the latest stuff, I guess that's """normal""" having switched to the other dc [16:06:06] thumbnails are loading slowly for me [16:06:10] hm, a refresh mostly fixed it [16:06:13] <_joe_> that we have some latencies [16:06:20] cdanis: do you mean broken links or thumbnails? [16:06:26] broken thumbs, sorry [16:06:29] cool [16:06:48] in theory, mediawiki filebackend doesn't return until it has written to both Swift clusters [16:07:06] <_joe_> lemme look at the state of thumbor in codfw [16:07:10] yeah Special:NewFiles works "eventually" even in normal conditions due to race conditions and rate/concurrency limits iirc [16:07:30] interestingly it seems like my failures are HTTP 429 [16:07:35] so that sounds like thumbor [16:07:38] that might be thumbor indeed [16:07:58] <_joe_> we have some increased latency in the observed latencies [16:08:07] <_joe_> err in codfw [16:08:34] <_joe_> but nothing super-worrisome [16:09:15] and yeah, retrying the request seems to fix -- e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Blake_in_Theater.jpg/90px-Blake_in_Theater.jpg was a 429 but is fine now [16:09:42] hm [16:09:48] thumbor hosts saturate their network RX quite a bit [16:10:01] and don't use much CPU [16:10:06] must be 1G hosts [16:10:08] always or just now? [16:10:12] always [16:10:14] in both DCs [16:14:08] I was going to say that 429s from Thumbor@codfw didn't look especially elevated, but then I realized it is a log_10 graph [16:14:26] so instead, they've increased by about an order of magnitude [16:21:33] do we think this is a transient effect of sloshing the traffic over? or is there something that needs fixing here [16:26:34] mmhh I'm tempted to say it is due to traffic sloshing, 429s are converging to eqiad levels pre-shift [16:26:42] I keep typing swift instead of shift [16:27:53] :) [16:28:23] it seems that things are looking good, I'll tentatively begin my evening [16:28:30] page if needed [16:28:47] yeah, we can keep an eye on swift over the next little while and either investigate or repool if need be [16:28:51] otherwise I think we're in good shape [16:29:05] thanks everyone <3 tomorrow's the fun part [16:29:10] <3 [16:30:02] 💜 [16:30:05] 🤞 [16:44:45] I'll go as well, LMK if sth comes up [16:45:35] same here :)