[07:40:09] <kormat>	 klausman: o/
[07:40:30] <kormat>	 hey folks, please welcome klausman to WMF :) today is his first day on... idk, some team.
[07:40:51] <klausman>	 They told me it has to do with computers.
[07:41:06] <klausman>	 Also, 'lo everyone.
[07:41:06] <marostegui>	 klausman: welcome!
[07:41:08] <volans>	 klausman: welcome aboard!
[07:45:20] <elukey>	 klausman: welcome! (Luca, Analytics)
[07:45:34] <moritzm>	 klausman: moin and welcome!
[07:54:13] <godog>	 welcome klausman !
[08:08:56] <jayme>	 klausman: welcome o/
[08:26:22] <apergos>	 welcome, klausman!
[09:37:04] <jbond42>	 welcome klausman
[09:45:31] <hnowlan>	 👋 klausman , welcome!
[12:00:50] <cdanis>	 👋
[12:08:11] <elukey>	 sobanski: o/ welcome!!
[12:08:14] <mark>	 hey everyone, please welcome sobanski :) He's the new manager of the Data Persistence team
[12:08:16] <mark>	 :)
[12:08:58] <kormat>	 give him all your pity, he's going to need it. ;)
[12:09:11] <sobanski>	 Hi everyone (and Stephen)
[12:09:15] <kormat>	 :D
[12:10:10] <mark>	 hello klausman as well :)
[12:11:46] <cdanis>	 👋
[12:13:32] <_joe_>	 hi sobanski :)
[12:16:22] <jbond42>	 welcome sobanski :)
[12:16:30] <godog>	 hello sobanski, welcome!
[12:22:59] <jayme>	 hey sobanski o/
[12:30:04] <jbond42>	 hi all i merged a change to remove the old hiera3 backends today which broke the spec tests on andything which used the shared spec_helper.rb which includes among others the standard module.  I have now fixed this but if you see any strange CI issues regarding where its unable to find hiera values it may be related and a rebase will be required.  for me the CI error looked like ...
[12:30:10] <jbond42>	 ... https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/9974/console
[12:50:23] <rzl>	 klausman, sobanski: welcome both! 👋
[13:06:28] <marostegui>	 I am going to failover m3 (phabricator) dbproxy, it should be transparent, but if you notice issues, please let me know
[13:36:07] <rzl>	 FYI: services switchover will start in about 25 minutes, followed by depooling eqiad -- please plan to hold off on any other production changes for a bit :)
[13:40:38] <kormat>	 rzl: at the risk of asking the dumb question - isn't the dc switchover tomorrow?
[13:40:50] <rzl>	 we're switching mediawiki tomorrow, yep
[13:40:50] <cdanis>	 the Mediawiki switchover is tomorrow
[13:41:13] <rzl>	 today we're doing the lower-risk stuff, depooling from eqiad most of the microservices that normally run active-active
[13:41:36] <rzl>	 and then also depooling its frontend traffic
[13:41:39] <kormat>	 ah i see
[13:41:43] <cdanis>	 and depooling some of the macroservices, too ;)
[13:41:50] <rzl>	 thank you for asking though
[13:42:11] <rzl>	 I would 100% not have put it beyond me to spend months planning this out and then press the button on the wrong day
[13:42:17] <kormat>	 haha
[13:43:36] <rzl>	 cdanis: next to mw, everything is a microservice
[13:57:42] <chasemp>	 (that's a talk title)
[14:00:14] <_joe_>	 we only have one macroservice (restbase), and a couple microservices who graduated to actual services
[14:00:25] <_joe_>	 I think the plan is to graduate them down
[14:30:06] <cdanis>	 I believe the purged alerts are a false alarm
[14:30:13] <rzl>	 moving here from ops to dodge icinga noise
[14:30:19] <rzl>	 _joe_ cdanis ema volans mark
[14:30:26] <cdanis>	 we are no longer producing events to the eqiad kafka topic, but, the codfw topic volume has increased
[14:30:26] <_joe_>	 cdanis: based on what?
[14:30:29] <cdanis>	 https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&from=now-3h&to=now
[14:30:41] <_joe_>	 cdanis: oh right we switched eventgate-main
[14:30:42] <cdanis>	 the 'event lag' is merely how long it has been since a purged has seen a message for that topic
[14:30:43] <_joe_>	 yes
[14:30:47] <_joe_>	 yes sorry
[14:30:48] <ema>	 indeed
[14:30:48] <ottomata>	 that is expected y?
[14:30:51] <_joe_>	 yes
[14:30:56] <rzl>	 oh, good find
[14:30:56] <ottomata>	 coo
[14:30:57] <_joe_>	 but maps isn't
[14:30:58] <cdanis>	 it's an end-to-end alert reasonable to have in the usual case, but it's inaccurate here
[14:31:00] <cdanis>	 yes, maps is real
[14:31:13] <_joe_>	 ok so, let's focus on maps
[14:31:13] <rzl>	 any volunteer to downtime the purged alert for now, please?
[14:31:20] <rzl>	 and then yes we look at maps
[14:31:22] <_joe_>	 I'll do it rzl
[14:31:25] <rzl>	 do we repool karto in eqiad?
[14:31:27] <rzl>	 ack, thanks _joe_
[14:31:48] <_joe_>	 rzl: let's repool momentarily, yes
[14:32:16] <ema>	 the issue on maps seems to be due to CPU saturation: https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=now-30m&to=now
[14:33:00] <rzl>	 confctl args check please? on the shared terminal
[14:33:14] <cdanis>	 rzl: lgtm
[14:33:17] <ema>	 +1
[14:33:26] <rzl>	 thanks both
[14:34:00] <rzl>	 I'm watching https://grafana.wikimedia.org/d/000000030/service-kartotherian?viewPanel=10&orgId=1&refresh=30s&from=now-1h&to=now for recovery
[14:34:02] <cdanis>	 https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&from=now-6h&to=now&var-site=codfw&var-cluster=maps&var-instance=All&var-datasource=thanos
[14:34:14] <cdanis>	 something happened around 12:35 as well, some increase in traffic maybe
[14:34:36] <rzl>	 wow it sure did
[14:34:54] <cdanis>	 but yes, having maps just in codfw is very cpu-starved
[14:35:29] <cdanis>	 huh, at 12:35 the NICs actually saturated some
[14:36:10] <_joe_>	 ok I have a proposal
[14:36:27] <_joe_>	 we've verified we can work with eventgate-main failed over to codfw
[14:36:39] <_joe_>	 but there is no good reason not to leave it active in both DCs
[14:37:03] <_joe_>	 it's also true that when we switch mw over, we'll get this same alert
[14:37:15] <_joe_>	 as we will be producing events in codfw only
[14:38:39] <rzl>	 kartotherian latency looks like it's recovering
[14:39:52] <rzl>	 _joe_: yeah, let's probably disable the alert before switching tomorrow, if nothing else
[14:40:14] <rzl>	 in the meantime though, repooling eventgate-main in eqiad sounds fine to me, I don't have strong feelings either way
[14:40:55] <_joe_>	 meh, let's not
[14:41:06] <mark>	 let's not what?
[14:43:00] <apergos>	 hello sobanski,  missed your arrival earlier!
[14:43:43] <ema>	 cache_upload looking good again: https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=All&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=now-1h&to=now
[14:44:26] <_joe_>	 mark: repool eventgate-main
[14:46:08] <mark>	 i'm also not sure why we'd repool it
[14:47:01] <cdanis>	 WRT the purged alert, I think we should change the alert so only one of the topics needs to be 'working', not both of them
[14:47:15] <cdanis>	 in that case we wouldn't have to downtime it, and it wouldn't matter if eventgate was depooled somewhere or not
[14:47:33] <rzl>	 that sounds like the correct fix, yeah
[14:47:58] <mark>	 yes
[14:48:25] <rzl>	 let's talk in a bit about whether we'll have that done in time for tomorrow
[14:48:27] <cdanis>	 I think it should be enough to wrap the existing prom query in min()
[14:48:31] <rzl>	 for now, though:
[14:48:46] <rzl>	 kartotherian is repooled in eqiad and we'll keep it there, we've learned all we need to know about that :)
[14:48:57] <rzl>	 it sounds like we're deciding not to repool eventgate-main, is that correct?
[14:50:50] <ema>	 correct, at least from the pointo of view of Purged, purges are coming in nicely so there's no functional reason to repool IMHO
[14:52:18] <rzl>	 okay, let's roll with that
[14:52:20] <_joe_>	 cdanis: the correct thing to do is to check if eventgate is pooled in a dc, then alert on the corresponding metric
[14:52:35] <cdanis>	 that's too hard
[14:52:37] <rzl>	 I'm not aware of any other issues we're tracking from the services switch -- anybody?
[14:52:39] <_joe_>	 because else we can miss one dc not sending the alerts
[14:53:01] <_joe_>	 cdanis: I can cook up something tomorrow morning :)
[14:53:05] <cdanis>	 I don't believe we have pooled-ness of services expressed as Prometheus metrics
[14:53:06] <_joe_>	 it's not that hard
[14:53:32] <_joe_>	 cdanis: I don't think either, but we can just run a wrapper around check_prometheus
[14:54:41] <cdanis>	 https://gerrit.wikimedia.org/r/623398 addresses it for now, tested by hand on icinga1001
[14:54:42] <rzl>	 okay, next thing is cache-depooling eqiad then -- ema I'll send you a patch
[14:54:49] <cdanis>	 and it lets us un-downtime the alert
[14:54:52] <ema>	 rzl: https://gerrit.wikimedia.org/r/c/operations/dns/+/623360
[14:54:58] <cdanis>	 which is a more comfortable situation for today
[14:55:04] <rzl>	 or that :D
[14:55:25] <ema>	 it's already there for your reviewing pleasure :)
[14:55:53] <rzl>	 ema: can you hang it off T243316 instead?
[14:55:53] <stashbot>	 T243316: FY2020-2021 Q1 eqiad -> codfw switchover - https://phabricator.wikimedia.org/T243316
[14:56:08] <ema>	 sure
[14:56:20] <rzl>	 otherwise LGTM, fire when ready
[14:58:57] <ema>	 running authdns-update
[14:59:36] <ema>	 OK - authdns-update successful on all nodes!
[15:06:50] <ema>	 all good so far: https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=codfw&var-site=eqiad&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4
[15:11:04] <godog>	 swift looks like it is still swifting
[15:13:23] <rzl>	 godog: haha I started from https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=10&orgId=1&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops
[15:13:26] <rzl>	 and switched DC to "All"
[15:13:36] <rzl>	 which apparently does not work, because it just showed me the eqiad data, so I panicked a little
[15:13:42] <rzl>	 but the codfw data is going up to match :)
[15:14:15] <ema>	 I see an icinga unknown: "Elevated latency for eventgate-logging-external eqiad"
[15:14:51] <ema>	 perhaps something to look into, similarly to the purged kafka alert
[15:14:57] <_joe_>	 ema: yes
[15:16:37] <rzl>	 noted in the doc
[15:18:38] <cdanis>	 ema: when you get a chance, https://gerrit.wikimedia.org/r/623398 ?
[15:21:45] <ema>	 cdanis: clever!
[15:21:53] <cdanis>	 it is a hack but a fine one
[15:22:23] <ema>	 maybe add a comment
[15:23:31] <cdanis>	 done
[15:25:11] <rzl>	 we're coming up on the scheduled time for restoring dnsdisc TTLs to 5m -- any objections?
[15:26:36] <sobanski>	 Reading through https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Dashboards_4, I noticed the Load Balancers dashboard link is dead. Should it be pointing at https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1 instead?
[15:27:21] <ema>	 sobanski: I thought I fixed it, perhaps you're on a old version of the page?
[15:27:43] <sobanski>	 ema: indeed :)
[15:27:56] <sobanski>	 I had it open since this morning
[15:28:19] <rzl>	 logstash alert, looks like the usual but checking
[15:28:24] <ema>	 ack
[15:30:49] <rzl>	 yeah, looks unrelated
[15:31:12] <rzl>	 I'm going to go ahead and run 02-restore-ttl for services, last call :)
[15:31:29] <rzl>	 _joe_: ^
[15:31:38] <volans>	 ack
[15:32:51] <rzl>	 going -- for anyone still watching my terminal, note the retries are still expected, and we may have to rerun the whole step
[15:33:09] <_joe_>	 rzl: +1 :)
[15:34:20] <rzl>	 success \o/
[15:34:24] <ema>	 nice
[15:35:12] <rzl>	 ema: that's quite a long tail on eqiad traffic, it's almost enough to make you think some people aren't respecting DNS TTLs correctly
[15:35:24] <cdanis>	 rzl: that always happens, yeah
[15:35:29] <rzl>	 but surely THAT can't be right, people would never be so irresponsible
[15:35:30] <_joe_>	 rzl: that would be ludicrous!
[15:35:35] <_joe_>	 rzl: yes!
[15:35:42] <ema>	 rzl: yeah there's plenty of those unfortunately!
[15:35:55] <cdanis>	 rzl: at some other shops it's standard operating procedure to eventually force traffic elsewhere via BGP advertisements
[15:40:09] <ema>	 here we have traffic_shutdown in hiera to flip the VCL finger when *really* needed
[15:42:20] <godog>	 looks like there's still a fair amount of requests to swift eqiad, I was expecting to have swift eqiad fully drained for reads though, expected?
[15:44:20] <ema>	 godog: in esams swift resolves to eqiad
[15:44:20] <ema>	 swift.discovery.wmnet has address 10.2.2.27
[15:44:43] <ema>	 did we change the dns discovery entry for swift? :)
[15:44:50] <rzl>	 _joe_: ^
[15:45:00] <rzl>	 it looks like we decided not to but I missed that
[15:45:11] <godog>	 mhhh that was my expectation too, but clearly not
[15:45:33] <rzl>	 I think we expected it to move with either the cache or MW, not sure
[15:45:58] <godog>	 reads with cache, and writes with MW
[15:46:22] <_joe_>	 rzl: yes, we should move swift.discovery.wmnet too, probably
[15:46:34] <_joe_>	 I didn't think of esams/eqsin :)
[15:47:04] <godog>	 I'm around to do it now if folks are ok with it ?
[15:47:13] <ema>	 I think we should do it
[15:47:21] <_joe_>	 godog: gimme 10 mins, we can just do it with confctl
[15:47:30] <godog>	 _joe_: ack
[15:47:33] <_joe_>	 or, anyone who wants to, just do it :)
[15:48:23] <_joe_>	 ema: what record do you use at the traffic layer?
[15:48:26] <rzl>	 will do right after this meeting
[15:48:28] <_joe_>	 swift.discovery.wmnet?
[15:48:33] <ema>	 _joe_: correct
[15:49:14] <_joe_>	 confctl --object-type discovery select 'dnsdisc=swift' get
[15:50:03] <rzl>	 yep we're active/active right now -- I'll just set/pooled=no in name=eqiad, yeah?
[15:50:06] <_joe_>	 confctl --object-type discovery select 'dnsdisc=swift,name=eqiad' set/pooled=false
[15:50:15] <_joe_>	 pooled=false
[15:50:19] <rzl>	 ack
[15:50:26] <_joe_>	 this is true/false, not yes/no/inactive :P
[15:50:46] <rzl>	 https://wikitech.wikimedia.org/wiki/Conftool to be updated then :D
[15:51:14] <rzl>	 swift still has the five-minute ttl so this won't be immediate, but that's fine
[15:51:56] <_joe_>	 +1
[15:52:04] <godog>	 SGTM
[15:52:06] <_joe_>	 although 10 is better for rollbacks
[15:53:10] <rzl>	 yeah it's true, just didn't think of it in time
[15:53:41] <rzl>	 automation Good, manual process Bad
[15:54:00] <rzl>	 I guess I could have just run the whole cookbook with --services=swift
[15:55:06] <_joe_>	 rzl: you'd get a surprise
[15:55:19] <rzl>	 aw man does EXCLUDED_SERVICES override it?
[15:55:20] <_joe_>	 it's one of the services we put in the blacklist at the top
[15:55:25] <rzl>	 :(
[15:55:37] <_joe_>	 just remove it from that list :)
[15:55:41] <rzl>	 without even an --i-promise-i-know-what-i'm-doing flag
[15:55:52] <_joe_>	 that's the --bblack flag
[15:55:56] <_joe_>	 and nope :P
[15:56:01] <rzl>	 --please-just-do-what-i-told-you-instead-of-what-you-think-i-want
[15:56:12] <_joe_>	 users don't want it.
[15:56:45] <rzl>	 5m elapsed, checking swift graphs
[15:56:50] <_joe_>	 like the headphone jack on a phone.
[15:57:54] <rzl>	 definitely more traffic movement, doesn't look like it's leveled off yet
[15:58:42] <cdanis>	 rzl: the swift graphs use rate() and not irate() so that's extra-expected
[15:58:52] <rzl>	 yeah figures
[15:59:01] <godog>	 indeed, close to be drained
[15:59:02] <rzl>	 something something joke about swift
[15:59:15] <godog>	 what's the joke about swift? :)
[15:59:30] <rzl>	 something like, you'd expect the swift graphs to be faster
[15:59:30] <mark>	 it doesn't drain swiftly
[15:59:45] <rzl>	 this was pretty low-quality to begin with and then I phoned it in, that's on me
[16:00:04] <godog>	 haha fair
[16:00:41] <ema>	 heterological swift
[16:00:44] <rzl>	 the mediawiki_prod bytes graph, without the bottom axis being fixed at zero, gives me a small heart attack every time
[16:02:32] <rzl>	 we're down to double-digit RPS in eqiad, looks like that did the trick
[16:02:45] <godog>	 +1
[16:03:13] <rzl>	 and codfw still looks healthy afaict
[16:03:56] <_joe_>	 the requests in eqiad are from mediawiki writing originals
[16:03:57] <godog>	 confirmed https://commons.wikimedia.org/wiki/Special:NewFiles works as expected
[16:04:14] <ema>	 also no 5xx, which is nice
[16:05:07] <rzl>	 ema: what a perfectionist
[16:05:19] <rzl>	 we need to be serving results *and* they need to not be 5xx? wow
[16:05:22] <_joe_>	 I don't see thumbnails in special:newfiles though
[16:05:45] <cdanis>	 I have some broken links there
[16:06:04] <_joe_>	 well just for the latest stuff, I guess that's """normal""" having switched to the other dc
[16:06:06] <rzl>	 thumbnails are loading slowly for me
[16:06:10] <cdanis>	 hm, a refresh mostly fixed it
[16:06:13] <_joe_>	 that we have some latencies
[16:06:20] <rzl>	 cdanis: do you mean broken links or thumbnails?
[16:06:26] <cdanis>	 broken thumbs, sorry
[16:06:29] <rzl>	 cool
[16:06:48] <cdanis>	 in theory, mediawiki filebackend doesn't return until it has written to both Swift clusters
[16:07:06] <_joe_>	 lemme look at the state of thumbor in codfw
[16:07:10] <godog>	 yeah Special:NewFiles works "eventually" even in normal conditions due to race conditions and rate/concurrency limits iirc
[16:07:30] <cdanis>	 interestingly it seems like my failures are HTTP 429
[16:07:35] <cdanis>	 so that sounds like thumbor
[16:07:38] <ema>	 that might be thumbor indeed
[16:07:58] <_joe_>	 we have some increased latency in the observed latencies
[16:08:07] <_joe_>	 err in codfw
[16:08:34] <_joe_>	 but nothing super-worrisome
[16:09:15] <cdanis>	 and yeah, retrying the request seems to fix -- e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Blake_in_Theater.jpg/90px-Blake_in_Theater.jpg was a 429 but is fine now
[16:09:42] <cdanis>	 hm
[16:09:48] <cdanis>	 thumbor hosts saturate their network RX quite a bit
[16:10:01] <cdanis>	 and don't use much CPU
[16:10:06] <cdanis>	 must be 1G hosts
[16:10:08] <rzl>	 always or just now?
[16:10:12] <cdanis>	 always
[16:10:14] <cdanis>	 in both DCs
[16:14:08] <cdanis>	 I was going to say that 429s from Thumbor@codfw didn't look especially elevated, but then I realized it is a log_10 graph
[16:14:26] <cdanis>	 so instead, they've increased by about an order of magnitude
[16:21:33] <rzl>	 do we think this is a transient effect of sloshing the traffic over? or is there something that needs fixing here
[16:26:34] <godog>	 mmhh I'm tempted to say it is due to traffic sloshing, 429s are converging to eqiad levels pre-shift
[16:26:42] <godog>	 I keep typing swift instead of shift
[16:27:53] <ema>	 :)
[16:28:23] <ema>	 it seems that things are looking good, I'll tentatively begin my evening
[16:28:30] <ema>	 page if needed
[16:28:47] <rzl>	 yeah, we can keep an eye on swift over the next little while and either investigate or repool if need be
[16:28:51] <rzl>	 otherwise I think we're in good shape
[16:29:05] <rzl>	 thanks everyone <3 tomorrow's the fun part
[16:29:10] <ema>	 <3
[16:30:02] <cdanis>	 💜
[16:30:05] <cdanis>	 🤞
[16:44:45] <godog>	 I'll go as well, LMK if sth comes up
[16:45:35] <mark>	 same here :)