[07:56:46] I noticed a brief spike in thumbor errors during the switchover (presumably a spike in requests because of new thumbs needed that previously were only in eqiad) - https://grafana.wikimedia.org/goto/QHB7XPgHR?orgId=1 [08:50:57] it was actually due to a bit of confusion between the "swift" and "swift-ro" service, we switched back swift-ro, but requests were going to thumbor via swift [09:07:51] ah, OK, good to know [09:08:41] I suspect some of the legacy swift services could usefully be gotten rid of, but am a bit afraid of breaking something [09:23:17] is the general advice to use swift.discovery.wmnet rather than swift-ro/swift-rw? [09:24:17] sw.french has pointed out that there's a note saying "# TODO: remove this from DNS!" for -rw and -ro :D [09:44:54] this was all a bit legacy when I started here... [09:45:54] <_joe_> interesting choice of terms :D [09:47:09] <_joe_> hnowlan: anyways, no, I think we can decidedly remove the -ro and -rw services. Those were supposed to be used when we had active replication between the swift clusters [09:47:24] <_joe_> now we write to both, directly from mw, without using discovery for that [09:47:24] * Emperor weeps softly [09:47:53] <_joe_> it /might/ be useful for other applications, like say the docker registry, that use replicated buckets [09:48:06] <_joe_> but then we'd need to have one discovery record per bucket :) [09:49:07] AIUI (but might be wrong, go.dog might know) swift is the one used for read operations by mw (and it writes to both regardless of discovery state). I'm not aware of uses of swift-ro and swift-rw, but that doesn't mean there aren't any lurking [09:49:58] <_joe_> swift is definitely used by ATS [09:50:12] <_joe_> but the whole conccept of active/passive only makes sense if there's replication [09:50:36] I have a hazy recollection that docker just uses one of the ms clusters [09:50:56] <_joe_> and the bucket is replicated [09:51:00] <_joe_> or at least it was [09:51:18] I don't think there is any between-dc replication being done by swift [09:51:38] [the exception being the thanos cluster, which is one swift cluster that spans both DCs] [09:54:09] inspection shows that the docker account does have the same number and size of objects in both clusters (but I think that's being done by client not server) [13:43:33] hi folks, just a friendly reminder that today is switchover day 2 [0]. preparation will begin in the 14:00 UTC hour, targeting 15:00 for the (brief) mediawiki read-only period. as yesterday, thanks in advance for your assistance / patience :) [13:43:33] [0] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki [14:06:03] arnaudb && jynus, there is an outstanding patch pre-switchover https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075052 [14:06:46] it can be merged, but it is too late to take effect [14:06:47] related to https://gerrit.wikimedia.org/r/c/operations/dns/+/1073897 (wmnet: update CNAME records for DB masters to codfw) [14:06:57] we would need to restart mysql [14:07:07] jynus: will it be a problem when we merge 1073897 ? [14:07:08] not a blocker if you ask me, just something that should be fixed [14:07:12] no, no [14:07:21] alright, we will let you sort it [14:07:23] the cnmae is mostly no op, no one uses the cnames [14:07:31] other than the proxies [14:07:47] so not a blocker, just something we caught to fix when we can [14:21:33] FYI, starting preparation work now. updates will primarily be in -operations [14:31:23] <_joe_> I will have my traditional monitoring tool http://listen.hatnote.com/#uk,fr,sv,he,as,pa,ml,or,pl,sr,fi,eo,pt,no,bg,mk,sa,mr,te,hi,id,ar,nl,ja,de,ru,es,it,fa,zh,bn,ta,kn,gu,be,el,et,hu,en in background [14:35:42] icinga downtimes looking good [14:39:02] I got the site notices now [14:44:32] hi all, we're at the first disruptive operation of the setup. is anyone aware of a reason we need to hold for now before I proceed (go / no-go #1) [14:45:50] continuing [14:46:06] is everyone listening to hatnote? because I think we broke it :p [14:46:17] ha ha [14:46:21] it's working for me [14:46:26] yes hatnote is open [14:47:09] <_joe_> I think we need sirenbot to also cheer the event [14:47:13] <_joe_> !sing [14:47:14] Never gonna give you up [14:47:14] Never gonna let you down [14:47:15] Never gonna run around and desert you [14:47:16] Never gonna make you cry [14:47:17] Never gonna say goodbye [14:47:18] Never gonna tell a lie and hurt you [14:47:33] btw, remember wikitech fails to log, as is tradition, for the newbies [14:47:39] <_joe_> ah yes [14:47:41] (during read only) [14:48:40] is there a shared tmux we can join to watch? [14:49:48] <_joe_> effie: ^ [14:50:06] <_joe_> taavi: hi! long time no see :) [14:50:18] taavi: tmux -S /home/swfrench/switchover attach-session -r on cumin1002 [14:50:37] please avoid using a tiny terminal :p [14:50:37] taavi: <3 [14:51:13] hi :-P [14:51:14] <_joe_> effie: there's an option not to resize the terminal [14:51:38] https://wikitech.wikimedia.org/wiki/Collaborative_tmux_sessions#Create_sessions_with_fixed_size [14:51:38] should we prep hatnote recording? [14:51:55] akosiaris: i am recording it since 20 minute ago [14:51:55] or done enough to not make sense [14:51:59] ahahahaha [14:52:00] thanks [14:52:13] jynus: are you using chatstrmr? ;) [14:52:16] dear tmux experts, it is ok really :) [14:52:33] cdanis: just obs [14:52:52] or is that like chatops? [14:53:30] swfrench-wmf: http://listen.hatnote.com/ btw. The listening of the chimes again right after the readonly period is one of the best music to the person's running it ears [14:53:44] the readonly period is so short now though, it's not as dramatic [14:54:04] let's agree that it is subjective [14:54:35] I remember a 40m long switchover, I had even forgotten it was supposed to make noises [14:55:39] it can be shorter once we move away from mysql_legacy [14:55:47] let us hope this time around is as short as last time :) [14:56:24] alright, final go / no-go before we start read-only: anyone aware of any reason why we should not proceed with the switchover? [14:56:54] ship it! [14:57:03] Back when I was on the front lines of a production site, we used to keep a twitter feed nailed up. We would generally see a spike in customers tweeting complaints before many of our own alarms went off :-) [14:58:19] jynus: and to nonosql? [14:58:37] if one could dream... [14:58:44] no objections from me [14:59:03] 👍 [14:59:11] silence.... [14:59:12] hatnote is quite :) [15:01:39] sound again [15:01:40] hatnote is back [15:01:44] my edit to s7 went through [15:01:52] woohoo [15:01:59] I can hear chimes ! [15:02:08] \o/ [15:02:18] so far the latency right afterward is much better than it used to be, still checking around [15:02:56] i will check mysql health [15:03:24] POST 5xxs are fully recovered [15:03:28] <_joe_> rzl: partly merit of mw on k8s [15:03:31] yeah! [15:03:41] if my maths are correct the read-only time was 2m46s [15:03:46] <_joe_> we can support being much much hotter than before [15:03:52] <_joe_> taavi: yes [15:04:05] that and it turns out active-passive is still a great idea [15:04:14] lots of qps in s4, but not worrying for now [15:04:17] best cache warmup there is [15:04:22] ^ [15:05:44] <_joe_> oh yes [15:05:57] <_joe_> but it was still harsher last time around [15:06:35] yeah I don't want to jinx anything yet but I've never seen a switchover this smooth [15:06:47] lots of writes on s5 [15:06:48] <_joe_> because well, we partition traffic geographically so the warmup isn't perfect [15:07:10] last time around we even had the operator not having internet mid way [15:07:20] and that happened 1.5 years ago too [15:07:25] so this one went awesome [15:07:36] RO time: 2 minutes and 45.64 seconds for those interested [15:07:49] volans: thanks! I needed that number [15:07:55] not sure why there was a spike of activity on dewiki [15:08:02] enabled enabled enabled enabled enabled enabled enabled [15:08:15] akosiaris: this time around I asked chatgpt for lazyness :D [15:08:18] my number from above was not good enough? [15:08:20] but nothing breaking, just seeing some unusual patterns [15:08:54] taavi: you were off by .36 seconds! [15:08:56] :P [15:09:05] <_joe_> taavi: volans is always precise to the millisecond [15:09:45] better call volans dot png [15:09:57] ahahaha [15:10:09] well done, nice switchover [15:12:40] wow [15:13:01] well done! [15:14:22] arnaudb: you were part of it, your db checks are what allowed that [15:14:39] we all were but I did not pull the trigger :D [15:15:15] can I force a rerun of icinga config refresh? I would like to see all masters green sooner [15:15:30] go ahead [15:16:17] actually, I may have to run puppet localy first on the db hosts [15:16:21] mind there is prometheus with a comparable (now) set of alerts jynus so we should be safe (please let me know if we're not, I'd be curious to see which checks have weirdnesses) [15:16:37] ah, no, what changed was a puppet config [15:16:37] jynus: I'm running puppet on all of the masters as we speak [15:16:38] jynus: the cookbook is doing that right now, hang on [15:16:48] cool [15:16:53] 23 of 47 done [15:17:00] standing by [15:17:45] arnaudb: I just want a graphana nice dashboard [15:17:53] to see them and I am sold! [15:18:02] * arnaudb notes [15:18:46] ah, I see a mistake already, test-s4 (which is a production network, but not a user facing db is complaining) [15:20:04] arnaudb: should we change the config, ack them, or switch them too? [15:20:31] I'd say lets ack them for now [15:20:41] doing [15:20:51] I'll need to wipe and productionize db2230 anyway [15:20:52] thanks jynus [15:21:26] don't worry, I took over that task because I was supposed to do it after testing finishes [15:22:13] the other thing is pc1017 and pc2017 had some weirdness (the patch and the lack of ops db) to tell Amir1 when he comes back [15:22:25] jynus: puppet runs are done and read-only downtimes are removed [15:22:32] nice [15:28:07] other than T375638, I see phab and other technical communities clean [15:28:08] T375638: When entering editing with VE while the read-only mode is set, provide a message related to the read-only, not about Parsoid not loading - https://phabricator.wikimedia.org/T375638 [15:28:23] swfrench-wmf: want me to resolve the maintenance on statuspage? [15:28:28] and that is not really for us SREs [15:29:19] <_joe_> jynus: nope, but it's a good find [15:29:24] <_joe_> it's bad UX [15:29:25] rzl: thanks for the reminder! I have it open, so can do right now [15:29:31] 👍 [15:29:41] yeah, I wanted to highlight it as semi-related [15:30:05] I remember fixing the original editor to make it more friendly [15:30:21] I thought we had done the same on the new one too [15:30:38] <_joe_> tbh I'm not even 100% sure the issue is in VE or parsoid's interface [15:30:52] maybe it got lost after the php migration or something [15:31:15] yeah, could be too [16:05:40] alright, we are now officially done with actions planned for day 2 :) [16:05:40] denisse: brett: nothing of note to highlight at this time specific to the switchover. however, see discussion in -traffic regarding the state of cr3-ulsfo and the potential need to depool ulsfo again. [16:06:27] ack [16:06:30] swfrench-wmf: Thanks! I'll take a look. [16:06:59] netops: gonna kcik the jcare renewal to ya'll today [16:07:04] bleh, wrong channel [16:07:47] for visibility here, per discussion on Monday, if we need to depool ulsfo again due to cr issues, I would propose that we do so without proactively re-pooling eqiad - i.e., codfw should be able to handle it, but if not we can always respond by repooling. [16:12:07] It's been depooled [16:13:08] reminder that we are in single DC for 1 week [16:13:35] start whatever you wanted to do in eqiad and didn't do because traffic was flowing there. [16:14:21] (but: don't start things that would degrade/harm the currently-depooled traffic edge clusters in eqiad!) [16:14:33] (we do need those as emergency backup option JIC) [16:17:18] longer-term, we need to maybe explicitly document related policy/thinking. we talked about this a bit in traffic, but it's possible there's some disconnects in shared understanding of the risks and allowances afforded across our core sites, esp during these switchover windows, at least for the traffic edge clusters, and thus things they depend on like network infra. [16:18:06] totally agree, I also imagine that going along with / being informed by traffic SLO work [16:19:35] there's even some bikeshedding to do about exactly how many concurrent failures of what kinds we intend to be able to cope with. [16:20:07] but edge capacity for traffic across all the sites (both globally and per-region) is a tricky case [16:20:31] I think we've maybe fallen behind a bit on edge egress capacity, just because it has grown nontrivially this year [16:25:09] yeah I think so [16:25:37] a lot of that, sadly, is just image ingestion into various random AI models [16:26:47] while I agree that "training our eventual AI overlords" is implicitly something that our free-knowledge aims end up supporting [16:27:04] that doesn't mean it has to be for free and at great burden to our other users :P [16:28:02] indeed [16:42:12] 12:39:17 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:42:14] 12:39:25 <+icinga-wm> PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:49] ah, there it went [16:43:31] ... or at least I assume that's what happened [16:44:29] a lot of BGP session flaps showing on logs in librenms [16:45:25] FPC Uptime 7 minutes, 38 seconds [16:45:27] that would be consistent with last time [16:45:27] yup [16:45:30] ah, there we are [16:45:58] I'm not sure if anything badly user-affecting happened [16:46:05] NEL for ulsfo text-lb looks fine [16:46:19] cdanis: it was already depooled [16:46:25] right [16:46:48] ah, you just mean for clients that might still be using it despite [16:47:00] no, I had forgotten that lol [16:49:09] thanks for catching the precursor symptoms, X.ioNoX [16:50:22] I'm happy it worked out well! [16:53:57] <_joe_> bblack, cdanis do we have hard evidence of increasing traffic and its causes? [16:55:00] _joe_: well for one thing, we've had two different outages this year caused by *transport* (internal between-DC) links overflowing because of additional uncacheable upload-lb egress from what look like AI scrapers [16:55:19] <_joe_> oh yes, I am aware of that [16:55:31] <_joe_> I thought we had more actually [16:55:54] <_joe_> but I was more talking about having any data on how much more traffic we're doing now compared to say 2022 [16:55:58] yeah, it's more [16:56:03] <_joe_> and then looking at the fraction that's scrapers [16:56:54] there's some tracking of the big scrapers now, because WME cares [16:57:34] we care too I guess, but we care less when it doesn't cause a big incident, and there are quite a few heavy scrapers that don't regularly cause incidents, but still add a lot of load/traffic [17:01:10] wow I'm querying thanos-downsample-1h and using a recording rule even and it's still really slow [19:29:44] swfrench-wmf: the motd at deploy2002 still has the scary DO NOT USE THIS SERVER message, is that expected? [19:31:31] rzl: thanks for flagging! yes, that's expected: we've not switched the deployment server yet (scheduled for tomorrow) [19:31:52] ah thanks [19:32:25] hm, I was waiting until post-switchover to send out the mwscript-k8s announcement just to avoid wrong-host confusion, maybe tomorrow's the day then [19:33:19] well, that and it seemed like a funny prank to get everyone to try running maintenance scripts the new way and then immediately kill them for the switch. but that part's resolved at least [19:33:39] ah, got it - yeah, we're in a weird state where the maintenance host has switched (it has to, since the jobs are gated on etcd state), but deployment host has not [19:33:48] nod [19:33:58] okay cool, tomorrow it is [19:34:22] sounds good, thanks for your patience :) [19:34:51] actually neat, that's a bug in the script, presently it assumes the WMFMasterDatacenter is always the one with an updated scap report.json [19:34:57] so thanks for catching it :D [19:35:42] if you have it handy, what's the source of truth on the active deployment host? [19:35:57] I see there's a hiera key, is that canonical? [19:36:29] oh, good catch - yeah, we're in the brief window where that does not hold [19:36:44] yeah, it's a hiera key: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073894 [19:36:59] perfect thanks [20:08:32] hmmm, except *right* after a switchover, the new deployment host won't have the freshest report.json either, until an image build's actually done there