[01:14:29] switchover updates: this afternoon, we received some outbound port utilization alerts for cr2-codfw, due to modest spikes in egress on top of the large increase with ulsfo and eqiad depooled. [01:14:29] given that this was likely to repeat throughout the next few hours, we made the decision to repool eqiad (done at 00:41). [01:14:29] one potential concern was stress on eqiad <> codfw transport links while A/A services are depooled in eqiad. however, given that swift was repooled there yesterday, we considered this unlikely (i.e., upload.w.o stays in eqiad). [01:14:29] in any case, the main point of note is: *do not* start disruptive maintenance in eqiad that might impact the CDN, as it's pooled. [01:45:17] ^ cross-posted to https://phabricator.wikimedia.org/T370962#10178173 as well [07:03:47] hello oncallers [07:04:06] I am going to depool and start the reimage of the remaining docker registry nodes on buster [07:16:39] <_joe_> please do [07:34:35] deployment hosts are back to using stunnel for rsyncing between them. I had to restart stunnel manually to pick up the new puppet 7 cert, but otherwise we are going. Tests worked fine [07:44:05] <_joe_> nice [08:30:30] XioNoX: topranks: double check me on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1075852 please? [08:30:49] and I 'll also do a full cross check that we don't miss other stuff later [08:33:00] akosiaris: lgtm! [08:33:49] thanks! [08:36:46] akosiaris: not the best but at least it's somewhere https://wikitech.wikimedia.org/wiki/Network_equipment_lifecycle#Provisioning [08:46:06] thanks! [08:48:10] registry1004 up and running, proceeding with 2004 as well [08:48:23] I repooled registry2003 in the meantime [08:48:35] (so we have 2003/5 while 2004 is reimaging) [09:31:59] all right docker registry VMs all on booworm, and pooled [09:32:10] going to decom the old VMs (1003/2003( [09:41:47] <_joe_> \o/ [09:53:16] all decommed :) [10:21:34] I created a tmux session on registry1004 to start the dry run of the GC process, all documented in https://phabricator.wikimedia.org/T375645#10176397 [10:21:49] to stop it, it is sufficient to tmux-attach and kill [10:24:29] so far everything seems fine, even resource consumption wise (the job is going through the mark step of mark/sweep, we'll see how much it takes) [10:36:03] something that I just discovered - https://docker-registry.wikimedia.org/ is returning 403s half the times since on registry2004 (reimaged) the /srv/homepage dir is being built [10:36:17] I didn't know about it, otherwise I'd have waited to kill the old vm [10:36:29] it should hopefully be fixed during the next couple of hours [10:36:37] need to step away but I'll check after lunch [14:41:06] FYI, in about 20 minutes, we'll start the switch of the deployment server from eqiad to codfw (deploy1003 to deploy2002). I'll mainly be posting updates in -operations. [14:46:04] should we long-term silence cloudsw* port saturation pages until we know all of the Ceph maintenance is done? [14:47:15] why would the mgmt side would complain? [14:47:32] is it because shared hw? [14:49:54] jynus: it is the DNS name of the address for the production router's endpoint on the management network [14:50:37] ah, gotcha [14:51:24] cdanis: +1 [14:52:16] XioNoX: is it better to do that in librenms than in alertmanager, do you think? [14:52:39] cdanis: I can exclude cloudw from that alert altogether [14:52:53] +1 from me although I guess we should double-check with them [14:53:06] and/or creating another one just for cloudsw that is non-paging [14:53:17] cdanis: yeah we had a quick chat already (cc dcaro) [14:53:21] ah cool [14:53:23] ty [14:57:35] done [14:58:18] dcaro, arturo, I can also make the alert go to WMCS if there is a dedicated alertmanager team I can add to https://librenms.wikimedia.org/alert-transports [14:58:39] wmcs [14:59:00] that'd be nice yes [15:01:02] dcaro: I sent a test notification did it work? [15:01:21] yes [15:01:24] https://usercontent.irccloud-cdn.com/file/x3NlBvgW/image.png [15:01:31] nice! [15:01:41] those can be silenced in the same way as the rest right? [15:01:56] FYI, deployment server switch is starting now [15:02:02] dcaro: yeah [15:02:16] awesome :) [15:03:49] all done, so now those alerts will go to Netops + WMCS [15:09:55] hello folks, not sure if you are reading #private, but please don't commit anything to puppet private [15:16:39] ^ done now, can commit [16:30:42] FYI, deployment server switch is done. [18:27:04] we're active in CODFW only now, right? Just wondering as WDQS hosts' load is suspiciously low in CODFW https://grafana.wikimedia.org/goto/8QofJ_gNR?orgId=1 [18:29:19] no, eqiad was repooled last night [18:30:23] at the same time you see the divergence so that matches :] [18:30:46] ah ok, so we're active/active now? [18:33:42] ah, just saw Scott's email...will subscribe/follow phab task [18:34:28] (and yes, to your previous question) [18:36:50] Thanks, I've been out and it appears I missed some important emails ;P [19:46:39] inflatador: as usual, things are a bit complicated :) we repooled eqiad for edge traffic yesterday, after observing some issues with high port utilization on cr* in codfw. [19:46:39] for most services, that should not really affect much, as we did not repool eqiad for the purposes of discovery. [19:46:39] however, I believe there may be some wdqs services that have historically been opted out of the services switchover. let me dig up that list. [19:48:43] swfrench-wmf no worries. If we assume that eqiad absorbs most of wdqs traffic when it's pool (which it does), the graph looks normal [19:48:51] inflatador: here's the exclusion list, which includes wdqs-ssl (i.e., wdqs.discovery.wmnet): https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/discovery/datacenter.py#27 [19:49:42] thanks! we should probably get rid of those exceptions...I think we are better provisioned now, but will check w/team [19:51:02] awesome, that would be great it if these can have their opt-outs removed :) [19:51:40] yeah, there was def a cascading failure then, pretty sure we are over that now [19:56:22] swfrench-wmf created T375793 and tagged you...we'll let you know what we find [19:56:22] T375793: Review whether or not WDQS needs to be exempted from DC failover - https://phabricator.wikimedia.org/T375793 [19:57:18] awesome, thank you! [20:01:05] np, thanks for the ping back