[02:05:48] FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:05:48] FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:20:20] that's disk sadness, I think DC-ops pulled the wrong drive (per T395990 ), but I'll silence for 24h to give them a chance to have another look [07:20:21] T395990: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990 [07:22:21] While looking at alerts, there's a whole pile of 'MediaWiki periodic job purge-parsercache-pc2 failed' for pc2-6 [07:23:56] marostegui, Amir: can I start rolling reboots in s4 in codfw? [07:32:20] Yep! [08:02:25] ok, started [09:27:16] yo, purge-parsercache failed for pc{2..6}, do you want me to relaunch the jobs, or just clear the failed jobs to reset alerting and let it run on normal schedule [09:28:47] claime: I've done some operations today on the databases, which implied restarting them [09:28:55] Can you relaunch them? [09:29:00] marostegui: ofc [09:29:10] claime: Thank you [09:31:18] job.batch/purge-parsercache-pc2-202506050931 created [09:31:20] job.batch/purge-parsercache-pc3-202506050931 created [09:31:22] job.batch/purge-parsercache-pc4-202506050931 created [09:31:24] job.batch/purge-parsercache-pc5-202506050931 created [09:31:26] job.batch/purge-parsercache-pc6-202506050931 created [09:31:55] thank you! [09:32:03] I am surprised pc7 and pc8 didn't fail [09:32:06] As I also worked on those [09:32:43] pc8 is a short run [09:32:46] about 54m [09:32:54] yeah, makes sense, as it was setup like a month ago [09:33:18] pc7 is abouth 5h, so it's weird [09:33:58] Anyways, they're running now :) [09:34:20] thank you [09:35:57] no worries [10:22:18] switching es4 and es5 masters in codfw [11:48:01] Amir1: we planned to do this after the master flips so we both do kernel updates and column drop together https://phabricator.wikimedia.org/T391056 - anything I can do to speed this up? [11:49:53] federico3: have you done the section reboots for those sections? [11:50:02] both eqiad and codfw (replicas I mean) [11:53:46] for some sections: 56328 in this order, so maybe I can create tasks for master flips in th esame sequence while the reboots of other sections are still going [11:55:52] federico3: Another option is to reboot the candidate masters for the sections you want to do, manually, so you can proceed with the schema changes [11:55:53] Up to yu [11:58:24] uhm I'm confused, afaik we want to flip the master<-->candidate in order to be able to do kernel upgrade on the ex-master and then trigger the schema change. [11:58:54] Yeah, what I mean is, if you don't want to wait for all the sections to be completed, you can just upgrade the kernel on the candidate, and do the switchover [12:00:03] ah: we change sequence doing first the candidates, then doing flips, then everything else [12:00:38] that's food for thought for future rollouts I think as right now we are almost done [12:18:21] federico3: Yeah, that is an approach if you want to unblock the schema change [12:18:35] But if you've done some sections already, you can do those switches [12:30:27] Amir1: I have moved ms1 hosts to parsercache and double checking all before going for ms2 [12:30:37] <3 thank you [12:51:39] w [12:51:44] Great [16:51:48] marostegui: Amir1: can I start s4 in codfw?