[07:48:56] <arnaudb>	 I'll start running s7 codfw switchover T373175
[07:48:57] <stashbot>	 T373175: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T373175
[07:58:41] <arnaudb>	 10min to move all s7 codfw replicas Amir1 : kudos x)
[08:13:47] <arnaudb>	 T373330 WIP
[08:13:47] <stashbot>	 T373330: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T373330
[08:37:32] <arnaudb>	 going for T374086
[08:37:33] <stashbot>	 T374086: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T374086
[08:45:01] <jynus>	 good morning
[08:46:02] <arnaudb>	 good morning jynus and welcome back :-)
[10:09:41] <jynus>	 no db backups failing since August! I should go away more often!
[10:17:29] <arnaudb>	 haha monitoring them was a smooth sail x)
[10:19:18] <jynus>	 you seemed to do a great job
[10:20:31] <arnaudb>	 idk if having nothing to do because its fully green qualifies as a "job" haha
[10:20:34] <jynus>	 also probably it was a good idea not doing any last minute upgrade/maintenance/decommission that would have caused issues, even if that added some blocking for outdated hosts
[10:20:57] <jynus>	 (os upgrade, package upgrade, etc.)
[10:20:59] <arnaudb>	 yeah there was little to no event before so no ripple effect 
[10:21:22] <jynus>	 and I saw that backup1011 still has 14 TB left
[11:01:13] <Amir1>	 arnaudb: once you're done, let me know, I have some schema changes to do there :D
[11:53:22] <arnaudb>	 sure!
[15:57:15] <mutante>	 what does it mean when bacula says "data is waiting on max Storage jobs". bacula is just busy ?
[15:57:39] <mutante>	 sounds like all it means is like "wait a bit please"
[15:57:44] <jynus>	 there is a limit in concurrency
[15:58:10] <mutante>	 we noticed the "freshness is stale" alert for host gerrit1003
[15:58:18] <mutante>	 I checked bconsole and that is what I see
[15:58:26] <mutante>	 sounds like no reason to be concerned
[15:58:31] <jynus>	 so it does neigher overload and, hopefully, gets the optimal amount of parallel jobs
[15:58:48] <mutante>	 since it still has a full backup and will just do it when it gets to it
[15:58:57] <jynus>	 yeah, first week of the month usually gets a bit clogged
[15:59:09] <mutante>	 ack, makes sense. 
[15:59:12] <jynus>	 I would like to optimize it a bit, but it is never a priority
[15:59:30] <mutante>	 yea, it seemed like it, thanks for confirming. no reason for concern then
[15:59:32] <jynus>	 if it keeps like that tomorrow, I will give it a look
[15:59:36] <mutante>	 ok, thanks
[15:59:43] <mutante>	 also, welcome back. 
[15:59:50] <mutante>	 sorry for bringing this your first day if it is 
[15:59:53] <jynus>	 in general, and unless I am away, you can ignore those
[15:59:58] <mutante>	 ok:)
[16:00:03] <jynus>	 they are alerts but should be just dashboards
[16:00:31] <mutante>	 yep, and the only reason we notice is because sometimes we have "alert review" in our team meeting where we actively click all the boards
[16:00:34] <jynus>	 sadly because gerrit has hourly frequency is the most noisy
[16:00:52] <mutante>	 I see
[16:02:27] <jynus>	 I think the real fix would be to try to make other backups limit their total size, make it in smaller chunks
[16:03:38] <mutante>	 *nod*
[16:04:01] <jynus>	 I may need to have a deeper analysis of resource allocation, it is allways an upwards hill (full backups get larger, never smaller)
[16:04:28] <jynus>	 there is hw ready to add a second general backup host which should double resources and concurrency
[16:04:43] <mutante>	 sounds good:)
[16:04:50] <jynus>	 as we added, among others, gitlab but increment happens in full hosts
[16:05:23] <jynus>	 so it is like a star, we are a bit behind until we add more resources and then we probably will be overprvisioned from some years
[16:05:28] <jynus>	 *stair
[16:06:04] <jynus>	 backup1010 is the one that should be soon setup for gitlab and others
[19:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:42:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ferm.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed