[07:48:56] I'll start running s7 codfw switchover T373175 [07:48:57] T373175: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T373175 [07:58:41] 10min to move all s7 codfw replicas Amir1 : kudos x) [08:13:47] T373330 WIP [08:13:47] T373330: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T373330 [08:37:32] going for T374086 [08:37:33] T374086: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T374086 [08:45:01] good morning [08:46:02] good morning jynus and welcome back :-) [10:09:41] no db backups failing since August! I should go away more often! [10:17:29] haha monitoring them was a smooth sail x) [10:19:18] you seemed to do a great job [10:20:31] idk if having nothing to do because its fully green qualifies as a "job" haha [10:20:34] also probably it was a good idea not doing any last minute upgrade/maintenance/decommission that would have caused issues, even if that added some blocking for outdated hosts [10:20:57] (os upgrade, package upgrade, etc.) [10:20:59] yeah there was little to no event before so no ripple effect [10:21:22] and I saw that backup1011 still has 14 TB left [11:01:13] arnaudb: once you're done, let me know, I have some schema changes to do there :D [11:53:22] sure! [15:57:15] what does it mean when bacula says "data is waiting on max Storage jobs". bacula is just busy ? [15:57:39] sounds like all it means is like "wait a bit please" [15:57:44] there is a limit in concurrency [15:58:10] we noticed the "freshness is stale" alert for host gerrit1003 [15:58:18] I checked bconsole and that is what I see [15:58:26] sounds like no reason to be concerned [15:58:31] so it does neigher overload and, hopefully, gets the optimal amount of parallel jobs [15:58:48] since it still has a full backup and will just do it when it gets to it [15:58:57] yeah, first week of the month usually gets a bit clogged [15:59:09] ack, makes sense. [15:59:12] I would like to optimize it a bit, but it is never a priority [15:59:30] yea, it seemed like it, thanks for confirming. no reason for concern then [15:59:32] if it keeps like that tomorrow, I will give it a look [15:59:36] ok, thanks [15:59:43] also, welcome back. [15:59:50] sorry for bringing this your first day if it is [15:59:53] in general, and unless I am away, you can ignore those [15:59:58] ok:) [16:00:03] they are alerts but should be just dashboards [16:00:31] yep, and the only reason we notice is because sometimes we have "alert review" in our team meeting where we actively click all the boards [16:00:34] sadly because gerrit has hourly frequency is the most noisy [16:00:52] I see [16:02:27] I think the real fix would be to try to make other backups limit their total size, make it in smaller chunks [16:03:38] *nod* [16:04:01] I may need to have a deeper analysis of resource allocation, it is allways an upwards hill (full backups get larger, never smaller) [16:04:28] there is hw ready to add a second general backup host which should double resources and concurrency [16:04:43] sounds good:) [16:04:50] as we added, among others, gitlab but increment happens in full hosts [16:05:23] so it is like a star, we are a bit behind until we add more resources and then we probably will be overprvisioned from some years [16:05:28] *stair [16:06:04] backup1010 is the one that should be soon setup for gitlab and others [19:12:25] FIRING: SystemdUnitFailed: ferm.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:25] RESOLVED: SystemdUnitFailed: ferm.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed