[02:33:04] FIRING: PuppetFailure: Puppet has failed on restbase1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:36:50] FIRING: DiskSpace: Disk space restbase1035:9100:/ 0.2299% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=restbase1035 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:33:04] FIRING: PuppetFailure: Puppet has failed on restbase1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:36:50] FIRING: DiskSpace: Disk space restbase1035:9100:/ 0.04339% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=restbase1035 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:00:37] Hopefully u.random can look at that a bit later today [10:06:42] Do we have a good answer for issues like T413733 (where the object is missing from swift, but the user has the original but uploading won't work because the file already exists)? Do they always have to upload something else as an intermediate version and then try again with the original-original? ISTR Amir1 using a test jpg before... [10:06:42] T413733: PDF does not exist - https://phabricator.wikimedia.org/T413733 [10:33:04] FIRING: PuppetFailure: Puppet has failed on restbase1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:36:50] FIRING: DiskSpace: Disk space restbase1035:9100:/ 0.5734% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=restbase1035 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:48:06] Yes, MW doesn't allow it. One thing that could happen is that the upload code could simply retry before erroring out and leaving the mw in a broken state in the first place which would automatically prevent people from needing to upload the file again [11:50:26] that would be nice; or maybe an admin-only "I know what I'm doing, let me re-upload this" box [11:50:32] (the latter might be less work) [11:51:26] Amir1: while I'm bothering you, do you have a couple of ticks to +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1223168 please? I should have done that before Christmas :-/ [11:52:50] {{done}} [11:53:29] TY :) [14:33:04] FIRING: PuppetFailure: Puppet has failed on restbase1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:33:45] urandom: are you around today to look at restbase1035? [15:09:30] Emperor: I'll have a look; Thanks [15:09:40] 👍 [15:29:09] RESOLVED: DiskSpace: Disk space restbase1035:9100:/ 0.2927% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=restbase1035 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:29:25] FIRING: SystemdUnitFailed: cassandra-c.service on restbase1035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:25] RESOLVED: SystemdUnitFailed: cassandra-c.service on restbase1035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:52] this is/was...unfortunate. /dev/sdc failed, it is mounted as /srv/sdc4, and when the disk disappeared, it started writing to (and filling up) the root device [15:36:10] can't believe this is the first time in all of these years. [15:39:51] I have vague memories of swift doing that back in the day, but it isn't quite so dumb now [17:29:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:09] urandom: there's still an alert for puppet failing on restbase1035 - maybe silence it a bit if it's going to be sad until the disk is swapped? [17:32:21] (I spotted that while going to silence the rclone alert, which I'll look at tomorrow) [17:32:38] Emperor: will do [17:33:17] TY :) [19:59:28] Amir/Emperor: MediaWiki could do "oh, while I thought I really had this file it isn't really in Swift, but this is the very same I wanted to have" and automatically repair it