[00:17:20] FIRING: [2x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:51:56] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be1075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:20] FIRING: [2x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:53:16] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be1075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:26:56] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:20] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:26:56] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:20] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:25:14] hello, I've been mugged coming back from lunch, handling the several issues that arise after this [12:25:29] maybe mostly attacked than mugged, I just lost a sweater during the battle :D [12:25:31] but still [12:44:57] WTH?!?!? [12:51:56] I'll go to a police station tomorrow or friday morning to sign my plaint + to the doc to take note of the small injuries I have [13:11:40] what? [13:11:50] ouch, take care of yourself! [13:12:04] thanks akosiaris sorry for the worry! [13:12:55] arnaudb: that sucks :( did they want money or what? [13:13:59] I hope the injuries are not too bad [13:14:41] nah I'm ok, I just accepted a few punches to let the conversation start but I ended it up quickly when I saw that it was just an angry young dude that had his ego bruised [13:14:56] so minor injuries but thats about it :) [13:15:38] I'm always baffled by how quick people are to hit eachother :x [13:17:00] anyway, he mistook my sweatshirt with his :-( so I'm down one sweatshirt [13:18:25] haha! the same happened with my umbrella yesterday, but at least it didn't involve punches :D [13:18:55] thankfully this is not often it happens to that level of aggression :DD [14:26:56] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:25] arnaudb: FYI, I'm doing recloning in s3 and I have wikidata thing running on s8, other sections are good to go [14:48:00] ack, noted, I'll get to the schema tomorrow morning, everything is registered on the maintenance map? [14:49:46] it is there now but since s8 alter tables take two days, it might not show up there [14:57:43] ack, this is noted anyway :) thanks for the heads up [15:42:20] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:28:10] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:42:09] from -operations: `PROBLEM - MariaDB Replica IO: s3 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running` - just want to confirm this is expected / ok, as db1154 is not pooled on any section AFAICT. [18:42:09] and curiously, it's complaining about replication from db1212, which I see is under some kind of maintenance for T375652. [18:42:11] T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652 [18:58:39] hey swfrench-wmf, just saw your message! [18:58:57] so its ok for db1154 to lag behind [18:59:10] https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [19:00:43] there is maintenance on s3 [19:00:57] and heartbeat showing a lag is normal also, behind the server under maintenance [19:01:04] its weird that you were alerted here though [19:04:00] yeap, confirmed: https://phabricator.wikimedia.org/P69527 https://phabricator.wikimedia.org/P69520 [19:04:28] you probably can downtime them until tomorrow [19:05:54] thanks, arnaudb! yeah, I figured it's non-urgent since db1154 isn't pooled, but thought it was odd since db1212 isn't the primary for s3 ... [19:06:07] in any case, many thanks for taking a look [19:08:05] aaah, sanitarium ... duh :) [19:08:44] I should have snooped dbctl earlier [19:10:31] oh, and db1154 is not even in etcd, so presumably is in some odd state of turn up or down? [19:11:16] and never mind again, it's just not managed in etcd at all :) [19:11:46] anyway, nothing to see here I guess - thanks again [19:42:33] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:32:49] PROBLEM - MariaDB sustained replica lag on s3 on db1154 is CRITICAL: 1786 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13313 [20:38:49] RECOVERY - MariaDB sustained replica lag on s3 on db1154 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13313 [21:00:48] currently two s3 replicas in eqiad have been depooled so we can clone one from the other, I think that's why the existing ones were not happy [22:31:54] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:33] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure