[02:27:48] FIRING: PuppetFailure: Puppet has failed on ms-be2069:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:27:48] FIRING: PuppetFailure: Puppet has failed on ms-be2069:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:27:52] that's a dead disk [08:37:51] I have depooled pc1 [09:38:23] T388373 opened, alert silenced [09:38:24] T388373: Disk (sdj) failed on ms-be2069 - https://phabricator.wikimedia.org/T388373 [10:05:17] when gerrit fails to save something it also somehow blocks the ability to select and copy text... [11:52:11] Emperor: Given that we are going to regenerate a lot of thumbnails, I'll be starting the clean up on eqiad starting tomorrow unless you object. Goodbye [11:56:42] Goodbye? [11:56:53] until I annoy you again :D [11:57:24] heh. I was somewhat wondering (I think I said on phab somewhere) if we should hold off until after the switchover to check there aren't any surprises from the eqiad deletions? [11:58:39] the thing is that by the time the eqiad switchover happens, only below 1% of thumbnails will be deleted. That's rounding error [11:58:53] OK... [12:20:00] PROBLEM - MariaDB sustained replica lag on s4 on db1243 is CRITICAL: 12.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104 [12:21:00] RECOVERY - MariaDB sustained replica lag on s4 on db1243 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104 [13:03:21] marostegui: for later I wonder if puppet needs a similar patch for db2230, db2185 [13:03:35] or if those are just under special cases [13:04:39] I will send a patch at least for db_inventory [14:06:13] Our ISP is having Routing issues today, so dunno if I'll be able to join the meeting or not [14:06:39] ( https://aastatus.net/42747 ) [15:16:51] _joe_: are you referring to using CAS? (the context is external atomicity in etcd v2) [15:17:23] <_joe_> federico3: no, you can do quorum writes/reads, but I think we didn't understand each other [15:35:47] _joe_: in my understanding quorum=true is implemented internally across the etcd nodes but the client only connects to one node. If it times out during a write (while the nodes are reaching quorum) the client is left not knowing if the write was successful or not. AFAIK this requires the client to implement retries with CAS [15:36:14] <_joe_> what do you mean with "times out"? [15:36:18] _joe_: if you want to chime in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1124797 [15:36:40] <_joe_> that's vague [15:37:43] <_joe_> federico3: what's the goal of that change? to replace dbctl? [15:38:09] _joe_: times out as in the client reaches an internal timeout while waiting for confirmation from the node it connected to [15:38:27] no, create a dbctl wrapper with locking [15:38:46] <_joe_> federico3: and why not do it directly in dbctl? [15:38:58] <_joe_> I just looked at the task and I'm even more confused [15:39:12] <_joe_> in any case, I don't have time right now sorry :) [15:39:20] <_joe_> I will take a better look later [15:41:12] _joe_: I meant to point out the discussion on atomicity, not asking you to review the whole CR (but if you want you are very welcome to do so) [15:41:37] <_joe_> federico3: to be clear, my confusion comes from the word "atomicity" [15:41:46] <_joe_> which I wouldn't have used in that context [18:50:07] PROBLEM - MariaDB sustained replica lag on s3 on db1198 is CRITICAL: 35 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104 [18:50:09] PROBLEM - MariaDB sustained replica lag on s3 on db1212 is CRITICAL: 34 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104 [18:54:07] RECOVERY - MariaDB sustained replica lag on s3 on db1198 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104 [18:54:09] RECOVERY - MariaDB sustained replica lag on s3 on db1212 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104