[02:27:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be2069:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:27:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be2069:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:27:52] <Emperor>	 that's a dead disk
[08:37:51] <marostegui>	 I have depooled pc1
[09:38:23] <Emperor>	 T388373 opened, alert silenced
[09:38:24] <stashbot>	 T388373: Disk (sdj) failed on ms-be2069 - https://phabricator.wikimedia.org/T388373
[10:05:17] <federico3>	 when gerrit fails to save something it also somehow blocks the ability to select and copy text...
[11:52:11] <Amir1>	 Emperor: Given that we are going to regenerate a lot of thumbnails, I'll be starting the clean up on eqiad starting tomorrow unless you object. Goodbye 
[11:56:42] <Emperor>	 Goodbye?
[11:56:53] <Amir1>	 until I annoy you again :D
[11:57:24] <Emperor>	 heh. I was somewhat wondering (I think I said on phab somewhere) if we should hold off until after the switchover to check there aren't any surprises from the eqiad deletions?
[11:58:39] <Amir1>	 the thing is that by the time the eqiad switchover happens, only below 1% of thumbnails will be deleted. That's rounding error
[11:58:53] <Emperor>	 OK...
[12:20:00] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1243 is CRITICAL: 12.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104
[12:21:00] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1243 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104
[13:03:21] <jynus>	 marostegui: for later I wonder if puppet needs a similar patch for db2230,  db2185
[13:03:35] <jynus>	 or if those are just under special cases 
[13:04:39] <jynus>	 I will send a patch at least for db_inventory
[14:06:13] <Emperor>	 Our ISP is having Routing issues today, so dunno if I'll be able to join the meeting or not
[14:06:39] <Emperor>	 ( https://aastatus.net/42747 )
[15:16:51] <federico3>	 _joe_: are you referring to using CAS? (the context is external atomicity in etcd v2)
[15:17:23] <_joe_>	 federico3: no, you can do quorum writes/reads, but I think we didn't understand each other
[15:35:47] <federico3>	 _joe_: in my understanding quorum=true is implemented internally across the etcd nodes but the client only connects to one node. If it times out during a write (while the nodes are reaching quorum) the client is left not knowing if the write was successful or not. AFAIK this requires the client to implement retries with CAS 
[15:36:14] <_joe_>	 what do you mean with "times out"?
[15:36:18] <federico3>	 _joe_: if you want to chime in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1124797 
[15:36:40] <_joe_>	 that's vague
[15:37:43] <_joe_>	 federico3: what's the goal of that change? to replace dbctl?
[15:38:09] <federico3>	 _joe_: times out as in the client reaches an internal timeout while waiting for confirmation from the node it connected to 
[15:38:27] <federico3>	 no, create a dbctl wrapper with locking
[15:38:46] <_joe_>	 federico3: and why not do it directly in dbctl?
[15:38:58] <_joe_>	 I just looked at the task and I'm even more confused
[15:39:12] <_joe_>	 in any case, I don't have time right now sorry :)
[15:39:20] <_joe_>	 I will take a better look later
[15:41:12] <federico3>	 _joe_: I meant to point out the discussion on atomicity, not asking you to review the whole CR (but if you want you are very welcome to do so)
[15:41:37] <_joe_>	 federico3: to be clear, my confusion comes from the word "atomicity"
[15:41:46] <_joe_>	 which I wouldn't have used in that context
[18:50:07] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s3 on db1198 is CRITICAL: 35 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104
[18:50:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s3 on db1212 is CRITICAL: 34 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104
[18:54:07] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s3 on db1198 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104
[18:54:09] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s3 on db1212 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104