[00:59:10] <denisse>	 !incidents
[00:59:10] <sirenbot>	 5062 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[00:59:10] <sirenbot>	 5061 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[00:59:11] <sirenbot>	 5063 (RESOLVED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[00:59:11] <sirenbot>	 5058 (RESOLVED)  db1241 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:11] <sirenbot>	 5057 (RESOLVED)  db1243 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:11] <sirenbot>	 5053 (RESOLVED)  db1244 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:11] <sirenbot>	 5056 (RESOLVED)  db1249 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:12] <sirenbot>	 5055 (RESOLVED)  db1242 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:12] <sirenbot>	 5050 (RESOLVED)  db1238 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:13] <sirenbot>	 5054 (RESOLVED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:13] <sirenbot>	 5052 (RESOLVED)  db1248 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:14] <sirenbot>	 5051 (RESOLVED)  db1221 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:14] <sirenbot>	 5060 (RESOLVED)  db1190 (paged)/MariaDB Replica Lag: s4 (paged)
[00:59:15] <sirenbot>	 5059 (RESOLVED)  db1199 (paged)/MariaDB Replica Lag: s4 (paged)
[06:02:48] <marostegui>	 https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet this doesn't look healthy
[06:08:28] <marostegui>	 So this seems to be recovering https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops
[06:11:49] <marostegui>	 Anyone from data engineering around to help with the evenstream issue?
[06:16:43] <marostegui>	 We got the resolve, but I don't know why, as I am still getting 503 from some requests to evenstream
[13:52:58] <sukhe>	 on-callers: Traffic is going to start deploying T369366. we plan to finish the transition from Git to confd fully. there should be no issues but please note
[13:52:58] <stashbot>	 T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366
[14:17:54] <cdanis>	 sukhe: \o/
[15:31:15] <ottomata>	 marostegui: just saw your comment, any idea what happened?
[15:31:49] <marostegui>	 ottomata: No, it recovered on its own before I could find anything
[15:32:37] <ottomata>	 huh
[15:33:08] <ottomata>	 looks like only in eqiad
[15:39:55] <ottomata>	 a guess: looks like a single client opened a lot of connections all at once.  there should be some per client ip throttling in each eventstreams instance. I do see 429: too_many_requests in logstash, but i am not sure why this would cause 500s.  Did it take down the service?  Hm.
[15:39:55] <ottomata>	 https://grafana.wikimedia.org/goto/v8VcWhCIg?orgId=1
[15:39:56] <ottomata>	 https://logstash.wikimedia.org/goto/7a89b4dbe020d374ff9ec258a624a541
[15:40:10] <ottomata>	 i guess its fine but if it happens again lets look more. thanks marostegui 
[17:06:33] <sukhe>	 depooling ulsfo for the live test of the sre.dns.admin cookbook. will pool back after a while. 
[17:22:50] <sukhe>	 pooled back. we are done for today.
[17:26:13] <inflatador>	 {◕ ◡ ◕}