[00:59:10] !incidents [00:59:10] 5062 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [00:59:10] 5061 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [00:59:11] 5063 (RESOLVED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [00:59:11] 5058 (RESOLVED) db1241 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:11] 5057 (RESOLVED) db1243 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:11] 5053 (RESOLVED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:11] 5056 (RESOLVED) db1249 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:12] 5055 (RESOLVED) db1242 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:12] 5050 (RESOLVED) db1238 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:13] 5054 (RESOLVED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:13] 5052 (RESOLVED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:14] 5051 (RESOLVED) db1221 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:14] 5060 (RESOLVED) db1190 (paged)/MariaDB Replica Lag: s4 (paged) [00:59:15] 5059 (RESOLVED) db1199 (paged)/MariaDB Replica Lag: s4 (paged) [06:02:48] https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet this doesn't look healthy [06:08:28] So this seems to be recovering https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops [06:11:49] Anyone from data engineering around to help with the evenstream issue? [06:16:43] We got the resolve, but I don't know why, as I am still getting 503 from some requests to evenstream [13:52:58] on-callers: Traffic is going to start deploying T369366. we plan to finish the transition from Git to confd fully. there should be no issues but please note [13:52:58] T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 [14:17:54] sukhe: \o/ [15:31:15] marostegui: just saw your comment, any idea what happened? [15:31:49] ottomata: No, it recovered on its own before I could find anything [15:32:37] huh [15:33:08] looks like only in eqiad [15:39:55] a guess: looks like a single client opened a lot of connections all at once. there should be some per client ip throttling in each eventstreams instance. I do see 429: too_many_requests in logstash, but i am not sure why this would cause 500s. Did it take down the service? Hm. [15:39:55] https://grafana.wikimedia.org/goto/v8VcWhCIg?orgId=1 [15:39:56] https://logstash.wikimedia.org/goto/7a89b4dbe020d374ff9ec258a624a541 [15:40:10] i guess its fine but if it happens again lets look more. thanks marostegui [17:06:33] depooling ulsfo for the live test of the sre.dns.admin cookbook. will pool back after a while. [17:22:50] pooled back. we are done for today. [17:26:13] {◕ ◡ ◕}