[06:52:27] 10Traffic, 10SRE: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10fgiunchedi) [06:57:29] 10Traffic, 10SRE: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10ayounsi) p:05Triage→03Low Agreed something needs to be fixed. The upside is that it works as a safeguard, preventing the service to receive live traffic before the fi... [10:54:18] XioNoX: we're bringing some bgp sessions (k8s <-> core routers) down in codfw due to shutdown of k8s nodes. Should we just ack the corresponding alerts? [10:54:35] that potentially hides real issues I guess :| [10:54:37] cc jelto [10:55:47] I'd recommend to downtime it for the duration of the maintenance, so if something doesn't come back up it will alert when everything should be good [11:10:50] 10netops, 10Infrastructure-Foundations: Return AS43821 to RIPE - https://phabricator.wikimedia.org/T314471 (10cmooney) 05Open→03In progress p:05Triage→03Low [11:11:00] XioNox, jayme: so I'll create a icinga downtime for BGP status alerts until 5:00pm UTC (this is when maintenance should be done for all racks containing kubernetes nodes)? [11:14:16] jelto: maybe add an additional 10min [11:14:21] :) [11:15:06] or 10min less so we can fix stuff that should be up before the end of the window :) [11:15:07] my concern only was that downtiming that check will hide bgp session errors from nodes that are in other racks as well [13:26:56] (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [14:21:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:04:02] (HAProxyEdgeTrafficDrop) firing: 44% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:04:55] XioNoX: I don't want you to feel unheard with the merging of the Africa DC CR - If you still have concerns with the work I'd love to hear them so I can follow up and perhaps improve what's already there. [16:08:53] brett: btw I have some suggestions on how to use NEL to do some simple latency probing [16:09:43] (from end user perspectives) [16:10:00] brett: I let it sit as I was interesting in hearing from more people about the topic [16:10:48] cdanis: do you think we could see nel data in turnilo? :) [16:11:00] or even just the latency data [16:11:04] XioNoX: https://phabricator.wikimedia.org/T304373 :) [16:11:52] nice! [16:12:35] in theory, with that done (or with some manual dumping from the data in logstash), it would not be too hard to also import some RIPE Atlas data into Hive [16:12:54] and then write a bunch of pyspark that computes mappings aggregating all of that as we like [16:13:11] not even sure we would need RIPE data if we have real user data [16:13:25] yeah, agreed, although it would still be interesting to cross-check [16:14:10] we can even vary the sampling fraction based on user country tbh [16:16:39] yeah, this would be a great basis for building our own edge maps [16:16:57] we've talked about doing it with client-side JS for years [16:17:49] I'm guessing we'll still need at least some client-side JS to do a sampling-limited fetch across all edges. [16:18:09] but if that fetch can report via NEL, that removes some of the other complexities [16:19:31] if we made up some custom hostnames for the probing, perhaps we can configure NEL sampling differently that way too? [16:19:53] nel-timing-ulsfo.wikimedia.org and such? [16:21:24] step one is coming up with mechanisms like this to even have the data in analytics. step two is coming up with a processing pipeline that cares about how many samples we get from a network, and decaying weight of older samples, etc, to generate something like our geo-maps file (but for networks, not countries) [16:28:01] bblack: yeah indeed -- we would still need to do some JS work, but it could just be probabilistically doing a fetch against a bunch of site-specific domains [16:28:17] and then those site-specific domains could have both `failure_fraction` and `success_fraction` set to 1.0, with reports going into the normal NEL pipeline [16:28:32] which could then hit analytics, and then from there, the processing pipeline like you say [16:33:56] (HAProxyEdgeTrafficDrop) resolved: 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:14:52] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) >>! In T279664#8123041, @MatthewVernon wrote: > Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs...