[06:52:27] <wikibugs>	 10Traffic, 10SRE: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10fgiunchedi)
[06:57:29] <wikibugs>	 10Traffic, 10SRE: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10ayounsi) p:05Triage→03Low Agreed something needs to be fixed. The upside is that it works as a safeguard, preventing the service to receive live traffic before the fi...
[10:54:18] <jayme>	 XioNoX: we're bringing some bgp sessions (k8s <-> core routers) down in codfw due to shutdown of k8s nodes. Should we just ack the corresponding alerts?
[10:54:35] <jayme>	 that potentially hides real issues I guess :|
[10:54:37] <jayme>	 cc jelto
[10:55:47] <XioNoX>	 I'd recommend to downtime it for the duration of the maintenance, so if something doesn't come back up it will alert when everything should be good
[11:10:50] <wikibugs>	 10netops, 10Infrastructure-Foundations: Return AS43821 to RIPE - https://phabricator.wikimedia.org/T314471 (10cmooney) 05Open→03In progress p:05Triage→03Low
[11:11:00] <jelto>	 XioNox, jayme: so I'll create a icinga downtime for BGP status alerts until 5:00pm UTC (this is when maintenance should be done for all racks containing kubernetes nodes)?
[11:14:16] <jayme>	 jelto: maybe add an additional 10min
[11:14:21] <XioNoX>	 :)
[11:15:06] <XioNoX>	 or 10min less so we can fix stuff that should be up before the end of the window :)
[11:15:07] <jayme>	 my concern only was that downtiming that check will hide bgp session errors from nodes that are in other racks as well
[13:26:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[14:21:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[16:04:02] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 44% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[16:04:55] <brett>	 XioNoX: I don't want you to feel unheard with the merging of the Africa DC CR - If you still have concerns with the work I'd love to hear them so I can follow up and perhaps improve what's already there.
[16:08:53] <cdanis>	 brett: btw I have some suggestions on how to use NEL to do some simple latency probing
[16:09:43] <cdanis>	 (from end user perspectives)
[16:10:00] <XioNoX>	 brett: I let it sit as I was interesting in hearing from more people about the topic
[16:10:48] <XioNoX>	 cdanis: do you think we could see nel data in turnilo? :)
[16:11:00] <XioNoX>	 or even just the latency data
[16:11:04] <cdanis>	 XioNoX: https://phabricator.wikimedia.org/T304373 :)
[16:11:52] <XioNoX>	 nice!
[16:12:35] <cdanis>	 in theory, with that done (or with some manual dumping from the data in logstash), it would not be too hard to also import some RIPE Atlas data into Hive
[16:12:54] <cdanis>	 and then write a bunch of pyspark that computes mappings aggregating all of that as we like
[16:13:11] <XioNoX>	 not even sure we would need RIPE data if we have real user data
[16:13:25] <cdanis>	 yeah, agreed, although it would still be interesting to cross-check
[16:14:10] <cdanis>	 we can even vary the sampling fraction based on user country tbh
[16:16:39] <bblack>	 yeah, this would be a great basis for building our own edge maps
[16:16:57] <bblack>	 we've talked about doing it with client-side JS for years
[16:17:49] <bblack>	 I'm guessing we'll still need at least some client-side JS to do a sampling-limited fetch across all edges.
[16:18:09] <bblack>	 but if that fetch can report via NEL, that removes some of the other complexities
[16:19:31] <bblack>	 if we made up some custom hostnames for the probing, perhaps we can configure NEL sampling differently that way too?
[16:19:53] <bblack>	 nel-timing-ulsfo.wikimedia.org and such?
[16:21:24] <bblack>	 step one is coming up with mechanisms like this to even have the data in analytics.  step two is coming up with a processing pipeline that cares about how many samples we get from a network, and decaying weight of older samples, etc, to generate something like our geo-maps file (but for networks, not countries)
[16:28:01] <cdanis>	 bblack: yeah indeed -- we would still need to do some JS work, but it could just be probabilistically doing a fetch against a bunch of site-specific domains
[16:28:17] <cdanis>	 and then those site-specific domains could have both `failure_fraction` and `success_fraction` set to 1.0, with reports going into the normal NEL pipeline
[16:28:32] <cdanis>	 which could then hit analytics, and then from there, the processing pipeline like you say
[16:33:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[22:14:52] <wikibugs>	 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) >>! In T279664#8123041, @MatthewVernon wrote: > Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs...