[09:30:58] <vgutierrez>	 I'm seeing this on cp5006
[09:31:01] <vgutierrez>	 [18679034.347029] INFO: task md0_raid1:603 blocked for more than 120 seconds.
[09:31:28] <vgutierrez>	 CPU usage seems fine now.. but it obviously struggled for a while
[09:31:29] <vgutierrez>	  09:31:13 up 217 days,  1:45,  1 user,  load average: 2456.82, 2456.17, 2454.40
[09:32:24] <vgutierrez>	 the RAID seems healthy though
[09:37:27] <vgutierrez>	 godog: for some reason cp5006 doesn't show up on the host overview dashboard.. do we have a short retention time for those metrics?
[09:43:00] <godog>	 vgutierrez: if the host isn't in the list is usually means prometheus doesn't know about the host, prometheus5001 in this case
[09:43:56] <godog>	 which in turn might mean the host isn't in puppetdb, or prometheus somehow isn't scraping it
[09:44:49] <godog>	 to answer your question, "host overview" uses thanos by default so long term metrics
[09:47:30] <vgutierrez>	 godog: the host has been struggling for the last 21 hours
[09:47:39] <vgutierrez>	 so puppet hasn't been able to run there for almost a day
[09:47:57] <vgutierrez>	 https://www.irccloud.com/pastebin/6t2lOjEy/
[09:48:07] <vgutierrez>	 poor bastard :)
[09:50:03] <godog>	 heheh yeah that might be the reason, the host is known to prometheus5001 but indeed looks like prometheus hasn't been able to scrape it for some time
[09:50:39] <godog>	 https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cp5006&var-datasource=thanos&var-cluster=cache_upload&from=now-24h&to=now
[09:51:21] <godog>	 but yes if a target can't be scraped for some time it goes stale, punching in the hostname though works
[09:57:40] <vgutierrez>	 gotcha, thanks!
[16:48:35] <cdanis>	 I just saw this but apparently it's ~22h old?  cr1-eqiad reporting interface xe-3/0/3: down (connects to access switches, asw2-d-eqiad:xe-2/0/40)
[16:48:47] <cdanis>	 anyone know anything?
[16:55:42] <cdanis>	 (filed T273301)
[16:55:42] <stashbot>	 T273301: cr1-eqiad<>asw2-d-eqiad link down - https://phabricator.wikimedia.org/T273301
[20:50:43] <cdanis>	 so, the GRE tunnel between ulsfo and eqsin (a backup link) is flapping
[20:50:51] <cdanis>	 has been since 20:21 UTC
[20:50:59] <cdanis>	 Jan 29 20:21:46  cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:4d 18:26 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
[20:51:12] <cdanis>	 the ulsfo side constantly thinks the BFD timer has expired
[20:51:32] <cdanis>	 ( https://en.wikipedia.org/wiki/Bidirectional_Forwarding_Detection -- basically just a simple heartbeat protocol)
[20:52:09] <cdanis>	 this is a bit uncomfortable because it means eqsin is at N+0 for connectivity to the rest of production
[20:52:22] <cdanis>	 https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png
[20:54:50] <cdanis>	 I will open a task in a few minutes, finishing up something else atm
[21:12:51] <dpifke>	 FWIW, I was seeing some latency/packet loss spikes talking to the ulsfo bastion yesterday.
[21:13:13] <cdanis>	 did you grab a mtr? :)
[21:13:44] <dpifke>	 I assumed it was my ISP (which it still very well may have been).
[21:14:17] <dpifke>	 Sadly, no.  Was in the middle of something so just changed my ssh config and resumed what I was doing. :)
[21:15:12] <dpifke>	 My Starlink antenna is supposed to arrive today, so then I'll have redundant connectivity here.
[21:25:47] <mutante>	 dpifke: how long ago had you signed up for beta?  when they ask for service address I am kind of thinking that wasn't the point, i want to put that dish on a van or something :)
[21:45:30] <dpifke>	 Signed up pretty much right away, got the invite last week.
[21:46:17] <dpifke>	 Right now the ToS say you have to keep it at the registered address, I assume because they're trying to distribute the beta geographically.
[21:48:18] <dpifke>	 Benefit of northern latitude: Starlink availability.  Downside: snow.  The real trick is going to be not falling off an icy roof when I put the dish up. :)
[21:55:22] <mutante>	 dpifke: gotcha! thanks
[21:55:37] <mutante>	 good luck:)
[22:01:49] <dpifke>	 Safety third!