[09:30:58] I'm seeing this on cp5006 [09:31:01] [18679034.347029] INFO: task md0_raid1:603 blocked for more than 120 seconds. [09:31:28] CPU usage seems fine now.. but it obviously struggled for a while [09:31:29] 09:31:13 up 217 days, 1:45, 1 user, load average: 2456.82, 2456.17, 2454.40 [09:32:24] the RAID seems healthy though [09:37:27] godog: for some reason cp5006 doesn't show up on the host overview dashboard.. do we have a short retention time for those metrics? [09:43:00] vgutierrez: if the host isn't in the list is usually means prometheus doesn't know about the host, prometheus5001 in this case [09:43:56] which in turn might mean the host isn't in puppetdb, or prometheus somehow isn't scraping it [09:44:49] to answer your question, "host overview" uses thanos by default so long term metrics [09:47:30] godog: the host has been struggling for the last 21 hours [09:47:39] so puppet hasn't been able to run there for almost a day [09:47:57] https://www.irccloud.com/pastebin/6t2lOjEy/ [09:48:07] poor bastard :) [09:50:03] heheh yeah that might be the reason, the host is known to prometheus5001 but indeed looks like prometheus hasn't been able to scrape it for some time [09:50:39] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cp5006&var-datasource=thanos&var-cluster=cache_upload&from=now-24h&to=now [09:51:21] but yes if a target can't be scraped for some time it goes stale, punching in the hostname though works [09:57:40] gotcha, thanks! [16:48:35] I just saw this but apparently it's ~22h old? cr1-eqiad reporting interface xe-3/0/3: down (connects to access switches, asw2-d-eqiad:xe-2/0/40) [16:48:47] anyone know anything? [16:55:42] (filed T273301) [16:55:42] T273301: cr1-eqiad<>asw2-d-eqiad link down - https://phabricator.wikimedia.org/T273301 [20:50:43] so, the GRE tunnel between ulsfo and eqsin (a backup link) is flapping [20:50:51] has been since 20:21 UTC [20:50:59] Jan 29 20:21:46 cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:4d 18:26 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. [20:51:12] the ulsfo side constantly thinks the BFD timer has expired [20:51:32] ( https://en.wikipedia.org/wiki/Bidirectional_Forwarding_Detection -- basically just a simple heartbeat protocol) [20:52:09] this is a bit uncomfortable because it means eqsin is at N+0 for connectivity to the rest of production [20:52:22] https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png [20:54:50] I will open a task in a few minutes, finishing up something else atm [21:12:51] FWIW, I was seeing some latency/packet loss spikes talking to the ulsfo bastion yesterday. [21:13:13] did you grab a mtr? :) [21:13:44] I assumed it was my ISP (which it still very well may have been). [21:14:17] Sadly, no. Was in the middle of something so just changed my ssh config and resumed what I was doing. :) [21:15:12] My Starlink antenna is supposed to arrive today, so then I'll have redundant connectivity here. [21:25:47] dpifke: how long ago had you signed up for beta? when they ask for service address I am kind of thinking that wasn't the point, i want to put that dish on a van or something :) [21:45:30] Signed up pretty much right away, got the invite last week. [21:46:17] Right now the ToS say you have to keep it at the registered address, I assume because they're trying to distribute the beta geographically. [21:48:18] Benefit of northern latitude: Starlink availability. Downside: snow. The real trick is going to be not falling off an icy roof when I put the dish up. :) [21:55:22] dpifke: gotcha! thanks [21:55:37] good luck:) [22:01:49] Safety third!