[00:19:11] RESOLVED: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [03:10:25] FIRING: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [03:32:55] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:59] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:14] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:34] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:40:25] RESOLVED: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:34] FIRING: DiskSpace: Disk space serpens:9100:/ 6.576% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:11:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:29:34] FIRING: DiskSpace: Disk space serpens:9100:/ 3.503% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:23:46] 10netops, 06Infrastructure-Foundations, 06SRE: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800 (10cmooney) 03NEW p:05Triage→03High [05:53:02] 10netops, 06Infrastructure-Foundations, 06SRE: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11361879 (10cmooney) [05:58:32] 10netops, 06Infrastructure-Foundations, 06SRE: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11361886 (10cmooney) [06:01:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:01:34] we had a failed logrotation on serpens, I fixed it manually, should recover soon [07:04:34] RESOLVED: [2x] DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:16:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:08] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11363473 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T408378#11351612, @colewhite wrote: > In today's case, the alert criteria wasn't met because... [15:37:21] topranks: any concerns if I pick 10.3.0.2/32 in 10.3.0.0/24? [15:37:30] full context https://phabricator.wikimedia.org/T409780 [15:37:41] or I can pick something after 10.3.0.9 as well, so .10 [15:38:06] this is for hcaptcha-proxy.anycast.wmnet [15:39:46] I mean no objection really, maybe .10 is better..... [15:39:59] but my main question would be why are we using anycast for something behind the LVS ? [15:40:27] topranks: ok on .10! [15:40:36] yeah that's a good question, I try to clarify that in the task [15:40:50] let me know if it doesn't make sense and happy to discuss of course [15:41:03] do we still have backup static /31 or /30 routes for the recdns .1 endpoint? [15:41:55] no, we don't do static routes anymore and we even removed from them for the LVSes (in the process) https://phabricator.wikimedia.org/T300877 [15:42:20] we have static routes in place for some ranges [15:42:39] https://phabricator.wikimedia.org/P85190 [15:42:47] but I am very much in favour of removing them [15:43:10] yeah, we decided to keep the eqiad ones for now but we will remove them eventually [15:43:27] my fear for something like a HTTP proxy is Anycast is sometimes problematic with stateful things like TCP based conections [15:43:30] fine for DNS, NTP etc [15:43:47] at least 10.68.16.0/21 from that paste is not in use and has not been in years, so we could drop that right awway [15:43:59] using a load-balancer is often better, but at the cost of that extra layer in the stack [15:44:26] yeah, in this case though we haven't really done a service at the edge and behind the CDN. so that problem is there in that respect [15:44:36] taavi: ha indeed, even the next-hop is invalid it's not in the routing table at all [15:44:39] I'll nuke it [15:44:47] for the anycast / TCP thing, do note that we already do it for durum and Wikidough though in that respect, but valid point [15:45:41] even so I expect the DoH/DoT services are gonna work deal better with broken state than something like a hcapcha transaction [15:45:51] topranks: feel free to comment on the task though; we want to get more input before we get started [15:46:25] yeah I'll feed back, basically I think putting anycast behind the LB is an added complication we don't need [15:46:26] the hcatpcha sessions should/are in theory short-lived as well [15:46:39] but on the question (you're now sorry you asked), .10 seems like a good choice :) [15:46:46] ha! [15:46:58] na it's all good, please comment and we can dicuss there [15:47:33] a typical hcatpcha session should be under a second [15:47:48] (well the HTTP session to that service that is) [15:47:56] but let's discuss there, I will hold off on the work [16:14:26] sukhe: maybe it's better to ask this here cos we are possibly talking past each other on task [16:14:45] are there two services at play? A 'hcaptcha' service and a 'http proxy' service? [16:14:54] are both behind the CDN? [16:15:13] topranks: yeah sorry, it's a bit confusing [16:15:19] let's start here with some idea I guess https://wikitech.wikimedia.org/wiki/HCaptcha#/media/File:Hcaptcha_wmf_design.png [16:16:13] ok take a step back. the internet is huge these days it'll never fit in that litle box :P [16:16:40] nah but I get the setup, I think maybe I was confused though about what was proposed for each [16:16:56] yeah so the domain itself, what the user looks up, is behind the CDN [16:17:13] the actual proxy is also a low-traffic service but the problme is we don't have low-traffic in edges [16:17:27] ok.... [16:17:47] so the proposal is to use anycast _instead_ of LVS for the proxy service at the POPs? [16:17:55] rather I guess, hcaptcha.wm.org -> text-lb and then text-lb backend to the proxies [16:18:35] topranks: yep, in theory because we can't make the VMs high-traffic[12] because then high-traffic1 (text-lb) looks up itself and that doesn't work [16:18:49] and we need something to distribute traffic between these two VMs (per site, total 14) [16:18:53] so thoughts on how to do that I guess? [16:23:30] yeah anycast is fine [16:23:36] I got the wrong end of the stick [16:24:12] anycast would not bring anything if the proxy was going behind the CDN, but that's not the case [16:34:05] topranks: thanks, responding to your comments on the task as well