[00:01:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [01:42:29] 06Traffic, 06SRE, 10WikimediaDebug, 07Developer Productivity, 13Patch-For-Review: Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794#10599124 (10bd808) 05Open→03In progress a:03bd808 [03:45:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [03:55:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:31:30] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10599433 (10ayounsi) Not a strong feeling, but I usually try to steer towards the leaner option. So in that case it's to remove BFD between cr1/2-codfw. Looking at https://github... [07:51:12] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10599496 (10Fabfur) [07:51:59] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Create systemd path unit - https://phabricator.wikimedia.org/T387799#10599505 (10Fabfur) [07:52:04] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10599506 (10Fabfur) [07:52:56] 06Traffic: Create a path unit for haproxy - https://phabricator.wikimedia.org/T387825 (10Fabfur) 03NEW [07:58:54] 06Traffic: Create systemd-tmpfiles configuration for TLS material - https://phabricator.wikimedia.org/T387826 (10Fabfur) 03NEW [07:59:32] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10599550 (10Fabfur) [08:17:34] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10599583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs5006.eqsin.wmnet with OS bookworm [09:32:49] 06Traffic: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10599923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs5006.eqsin.wmnet with OS bookworm completed: - lvs5006 (**PASS**) - Downtimed on Icinga/Alertmanager... [09:52:48] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10599988 (10cmooney) >>! In T387773#10599433, @ayounsi wrote: > Automation wise, we could probably automate "no `metric` = no BFD". Not sure I get the logic here, the suggestion... [10:15:09] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839 (10JAllemandou) 03NEW [10:15:21] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600037 (10JAllemandou) p:05Triage→03High [10:17:04] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10600038 (10Vgutierrez) [10:20:55] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10600044 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5044c35e-9a31-4185-a901-87ad39756198) set by vgutierrez@cumin1002 for 0:30:00 on 1 host(s) and their services with reas... [10:27:19] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10600069 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs5005.eqsin.wmnet with OS bookworm [10:58:29] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600147 (10ayounsi) Let's see what other people think, but I think it would be fine to : * Keep only 1 month... [11:12:49] 06Traffic: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10600187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs5005.eqsin.wmnet with OS bookworm completed: - lvs5005 (**PASS**) - Downtimed on Icinga/Alertmanager... [11:44:35] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10600282 (10Fabfur) [11:46:37] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10600286 (10Fabfur) [11:54:44] o/ me again! I'd like to roll restbaseless citoid out to a wider group of wikis if it suits: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122973 [11:55:06] checking [11:55:13] this is more or less safe as citoid looks like it works on the existing wikis, and is very low traffic [11:56:00] lgtm [11:57:59] 06Traffic: Create a path unit for haproxy - https://phabricator.wikimedia.org/T387825#10600327 (10Fabfur) [11:58:02] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Create systemd path unit - https://phabricator.wikimedia.org/T387799#10600328 (10Fabfur) [12:01:10] thanks! [12:01:19] I'll roll out now [12:03:26] <_joe_> hi! looking at the web logs we save on centrallog:/srv/weblog/webrequests/sampled-1000.json, I see for all requests the "backend" is set to "ATS/9.2.6" [12:03:31] no prob, I'm oncall [12:03:37] (referred to hnowlan :D ) [12:03:49] <_joe_> maybe it would be more useful to add the value of the "server" header in the response instead, or in addition [12:09:09] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10600375 (10Vgutierrez) [12:13:21] _joe_: you mean by having something like `backend: ATS/9.2.6; mw-web.eqiad.main-698dc6dd7b-wvhdm` ? [12:13:29] joining the two? [12:20:04] _joe_: backend is not always ATS/9.2.6 btw [12:22:12] https://www.irccloud.com/pastebin/HTvGOm88/ [12:29:32] 06Traffic, 10Math, 06SRE: Determine the cause of x8 increase in requests to math endpoints between july 6 and August 3 2023 - https://phabricator.wikimedia.org/T344329#10600461 (10MSantos) [12:45:00] <_joe_> vgutierrez: oh so uncached requests [12:45:10] <_joe_> right, so it's done correctly already [12:45:24] <_joe_> if you think of where the current request originated [12:48:34] I think so [12:48:46] unless we have some kind of regression [13:53:10] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600761 (10cmooney) +1 I've no objection to any of these. 30 days for the full data is probably enough. In... [14:09:25] FIRING: SystemdUnitCrashLoop: varnish-frontend-fetcherr.service crashloop on cp3066:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:10:21] looking [14:12:26] > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 518: invalid continuation byte [14:14:25] RESOLVED: SystemdUnitCrashLoop: varnish-frontend-fetcherr.service crashloop on cp3066:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:14:47] just on cp3066 though [14:17:35] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10600869 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3c84753d-9da7-4512-8291-9b672fc8b298) set by vgutierrez@cumin1002 for 0:30:00 on 1 host(s) and their services with reas... [14:29:24] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10600898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs5004.eqsin.wmnet with OS bookworm [14:29:33] 06Traffic: varnish-frontend-fetcherr.service crashloop on cp3066 - https://phabricator.wikimedia.org/T387864 (10ssingh) 03NEW [14:31:13] 06Traffic: varnish-frontend-fetcherr.service crashloop on cp3066 - https://phabricator.wikimedia.org/T387864#10600933 (10ssingh) ` Mar 04 14:17:56 cp3066 varnish-frontend-fetcherr[1551480]: @cee: {"time": "2025-03-04T14:17:56.775748", "message": "req.body read error: 11 (Resource temporarily unavailable) - backe... [15:20:11] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10601148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs5004.eqsin.wmnet with OS bookworm completed: - lvs5004 (**PASS**) - Downtimed on... [15:23:33] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10601187 (10Fabfur) [15:23:50] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Create systemd path unit - https://phabricator.wikimedia.org/T387799#10601191 (10Fabfur) 05Open→03Resolved [15:27:36] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601220 (10ayounsi) Mostly to be able to see long term trends, for example per destination AS. [15:45:48] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10601312 (10Vgutierrez) [16:11:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [16:11:35] 06Traffic: liberica fails to refresh liberica_cp_unhealthy_pooled_realservers_total metric - https://phabricator.wikimedia.org/T387880 (10Vgutierrez) 03NEW [16:11:56] 06Traffic: liberica fails to refresh liberica_cp_unhealthy_pooled_realservers_total metric - https://phabricator.wikimedia.org/T387880#10601493 (10Vgutierrez) p:05Triage→03Medium [16:15:25] 06Traffic: liberica fails to refresh liberica_cp_unhealthy_pooled_realservers_total metric - https://phabricator.wikimedia.org/T387880#10601532 (10Vgutierrez) As expected, restarting liberica-cp without depooling the load balancer restored the metrics to the expected values: ` $ sudo -i systemctl kill liberica-c... [16:21:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [16:24:26] Hey traffic team, does anyone have time to check this DNS patch? https://gerrit.wikimedia.org/r/c/operations/dns/+/1124197 cc: ryankemper [16:26:30] looking [16:27:57] so wdqs-legacy-full is being deprecated [16:27:58] ? [16:29:16] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601646 (10Ottomata) FWIW, netflow is ingested into the Data Lake, so is queryable using SQL and/or [[ https... [16:29:55] sukhe confusingly, it's both new and deprecated ;( . Basically we're getting rid of the current WDQS hosts, but we need to keep one for one of our largest users. More details in https://phabricator.wikimedia.org/T384422 [16:32:57] inflatador: looks good then; happy to discuss how dyna.wikimedia.org works under the hood but quite simply, it will return one of the IPs defined in geo-resources file in the DNS repo for text-addrs (dyna.wikimedia.org is essentially geoip!text-addrs) [16:54:34] Hi, i have a question regarding sending resource_change events to purge CDN URLs. Does the URI require a scheme to be defined? If yes should we emit resource_change events with uri [16:54:44] with schema http or https? [16:54:44] sukhe: I do have a slight confusion on geoip, does it require there to be a backing lvs pool or is that tangential [16:55:59] Because in this case we've got a single host that will be serving the query-legacy-full endpoint, and it's really just an arbitarily chosen host from our current set of query.wikidata.org hosts [16:56:11] ryankemper: not really; in most cases there is a backing LVS pool but you could put any /32 in there [16:56:25] (I am specifically referring to the DNS repo here) [16:57:20] got it, as long as that dns patch is sufficient for the request to reach the trafficserver appropriately then https://gerrit.wikimedia.org/r/c/operations/puppet/+/1121726 (already merged) should take care of the rest [16:58:13] ryankemper: yeah, I think so if I got it correctly what you are trying to do here [16:58:13] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601848 (10JAllemandou) >>! In T387839#10601646, @Ottomata wrote: > FWIW, netflow is ingested into the Data... [17:02:24] nemo-yiannis: I don't think so. [17:02:50] thanks sukhe [17:03:29] vgutierrez: ^ can you confirm that no scheme is required? [17:03:46] where? [17:03:53] 11:54:34 < nemo-yiannis> Hi, i have a question regarding sending resource_change events to purge CDN URLs. Does the URI require a scheme to be defined? If yes should we emit resource_change events with uri [17:03:57] 11:54:44 < nemo-yiannis> with schema http or https? [17:06:03] so I purged a few URLs on Friday using `https://` [17:06:21] ryankemper: BTW, wdqs2009.codfw.wmnet TLS material needs to be updated [17:06:35] https://www.irccloud.com/pastebin/VDUH4uMN/ [17:06:45] query-legacy-full.wikidata.org isn't there [17:07:07] so that ATS change shouldn't have been merged :( [17:08:12] ah yeah, they have query-legacy-full.eqiad there which I just suspect is a typo from the previous CRs [17:08:30] yes indeed [17:11:03] sukhe, nemo-yiannis: back to the schema question, purged only sends the host and the URI to varnish/ATS, so the schema is irrelevant [17:11:32] ok got it, thanks [17:12:21] relevant code here: https://gitlab.wikimedia.org/repos/sre/purged/-/blob/main/purged.go?ref_type=heads#L257 [17:12:58] go playground PoC: https://go.dev/play/p/gcYLLrBHMaa [17:13:12] yeah, makes sense and I think that is what nemo-yiannis was asking [17:13:41] so purged basically builds a request that looks like `PURGE /wiki/Main_Page\r\nHost: www.wikipedia.org` [17:14:51] the full request that's sending it's here: https://gitlab.wikimedia.org/repos/sre/purged/-/blob/main/purged.go?ref_type=heads#L60 [17:16:46] sounds good, thanks for the information [17:17:02] no problem :D [17:24:58] (wdqs) Okay I think the cert is working now. I can `curl https://query-legacy-full.wikidata.org/sparql` and get a response [17:25:42] yeah it's there [17:27:44] The UI is supposed to be there at `https://query-legacy-full.wikidata.org/` but that page is just returning 502 so there's probably something missing from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1122678 I gather [17:27:54] (The above is a concern for sre-collab but just mentioning for completeness' sake) [17:30:09] Did you deploy the admin part of kubernetes as well? If you change the ingress (like adding a new domain) the admin service has to be deployed. [17:30:35] I'm already afk I can take care of that tomorrow or you reach out in service ops channel [17:31:32] Did not deploy the admin service, so that is likely the issue. thanks jelto [17:31:37] Will look into that when I'm back in an hour [17:33:35] https://wikitech.wikimedia.org/wiki/Kubernetes/Remove_a_service#Deploy_changes_to_helmfile.d/admin_ng [18:58:34] 06Traffic, 06Data-Engineering: GeoDNS: Pipeline from event.development_network_probe to operations/dns.git - https://phabricator.wikimedia.org/T380626#10602407 (10Ottomata) [20:58:23] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10602968 (10Fabfur) [21:14:45] 06Traffic: acme_chief and sslcert modules should allow destination parameter - https://phabricator.wikimedia.org/T387929 (10Fabfur) 03NEW