[09:57:47] Network Next Hop AS_PATH Age Attrs [09:57:47] *> 10.0.0.0/24 192.168.88.5 64512 64512 64512 64512 00:00:28 [{Origin: i}] [09:58:03] XioNoX, topranks ^^ bye bye MED, as prepend working as expected [11:16:03] elukey: if I'm reading https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075528/1/hieradata/common/profile/trafficserver/backend.yaml correctly, docker-registry has a TTFB over 5 minutes on some endpoints? [11:18:12] vgutierrez: o/ sometimes to fetch parts of the catalog it takes more than those 180s yes.. pagination is used, so when fetching /v2/_catalog the response carries a Link header with the next URI to use (that forces pagination). I tried today with lower values, it improved things but the service is very slow when asking for the catalog [11:18:30] there are a million sad reasons for this [11:18:52] IMHO that should be switched to an async model rather than increasing timeouts [11:19:14] you make a slow request, get an uuid back and check when the response is ready [11:19:35] rather than letting the CDN hang for a response for 5 minutes [11:19:42] if the docker registry supported this, it would be nice [11:20:30] 300s is an upper bound, 180 is still a bit tight so I added a little room for the future [11:20:46] not all requests take that time, only fetching the catalog [11:21:09] hopefully if we are able to clean up its state and make it faster we'll not need this forever [11:21:44] I understand it is not pretty, but I don't see many more options [11:22:09] there is any chance of bypassing the CDN for this kind of operation? [11:22:49] are we running docker-report within the production network? [11:23:57] this is a good point, it is internal, we use the CDN endpoint for some obscure reason though that I am not aware of. I can try to test/dig into it, maybe we can solve it quickly [11:24:50] targeting https://docker-registry.discovery.wmnet would make a lot of sense for this [11:25:33] you are totally right, I am going to test it and report back.. I have a sense that it was tested in the past and possibly something didn't work, but we'll see. I'll report back if it fails miserably :) [11:25:44] thx [11:26:05] thanks for the brainbounce :) [11:29:35] the only big difference I can see between targeting the applayer directly is that you lose HTTP/2 [11:30:30] don't really care, I assumed there was a problem related to how nginx was/is configured, but so far it seems working fine [11:30:51] so bypassing the CDN, if it works, it is surely the best [11:31:00] and we don't need to add horrors to the ATS config [11:31:14] and it's more efficient in this case IMHO [11:36:50] definitely [11:54:51] 06Traffic, 06Data-Platform, 10Data Products (Data Products Sprint 19): NEW BUG REPORT - Issues in calculation logic for unique devices tables - https://phabricator.wikimedia.org/T375527#10175204 (10WDoranWMF) [11:55:47] 06Traffic, 06Data-Platform, 10Data Products (Data Products Sprint 19): NEW BUG REPORT - Issues in calculation logic for unique devices tables - https://phabricator.wikimedia.org/T375527#10175206 (10WDoranWMF) @KOfori this task is considered high priority and we'll need support from Traffic, could you triage... [13:00:04] 06Traffic, 06Movement-Insights: Investigating unique devices traffic data - https://phabricator.wikimedia.org/T375562#10175385 (10Vgutierrez) > 1. Is there an explanation for why there are users that apparently WMF-Last-Access-Global is set but not WMF-Last-Access and vice versa? It seems that Cookies should b... [13:25:43] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10175439 (10Papaul) [13:28:50] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10175445 (10Papaul) [13:31:10] 06Traffic, 06Data-Engineering: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute - https://phabricator.wikimedia.org/T375256#10175471 (10Tgr) This only affects cookies on cross-domain requests. Not sure if that's a problem. The logspam can be prevented by setting th... [14:42:23] 06Traffic, 06Movement-Insights: Investigating unique devices traffic data - https://phabricator.wikimedia.org/T375562#10175920 (10Milimetric) >>! In T375562#10175385, @Vgutierrez wrote: >> 1. Is there an explanation for why there are users that apparently WMF-Last-Access-Global is set but not WMF-Last-Access a... [14:53:58] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10176000 (10Jhancock.wm) [14:56:36] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10176005 (10Jhancock.wm) @papaul there's two more not marked in the comments that do not have 10G cards, but they are being decommed. civi20... [15:03:25] 06Traffic, 06Movement-Insights: Investigating unique devices traffic data - https://phabricator.wikimedia.org/T375562#10176053 (10Hghani) >>! In T375562#10175385, @Vgutierrez wrote: > >>3. Looking at the isp_data, or any other field, is there any way to determine the origin of the request? Even if it's a gen... [15:15:29] 06Traffic, 06Movement-Insights: Investigating unique devices traffic data - https://phabricator.wikimedia.org/T375562#10176102 (10Vgutierrez) >>! In T375562#10175920, @Milimetric wrote: > What we saw in the data is that for `*.wikipedia.org` we have webrequests with WMF-Last-Access-Global set but no WMF-Last-A... [15:43:42] swfrench-wmf, vgutierrez, cr3-ulsfo shows early signs of failure [15:44:00] ack [15:44:06] ack [15:45:06] eqiad is still depooled at the CDN level, ulsfo is currently handling ~10k rps that would add to the current codfw load [15:45:31] https://phabricator.wikimedia.org/T375345#10176212 [15:46:03] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10176212 (10ayounsi) 05Resolved→03Open ` cr3-ulsfo> show system alarms 1 alarms currently active Alarm time Class Description 2024-09-25 13:11:42 UTC Minor FPC 0 Min... [15:50:19] so at least be ready to depool it, I'll leave it up to you to actually depool or not [15:50:37] I would propose that per discussion s.ukhe and b.black on Monday, that if we end up needing to depool ulsfo, then we do not proactively repool eqiad, and instead do so only if we start seeing issues with load / network saturation in codfw [15:50:47] other than that it seems to be behaving correcly so far [15:52:57] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10176239 (10Papaul) @Jhancock.wm thank you no worry on civi2001 and frpig2001. So we have a total of 8 servers that are running on 1G and we... [16:03:59] swfrench-wmf: end of my day here, I'd depool it instead of risking issues over the night [16:04:59] vgutierrez: no objections on my end, but would defer to you folks or the oncallers for that [16:07:07] !oncall-now [16:08:37] bblack: WDYT? cr3-ulsfo seems to be close to crash, I'm inclined to depool it before that happens :) [16:09:06] sounds good to me [16:09:37] another chance to break out that fancy new cookbook anyways :) [16:09:57] it's so good I already used it on a Sunday [16:11:05] and yes, so long as eqiad remains available-to-pool in case of codfw emergency, we can be fine on codfw-only in the US I think. [16:42:08] 12:39:17 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:42:10] 12:39:25 <+icinga-wm> PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:28] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10176493 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dcd9deb3-f5d9-41d3-ade0-567f7154bb5b) set by ayounsi@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their serv... [16:47:49] FIRING: PyBalBGPUnstable: PyBal BGP sessions on instance lvs4010 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=ulsfo%20prometheus/ops&var-server=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [16:52:49] FIRING: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs4008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [17:12:17] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10176609 (10RobH) Adding in #ops-ulsfo project tag as I've been CC'd in at this point for the actual processing of the on-site steps for this failed hardware. [17:12:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10176610 (10RobH) [17:22:49] RESOLVED: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs4008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable