[10:55:12] gitlab needs a short maintenace in one hour (12:00 UTC) [12:01:15] gitlab maintenance finished [15:22:41] folks, do we know of any issues related to MW or the network starting around 20:20 UTC Feb 5? [15:23:34] sukhe: like https://grafana.wikimedia.org/goto/uDhweuHvR?orgId=1 ? [15:23:38] context: we are trying to debug some hcaptcha connection issues (MW -> hCaptcha proxy VMs) [15:23:48] see milimetric's post from earlier in #-traffic :) [15:24:07] cdanis: yeah that was pointed out but it couldn't have been that since the issue persisted well beyond that period [15:24:11] ah okay [15:24:19] https://logstash.wikimedia.org/goto/3638725468dcb59bc3e917335776afa1 [15:25:09] hcaptcha proxy -> hcaptcha requests appear to go fine, it's MW -> hcaptcha proxy that seems to be failing and then some other issues as well, like IPReputation above, even though it's fairly a small number of failing requests [15:26:14] this also doesn't seem likely https://sal.toolforge.org/log/cKR1L5wBvg159pQrh9_Y [15:29:06] sukhe: are you sure about that first part? https://grafana-rw.wikimedia.org/d/441b2def-52e9-49d6-acad-91f5bb748989/hcaptcha-reverse-proxy-proxoid?orgId=1&from=now-2d&to=now&timezone=utc&var-instance=$__all&var-site=codfw&var-wiki=ptwiki&var-wiki=jawiki&var-wiki=idwiki&var-wiki=zhwiki&var-wiki=trwiki&var-wiki=fawiki&var-wiki=frwiki&var-wiki=enwiki&viewPanel=panel-36 [15:29:59] cdanis: I can't seem to find the 5x on the proxy VMs and any requests for secure-api.js (that does the healthchecking) are all 200s [15:30:28] but yeah, that's exactly when it started, 20:22 UTC [15:30:40] and it has resolved now as you can see, except still lost on the cause [15:33:22] cdanis: so to answer your question, I am sure about no issues on the proxy but of course I can be wrong. the logs there look good though (see also https://grafana.wikimedia.org/goto/UTFB19HDR?orgId=1) [16:03:50] sukhe: hcaptcha seems to behave pretty well in general lately... not like in the past: https://grafana.wikimedia.org/goto/pRyKa9NvR?orgId=1 [16:11:19] interesting... https://grafana.wikimedia.org/goto/6J8wa9Nvg?orgId=1 [16:11:43] although perhaps that's all symptom [16:35:29] vgutierrez: yeah [16:36:16] cdanis: if we see the recovery around 13:00 UTC today for example (it has not been flapping since then), that matches up with the symptoms, yes [16:36:56] yeah but thinking about it more, I think all that graph tells us is that the requests were timing out when the requests were timing out [16:45:12] yeah, I will admit that I am at a somewhat loss on how to debug this other than the on the proxy side itself. but since it is stable for now, I guess we will wait to see if it comes up again and if there is anything else that coincides with that time period [16:46:15] I took a peek at machine metrics for urldownloader* and didn't see anything obvious [16:46:48] cdanis: we moved these away from urldownloaders*. they are now dedicated hcaptcha-proxy* anycast VMs [16:46:59] oh! [16:47:32] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/9c68b14872f530c38ab8ed952d4dd56d0da04328/wmf-config/CommonSettings.php#2149 [16:47:44] that's what the code path in question uses, though ^ [16:50:06] ha, no idea what this is, I will ping Kosta. we certainly see the requests coming in to the proxy VMs (the hcaptcha domains are behind the CDN and then the VMs are the backend) [16:50:22] (asked Kosta) https://wikimedia.slack.com/archives/C07BN6C5118/p1770396587572189?thread_ts=1770330802.695109&cid=C07BN6C5118 [16:50:49] it's the healthchecking of if we can reach hcaptcha that happens within mediawiki [16:50:55] I think effie knows something also [16:51:13] https://gerrit.wikimedia.org/g/mediawiki/extensions/ConfirmEdit/+/c0916b79c5826a036bd68b53f221db823dec3258/includes/hCaptcha/Services/HCaptchaEnterpriseHealthChecker.php [16:52:11] https://codesearch.wmcloud.org/search/?q=wgHCaptchaProxy&files=&excludeFiles=&repos= [16:52:15] I think this is just an artifact [16:52:42] yeah.. traffic reaches hcaptcha via urldownloader boxes [16:52:49] could be the issue there? [16:53:21] vgutierrez: traffic does not reach via urldownloader boxes though [16:53:26] target: 'http://(?:|imgs-|report-|assets-|sentry-)hcaptcha\.wikimedia\.org' [16:53:29] sukhe: turnilo says it does [16:53:29] replacement: https://hcaptcha-proxy.anycast.wmnet:4260 [16:53:35] sukhe: that's on the CDN [16:53:43] MW -> urldownloader -> CDN [16:53:59] sukhe: no, it's wired in here: https://gerrit.wikimedia.org/g/mediawiki/extensions/ConfirmEdit/+/c0916b79c5826a036bd68b53f221db823dec3258/includes/ServiceWiring.php#32 [16:54:35] vgutierrez: can you share the Turnilo link? [16:54:56] https://w.wiki/HknC [16:55:55] I am about to call it a day, but please ping if there is something I could help you with [16:56:44] do we have squid metrics on grafana? [16:57:22] we have some [16:59:15] lol the things that seem like they might be latency percentiles, we have only for install*, not for urldownloader* [16:59:25] sorry but I need to get back to APP stuff [16:59:38] access log doesn't show any errors [16:59:42] just 200 responses [17:00:47] https://w.wiki/HknS [17:00:59] I am not convinced, the volume of requests is far too low [17:01:10] meeting, will come back to this again [17:01:15] well.. that's 1/128 [17:01:23] ^ [17:02:15] yeah but still, with the extent of hcatpcha rollout [17:02:46] the mediawiki extension first checks memcache to see if a healthcheck recently succeeded elsewhere, before deciding to do one [17:02:58] it's not like every appserver is constantly pinging the url (only memcached lol) [17:05:52] sukhe: ^^ https://www.irccloud.com/pastebin/7RKlSCnT/ [17:05:59] that's the last hour [17:06:16] vgutierrez: one thing I notice, the timeout on the mw http client is really tight [17:06:26] so twice per minute [17:06:28] https://gerrit.wikimedia.org/g/mediawiki/extensions/ConfirmEdit/+/c0916b79c5826a036bd68b53f221db823dec3258/includes/hCaptcha/Services/HCaptchaEnterpriseHealthChecker.php#165 [17:06:32] I believe that's seconds? [17:08:48] so that's plausible [17:10:20] how much you wanna bet squid counts it as a 200 even if the client hangs up [17:11:04] varnish would say it's a 503 [17:11:22] so 200 doesn't sound that crazy TBH [17:13:30] so squid logs the response of the CONNECT proxy request [17:13:49] as in it doesn't have any issue creating the tunnel against the cp server [17:15:46] but it doesn't know what's going on inside the tunnel [17:16:07] so yeah.. the client could give up waiting for a response and that would be perfectly fine for squid [17:32:18] yeah [17:44:05] do we need to go MW -> urldownloader -> CDN? I am quite surprised by this but I need to dig more. (I was only involved in the hcaptcha-proxy side so I really don't know the request flow there other than the user -> CDN -> proxy backend) [17:44:25] as in why can't we just go MW -> CDN on the private network? urldownloader is only for outbound connections [17:44:35] anyway this is a question for Effi.e and Rain.e later [17:54:50] do all requests on MW go through the URL downloader? [17:55:41] that would be very surprising. and if not, then in this case it means that we have this hardcoded from when this was behind urldownloader and not updated since then? [18:56:26] <_joe_> sukhe: that's a security measure [18:56:41] <_joe_> any outgoing connection goes through a proxy so it can be logged and inspected [18:56:53] <_joe_> and the IPs open to outside connections are limited [18:57:06] <_joe_> it's one of the most basic tenets of applayer security [19:13:02] _joe_: yeah but it's not an outbound connection to an external source, it's to an internal network [19:13:30] that is a bit different for me for when we talk to the hcaptcha CDN, which is not in our network [19:26:03] <_joe_> well if we're talking to our CDN it's an external IP anyways, and we should never do that, we should use the mesh to go to the right service :) [19:26:33] <_joe_> it's a misconfiguration, I thought we had added a CI check to mediawiki-config to prevent it from happening