[06:30:06] 10Traffic, 10SRE, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) >>! In T273003#6778171, @CDanis wrote: > It seems the User-Agent being used is `Peachy MediaWiki Bot API Versio... [07:05:09] 10Traffic, 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) p:05Medium→03Low I also checked the logs from yesterday, and there was no error reported by the backend servers (in envoy o... [07:09:27] 10Traffic, 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) I'm not even sure this qualifies for the "production error" tags. We're talking about 50 events over the last week, that's way... [07:19:12] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Krinkle) [07:36:49] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) Joe, not all requests are 502s. They are insignificant compared to the amount of requests returning null responses. This is a very problematic occu... [07:53:48] vgutierrez: ema Hey, one simple one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/658503 [09:33:54] Amir1: thx <3 it's being merged right now [09:35:57] Thanks! [09:49:24] elukey: let me know when it's a good moment to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/629735/ && https://gerrit.wikimedia.org/r/c/operations/puppet/+/658567/ [09:54:52] vgutierrez: hola! Anytime is fine, I can check on kafka if things looks ok [09:57:25] cool, let's merge the debug one first [10:01:42] not sure how to test that one, I can surely filter in kafka for debug=1 and see if it pops up, but not how to trigger a webrequest carrying that [10:02:47] cp4032 looks healthy :) [10:03:18] I'm always a coward merging varnish changes :) [10:05:34] ahahhaha [10:05:41] I can understand the feeling :D [10:25:17] vgutierrez: all deployed? If so we could ping effi*e and try to trigger the flag in webrequest [10:31:37] indeed [10:32:07] shall we go ahead with the port one as well? we're capable of testing that one without her :) [10:42:34] vgutierrez: +1 [10:42:54] I'll merge it with puppet disabled as usual and test it on cp4032 [10:46:33] vgutierrez: after that I can check on kafka the records for 4032 [10:46:48] cool [10:48:27] ready [10:50:00] elukey: you can check now [10:50:26] vgutierrez: all good, it works [10:50:41] awesome, deploying it cluster wide :) [10:50:43] thx elukey <3 [10:50:54] wait a sec sorry [10:50:57] I don't see it anymore [10:51:22] vgutierrez: --^ [10:51:47] uh? [10:52:23] elukey: are we talking about https requests? [10:53:06] vgutierrez: I am tailing webrequest_text, yes [10:53:32] for a moment I saw a stream with client_port=etc.. [10:53:39] but now nothing [10:55:40] basically I am doing [10:55:41] kafkacat -t webrequest_text -b kafka-jumbo1001.eqiad.wmnet:9092 -C | grep 'hostname":"cp4032' | jq '.'| grep x_analytics [11:01:21] I'm seeing the client_port= being added [11:01:24] on varnishlog [11:05:20] I am wondering if varnishkafka doesn't like it for some reason [11:06:24] hmmm this is weird [11:06:27] vgutierrez: is it deployed everywhere now? (I am ok with it, just wondering what filter to use) [11:07:04] I'm seeing some requests on cp4032 that aren't getting the client_port= data on the X-Analytics header but are flagged as https=1 [11:07:21] hmm forget it.. idiotic grep skills [11:07:22] :) [11:08:03] elukey: I didn't trigger the puppet run everywhere so some nodes could miss it yet [11:08:37] okok perfect [11:08:57] I am going to try varnishkafka with stdout on cp4032 if you are ok to see how things are displayed [11:09:54] sure [11:10:40] ok I see the client_port field [11:10:53] so something is weird in my kafkacat then [11:11:07] try with kafkadog [11:11:10] * vgutierrez hides [11:11:36] * elukey cries in a corner [11:14:11] vgutierrez: ok my bad, the new kafkacat seems not tailing straight from the end of the topic, so I was seeing old data [11:14:15] all good, confirmed [11:14:22] sorry for the lag :) [11:15:40] np :) [11:57:00] <_joe_> hello traffic team, can we agree on a time for me to deploy https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/658964 to the codfw/eqiad pybals? [11:58:01] <_joe_> I need to get back with an ETA to other teams that are waiting on this (among other stuff) to be able to do rolling restarts of appservers when deploying with scap [12:05:23] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) p:05Low→03Medium [12:05:56] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) Got another surge of bad responses from production. [13:05:22] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Dreamy_Jazz) Just to note that cyberbot I has been blocked because of the blanking issues on enwiki. The bot has also been blocked on two other wikis for the blanki... [14:24:17] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Aklapper) p:05Medium→03Low Please don't change the priority value if you don't plan to work on fixing this - thanks a lot! :) [15:12:58] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10CDanis) Can you please provide a complete dump of a "null response", with both the complete response headers and the raw response body? What is the HTTP status cod... [19:10:40] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) I've added additional logging data to the framework. ` Date/Time: Wed, 27 Jan 2021 19:08:52 +0000 Method: GET URL: https://en.wikipedia.org/w/api.p... [19:45:41] What could I do to debug more why I see errors in ATS log like "CONNECT: could not connect to 10.64.48.40 for 'https://testreduce.discovery.wmnet/index.html'" when the exact IP/URL works from the same host when using curl [19:47:12] curl -vvv https://testreduce.discovery.wmnet/favicon.ico is connecting all the way to the backend nginx behind envoy.. same cp4030 host as source [19:47:40] also checked with -6 [21:05:30] 10netops, 10SRE, 10ops-codfw: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10RobH) [21:28:49] 10Traffic, 10SRE: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10Legoktm) p:05Triage→03Low [21:38:54] 10Traffic, 10DNS, 10Mail, 10SRE: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10pkang) @drochford hey david, based on Andre's last comment, would the team be open to have the emails be sent from a subdomain like @zendesk.wikimedia.org? [23:06:43] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Urbanecm) @Cyberpower678 This doesn't sound to be a complete dump of a raw request. I'm pretty confident the response has at least one line (the one starting with `... [23:14:13] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) >>! In T273003#6778468, @Cyberpower678 wrote: > I believe it only does maxlag on write requests, like when it edits. These are all read requests. You sho... [23:17:24] 10Traffic, 10SRE: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) >>! In T273003#6782382, @Urbanecm wrote: > @Cyberpower678 This doesn't sound to be a complete dump of a raw request. I'm pretty confident the respons...