[06:15:39] 10Traffic, 10Operations, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) 05Open→03Resolved a:03jijiki @Jdforrester-WMF I am marking this as resolved :D [06:16:54] 10Traffic, 10Operations, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) [06:17:51] 10Traffic, 10Operations, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) [06:35:29] hello traffic team [06:38:27] I'd like to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/511690/ and disable puppet on all the hosts running profile::cache::kafka::webrequest (that are cp nodes basically) [06:49:48] Tested varnishkafka on cp1077 up to the webrequest topic in kafka, all good, I can see the backend field now. Re-enabling puppet to let it deploy the new config to all the varnishkafkas [06:50:11] now that I see we should probably move varnishkafka to a proper systemd unit that doesn't restart on config change [07:18:22] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) Does logstash need to be changed to read the new field? Cc: @fgiunchedi [07:49:52] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) Nevermind, self answered. The field is now showing up in the 50x logstash dashboard, but it seems requiring a refresh of the index list (I hovere... [07:58:48] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Ankit-Maity) 05Resolved→03Open ` ... cp1075 cp1075, Varnish XID 1035109169 Error: 503, Backend fetch failed at Mon, 03... [08:01:57] it looks like cp1075 and cp1077 are struggling with some mbox lag ema [08:12:40] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10DannyS712) Just got it again `via cp1075 cp1075, Varnish XID 1031998560 Error: 503, Backend fetch failed at Mon, 03 Jun 2... [08:16:33] !log cp1075: restart varnish-be [08:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:33] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) Refreshed also the index list on kibana, no more warnings for the backend field. [08:23:35] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10elukey) The traffic team restarted two Varnish backends, the issue should be fixed now. Thanks a lot for the reports, plea... [11:07:12] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, and 2 others: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Volans) Both records are actually up now: `lang=bash $ dig +trace wikimedia.modular.im [...SNIP...] wikimedia.modular.... [11:22:14] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Volans) 05Open→03Resolved a:03Volans Change is live: ` L| 0 ~$ dig @ns0.wikimedia.org SRV _matrix._tcp.wikimed... [13:23:57] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) 05Resolved→03Open It seems the documentation is outdated and only the `.well-known` method works with Modul... [14:20:44] 10Traffic, 10Operations: Rate limit requests to cache_upload - https://phabricator.wikimedia.org/T224884 (10ema) [14:20:52] 10Traffic, 10Operations: Rate limit requests to cache_upload - https://phabricator.wikimedia.org/T224884 (10ema) p:05Triage→03Normal [14:44:05] 10Traffic, 10netops, 10Operations: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ema) [14:44:12] 10Traffic, 10netops, 10Operations: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ema) p:05Triage→03Normal [14:55:57] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10jcrespo) >>! In T222418#5229410, @Ankit-Maity wrote: > Just a question: is this intermittent behaviour expected or is the... [15:00:52] 10Traffic, 10netops, 10Operations: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ayounsi) I agree! That's all the "transports" LibreNMS alerting can use: https://docs.librenms.org/Alerting/Transports I'm not familiar with our paging system. If any of the ab... [15:04:34] 10Traffic, 10Operations: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10ema) [15:04:43] 10Traffic, 10Operations: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10ema) p:05Triage→03Normal [15:08:42] bblack: hi! Your input is welcome on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/514017/ and in general all the actionables here https://wikitech.wikimedia.org/wiki/Incident_documentation/20190603-eqiad-port-saturation#Actionables [15:11:59] ema: re: T224884 - maybe link up the patch jbond has been working on last week? [15:12:08] (the ratelimit task) [15:12:11] T224884: Rate limit requests to cache_upload - https://phabricator.wikimedia.org/T224884 [15:12:58] UA policy has always been tricky, but I guess it's not unreasonable to special-case the known generic ones from script libraries as we run into them. [15:13:54] tricky in that it's not really an enforcement mechanism. whomever it is will at least have to be operating in somewhat-decent faith when they see the 403 and identify themselves... [15:14:22] whereas you can imagine a lot of immature people that can write code but aren't true attacks, might just see a 403 about a UA and go patch their code to set it "asdf" or whatever. [15:14:42] and we can't just chase all possible strange UA values down one by one in giant regexen or else clauses or whatever [15:15:11] but I think for "python-requests" and similar cases, it makes sense to at least try to help those that are willing to be helpful but didn't know any better. [15:16:05] I suspect we have so many violators on cache_text that enforcing the same policy there would cause a big outcry [15:16:30] (as a thousand unmaintained tools go dark and a scramble ensues to find maintainers to update their UA strings or whatever) [15:17:04] I assume there will be at least some annoyance level on cache_upload enforcing any such thing for the first time, too. [15:19:21] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) If you need any examples, that's what I do in: https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/507... [15:22:22] bblack: we could gets stats at least on scripts that run from our infra/cloud with those UAs, and fix them [15:24:12] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) [15:30:26] from our prod infra hopefully none, but I guess we don't know for sure [15:31:21] cloud, the problem tends to be that there many tools editors consider "useful" and thus breaking them causes an outcry, but the code is effectively unmaintained and nobody can easily make simple changes to it. [15:31:34] (we ran into a lot of this during past HTTPS enforcement/standards stuff) [15:32:29] that's not to say we can't try, do some surveying, etc [15:32:55] but I expect we'll cause some problems, and that there will be pushback on whether "please set a random UA string" is important enough to cause all the drama [15:33:38] (talking about cache_text here mostly. I doubt as many tools on cloud use upload. Even the ones that deal with images, probably go for accessing commons via cache_text for the metadata more than anything. [15:33:42] ) [15:37:07] To reduce the risk of random UA we could ask people to set the UA in a specific way too, and verify it with a regex. But more strict means more possible pushback [15:47:20] I think it might be interesting to allow the generic UAs in low quantity of requests, but once you see too much from a given IP address, start serving 429 with a note to set a specific UA [15:48:33] yeah that's a decent idea too [15:49:12] I don't think we can ever really get compliance with a strict standard (e.g. a specific regex everyone's UA has to match), and it's not reasonable in the general case. [15:50:24] but we can build up a list over time of "this is the default UA string for some random HTTP library when the author didn't bother to set a specific one", and block (or ratelimit) those and grow the set over time. [15:59:26] yes i think ratelimite vs block is much better even if the rate limite is 1 request/second [16:00:10] we could rate limit know script UA's to 1/sec and then have an official UA scripts can use which is limited with a different limit 10 or 100 requests per second [16:00:34] the 429 from the first limit could direct users to use the UA in the seoncd limit [16:03:11] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10fdans) 05Open→03Resolved [16:04:18] I tried to dig around everywhere else (logstash, grafana, etc) re: the text 5xx's from this morning circa 06:00 -> 08:20 [16:04:36] I can't find much that conclusively correlates well [16:05:22] the only remotely interesting find really (but I think the correlation is in doubt) [16:06:01] is a spike of "AH01079: failed to make connection to backend: 127.0.0.1" on mw1222 in api cluster, which spikes up hard and very temporarily, shortly after the start of the problem period (but it may take a while for timeouts to happen and then log emission, etc) [16:06:19] (in logstash) [16:07:04] in general, there's a background rate of those messages and other FCGI failure/timeout things, which might indicate some parameter/limit/whatever needs some tuning somewhere? [16:08:12] it's like 10 minutes late to be causative though, the specific spike on mw1222, that's why I think it's doubtful [16:08:23] but still, the overall rate of those in general in logstash is worrying [16:12:24] in SAL terms, there was db master moves happening during that window too, but it's hard to imagine that correlating out to varnish without seeing some other evidence in the middle (e.g. spike of latent responses, etc) [16:26:05] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) Before upgrade `Junos: 14.1X53-D47.6 JUNOS EX Software Suite [14.1X53-D47.6] JUNOS FIPS mode utilities [14.1X53-D47.6] JUNOS Online Documentation [14.1X53-D47.6] JUNOS EX 4300 Software Suite [1... [16:26:31] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) [16:32:08] hmmm a more-interesting correlation: there was a huge spike of PURGE just before the problem started [16:32:22] ~6-7x the normal average purge rate, for a short burst [16:32:47] and then as the problem period comes in, we see spikes of cache object expiry, and also spikes of available cache storage to zero [16:34:10] but stepping out to the 1-week view, we do have spikes of that rough magnitude fairly routinely, too [17:32:33] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10Ottomata) Thanks Luca! Any reason we shouldn't add the field to webrequest Hive table too? [17:35:23] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) @Ottomata the quorum of the A-team in Mallorca said that we shouldn't add it since it seems more a debugging info rather th... [17:37:28] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10Ottomata) K! let's leave it for now I don't mind either way. [18:30:07] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Ankit-Maity) 05Open→03Resolved That explanation certainly helps >>! In T222418#5230418, @jcrespo wrote: >>>! In T2224... [18:30:14] 10Traffic, 10netops, 10Operations: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) There is a "Nagios Compatible" transport, but it is underdocumented and seems to also only write to a local filesystem path (which is presumed to be a Nagios external com... [23:14:26] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) I agree that from the perspective of more closely modelling the devices between the various tools that the domain name...