[02:13:35] ema: hmmm same symptoms as cp3057 - T237348 [02:13:35] T237348: cp3057 is unreachable - https://phabricator.wikimedia.org/T237348 [02:29:20] 10Traffic, 10Operations, 10ops-esams: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10Vgutierrez) [02:30:56] ema: I took the liberty of opening a task for cp3065 as well, to track these issues... 2 new hosts showing the same behaviour 1 week apart :/ [03:26:10] 10HTTPS, 10Traffic, 10Operations: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Morgankevinj) [04:02:13] 10netops, 10Operations: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Vgutierrez) [04:27:18] 10HTTPS, 10Traffic, 10Operations: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Vgutierrez) We should consider QUIC and HTTP/3 adoption carefully as it implies a switch from TCP to UDP, and that could open new (D)DoS vectors and render unusable some mitigation techni... [05:19:19] 10Traffic, 10Operations: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Vgutierrez) [05:32:52] 10Traffic, 10Operations, 10Patch-For-Review: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Vgutierrez) p:05Triage→03Normal [08:24:54] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Site-requests: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Masumrezarock100) [08:30:23] 10HTTPS, 10Traffic, 10Operations: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Peachey88) @Masumrezarock100 This is something that needs to be done on the operations side of thigs, so i've removed Site-Requests which is for local wiki config changes. [08:51:56] 10HTTPS, 10Traffic, 10Operations: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Masumrezarock100) Oh I see. [09:50:12] 10Traffic, 10Operations: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Vgutierrez) [09:50:26] 10Traffic, 10Operations: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Vgutierrez) p:05Triage→03Normal [10:01:37] vgutierrez: ty! [10:05:52] 10Traffic, 10Operations, 10ops-esams: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10ema) p:05Triage→03Normal [10:06:14] vgutierrez: ok to repool cp3065? [10:06:22] yep [11:10:29] 10Traffic, 10Operations: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Fixed with 2428105: ` (eqsin) $ curl -v https://en.wikipedia.org/api/rest_v1/page/summary/Tremont_Street_Subway 2>&1 |grep server: < server: restbase1017 ` [14:32:06] so far the new turnilo TLS stuff is still showing strong indicators that most of the majority of it, by a decent margin, seems to all be requests for "https://www.wikipedia.org/" from a blank user-agent, from a particular ISP in Indonesia [14:33:22] and many of the next most common there is some legit traffic from ancient android, etc, but it's tiny by comparison [14:34:47] still not quite a full week of data yet, though. We should put together some real numbers at that point, and maybe get analytics to do some unsampled queries of the same things to confirm there's nothing weird going on with sampling. [14:36:34] actually is interesting that almost half of the traffic comes from Indonesia [14:36:47] when you drill into that, it's all the same stuff [14:37:02] a variety of client IPs, but all the same ISP, same blank UA, same URI, etc [14:37:13] yep [14:37:22] and showing a huge ratio of requests per IP [14:37:28] someone has probably codified a healthcheck somewhere (e.g. the ISP techs configured something in some DSL modems to poll https://www.wikipedia.org/ or something) [14:37:36] compared to other ips in the same ISP using tls1.2 [14:37:39] or who knows [14:38:03] also that ISP is flagged in ooni.. so it makes me paranoid [14:38:35] but yes, best case scenario: not human requests [14:39:13] worst case: middleboxes messing with users [14:48:55] bblack: by any chance do you think you might be able to have a look at https://phabricator.wikimedia.org/T233183#5637645 sometime this week. No hurry but would be nice to have an agreement on the plan forward [14:49:10] s/./?/ [14:52:00] hiya, am trying to set up discovery for schema.svc, but am a bit confused about something. [14:52:13] for the lvs stuff [14:52:15] lvs::realserver::realserver_ips is not defined [14:52:18] currently. [14:52:26] but, e.g. schema.svc.eqiad.wmnet works [14:52:38] should https://gerrit.wikimedia.org/r/c/operations/puppet/+/549106/1/hieradata/role/eqiad/eventschemas/service.yaml be added to puppet? [14:52:54] ( i don't really know how this works, just blindly trying to follow instructions) [14:54:47] ema: ^ ? [14:58:05] Hm, I guess it is not needed somehow? /etc/default/wikimedia-lvs-realserver on the host has the proper LVS_SERVICE_IPS. Dunno how they get there though... [14:58:07] ottomata: I'm not sure, and meeting soon, but in general I think how various "realserver ips" are configured for various cases is somewhat inconsistent at present [14:58:23] it's not necessarily true that examples are portable between differently-styled cases easily [14:58:52] (various hieradata and/or manifests do it differently) [14:59:00] volans: yes [14:59:43] hm, ok... thanks. [15:00:05] i will not try and add them then. LVS stuff seems like it is as it should be [15:00:14] will just do discovery and external routing stuff. ty [15:01:54] ottomata: sorry for entering bike-shedding territory here, but the name "schema" seems very generic? [15:02:46] given the intricacies of updating lvs services, are we sure the name is there to stay? :) [15:07:53] :) i waffled between eventschema event-schema and schema, but am not 100% set on schema. [15:08:02] that was months ago, and its already in LVS tho... [15:08:58] ema: we could consider changing it, it isn't too late yet (is it?) [15:09:44] should we? [15:10:34] ack, thanks a lot b.black :) [15:13:06] ottomata: oh, I didn't get that the name was in LVS already. We're good then! https://it.wikipedia.org/wiki/Cosa_fatta_capo_ha [15:15:07] ha ok! [15:15:27] ema: you ok if i proceed with the discovery stuff then? am following https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production [15:15:51] ottomata: looking [15:16:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/549106 and https://gerrit.wikimedia.org/r/c/operations/dns/+/549105 [15:17:57] ottomata: lgtm! [15:18:16] danke, proceeding! [15:26:19] hm, ema not sure what I did wrong with the dns part [15:26:23] authdns-update says [15:26:26] error: plugin_geoip: Invalid resource name 'disc-schema' detected from zonefile lookup [15:26:26] error: Name 'schema.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-schema' [15:30:03] ottomata: uh, I've just noticed that the indentation here is off [15:30:07] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/549106/2/hieradata/common/discovery.yaml [15:30:09] OH! [15:30:24] ARGH [15:30:29] ok [15:30:36] totally missed that during review, sorry! [15:30:50] me too ssorry! [15:30:52] fixing [15:30:54] ack [15:34:36] better! ty [15:36:05] ok ema now this one should be ready [15:36:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/549177 [15:40:46] ottomata: we need to add TLS! [15:42:43] see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545813/ https://gerrit.wikimedia.org/r/#/c/labs/private/+/546095/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/546097/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/546098/ [15:43:30] AH yes i meant to ask you about that [15:49:05] ema: so i'm using nginx to host static files for now. [15:49:23] j o e was wondering if i should just do tls in nginx [15:49:41] i'd kinda rather not, it'd be nice not to think about it there [15:49:50] esp if this envoy proxy thing works easily [15:52:47] ema: do I need an extra LVS service? [15:52:58] i'm ok if we make the main schema.svc lvs https [15:54:08] https-only is fine as long as you can wait till the end of T227432 before opening the gates [15:54:09] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:54:53] hm [15:54:56] ema: timeline? [15:55:16] EOQ if hardware stops making me sad [15:55:19] hm [15:56:01] re: envoy vs nginx for tls termination, I think you should clarify that with joe and the rest of service ops [15:56:20] as long as we can speak tls, I'm fine :) [15:56:52] the amount of "fine" and "as long as" in my sentences is increasing [15:58:20] heheh [15:59:11] i'll see if we can wait til next Q for this then, i think we can [16:00:35] it would be great to avoid the extra effort of having both http and https services [16:00:46] just to get rid of the http one in a bit [16:02:12] yeah [16:52:30] ema: are you on 3063 as well? [16:53:06] hmmm nevermind, not sure why it failed agent run from cumin earlier, but running now [16:53:44] bblack: nope, I'm on 3052 [17:03:23] 10netops, 10Operations, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10fgiunchedi) (2) to me seems the way to go as it would integrate best with our existing workflows. With an eye pointed at low hanging fruits though I'm w... [20:07:48] 10Traffic, 10Operations, 10Performance-Team: Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10Gilles) Impact of that test in Europe: https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?panelId=54&fulls... [20:56:27] 10Traffic, 10Operations, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) [20:57:10] 10Traffic, 10Operations, 10Performance-Team (Radar): Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10Gilles) [21:12:10] 10netops, 10Operations, 10Patch-For-Review: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) 05Open→03Resolved a:03ayounsi All good! [21:12:52] 10Traffic, 10Varnish, 10Wikimedia-Apache-configuration, 10Operations, 10HHVM: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629 (10matmarex) [21:15:41] 10Wikimedia-Apache-configuration, 10Operations: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275 (10matmarex) [21:15:43] 10Wikimedia-Apache-configuration, 10Operations: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276 (10matmarex) [21:16:34] 10Wikimedia-Apache-configuration, 10Operations: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276 (10matmarex) I can't reproduce this problem any more. I assume this was fixed by migrating to PHP 7 (T176370). [21:16:54] 10Wikimedia-Apache-configuration, 10Operations: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276 (10matmarex) 05Open→03Resolved [21:18:00] 10Wikimedia-Apache-configuration: Redirect with a question mark '?' in the title treats everything following it as URL query part when updating the URL - https://phabricator.wikimedia.org/T128380 (10matmarex) 05Open→03Resolved I can't reproduce this problem any more. I assume this was fixed by migrating to P... [21:23:16] 1/win 38