[03:45:57] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:01:57] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:25:26] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:00:42] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:22:53] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:32:41] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:48:19] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [06:40:34] 10Traffic, 10Operations: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) 05Open→03Resolved [06:40:38] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [06:41:31] 10Traffic, 10Operations: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Vgutierrez) 05Open→03Resolved Marking this as resolved as we don't use nginx anymore to terminate TLS in the caching cluster no... [06:47:40] 10Traffic, 10Operations: Remove nginx puppetization for cache text/text_ats - https://phabricator.wikimedia.org/T238625 (10Vgutierrez) [06:47:51] 10Traffic, 10Operations: Remove nginx puppetization for cache text/text_ats - https://phabricator.wikimedia.org/T238625 (10Vgutierrez) p:05Triage→03Normal [08:51:21] 10Traffic, 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ema) [09:49:23] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) so, I've been doing some tests, and ATS doesn't drop the url-encoded version of the semicolon, so `... [10:00:07] 10Traffic, 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ema) By going through SAL and the irc logs on #wikimedia-operations I've reconstructed the events as follo... [10:01:30] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) So: `vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v... [10:15:15] nice: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1574126378218&to=1574154478218&var-site=All&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&panelId=5&fullscreen [10:20:42] * ema waves nginx goodbye [10:22:45] \o/ [10:23:30] godog: we can get rid of the nginx status, availability and the alert [10:26:28] totally! good to see it gone, very cool [10:26:50] ema: \o/ any grafana dashboard that needs updating for this change? [10:27:17] volans: I'm just waving, valentin is the one doing work [10:27:24] me? [10:27:27] lol [10:27:34] I'm slacking in a cafe in Taipei please [10:27:46] the cp servers work on their own [10:28:11] but don't tell anybody ;P [10:28:59] I've found a cafe that closes at midnight, has amazing coffee and two cats <3 [10:29:30] plot twist: the cats own the coffee shop [10:29:41] they certainly run it [10:30:17] hehehe no doubt [10:31:27] own the coffee shop AND the clients [10:34:19] godog: BTW, what's the proper way of rendering a gauge in grafana? [10:34:48] we just included an ATS metric that gets the current active http/http2/websocket connections [10:35:35] the usual rate doesn't make any sense here [10:37:29] perhaps just plot the value without applying any functions? [10:38:10] yep, that's what I'm doing right now [10:38:34] see https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?panelId=56&fullscreen&orgId=1&var-site=eqiad%20prometheus%2Fops&var-instance=cp1075&var-layer=tls&from=now-15m&to=now [10:39:32] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Enable mwdebug routes for noc.wikimedia.org - https://phabricator.wikimedia.org/T233768 (10ema) 05Open→03Resolved a:03ema This is now done: ` $ curl -v https://noc.wikimedia.org/Potato -H "X-Wikimedia-Debug: mwdebug1001.eqiad.wm... [10:40:48] vgutierrez: yeah just the gauge LGTM [10:42:15] very nice [10:43:12] now I'm wondering if ATS actually increments that metric for websockets at some point [10:43:12] :/ [10:48:15] 10Traffic, 10Operations: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10ema) [12:48:01] 10Traffic, 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Joe) The problem is most apache workers ended up being stuck talking to aphlict via `proxy_wstunnel` which... [13:18:30] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2001.codfw.wmnet'] ` The log can be found in `/var/log/wm... [13:52:53] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2001.codfw.wmnet'] ` and were **ALL** successful. [14:21:03] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2004.codfw.wmnet'] ` The log can be found in `/var/log/wm... [14:44:38] 10Traffic, 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10akosiaris) >>! In T238593#5674571, @ema wrote: > By going through SAL and the irc logs on #wikimedia-opera... [14:45:38] vgutierrez: godog: +1 to just plotting the gauge; the other reasonable thing (esp at longer timescales) is to plot the max() or similar, if it's a gauge with some limit inherent in the system [14:46:21] ack, thankd! [14:46:23] thanks [14:52:23] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2004.codfw.wmnet'] ` and were **ALL** successful. [14:55:38] 10Traffic, 10Operations, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10fgiunchedi) [14:56:32] 10Traffic, 10Operations, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10ema) p:05Triage→03Normal [15:07:20] 10Traffic, 10Operations, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10ema) Interesting, I've observed the request failing as described in this task by using the Chromium d... [15:19:33] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2006.codfw.wmnet'] ` The log can be found in `/var/log/wm... [15:28:24] 10Traffic, 10Operations, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10ema) >>! In T238540#5675342, @ema wrote: > I've observed the request failing as described in this tas... [16:06:08] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2006.codfw.wmnet'] ` and were **ALL** successful. [16:12:54] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2007.codfw.wmnet'] ` The log can be found in `/var/log/wm... [16:14:13] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) >>! In T238593#5674571, @ema wrote: > - 2019-11-15 17:30 SAL: `mutante: phabricator - -started phd service`. @Dzahn It's... [16:20:30] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ema) >>! In T238593#5675662, @Dzahn wrote: > > You can entirely disregard that, i was on phab1001 and not phab1003 by accident.... [16:25:56] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) >>! In T238593#5675686, @ema wrote: > Was there any other admin action between the page and when @joe disabled proxy_wstu... [16:42:35] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) Regarding the puppetization: There is `hiera('phabricator_aphlict_enabled'.` which is now set to false. What this does:... [16:43:13] 10Traffic, 10Operations, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10ema) As per irc conversation with @gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the info... [16:45:36] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2007.codfw.wmnet'] ` and were **ALL** successful. [16:51:08] 10Traffic, 10Operations, 10Performance-Team: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) @Gilles To see if and to which extent ats-tls is also responsible for some of the performance degradation, you can query hadoop and check the ssl t... [16:56:59] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) My summary of the gerrit discussion: //The patch was written in a way that completely ignored the reality of the... [17:32:35] 10netops, 10Operations: "unknown session id" from bird on centrallog hosts - https://phabricator.wikimedia.org/T238677 (10fgiunchedi) [17:33:01] 10netops, 10Operations: "unknown session id" from bird on centrallog hosts - https://phabricator.wikimedia.org/T238677 (10ayounsi) a:03ayounsi [18:08:07] 10netops, 10Operations: "unknown session id" from bird on centrallog hosts - https://phabricator.wikimedia.org/T238677 (10ayounsi) 05Open→03Resolved Clearing the BFD session on the router and restarting bird solved the issue. If it happen again please reopen and I'll investigate it more. [18:24:37] https://datasets.wikimedia.org/ seems returning 502 from ATS, known? (still haven't checked, somebody reported in the analytics chan :) [18:24:49] cc bblack [18:25:49] self-redirected, it seems thorium having difficulties [18:26:26] (in the sense that datasets.w.o has thorium as backend) [18:30:58] so datasets.w.o is not among the SANs of the TLS cert that thorium holds afaics [18:32:41] I am super ignorant about ATS, how can I check if it fails to connect to thorium due to a TLS issue? [18:36:54] ok so I'll proceed in adding the SAN to the TLS cert, it seems the most obvious issue [18:36:58] you can try manually with curl -S -vvv or so [18:37:22] or check directly the file in the repo .. like so: [18:38:11] mutante: what I'd need to know is if ATS (backend) is returning the 502 due to a TLS failure when contacting thorium for datasets.w.o [18:38:43] elukey: what is the name of the service / certificate ? [18:40:11] it is thorium, datasets.wikimedia.org [18:40:25] it is a redirect, but I guess that we forgot to add it as SAN [18:40:43] and now that all the text backends are ATS it fails [18:41:12] that's the name of a backend but also the name of the directory.. i see [18:41:21] i was trying to identify the cert in ./files/ssl/ [18:41:34] "of the director" [18:41:36] we have one, yarn.wikimedia.org [18:41:47] that we deploy with several sans in multiple hosts [18:42:18] ah, ok. so: [18:42:19] DNS:yarn.wikimedia.org, DNS:hue.wikimedia.org, DNS:superset.wikimedia.org, DNS:pivot.wikimedia.org, DNS:turnilo.wikimedia.org, DNS:stats.wikimedia.org, DNS:analytics.wikimedia.org, DNS:piwik.wikimedia.org [18:42:45] it needs these steps to add an additional name: [18:42:46] https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate [18:43:11] yep I am doing it [18:43:44] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) I'll be sending the following into ESAMS remote hands: I'll be sending the following to Iron Mountain remote hands request via the portal: Iron Mountain, We are experiencing transi... [18:45:15] sorry about nitpicks: maybe "thorium" (the director) should be called something like "analytics-websites" and the yarn cert analytics-websites.pem [18:47:56] yes definitely [18:49:02] and https://analytics-web.discovery.wmnet to point to actual thorium [18:49:18] elukey: confirmed it is the SAN for sure: [18:49:19] [cp1075:~] $ curl -v -S https://thorium.eqiad.wmnet [18:49:26] cp1075 is the ATS server you can test from [18:49:44] and then curl tells you the "no alternative subject name ..." [18:50:10] thanks! I am about to deploy the puppet change [18:50:43] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/551899/ [18:52:40] yep, that cert has the datasets.wikimedia.org name [18:52:46] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) @bblack: Do we care when the work is done, other than in EU or US business hours? [18:54:05] mutante: ok now it works :) [18:54:15] thanks for the support! [18:54:27] I'll open a task to see if we can rename + add monitors for datasets.w.o [18:57:27] elukey: cool :) [18:59:52] ok all good, cert deployed everywhere and nginx reloaded [18:59:55] logging off! [19:59:33] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Multichill) @Gehel @DCausse what's the plan here? Currently every file on Commons that uses structured data (about 2M I think) h... [20:22:15] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) Ok, synced up with @bblack via irc and he doesn't have a preference for time. My above directions have been submitted for remote hands via the portal, case RITM0115394. [20:30:43] 10Traffic, 10Operations, 10Patch-For-Review: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BBlack) So the patch above adds it to the queue distribution logic in interface-rps, but there's another piece of the puzzle here, which is setting the hardware's queu... [20:33:55] 10Traffic, 10Operations, 10Patch-For-Review: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BBlack) Adding @RLazarus in hopes of nerd-sniping him further on this topic... [20:44:11] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) SCTASK0128980 is new case number, confirmed and opened. (I suppose one is the request, and now we have a confirmed remote hands case?) [21:12:28] 10netops, 10Operations, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10herron) Friendly ping to @Volans about @fgiunchedi question above [21:23:01] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The Hiera key now does all the things, also stopped the service and unloaded the httpd module wstunnel. After that Phabr... [21:49:57] 10netops, 10Operations, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10Volans) @herron @fgiunchedi I don't think that much, I guess you have to do the triggering part, I'm not super clear what you have in mind, a script to... [22:00:53] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10dcausse) Problem is that Special:EntityData does not support displaying the content of non main slot, it seems to just redirect... [22:14:37] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Multichill) It doesn't ? Looks to me it does: * https://commons.wikimedia.org/wiki/Special:EntityData/M1916.json type="applicat... [22:51:54] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10dcausse) Nothing is broken if we are OK having `Special:EntityData/M123` linking back to the File page, the only thing that I do... [23:23:00] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Multichill) Yeah, it the content negotiation is for html, showing the File page is imho the only correct location to link to. T...