[00:26:16] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10DannyS712) I just got 502 again on commons during a normal edit (`Request from [snip] via cp4030.ulsfo.wmnet, ATS/8.0.5... [02:41:04] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) I don't think so, at 00:25 GMT we had 75 requests (including yours) failing against appservers-rw.discovery.... [04:57:23] 10Traffic, 10Operations: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [05:09:17] 10Traffic, 10Operations, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Bugreporter) Note very old browsers may not support HSTS preload list or even HSTS itself; probably we want to configure a specific 403 message (... [05:21:01] 10Traffic, 10Operations, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) we are targeting here the one-off sites, some of them are already configured to support TLSv1.2 only, that's usually a stricter requi... [05:22:00] 10Traffic, 10Operations, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) Just to be clear, wikipedia and the rest of the canonical sites are out of scope for this task :) [05:22:22] 10Traffic, 10Operations, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) p:05Triage→03Normal [06:52:00] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Joe) I don't think the solution is removing aphlict, but instead proxying to it directly from envoy or ATS, our choice. @ema @Dz... [07:36:05] 10Traffic, 10Operations: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) [07:36:49] 10Traffic, 10Operations: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) [07:36:55] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10DannyS712) Again when trying to edit on species wiki: `Request from [snip] via cp4028.ulsfo.wmnet, ATS/8.0.5 Error: 502,... [09:03:28] 10Traffic, 10DNS, 10Operations, 10SRE-tools: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10fgiunchedi) [09:11:35] 10Traffic, 10Operations, 10RESTBase: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10mobrovac) 05Resolved→03Open This doesn't seem to be working as expected. On the client, I always get `Server: envoy`: ` $ curl https://test.wikipedia.org/api/rest_v1/page/html/Testparso... [09:18:50] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) From ATS source code, it looks like ATS logs connect errors on the first attempt but it's configured to perf... [09:41:09] 10Traffic, 10Operations, 10RESTBase: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) Ok, a bit of digging: `lang=bash # From the public internet $ for dc in eqiad codfw esams eqsin ulsfo; do echo -n "$dc: "; curl --resolve test.wikipedia.org:443:$(dig +short text-lb.$... [09:52:08] 10Traffic, 10Operations, 10RESTBase: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) Not very misteriously, the edges use ATS-BE so they call envoy, while the main dcs are still contacting restbase directly. Meh. [09:56:54] 10Traffic, 10Operations, 10RESTBase: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) And indeed it seems things are not working as expected: ` restbase2015:~$ curl restbase2015:7231/de.wikipedia.org/v1/page/references/Der_Junge_mit_dem_gro%C3%9Fen_schwarzen_Hund -Is |... [09:57:22] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) I find this pretty worrisome for the following reasons: # right now we have one remap rule that catches all... [09:58:40] 10Traffic, 10Operations, 10RESTBase: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) [10:08:19] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Joe) >>! In T237319#5677665, @Vgutierrez wrote: > I find this pretty worrisome for the following reasons: > # right now... [10:11:22] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) nope, a 5xx doesn't translate to `BAD_INCOMING_RESPONSE`, actually is specifically whitelisted: `lang=C++ ca... [10:12:52] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) [10:13:00] 10Traffic, 10MediaWiki-extensions-Translate, 10Operations, 10Wikidata, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) [10:13:15] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) [10:18:44] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) >>! In T237319#5677739, @Vgutierrez wrote: > nope, a 5xx doesn't translate to `BAD_INCOMING_RESPONSE`, actually is specifically whitelisted: > `lang=C++ > case STATUS_CODE_SERVER_ERROR: >... [10:36:55] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2010.codfw.wmnet'] ` The log can be found in `/var/log/wm... [10:48:01] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) > One real problem is: some of our requests are incredibly long and might overflow the timeouts in ATS-BE - in that case, pybal won't depool a backend but we might still get an error... [10:56:57] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) @Joe on the other hand ATS doesn't reap connections for hosts marked as down, and because ats-be uses KA it should have plenty of available connections against appserver-rw.discovery.... [11:04:10] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2010.codfw.wmnet'] ` and were **ALL** successful. [11:24:39] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2012.codfw.wmnet'] ` The log can be found in `/var/log/wm... [11:51:14] 10Traffic, 10Operations, 10Wikidata, 10observability, 10User-Addshore: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10Addshore) 05Open→03Resolved a:03Addshore [11:53:01] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2012.codfw.wmnet'] ` and were **ALL** successful. [12:10:08] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2013.codfw.wmnet'] ` The log can be found in `/var/log/wm... [12:38:15] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2013.codfw.wmnet'] ` and were **ALL** successful. [12:45:55] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Lucas_Werkmeister_WMDE) I agree that nothing is broken with Special:EntityData behavior. It would be nice if it was possible to... [13:24:22] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) Either way when switching the Hiera key it should either enable or disable all the things and not just some of them. A... [13:52:43] 10Traffic, 10Operations, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) I think it makes more sense to expose new navtiming metrics with Prometheus instead, especially for things like this that require slicing data by... [14:09:31] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2016.codfw.wmnet'] ` The log can be found in `/var/log/wm... [14:14:38] ema: o/ [14:14:40] q for you in https://gerrit.wikimedia.org/r/c/operations/puppet/+/551247 [14:14:55] in trafficserver/backend.yaml [14:15:11] we are mapping from frontend url to backend discovery url [14:15:19] the discovery url supports https via envoyproxy [14:15:25] what should the ats map target rule be? [14:15:29] with or without https? [14:15:34] for the frontend url [14:16:38] hey ottomata! [14:17:00] in backend.yaml the target should be https [14:17:14] sorry [14:17:18] the replacement should be https [14:17:39] the target has got to be plain http given that the client is varnish-fe, and varnish does not speak TLS [14:20:14] ah! interesting [14:20:26] ok cool, to TLS is already terminated by the time it gets here (whatever here is) [14:20:32] nginx? [14:20:49] nginx -> varnish-fe -> ats-be-> app? [14:20:57] ats-tls -> varnish-fe -> ats-be :) [14:21:02] ahhh cool! [14:21:09] the ats sandwich is ready [14:21:12] yum [14:21:23] and yes, ats-be -> app [14:22:57] ty [14:37:48] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2016.codfw.wmnet'] ` and were **ALL** successful. [15:01:10] 10Traffic, 10Operations, 10Performance-Team: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) This is the data for cp3064 over that period. {F31111963, size=full} The 2 dots represent the 2 events you've mentioned. Ignore the fact that... [15:02:49] 10Traffic, 10Operations, 10Performance-Team: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Do you expect that there's a delay for the SSL cert change to affect people? If that's the case then we can certainly see a regression ramping... [15:05:52] 10Traffic, 10Cloud-Services, 10Operations, 10cloud-services-team, and 4 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) modules still using base::service_unit: - confd - varnishkafka - mediawiki (cgroups) - udp2log - uwsgi - redis - service - (base)... [15:23:19] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2019.codfw.wmnet'] ` The log can be found in `/var/log/wm... [15:30:53] 10netops, 10Operations, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10fgiunchedi) Thanks! I think we should go with (2) (i.e. investigate integration between icinga (or grafana alerts, and from there icinga checks) for fas... [15:54:39] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2019.codfw.wmnet'] ` and were **ALL** successful. [16:07:12] 10Traffic, 10Operations, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10ema) On cp1075: ` $ sudo stap -ve 'probe process("/usr/bin/traffic_server").statement("retry_server_connection_not_open@./proxy/http/HttpTransact.cc:3612") { printf("%d retry %d max_retries resp... [16:25:29] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log can be found in `/var/log/wm... [16:49:59] 10Traffic, 10Operations: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) This just happened on cp2023 too. [16:50:25] 10Traffic, 10Operations: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log can be found in `/var... [16:57:23] 10Traffic, 10DNS, 10Operations, 10SRE-tools: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10crusnov) p:05Triage→03Normal [16:57:59] 10Traffic, 10Operations: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10crusnov) p:05Triage→03Normal [17:16:01] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10BBlack) There were two TLS-level changes to the certificate output for esams specifically, each of which bumped the output size (t... [17:25:52] 10Traffic, 10Operations: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` and were **ALL** successful. [17:41:05] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [17:51:08] 10Traffic, 10DNS, 10Operations, 10SRE-tools: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10Volans) @fgiunchedi I think is fair request, but given we're in process of auto-generating all mgmt and then server's DNS records this might have less benefit that in th... [17:59:33] there is 2 separate people that are reporting 503s right now, if you want to check [19:18:10] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) [20:16:33] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927 (10Ottomata) 05Open→03Declined Old task, I think we aren't likely to do this. Declining, feel free to reopen i... [20:27:03] 10Traffic, 10Operations, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10Krinkle) 05Resolved→03Open Re-opening but not 100% sure it was this change that caused the issue. The issue - When `X-Wikimedia-Debug` is enabled (e.g. via the Wikime... [20:27:05] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10Krinkle) [20:55:24] 10Traffic, 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) >>! In T238593#5678148, @Dzahn wrote: > As Mukunda pointed out the aphlict service does not even... [21:22:59] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Neil_P._Quinn_WMF) @SBisson I looked over the patch and [the schema](https://meta.wikimedia.org/wiki/Schema:InukaPageView)... [22:29:55] 10HTTPS, 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486 (10Krenair) done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/482142 ? [23:03:25] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [23:06:11] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Koavf) I object to deletion: as long as we still own the domain names (that is, "we" being the WMF, not us personally), URIs should stay... [23:18:19] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [23:19:01] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Aklapper) [23:23:04] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10MarcoAurelio) [23:23:20] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Aklapper) If this gets done, potential steps afterwards could be * declining the Phab tasks https://phabr... [23:26:39] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) The Phab tasks contain some lessons learned. I agree they should be declined, but those le... [23:26:54] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10MarcoAurelio) @Aklapper re. point 3: @CCicalese_WMF above mentions that she does not want those extension... [23:28:41] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF)