[04:49:31] 10Traffic, 10Operations: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Vgutierrez) [04:53:18] 10Traffic, 10Operations: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Vgutierrez) p:05Triage→03Medium `name=TSRemapDeleteInstance stacktrace Mar 30 12:07:56 cp2013 traffic_manager[32876]: traffic_server: received signal 11 (Segmentation fault) Mar 30 12:07:56 cp2013... [04:58:13] 10Traffic, 10Operations: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Vgutierrez) This issue seems to be identified by upstream at https://github.com/apache/trafficserver/pull/6403 but the fix hasn't been backported to ATS 8.x [05:25:41] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2034.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [05:39:03] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Vgutierrez) [05:51:09] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2034.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2034.codfw.wmnet'] ` [05:55:20] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2007.codfw.wmnet` - cp2007.codfw.wmnet (**PASS**) - Downtime... [06:59:07] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Vgutierrez) a:05Vgutierrez→03Papaul [07:17:55] 10Traffic, 10Operations: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [07:33:01] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2035.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [07:48:45] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Vgutierrez) [07:54:34] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2035.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2035.codfw.wmnet'] ` [08:01:44] 10Traffic, 10Operations, 10Patch-For-Review: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10ema) Another ongoing issue which causes traffic_server to crash upon configuration reloads and related to tslua is T242952. [08:11:09] 10Traffic, 10Operations: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) The issue occurred yesterday on cp2023 and cp1081: ` Mar 30 13:55:02 cp2023 traffic_manager[17786]: PANIC: unprotected error in call to Lua API (attempt to co... [08:56:36] 10Traffic, 10Operations: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [09:10:25] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2011.codfw.wmnet` - cp2011.codfw.wmnet (**PASS**) - Downtime... [09:14:29] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Vgutierrez) a:05Vgutierrez→03Papaul [09:14:53] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [10:01:05] 10netops, 10Analytics, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) Got it, so basically do all the process of creating the schemas etc.. but push data from pmacct directly to the kafka topic bypassing Eventgate. The fact that we'll have to hav... [10:40:21] <_joe_> o traffic masters, can I ask for a +1 on https://gerrit.wikimedia.org/r/584904 ? [10:43:52] _joe_: I forgot, what's the maximum total number of requests served over the same connection on envoy? Is it infinite, or is it limited like on nginx (keepalive_requests)? [10:44:49] <_joe_> so for upstream clusters it's determined by a parameter, but for listeners I guess it depends from the client and the idle_timeout setting [10:45:23] <_joe_> ema: my proposal is to try on one server first, see what happens, then do the transition on the others [10:45:34] <_joe_> I need to disable puppet before running this anyways [10:45:51] <_joe_> because I need to remove nginx by hand first [10:48:26] _joe_: +1 but I'll have to go cook now so I can't follow the launch :) [10:48:59] <_joe_> ema: uhm let's postpone to the afternoon then, I'd like us to watch what happens on both sides [10:49:41] _joe_: ack! [12:58:00] 10Traffic, 10Operations: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Ottomata) Would love to get this fixed ASAP! Let me know if I can help! [13:01:48] <_joe_> vgutierrez: on mw1261 envoy_listener_ssl_versions_TLSv1_3{envoy_listener_address="0.0.0.0_443"} 6333 [13:02:12] lovely [13:02:13] <_joe_> ATS is indeed using tls 1.3 [13:03:17] <_joe_> so let's do it for the api canaries too [13:03:33] nice :) [13:37:02] 10netops, 10Analytics, 10Operations: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) [13:52:38] 10Traffic, 10Operations: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10Vgutierrez) https://github.com/apache/trafficserver/pull/6571 could be handy to tune some TS lua aspects [13:54:39] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2036.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [14:22:07] 10Traffic, 10Operations: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2036.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2036.codfw.wmnet'] ` [14:26:22] 10netops, 10Analytics, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) > We'll need to move current hive data to the new location in the event database You might end up just producing to a new topic name anyway. The topic name will eventually m... [15:05:23] 10Traffic, 10Operations: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:18:22] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2037.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [15:21:59] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Vgutierrez) [15:28:36] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2010.codfw.wmnet` - cp2010.codfw.wmnet (**PASS**) - Downtime... [15:33:11] 10netops, 10Operations, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [15:35:08] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:35:31] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:37:01] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Vgutierrez) [15:44:07] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2037.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2037.codfw.wmnet'] ` [15:45:43] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2014.codfw.wmnet` - cp2014.codfw.wmnet (**PASS**) - Downtime... [15:49:43] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:50:20] 10Traffic, 10Operations, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [16:15:45] 10Traffic, 10Operations: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [18:43:21] 10Traffic, 10Cloud-Services, 10Operations: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) [18:48:34] 10Traffic, 10Cloud-Services, 10Operations: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) [19:24:42] 10Traffic, 10Cloud-Services, 10Operations: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10CDanis) I did some quick looking at analytics webrequest data and I don't see a marked increase in >29 second TTFB responses nor a marked increase... [20:11:17] 10Traffic, 10Cloud-Services, 10Operations: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) >>! In T249035#6016229, @CDanis wrote: > Would it be possible to get some packet captures of requests that failed? `tcpdump` has sup... [20:47:27] 10Traffic, 10Cloud-Services, 10Operations: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10bd808) I would expect traffic from a VPS instance to be routed something like: instance local interface -> {neutron overlay network} -> cloudnet e...