[01:08:01] <wikibugs>	 10Traffic, 10Operations, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis)
[01:09:22] <wikibugs>	 10Traffic, 10Operations, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis)
[01:49:06] <wikibugs>	 10Traffic, 10Operations: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) Even though moving (mv) the file manually works as expected without a reload, the rotation triggered by logrotate isn't forcing ATS to open a new file: ` -rw-r--r--  1 trafficserver trafficserver...
[02:18:30] <wikibugs>	 10Traffic, 10Operations: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) upon manual removal of the empty error.log file, trafficserver creates a new one (without issuing a reload): ` -rw-r--r--  1 trafficserver trafficserver 2.8K Nov 25 02:17 error.log -rw-r--r--  1 tr...
[02:47:42] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez)
[03:03:24] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) p:05Triage→03Normal
[03:09:09] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[03:09:30] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[03:09:32] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez)
[03:14:30] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Nothing on the logs or on SEL
[03:14:32] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[05:20:32] <wikibugs>	 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) > jcrespo changed the task status from Open to Stalled.  What exactly is this task [stalled](https://w...
[06:40:31] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) db2125 crashed too and it is a new R440: {T239042}
[06:40:50] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[06:41:09] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[08:17:50] <wikibugs>	 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10jcrespo) @Aklapper an answer to T99216#2057570
[08:39:39] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[08:54:11] <wikibugs>	 10Traffic, 10Operations, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10fgiunchedi)
[08:56:59] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[09:19:46] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles)
[10:22:04] <wikibugs>	 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) Given T238901#5687813 it seems it's fixed.
[11:41:37] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Noteworthy graph found by @ema:  https://grafana.wikimedia.org/d/w4TRwaxZz/local-backend-hitrate-varnish-vs-ats?panelId=4&...
[11:49:59] <wikibugs>	 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris)
[12:28:18] <wikibugs>	 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10AMuigai) Makes sense to me on both fronts @Neil_P._Quinn_WMF
[12:49:51] <wikibugs>	 10Traffic, 10netops, 10Operations, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Vgutierrez) @volans @ayounsi IMHO it doesn't make any sense to include smokeping.wm.o SNI on the librenms certificate, that would set a dependency between otherw...
[13:34:43] <wikibugs>	 10Traffic, 10netops, 10Operations, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Volans) No problem for me for 1 cert, it seems a reasonable approach.
[13:52:15] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Volans) If needed, full list of R440 available here: https://puppetboard.wikimedia.org/fact/productname/PowerEdge+R440 (intentionally not mentioning their count here)
[14:29:07] <wikibugs>	 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) Ah, thanks. But who exactly is supposed to answer that question?
[14:35:22] <wikibugs>	 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10ayounsi)
[14:39:06] <elukey>	 hi everybody, I am wondering if there are any thoughts about https://phabricator.wikimedia.org/T237993 (replacement for varnishkafka)
[14:39:10] <wikibugs>	 10Traffic, 10Operations, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10fgiunchedi) Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert
[14:48:41] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3064.esams.wmnet'] ` The...
[14:57:57] <bblack>	 ema: we're due to repool cp3056 today in https://phabricator.wikimedia.org/T236497 .  It's a text node in the ats-be set (was varnish-be before hw was dead, but reimaged last week as initial ats-be)
[14:58:34] <bblack>	 ema: does this interfere with the perf regression experimentation and rollback to varnish-be, etc?  or can I just go ahead and pool it?
[15:00:49] <ema>	 bblack: nope, it shouldn't interfere. What I'm trying to do is finding patterns of miss/pass traffic on esams ats-be that are hits on cp3064
[15:00:59] <bblack>	 ok, pooling...
[15:01:06] <ema>	 thanks for the heads up!
[15:18:05] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3064.esams.wmnet'] `  Of which those **FAILED**: ` ['cp3064.esams.wmnet...
[15:31:41] <ema>	 bblack: oh, on cp3064 mount /dev/nvme0n1p1 fails now after reimage to varnish-be
[15:31:49] <ema>	 I guess I need to run the mke2fs command manually
[15:32:14] <ema>	 the one removed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552547/
[15:33:21] <ema>	 trying
[15:33:30] <ema>	 yup that was it
[15:35:15] <bblack>	 yes
[15:35:39] <bblack>	 we can revert if we anticipate more reversions! :)
[15:37:04] <ema>	 I'd rather superstitiously not :)
[15:39:07] <bblack>	 I'm hoping for a "simple" explanation in that some class of traffic is getting passed that wasn't before or whatever
[15:40:22] <ema>	 yeah
[15:40:48] <ema>	 I did go through the backend VCL and ATS Lua once again and there's nothing obviously wrong 
[15:41:21] <bblack>	 there could be other layers to the confusion too, but we can go down that road once this one seems exhausted
[15:46:53] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) Among the things I've checked to rule out obvious mistakes porting VCL to Lua:  - Cookie responses without "session" or "toke...
[16:02:48] <bblack>	 finishing up scary puppet runs, be there soon
[16:11:05] <wikibugs>	 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon)
[16:13:38] <wikibugs>	 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) The Anchor is now installed, connected to the SCS, and we see a getty on serial with the right hostname. It's also now responsive to IPv4 pings but not IPv6 (which matches our previous experie...
[16:52:15] <wikibugs>	 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) I've a proposal for doing this:    - Add some special tag like `#NRPE` or `#page` to the names of any [[ https://librenms.wikimedia.org/alert-rules | L...
[17:43:08] <wikibugs>	 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark)
[17:43:13] <wikibugs>	 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10BBlack) It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in...
[17:51:06] <wikibugs>	 10Traffic, 10Operations, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10Jdforrester-WMF)
[17:51:26] <wikibugs>	 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) 05Open→03Stalled
[18:13:35] <bblack>	 vgutierrez: PROBLEM - Disk space on cp4028 is CRITICAL - logrotate issue?
[18:14:11] <wikibugs>	 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark)
[18:16:09] <volans>	 I hope for him that he's sleeping by now :D
[18:17:21] <bblack>	 hopefully, but you never know
[18:17:35] <volans>	 4.2G    var
[18:17:41] <volans>	 on a 9.1 partition
[18:18:00] <bblack>	 yes
[18:19:30] <bblack>	 ema: anything unusual on cp4028? I don't see the issue on others
[18:19:52] <vgutierrez>	 hmmm
[18:19:55] <vgutierrez>	 checking
[18:20:22] <bblack>	 vgutierrez: it's not logrotate
[18:20:31] <bblack>	 and it is on others
[18:20:59] <bblack>	 something like 60% of the cp fleet is exhibiting the log spam, just not all of them have small root partitions
[18:23:37] <vgutierrez>	 wow
[18:23:55] <bblack>	 so it's the lua reload patch
[18:23:57] <vgutierrez>	 yes
[18:24:04] <bblack>	 which for some apparently intentional reason sets HOSTNAME to nil
[18:24:21] <vgutierrez>	 ema was running a cumin task restarting trafficserver...
[18:24:22] <bblack>	 and then we get this existing line:
[18:24:24] <bblack>	     ts.client_response.header['X-Cache-Int'] = get_hostname() .. " " .. cache_status
[18:24:34] <bblack>	 and lua doesn't like get_hostname()'s nil result in a string concat
[18:25:12] <bblack>	 I don't see a cumin running now though, maybe he paused
[18:25:23] <vgutierrez>	 or finished already :/
[18:25:47] <vgutierrez>	 I think we should rollback that and restart TS
[18:26:01] <vgutierrez>	 let's see if I can get ema online
[18:26:49] <bblack>	 I'm doing a hacky fix to buy more time on all the affected ones
[18:27:05] <bblack>	 (wiping out one of the oversized logfiles and restarting rsyslog to free it)
[18:27:26] <bblack>	 cp1 and cp3 don't have it as bad, because they have much larger root partitions
[18:28:18] <bblack>	 so that should buy us some time, it took a while to get bad
[18:30:07] <bblack>	 in the meantime, I'm trying to debug what actually went wrong with the earlier patch in case it's salvageable
[18:30:49] <vgutierrez>	 hmmm I think it's the context of dofile()
[18:31:05] <vgutierrez>	 it's being evaluated inside the function instead of globally
[18:31:24] <vgutierrez>	 so it doesn't work as expected I'd say
[18:32:17] <bblack>	 also the assert seems backwards
[18:32:21] <bblack>	 it's asserting that it's broken :)
[18:32:36] <bblack>	 and the conf file being loaded has no newline, so that might affect it loading correctly too?
[18:32:43] <robh>	 .... cp3056 hasnt crashed.
[18:32:47] <robh>	 i am pleasantly surprised.
[18:33:26] <ema>	 what's up
[18:33:38] <vgutierrez>	 Nov 25 18:32:16 cp4028 traffic_manager[32739]: [Nov 25 18:32:16.344] {0x2b3ed892f700} ERROR: [ts_lua] lua_pcall failed: /etc/trafficserver/lua/default.lua:48: attempt to concatenate a nil value
[18:33:44] <bblack>	 ema: the global HOSTNAME is remaining nil
[18:33:46] <vgutierrez>	 that's flooding cp hosts logs
[18:33:50] <bblack>	 and the assertion is backwards so it doesn't catch it
[18:33:56] <bblack>	 and then yeah those log entries filling disks
[18:34:02] <ema>	 uh
[18:34:03] <vgutierrez>	 we have the same issue with the TLS instance
[18:34:08] <vgutierrez>	 I'd assume
[18:34:16] <ema>	 upon reload I guess, at startup I think it worked?
[18:34:23] <bblack>	 I donno
[18:34:33] <ema>	 as in, after restart I was seeing proper values for X-Cache-Int 
[18:34:36] <ema>	 let's rollback
[18:34:58] <bblack>	 oh are you saying we've had some reloads on unrestarted ones or something?
[18:35:02] <bblack>	 and more restarts == fix?
[18:35:23] <bblack>	 either way the assert seems backwards
[18:35:28] <vgutierrez>	 ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/552869
[18:35:40] <bblack>	 I've temporarily freed up the disk space, we have a few minutes to think
[18:36:46] <bblack>	 oh wait
[18:36:54] <bblack>	 are lua's assert()s backwards?
[18:37:13] <ema>	 assert(HOSTNAME ~= nil, "Cannot read HOSTNAME from " .. configfile)
[18:37:39] <bblack>	 the usual way that assert works is the condition should be true in the non-error case
[18:37:39] <ema>	 this didn't assert when HOSTNAME was set
[18:37:52] <bblack>	 maybe lua gets the whole thing backwards
[18:38:20] <ema>	 or I did :)
[18:38:21] <bblack>	 e.g. assert(x != 0, "X can never be zero")
[18:39:03] <bblack>	 oh yeah, it's you not lua, I finally found some readable docs
[18:39:24] <ema>	 > banana = nil
[18:39:24] <ema>	 > assert(banana ~= nil)
[18:39:24] <ema>	 stdin:1: assertion failed!
[18:39:24] <ema>	 stack traceback: [C]: in function 'assert' stdin:1: in main chunk [C]: in ?
[18:39:48] <ema>	 ~= is "not equal" BTW
[18:40:17] <bblack>	 oh, so that's super confusing, but I guess once you get used to lua again heh
[18:40:30] <vgutierrez>	 assert raises an error if the first argument is false/nil
[18:40:33] <bblack>	 so why didn't the assert fail?
[18:40:56] <vgutierrez>	 I'm wondering if global HOSTNAME != HOSTNAME within read_config() function
[18:41:09] <bblack>	 but he said he saw some value hostname values in x-cache-int at some point
[18:41:16] <vgutierrez>	 yeah, I saw that as well
[18:41:20] <ema>	 right, immediately after ats-backend-restart
[18:41:30] <bblack>	 have they all been restarted already?
[18:41:30] <vgutierrez>	 we validated this on cp1075 and it looked good
[18:41:32] <ema>	 so at least the __init__(argtb) call worked fine
[18:41:42] <bblack>	 (for the lua reload change)
[18:41:45] <ema>	 but maybe __reload__() unsets things?
[18:41:57] <ema>	 still unclear why the assert wouldn't fire
[18:42:23] <bblack>	 it should fire if ~= evaluates to false or nil
[18:42:32] <ema>	 all restarts have happened 
[18:42:44] <bblack>	 (but confusingly, 0 is a numeric value which is true in the lua world)
[18:42:54] <ema>	 I'd say let's revert for now and figure out tomorrow?
[18:43:06] <vgutierrez>	 +1 :_)
[18:43:26] <vgutierrez>	 I'd love to debug this at human times
[18:43:55] <ema>	 reverting
[18:43:56] <bblack>	 sudo cumin 'A:cp' 'tail -1000 /var/log/daemon.log|grep -q lua_pcall; echo $?'
[18:44:09] <bblack>	 the zeros are: (47) cp[2001-2002,2004-2007,2010-2011,2013-2014,2016-2017,2022,2024-2025].codfw.wmnet,cp[1075-1077,1079-1082,1084,1086].eqiad.wmnet,cp[5001-5002,5005-5008,5010].eqsin.wmnet,cp[3050-3051,3053-3055,3057,3059-3061,3063,3065].esams.wmnet,cp[4022,4028-4029,4031-4032].ulsfo.wmnet
[18:44:18] <bblack>	 ones: cp[2008,2012,2018-2020,2023,2026].codfw.wmnet,cp[1078,1083,1085,1087-1090].eqiad.wmnet,cp[5003-5004,5009,5011-5012].eqsin.wmnet,cp[3052,3056,3058,3062,3064].esams.wmnet,cp[4021,4023-4027,4030].ulsfo.wmnet
[18:44:24] <bblack>	 not sure what the difference is in those sets
[18:44:50] <ema>	 the failing ones, have they been reloaded?
[18:44:59] <bblack>	 no idea
[18:45:05] <bblack>	 I mean, I haven't intentionally reloaded anything
[18:45:19] <bblack>	 does etcd pooling trigger reloads, or something?
[18:45:32] <ema>	 it shouldn't be
[18:45:52] <vgutierrez>	 hmmm
[18:45:58] <vgutierrez>	 I merged the logrotate change
[18:46:02] <vgutierrez>	 that triggered a reload
[18:46:04] <ema>	 ah
[18:46:13] <vgutierrez>	 I'd say that in some hosts before your restart
[18:46:20] <bblack>	 yes, reloads
[18:46:23] <ema>	 so the failing ones might have been reloaded without restarted
[18:46:25] <bblack>	 Nov 25 17:00:19 cp4028 puppet-agent[34901]: Computing checksum on file /etc/logrotate.d/ats-backend
[18:46:28] <bblack>	 Nov 25 17:00:19 cp4028 puppet-agent[34901]: (/Stage[main]/Profile::Trafficserver::Backend/Trafficserver::Instance[backend]/File[/etc/logrotate.d/at
[18:46:31] <bblack>	 s-backend]) Filebucketed /etc/logrotate.d/ats-backend to puppet with sum 76e9d84d04f27d3331f22134a0172d23
[18:46:34] <bblack>	 Nov 25 17:00:19 cp4028 puppet-agent[34901]: (/Stage[main]/Profile::Trafficserver::Backend/Trafficserver::Instance[backend]/File[/etc/logrotate.d/at
[18:46:37] <bblack>	 s-backend]/content) content changed '{md5}76e9d84d04f27d3331f22134a0172d23' to '{md5}2754d718c6df520f8024f6e8f3a709fe'
[18:46:40] <bblack>	 Nov 25 17:00:26 cp4028 systemd[1]: Reloading Apache Traffic Server is a fast, scalable and extensible caching proxy server..
[18:46:43] <bblack>	 Nov 25 17:00:26 cp4028 traffic_manager[32739]: [Nov 25 17:00:26.406] {0x7f26769f9700} NOTE: User has changed config file records.config
[18:46:46] <bblack>	 Nov 25 17:00:26 cp4028 traffic_manager[32739]: [Nov 25 17:00:26.437] {0x2b3ed8548700} ERROR: [ts_lua] lua_pcall failed: /etc/trafficserver/lua/default.lua:48: attempt to concatenate a nil value
[18:46:50] <bblack>	 that's the pattern
[18:46:52] <bblack>	 logrotate change -> reload -> fail
[18:47:48] <bblack>	 but on that same host
[18:47:51] <bblack>	 Nov 25 16:55:44 cp4028 ats-restart: Repooling name=cp4028.ulsfo.wmnet,service=nginx
[18:48:01] <bblack>	 it was restarted just before...
[18:48:57] <bblack>	 not sure why exactly, I assume cumin, no idea
[18:49:02] <ema>	 bblack: ok to begin restarts to undo the change?
[18:49:09] <bblack>	 yes
[18:49:27] <bblack>	 I'm gonna wipe logfiles again to patch up disk space on the affected ones
[18:49:31] <ema>	 thanks
[18:49:34] <bblack>	 shouldn't interfere, it's all rsyslog-level
[18:52:14] <bblack>	 ema: worry about ulsfo/eqsin (and codfw I guess) first
[18:52:33] <bblack>	 they all have small root partitions.  esams and eqiad have much larger, will take them longer to exhaust by a lot
[18:52:40] <ema>	 ack
[18:52:47] <vgutierrez>	 bblack: /var/log/syslog is also fat on those hosts
[18:52:51] <ema>	 started eqiad and codfw, proceeding with ulsfo/eqsin 
[18:52:52] <vgutierrez>	 ~4.3Gb
[18:53:15] <bblack>	 yeah I've been killing just daemon.log so we still had some log to look at
[18:53:20] <bblack>	 but I guess we don't need it now
[18:53:49] <bblack>	 and I guess my fix lasts a shorter amount of time each time if I keep letting syslog grow
[18:55:21] <ema>	 restarts began in all DCs
[18:55:36] <bblack>	 did we turn off syslog repeat-suppression everywhere for some reason? might've helped heh
[18:55:39] <ema>	 I could have done text and upload in parallel really, noticed only now
[18:56:01] <vgutierrez>	 ~60 secs + puppet run per host
[18:57:31] <bblack>	 apparently I really need to brush up on my lua
[18:57:42] <vgutierrez>	 hmm if this affected in the same way to ats-tls, we've lost some XCPS stats
[18:57:43] <bblack>	 I had a lot of silly confusion staring at the assert line
[18:58:01] <ema>	 well the syntax doesn't help does it :)
[18:58:40] <vgutierrez>	 hmmm or not, because the assert isn't being triggered...
[18:58:41] <bblack>	 at least it has variables
[18:58:57] <vgutierrez>	 so we just lost websockets for text
[18:59:03] <vgutierrez>	 it isn't (that) bad
[18:59:24] <ema>	 so: 
[18:59:25] <ema>	 banana = "ciao"
[18:59:29] <ema>	 assert(banana ~= nil)
[18:59:33] <bblack>	 think of all the etherpad D&D sessions you may have inconvenienced
[18:59:34] <ema>	 this isn't triggered ^
[18:59:49] <bblack>	 because nil ~= nil is true
[19:00:04] <bblack>	 it's the mathematical nil, it can't be equal to anything?
[19:00:42] <bblack>	 since assert() will raise on anything that evalutes to false or nil, you can just do assert(HOSTNAME, "blah") too
[19:01:06] <ema>	 yep
[19:01:40] <vgutierrez>	 yep, but that assert() is working as expected on a lua CLI..
[19:01:48] <vgutierrez>	 this is pretty weird
[19:02:54] <vgutierrez>	 it's 3 AM here, so I'll look at this "tomorrow" with fresh eyes
[19:03:28] <ema>	 good night vgutierrez 
[19:04:16] <ema>	 so, on hosts where the puppet run + ats restart happened before the logrotate patch got merged, things worked
[19:04:21] <ema>	 cp1075 is an example
[19:05:19] <bblack>	 ok
[19:05:56] <ema>	 well I do see errors on 1075 actually 
[19:06:26] <bblack>	 my hunch is:
[19:06:36] <bblack>	 "read_config()" is called once at start, and not on reload
[19:06:55] <bblack>	 so it works fine on initial start, but then the hostname file is not loaded (thus the global goes back to nil) and the assert is not executed, on a reload
[19:07:07] <bblack>	 (it re-evalutes the lua on a reload, but does not call the read_config() hook, in other words)
[19:07:19] <ema>	 that would make sense, but __reload__ should be called at reload
[19:09:49] <ema>	 what if something imports the script though
[19:10:23] <ema>	 would that overwrite the global value of HOSTNAME?
[19:10:26] <bblack>	 nothing in the docs mention __reload__
[19:10:38] <bblack>	 (that I can find)
[19:10:41] <bblack>	 they do show __init__
[19:10:44] <wikibugs>	 10netops, 10Operations, 10ops-esams: Setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10faidon)
[19:10:48] <ema>	 yeah, it's documented by the source code :)
[19:10:54] <bblack>	 awesome!
[19:12:17] <bblack>	 so, reading the 8.0.x lua plugin source code (keep in mind, for the first time, I could be wrong!)
[19:12:31] <bblack>	 it seems like __reload__ is actually a cleanup function that happens just before the reloading code
[19:12:36] <bblack>	 and maybe used to be called __cleanup__
[19:12:41] <bblack>	 err __clean__
[19:12:56] <bblack>	 https://github.com/apache/trafficserver/blob/21c82cf370a5c4c53ebdde23f161af3485b95aa8/plugins/lua/ts_lua_util.c#L328
[19:13:29] <bblack>	 in any case, it calls __reload__ and then creates a new global lua context
[19:13:43] <bblack>	 I think __reload__ is meant to do any pre-reload cleanup, like freeing up other resources?
[19:14:27] <cdanis>	 is there any issue with me merging text-lb traffic changes right now?  https://gerrit.wikimedia.org/r/c/operations/puppet/+/552879
[19:15:27] <bblack>	 cdanis: wait please, lots going on at this layer at present
[19:15:36] <cdanis>	 ack
[19:18:15] <bblack>	 ema: I still have a hard time following that lua plugin code, but it really does still seem like __init__ is called on fresh load of module, then __reload__ is called just before the new lua context for a reload is created (but there is no further __init__ nor __reload__ call inside the new context)
[19:18:17] <ema>	 bblack: on some ulsfo hosts puppet failed running due to lack of disk space
[19:18:35] <ema>	 what's the command you've been using to free things up?
[19:18:46] <bblack>	 I just re-ran, probably since that happened, try again
[19:18:54] <bblack>	 sudo cumin 'cp*.eqsin.wmnet or cp*.ulsfo.wmnet or cp*.codfw.wmnet' 'rm -f /var/log/daemon.log /var/log/syslog.log; systemctl restart rsyslog.service'
[19:19:00] <cdanis>	 can I help?
[19:20:35] <ema>	 what's the command to see if the last puppet run failed?
[19:21:29] <cdanis>	 ema: you could just do a cumin like so https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed
[19:21:35] <cdanis>	 run-puppet-agent --failed-only
[19:22:03] <bblack>	 ugh fixing 4028 and 5007 manually
[19:22:11] <ema>	 bblack: I've done 4028
[19:22:38] <bblack>	 bad command
[19:22:44] <bblack>	 "/var/log/syslog.log" :)
[19:22:51] <bblack>	 just fixed it and ran it again everywhere heh
[19:24:30] <ema>	 this is the order in which things failed on hosts with full /:
[19:24:31] <ema>	 https://phabricator.wikimedia.org/P9738
[19:24:53] <ema>	 so the backend restarts work even with full /
[19:25:22] <bblack>	 rsyslog has both duplicate suppression and ratelimiting apparently
[19:25:28] <ema>	 trafficserver-tls instead fails starting
[19:25:32] <bblack>	 maybe as a meta-followup we should look into those :)
[19:25:45] <ema>	 so wherever it failed we also need to make sure we re-pool "nginx"
[19:26:18] <bblack>	 is everything caught up now?
[19:26:26] <bblack>	 (to the cumin progress)
[19:26:57] <ema>	 esams has two hosts to go
[19:27:07] <ema>	 eqsin 1
[19:27:19] <ema>	 ulsfo is done but some have failed
[19:27:29] <ema>	 codfw 1 to go
[19:28:19] <cdanis>	 btw ema you can check for depooled nginxes like so: confctl --quiet select 'service=nginx' get | jq 'select(..|.pooled? == "no")' 
[19:28:29] <cdanis>	 https://phabricator.wikimedia.org/P9739
[19:28:36] <ema>	 cdanis: excellent, thanks
[19:28:58] <cdanis>	 oh, that includes some things like cluster=api_appserver haha
[19:29:13] <bblack>	 add a cluster argument I guess! :)
[19:29:42] <cdanis>	 confctl --quiet select 'service=nginx' get | jq 'select(..|.pooled? == "no") | select(.tags | contains("cluster=cache_"))'
[19:29:51] <cdanis>	 --> https://phabricator.wikimedia.org/P9740
[19:29:52] <bblack>	 I don't even remember how disruptive it is to rename a confd service, but maybe we should do that at some point in this case :)
[19:32:17] <ema>	 I think we're out of the woods
[19:32:32] <ema>	 just forced a icinga re-check of services that were still red, I see recoveries coming
[19:32:57] <cdanis>	 and the apparently-bad lua was reverted?
[19:33:11] <ema>	 yes
[19:33:22] <cdanis>	 cool, i am going to continue with my deploy
[19:33:56] <ema>	 ack I'm going to continue with my evening :)
[19:34:11] <ema>	 but text me if needed
[19:34:44] <bblack>	 enjoy :)
[20:05:31] <bblack>	 rabbitholes :P
[20:06:06] <bblack>	 the icinga "check_dns" plugin (as used by our icinga to monitor rec and auth dns services, and also re-used by our anycast healthchecker for recdns)...
[20:06:52] <bblack>	 it doesn't have an argument to use an alternate dns server port, and I really don't want that to be the reason I go through a bunch of bullshit to define separate in-subnet public IPs for all our dns boxes, esp as IP space is limited at edge sites
[20:07:11] <bblack>	 so I thought maybe we could just patch it easily, but of course it's a compiled binary
[20:07:46] <bblack>	 and when you look at the source code, it's literally written in C and what it does is wrap an execution of the external (and ancient and truly horrible) nslookup binary.
[20:08:19] <bblack>	 I mean, if you're going to shell out to nslookup (ewwww) and all you're doing is parsing arguments and parsing output, you could at least not do that in C :P
[20:08:40] <bblack>	 so I guess I'm going to replace check_dns with something simpler, in a scripting language :P
[20:10:09] <bblack>	 who woke up one day and said "I need to write a simple healthcheck, so I'm going to write a horrible C program that just wraps the execution of another even more-horrible C program to do some parsing on the inputs and outputs?"
[20:13:14] <bblack>	 (and then not bother with a port argument, to boot)
[20:14:49] <cdanis>	 bblack: I mean, once you've already decided to write nagios/icinga in straight C...
[20:15:06] <cdanis>	 but yeah that is pretty absurd
[20:15:35] <cdanis>	 check_http has its own HTTP client written in C, which while is better than shelling out to curl, is decidedly not better than linking libcurl
[20:17:29] <bblack>	 maybe I should continue the insanity and write a new wrapper, also in C, called check_dns_with_port, which executes check_dns, but then uses ptrace to intercept critical socket calls and change the port number.
[20:20:12] <cdanis>	 some other time I have a fun story for you about wrapping socket calls
[23:31:05] <wikibugs>	 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Dzahn) >>! In T99216#5689785, @Aklapper wrote: > Ah, thanks. But who exactly is supposed to answer that question...