[01:08:01] 10Traffic, 10Operations, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis) [01:09:22] 10Traffic, 10Operations, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis) [01:49:06] 10Traffic, 10Operations: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) Even though moving (mv) the file manually works as expected without a reload, the rotation triggered by logrotate isn't forcing ATS to open a new file: ` -rw-r--r-- 1 trafficserver trafficserver... [02:18:30] 10Traffic, 10Operations: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) upon manual removal of the empty error.log file, trafficserver creates a new one (without issuing a reload): ` -rw-r--r-- 1 trafficserver trafficserver 2.8K Nov 25 02:17 error.log -rw-r--r-- 1 tr... [02:47:42] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) [03:03:24] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) p:05Triage→03Normal [03:09:09] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [03:09:30] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [03:09:32] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) [03:14:30] 10Traffic, 10Operations, 10ops-esams: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Nothing on the logs or on SEL [03:14:32] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [05:20:32] 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) > jcrespo changed the task status from Open to Stalled. What exactly is this task [stalled](https://w... [06:40:31] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) db2125 crashed too and it is a new R440: {T239042} [06:40:50] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [06:41:09] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [08:17:50] 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10jcrespo) @Aklapper an answer to T99216#2057570 [08:39:39] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [08:54:11] 10Traffic, 10Operations, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10fgiunchedi) [08:56:59] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [09:19:46] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) [10:22:04] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) Given T238901#5687813 it seems it's fixed. [11:41:37] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Noteworthy graph found by @ema: https://grafana.wikimedia.org/d/w4TRwaxZz/local-backend-hitrate-varnish-vs-ats?panelId=4&... [11:49:59] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) [12:28:18] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10AMuigai) Makes sense to me on both fronts @Neil_P._Quinn_WMF [12:49:51] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Vgutierrez) @volans @ayounsi IMHO it doesn't make any sense to include smokeping.wm.o SNI on the librenms certificate, that would set a dependency between otherw... [13:34:43] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Volans) No problem for me for 1 cert, it seems a reasonable approach. [13:52:15] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Volans) If needed, full list of R440 available here: https://puppetboard.wikimedia.org/fact/productname/PowerEdge+R440 (intentionally not mentioning their count here) [14:29:07] 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) Ah, thanks. But who exactly is supposed to answer that question? [14:35:22] 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10ayounsi) [14:39:06] hi everybody, I am wondering if there are any thoughts about https://phabricator.wikimedia.org/T237993 (replacement for varnishkafka) [14:39:10] 10Traffic, 10Operations, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10fgiunchedi) Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert [14:48:41] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3064.esams.wmnet'] ` The... [14:57:57] ema: we're due to repool cp3056 today in https://phabricator.wikimedia.org/T236497 . It's a text node in the ats-be set (was varnish-be before hw was dead, but reimaged last week as initial ats-be) [14:58:34] ema: does this interfere with the perf regression experimentation and rollback to varnish-be, etc? or can I just go ahead and pool it? [15:00:49] bblack: nope, it shouldn't interfere. What I'm trying to do is finding patterns of miss/pass traffic on esams ats-be that are hits on cp3064 [15:00:59] ok, pooling... [15:01:06] thanks for the heads up! [15:18:05] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3064.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3064.esams.wmnet... [15:31:41] bblack: oh, on cp3064 mount /dev/nvme0n1p1 fails now after reimage to varnish-be [15:31:49] I guess I need to run the mke2fs command manually [15:32:14] the one removed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552547/ [15:33:21] trying [15:33:30] yup that was it [15:35:15] yes [15:35:39] we can revert if we anticipate more reversions! :) [15:37:04] I'd rather superstitiously not :) [15:39:07] I'm hoping for a "simple" explanation in that some class of traffic is getting passed that wasn't before or whatever [15:40:22] yeah [15:40:48] I did go through the backend VCL and ATS Lua once again and there's nothing obviously wrong [15:41:21] there could be other layers to the confusion too, but we can go down that road once this one seems exhausted [15:46:53] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) Among the things I've checked to rule out obvious mistakes porting VCL to Lua: - Cookie responses without "session" or "toke... [16:02:48] finishing up scary puppet runs, be there soon [16:11:05] 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) [16:13:38] 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) The Anchor is now installed, connected to the SCS, and we see a getty on serial with the right hostname. It's also now responsive to IPv4 pings but not IPv6 (which matches our previous experie... [16:52:15] 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) I've a proposal for doing this: - Add some special tag like `#NRPE` or `#page` to the names of any [[ https://librenms.wikimedia.org/alert-rules | L... [17:43:08] 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark) [17:43:13] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10BBlack) It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in... [17:51:06] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10Jdforrester-WMF) [17:51:26] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) 05Open→03Stalled [18:13:35] vgutierrez: PROBLEM - Disk space on cp4028 is CRITICAL - logrotate issue? [18:14:11] 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark) [18:16:09] I hope for him that he's sleeping by now :D [18:17:21] hopefully, but you never know [18:17:35] 4.2G var [18:17:41] on a 9.1 partition [18:18:00] yes [18:19:30] ema: anything unusual on cp4028? I don't see the issue on others [18:19:52] hmmm [18:19:55] checking [18:20:22] vgutierrez: it's not logrotate [18:20:31] and it is on others [18:20:59] something like 60% of the cp fleet is exhibiting the log spam, just not all of them have small root partitions [18:23:37] wow [18:23:55] so it's the lua reload patch [18:23:57] yes [18:24:04] which for some apparently intentional reason sets HOSTNAME to nil [18:24:21] ema was running a cumin task restarting trafficserver... [18:24:22] and then we get this existing line: [18:24:24] ts.client_response.header['X-Cache-Int'] = get_hostname() .. " " .. cache_status [18:24:34] and lua doesn't like get_hostname()'s nil result in a string concat [18:25:12] I don't see a cumin running now though, maybe he paused [18:25:23] or finished already :/ [18:25:47] I think we should rollback that and restart TS [18:26:01] let's see if I can get ema online [18:26:49] I'm doing a hacky fix to buy more time on all the affected ones [18:27:05] (wiping out one of the oversized logfiles and restarting rsyslog to free it) [18:27:26] cp1 and cp3 don't have it as bad, because they have much larger root partitions [18:28:18] so that should buy us some time, it took a while to get bad [18:30:07] in the meantime, I'm trying to debug what actually went wrong with the earlier patch in case it's salvageable [18:30:49] hmmm I think it's the context of dofile() [18:31:05] it's being evaluated inside the function instead of globally [18:31:24] so it doesn't work as expected I'd say [18:32:17] also the assert seems backwards [18:32:21] it's asserting that it's broken :) [18:32:36] and the conf file being loaded has no newline, so that might affect it loading correctly too? [18:32:43] .... cp3056 hasnt crashed. [18:32:47] i am pleasantly surprised. [18:33:26] what's up [18:33:38] Nov 25 18:32:16 cp4028 traffic_manager[32739]: [Nov 25 18:32:16.344] {0x2b3ed892f700} ERROR: [ts_lua] lua_pcall failed: /etc/trafficserver/lua/default.lua:48: attempt to concatenate a nil value [18:33:44] ema: the global HOSTNAME is remaining nil [18:33:46] that's flooding cp hosts logs [18:33:50] and the assertion is backwards so it doesn't catch it [18:33:56] and then yeah those log entries filling disks [18:34:02] uh [18:34:03] we have the same issue with the TLS instance [18:34:08] I'd assume [18:34:16] upon reload I guess, at startup I think it worked? [18:34:23] I donno [18:34:33] as in, after restart I was seeing proper values for X-Cache-Int [18:34:36] let's rollback [18:34:58] oh are you saying we've had some reloads on unrestarted ones or something? [18:35:02] and more restarts == fix? [18:35:23] either way the assert seems backwards [18:35:28] ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/552869 [18:35:40] I've temporarily freed up the disk space, we have a few minutes to think [18:36:46] oh wait [18:36:54] are lua's assert()s backwards? [18:37:13] assert(HOSTNAME ~= nil, "Cannot read HOSTNAME from " .. configfile) [18:37:39] the usual way that assert works is the condition should be true in the non-error case [18:37:39] this didn't assert when HOSTNAME was set [18:37:52] maybe lua gets the whole thing backwards [18:38:20] or I did :) [18:38:21] e.g. assert(x != 0, "X can never be zero") [18:39:03] oh yeah, it's you not lua, I finally found some readable docs [18:39:24] > banana = nil [18:39:24] > assert(banana ~= nil) [18:39:24] stdin:1: assertion failed! [18:39:24] stack traceback: [C]: in function 'assert' stdin:1: in main chunk [C]: in ? [18:39:48] ~= is "not equal" BTW [18:40:17] oh, so that's super confusing, but I guess once you get used to lua again heh [18:40:30] assert raises an error if the first argument is false/nil [18:40:33] so why didn't the assert fail? [18:40:56] I'm wondering if global HOSTNAME != HOSTNAME within read_config() function [18:41:09] but he said he saw some value hostname values in x-cache-int at some point [18:41:16] yeah, I saw that as well [18:41:20] right, immediately after ats-backend-restart [18:41:30] have they all been restarted already? [18:41:30] we validated this on cp1075 and it looked good [18:41:32] so at least the __init__(argtb) call worked fine [18:41:42] (for the lua reload change) [18:41:45] but maybe __reload__() unsets things? [18:41:57] still unclear why the assert wouldn't fire [18:42:23] it should fire if ~= evaluates to false or nil [18:42:32] all restarts have happened [18:42:44] (but confusingly, 0 is a numeric value which is true in the lua world) [18:42:54] I'd say let's revert for now and figure out tomorrow? [18:43:06] +1 :_) [18:43:26] I'd love to debug this at human times [18:43:55] reverting [18:43:56] sudo cumin 'A:cp' 'tail -1000 /var/log/daemon.log|grep -q lua_pcall; echo $?' [18:44:09] the zeros are: (47) cp[2001-2002,2004-2007,2010-2011,2013-2014,2016-2017,2022,2024-2025].codfw.wmnet,cp[1075-1077,1079-1082,1084,1086].eqiad.wmnet,cp[5001-5002,5005-5008,5010].eqsin.wmnet,cp[3050-3051,3053-3055,3057,3059-3061,3063,3065].esams.wmnet,cp[4022,4028-4029,4031-4032].ulsfo.wmnet [18:44:18] ones: cp[2008,2012,2018-2020,2023,2026].codfw.wmnet,cp[1078,1083,1085,1087-1090].eqiad.wmnet,cp[5003-5004,5009,5011-5012].eqsin.wmnet,cp[3052,3056,3058,3062,3064].esams.wmnet,cp[4021,4023-4027,4030].ulsfo.wmnet [18:44:24] not sure what the difference is in those sets [18:44:50] the failing ones, have they been reloaded? [18:44:59] no idea [18:45:05] I mean, I haven't intentionally reloaded anything [18:45:19] does etcd pooling trigger reloads, or something? [18:45:32] it shouldn't be [18:45:52] hmmm [18:45:58] I merged the logrotate change [18:46:02] that triggered a reload [18:46:04] ah [18:46:13] I'd say that in some hosts before your restart [18:46:20] yes, reloads [18:46:23] so the failing ones might have been reloaded without restarted [18:46:25] Nov 25 17:00:19 cp4028 puppet-agent[34901]: Computing checksum on file /etc/logrotate.d/ats-backend [18:46:28] Nov 25 17:00:19 cp4028 puppet-agent[34901]: (/Stage[main]/Profile::Trafficserver::Backend/Trafficserver::Instance[backend]/File[/etc/logrotate.d/at [18:46:31] s-backend]) Filebucketed /etc/logrotate.d/ats-backend to puppet with sum 76e9d84d04f27d3331f22134a0172d23 [18:46:34] Nov 25 17:00:19 cp4028 puppet-agent[34901]: (/Stage[main]/Profile::Trafficserver::Backend/Trafficserver::Instance[backend]/File[/etc/logrotate.d/at [18:46:37] s-backend]/content) content changed '{md5}76e9d84d04f27d3331f22134a0172d23' to '{md5}2754d718c6df520f8024f6e8f3a709fe' [18:46:40] Nov 25 17:00:26 cp4028 systemd[1]: Reloading Apache Traffic Server is a fast, scalable and extensible caching proxy server.. [18:46:43] Nov 25 17:00:26 cp4028 traffic_manager[32739]: [Nov 25 17:00:26.406] {0x7f26769f9700} NOTE: User has changed config file records.config [18:46:46] Nov 25 17:00:26 cp4028 traffic_manager[32739]: [Nov 25 17:00:26.437] {0x2b3ed8548700} ERROR: [ts_lua] lua_pcall failed: /etc/trafficserver/lua/default.lua:48: attempt to concatenate a nil value [18:46:50] that's the pattern [18:46:52] logrotate change -> reload -> fail [18:47:48] but on that same host [18:47:51] Nov 25 16:55:44 cp4028 ats-restart: Repooling name=cp4028.ulsfo.wmnet,service=nginx [18:48:01] it was restarted just before... [18:48:57] not sure why exactly, I assume cumin, no idea [18:49:02] bblack: ok to begin restarts to undo the change? [18:49:09] yes [18:49:27] I'm gonna wipe logfiles again to patch up disk space on the affected ones [18:49:31] thanks [18:49:34] shouldn't interfere, it's all rsyslog-level [18:52:14] ema: worry about ulsfo/eqsin (and codfw I guess) first [18:52:33] they all have small root partitions. esams and eqiad have much larger, will take them longer to exhaust by a lot [18:52:40] ack [18:52:47] bblack: /var/log/syslog is also fat on those hosts [18:52:51] started eqiad and codfw, proceeding with ulsfo/eqsin [18:52:52] ~4.3Gb [18:53:15] yeah I've been killing just daemon.log so we still had some log to look at [18:53:20] but I guess we don't need it now [18:53:49] and I guess my fix lasts a shorter amount of time each time if I keep letting syslog grow [18:55:21] restarts began in all DCs [18:55:36] did we turn off syslog repeat-suppression everywhere for some reason? might've helped heh [18:55:39] I could have done text and upload in parallel really, noticed only now [18:56:01] ~60 secs + puppet run per host [18:57:31] apparently I really need to brush up on my lua [18:57:42] hmm if this affected in the same way to ats-tls, we've lost some XCPS stats [18:57:43] I had a lot of silly confusion staring at the assert line [18:58:01] well the syntax doesn't help does it :) [18:58:40] hmmm or not, because the assert isn't being triggered... [18:58:41] at least it has variables [18:58:57] so we just lost websockets for text [18:59:03] it isn't (that) bad [18:59:24] so: [18:59:25] banana = "ciao" [18:59:29] assert(banana ~= nil) [18:59:33] think of all the etherpad D&D sessions you may have inconvenienced [18:59:34] this isn't triggered ^ [18:59:49] because nil ~= nil is true [19:00:04] it's the mathematical nil, it can't be equal to anything? [19:00:42] since assert() will raise on anything that evalutes to false or nil, you can just do assert(HOSTNAME, "blah") too [19:01:06] yep [19:01:40] yep, but that assert() is working as expected on a lua CLI.. [19:01:48] this is pretty weird [19:02:54] it's 3 AM here, so I'll look at this "tomorrow" with fresh eyes [19:03:28] good night vgutierrez [19:04:16] so, on hosts where the puppet run + ats restart happened before the logrotate patch got merged, things worked [19:04:21] cp1075 is an example [19:05:19] ok [19:05:56] well I do see errors on 1075 actually [19:06:26] my hunch is: [19:06:36] "read_config()" is called once at start, and not on reload [19:06:55] so it works fine on initial start, but then the hostname file is not loaded (thus the global goes back to nil) and the assert is not executed, on a reload [19:07:07] (it re-evalutes the lua on a reload, but does not call the read_config() hook, in other words) [19:07:19] that would make sense, but __reload__ should be called at reload [19:09:49] what if something imports the script though [19:10:23] would that overwrite the global value of HOSTNAME? [19:10:26] nothing in the docs mention __reload__ [19:10:38] (that I can find) [19:10:41] they do show __init__ [19:10:44] 10netops, 10Operations, 10ops-esams: Setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10faidon) [19:10:48] yeah, it's documented by the source code :) [19:10:54] awesome! [19:12:17] so, reading the 8.0.x lua plugin source code (keep in mind, for the first time, I could be wrong!) [19:12:31] it seems like __reload__ is actually a cleanup function that happens just before the reloading code [19:12:36] and maybe used to be called __cleanup__ [19:12:41] err __clean__ [19:12:56] https://github.com/apache/trafficserver/blob/21c82cf370a5c4c53ebdde23f161af3485b95aa8/plugins/lua/ts_lua_util.c#L328 [19:13:29] in any case, it calls __reload__ and then creates a new global lua context [19:13:43] I think __reload__ is meant to do any pre-reload cleanup, like freeing up other resources? [19:14:27] is there any issue with me merging text-lb traffic changes right now? https://gerrit.wikimedia.org/r/c/operations/puppet/+/552879 [19:15:27] cdanis: wait please, lots going on at this layer at present [19:15:36] ack [19:18:15] ema: I still have a hard time following that lua plugin code, but it really does still seem like __init__ is called on fresh load of module, then __reload__ is called just before the new lua context for a reload is created (but there is no further __init__ nor __reload__ call inside the new context) [19:18:17] bblack: on some ulsfo hosts puppet failed running due to lack of disk space [19:18:35] what's the command you've been using to free things up? [19:18:46] I just re-ran, probably since that happened, try again [19:18:54] sudo cumin 'cp*.eqsin.wmnet or cp*.ulsfo.wmnet or cp*.codfw.wmnet' 'rm -f /var/log/daemon.log /var/log/syslog.log; systemctl restart rsyslog.service' [19:19:00] can I help? [19:20:35] what's the command to see if the last puppet run failed? [19:21:29] ema: you could just do a cumin like so https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [19:21:35] run-puppet-agent --failed-only [19:22:03] ugh fixing 4028 and 5007 manually [19:22:11] bblack: I've done 4028 [19:22:38] bad command [19:22:44] "/var/log/syslog.log" :) [19:22:51] just fixed it and ran it again everywhere heh [19:24:30] this is the order in which things failed on hosts with full /: [19:24:31] https://phabricator.wikimedia.org/P9738 [19:24:53] so the backend restarts work even with full / [19:25:22] rsyslog has both duplicate suppression and ratelimiting apparently [19:25:28] trafficserver-tls instead fails starting [19:25:32] maybe as a meta-followup we should look into those :) [19:25:45] so wherever it failed we also need to make sure we re-pool "nginx" [19:26:18] is everything caught up now? [19:26:26] (to the cumin progress) [19:26:57] esams has two hosts to go [19:27:07] eqsin 1 [19:27:19] ulsfo is done but some have failed [19:27:29] codfw 1 to go [19:28:19] btw ema you can check for depooled nginxes like so: confctl --quiet select 'service=nginx' get | jq 'select(..|.pooled? == "no")' [19:28:29] https://phabricator.wikimedia.org/P9739 [19:28:36] cdanis: excellent, thanks [19:28:58] oh, that includes some things like cluster=api_appserver haha [19:29:13] add a cluster argument I guess! :) [19:29:42] confctl --quiet select 'service=nginx' get | jq 'select(..|.pooled? == "no") | select(.tags | contains("cluster=cache_"))' [19:29:51] --> https://phabricator.wikimedia.org/P9740 [19:29:52] I don't even remember how disruptive it is to rename a confd service, but maybe we should do that at some point in this case :) [19:32:17] I think we're out of the woods [19:32:32] just forced a icinga re-check of services that were still red, I see recoveries coming [19:32:57] and the apparently-bad lua was reverted? [19:33:11] yes [19:33:22] cool, i am going to continue with my deploy [19:33:56] ack I'm going to continue with my evening :) [19:34:11] but text me if needed [19:34:44] enjoy :) [20:05:31] rabbitholes :P [20:06:06] the icinga "check_dns" plugin (as used by our icinga to monitor rec and auth dns services, and also re-used by our anycast healthchecker for recdns)... [20:06:52] it doesn't have an argument to use an alternate dns server port, and I really don't want that to be the reason I go through a bunch of bullshit to define separate in-subnet public IPs for all our dns boxes, esp as IP space is limited at edge sites [20:07:11] so I thought maybe we could just patch it easily, but of course it's a compiled binary [20:07:46] and when you look at the source code, it's literally written in C and what it does is wrap an execution of the external (and ancient and truly horrible) nslookup binary. [20:08:19] I mean, if you're going to shell out to nslookup (ewwww) and all you're doing is parsing arguments and parsing output, you could at least not do that in C :P [20:08:40] so I guess I'm going to replace check_dns with something simpler, in a scripting language :P [20:10:09] who woke up one day and said "I need to write a simple healthcheck, so I'm going to write a horrible C program that just wraps the execution of another even more-horrible C program to do some parsing on the inputs and outputs?" [20:13:14] (and then not bother with a port argument, to boot) [20:14:49] bblack: I mean, once you've already decided to write nagios/icinga in straight C... [20:15:06] but yeah that is pretty absurd [20:15:35] check_http has its own HTTP client written in C, which while is better than shelling out to curl, is decidedly not better than linking libcurl [20:17:29] maybe I should continue the insanity and write a new wrapper, also in C, called check_dns_with_port, which executes check_dns, but then uses ptrace to intercept critical socket calls and change the port number. [20:20:12] some other time I have a fun story for you about wrapping socket calls [23:31:05] 10Traffic, 10DNS, 10Internet-Archive, 10Operations, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Dzahn) >>! In T99216#5689785, @Aklapper wrote: > Ah, thanks. But who exactly is supposed to answer that question...