[09:26:25] exp admission policy patch merged, staring at logs on pinkunicorn: [09:26:34] varnishncsa -b -n frontend -q 'BerespStatus eq 200 and BereqMethod eq "GET"' -F 'p=%{VCL_Log:Admission Probability}x r=%{VCL_Log:Admission Urand}x s=%b %{X-Cache-Int}o %s %r' [09:26:46] p=0.024 r=0.464 s=2034 cp1008 hit/44 200 GET http://pinkunicorn.wikimedia.org/ HTTP/1.1 [09:28:02] X-Cache-Int is the status on varnish-be, all these end up not being frontend hits (fe_mem_gb=1 on pu) [09:36:43] oh right, the memory cache sizing code does: [09:36:44] if ($mem_gb < 90.0) { [09:36:44] # virtuals, test hosts, etc... [09:36:44] $fe_mem_gb = 1 [09:37:04] and pu has 32G of memory [09:59:40] -- [09:59:48] cp4021 doesn't seem to be doing great: [10:00:01] $ sudo journalctl -u varnish-frontend --priority '0..3' | wc -l [10:00:01] 56 [10:00:13] Oct 16 06:22:32 cp4021 varnishd[1412]: Child (29168) not responding to CLI, killed it. [10:00:16] Oct 16 06:22:32 cp4021 varnishd[1412]: Unexpected reply from ping: 400 CLI communication error (hdr) [10:00:19] [...] [10:00:37] [Thu Oct 26 18:22:54 2017] Out of memory: Kill process 30187 (varnishd) score 495 or sacrifice child [10:01:10] and other interesting ones: [10:01:13] [Mon Oct 16 06:22:34 2017] Out of memory: Kill process 45106 (nginx) score 4 or sacrifice child [10:01:47] ah! [10:01:47] hieradata/hosts/cp4021.yaml:numa_networking: on [10:47:09] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715061 (10Ladsgroup) [11:11:45] so, trying to figure out what might be behind upload-ulsfo mbox lags (started on Mon Oct 23) [11:11:53] https://goo.gl/koqhNu [11:12:37] "User-Agent: -" suddenly started showing up a lot in webrequst [11:12:40] *webrequest [11:13:17] that's not the case in esams-upload [11:13:46] also, 301s increased: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&from=now-7d&to=now&var-site=ulsfo&var-cache_type=upload&var-status_type=3 [11:14:03] (probably entirely unrelated, just some data points) [11:25:13] majority of ulsfo 301s are 'UA: -' or facebookexternalhit/1.1 [11:30:27] has mbox lag returned? [11:31:00] mark: it has moved to upload-ulsfo, in a different form [11:31:06] moved from? [11:31:15] (I haven't followed at all :) [11:31:27] "moved" in the sense that we'd seen it in the past on different clusters/DCs [11:31:35] ok [11:32:36] in the recent upload-ulsfo case (started Oct 23 as I said) it does not immediately have user-facing impact, lag keeps on growing without obvious issues [11:33:29] then there's a few hours of sporadic 503s, and finally it fixes itself [11:33:59] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715115 (10Ladsgroup) @hoo and I sampled x-cache from wberequest dataset for the hour of 19 yesterday for two varnish boxes... [11:34:45] on one hand you could say that perhaps 5 hosts aren't enough for a cluster like upload-ulsfo [11:35:09] on the other hand we hadn't see issues there in the recent past [11:43:04] do we have any idea yet if v5 will fix it? [11:51:52] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715170 (10hoo) This has some potentially interesting patterns: `watchlist, recentchanges, contributions, logpager` replica... [12:00:24] 10Traffic, 10netops, 10Operations, 10Pybal, 10Patch-For-Review: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3715204 (10elukey) Found a chrome bug that seems interesting (although not really recent): https://bugs.chromium.org/p/chromium/issues/detail?id=11... [12:06:32] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715032 (10Marostegui) From those two masters's (s4 and s5) graphs, we can see that whatever happened, happened exactly at t... [12:08:49] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715242 (10thiemowmde) p:05Triage>03High The tasks description talks about ongoing investigation. Is there already an ou... [12:12:32] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715253 (10hoo) p:05High>03Triage >>! In T179156#3715242, @thiemowmde wrote: > The tasks description talks about ongoing... [12:26:00] well 4024 is dead too [12:26:24] what's the status on 4021 now? [12:28:51] mark: maybe? [12:29:12] mark: but we have some further VCL work to do before we can switch cache_upload to v5, so it's probably not an immediate fix [12:29:17] ok [12:29:34] it's maybe a next week-or-two kind of thing, though [12:34:36] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715319 (10BBlack) Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but still pretty verb... [12:37:32] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715335 (10BBlack) My gut instinct remains what it was at the end of the log above. I think something in the revert of wiki... [12:47:52] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715381 (10BBlack) Unless anyone objects, I'd like to start with reverting our emergency `varnish max_connections` changes f... [12:56:33] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715432 (10hoo) I think I found the root cuase now, seems it's actually related to the WikibaseQualityConstraints extension... [13:00:12] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715446 (10BBlack) >>! In T179156#3715432, @hoo wrote: > I think I found the root cuase now, seems it's actually related t... [13:01:24] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715447 (10hoo) >>! In T179156#3715446, @BBlack wrote: > >>>! In T179156#3715432, @hoo wrote: >> I think I found the root c... [13:23:23] ema: so, cp4021... I think the last reboot ~28 days ago, is when we reverted it from numa_networking:isolate to numa_networking:on [13:23:38] isolate was definitely problematic in this sort of way, with all the hard segregation it was doing [13:23:56] but "on" is pretty weak. the functional diffs from the normal node config for "on" are basically: [13:25:14] 1) The bnx2x IRQs are normally module-configured at the same count as physical cpu cores, are normally spread over all physical cores by the RPS script. Under "on", the IRQ count is cut in half and the RPS script only attaches them to NUMA node 0 CPUs (which is physically closer to the ethernet adapter) [13:26:26] 2) nginx is normally also configured across all cpu cores, with 2x nginx workers per HyperThread pair of CPU cores. Under "on", it limits to NUMA node 0 (where the bnx2x IRQs are) at half the cpus+workers counts, and locks all of its memory allocations to the same NUMA node 0. [13:29:32] but varnishes (both) aren't affected at all [13:29:54] except in the indirect sense that when varnish workers are running on a numa node 1 CPU, they're remote from all network IRQs [13:30:18] but I don't think that's all that abnormal. under normal conditions there's a 50/50 chance any given network IRQ for traffic to a varnish is "remote" in that sense. [13:36:29] assuming the problems on 4021 are unique (I'm gonna go trawl logs on the other new upload@ulsfo and see if there's any similar OOMs, etc), my best guess would be this is about the nginx memory lock to node0 [13:37:30] it just be the case that having any major daemon locked down to 1 node while the regular memory allocation policies are running for other heavy memory users (network traffic in general and of course varnishes for transient+malloc, and heavy buffercache for backend disk) is a no-go, due to the general lack of numa-awareness everywhere else. [13:38:01] so maybe we weaken that by not having "on" using memory locking for nginx, just the irq + cpu-pinning bits. [13:46:30] (trawling confirms, only cp4021 is having the ooms, so trying that in https://gerrit.wikimedia.org/r/#/c/386836/1/modules/tlsproxy/templates/nginx-numa.conf.erb [13:50:28] bblack: ok so we're trying without memory locking for nginx now? [13:50:49] ema: yes [13:51:06] although given the history of ooms already since boot time, might be better to just reboot for clean testing at this point? [13:51:12] +1 [13:53:03] prometheus-varnish-exporter has info about the child process that we might want to plot/add to grafana somehow [13:53:19] eg: [13:53:20] varnish_mgt_child_died 1 [13:53:21] varnish_mgt_child_dump 0 [13:53:21] varnish_mgt_child_exit 0 [13:53:21] varnish_mgt_child_panic 0 [13:53:23] varnish_mgt_child_start 2 [13:53:25] varnish_mgt_child_stop 0 [13:53:32] mice [13:53:41] nice mice? [13:53:44] :) [13:53:55] nice mice bite thrice? [13:55:36] more in general I guess we want varnish_mgt_child_start to be 1? If it's higher than that it means that the child died and has been restarted by the management process, which shouldn't happen under normal circumstances [13:56:05] is that counter global across service restarts though? [13:56:20] or is it pulling a counter from the varnish mgmt process, so resets on service restart? [13:56:30] I guess the latter makes more sense [13:56:33] a service restart restarts the management process, so the counter resets on restart [13:56:38] ok [13:56:56] so yeah we should probably alert on start>1 and also panic>0 [13:57:27] well [13:57:39] all of them I guess, as in this case there was no panic at the varnish level [13:58:20] but we might have some race cond for spurious alerts if we trigger on "exit" or "stop", as those probably increment briefly on service restart [13:58:27] (one or both?) [13:58:48] so maybe the condition for "all ok" is died==0,dump==0,panic==0,start==1 [13:58:53] well [13:58:58] so maybe the condition for "all ok" is died==0,dump==0,panic==0,start<2 [14:02:07] relatedly, it would be nice to have a general-purpose alert on all hosts for oomkills and other kernel panics/warnings [14:02:22] I wonder if there's a better way to know about those than checking dmesg output. some counter in procfs? [14:06:48] we currently don't have any varnish with varnish_mgt_child_start > 1, both fe and be [14:06:51] https://www.spinics.net/lists/linux-mm/msg95658.html <- 2015 patch, I think not merged [14:11:27] there's a count_vm_event(OOM_KILL); though, does that increment a counter somewher? [14:11:32] https://elixir.free-electrons.com/linux/latest/source/mm/oom_kill.c#L904 [14:13:00] yeah see `grep oom_kill /proc/vmstat` [14:14:38] ah, that works on my workstation with linux 4.13, not on the cache nodes [14:25:38] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3716073 (10BBlack) Since diagnostics and SEL don't turn up anything, I'll recap what we've observed: * Initially - system was reporting kernel errors such as ` kernel:[30568.739024] NMI watchdog: BUG: soft... [14:26:32] yeah heh [14:29:13] it's not in stretch either [14:29:23] well we're running stretch's kernel anyways [14:29:28] right [14:52:28] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` cp4024.ulsfo.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201710271452_b... [15:03:33] 10Traffic, 10ORES, 10Operations, 10Release-Engineering-Team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3716144 (10Lucas_Werkmeister_WMDE) > (Permalink: https://grafana.wikimedia.org/dashboard/db/wikidata-quality?panelId=10&full... [15:18:36] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3716222 (10BBlack) So, the reinstall attempt failed with another crash during the installer. I think we have to be looking at bad hardware here: ``` ┌───────────────────────────┤ Configuring apt ├─────... [15:38:00] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3716283 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4024.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4024.ulsfo.wmnet'] ``` [16:46:54] I've tried to document my understanding of hfp vs hfm https://wikitech.wikimedia.org/wiki/Varnish#Requests_coalescing [16:47:01] edit at will! [16:49:43] thanks! [16:54:29] next step I guess is to look at beresp.uncacheable with ttl > 0 in our current VCL and verify if it makes sense to turn those into hfm or if they should remain hfp [16:54:49] yeah [16:55:02] the text-cluster might be tricky, but I think the upload+common VCL will be simpler [16:55:54] my weekend starts now! see you :) [17:00:54] cya [17:01:15] I took a quick 5-minute stab at documenting HFM/HFP guesses in all cases of current uncacheable=foo [17:01:41] https://gerrit.wikimedia.org/r/#/c/386895/ [18:23:58] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3716644 (10greg) 05Open>03Resolved a:03greg Well, let's close this and make a follow-up task. [18:26:29] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team (Kanban), and 2 others: Investigate what caused the the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197#3716649 (10greg) [18:26:59] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team (Backlog), 10User-greg: Investigate what caused the the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197#3716668 (10greg) [18:27:31] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team (Backlog): Investigate what caused the the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197#3716670 (10greg) a:05greg>03None [18:28:08] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10greg) a:05greg>03hashar [19:10:21] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) [19:11:27] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) I've assigned this to @bblack for now, as my understanding is he is in charge of the re-structuring of our caching center services across the three misc systems. (This task has two sys... [21:43:07] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3717006 (10greg)