[09:39:17] ema: very nice! (re: tc metrics) if it doesn't require root we can add it as a collector to node_exporter, otherwise as a cron that dumps metrics to a text file [11:48:32] 10netops, 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 switch port configuration - https://phabricator.wikimedia.org/T164594#3243407 (10akosiaris) 05Open>03Resolved a:03akosiaris Done. ports added to interface-range ganeti which sets trunk, vlan. Added descriptions as well, resolving. [13:17:33] 10Traffic, 06Operations: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3239748 (10faidon) Undeploying cache_misc sounds unfortunateā€¦ Why not keep the existing 4-year old ulsfo hardware for cache-misc still, perhaps keeping some of the other old servers in there for parts? [13:54:15] vhtcp segfaults are worrying. that code's been stable for a long time. some unexpected change to its inputs has triggered a long-dormant bug? [14:01:13] 10Traffic, 06Operations: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3243905 (10BBlack) We could do so as a goal at the end of the process, depending how we arrange things. @RobH says we're short on power there to plug in all the new systems while the old ones are running. So t... [14:12:02] bblack: cp4016 is still depooled, I didn't take any action today [14:12:29] not sure if it was ok or not, but overall Varnish metrics for text seemed fine [14:15:01] I'm just now looking at 4018 [14:15:18] I think it's that the machine oom'd, although I don't see oomkill happening either [14:15:34] vhtcpd likely assumes memory allocations succeed, and segfaulted when one didn't [14:15:52] less then a minute later, varnishd child died in a panic that reports an errno about unable to allocate memory [14:16:46] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4018&var-datasource=ulsfo%20prometheus%2Fops&from=now-2d&to=now [14:16:56] ^ you can see it winding down to the oom event where vhtcpd + varnish crash there [14:17:10] buffers+cache (which is what consumes our "free" memory) drop and drop and then boom [14:18:46] 4016 and 4018 belong to the same memory-size-class (the 192GB machines) [14:18:59] and I recently patched in a change to the percentages of memory we allocate to frontends [14:19:07] so most likely that's all inter-related [14:19:33] the 192GB-class machines bumped up from ~92GB to 99GB of malloc space for varnish-frontend from that change, I believe [14:19:55] maybe they were just borderline enough as they were, and that pushes over the edge where oom conditions are significantly more probable [14:22:06] on cp4016 you can see oom-like behavior just before the event too, from other unrelated things, e.g. puppet: [14:22:09] May 7 17:00:32 cp4016 puppet-agent[19330]: (/Package[python3-apport]) Could not evaluate: Cannot allocate memory - fork(2) [14:22:46] May 7 17:05:06 cp4016 kernel: [974666.789999] vhtcpd[986]: segfault at 90 ip 00007eff33e204a0 sp 00007ffcf6588a28 error 4 in libc-2.19.so[7eff33ce9000+1a1000] [14:23:06] May 7 17:16:40 cp4016 varnishd[17265]: Child (17269) Last panic at: Sun, 07 May 2017 17:16:40 GMT#012"Assert error in VGZ_NewGzip(), cache/cache_gzip.c line 135:#012 Condition(Z_OK == i) not true.#012errno = 12 (Cannot allocate memory) [14:23:25] same grafana memory pattern too, same basic scenario on both hosts [14:26:13] oh and it's not 92->99 on the malloc sizing transition for that class, it's much worse. found the puppet lines: [14:26:16] May 4 11:56:14 cp4016 puppet-agent[1820]: (/Stage[main]/Role::Cache::Text/Role::Cache::Instances[text]/Varnish::Instance[text-frontend]/Base::Service_unit[varnish-frontend]/File[/lib/systemd [14:26:20] /system/varnish-frontend.service]/content) --s malloc,76G \ [14:26:23] May 4 11:56:14 cp4016 puppet-agent[1820]: (/Stage[main]/Role::Cache::Text/Role::Cache::Instances[text]/Varnish::Instance[text-frontend]/Base::Service_unit[varnish-frontend]/File[/lib/systemd/system/varnish-frontend.service]/content) +-s malloc,99G \ [14:26:29] so it was upped from 76 to 99 by the new math, which is a bit more significant [14:26:47] (that's also the amount documented in the table of the commitmsg at https://gerrit.wikimedia.org/r/#/c/324230/ ) [14:27:07] (the numbers there are off by a few from rounding-down issues) [14:28:04] so, need a better formula, that results in smaller values further into the list (I guess, something based on a larger constant, in other words) [14:29:16] I had though it was only the 96GB-total-mem hosts that gave us problems before. [15:25:12] bblack: if you have some time this week for a quick C review, I was poking at kafkatee and noticed I couldn't SIGTERM its children, and SIGPIPE was broken too, that lead to https://gerrit.wikimedia.org/r/#/c/352591 [16:33:20] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#3244604 (10GWicke) I added some hints, and linked to the upstream service repository. Functionally, the electron render service is all that is needed... [16:46:13] 10Traffic, 10netops, 06Operations, 13Patch-For-Review: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3244656 (10BBlack) Updates from IRC-only work - a significant majority of our ICMP echo volume is coming from a large number of IPs owned by Google. TODO... [17:02:25] 10Traffic, 06Operations: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3244737 (10BBlack) [18:21:55] 10netops, 06Operations: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10ayounsi) [18:31:01] 10netops, 06Operations, 10fundraising-tech-ops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245083 (10Jgreen) [19:16:35] hmmm something is odd [19:16:40] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?orgId=1&from=now-2d&to=now [19:16:48] ^ req rate stats look reasonable/normal there [19:17:01] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&from=now-2d&to=now [19:17:12] ^ but starting ~11h ago, they look crazy there [19:17:33] well, more like 9h ago [19:18:38] I don't think that's "real" traffic, something else is going crazy with how we're processing requests or something... [19:18:48] (processing request statistics, that is) [19:20:26] seems that we fail to recognize stuff [19:20:40] that goes into the unknown bucket [19:22:37] from SAL I can see some work on kafka earlier than that, but not sure if could be related [19:25:45] hmmm it's only coming from esams text nodes [19:25:55] (maybe just one is misbehaving for all I know yet) [19:32:36] yeah cp3031 hmmm [19:34:29] restarting varnishxcache on cp3031 was it I think [19:34:45] I don't know why it was in a funky state, but whatever, will look harder if it happens again [19:36:16] when I ran strace on varnishxcache processes, most spammed mostly lines like: [19:36:17] read(4, "cp1065 miss, cp3033 hit/8, cp3031 hit/1\ncp1055 hit/6, cp3031 hit/6, cp3031 hit/11\ncp1067 miss, cp3042 hit/6, cp3031 hit/58\ncp105"..., 4096) = 2494 [19:36:37] with fd#4 being the pipe input [19:37:08] but cp3031 while broken was spamming: read(4, "", 0) [19:37:53] there may be some bug to fix there about not generating "unknown" stats if it reads empty inputs or something like that, but that still doesn't explain why it was getting spammed with empty read notification in general [19:38:23] err the lines looked like: read(4, "", 4096) = 0 [19:39:47] right around the start time of the anomaly was also this in syslog: [19:39:48] May 8 10:30:09 cp3031 varnishxcache[21637]: Assert error in vslc_vtx_next(), vsl_dispatch.c line 285: [19:39:51] May 8 10:30:09 cp3031 varnishxcache[21637]: Condition(c->offset <= c->vtx->len) not true. [19:41:00] why assert didn't result in immediate process death, I don't know [19:41:15] seems to be a python pattern :P [19:47:10] * volans reading backlog [19:48:05] lol [19:48:14] gotta go now, but I'll read it later [20:30:44] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#3245555 (10TheDJ) @GWicke Thank You !