[10:01:52] ema, vgutierrez around? There is some backlog on purged for some cp nodes [10:03:44] this is the kafka topic [10:03:46] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-24h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.resource-purge [10:03:56] but I don't see a horrible increase in messages [10:04:53] "Number of messages locally queued by purged for processing" - what does it mean? The purges itself are taking more and more or something else? [10:11:02] sorry, you caught me brewing some coffee [10:11:05] * vgutierrez checking [10:12:09] if I check cp3050, I see some weird latencies on ats matching the purged backlog increase [10:12:12] https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3050&var-layer=backend&from=now-6h&to=now [10:12:15] hello vgutierrez :) [10:13:30] and latencies to appservers are horrible [10:13:34] (from cp3050) [10:13:44] CPU increased for some reason at 09:19 as well on cp3050 [10:14:18] the remaining alerts are mostly for esams afaics [10:14:19] hmm not only cp3050 [10:15:39] there were a few hosts from other dcs alarming, but they auto-resolved [10:18:55] from https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&viewPanel=80 [10:19:02] ats-backend isn't happy since 09:19 [10:19:26] yeah, it mathes with purged + latencies [10:19:53] nothing on the logs [10:21:11] let me restart ats-backend on cp3050, considering that's affecting a bunch of instances [10:22:19] I am wondering if some of the purges are heavy, putting pressure on ats [10:22:54] and from https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 I don't see problems on the appserver side [10:23:34] funny... they're auto-recovering [10:23:55] they were scared, I told them I'd have called you [10:24:32] and ats-be CPU usage goes back to normal after recovery: https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&viewPanel=80&var-site=esams%20prometheus%2Fops&var-instance=cp3054 [10:25:04] did you restart cp3050's backend or not yet? [10:25:49] nope [10:28:49] https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?viewPanel=61&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3050&var-layer=backend&from=now-6h&to=now was really not great [10:30:23] (going to a meeting with Joseph, ping me if needed) [10:46:45] it's far from normal yet [10:46:53] even if the instances are recovering in terms of purged alerts [11:03:56] XioNoX: https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3050&viewPanel=91 --> any network change that could have triggered this? [11:04:42] I'm not seeing anything interesting on the SAL around that time [11:08:01] vgutierrez: nop [11:08:09] let me know if I should drill [11:22:14] the situation got worse again [11:36:05] recovered again, werid [11:36:07] *weird [11:42:20] hey, what's up? [11:45:15] so tl;dr is: increased load on some ats-be? [11:47:12] Yup [11:47:21] And increased cache write time [11:47:28] And purged struggles [11:48:32] only cache_text, right? [11:55:22] Yep [12:10:30] is the increased latency for real requests as well, or the stats are being skewed by counting the latency of slow PURGE writes? [12:12:10] bblack: there was increased latency for real requests too, but nothing major: https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?viewPanel=11&orgId=1&from=1602712640864&to=1602763890080 [12:20:58] vgutierrez: re: digicert expiry and the ocsp alerts, I think we'll have to manually rm /var/cache/ocsp/digicert-2019a-ecdsa-unified.ocsp on them all to clear the alert, after removing the cert itself [12:22:14] (and -rsa- of course) [12:48:02] Indeed [12:48:16] ema: o/ did you find any reason about the ats be behavior? [13:04:25] elukey: not really, no [14:10:34] let's see if the varnishkafka issue is fixed [14:10:36] !log cp3050: upgrade varnish to 6.0.6-1wm2 T26407 [14:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:42] T26407: Enable Some User Groups and granting admins(syopp) to give this priveledge to user - https://phabricator.wikimedia.org/T26407 [14:11:09] almost! [14:11:15] !log cp3050: upgrade varnish to 6.0.6-1wm2 T264074 [14:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:21] T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 [14:12:49] !log cp3050: restart varnishkafka-webrequest w/ libvarnishapi2 6.0.6-1wm2 T264074 [14:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:40] so now we expect CPU usage to go down visibly on https://grafana.wikimedia.org/d/000000253/varnishkafka?viewPanel=42&orgId=1&var-datasource=esams%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=All&var-instance=cp3050&from=now-24h&to=now [14:14:18] 🤞 [14:14:50] after that, CPU usage should not go crazy up even if we dare sending HUP to varnishkafka-webrequest [14:15:27] I'll wait a couple of minutes before issuing a systemctl reload [14:15:35] what a crazy thing to do ;P [14:15:45] poor varnishkafka-webrequest [14:18:44] that's quite the drop [14:19:45] sukhe: yeah, "normal" values should be in the 7-25 ms range [14:21:18] !log cp3050: systemctl reload varnishkafka-webrequest.service T264074 [14:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:23] T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 [14:24:17] nice, it worked [14:25:12] CPU usage is staying low despite my reload [14:37:06] it's amazing what 100K extra stat syscalls per second can do! :) [14:37:56] :) [15:16:15] I've filed https://phabricator.wikimedia.org/T265625 to document the little we know so far about the puzzling issue of the day [15:17:31] thanks! [23:02:05] 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10EBernhardson) I think this would typically go in https://wikitech.wikimedia.org/wik... [23:29:38] 10Traffic, 10CheckUser, 10Operations: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10jrbs)