[10:01:52] <elukey>	 ema, vgutierrez around? There is some backlog on purged for some cp nodes
[10:03:44] <elukey>	 this is the kafka topic
[10:03:46] <elukey>	 https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-24h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.resource-purge
[10:03:56] <elukey>	 but I don't see a horrible increase in messages
[10:04:53] <elukey>	 "Number of messages locally queued by purged for processing" - what does it mean? The purges itself are taking more and more or something else?
[10:11:02] <vgutierrez>	 sorry, you caught me brewing some coffee
[10:11:05] * vgutierrez checking
[10:12:09] <elukey>	 if I check cp3050, I see some weird latencies on ats matching the purged backlog increase 
[10:12:12] <elukey>	 https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3050&var-layer=backend&from=now-6h&to=now
[10:12:15] <elukey>	 hello vgutierrez :)
[10:13:30] <elukey>	 and latencies to appservers are horrible
[10:13:34] <elukey>	 (from cp3050)
[10:13:44] <vgutierrez>	 CPU increased for some reason at 09:19 as well on cp3050
[10:14:18] <elukey>	 the remaining alerts are mostly for esams afaics
[10:14:19] <vgutierrez>	 hmm not only cp3050
[10:15:39] <elukey>	 there were a few hosts from other dcs alarming, but they auto-resolved
[10:18:55] <vgutierrez>	 from https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&viewPanel=80
[10:19:02] <vgutierrez>	 ats-backend isn't happy since 09:19
[10:19:26] <elukey>	 yeah, it mathes with purged + latencies
[10:19:53] <vgutierrez>	 nothing on the logs
[10:21:11] <vgutierrez>	 let me restart ats-backend on cp3050, considering that's affecting a bunch of instances
[10:22:19] <elukey>	 I am wondering if some of the purges are heavy, putting pressure on ats
[10:22:54] <elukey>	 and from https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 I don't see problems on the appserver side 
[10:23:34] <vgutierrez>	 funny... they're auto-recovering
[10:23:55] <elukey>	 they were scared, I told them I'd have called you
[10:24:32] <vgutierrez>	 and ats-be CPU usage goes back to normal after recovery: https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&viewPanel=80&var-site=esams%20prometheus%2Fops&var-instance=cp3054
[10:25:04] <elukey>	 did you restart cp3050's backend or not yet?
[10:25:49] <vgutierrez>	 nope
[10:28:49] <elukey>	 https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?viewPanel=61&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3050&var-layer=backend&from=now-6h&to=now was really not great
[10:30:23] <elukey>	 (going to a meeting with Joseph, ping me if needed)
[10:46:45] <vgutierrez>	 it's far from normal yet
[10:46:53] <vgutierrez>	 even if the instances are recovering in terms of purged alerts
[11:03:56] <vgutierrez>	 XioNoX: https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3050&viewPanel=91 --> any network change that could have triggered this?
[11:04:42] <vgutierrez>	 I'm not seeing anything interesting on the SAL around that time
[11:08:01] <XioNoX>	 vgutierrez: nop
[11:08:09] <XioNoX>	 let me know if I should drill
[11:22:14] <elukey>	 the situation got worse again
[11:36:05] <elukey>	 recovered again, werid
[11:36:07] <elukey>	 *weird
[11:42:20] <ema>	 hey, what's up?
[11:45:15] <ema>	 so tl;dr is: increased load on some ats-be?
[11:47:12] <vgutierrez>	 Yup
[11:47:21] <vgutierrez>	 And increased cache write time
[11:47:28] <vgutierrez>	 And purged struggles 
[11:48:32] <ema>	 only cache_text, right?
[11:55:22] <vgutierrez>	 Yep
[12:10:30] <bblack>	 is the increased latency for real requests as well, or the stats are being skewed by counting the latency of slow PURGE writes?
[12:12:10] <ema>	 bblack: there was increased latency for real requests too, but nothing major: https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?viewPanel=11&orgId=1&from=1602712640864&to=1602763890080
[12:20:58] <bblack>	 vgutierrez: re: digicert expiry and the ocsp alerts, I think we'll have to manually rm /var/cache/ocsp/digicert-2019a-ecdsa-unified.ocsp on them all to clear the alert, after removing the cert itself
[12:22:14] <bblack>	 (and -rsa- of course)
[12:48:02] <vgutierrez>	 Indeed
[12:48:16] <elukey>	 ema: o/ did you find any reason about the ats be behavior? 
[13:04:25] <ema>	 elukey: not really, no
[14:10:34] <ema>	 let's see if the varnishkafka issue is fixed 
[14:10:36] <ema>	 !log cp3050: upgrade varnish to 6.0.6-1wm2 T26407
[14:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:42] <stashbot>	 T26407: Enable Some User Groups and granting admins(syopp) to give this priveledge to user - https://phabricator.wikimedia.org/T26407
[14:11:09] <ema>	 almost!
[14:11:15] <ema>	 !log cp3050: upgrade varnish to 6.0.6-1wm2 T264074
[14:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:21] <stashbot>	 T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074
[14:12:49] <ema>	 !log cp3050: restart varnishkafka-webrequest w/ libvarnishapi2 6.0.6-1wm2 T264074
[14:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:40] <ema>	 so now we expect CPU usage to go down visibly on https://grafana.wikimedia.org/d/000000253/varnishkafka?viewPanel=42&orgId=1&var-datasource=esams%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=All&var-instance=cp3050&from=now-24h&to=now
[14:14:18] <vgutierrez>	 🤞
[14:14:50] <ema>	 after that, CPU usage should not go crazy up even if we dare sending HUP to varnishkafka-webrequest
[14:15:27] <ema>	 I'll wait a couple of minutes before issuing a systemctl reload 
[14:15:35] <vgutierrez>	 what a crazy thing to do ;P
[14:15:45] <vgutierrez>	 poor varnishkafka-webrequest
[14:18:44] <sukhe>	 that's quite the drop
[14:19:45] <ema>	 sukhe: yeah, "normal" values should be in the 7-25 ms range  
[14:21:18] <ema>	 !log cp3050: systemctl reload varnishkafka-webrequest.service T264074
[14:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:23] <stashbot>	 T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074
[14:24:17] <ema>	 nice, it worked
[14:25:12] <ema>	 CPU usage is staying low despite my reload
[14:37:06] <bblack>	 it's amazing what 100K extra stat syscalls per second can do! :)
[14:37:56] <ema>	 :)
[15:16:15] <ema>	 I've filed https://phabricator.wikimedia.org/T265625 to document the little we know so far about the puzzling issue of the day 
[15:17:31] <bblack>	 thanks!
[23:02:05] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10EBernhardson) I think this would typically go in https://wikitech.wikimedia.org/wik...
[23:29:38] <wikibugs>	 10Traffic, 10CheckUser, 10Operations: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10jrbs)