[00:05:20] 10Analytics, 10Research: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10leila) @Nuria Ok. I'll decline then and feel free to re-open if things change. [00:05:30] 10Analytics, 10Research: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10leila) 05Open→03Declined [00:08:19] RECOVERY - Throughput of EventLogging NavigationTiming events on icinga1001 is OK: (C)0 le (W)1 le 2.918 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [00:09:01] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on icinga1001 is CRITICAL: 2.989e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [00:17:15] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on icinga1001 is OK: (C)1000 gt (W)100 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [00:18:02] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) [00:18:31] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) Interval without data: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=1554146229141&to=1554164229141&var-schema=NavigationTiming [00:20:35] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 1711 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [00:20:39] RECOVERY - Varnishkafka Eventlogging Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All [00:23:34] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) Cirrus incoming requests: https://grafana.wikimedia.org/d/000000027/kafka?refresh=5m&panelId=46&fullscreen&orgId=1&from=1554078175677&to=1554164575677&var-datasource=eqiad%20prometheus%2Fops&var-kafka... [00:24:05] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [00:30:49] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) Outage starts at 22:12 UTC April 1st, it starts recovering at 00:07 April 2nd [00:37:33] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1004 is CRITICAL: CRITICAL: 33.33% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [00:58:06] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10EBernhardson) `eqiad.cirrussearch.page-index-update` should be unrelated, this is a job that runs every monday and generally goes to a consumer lag of ~5M and then back down. Producing messages is throttled... [01:00:01] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:01:13] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10EBernhardson) Some varnishkafka failure, suggests some webrequest data may be incomplete? https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host... [01:05:09] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:20:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:21:55] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:24:46] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) Right, we need to quantify the amount of time varnishkafka just received errors from posts cause those will be lost records. [01:32:13] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:34:47] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:39:55] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:46:21] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:08:09] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:09:27] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:13:23] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:14:39] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:21:05] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:24:59] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:36:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:37:55] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:46:59] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:48:17] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:58:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:01:13] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:10:13] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:14:05] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:17:57] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:19:03] 10Analytics, 10Analytics-Wikistats: wikistats: Insecure content warnings (images from upload.wikimedia.org) - https://phabricator.wikimedia.org/T57443 (10Krinkle) 05Open→03Declined [03:23:07] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:30:53] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:33:29] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:37:23] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:38:39] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:48:59] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:52:53] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:01:55] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:05:45] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:16:05] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:26:25] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:41:53] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:47:03] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:50:57] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:52:13] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:06:23] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:08:57] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:16:45] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:18:03] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:20:50] 10Analytics: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Marostegui) Removing the #dba tag as we cannot do that. Not closing the task in case this task wants to be used for further discussion about what @Milimetric said. I will remain subsc... [05:23:13] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:24:31] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:30:59] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:37:29] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:51:11] oh my [05:54:00] :S [05:55:50] elukey: I'm gonna add to those a small bit - We need to rerun the sqoop-mediawiki-production job (I'll do it manually) as it finished earlier than the labs sqoop (it is mandataory that labs finishes first, as it gets data to be linked to labs one) [05:56:13] elukey: I'll be off most of the day, with peaks on to monitor that [05:56:25] joal: bonjour! checking what happened, I think kafka didn't feel well [05:56:51] o/ [05:56:55] from backlogging, kafka died (or similar) yesterday [05:56:59] hi ottomata [05:57:29] yeaha, poking [05:57:30] not sure why [05:57:34] something very strange happened [05:57:43] brokers all the sudden couldn't talk to each other [05:57:49] trying to find a pattern [05:57:55] brain bower waning tho [05:57:57] very sleepy [05:58:32] kafka graphs looks horrible [05:59:52] [2019-04-01 23:52:53,144] ERROR Error while accepting connection (kafka.network.Acceptor) [05:59:55] java.lang.ArithmeticException: / by zer [05:59:55] ahahaha [06:00:37] not sure if noise or not but really interesting, there are a ton of them [06:01:21] I can see [06:01:21] [2019-04-01 23:52:52,756] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.Acceptor) [06:01:24] [2019-04-01 23:52:52,788] INFO Awaiting socket connections on 0.0.0.0:9093. (kafka.network.Acceptor) [06:01:27] for kafka-jumbo1002 [06:01:28] like it was restarted [06:01:35] stramge [06:01:55] https://tools.wmflabs.org/sal/production [06:01:57] oh boy [06:02:00] ottomata: --^ [06:03:18] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) The first error I see in broker logs on kafka-jumbo1001 is: ` [2019-04-01 19:29:25,322] WARN [ReplicaFetcher replicaId=1001, leaderId=1006, fetcherId=6] Error in response for fetch request (type=F... [06:03:25] https://phabricator.wikimedia.org/T219842 [06:03:30] elukey: is the ticket [06:03:39] nuria and marko responded while I was on the plane [06:03:44] i think the rebooted and things got fixe [06:03:45] d [06:03:49] dunno what happened though [06:04:13] elukey: i think things are ok now and I am very very sleepy and have to get up in morning for wiki lead shuttle [06:04:31] i'm going to sign off, but will check back in morningish over here in cali, so closer to end of your day [06:04:43] don't spend too much time on it, but if you find anything put it on ticket ya? [06:04:55] maybe nuria and marko hvae more context [06:05:00] i dont' have backscroll in mw-sec [06:05:41] what the hell why didn't they call me [06:05:44] it was midnight [06:06:11] ottomata: please go rest, I'll check [06:09:47] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:17:25] elukey: I can try to make time at lunch time/afternoon to help - would it be useful? [06:18:09] !log Deleted (in hdfs bin) actor and comment table data because it has been sqooped too early - manual rerun will be started once labs sqoop is done [06:18:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:19:05] joal: don't worry, thanks :) the above alarm seems a false positive, nothing stands in the graphs [06:23:59] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:34:27] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:34:55] ok this is annoying [06:37:01] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:47:17] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:48:35] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:51:09] 10Analytics: Refine eventlogging pipeline should not refine data for domains that are not wikimedia's - https://phabricator.wikimedia.org/T219828 (10phuedx) [06:51:15] 10Analytics, 10Reading Depth, 10Readers-Web-Backlog (Tracking): [Bug] Many ReadingDepth validation errors logged - https://phabricator.wikimedia.org/T216063 (10phuedx) [06:51:31] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) I can see two events: 1) ~19:28->~19:44 UTC 2) ~22:13->~00:10 UTC (this one seems a bit more complex, probably multiple occurrences in the same timeframe) The first one seems to have no alarmed, I... [07:07:11] 10Analytics: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Bawolff) 05Open→03Declined >>! In T219827#5075597, @Marostegui wrote: > That's not possible cause it would require setting up multi-source replication on all the instances that ar... [07:09:51] elukey: while i don't see anything definitive, there is an awfully coincidental timing with me grabbing 35 kafka partitions via KafkaRDD in spark and kafka falling over. [07:10:02] (via a process that has run many times before though) [07:10:47] ebernhardson: hello! [07:10:52] oh that didn't make it into the ticket, it should have [07:10:54] hi :) [07:11:36] I am currently checking logs but all I can see is the replica fetcher kafka thread complaining [07:11:52] there were two events though [07:11:59] one happened at around 19:30 UTC [07:12:02] that recovered [07:12:05] and then the other big one [07:12:38] do you remember if you did other tries with the KafkaRDD at around that time [07:12:41] ? [07:13:21] basically my thing ran from 22:11:23 until it died at 22:17:49 with SocketTimeoutException, everything went haywire around 22:18 [07:13:41] elukey: there was an earlier run that failed, lemme find the timing [07:14:24] 19:14 - 19:22 [07:14:29] ah! [07:14:31] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=mjolnir.msearch-prod-request&from=1554145760600&to=1554146584055 [07:15:09] these contains messages of ~400kB each, although they compress extremely well (>40kB after compression) [07:15:12] even if it doesn't really match with 19:30 though [07:15:27] err, <40kB [07:15:44] one thing to keep into consideration is that kafka brokers have 1g NICs, need to confirm that now but I am pretty sure this is the case [07:15:54] I am currently wondering if the interface was saturated [07:16:28] it doesn't appear of course from 5m metrics [07:16:44] actually, this runs in 3 stages, so first stage 19:14-19:22, second stage ran until 19:29, then third stage ran 19:29-19:34 [07:16:46] but from the switch's metrics I can see that some broker's nic discarded packets [07:17:03] that would make a perfect match then [07:17:32] ok so, i've been reading and apparently kafka has quotas? are they turned on? [07:17:59] https://kafka.apache.org/documentation/#design_quotas [07:18:19] they are not, we didn't had time to work on those in the past [07:20:13] that would probably explain things then. sorry to break everything :( I poked at KafkaRDD but no throttling available there, it suggests using quotas. I can invent a python based KafkaRDD that does client side throttling via kafka-python i suppose [07:22:35] nono please don't be sorry, from the infra point of view we should prevent people from bringing down something if possible [07:22:43] it is clearly something that we should have invested time on [07:22:54] can you add the timings to the task? [07:23:13] yup adding [07:23:42] the main unresolved point is what caused the failures.. was it bandwidth? [07:23:59] it doesn't seem a too many conns issue or similar [07:24:28] at first i didn't think so, since the graphs didn't show going above 20MB/s, thats only 1.6Gbit/s across the 6 kafka's [07:24:36] I had recently a long debugging session for the mw memcached shards, in which it turned out that microburst of 1s/2s were saturating the 1g interface [07:24:36] but the graphs might not capture the spikes well [07:24:46] exactly [07:26:33] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10EBernhardson) Most likely related to those timelines is two runs of mjolnir's feature collection: 19:14-19:34 and later 21:35-22:20. These use KafkaRDD against a topic with 35 partitions, `mjolnir.msearch... [07:29:19] so probably tx bandwidth saturated [07:29:21] I'd sa [07:29:22] *say [07:29:34] it seems the most likely explanation [07:30:29] indeed. Well at least one mystery solved. Seems kafka should be more resilient though, being an append only log i would have expected it to pick back up after a brief network partition quickly [07:32:21] ebernhardson: defintely, thanks a lot for the help! [07:34:44] ebernhardson: just to add a bit of joy in this kafka mess, the vega gpu is on stat1005 and it seems working :) [07:35:03] today if I have time I'll try to install python 3.6 + venv and tensorflow [07:36:38] saw, yea next step is definatly the python deps, and hopefully it just works from there, can hope at least :) [07:37:11] I'll let you know when done! [07:37:28] the last time I have installed python 3.6 from the debian archive [07:37:33] a bit painful but worked [07:37:43] I hope that tensorflow will support 3.7 soon [07:39:14] i'm surprised they support 3.7, rumor is google still runs lots of 2.7 is production :) [07:39:21] ahahahah [07:39:22] s/3.7/3.6/ [07:39:35] mmm from https://github.com/tensorflow/tensorflow/issues/17022#issuecomment-477914905 it seems it works [07:45:46] ah interesting [07:45:54] pip install tensorflow from 3.7 starts [07:46:01] tensorflow-rocm says [07:46:09] Could not find a version that satisfies the requirement tensorflow-rocm (from versions: ) [07:49:51] ufff [08:10:53] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:13:46] this is a false positive [08:16:03] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:41:35] brb [08:45:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:50:45] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:57:12] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) [09:02:53] so all the last occurrences of the alarm are false positives (caching involved) [09:03:10] going to finish to add data to the task (timeline etc..) [09:03:17] and then I'll start checking alarms [09:23:09] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [09:25:27] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [09:31:50] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Ok so my current theory is that the TX bandwidth of the 1G interfaces was temporary saturated multiple times, leading to timeouts due to the high traffic consumed. Let's re-analyze the the graphs wit... [09:43:37] 10Analytics: jumbo kafka cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Very interesting, it should confirm what Erik reported. There are various OOMs related to fetching from topic `mjolnir.msearch-prod-response` during the timeframes that the KafkaRDD fetcher was actin... [09:45:26] ok so root cause definitely identified --^ [09:47:08] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) p:05Triage→03Normal [09:53:48] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [09:55:08] this is really annoying [09:57:58] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:04:46] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:08:34] PROBLEM - Check the last execution of check_webrequest_partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions [10:09:00] yeah yeah [10:09:38] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:14:14] !log restart eventlogging's mysql consumers on eventlog1002 - T219842 [10:14:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:14:17] T219842: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 [10:15:58] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:27:43] mforns: o/ [10:27:45] are you around? [10:38:30] when you are around we should check what is the status of all the jobs etc.. [10:38:33] after the kafka mess [10:38:45] some hours need to be re-run with higher threshold [10:40:12] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:42:14] so now druid public is also falling over [10:42:15] sigh [10:42:23] (see operations channel) [10:51:16] so there seems to be something scanning fr.wikipedia.org's edit data [10:51:23] with huge time windows [10:59:18] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:59:20] elukey you are not alone! /hug [11:00:29] welcome back :) [11:01:52] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:14:36] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:15:35] these are still false positives [11:15:37] sigh [11:17:08] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:40:06] hey teammmm [11:40:11] O.o O.o [11:40:14] hi elukey [11:40:20] good morning :D [11:40:28] marcel don't freak out when checking alarms :D [11:40:35] already did! [11:40:40] maybe we can chat quickly in batcave so I can explain [11:40:44] sure, omw [11:40:47] (I was about to go off for 1h) [11:42:44] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:44:04] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:51:48] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:54:20] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:04:32] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:05:36] * elukey lunch! [12:05:50] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:19:54] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:21:10] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:29:42] (03PS1) 10Hoo man: Track number of Schemas [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) [12:37:05] (03CR) 10Lucas Werkmeister (WMDE): Track number of Schemas (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) (owner: 10Hoo man) [12:39:43] (03CR) 10Hoo man: Track number of Schemas (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) (owner: 10Hoo man) [12:40:11] (03PS2) 10Hoo man: Track number of Schemas [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) [12:47:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Track number of Schemas [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) (owner: 10Hoo man) [12:48:18] (03Merged) 10jenkins-bot: Track number of Schemas [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) (owner: 10Hoo man) [12:50:03] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1004 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [13:11:00] back! [13:15:37] mforns: o/ how are the re-run going? [13:24:19] elukey, hey, no luck with that [13:24:26] the error keeps firing [13:24:56] actually, the last time I executed it, it got the right thresholds, but still failed the check, and did not run the refine.. [13:26:50] mforns: how did you re-run? With what threshold? Really weird, it should go forward [13:33:20] elukey, first I tried re-running from hue, from the coordinator's list of jobs, checking the failed one and clicking rerun, then setting the thresholds [13:33:33] that did not work, the thresholds were unchanged [13:33:51] then I tried the same but opening the falied workflow [13:34:05] and re-running from within, changing the thresholds as well. [13:34:33] this time, the thresholds were correctly passed to the job, but the workflow failed on a prior step [13:35:28] I dunno, I'm a bit confused [13:37:53] mforns: submitter is your username, not 'hdfs' [13:38:11] IIRC if you go inside the job and re-run it, it uses your username [13:38:42] Remote Exception: Permission denied: user=mforns, access=WRITE, inode="/wmf/data/raw/webrequests_data_loss/upload/2019/4/1/14":hdfs:hadoop:drwxr-xr-x [13:38:47] mforns: --^ [13:39:03] I think that you need to use the oozie CLI [13:39:19] k [13:42:58] mforns: I think that you just need to fire another separate coordinator, only for that hour [13:43:07] with the modified .properties file [13:43:33] I think that Hue doesn't allow you to modify it if you are not hdfs [13:43:47] unless you run it with your username [13:43:51] that of course fails [13:50:49] aha [13:51:20] elukey, do I need to modify the properties file, or is it enough to override params with -D ? [13:53:46] mforns: ah sorry yes -D is way better [13:57:50] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10CDanis) While I don't want to stop investigative work from happening, is someone planning on writing this up at https://wikitech.wikimedia.org/wiki/Incident_documentation ? [13:59:45] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) >>! In T219842#5077438, @CDanis wrote: > While I don't want to stop investigative work from happening, is someone planning on writing this up at https://wikitech.w... [14:04:19] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Eventlogging data sent by Varnishkafka was of course impacted as well: https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-in... [14:09:08] 10Analytics, 10Analytics-Wikistats, 10Research: Renovation of Wikistats production jobs - https://phabricator.wikimedia.org/T176478 (10Erik_Zachte) That's OK. Cheers [14:09:18] mforns: if you want me on bc lemme know, I am free now [14:09:33] elukey, ok, let's meet on bcm thanks! [14:33:39] elukey, it's refining! [14:34:32] Heya team - sqoop almost finished, and might have some errors - We also need a manual run of actor/comment tables - I'll care that when I get back home tonight (currently heading to the airport) [14:39:21] (03PS1) 10Lucas Werkmeister (WMDE): WIP: count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) [14:41:10] joal: if possible let us do it, enjoy your time off :) [14:44:01] hi! anyone have any ideas about how to follow a kafka stream from Vagrant on the EventBus role on the host machine? [14:44:31] I forwarded the port 9092 from the guest to the host [14:44:42] (03CR) 10Lucas Werkmeister (WMDE): "Some questions that I hope reviewers can help with:" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [14:44:58] So for example kafkacat -C -b localhost:9092 -t datacenter1.mediawiki.centralnotice.campaign-change works from within the Vagrant VM [14:45:20] but on the host machine, I get, "Failed to resolve 'vagrant.mediawiki-vagrant.dev:9092': Name or service not known (after 129671607ms in state INIT)" [14:45:27] hmmm maybe if I add that to /etc/hosts [14:47:26] AndyRussG: hi! Do you see the port used on your localhost? [14:47:39] I mean, vagrant should have exposed it for you to contact [14:47:50] if so, you can just refer to it as localhost:9092 [14:48:14] elukey: if I say vagrant port [14:48:26] it shows 9092 (guest) => 9092 (host) [14:48:29] among other things [14:48:35] is that where I should look [14:48:48] I think somehow it is getting through to vagrant, but there's some missing config [14:48:55] are you running on linux [14:48:56] ? [14:49:12] yeah always ;) [14:49:19] if so sudo netstat -nlpt | grep 9092 [14:49:52] Hmmm, that says: tcp 0 0 127.0.0.1:9092 0.0.0.0:* LISTEN 29931/VBoxHeadless [14:50:18] Dunno what the 29931 means [14:52:58] so have you tried to kafkacat from 127.0.0.1:9092 ? [14:53:23] AndyRussG: this is surfacing vagrant ports in your say mac laptop acting as host? [14:53:47] elukey: yeah I tried that [14:54:05] and? :) [14:54:23] nuria: yes, but it's Debian GNU/linux [14:54:27] AndyRussG: you need explicit config to forward ports in vagrant on vagrantfile [14:54:30] elukey: same result [14:54:57] same result? [14:55:01] what do you mean ? :) [14:55:33] it cannot be failed to resolve bla bla [14:55:46] nuria: yes I did that :) added this to Vagrantfile-extra.rb: https://tools.wmflabs.org/paste/view/3ccd3548 [14:55:57] then restarted the virtual machine [14:56:09] elukey: yes, exactly the same [14:57:19] it's as if kafkacat is somehow getting to the VM and somewhere in the process something resolves differently on the host vs. the guest [14:57:41] AndyRussG: did you enable port forwarding on virtual box? [14:57:55] ah right, AndyRussG same thing with 127.0.0.1 ? [14:58:13] nuria: no, didn't try that, lemme see [14:58:16] elukey: yep same thing [14:58:17] mforns: refine completed, let's do the other two [14:58:26] elukey, yes omw [14:58:31] could be port forwarding then, no idea [14:58:35] never seen this issue [14:59:06] !log re-run of webrequest upload 2019-04-01-14 with higher data loss threshold [14:59:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:04] reruns launched [15:01:42] PROBLEM - Check the last execution of refinery-sqoop-mediawiki on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-mediawiki [15:01:51] yeeeeaaahhh [15:02:01] mforns: let's also !log it [15:02:36] !log Rerunning webrequest-load-coord for 2019-04-01T22 [15:02:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:07] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Very interesting: ` java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.a... [15:05:13] (03CR) 10Hoo man: "> Is there any way we can test this before deploying it?" (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [15:06:33] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) >The OOMs have in their stacktrace signs of Kafka partitions being converted, Can you explain this a bit? [15:07:52] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10EBernhardson) The KafkaRDD implementation is indeed old, for whatever reason they never implemented 0.10+ for python, so pyspark uses the 0.8 api. [15:08:54] nuria: the port also shows up as forwarded in the VB gui [15:09:09] AndyRussG: then see if it is occupied [15:09:28] hmmm right [15:10:30] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) >>! In T219842#5077736, @Nuria wrote: >>The OOMs have in their stacktrace signs of Kafka partitions being converted, > Can you explain this a bit? Sure! What I wa... [15:13:16] AndyRussG: netstat -lpa [15:13:24] AndyRussG: or netstat -putona [15:13:29] lol [15:13:45] ahahhahah nuria I can't stop laughing [15:13:53] elukey: I KNOW [15:13:57] elukey: a classic [15:14:56] hmmmm [15:17:08] nuria: elukey: I don't see anything on that port other than the VirtualBox stuff [15:21:01] So I guess forwarding kafka streams to outside the VB isn't something you folks usually do? or is it? [15:22:09] AndyRussG: we don't use a lot VB, but if I had to guess I'd say that there is a reverse dns query from the kafka client [15:22:36] have you tried to add the vagrant dns name in /etc/hosts? [15:22:46] not yet lemme try it tho [15:23:02] not sure if it works but let's rule it out :) [15:24:51] elukey: nuria: ^ it worked! just added this to /etc/hosts: 127.0.0.1 vagrant.mediawiki-vagrant.dev [15:25:00] thanks so much eh [15:25:15] Is there somewhere I should add this to the documentation btw? [15:25:41] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Some links that seem to validate what I've said above: https://stackoverflow.com/questions/49463623/kafka-broker-failed-to-handle-request-due-to-heap-oom http://k... [15:25:41] Mebbe here? https://wikitech.wikimedia.org/wiki/Event_Platform/EventBus#MediaWiki_Vagrant_Development_Environment [15:26:03] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Change permissions for daily traffic anomaly reports on stat1007 - https://phabricator.wikimedia.org/T219546 (10elukey) [15:27:38] AndyRussG: sure! [15:27:38] it was the reverse DNS lookup then [15:27:38] didn't know about it sorry :( [15:27:38] yeah something like that, heh no worries [15:27:38] thx again! [15:31:56] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Change permissions for daily traffic anomaly reports on stat1007 - https://phabricator.wikimedia.org/T219546 (10elukey) @ssingh hi! Sorry for the delay. I am fine adding the group permissions but if the directory was marked in that way it was on purpose, maybe... [15:32:05] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) Created template for incident report, will fill timeline during the day today: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo [15:34:27] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) [15:36:04] 10Analytics-Kanban, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10elukey) >>! In T178802#5076008, @Tbayer wrote: > @elukey Sure, that totally makes sense! The end of January estimate from T178802#4647106 turned out a b... [15:36:26] nuria: --^ (whenever you have time) [15:36:32] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500765/ [15:56:46] (03CR) 10Ladsgroup: "> Patch Set 1:" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [16:03:27] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace current time range selector on Wikistats to allow for arbitrary time selections - https://phabricator.wikimedia.org/T219112 (10fdans) [16:07:10] 10Analytics, 10Analytics-Kanban: Explain with annotations start of new registered users data - https://phabricator.wikimedia.org/T215887 (10Milimetric) 05Open→03Resolved someone did this for us :) [16:10:25] 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Milimetric) @DStrine can you let us know what you'd like to do here? It's not technically complicated, but it's a little time sensitive in case... [16:40:11] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10EBernhardson) As a followup I've started rewriting mjolnir's usage of KafkaRDD to instead use the kafka-python library we use elsewhere (that supports the full current pro... [16:53:09] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) @EBernhardson while having client quotas might be something we want to look at but the priority on our end (I think) should be to see if there is a setting in kafka... [16:53:40] 10Analytics, 10Analytics-Kanban: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) [17:00:10] 10Analytics, 10Pageviews-API: metrics.wmflabs.org pageviews csv now redirecting to eventmetrics forbidden - https://phabricator.wikimedia.org/T219718 (10jmatazzoni) This is not really something that the Event Metrics team can take on. I'm removing that tag.. [17:15:07] https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/389#issuecomment-479082013 [17:15:11] \o/ [17:16:06] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) [17:16:20] 10Analytics, 10Operations, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) 05Stalled→03Resolved [17:17:55] 10Analytics, 10Discovery-Search, 10Multimedia, 10Reading-Admin, and 3 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10elukey) [17:18:05] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) 05Stalled→03Open Vega GPU mounted on stat1005, it looks go... [17:22:09] 10Analytics, 10Discovery-Search, 10Patch-For-Review: Publish both shaded and unshaded artifacts from analytics refinery - https://phabricator.wikimedia.org/T217967 (10Gehel) [17:29:37] !log revision/pagelinks failed wikis rerun successfully, now forcing comment/actor rerun [17:29:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:30:17] * milimetric gone for lunch [17:30:37] nuria: Re the pageview token discussion from yesterday: we need this token to be shared between server and client, because one of our schemas uses server-side logging [17:30:42] That's why we're creating our own thing [17:30:50] My understanding is that the existing pageview token is generated client-side [17:30:55] elukey: doing a quick test, seem to be missing libhc_am.so [17:31:09] (stat1005/rocm/tf0 [17:31:10] But if there's something that already exists that we can use, let us know [17:31:52] elukey: ignore me ... [17:31:53] ebernhardson: interesting, what kind of test? So I can try to repro [17:31:56] ahahhaha [17:32:17] it's in /opt [17:36:52] ebernhardson: are you using python atm or can I nuke it to install 3.6 from http://snapshot.debian.org/package/python3.6/3.6.8-1/ ? [17:37:03] (I can wait tomorrow if you want) [17:37:10] elukey: you can nuke it, you dont have venv so i built a venv on stat1007 and copied it over [17:37:40] ah! with python 3.6? [17:37:45] elukey: yes [17:37:48] nice! [17:37:58] this is actually way smarter than what I was trying to do :D [17:38:14] if you are testing tensorflow rocm I'll let you do the work [17:38:24] well, i'm not entirely sure it works yet :) [17:38:44] did you already install it ? that is a good start :) [17:40:07] elukey: well, it's just running into more errors :) next is https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/389#issuecomment-479082013 [17:40:24] bah wrong paste, ImportError: libMIOpen.so.1: cannot open shared object file: No such file or directory [17:41:31] it's in the venv directory though, and ldd doesn't claim any deps are missing. needs more investigation [17:43:10] some LD_LIBRARY_PATH magic and it started at least [17:43:18] 2019-04-02 17:43:12.939323: F tensorflow/stream_executor/rocm/rocm_dnn.cc:2659] Check failed: status == miopenStatusSuccess (7 vs. 0)Unable to find a suitable algorithm for doing forward convolution [17:43:40] that might just be the tf_hub module i grabbed not being supported by rocm, will find something simpler [17:43:58] now I am a bit lost but I trust your judgement :) [17:44:15] if you need me to follow up on anything please drop a line in here or in the task and I'll do it [17:44:44] tensorflow_hub is a place to download pre-trained models. tensorflow-rocm didn't like this pre-trained model but it's not a particularly simple thing, its state of the art sentence embedding [17:44:57] i'll try with something simple :) [17:46:19] 10Analytics, 10Beta-Cluster-Infrastructure, 10EventBus, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10dduvall) p:05Triage→03Unbreak! Marking this UBN as is policy for all deployment block... [17:46:31] ah snap okok [17:46:53] on the upside, it started, talked to the gpu, talked to the libraries. it's progress! [17:48:28] compared to our previous card for sure! [17:50:43] 10Analytics, 10Beta-Cluster-Infrastructure, 10EventBus, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10Pchelolo) The https://gerrit.wikimedia.org/r/500363 fixes it. Don't want to self-merge my... [17:54:33] !log restarted turnilo to clear deleted datasource [17:54:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:55:40] RECOVERY - Check the last execution of check_webrequest_partitions on an-coord1001 is OK: OK: Status of the systemd unit check_webrequest_partitions [17:55:53] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10elukey) p:05Triage→03Normal [17:55:57] manually re-run the checker --^ [17:56:08] nuria, milimetric opened task for druid --^ [17:56:51] going off! o/ [17:58:48] elukey: so next problem, apparently we have rocm-clang-ocl 0.3.0-7997136 but should have 0.3.0-688fe5d (isn't their versioning fantastic!) [17:58:56] (not now, but sometime, i'll add a note to ticket) [18:00:13] ebernhardson: I did some downgrade tests a while a go with the previous card of the rocm suite, I might have missed a deb [18:01:04] ahh, so these aren't copied into our mirrors? I can probably find and install the appropriate things then, and take notes on which from where [18:01:15] nono I added the repo manually [18:01:18] for the moment [18:01:29] sure, reprepro is noones friend for testing things :) [18:02:13] ah and now it seems gone from what I can see, puppet cleaned it up [18:02:47] if you find much troubles let me know and I'll re-install the os and the last rocm suite [18:02:53] so we'll start from a clean state [18:03:28] going to dinner! [18:06:38] success (on a toy dataset). stat1007 using cpu trainst mnist is 12s, stat1005 on gpu trains in 8s. It's a very toy problem that probably doesn't benefit that much from gpu, but it appears to have trained something [18:10:16] 10Analytics, 10Beta-Cluster-Infrastructure, 10EventBus, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10dduvall) a:03Pchelolo >>! In T219738#5078550, @Pchelolo wrote: > The https://gerrit.wik... [18:10:57] 10Analytics, 10Pageviews-API: metrics.wmflabs.org pageviews csv now redirecting to eventmetrics forbidden - https://phabricator.wikimedia.org/T219718 (10mahmoud) I've got a version of the code going off of the rest_api now. The code's not running exactly the same, but that could be on me. As for the why, I'm... [18:11:29] 10Analytics, 10Pageviews-API: metrics.wmflabs.org pageviews csv now redirecting to eventmetrics forbidden - https://phabricator.wikimedia.org/T219718 (10mahmoud) 05Open→03Resolved [18:21:13] On Scoring Platform we are interested in figuring out what search terms people use to find our docs pages on mediawiki.org. I understand this information is available in Hadoop. Is there a way to access this information outside of Hadoop (such as a specialized tool)? If not, may I begin drafting the paperwork to get access to Hadoop? :) [18:23:52] 10Analytics: Outdated project codes in pagecounts-ez - https://phabricator.wikimedia.org/T219914 (10Sascha) [18:23:59] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10greg) [18:25:18] hare: search terms are PII, so you will need hadoop access :) [18:27:25] hare: you can file an access request for hadoop: https://wikitech.wikimedia.org/wiki/Analytics/Data_access ebernhardson can qualify here but I do not think every search done by every user is persisted to hadoop , * i think* some search matchings are done client side and only a percentage of those per wiki is persisted [18:28:08] full text search is logged completely, autocomplete is cached at varnish so not all of those are in the search logs [18:28:19] I was thinking more Google searches [18:28:24] External referrers [18:28:25] oh, we don't have those at all [18:28:41] hare: external referrers were killed years ago, google just sends Referrer: https://www.google.com [18:29:18] (we set the same thing on our sites, when someone clicks out of wikipedia to some external resource that resource doesn't get to know anything more than the domain name) [18:35:13] hare: your best bet would be to find someone with access to the google side of site analytics, they have some reports that include search terms and pages clicked through to, along with # impressions / # clickthroughs / result position. But past reviews have shown that at least for wikipedia's it's just a list of search terms that match page titles [18:35:16] 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10greg) [18:36:18] This would be for documentation on mediawiki.org or possible queries directly for our production service (ores.wikimedia.org) so it might be slightly more interesting? Also, do you know who coordinates that? [18:37:38] hare: probably sre? look through phab for other tasks mentioning 'google search console' [18:39:05] hare: https://wikitech.wikimedia.org/wiki/Google_Search_Console_access maybe [18:39:21] Excellent, I will look into that. Thank you. [18:42:58] upgraded stat1005 back to the latest rocm versions, seems to run standard models such as elmo (sentence embedding), and miriam's image quality model [18:51:54] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) hacks abound, but basically: * Added `deb [arch=amd64]... [19:05:18] 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), and 2 others: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10Pchelolo) > If you could verify on Beta prior to deployment, however, that would be... [19:08:28] 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), 10Wikimedia-production-error: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10Jdforrester-WMF) 05Open→03Resolved Yay. Thank you! [19:23:40] 10Analytics, 10Operations, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move eventstreams logging to new logging pipeline - https://phabricator.wikimedia.org/T219922 (10Pchelolo) [19:32:01] 10Analytics, 10Operations, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10Pchelolo) [19:55:19] harej: some of the search data is in druid, i ma not sure you would find it of use for mediawiki.org given how small it is compared to other sites. do you have access to http://superset.wikimedia.org? [19:55:38] I probably don’t. [19:56:40] hare: i am not sure anyone actually has data about specific search terms cc ebernhar|lunch [19:57:01] hare: you can find say, click through ratios [19:57:28] hare: but to my knowledge we do not have anywhere data about specific search requests in google (terms) [19:58:22] hare: to file for access: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Turnilo-Pivot#Access [20:03:12] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) ping @Miriam @Gilles so they know the status of this. [20:14:41] milimetric: plenty thanks for having rerun the sqoop jobs :0 [20:14:42] :) [20:15:24] joal of course [20:15:39] One last missing bit is _SUCCESS files in /revision and /pagelinks snapshot folders [20:15:44] milimetric: --^ [20:15:58] Gone to bed then :) [20:16:05] joal: yes, but we’re waiting for comment/actor to finish [20:16:33] done :) [20:16:39] oh!?! [20:16:41] Ah - I thought you knew :) [20:16:52] k, no worries, will put success [20:16:53] actor/comment is A LOT faster :) [20:16:56] Crazy Fast!! [20:17:04] it took 23 hrs first time [20:17:04] like 2hours IIRC [20:17:20] Oh... maybe we set it to start later, oh [20:17:22] milimetric: 3 hosts versus 1 I guess [20:17:40] we did processors 10 as well [20:17:41] to test that [20:17:48] yes :) [20:17:56] milimetric: I had tested, but 2 tests better :) [20:18:13] And sorry for having missed the change in puppet :( [20:18:32] no at all, no real harm done [20:18:35] just a few hours delay [20:18:40] yup [20:18:41] and our docs are better now :) [20:18:43] nite! [20:18:46] Gone for real ;) [20:19:13] mforns: ok, touching _SUCCESS flags now that actor/comment are done [20:19:22] milimetric, wow, so fast [20:19:30] yeah, super fast [20:19:35] should see this job kick off after I do: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0131840-181112144035577-oozie-oozi-C/ [20:20:17] ok [20:20:56] k, all good [20:21:00] RoanKattouw: totally missed you ping prior [20:22:11] RoanKattouw: the pageviewtoken is generated only on the client that is correct [21:33:46] 10Analytics, 10Analytics-Wikistats, 10ORES, 10Scoring-platform-team: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479 (10Halfak) [21:34:59] 10Analytics, 10Dumps-Generation, 10ORES, 10Scoring-platform-team, and 3 others: Decide whether we will include raw features - https://phabricator.wikimedia.org/T211069 (10Halfak) [21:36:30] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10Nuria) [22:07:48] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) [22:14:28] mforns: thanks for updating docshttps://wikitech.wikimedia.org/w/index.php?title=Analytics/Data_Lake/Traffic/Pageview_hourly&action=history! [23:11:34] 10Analytics, 10Dumps-Generation, 10WikiCite, 10Wikidata, 10Patch-For-Review: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday - https://phabricator.wikimedia.org/T216160 (10Pintoch) I agree with @Nicolastorzec above. I suspect that the entity dumps are more popular... [23:19:03] ebernhardson: outage report, any corrections welcomed: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo [23:20:27] nuria: seems about right. Perhaps not having 35 partitions would also help? No guarantees, but if spark had used 12 jvm's to talk to kafka instead of 35 things might have been better. I think 35 was chosen somewhat arbitrarily, as the number of servers on the prod end consuming/producing [23:21:37] ebernhardson: you know, i was looking at the conversion among apis and that *seems* the real expensive thing, might be that + partitions i guess which translates to widespread load right? [23:22:30] nuria: well, each jvm is single threaded, so it would mean 35 concurrent clients all asking kafka to convert records in memory at same time [23:22:34] each spark jvm at least [23:22:59] ebernhardson: ya, load must have not helped [23:23:06] i guess on the other hand, 35 shouldn't be that big a number for a server like kafka [23:23:15] 35 clients that is [23:25:47] ebernhardson: https://github.com/apache/kafka/blob/0.10.0/core/src/main/scala/kafka/server/KafkaApis.scala#L436 [23:27:16] * ebernhardson would have to dig more to understand these magic numbers [23:29:32] ebernhardson: ya, same here [23:32:20] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10Nuria) Of interest: https://github.com/apache/kafka/blob/0.10.0/core/src/main/scala/kafka/server/KafkaApis.... [23:32:50] i suppose the initial gate there is on fetchRequest.versionId <= 1, the kafka-python library we use for the rest of kafka stuff in mjolnir supports up to FetchRequest_v5 [23:39:54] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) Well yesterday (2019-04-02) between 10 and 11 am we had about 120,000 reqs for fr.wikipedia edit data, last month same interval there was just a couple (literally a couple). The UA of... [23:46:06] * ebernhardson has often wondered about some sort of streaming webrequest analysis that detects things like ^, but i don't exactly want alerts for all unexpected query patterns... [23:50:05] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) For the whole day there is about 300,000 requests across three IPs [23:59:51] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) All these requests have a 200 response, need to check throttling settings