[00:16:23] Hello a-team, I was trying to `pip install sasl`for querying hive in python in my virtualenv on stat1007, but encounter an error: `fatal error: sasl/sasl.h: No such file or directory` [00:16:33] Looks like a library is required: https://stackoverflow.com/questions/48562383/sasl-saslwrapper-h2223-fatal-error-sasl-sasl-h-no-such-file-or-directory?rq=1 [00:16:39] Can you help? Thanks! [00:20:08] chelsyx: yeah it looks like you need one of the sre guys to install that package. It should be easy but I don’t think they’re around. You need the data in a shape that python can deal with? Have you tried pyspark and querying with spark-sql instead? [00:21:02] milimetric: not yet, but I'm about to try it :) [00:22:10] ok chelsyx lemme know if you want to try together, I’m sure there’s some workaround. I’ll be done with baby’s bath in like 15-20 minutes [00:28:37] Thank you for the help milimetric ! I will try that tomorrow and will let you know :) [00:32:58] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10WMF-JobQueue, and 2 others: Beta cluster: MassMessage fails with PHP fatal error because of Declaration of JobQueueEventBus::doAck() must be compatible with that of JobQueue::doAck() - https://phabricator.wikimedia.org/T220662 (10Legoktm) [05:34:58] 10Analytics, 10Research, 10Article-Recommendation: Make endpoint for top wikis by number of articles - https://phabricator.wikimedia.org/T220673 (10Milimetric) [06:13:00] chelsyx: o/ - let's follow up if you don't manage to resolve the issue.. I am happy to install the package but I am wondering why you need sasl for your use case (if you can point me to example code etc.. I am curious :) [07:13:36] Good morning elukey [07:13:48] Are you ready for a new druid-ish morning? [07:17:31] yeah :( [07:22:44] I'm currently trying to test the update deletion-scripts - How do you want us to proceed elukey ? [07:25:33] elukey: just changed keep-rules for old datasource in druid - This will drop a bunch of segments [07:25:36] I tried to check if our configs for http threads are applied, and it seems so [07:25:49] 372204:2019-04-10T17:15:20,615 INFO io.druid.cli.CliBroker: * druid.broker.http.numConnections: 20 [07:25:52] for example [07:26:17] ok - at least looks like we can trust that [07:26:29] the other tons of threads for the broker are a mistery to me, I guess related to processing + merge + cache [07:26:32] I still wonder how we can end up with 300+ threads [07:26:36] yeah [07:26:50] this is probably worth asking to the mailing list [07:28:18] let's assume for a moment that all those threads are doing something and that we can leave them aside for the moment [07:28:21] we set: [07:28:34] * 20 http threads to handle http connections from clients in the broker [07:28:48] * 20 http theads in the broker as connection pool to the historicals [07:29:02] * 20 http threads in the historicals to handle queries from brokers [07:29:08] elukey: let's batcave for a minute if you're ok, about http server/client :) [07:29:41] or not : [07:29:51] elukey: I agree with the above [07:30:16] elukey: And from the docs, it looks the one we have correct is '20 http theads in the broker as connection pool to the historicals', while the other 2 are too small [07:30:48] the other main thing is [07:30:49] service":"druid/broker","host":"druid1004.eqiad.wmnet:8082","version":"0.12.3","metric":"jetty/numOpenConnections","value":37} [07:31:14] elukey: something else I'd like to try to do as an exercise is to do as if we had multiple machines for each service, in order to share the load in a more reasonnable way [07:31:37] elukey: this seems bizarre :( [07:31:48] probably we could ask for separate nodes for historicals in next fiscal [07:31:53] Actually maybe not: connection queuing - 37 open, but only 20 worked [07:32:15] yes this is my understanding as well about how jetty should work [07:32:40] elukey: given the setupo, I'd say 2 nodes for brokers + coordinators, and the rest for historical (they're the one doing most of the job)) [07:32:53] and then [07:32:54] "service":"druid/historical","host":"druid1004.eqiad.wmnet:8083","version":"0.12.3","metric":"jetty/numOpenConnections","value":60} [07:33:00] this is even more interesting [07:33:17] hm [07:33:30] elukey: 3 brokers * 20 connections [07:33:56] yep since they seem steady and not changing [07:36:33] joal: we can bc if you want [07:36:46] elukey: do we try to change values with faking having smaller machines (as if broker had 1/3 of the machine and historical 2/3?) [07:37:56] I'd keep it simple if possible.. [07:38:35] Ah yes :) [10:00:44] (03PS3) 10Joal: Refactor refinery/python/druid.py [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) [10:01:00] (03CR) 10Joal: [V: 03+1] "tested on cluster" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) (owner: 10Joal) [10:11:52] 10Analytics: Decide: start_timestamp for mediawiki history - https://phabricator.wikimedia.org/T220507 (10JAllemandou) Ping @Neil_P._Quinn_WMF and @nettrom_WMF - I'll move forward with the suggested implementation this end of week to have it tested next week :) [10:23:17] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10elukey) After the above changes it seems that the problem do not show up anymore. This is what we changed: * More HTTP threads on the historical d... [10:24:14] ah snap I commented in the wrong one [10:24:35] well it was generating aqs alerts [10:26:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10elukey) [10:36:42] PROBLEM - Number of segments reported as unavailable by the Druid Coordinators of the Public cluster on icinga1001 is CRITICAL: 2.347e+04 gt 200 https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&panelId=46&fullscreen&orgId=1&var-cluster=druid_public&var-druid_datasource=All [10:47:17] this seems expected, related to an old datasource [10:49:32] 10Analytics, 10User-Elukey: Show hits matching a list of IP subnets - https://phabricator.wikimedia.org/T220639 (10elukey) [10:53:51] 10Analytics, 10User-Elukey: Show hits matching a list of IP subnets - https://phabricator.wikimedia.org/T220639 (10elukey) I had a chat with Arzhel on IRC. The scope of the project would be to check, among webrequest data, if we have traffic coming from IPs registered in subnets like: https://nusenu.github.io... [10:54:12] 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10elukey) [11:12:24] elukey: I confirm old datasources are gone from druid dashboard - Thanks for the magic :) [11:15:00] super :) [11:15:08] I created https://phabricator.wikimedia.org/T220687 to order the new zookeeper nodes for hadoop [11:15:23] (and possibly druid, presto, slider, etc..) [11:15:36] great :) [11:41:21] afk a bit for lunch, ttl! [13:18:42] 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) Hm, if those two heartbeats are indeed the reason for the kill, then perhaps increasing the worker timeout would be ok. The... [13:22:56] joal: prometheus reports segments not available for 2018_12 in druid, is it related to a half done data drop or something else? [13:26:36] elukey: I think it was related to segments taking some time to be reloaded [13:27:24] elukey: we have seen this alarm before, usually just after a new datasource creation --> data is available on HDFS, and transfering ~200Gb takes time :) [13:28:18] joal: sorry I didn't make my point clear, the metrics are still showing segments unavailable :D [13:28:28] Ah! [13:28:35] This is not normal :) [13:29:08] elukey: seems to be a metric issue: druid-coordinator UI shows all good [13:30:11] just restarted them, let's see [13:34:38] ack it was the past coordinator [13:34:45] really weird problem to solve [13:34:47] Ah [13:35:23] hm - how to make the metric-extractor drop old-metrics when a component is restarted [13:36:08] I think to remember now - this is a problem in the way the main prometheus library exporter works, namely it doesn't contemplate metrics disappearing [13:42:51] 10Analytics, 10Research, 10Article-Recommendation: Make endpoint for top wikis by number of articles - https://phabricator.wikimedia.org/T220673 (10bmansurov) [13:44:41] hello team :] [13:44:52] RECOVERY - Number of segments reported as unavailable by the Druid Coordinators of the Public cluster on icinga1001 is OK: (C)200 gt (W)180 gt 0 https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&panelId=46&fullscreen&orgId=1&var-cluster=druid_public&var-druid_datasource=All [14:00:18] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) p:05Triage→03Normal [14:10:14] 10Analytics, 10Operations, 10hardware-requests, 10netops, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10elukey) p:05Triage→03Normal [14:12:07] mforns: o/ [14:12:16] :] [14:12:32] need to check with the team but probably tbayer_popups' datasource in druid can be dropped? [14:13:35] elukey, not sure if other people are using it... [14:13:52] the code change looks good thoug [14:13:55] though [14:14:03] ah I have just merged it :D [14:14:12] wasn't it a testing data source? [14:14:21] we are not adding data to it regularly right? [14:17:46] 10Analytics, 10User-Elukey: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 (10elukey) [14:26:08] elukey, mmmm not sure [14:30:19] elukey, yes, data stops at Nov 16th, and I don't recall having any ingestion job set for that [14:34:32] all right let's bring up this during standup [14:35:24] elukey, did you do another aqs deploy yesterday afternoon? [14:46:24] mforns: we did it only just before you joined [14:49:39] elukey, the aqs alarms then at 6pm remain to be looked into? [14:50:03] mforns: nono today we have fine tuned druid settings, all good [14:50:25] ok [15:02:28] (03CR) 10Mforns: [C: 03+2] "Looks great to me! Thanks!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502469 (https://phabricator.wikimedia.org/T220111) (owner: 10Joal) [15:08:43] 10Analytics, 10MediaWiki-extensions-ORES, 10Core Platform Team Backlog (Designing), 10Scoring-platform-team (Current), and 2 others: ORES hook integration with EventBus - https://phabricator.wikimedia.org/T201869 (10Ladsgroup) a:03Ladsgroup [15:10:50] PROBLEM - Hue Server on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:11:26] this is me, testing host --^ [15:15:58] (03CR) 10Milimetric: [C: 04-1] Replace time range selector (033 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans) [15:26:52] (03CR) 10Mforns: [V: 03+1 C: 03+1] "LGTM!" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502623 (https://phabricator.wikimedia.org/T220088) (owner: 10Milimetric) [15:27:33] o/ anyone knows how long files in hdfs://analytics-cluster/tmp kept around? Is it safe to use that location for storing temporary files for potentially long running processes (2-3 days)? [15:27:55] (03PS3) 10Milimetric: Show "Loading metric details..." on async Detail [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502623 (https://phabricator.wikimedia.org/T220088) [15:28:10] no idea bmansurov, maybe elukey? [15:28:23] ok [15:28:47] 10Analytics, 10EventBus, 10Operations, 10monitoring, 10User-fgiunchedi: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10Ottomata) [15:31:15] bmansurov, but thinking of it a bit, the tmp file is used by hadoop to store tmp data of ongoing jobs, and jobs do sometimes take a couple days [15:31:55] mforns: that makes sense, thanks! [15:33:00] yeah I'd say so too, not aware of retention rules for tmp (I should check if there are default ones) [15:33:48] elukey: that'd be great [15:34:01] (03CR) 10Mforns: [V: 03+1 C: 03+1] "LGTM! Feel free to merge, if you feel like it. Just wanted to let others review, because it involves UI. And I felt in these cases we tend" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502623 (https://phabricator.wikimedia.org/T220088) (owner: 10Milimetric) [15:34:33] running a quick errand, back in a bit [15:35:14] (03CR) 10Milimetric: "agreed. The only annoying part is even on fast connections you get a little flash of the loading text. Wondering how @fdans feels about " [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502623 (https://phabricator.wikimedia.org/T220088) (owner: 10Milimetric) [15:36:39] milimetric mforns_brb maybe let's debounce it? that's what I've seen in a bunch of situations like this [15:36:50] like, only show it if more than x time has passed [15:45:55] fdans: good point, trying now [15:51:03] milimetric, fdans, makes sense to me! [16:00:02] (03PS4) 10Milimetric: Show "Loading metric details..." on async Detail [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502623 (https://phabricator.wikimedia.org/T220088) [16:03:38] ottomata: standuuuppp [16:03:43] cOMING [16:05:22] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10WMF-JobQueue, and 2 others: Beta cluster: MassMessage fails with PHP fatal error because of Declaration of JobQueueEventBus::doAck() must be compatible with that of JobQueue::doAck() - https://phabricator.wikimedia.org/T220662 (10kostajh) Should be fixed by... [16:06:08] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10WMF-JobQueue, and 2 others: Beta cluster: MassMessage fails with PHP fatal error because of Declaration of JobQueueEventBus::doAck() must be compatible with that of JobQueue::doAck() - https://phabricator.wikimedia.org/T220662 (10kostajh) p:05Unbreak!→03H... [16:08:10] RECOVERY - Hue Server on analytics1039 is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:18:35] 10Analytics, 10User-Elukey: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 (10fdans) p:05Triage→03Normal [16:23:28] 10Analytics, 10Research, 10Article-Recommendation: Make endpoint for top wikis by number of articles - https://phabricator.wikimedia.org/T220673 (10fdans) p:05Triage→03Normal [16:26:19] 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10fdans) p:05Triage→03Normal [16:26:51] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10fdans) p:05Triage→03High [16:27:37] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10fdans) p:05High→03Unbreak! [16:27:54] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10fdans) p:05Unbreak!→03High [16:28:05] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10fdans) p:05High→03Unbreak! [16:28:37] 10Analytics, 10Product-Analytics: Update R from 3.3.3 to 3.5.3 on stat and notebook machines - https://phabricator.wikimedia.org/T220542 (10fdans) p:05Triage→03Normal [16:31:41] 10Analytics: Decide: start_timestamp for mediawiki history - https://phabricator.wikimedia.org/T220507 (10nettrom_WMF) From reading this, it sounds to me like it'll be possible to identify these events for pages because of the difference between the creation and first edit timestamps. What I'm wondering is if th... [16:36:33] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: Many small wikis missing from mediawiki_history dataset - https://phabricator.wikimedia.org/T220456 (10fdans) p:05Triage→03Low [16:37:13] 10Analytics: Epic: content-based metrics - https://phabricator.wikimedia.org/T220717 (10Milimetric) [16:37:19] 10Analytics: Epic: content-based metrics - https://phabricator.wikimedia.org/T220717 (10Milimetric) p:05Triage→03Normal [16:37:30] 10Analytics: Epic: content-based metrics - https://phabricator.wikimedia.org/T220717 (10Milimetric) [16:37:32] 10Analytics, 10Analytics-Wikistats: Add "Top used photos" metric - https://phabricator.wikimedia.org/T220485 (10Milimetric) [16:37:39] 10Analytics: Epic: content-based metrics - https://phabricator.wikimedia.org/T220717 (10Milimetric) [16:37:41] 10Analytics, 10Analytics-Wikistats: Add "Top linked article" metric - https://phabricator.wikimedia.org/T220484 (10Milimetric) [16:38:22] 10Analytics: Epic: content-based metrics - https://phabricator.wikimedia.org/T220717 (10Milimetric) [16:38:24] 10Analytics, 10Analytics-Wikistats: Add "Number of stub articles" metric - https://phabricator.wikimedia.org/T220483 (10Milimetric) [16:38:26] 10Analytics: Epic: content-based metrics - https://phabricator.wikimedia.org/T220717 (10Milimetric) [16:38:28] 10Analytics, 10Analytics-Wikistats: Add "Top large articles" metric - https://phabricator.wikimedia.org/T220482 (10Milimetric) [16:38:39] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: Many small wikis missing from mediawiki_history dataset - https://phabricator.wikimedia.org/T220456 (10fdans) A lot of the wikis there are either private or closed. The rest can be added via a patch to the sqoop load list https://github.com/wikimed... [16:41:02] 10Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10fdans) p:05Triage→03High [16:41:08] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics, 10Patch-For-Review: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history - https://phabricator.wikimedia.org/T218463 (10fdans) a:03mforns [16:43:19] 10Analytics: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10fdans) p:05Triage→03Normal [16:48:50] joal: test cluster updated to cdh 5.16.1 :) [17:16:54] * elukey off! [17:30:01] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10WMF-JobQueue, and 2 others: Beta cluster: MassMessage fails with PHP fatal error because of Declaration of JobQueueEventBus::doAck() must be compatible with that of JobQueue::doAck() - https://phabricator.wikimedia.org/T220662 (10DannyS712) @kostajh It works... [17:33:21] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10WMF-JobQueue, and 2 others: Beta cluster: MassMessage fails with PHP fatal error because of Declaration of JobQueueEventBus::doAck() must be compatible with that of JobQueue::doAck() - https://phabricator.wikimedia.org/T220662 (10kostajh) > But, it wasn't a... [17:34:37] 10Analytics, 10EventBus, 10Operations, 10monitoring, and 3 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10mobrovac) [17:37:06] Hey elukey, I was trying to install this package from Neil to query Hive inside of my python virtualenv on stat1007: https://github.com/neilpquinn/wmfdata , sasl is a dependency. Now I've decided to do this work on notebook machine since I can install this package on them. But it would be great if you can still help install it on stat1007! No rush! [18:14:16] 10Analytics, 10EventBus, 10Services (watching): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10Ottomata) [18:16:17] 10Analytics, 10EventBus, 10Services (watching): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10Pchelolo) I've also had an idea to create a limited concurrency consumer for change-prop use-case, eg T206186 [18:34:55] !log restarting processor blacklisting SearchSatisfactionErrors and TestSearchSatisfaction2 from eventlogging-valid-mixed for mysql [18:38:41] 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Next), 10Services (next): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10mobrovac) [18:38:55] 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Next), 10Services (next): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10mobrovac) p:05Triage→03Normal [18:53:24] chelsyx: ack got it! Created https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/503084/, after review will merge it, probably done by tomorrow :) [18:55:10] Thanks elukey ! [18:56:03] elukey: i can merge [19:03:03] super thanks ottomata :) [19:16:30] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Miriam) HI All, I quickly tested a simple training task on stat1005,... [21:05:56] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) >Next, I would like to test a more complex task, and measure ho...