[00:47:58] <RoanKattouw>	 Does anyone know what happened to ee-dashboard.wmflabs.org? We used to have graphs hosted at http://ee-dashboard.wmflabs.org/dashboards/enwiki-features#notifications-graphs-tab but now that domain doesn't resolve anymore
[01:35:16] <wikibugs>	 10Analytics, 10Analytics-EventLogging: uBlock blocks EventLogging - https://phabricator.wikimedia.org/T186572#3948086 (10Tgr)
[02:55:43] <wikibugs>	 10Analytics, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351#3948123 (10Hydriz)
[05:52:36] <icinga-wm>	 PROBLEM - YARN NodeManager JVM Heap usage on analytics1052 is CRITICAL: CRITICAL: 61.02% of data above the critical threshold [3891.2] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen
[05:53:36] <icinga-wm>	 RECOVERY - YARN NodeManager JVM Heap usage on analytics1052 is OK: OK: Less than 60.00% above the threshold [3891.2] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen
[08:49:22] <wikibugs>	 10Analytics-Kanban, 10Operations, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3948374 (10elukey) @akosiaris do you think that we can re-attempt to do the copy (maybe using ionice or rsync with limit bandwitdh or other) ?
[09:09:16] <icinga-wm>	 PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 2 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[09:10:01] <elukey>	 this is me testing --^
[09:10:18] <elukey>	 not really sure why there are two procs
[09:11:16] <icinga-wm>	 RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[09:11:25] <elukey>	 manually killed one
[09:11:32] <elukey>	 joal: o/
[09:11:38] <elukey>	 I disabled the webhdfs option
[09:11:45] <elukey>	 (manually for the moment)
[09:11:54] <elukey>	 hue seems complaining in the check config 
[09:11:58] <elukey>	 but it is much faster
[09:31:32] <elukey>	 https://github.com/cloudera/hue/issues/402
[09:31:53] <elukey>	 oh cloudera is on gh! 
[09:32:06] <elukey>	 so I can open bugs for oozie and hive scripts?
[09:33:57] <elukey>	 no not really
[09:48:03] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3948454 (10elukey) This example from librdkafka seems to be inline with what we are trying to do: https://github.com/e...
[09:58:42] <elukey>	 !log applied https://gerrit.wikimedia.org/r/c/405687/ manually on deployment-eventlog02 for testing
[09:58:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:41:45] <wikibugs>	 10Analytics-Kanban: Unconfigure Hue's use of webhdfs/httpfs - https://phabricator.wikimedia.org/T182242#3948671 (10elukey) p:05Triage>03Normal a:03elukey
[11:45:36] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3948687 (10elukey) As far as I can see there are no more actions to do on this particular task since:  - the host is OOW so after a chat we Chris we'd be inclined not to replace any p...
[12:05:37] * elukey lunch!
[12:39:29] <wikibugs>	 10Quarry: Search or filter queries by title or summary - https://phabricator.wikimedia.org/T90509#1060909 (10Elitre) If reusing queries is the whole point of avoiding that people reinvent the wheel, then we certainly need a way to... query the queries?
[12:55:55] <wikibugs>	 (03PS3) 10Mforns: Optimize WikiSelector for slow browsers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria)
[12:59:45] <mforns>	 heloooo
[13:02:01] <wikibugs>	 (03PS4) 10Mforns: Optimize WikiSelector for slow browsers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria)
[13:07:18] <wikibugs>	 (03CR) 10Joal: [C: 031] "LGTM, merge whenever needed." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408329 (owner: 10Nuria)
[13:11:41] <wikibugs>	 (03PS5) 10Mforns: Optimize WikiSelector for slow browsers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria)
[13:13:51] <wikibugs>	 (03CR) 10Joal: [C: 031] "Code extraction looks good. I don't like the RefineTarget name but didn't find a better one." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408435 (https://phabricator.wikimedia.org/T181064) (owner: 10Ottomata)
[13:17:34] <wikibugs>	 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3948818 (10elukey) All the work to move eventlogging to systemd is going to be tracked in https://phabricator.wikimedia.org/T114199, let's use this task only for the eventlog1002's productionization.
[13:28:40] <wikibugs>	 (03PS6) 10Mforns: Optimize WikiSelector for slow browsers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria)
[13:35:58] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3948843 (10elukey) >>! In T182993#3947432, @Ottomata wrote: > Thanks @bblack, it's at least good to know that we'll need to do the IPSec thing o...
[13:45:20] <milimetric>	 joal: have you started on the sqoop changes?
[13:45:30] <joal>	 milimetric: not yet
[13:45:39] <joal>	 milimetric: was planning on discussing with you first
[13:45:44] <milimetric>	 ok, great
[13:46:01] <milimetric>	 yeah, we should talk about it, maybe we rethink the whole thing a bit since it's grown since our original code
[13:46:27] <milimetric>	 wanna hang out in a bit?  Still have to eat breakfast, but when do you have time?
[13:47:48] <joal>	 milimetric: I'll have to leave to catch Lino in ~1h10 mins - Before that, when you wish :) (or later tonight)
[13:48:10] <milimetric>	 k, then let's say in 12 minutes, on the hour
[13:48:20] <joal>	 Great
[13:49:58] <elukey>	 joal: o/
[13:50:04] <joal>	 Hi elukey 
[13:50:16] <elukey>	 do you have a minute for a json refine question?
[13:55:10] <joal>	 sure elukey  - Not sure I'm the best to answer, but shoot !
[13:55:28] <elukey>	 I am trying to set up json refine for netflow
[13:55:40] <elukey>	 just merged the new cron to import /raw/netflow :)
[13:55:53] <elukey>	 but I am unsure about this parameter https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/JsonRefine.scala#L165
[13:56:46] <elukey>	 for example, for eventlogging_analytics we have 'table,year,month,day,hour',
[13:56:50] <elukey>	 are those the hive partitions?
[13:57:23] <joal>	 elukey: blame ottomata[m] for this :-P His code is super-generic, but a bit harsh to configure :)
[13:58:24] <elukey>	 I only blame that slacker of elukey
[13:58:38] <elukey>	 nothing else :D
[13:59:58] <joal>	 elukey: The refinement takes data-path and convert them for hive - Since data in hive is partitioned, but the original folders don't follow the hive convention (camus ....), input-regex and input-capture allow to get the needed info about table and partition-needed data from the path
[14:00:26] <joal>	 elukey: I'm totally happy to hear that my explanation is not clear enough, or doesn't help :s
[14:03:23] <milimetric>	 joal: I'm in the cave, elukey: sorry for stealing him, we're not doing anything critical so ping us if you need us
[14:03:37] <joal>	 sorry milimetric `
[14:04:46] <elukey>	 nono please go ahead, me neither :)
[14:19:50] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban: Set up (temporary) IPSec for Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T186598#3948896 (10Ottomata) p:05Triage>03Normal
[14:23:43] <ottomata>	 elukey:  for netflow stuff, i wonder if we should make 'db' level folders as 'ops' or something
[14:23:45] <ottomata>	 haha
[14:23:47] <ottomata>	 sre
[14:23:52] <ottomata>	 sre sigh what a dumb name
[14:23:54] <ottomata>	 thanks for nothing google.
[14:23:58] <ottomata>	 like 
[14:24:14] <ottomata>	  /wmf/data/raw/ops, /wmf/data/ops
[14:24:18] <ottomata>	 then with camus and jsonrefine
[14:24:29] <ottomata>	 you'd get topic and table name folders of 'netflow' in those dirs
[14:24:31] <ottomata>	 ?
[14:24:42] <ottomata>	 or
[14:24:45] <ottomata>	 hm
[14:25:00] <ottomata>	  /wmf/data/raw/ops and just use jsonrefine to put netflow in /wmf/data/wmf/netflow?
[14:25:47] <ottomata>	 that way if there are more opsy stuff like this, we can use camus to import them into the same place
[14:28:45] <elukey>	 ottomata: hello! I think it is not necessary but if you feel strongly about it we can do it
[14:29:15] <ottomata>	 well, as is you'll get raw/netflow/netflow
[14:29:23] <ottomata>	 and then a 'database' and table named netflow
[14:29:30] <ottomata>	 so /wmf/data/netflow/netflow
[14:29:35] <ottomata>	 and in hive
[14:29:43] <ottomata>	  netflow db and netflow table, e.g. netflow.netflow
[14:29:52] <elukey>	 buuuuu didn't know it
[14:30:10] <ottomata>	 ya just commented on camus, didn't notice either
[14:30:11] <ottomata>	 for camus
[14:30:17] <ottomata>	 you set base path to raw/netflow
[14:30:21] <ottomata>	 but the topic is netflow
[14:30:31] <ottomata>	 and it will write to topic named folders in the base path
[14:30:45] <ottomata>	 then for refien, you set output path to /wmf/data/netflow
[14:30:51] <ottomata>	 and are capturing the table name from the input path as netflow
[14:30:58] <ottomata>	 and also set the database name to netflow
[14:31:21] <elukey>	 so the base path should've been only 'raw' ?
[14:31:41] <ottomata>	 if you want it to write e.g. at /wmf/data/raw/netflow/hourly/... then yeah
[14:31:52] <elukey>	 ok got it, didn't know it
[14:32:07] <ottomata>	 i guess we could do that, and then just put the netflow stuff into the wmf database and at /wmf/data/wmf ?
[14:32:13] <elukey>	 I think I picked up the eventlogging example that was the wrong one
[14:32:15] <ottomata>	 then we don't need to make new top level stuff
[14:32:22] <ottomata>	 well, yours is the only one that uses one topic :)
[14:32:53] <ottomata>	 but, camus is always base.path/<topic>, and then jsonrefine will do output_base_path/<captured_table_name>
[14:33:11] * elukey nods
[14:36:29] <elukey>	 ottomata: shall I kill the camus-netflow job then?
[14:36:39] <elukey>	 it is running now
[14:37:48] <ottomata>	 ya you can kill
[14:37:57] <ottomata>	 or wait and delete data, doesn't really matter yet, right?
[14:38:02] <ottomata>	 kill might be easiest
[14:38:15] <ottomata>	 sorry i only realized this when reading your json refine patch just now
[14:38:31] <elukey>	 my bad didn't know this bit about camus :(
[14:38:42] <elukey>	 nono doesn't matter now, I can wait and delete
[14:39:18] <elukey>	 just to learn: what's happening right now on hdfs? I am seeing that /wmf/camus/netflow gets populated, but I don't see anything in /wmf/raw
[14:39:21] <elukey>	 yet
[14:43:19] <ottomata>	 looking
[14:43:52] <ottomata>	 ah
[14:43:52] <ottomata>	 yes
[14:44:05] <ottomata>	  the stuff in /wmf/camus is camus metadata stuff
[14:44:15] <ottomata>	 info about job history, storing offsets, temporary paths, etc.
[14:44:24] <ottomata>	 at the end of the job, after it finishes writing to hdfs from kafka
[14:44:49] <ottomata>	 it will atomically move the data in the incoming directory to the final output base  path, in this case /wmf/data/raw/netflow/<topics*>
[14:45:23] <ottomata>	 so you wont' see anything in the destination.path (sorry, that is the correct name) until it is done
[14:45:35] <elukey>	 (in the meantime, hue should load faster now)
[14:45:58] <elukey>	 (it now complains about the conf but we can live with it)
[14:55:07] <elukey>	 # final top-level data output directory, sub-directory will be dynamically created for each topic pulled
[14:55:15] <elukey>	 it was also written clearly
[14:55:20] * elukey cries in a corner
[14:55:20] <ottomata>	 nice elukey
[14:58:49] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3949023 (10Ottomata) a:03Ottomata
[14:59:25] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban: Monitor and alert if no new data from JsonRefine jobs - https://phabricator.wikimedia.org/T186602#3949036 (10Ottomata) p:05Triage>03High
[15:05:27] <ottomata>	 elukey:  got a quick sec for a json refine data monitoring brain bounce?
[15:08:18] <elukey>	 ottomata: of course
[15:08:33] <elukey>	 joal: https://gerrit.wikimedia.org/r/408539 whenever you have time :)
[15:08:47] <ottomata>	 in bc
[15:09:05] <ottomata>	 elukey:  i think you can merge that without jo al :)
[15:09:51] <elukey>	 all riiighhhtttt
[15:10:48] <elukey>	 ottomata: did you mean in bc?
[15:10:55] <ottomata>	 ya!
[15:10:56] <ottomata>	 come in!
[15:10:58] <elukey>	 ahh coming! 
[15:36:52] <elukey>	 !log drain + shutdown of analytics1038 to replace faulty BBU
[15:37:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:01:14] <nuria_>	 ping ottomata joal 
[16:17:57] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3949382 (10Nuria) >We have a new employee starting next week who will be working in just the right area to do this review as well, but this isn'...
[16:24:53] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3949399 (10elukey) Much better now!  ``` elukey@analytics1038:~$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll   BBU Capacity Info for Adapter: 0    Relative State of Charge: 8...
[16:25:03] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3949400 (10elukey) 05Open>03Resolved a:03elukey
[17:11:47] <wikibugs>	 10Analytics-Kanban, 10Operations, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3949567 (10akosiaris) @elukey. Feel free to try. If anything it will provide us with some more insight into T181121. FWIW I had refilled ganeti1005 with the VMs assigned to...
[17:25:02] <wikibugs>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (next): Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052#3949667 (10Pchelolo) I have rerun the script for 5 million events and the results are fairly similar to what was observed, so the plan is valid.
[17:31:29] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Purge refined JSON data after 90 days - https://phabricator.wikimedia.org/T181064#3949677 (10mforns) a:03mforns
[17:33:18] <elukey>	 joal: the current netflow files imported from camus are ~15MB in size, that is not very good right?
[17:33:25] <elukey>	 more data will likely come in soonish
[17:34:33] <joal>	 elukey: will look later, when back from diner )
[17:47:00] <elukey>	 ottomata: https://gerrit.wikimedia.org/r/c/408535/3/modules/profile/manifests/analytics/refinery/job/json_refine.pp :)
[18:06:04] <wikibugs>	 (03PS1) 10Joal: Add GetMediawikiTimestampUDF to refinery-hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408567 (https://phabricator.wikimedia.org/T186155)
[18:10:27] <elukey>	 so joal, if hdfs blocks usage is ok for netflow (more data will come so the overhead should be minimal in the near future) I'll merge https://gerrit.wikimedia.org/r/c/408535/5/modules/profile/manifests/analytics/refinery/job/json_refine.pp to enable json refine for netflow :)
[18:10:36] <elukey>	 going offline now so we can discuss it tomorrow :)
[18:10:38] <elukey>	 byyyeee
[18:26:57] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "General +1 thanks!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408567 (https://phabricator.wikimedia.org/T186155) (owner: 10Joal)
[18:32:05] <elukey>	 ottomata: do you have any wikitech material about the new realtime data platform plans? (sorry if the name is wrong but I don't recall the last/correct one :)
[18:34:45] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Factor out RefineTarget from JsonRefine for use with other jobs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408435 (https://phabricator.wikimedia.org/T181064) (owner: 10Ottomata)
[18:35:06] <ottomata>	 elukey:  hmmm just https://wikitech.wikimedia.org/wiki/User:Ottomata/Stream_Data_Platform
[18:35:09] <ottomata>	 which will probably move soon
[18:35:12] <ottomata>	 this week maybe
[18:36:22] <elukey>	 super
[18:36:35] <elukey>	 other thing: shall we order 3 new hosts for zookeeper?
[18:36:48] <elukey>	 and then leave the conf100* location?
[18:36:49] <ottomata>	 oh for next fy budget
[18:36:50] <ottomata>	 i'm into it.
[18:36:56] <ottomata>	 yeah, let's do that
[18:37:03] <elukey>	 I was talking with Giuseppe to separate etcd and zk
[18:37:10] <ottomata>	 for eqiad and codfw, ya?
[18:37:14] <elukey>	 yeah
[18:37:16] <ottomata>	 should we use ganeti?
[18:37:19] <elukey>	 nope
[18:37:20] <ottomata>	 ok
[18:37:33] <elukey>	 I mean, maybe, but I don't really trust it that much
[18:37:38] <elukey>	 especially with the last bug ongoing :(
[18:37:43] <ottomata>	 k :)
[18:38:16] <elukey>	 super :)
[18:42:57] <nuria_>	 RoanKattouw: there are several here: https://edit-analysis.wmflabs.org/compare/  https://edit-analysis.wmflabs.org/multimedia-health/#projects=enwiki,dewiki,commonswiki/metrics=Uploads
[18:44:01] <nuria_>	 RoanKattouw: also https://flow-reportcard.wmflabs.org/#usage-data
[18:44:58] <nuria_>	 RoanKattouw: you can see all dashboards here https://github.com/wikimedia/analytics-dashiki/blob/master/config.yaml
[18:45:38] <nuria_>	 RoanKattouw: let us know if you still do not find the one you are looking for
[18:46:28] <wikibugs>	 (03CR) 10Mforns: [C: 031] "Looks cool! LGTM as is, but left a couple suggestions in case you feel identified :]" (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408567 (https://phabricator.wikimedia.org/T186155) (owner: 10Joal)
[18:47:54] <wikibugs>	 (03CR) 10Mforns: [C: 031] Add GetMediawikiTimestampUDF to refinery-hive (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408567 (https://phabricator.wikimedia.org/T186155) (owner: 10Joal)
[18:48:06] <joal>	 elukey: Just looked at the data: the only way to reduce blocks overhead would be to import daily - But given data will grow, it's probably a fake issue
[18:48:11] <wikibugs>	 (03CR) 10Nuria: [C: 032] Add GetMediawikiTimestampUDF to refinery-hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408567 (https://phabricator.wikimedia.org/T186155) (owner: 10Joal)
[18:48:52] <ottomata>	 also, it isn't really an '
[18:48:57] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Provide MediaWiki timestamps in Hive-refined EventLogging tables via UDF - https://phabricator.wikimedia.org/T186155#3949971 (10JAllemandou) a:05fdans>03JAllemandou
[18:49:08] <ottomata>	 'issue', right?
[18:49:34] <ottomata>	 hadoop just has overhead, so isn't efficient on small data
[18:49:38] <joal>	 ottomata: small files - Hadoop doesn't like them too much
[18:49:46] <ottomata>	 doesn't like them?
[18:49:55] <ottomata>	 it is fine with them, no?  its just that there's latecny overhead
[18:50:11] <ottomata>	 it would be faster to work on small files in non hadoop, but we don't really mind so much
[18:50:13] <joal>	 ottomata: correct, and that they put more pressure on namenode
[18:50:16] <ottomata>	 we use more resources than needed
[18:50:29] <joal>	 it'll work for sure
[18:50:30] <ottomata>	 hm, really? not, its just that more files put pressure on namenode, right?
[18:50:35] <ottomata>	 not necessarily small files
[18:50:37] <ottomata>	 ?
[18:50:55] <joal>	 correct ottomata - same data size, smaller files = more files
[18:51:01] <ottomata>	 aye
[18:51:12] <ottomata>	 nuria_:  i'm getting my head into planning stuff if you find some time for me
[18:51:40] <nuria_>	 ottomata: i can do it later on today ? or tomorrow?
[18:53:49] <wikibugs>	 (03Merged) 10jenkins-bot: Add GetMediawikiTimestampUDF to refinery-hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408567 (https://phabricator.wikimedia.org/T186155) (owner: 10Joal)
[18:56:41] <ottomata>	 later today is good
[18:56:54] <ottomata>	 trying to understand the difference between outcomes and output
[18:59:10] <joal>	 milimetric: Do you have a minute for me?
[18:59:19] <milimetric>	 joal: cave?
[18:59:23] <joal>	 OMW !
[19:19:24] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, and 3 others: Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3950071 (10Pchelolo) Yesterday during the deployment of JobQueue CP instance we've had a small incident because of the disparity betwee...
[19:34:46] <RoanKattouw>	 nuria_: None of those are what I'm looking for unfortunately. They were stats about notifications and I haven't looked at them in a while. No idea where they were hosted/defined exactly because they were made before my time
[19:40:46] <joal>	 !log Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
[19:40:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:45:42] <joal>	 hm - looks like we have an issue with Druid tonight
[19:49:34] <joal>	 ottomata, elukey - Unexpected behavior from druid-public tonight
[19:52:54] <ottomata>	 oh?
[19:53:17] <joal>	 ottomata: indexation tasks have failed, but related hadoop jobs continue
[19:55:00] <joal>	 Killing the hadoop job now
[19:55:52] <ottomata>	 ok...?
[19:56:24] <joal>	 ottomata: I have actually no clue why that indexing task died :(
[19:57:31] <joal>	 ottomata: Looking at https://grafana-admin.wikimedia.org/dashboard/db/prometheus-druid
[19:58:04] <joal>	 ottomata: Doesn't seem related to daemon failures
[19:59:05] <ottomata>	 joal:  any errors in yarn app logs?
[19:59:21] <joal>	 ottomata: I have not checked - Will do
[20:02:43] <joal>	 ottomata: Only error I find is: 2018-02-06 19:55:28,571 Thread-4 ERROR Unable to register shutdown hook because JVM is shutting down. java.lang.IllegalStateException: Cannot add new shutdown hook as this is not started. Current state: STOPPED
[20:06:44] <joal>	 ottomata: The weird aspect of it is that the MR indexing job hasn't failed
[20:07:26] <joal>	 ottomata: It's as if the druid-middlemanager lost connection with the AppMaster of the MR-job
[20:09:49] <ottomata>	 joal:  do you have a druid job id or somethign i can grep for?
[20:10:49] <joal>	 grep wherE?
[20:10:54] <ottomata>	 druid logs
[20:11:18] <joal>	 ottomata: end of overlord.log log on druid1005
[20:11:38] <joal>	 ottomata: I'm about to restart a task - you want?
[20:12:15] <ottomata>	 nope go ahead
[20:12:44] <joal>	 task started
[20:13:05] <joal>	 index_hadoop_mediawiki_history_reduced_2018-02-06T20:12:29.351Z
[20:13:32] <joal>	 Related hadoop job: application_1515441536446_99237
[20:16:26] <joal>	 ottomata: so far so good
[20:21:43] <joal>	 And failed :(
[20:24:41] <joal>	 ottomata: Weirder - middlemanager is still receiving and logging hadoop MR progress
[20:26:36] * joal wonders what to do :(
[20:28:19] <ottomata>	 failed?
[20:28:39] <ottomata>	 2018-02-06T20:14:41,941 ERROR io.druid.indexing.overlord.RemoteTaskRunner: Alert not emitted, emitting. Task assignment timed out on worker [druid1006.eqiad.wmnet:8091], never ran task [index_hadoop_mediawiki_history_reduced_2018-02-06T19:34:20.683Z]! Timeout: (300000 >= PT5M)!: {class=io.druid.indexing.overlord.RemoteTaskRunner}
[20:28:45] <ottomata>	 2018-02-06T20:17:29,689 ERROR io.druid.indexing.overlord.RemoteTaskRunner: WTF?! Asked to cleanup nonexistent task: {class=io.druid.indexing.overlord.RemoteTaskRunner, taskId=index_hadoop_mediawiki_history_reduced_2018-02-06T20:12:29.351Z}
[20:28:53] <joal>	 ottomata: task is seen as failed in overlord, but MR job is still running, and middle-manager still conected to MR
[20:30:13] <ottomata>	 WTF?!
[20:30:16] <ottomata>	 haha
[20:32:33] <ottomata>	 joal:  i think let it go?
[20:32:39] <ottomata>	 the indexing is still happening, right?
[20:32:45] <ottomata>	 overlord is confused somehow
[20:33:27] <joal>	 ottomata: I'll let it finish tonight and check tomorrow morning, see 1) if it has succeeded 2) if druid knowsd about new data in that case
[20:33:39] <ottomata>	 aye
[20:37:20] <joal>	 ok - gone for tonight team
[20:37:37] <joal>	 milimetric: I have not moved forward the sqoop patch as expected - will take it again tomorrow
[20:44:36] <ottomata>	 laters joal!
[21:18:03] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, and 2 others: Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3950472 (10mobrovac)
[21:18:35] <milimetric>	 ottomata: what do you think about that Facebook/Twitter data?  put it in /wmf/data/raw/external/facebook .../external/twitter ?
[21:18:38] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, and 2 others: Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3614140 (10mobrovac) Mental note: we will also have to update librdkafka in Vagrant once all of the services are ported to node-rdkafka...
[21:19:24] <milimetric>	 using cron that runs in our puppet on stat1005 maybe?
[21:19:44] <ottomata>	 not sure i understand that request
[21:19:48] <milimetric>	 uh... cron that runs IN puppet, wth am I talking about
[21:20:13] <milimetric>	 so they're running some campaigns and Facebook/Twitter are giving them data that they then download and maybe process lightly
[21:20:37] <milimetric>	 they can hang on to it but they'd rather put it on our servers so it's all in one place, which sounds good to me
[21:20:52] <milimetric>	 but some of the data needs to be downloaded on an ongoing basis
[21:21:05] <milimetric>	 so they'd make some crons "somewhere"
[21:21:46] <milimetric>	 that just seems safer and better if it all happened from our machines in the first place
[21:21:58] <milimetric>	 I wanted to reply but check with you first just in case
[21:26:05] <ottomata>	 milimetric:  sounds fine i guess, but this sounds more like something they should get a stat box accound and do themselves?
[21:26:09] <ottomata>	 i this big or large data?
[21:26:17] <ottomata>	 should they just do like others and have one of their own accounts schedule the download?
[21:26:21] <ottomata>	 should they maybe just do this in labs?
[21:26:23] <ottomata>	 its all external data
[21:26:24] <milimetric>	 < 10G
[21:26:48] <milimetric>	 it's external but that doesn't mean it's not sensitive at all, it's not public 
[21:26:53] <ottomata>	 ok
[21:26:59] <ottomata>	 so stat1006? :)
[21:27:09] <milimetric>	 but wouldn't we want this on hdfs?
[21:27:14] <ottomata>	 would we?
[21:27:40] <milimetric>	 not sure, it's up to us I think
[21:27:50] <milimetric>	 hdfs is nice because it's safe and backed up
[21:27:51] <ottomata>	 ha, not sure either, but why is it up to us?!
[21:28:12] <milimetric>	 because they don't have to put this data on our servers at all, the contractors could just crunch it and give Anne the results
[21:29:06] <ottomata>	 they jsut want to give it to use to back i tup?
[21:29:37] <ottomata>	 milimetric:  i'm just asking these questions, because i kinda want them to do it!  this sounds super easy to me, should we do it for them?
[21:30:16] <milimetric>	 oh, they should definitely do it either way
[21:30:35] <milimetric>	 but it's more - do we want them to do it 100% themselves on stat1006, and then it's backed up like other stuff there?
[21:31:03] <milimetric>	 or do we want them to do it a little more officially with a cron in puppet and put it up on hdfs in a proper place, anticipating more stuff like this might come at us in the future
[21:31:20] <ottomata>	 if they put it on stat1006, it will be backed up by bacula
[21:31:23] <ottomata>	 in their home dirs
[21:31:35] <milimetric>	 ok, cool, let's tell them that then
[21:31:45] <milimetric>	 if it becomes a common thing, we can think again
[21:31:59] <ottomata>	 aye
[21:32:01] <ottomata>	 k
[21:42:42] <mforns>	 logging off team, bye!
[22:56:58] <nuria_>	 RoanKattouw: i think i found what happen to your missing dashboard, see: https://phabricator.wikimedia.org/T126358
[22:57:35] <nuria_>	 RoanKattouw: if you need it it shoudl not be hard to do at all having teh data
[23:05:58] <nuria_>	 RoanKattouw: is this the data or this is somethimng else?
[23:06:00] <nuria_>	 https://github.com/wikimedia/analytics-limn-ee-data
[23:07:20] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Services (doing): Support reliable delayed job execution in ChangeProp - https://phabricator.wikimedia.org/T186261#3950892 (10Pchelolo) A very basic version of this implemented at https://github.com/wikimedia/change-propagation/pull/233  The PR contains some questio...
[23:27:24] <wikibugs>	 (03CR) 10Nuria: [V: 032 C: 032] "Let's deploy this change after the java migration right? so as not to get side tracked with possible errors?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405899 (https://phabricator.wikimedia.org/T167907) (owner: 10Joal)
[23:42:14] <wikibugs>	 (03CR) 10Nuria: [C: 032] Optimize WikiSelector for slow browsers (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria)