[01:14:23] So I'm new to this hadoop stuff. Is there an equivalent to indexed fields like in a mysql db, or is it just the partition fields that is the equivalent? [02:24:53] bawolff: there's no indexed field exactly because we're just going over tons and tons of files, but some of the tables are stored in Parquet format, which kind of works like columnar storage. So for those tables, the fewer columns you select, the faster the query. Partitions are just literally folders that split up the files so you don't iterate over more files than you have to. So partitions can be year=2018, year=2017, those [02:24:53] translate to folders like /some-data/year=2018/ with files inside there [06:24:16] morning! [06:24:23] Hi elukey :) [06:24:40] so it seems that Druid re-compiled with hadoop-client 2.6.0 (already deployed in labs) doesn't work [06:24:48] :( [06:25:01] elukey: what happens? [06:25:11] same jackson issue, but if you want we can check together [06:25:20] let's dp that [06:26:02] my main doubt is: since in pom.xml, jackson 2.4.6 has been set a while ago, what changes between 0.10 and 0.11 to cause this? [06:26:25] elukey: I don't actually know !!!! [06:27:09] elukey: My assumption is that the classpath loading is not as it was on 0.10 (for which mapreduce.user.classpath.first was working)( [06:29:44] elukey: the current indexaion you run is now failing because of the incorrect-path, not anymore because of jackson [06:29:56] (same issue as with the non-recompiled version) [06:30:12] elukey: I think this (the hadoop-path error) is the closest o success we have so far [06:30:18] elukey: I'll investigate in that direction [06:30:40] joal: how did you get that? I find it difficult to get yarn logs, I ended up checking in HDFS because I can't make the CLI working [06:30:45] and found only jackson issues [06:31:20] Here the hing elukey: I have run many of those trials, and found that, when he issue is jackson, job fails rapidly [06:32:05] the one you're running hasn't fialed rapidly, but has one reduce failed - This usually means the other error we've experienced [06:32:43] For logs elukey, easiest is to wait for at leas one reduce to fail (so no would do), kill the yarn application manually and ask yarn for logs [06:33:52] but via yarn logs -applicationId etc.. ? [06:33:59] yessir [06:34:09] sorry for the dumb qs but I haven't managed to make it work in labs [06:34:14] you'd need the --appOwner though :) [06:34:21] yeah I added that [06:34:32] but it keeps telling me that no logs are there [06:34:33] elukey: yarn logs belongs to certain users, and are stored by user [06:34:55] yes yes I know it, they also corresponds to paths [06:35:07] When you use yarn logs command, it assumes you want the logs for an application ran by the user you are [06:35:14] Indexations are run by user druid [06:35:19] unless you add --appOnwer I know :) [06:35:21] I tried that [06:35:27] hm [06:35:37] it must be me doing something stupid as always [06:35:38] That's very bizrre, It works dfor me [06:35:49] do you run it from d-1 ? [06:36:00] maybe I simply didn't wait for the job to complete [06:36:08] going to check later :) [06:36:12] Ah no - I run if from any of the hadoop nodes [06:36:37] But if the yarn command succeeds, it should work [06:37:31] aahhh now it workkkssss [06:37:35] I ammm stupidddddd [06:37:37] * elukey cries [06:38:05] for this one I keep seeing jackson weirdness [06:38:24] elukey: mwarf [06:39:32] joal: one thing that we could do is trying to see what jackson version we use on CDH and rebuild druid with it [06:39:49] elukey: feasible [06:39:59] elukey: I've already had to trick refinery because of jackson issues [06:40:48] ufff [06:40:51] hadoop-0.20-mapreduce: /usr/lib/hadoop-0.20-mapreduce/lib/jackson-core-2.2.3.jar [06:41:04] hadoop-client: /usr/lib/hadoop/client/jackson-core-2.2.3.jar [06:41:27] elukey: in refinery currently, I explicitely exclude com.fasterxml.jackson.core:* from hadoop-hdfs, and explicitely import 2.6.5 [06:41:35] right [06:42:03] elukey: The user.classpath.first is the thing that should make it work [06:43:00] ah the good news is that mvn clean package has to run with -DskipTests for hadoop-client:2.6.0 because they fail :P [06:43:07] due to Guava issues in the tests, nothing serious [06:43:12] but annoying [06:44:44] hm [06:46:20] anyhow, I can try to build with 2.2.3 [06:46:59] also reading https://github.com/druid-io/druid/issues/2087 [06:48:39] and opened https://github.com/druid-io/druid/issues/5763 [06:48:47] hopefully somebody will answer :) [06:50:11] elukey: ok - :) [06:50:16] Thanks a lot for doing that elukey [06:50:39] elukey: as I said, I think the closest we were was with that error with path not orrect [06:53:32] (03PS1) 10Joal: Update pageview regexp to accept more chapters [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/432342 (https://phabricator.wikimedia.org/T194309) [06:54:00] 10Analytics, 10Analytics-Kanban, 10Pageviews-API, 10Patch-For-Review: Add nyc.wikimedia to pageviews whitelist - https://phabricator.wikimedia.org/T194309#4196158 (10JAllemandou) a:03JAllemandou [06:54:30] joal: so what jackson version should I use? The task mentions 2.3.5 (checking if it was used before in pom.xml), we should have 2.3.3 right? can you confirm? [06:54:40] I'll also restore the hadoop client to 2.7.3 [06:55:33] elukey: triple checking [06:58:03] elukey: currently rebuilding the jars (sloooooow) [07:01:45] I started a build with 2.3.3 as speculative action :P [07:03:18] elukey: Since I removed the failing one from refinery, I'm actually having issues getting bakc - I'll have it soon :) [07:17:28] so joal I removed the labs override that I put in place to remove the -D option that andrew put in prod [07:17:36] to see if anything changed [07:17:43] buuut errors :) [07:17:44] I don't understand elukey :( [07:19:03] among the druid.indexer.runner.javaOpt in puppet we have -Dhadoop.mapreduce.job.user.classpath.first=true [07:20:04] yes [07:20:31] Ah, basically labs is back to prod-config in term of -D [07:20:36] yeah! :) [07:20:40] Cool [07:21:46] elukey: man, I'm really struggling getting the jackson version ! [07:34:45] elukey: I don't actually find jackson as a dep of hadoop in refinery [07:37:29] it could be a dep of hadoop itself, maybe refinery uses only hadoop-client? [07:38:26] elukey: I think you're right [07:39:05] elukey: so we're back in normal mode in d-labs, right (regualr hadoop-dep + user.classpath.first set [07:44:29] elukey: May I kill the currently running job in labs? [07:45:08] yes yes [07:45:21] be brutal and do whatever you want :) [07:53:04] elukey: from conf in d-1: druid.storage.storageDirectory=/user/druid/deep-storage-analytics_test0 [07:53:35] elukey: Cound we make that: druid.storage.storageDirectory=hdfs://analytics-hadoop-labs/user/druid/deep-storage-analytics_test0 [07:53:38] ? [07:53:56] * elukey sees what joseph is trying to do [07:54:03] * elukey amends the config in labs [07:54:08] * joal feels watched [07:55:09] :-P [07:59:05] -druid.storage.storageDirectory=/user/druid/deep-storage-analytics_test0 [07:59:08] +druid.storage.storageDirectory=hdfs://analytics-hadoop-labs/user/druid/deep-storage-analytics_test0 [07:59:11] done! [07:59:25] elukey: druid restarted? [07:59:28] yep [08:01:37] * joal does a courbette [08:02:50] I have the new druid 0.11 version ready [08:03:14] need to debianize it, build and copy to labs [08:03:39] joal: did you manage to reproduce with the new version that I deployed the hdfs path issue? [08:03:55] elukey: trying now [08:04:07] elukey: I have found my courbette gif: https://assets.rbl.ms/10870767/980x.gif [08:06:59] :) [08:10:46] elukey: I'm waiting for the job to have a failed reducer - take ~10mins [08:11:25] Oh by the way elukey - I meant to ask [08:11:53] elukey: I'd like to add a puppet patch for AQS conf to match https://gerrit.wikimedia.org/r/#/c/429765/ [08:12:28] This means, adding a datasources section to druid config, with a value for mediawiki_history [08:13:21] ah yes [08:13:34] I've been looking at puppet code, and wondered about adding a single field in AQS profile, or add a list? ro something else? [08:19:54] also elukey, found about http://druid.io/docs/latest/development/extensions-contrib/parquet.html [08:20:14] elukey: In some useful cases, this would prevent us to have a precomputation step ! [08:21:50] so about the puppet patch, I'd start from the aqs yaml file erb template, and then go upwards to class/profile [08:22:20] elukey: I did that - I was wondering about adding a list for druid datasrouces, or a single value as of nopw [08:22:48] ah ok, we can have a map of values [08:23:30] elukey: you tell me what's best :) [08:23:32] that we render directly in the erb template [08:25:37] also elukey - Do we use systemd for druid? [08:26:57] yep [08:27:28] how is it confirgured? [08:27:45] * joal learns about systemd in the meantime [08:27:48] we ship the units in the package [08:28:04] do you need to modify a parameter? [08:28:19] I wondered how to look at the conf files from systemd rfor druid [08:28:34] if you want to see the units you can use sudo systemctl cat druid-broker [08:28:42] or whatever daemon you need [08:29:10] Thanks elukey :) [08:29:25] I am building the 0.11.0-3 deb version [08:29:29] with jackson 2.2.3 [08:29:35] ok [08:29:54] I think it's kinda shaky, given the warnings from druid people, but let's try [08:31:08] yep I agree, I don't like it either [08:35:24] elukey: again the problem with hdfs path [08:36:35] joal: any specific parameters that triggers it? [08:36:46] elukey: I have a wonder - Can we try the default HDFS storage extension instead of teh cdh one? [08:37:17] elukey: I tried 2 things: indexation as if nothing had changed, and with hadoopCoordinates et [08:37:32] Both failed with hdfs error [08:37:53] And, in the case of hadoopCoordinates set, some extra jackson errors [08:40:17] we can try [08:40:34] joal: where do you see the hdfs path error? Can you give me the command that you used? [08:40:39] I keep seeing only jackson issues [08:41:52] The HDFS error is actually Exception in retry loop [08:42:21] elukey: sudo -u hdfs yarn logs --applicationId application_1524125463609_0749 --appOwner druid | grep -A 10 "Exception in retry loop [08:44:29] supa weird [08:45:14] so modifying the storage extension [08:54:25] -druid.extensions.loadList=["druid-datasketches","druid-hdfs-storage-cdh","druid-histogram","druid-lookups-cached-global","mysql-metadata-storage"] [08:54:28] +druid.extensions.loadList=["druid-datasketches","druid-hdfs-storage","druid-histogram","druid-lookups-cached-global","mysql-metadata-storage"] [08:54:31] joal: --^ [08:56:37] * elukey is curious and launches an indexation job [08:57:31] 2018-05-10T08:57:23,668 INFO io.druid.indexing.worker.WorkerTaskMonitor: Job's finished. Completed [index_hadoop_webrequest_2018-05-10T08:56:21.057Z] with status [SUCCESS] [08:57:35] wooooooooooooooooooooooooooooooooooooooooooooooow [09:00:06] so it seems that druid-hdfs-storage-cdh was the culprit [09:01:25] elukey: \o/ !!!! Was away for a few minutes [09:01:56] elukey: have you tried a query against the idnexed data? [09:02:15] nope I was waiting for you :) [09:02:30] so the script does this [09:02:30] # druid-hdfs-storage is shipped with Hadoop 2.3.0 jars. We need a [09:02:33] # new extension called druid-hdfs-storage-cdh which symlinks to [09:02:35] # the installed cdh jars, but still includes the [09:02:38] # druid-hdfs-storage .jar. [09:02:40] # Install a script that will do this for us. [09:03:05] ??? [09:03:09] Supa weird indeed [09:04:16] elukey: data is loaded [09:04:30] and this kind of follow the line of thinking that the hadoop-client 2.7.3 caused the issue that we were seeing [09:04:33] elukey: give me minute trying streaming job [09:04:47] elukey: I actually think the other way around [09:05:02] joal: one issue is that we tested druid with hadoop-client: 2.6.0 [09:05:23] because 0.11.0-2 is deployed [09:05:31] hm [09:05:33] ok [09:05:52] anyhow, I'd do this: 1) test streaming etc.. and make sure this works [09:05:57] elukey: let me try he streaming - so that we'll have full knowledge for 0.11-2 [09:06:04] then we rollback to 0.11.0-1 and test it again [09:06:08] :) [09:06:08] +1 [09:07:16] be back in a sec! [09:23:01] joal: let me know when you are ok and I'll deploy the older version [09:25:03] elukey: still looking at RT - You can redeploy while I'm looking at logs [09:28:15] joal: I'd need to run an errand in a bit, I'll re-deploy right after I am back ok? [09:28:26] elukey: np [10:04:28] all right starting the rollback joal [10:04:36] is it ok or are you playing with druid? [10:04:57] elukey: go ahead ! [10:07:03] ack! [10:15:02] joal: rolledback and tried an indexation, it seems to work! [10:15:36] elukey: let do a full round: I'm gonna do an oozie test, then a streaming one [10:16:03] yep sure [10:16:10] but it looks good [10:46:08] elukey: oozie worked, streaming failed [10:46:19] * joal cries [10:50:02] joal: _hug_ [10:53:53] * joal hugs fdans as well :) [10:54:15] joal: lovely! [10:54:15] * fdans hugging intensifies [10:54:18] what failed? [10:54:41] elukey: ERROR ClusteredBeam: Failed to update cluster state: druid:overlord/webrequest_live [10:54:44] com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!druid:overlord, Dtab.base=[], Dtab.local=[] [11:00:14] mmm joal is druid:overlord the service name that we use? [11:01:11] elukey: I have not changed that at all [11:01:55] "In earlier versions of Druid, / characters in service names defined by druid.service would be replaced by : characters because these service names were used in Zookeeper paths. Druid 0.11.0 no longer performs these character replacements." [11:02:10] elukey: But the name changed IIRC in 0.11 - Now, he value I give to tranquility is: "druid/overlord" [11:03:23] sure but the error mentions druid:overlord [11:03:26] no? [11:03:29] I does !! [11:05:06] same thing reported in https://groups.google.com/forum/#!msg/druid-user/ktxTPFhSV7g/vZm3XG4oAQAJ [11:05:34] Didn't see that one elukey [11:08:34] elukey: https://github.com/druid-io/tranquility/blob/master/core/src/main/scala/com/metamx/tranquility/druid/DruidEnvironment.scala#L28 [11:08:38] * joal cries harder [11:09:28] loooool [11:10:27] elukey: This confirms your feeling - Tranquility doesn't seemed maintained good enoug [11:10:30] so yeah we are hitting bugs since tranquillity has not been updated [11:10:36] exactly :) [11:10:56] IIRC they are the ones maintaining it rigt? [11:12:19] elukey: I think the philosophy for streaming indexation has changed, starting v0.10 [11:12:28] elukey: see http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html [11:14:29] ah so we might try druid-kafka-indexing-service ? [11:14:45] elukey: I think for RT we'll need to have streaming jobs converting data and write again to kafka, and druid run the KIS [11:15:14] pfff - I spent a lot time setting up tranquility ... Not super nice [11:16:34] joal: for this particular use case we might try to just comment the line in tranquillity and rebuild? [11:16:44] We could elukey [11:17:27] after that if it doesn't work we'll just give up [11:17:43] but it would be nice to have tranquillity work now while we study KIS [11:18:22] it doesn't seem bad from my pov that Druid eventually will pull only from Kafka [11:18:31] rather than leveraging overlord's api [11:18:44] buut I agree that it sucks due to the amount of time you put on it :( [11:18:48] elukey: it'll actually do it under the hood elukey [11:19:12] sure yes but it will be a supported extension that is shipped with druid etc.. [11:19:21] so they cannot break it or leave it behind [11:19:54] the way KIS is set up (with my 2 minutes reading), is basically setting up an supervsisor on top of overlord --> since everybody uses Kafka, no point in having a general supervisor called tranquility- A single one for kafka is enough :) [11:20:08] yep [11:20:46] but it seems that they are in a development phase summarized as "we don't care if you have prod use cases, 0.x is not stable so we break as much as we can" [11:21:17] maybe not that dramatic but following those lines :D [11:34:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid clusters to 0.11 - https://phabricator.wikimedia.org/T193712#4196911 (10elukey) Finally we figured out the issue, namely our own version of the `druid-hdfs-storage` extension. It was working fine with hadoop-client:2.3.0 but... [11:35:44] joal: if you are ok I'd remove the hdfs:// prefix that we added in labs [11:35:50] so we match 1:1 with production [11:36:04] elukey: +1 - I'll retry indexation after (to be sure) [11:40:16] -druid.storage.storageDirectory=hdfs://analytics-hadoop-labs/user/druid/deep-storage-analytics_test0 [11:40:19] +druid.storage.storageDirectory=/user/druid/deep-storage [11:40:21] done! [11:40:26] +restart? [11:40:38] yep (in labs we automatically restart on config change) [11:40:48] Ohhhh - Didn't know :) [11:40:54] Andrew's magic [11:40:57] as always [11:41:09] :) [11:41:39] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578#4196915 (10fdans) A note about user agent strings identified as `Spider`. There don't seem to be breaking changes in bot metrics from the perspective of ua strin... [11:41:51] so how do we want to proceed? I'd wait a bit before committing all the changes to the druid debs repo, just to know if we want to migrate now to 0.11 or later on [11:42:29] elukey: currently testing indexation and trying to compile tranquility [11:42:47] ah nice :) [11:43:00] soo in the meantime I am going to do some restarts for jvm 8 upgrades https://phabricator.wikimedia.org/T194268 [11:43:05] I was so missing that [11:43:48] Yay ! [11:43:55] New JVM restars ! [11:44:06] '/o\ [11:44:34] elukey: successfull indexations with hadoop [11:44:50] elukey: tranquility compiled, will try and test [11:45:49] niceeee [11:46:05] going to reimage two worker nodes now [12:00:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4196947 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1030.... [12:37:59] 10Analytics, 10EventBus, 10Services (blocked), 10User-Elukey: Investigate group.initial.rebalance.delay.ms Kafka setting - https://phabricator.wikimedia.org/T189618#4197036 (10elukey) >>! In T189618#4194933, @Ottomata wrote: > Yeah, I think that sounds fine. +1 [12:42:41] taking a break team [13:03:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4197098 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1031.eqiad.wmnet', 'analytics1030.eqiad.wmnet'] ``` an... [13:08:24] !log disabled all camus jobs to drain the cluster and allow hive/oozie restarts for jvm upgrades [13:08:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:32] Reimaging an1029 to debian [13:20:44] after this one, we'll do the journal nodes one by one [13:20:52] and then all the workers will be on Debian Stretch [13:42:50] jumbo1001 restarted, after ~10 mins it auto restart leader election [13:43:48] great [13:44:04] and also reimaging an1029 [13:44:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4197193 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1029.... [13:44:13] after that we'll have only the journal nodes to do [13:44:23] elukey: i want to try renabling job queue (and cp) topic replication from eqiad -> jumbo [13:44:28] just to try more traffic [13:44:36] before we start relying on mm again [13:44:41] elukey: oh YEAH [13:44:43] that is exciting [13:44:54] sure makes sense, let's do it after the rolling restart? [13:45:02] sure [13:45:41] so we'll see how mm behaves in two different situations (rolling restarts and more bursty traffic) [14:09:04] Restarted oozie/hive on an1003 [14:11:05] !log re-enabled camus after analytics1003's maintenance [14:11:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:16:10] interesting, kafka-jumbo1002 is still not in the partition leaders [14:16:47] just issued a preferred-replica-election [14:17:16] yea elukey i dunno what would make it trigger [14:17:25] there are probably some heuristics in there somewhere [14:19:56] gehel: o/ - not sure if you noticed but Andrew yesterday upgraded Kafka main eqiad to 1.1.0, and also upgraded mirror maker to the new version [14:26:33] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Restart Analytics hosts for Java 8 Security upgrades - https://phabricator.wikimedia.org/T194268#4197319 (10elukey) [14:27:28] (03CR) 10Mforns: [C: 032] Update pageview regexp to accept more chapters [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/432342 (https://phabricator.wikimedia.org/T194309) (owner: 10Joal) [14:29:37] (03CR) 10Mforns: [C: 032] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/432154 (https://phabricator.wikimedia.org/T194304) (owner: 10Milimetric) [14:31:47] (03CR) 10Milimetric: [V: 032] Expand support for specifying dates in partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/432154 (https://phabricator.wikimedia.org/T194304) (owner: 10Milimetric) [14:32:22] k, I should test that a bit more with --dry-run before deploying, that function is used everywhere! [14:35:22] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4197348 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1029.eqiad.wmnet'] ``` and were **ALL** successful. [14:35:51] (03Merged) 10jenkins-bot: Update pageview regexp to accept more chapters [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/432342 (https://phabricator.wikimedia.org/T194309) (owner: 10Joal) [14:40:27] ottomata: strange pattern of kafka bytes out after 3 kafka restarts - https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=19&fullscreen&orgId=1&from=now-3h&to=now [14:40:39] it seems like some cgroup starts pulling like crazy? [14:41:20] or maybe just traffic between nodes due to the restarts, not sure [14:42:59] elukey: it makes sense that bytes out would be huge aftera restart [14:43:38] the restarted node needs to read from each of the partition replica leaders to catch up [14:43:55] wtih replication factor 3 [14:44:25] actuall no replication factor doesn't matter sorry [14:44:43] the restarted node just needs to read from all of the partition leaders for which it is a replica [14:44:51] which will be spread around on all brokers [14:45:16] sure sure but I thought it wouldn't have doubled the traffic like that, anyhow all good [14:45:38] (maybe I haven't noticed the pattern before) [14:45:42] yeah it happens [14:45:56] that's just the restarted node reading from the other brokers as fast as it can to catch back up [15:01:03] heads up people, I am rolling restart a lot of datanodes for jvm upgrades [15:01:09] batches of 2 every 2 minutes [15:02:00] it will take a while but it should be transparent to every job (even if I expect some failures popping up) [15:09:56] elukey: i wonder if you could build into the rolling restart scrip [15:09:59] 10Analytics, 10EventBus, 10ORES, 10Reading-Infrastructure-Team-Backlog, and 3 others: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#4197451 (10Ottomata) Ok its about time to resurrect this! @Pchelolo, from at https://github.com/wikimedia/chang... [15:09:59] nodemangaer stop [15:10:05] and wait til 0 yarn jvm procs [15:10:11] mabye with a timeout of 10 mins or something [15:16:49] roll restart? [15:17:05] or the reimages? [15:17:25] roll restart [15:17:25] either [15:17:27] right? [15:17:33] that way you wait for jobs to drain? [15:17:43] we coudl puppetize a script on the worker nodes [15:17:50] hadoop-nice-shutdown [15:17:56] stop yarn [15:18:03] wait for 0 containers or timeout [15:18:06] stop datanode [15:18:11] ah no so for roll restarts all work is done by cumin: yarn supports the seamless restarts (and eventually all the containers will be on the new jvm) [15:18:21] OH [15:18:22] right [15:18:26] right w don't need to kill the containers [15:18:28] oook right [15:18:28] cool [15:18:40] yep yep, but it is a good point for reboots [15:18:45] in that case yes [15:19:42] from what I know Riccardo has some wip script (derived from the one that they use for the mw switchover automation) that should eventually become a framework for everybody [15:20:01] basically being able to register checks, actions, etc.. as part of a workflow for each node [15:20:43] maybe it could be a good thing next quarter to work with him to have a testing prototype to use with the next round of reboots [15:33:54] nice [15:40:09] ottomata: all the worker nodes have up to date openjdk, I'd proceed with the reimage of one journal node at the time [15:40:13] sounds ok? [15:40:16] elukey: tranquility patched and wroking [15:40:22] * elukey hugs joal [15:40:24] yesssss [15:40:46] so green light for the final packaging + upgrade? [15:41:21] yessir - we need to discuss how we manage the tranquility-wmf package, but other than that, happy to move :) [15:41:53] joal, can you help a poor troubled soul? [15:42:06] ideall we could submit a pull request for upstream [15:42:22] mforns: if you're both poor and troubled, I'll definitely make an effort :) [15:42:25] elukey: yeah! [15:42:30] yea [15:42:40] joal, bc? [15:42:42] sure [15:45:08] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Restart Analytics hosts for Java 8 Security upgrades - https://phabricator.wikimedia.org/T194268#4197608 (10elukey) [15:54:59] elukey: about pushing upstream, I'm a bit reluctant- current druid version trnaquility is built against is 0.9.2 - There are a lot of diffs, ad I'm willing to implement hem all1 [15:56:16] okok I thought it was only removing a single line or gate it [15:56:47] elukey: that's all we need (more a less), but I'm not sure that'd be enough for a full release [15:59:35] joal: maybe opening an issue upsteam on github? Acceptable solution? [16:00:01] elukey: For sure [16:01:08] ping elukey joal fdans ottomata [16:01:12] standdduppp [16:02:11] ping ottomata [16:05:49] (03CR) 10Milimetric: [V: 032] "ok, tested every job here https://github.com/wikimedia/puppet/blob/71269e29191e4e09b6d8faeb0cf8aabe59d94333/modules/profile/manifests/anal" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/432154 (https://phabricator.wikimedia.org/T194304) (owner: 10Milimetric) [16:31:19] PROBLEM - Number of segments reported as unavailable by the Druid Coordinators -Analytics cluster- on einsteinium is CRITICAL: 11 gt 10 https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&panelId=46&fullscreen&orgId=1&var-cluster=druid_analytics&var-druid_datasource=All [16:32:42] 10 is definitely too low [16:32:47] :) [16:33:04] and also it need a window of time [16:33:08] will fix it after meetings [16:35:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4197796 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1028.... [16:37:52] joal fyi getting closer with toree. got it to work, but not with hive metastore support for some reason [16:37:55] it looks like it should work but not yet [16:38:04] but at least i can use spark in a notebook and in yarn! [16:38:51] milimetric: seehttps://github.com/Microsoft/PowerBI-visuals-tools/issues/81 [16:41:19] lol nuria_ https://github.com/Microsoft/PowerBI-visuals-tools/issues/81#issuecomment-294108852 [16:41:30] window.window.window? [16:43:33] ottomata: This is a super news (the torree one) ! [16:45:50] ping ottomata mforns [16:46:49] RECOVERY - Number of segments reported as unavailable by the Druid Coordinators -Analytics cluster- on einsteinium is OK: (C)10 gt (W)5 gt 4 https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&panelId=46&fullscreen&orgId=1&var-cluster=druid_analytics&var-druid_datasource=All [16:50:25] 10Analytics, 10EventBus, 10MediaWiki-General-or-Unknown, 10Beta-Cluster-reproducible, and 5 others: RevisionStore.php Could not determine title for page ID X and revision ID Y in EventBus::createRevisionAttrs - https://phabricator.wikimedia.org/T183505#4197896 (10CCicalese_WMF) [16:53:11] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578#4197916 (10fdans) Added percentages to all absolute figures to improve clarity. [16:57:02] 10Analytics, 10EventBus, 10ORES, 10Reading-Infrastructure-Team-Backlog, and 3 others: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#4197940 (10Pchelolo) @Ottomata, when we send the `revision-create` event to ORES, precache endpoint we get the s... [17:11:03] (03PS1) 10AndyRussG: [PLS. DON'T MERGE] Make banner activity Druid ingress from EventLogging [analytics/refinery] - 10https://gerrit.wikimedia.org/r/432405 (https://phabricator.wikimedia.org/T186048) [17:13:13] (03CR) 10AndyRussG: [C: 04-2] [PLS. DON'T MERGE] Make banner activity Druid ingress from EventLogging [analytics/refinery] - 10https://gerrit.wikimedia.org/r/432405 (https://phabricator.wikimedia.org/T186048) (owner: 10AndyRussG) [17:26:24] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578#4198163 (10Tbayer) Super interesting findings, thanks @Fdans! CCing @chelsyx regarding the implications for iOS. To get closer to the objectives of this task as... [18:03:15] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4198190 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1028.eqiad.wmnet'] ``` and were **ALL** successful. [18:07:29] analytics1028 back up, it is catching up with the other two nodes [18:13:37] * elukey off! (will check an1028 laterz) [18:27:22] hey milimetric - Do you want us to delete those druid datasources? [18:28:06] joal: yeah! I thought it'd be too late for you [18:28:12] k, so I did what you said in this task... [18:28:53] hm, can't find it [18:29:46] ah yeah, joal: https://phabricator.wikimedia.org/T190409#4093828 [18:30:17] so I did curl -X DELETE http://localhost:8081/druid/coordinator/v1/datasources/mediawiki-geowiki-monthly [18:30:25] and the same with mediawiki-geoeditors-monthly [18:30:31] that's just to disable [18:30:49] then I did the script, /srv/deployment/analytics/refinery/bin/refinery-drop-druid-deep-storage-data -d 1 -v mediawiki-geowiki-monthly --no-datasource-check [18:30:53] (for both) [18:30:57] yeah, but they are still enqabled as of now [18:31:09] ok, so something didn't work with the curl [18:31:19] do you have the tunnel setup? [18:31:25] I just ssh-ed into druid [18:31:28] druid1001 [18:31:37] should do [18:31:46] can you try to disable one of the two? [18:31:54] doing now, for geowiki-monthly [18:32:07] the curl doesn't return anything, and it's very very quick [18:32:21] but yeah, just did it [18:32:22] still enabled [18:32:32] ok, so something's wrong with that curl for just my user? [18:32:55] milimetric: i almost never use curl - I always do a tunnel and disable it through UI [18:33:01] let me try [18:33:16] hm, maybe it's not port 8081? [18:33:22] it is [18:34:22] milimetric: it;s because of redirects I think [18:34:35] oh, interesting [18:35:13] add a -L to your curl milimetric , and let's see [18:35:47] did, no output, still enabled [18:36:09] oh, interesting, it says 404 when I run with -v [18:36:29] interesting indeed [18:37:02] milimetric: shall I try? [18:37:19] I can disable it from the UI, but we should remove the curl command from docs unless we can fix it [18:37:28] I'm trying : > GET /druid/coordinator/v1/status HTTP/1.1 [18:37:35] curl -v -L -X DELETE http://localhost:8081/druid/coordinator/v1/datasources/mediawiki-geowiki-monthly [18:37:38] sorry worng past [18:38:05] worked for me milimetric [18:38:18] ... [18:38:18] IFAICT it's disabled [18:38:24] We should add a -L to the curln [18:38:25] wtf [18:38:32] didn't work for me with -L... [18:38:57] copying your command exactly and trying it on geoeditors [18:39:05] ack milimetric [18:39:09] wtf, that worked [18:39:20] it did ! [18:39:26] Maybe you put the -L at the end? [18:39:30] weird [18:39:44] nope, this was my command before: curl -L -v -X DELETE http://localhost:8082/druid/coordinator/v1/datasources/mediawiki-geowiki-monthly [18:39:50] aah [18:39:58] wrong port [18:39:59] sorry, I did 8082 when I did -L by accident [18:40:00] k [18:40:06] so -L should work then [18:40:06] ok makes sense :) [18:40:13] Let's add that to the doc [18:40:13] thx joal [18:40:15] np [18:40:28] Now, let's go with deleting segments :) [18:41:17] doing joal [18:41:23] ack :) [18:41:26] (updated docs too) [18:42:00] k, done deleting both [18:42:16] I saw milimetric - I wanted to do it, and it was already done :) [18:42:30] sweet, k, all is well in deletion land [18:42:35] hdfs dfs -du -s -h /user/druid/deep-storage/* [18:42:39] 0 0 /user/druid/deep-storage/mediawiki-geoeditors-monthly [18:42:39] 0 0 /user/druid/deep-storage/mediawiki-geowiki-monthly [18:42:54] done milimetric :) [18:43:13] thanks much joal [18:43:14] And they're gone from the disabled ui [18:43:17] np milimetric :) [18:43:35] Thanks for the deploys and all yesterday! [18:44:54] I think that's it for me today then :) [18:45:10] ottomata,joal - I can see now 10.64.36.128:8485 (Written txid 1920086965 (151385 txns/7789958ms behind) (will try to re-sync on next segment)) [18:45:13] for analytics1028 [18:45:36] hm - I think I don't get it :( [18:45:46] a lot of people seems to suggest that hdfs dfsadmin -rollEdits helps in this cases [18:45:52] forcing the edit log to roll [18:45:58] elukey: ah [18:46:27] but I am wondering if it will get fixed after the hour [18:46:34] elukey: is it not catching back up? [18:47:32] it seems still lagging [18:47:50] but every hour the namenode generates a new fsimage [18:48:01] so possibly after that it will get fixed [18:48:07] like in ~15 mins [18:48:12] hm [18:48:36] if journalnodes don't manage to catchup when restarted, that's not cool! [18:49:19] well I think it is only saying "I am too far behind, will try to catch up when the next fsimage is created" [18:49:27] Ah ok [18:49:31] I think! [18:51:02] and I can also see [18:51:03] java.io.FileNotFoundException: /var/lib/hadoop/journal/analytics-hadoop/current/last-promised-epoch.tmp (No such file or directory) [18:51:08] on an1028 [18:51:12] but at INFO level [18:51:43] I'm not elukey maybe the node missed the last snapshot and can't catchup? [18:55:02] possibly, I think that it will try during the next one [18:55:08] that should happen soon [19:09:17] elukey: synchronized? [19:10:30] nope still working on it [19:12:25] I think that it complains about the data not being there in the first place [19:12:42] in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Adding_a_new_JournalNode_to_a_running_HA_Hadoop_Cluster [19:12:51] we mention the rsync of data from another node first [19:13:29] from the error logs it seems indeed that it doesn't find the dir structure, and possibly that it will not do it by itself [19:18:18] this is really sad [19:18:43] it still doesn't catch up? [19:20:57] the main issue is that it expects a dir tree [19:21:00] I created /var/lib/hadoop/journal/analytics-hadoop [19:21:09] and then it says [19:21:09] org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory /var/lib/hadoop/journal/analytics-hadoop not formatted [19:21:17] because of missing files [19:21:27] elukey: missing VERSION file from https://stackoverflow.com/questions/29385067/how-to-recover-hdfs-journal-node [19:21:48] elukey: can we try that? [19:22:29] yeah but I'd prefer to rsync everything from another node [19:22:34] Works for me [19:27:09] so the rsync seems not working, maybe the server is not configured? [19:31:39] elukey: any luck? [19:31:53] ottomata: webrequest_partitioned? It works both ways, except that webrequest itself is already technically partitioned by year/month... etc. [19:33:13] (03CR) 10Milimetric: Split webrequest into smaller datasets (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/357814 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [19:34:42] joal: seems better now, checking the logs [19:34:54] elukey: have you managed to get rsync work? [19:35:04] copied over [19:35:12] good old scp? [19:35:29] Writing segment beginning at txid 1920287177. 10.64.36.128:8485 (Written txid 1920288055), 10.64.53.14:8485 (Written txid 1920288055), 10.64.5.15:8485 (Written txid 1920288055) [19:35:35] :) [19:35:49] yeah I need to learn and or write down quick rsync hacks [19:35:54] they are so useful [19:36:05] elukey: I have never manged to ge that skill [19:36:17] It is indeed super usefull - But I never managed to get it [19:36:24] at the moment I am REALLY disappointed about journal nodes [19:36:37] (03CR) 10Milimetric: [C: 032] "remember before deploying to update puppet jobs with the new format parameter" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/423235 (owner: 10Joal) [19:39:34] (03CR) 10Milimetric: Add a config param for druid datasources (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/429765 (https://phabricator.wikimedia.org/T193387) (owner: 10Joal) [19:39:51] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578#4198405 (10chelsyx) Thanks @Tbayer and @fdans ! This is very interesting! [19:39:53] (03CR) 10Milimetric: [C: 032] Add a config param for druid datasources [analytics/aqs] - 10https://gerrit.wikimedia.org/r/429765 (https://phabricator.wikimedia.org/T193387) (owner: 10Joal) [19:41:18] joal: will keep watching metrics for a bit, thanks for the support :) [19:41:38] elukey: did nothin ... Thank you for the care, as always :) [19:41:55] Gone for tonight then - Thanks milimetric for the plenty reviews :) [19:42:28] nite joal [19:46:56] When I do a query in hadoop, and I have something like WHERE user_agent = "Foo/1.0 (some stuff; more stuff)" I get an error: Error: Error while compiling statement: FAILED: ParseException line 1:75 cannot recognize input near '(' 'compatible' '' in expression specification (state=42000,code=40000) [19:47:15] What am i doing wrong here. Do I need to escape semi-colons in strings for some reason [19:50:00] hm, bawolff never encountered that [19:50:03] when you say hadoop[ [19:50:05] do you mean Hive? [19:50:08] I mean beeline [19:50:10] ok [19:50:16] mabye....single quotes? just guessing :p [19:50:37] (03CR) 10Milimetric: [C: 032] "checks out, nice" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/427948 (https://phabricator.wikimedia.org/T192482) (owner: 10Joal) [19:52:03] Still get Error: Error while compiling statement: FAILED: ParseException line 1:64 cannot recognize input near '(' 'some' 'stuff' in expression specification (state=42000,code=40000) [19:52:18] The exact query i'm typing is: select dt, uri_path from webrequest where user_agent = 'Foo/1.0 (some stuff; more stuff)' limit 4; [19:54:06] bawolff: yeah, it's some kind of crazy bug, it's tripping on the semicolon ; [19:54:10] it shouldn't... but it is... one sec [19:54:33] the semicolon INSIDE the quotes [19:54:44] bawolff: hoping you have some partition clause on there too? :) [19:55:09] if that is your exact query, its going to query a LOT of data [19:55:16] bawolff: just escape the semicolon with a \ [19:55:21] so select dt, uri_path from webrequest where user_agent = 'Foo/1.0 (some stuff\; more stuff)' limit 4; [19:55:33] right, that wasn't my exact query, I was simplifying it to demonstrate the bug, and also remove the private data i didn't want to paste into irc :) [19:55:39] ok phew just checking [19:55:39] :) [19:57:46] bawolff: did you see my silly magic fix? [19:58:52] I was just about to say it wasn't working, and then i realized i typed it out wrong [19:58:58] Thankyou, that seems to do it [19:59:13] I honestly should have thought of backslash escaping myself :) [19:59:33] nah, maybe for weird characters like tabs or something, semicolons should definitely be fair game [19:59:47] bad parser [20:21:25] milimetric: is there any reason why we might want turnilo/pivot to be able to use both of our druid clusters? [20:21:31] or wil we always want it on just the analtyics one? [20:22:40] ottomata: eh, I don't see any reason we need it on the public one right _now_, but NEVER?! That's too long to commit to [20:22:54] :) [20:24:39] 10Analytics, 10Contributors-Analysis, 10Product-Analytics: Bring the Editor Engagement Dashboard back - https://phabricator.wikimedia.org/T166877#3310516 (10Neil_P._Quinn_WMF) a:03Neil_P._Quinn_WMF [20:26:27] haha ok [20:26:29] well not soon [20:26:32] we can change later then [20:26:35] will keep it simpler [20:52:18] 10Analytics, 10Analytics-Wikistats, 10ORES, 10Scoring-platform-team: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479#4198576 (10Milimetric) @awight. Those look great and in my opinion they fit right in for the most part. We organize metrics into three groups: content:... [20:55:07] 10Analytics, 10Analytics-Kanban: Issues with page deleted dates on data lake - https://phabricator.wikimedia.org/T190434#4198588 (10Milimetric) (just in case we've been so absent minded as to not say this via some other channel:) @Nettrom this is awesome work, thank you so much for finding it and surfacing the... [20:58:03] 10Analytics, 10Analytics-Kanban: Deploy Turnilo (possible pivot replacement) - https://phabricator.wikimedia.org/T194427#4198604 (10Ottomata) [20:58:09] 10Analytics, 10Analytics-Kanban: Deploy Turnilo (possible pivot replacement) - https://phabricator.wikimedia.org/T194427#4198614 (10Ottomata) [20:59:17] 10Analytics, 10Analytics-Kanban, 10Pageviews-API, 10Patch-For-Review: Add nyc.wikimedia to pageviews whitelist - https://phabricator.wikimedia.org/T194309#4195274 (10Milimetric) p:05Triage>03Normal [21:00:02] 10Analytics, 10Analytics-Kanban: Wikistats. Bug on title "wikistats 2" is not shown - https://phabricator.wikimedia.org/T194224#4198627 (10Milimetric) p:05Triage>03High [21:02:10] 10Analytics, 10Analytics-Wikistats: Pixel ratio messed up on Windows Chrome - https://phabricator.wikimedia.org/T194428#4198635 (10Milimetric) [21:04:36] 10Analytics, 10Analytics-Wikistats: Hover infobox should adjust date formatting to granularity displayed - https://phabricator.wikimedia.org/T194430#4198663 (10Milimetric) [21:04:48] 10Analytics, 10Analytics-Wikistats: Pixel ratio messed up on Windows Chrome - https://phabricator.wikimedia.org/T194428#4198674 (10Milimetric) a:05Milimetric>03None [21:06:22] 10Analytics, 10Analytics-Wikistats: Bar chart changes height when toggling splits - https://phabricator.wikimedia.org/T194431#4198679 (10Milimetric) [21:13:16] 10Analytics, 10Analytics-Wikistats: We send system user create events to history_reduced - https://phabricator.wikimedia.org/T194432#4198698 (10Milimetric) [21:17:04] 10Analytics, 10Analytics-Wikistats: We send system user create events to history_reduced - https://phabricator.wikimedia.org/T194432#4198712 (10Milimetric) 05Open>03Invalid never mind, we do, but it's a little more complicated: https://github.com/wikimedia/analytics-aqs/blob/master/sys/mediawiki-history-me... [21:30:09] 10Analytics, 10Analytics-Wikistats: Pixel ratio messed up on Windows Chrome - https://phabricator.wikimedia.org/T194428#4198725 (10Milimetric) p:05Triage>03High [21:30:11] 10Analytics, 10Analytics-Wikistats: Hover infobox should adjust date formatting to granularity displayed - https://phabricator.wikimedia.org/T194430#4198727 (10Milimetric) p:05Triage>03High [21:30:13] 10Analytics, 10Analytics-Wikistats: Bar chart changes height when toggling splits - https://phabricator.wikimedia.org/T194431#4198729 (10Milimetric) p:05Triage>03High [23:37:34] Anyone from analytics online and here?