[08:05:11] Morning team! [08:16:29] 10Analytics-Kanban, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3896185 (10Xqt) I have requests 2.7.0 for my production environment; I get no response from sseclient because `next(self.resp_iterator)` is empty there. For the developm... [08:36:29] (03PS1) 10Joal: Update pageviews top and by-country response def [analytics/aqs] - 10https://gerrit.wikimedia.org/r/403890 (https://phabricator.wikimedia.org/T184541) [08:36:46] 10Analytics-Kanban, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3896200 (10Xqt) >>! In T184713#3893484, @Ottomata wrote: > Ah, I did deploy EventStreams yesterday for T171011. I don't know exactly what caused this change, but I thin... [08:37:02] 10Analytics-Kanban, 10RESTBase-API, 10Patch-For-Review, 10Services (watching): Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541#3896201 (10JAllemandou) Also submitted a PR to restbase: https://github.com/wikimedia/restbase/pull/941 [08:37:15] 10Analytics-Kanban, 10RESTBase-API, 10Patch-For-Review, 10Services (watching): Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541#3896202 (10JAllemandou) a:03JAllemandou [08:54:59] (03PS1) 10Joal: Add script for webrequest dataloss flase-positives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403891 [09:00:48] hola [09:07:45] Hi :) [09:07:47] !log reboot analytics1063->65 for kernel updates [09:07:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:08:02] so joal the labs cluster should be running java 8 :) [09:08:13] elukey: it refines data :) [09:08:21] elukey: slowly, but surely [09:08:42] for some weird reason that I don't know (probably a horrible bash inclusion) hive/spark/etc.. clients are all using java 8 [09:09:05] I straced hive (client) and it indeed reads hadoop-env.sh [09:09:13] where JAVA_HOME is set [09:09:39] I'll try to figure out if this is expected or coincidence, buuut in the meantime we can do our tests [09:09:53] rollback will be simple: set JAVA_HOME in puppet, and that's it [09:10:05] Awesome elukey :) [09:10:16] https://gerrit.wikimedia.org/r/#/c/403701/ if you want to check [09:10:24] (no-op for the moment) [09:12:09] elukey: if you have a minute while following you reboots: https://gerrit.wikimedia.org/r/403891 [09:12:52] elukey: It's a copy from the one we had in oncall page, but the page has now been updated: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FTeam%2FOncall&type=revision&diff=1780311&oldid=1780238 [09:13:27] (03CR) 10Elukey: [C: 031] "A great +1 :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403891 (owner: 10Joal) [09:14:02] Yay ! [09:14:16] (03CR) 10Joal: [V: 032 C: 032] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403891 (owner: 10Joal) [09:25:22] so joal, as far as I've understood, it seems that the labs cluster is working well with java 8 right? [09:25:59] elukey: triple checking spark now, but YES! [09:26:30] very nice [09:28:14] elukey: looks like spark is not connected to hive, but seems more a problem of cluster than java8 [09:29:48] elukey: 18/01/12 09:27:35 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 [09:30:11] mmmm [09:30:14] yeah seems so [09:30:21] elukey: but reading parquet works super nice [09:32:11] elukey: here are the requests sent by ottomata to our test cluster: df.groupBy("uri_path").count().collect() [09:32:14] res6: Array[org.apache.spark.sql.Row] = Array([/frog,410], [/halibut,360], [/apple,425], [/banana,396], [/donkey,369], [/,373], [/emu,384], [/cricket,368], [/giraffe,433]) [09:32:18] :D [09:32:31] hahahah [09:32:57] elukey: spark2 tested - testing spark-1 [09:33:42] elukey: spark2 runs with java8, but it looks like spark1 runs with j7 [09:33:50] Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_151) [09:36:14] Ohhh - Interesting - spark1 works well with hive but runs with j7, spark2 doesn't work well with hive and runs j8 [09:40:04] so spark-shell vs spark2-shell right? [09:40:23] correct elukey [09:42:24] joal: what do you mean that spark1 works well with hive but not spark2 ? [09:42:46] saprk.sql (in spark2), sqlContext.sql (in spark1)( [09:43:03] I can make a query against hive metastore in s1, while it fails in s2 [09:44:34] any specific error? [09:45:05] I have no idea if they source any config file for JAVA_HOME [09:45:27] WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 [09:45:31] in s2 [09:46:20] wait I get the following for spark 1 [09:46:21] 18/01/12 09:41:21 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0 [09:46:25] but nothing for spark2 [09:46:35] on hadoop-worker-2 [09:46:35] This was for s2 [09:47:11] ? [09:47:15] Weird [09:47:16] yeah [09:48:11] on hadoop-worker-1, I launched spark-shell --master yarn [09:48:24] it tells me: Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_151) [09:49:25] ah no I haven't used --master yarn [09:49:25] Which means spark1 is still using j7 [09:49:38] so this might be the issue, let me retry [09:49:49] elukey: maybe, but I don't really htink si [09:50:07] no I mean for the inconsistency in our results, they are flipped [09:50:19] anyhow, for spark1 it is a matter of setting JAVA_HOME properly [09:50:21] elukey: :( [09:51:30] just tested export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre before spark-shell, it uses j8 as expected [09:51:45] hive for some reason sources hadoop-env.sh, and in there j8 is set [09:52:28] this is not a huge deal though, java 7 will be eventually removed from the cluster [09:52:43] so any client will pick up j8 [09:52:58] testing spark/hive connection with j8 and s1 [09:53:19] elukey: Got the same message you add about metastore connecttion issue when using j8 for spark1 [09:53:33] so metastore connection seems related to java version [09:54:57] Actually the warning message didn't prevent for a request to be successful - spark1 succesfully connected to hive metastore to run a query [09:54:58] what user do you use for spark-shell? Yours? [09:55:01] using java8 [09:55:06] yessir [09:55:23] because I get some access denied exceptions, mmm [09:56:12] Permission denied: user=elukey, access=WRITE, inode="/user":hdfs:hadoop:drwxr-xr-x [09:56:42] elukey: you haven't created your home folder in hdfs /user/elukey with rights o yourself [09:56:50] yep doing it :D [09:57:08] elukey: Home, sweet home :) [09:58:05] elukey: At least I get coherent results: spark2 with either j7 or j8 doesn't want to connect ot hive metastore (no error at launch, but error when trying a query) [09:58:26] elukey: while it works in spark1, with either j7 or j8 [10:00:01] I was expecting some weirdness joal, it was too easy :D [10:00:08] hehe :D [10:00:24] thing to ponder: I don't know if this was an issue before j8 or not [10:04:30] 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965#3676320 (10JAllemandou) Plenty of possible different ways here. Listing the two that makes most sense to me: - Oozie style: add steps to the oozie mediawiki-reduced... [10:04:49] 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965#3896289 (10JAllemandou) a:03JAllemandou [10:06:47] ok joal now I am confused :D [10:07:58] elukey: I have the feeling it's a metastore-version thingthink [10:09:39] joal: what happens now on the prod cluster then? [10:09:49] do spark2-shell fail? [10:09:53] elukey: it works [10:10:02] eqi stat21004 [10:10:04] oops [10:11:38] so you are saying that this was an issue on the labs cluster before java 8 [10:11:46] ok now I got it [10:12:10] elukey: I'm saying I actually don't know if it was an issue before j8 - I didn't test sdpark before :( [10:12:30] yep yep I didn't get the part of "context==labs" [10:12:32] :) [10:12:40] Arf sorry - should be more explicit :) [10:12:52] nono I need coffee, grabbing some :) [10:15:10] (no coffee in the co-working, nuuuuooooo) [10:15:37] :( [10:33:32] !log reboot analytics1066->69 for kernel updates [10:33:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:33:39] finally the last one :D [10:33:42] *ones [10:37:58] the last of ones the hadoop cluster :-) [10:47:01] moritzm: comee oooonnn let me enjoy these temporary moments of happiness :D [10:47:29] :D [10:52:57] ok :-) [10:57:20] ok all the nodes (except an1003) have the new kernel [10:57:30] I need to schedule maintenance for all the stat boxes and an1003 [10:59:00] thanks. stat* boxes are already upgraded, BTW [10:59:09] super [10:59:35] moritzm: how about druid nodes and kafka[12]00[123] ? [11:00:11] and also eventlog1001 (but it runs trusty so not sure if the kernel is ready) [11:01:11] druid*, kafka[12]00[123], aqs* and eventlog all the fixed kernels installed (the one for trusty has been released now) [11:01:24] very nice! [11:01:25] they messed up their 4.4 builds, but doesn't apply to trusty [11:02:34] I'm grinding through some others clusters, but can also help with other analytics reboots on Monday/Tuesday [11:04:58] I'll let you know if I need help, but it should be ok.. thanks! [11:11:12] ok! [11:37:28] Hey elukey - Where can I find the network.pp file (i'd like to update IP addresses in out refiniery-source codebase) [11:39:26] joal: I may need a bit more info.. what IP do you need to update and where in puppet? (sorry to ask but I don't have a lot of context) [11:40:36] elukey: we reference internal IPs here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IpUtil.java [11:40:41] elukey: I'd like to update them [11:41:15] elukey: I found puppet:/manifest/realm.pp -- But it seems to contain only ipv4 values, no labs not v6 ones [11:46:40] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3896456 (10elukey) [11:48:12] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add the prometheus jmx exporter to all the Hadoop daemons - https://phabricator.wikimedia.org/T177458#3659718 (10elukey) I opened https://phabricator.wikimedia.org/T184794 to track down and fix Oozie/Hive bugs, I am inclined to close this task since: 1)... [11:49:53] mmmm I don't really like this file, we might need to find a better solution [11:50:07] elukey: I'd love to [11:51:36] so network::constants has moved but still in puppet, a bit different from the version that we use though [11:52:36] https://github.com/wikimedia/puppet/blob/production/modules/network/manifests/constants.pp [11:55:02] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3896474 (10elukey) a:03elukey [11:56:50] Bazinga elukey!! Many thanks :) [11:57:33] joal: one possible solution to the issue would be for puppet to create a file with the ips that you need [11:57:41] and then that class would pick them up [11:57:53] elukey: that'd be super awesome [11:58:39] elukey: I'd need two files: one with our external IPs (v4 and v6), one with our labs-internal IPs [11:59:35] 10Analytics-Kanban, 10User-Elukey: Move AQS Cassandra daemons to use the Prometheus JMX agent - https://phabricator.wikimedia.org/T184795#3896483 (10elukey) [12:00:12] 10Analytics-Kanban, 10User-Elukey: Add the prometheus jmx agent to AQS Cassandra - https://phabricator.wikimedia.org/T184795#3896483 (10elukey) [12:34:53] (03CR) 10Mforns: Add core class and job to import EL hive tables to Druid (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [12:57:50] 10Analytics-Kanban: Incorporate data from the GeoIP2 ISP database to webrequest - https://phabricator.wikimedia.org/T167907#3896604 (10JAllemandou) [12:58:15] 10Analytics-Kanban: Incorporate data from the GeoIP2 ISP database to webrequest - https://phabricator.wikimedia.org/T167907#3349143 (10JAllemandou) a:03JAllemandou [12:58:33] (03PS1) 10Joal: Refactor geo-coding function and add ISP [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/403916 (https://phabricator.wikimedia.org/T167907) [13:02:50] !log Rerun webrequest-load-wf-upload-2018-1-12-9 [13:02:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:03:34] !log Rerun webrequest-load-wf-text-2018-1-12-9 [13:03:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:21:38] thanks! [13:43:10] joal: fine if I reboot druid1004? [13:43:22] depooling it first from pybal/lvs [14:01:05] * elukey coffee! [14:25:31] grrrrrrrrr [14:25:40] something's still wrong with this interlanguage job [14:25:47] it's still not picking up all the data [14:25:51] * milimetric hates oozie [14:26:05] (03CR) 10Fdans: [V: 032 C: 032] "Looks good to me!" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/403890 (https://phabricator.wikimedia.org/T184541) (owner: 10Joal) [14:39:49] 10Analytics-Kanban: Sqoop cu_changes table for geowiki - https://phabricator.wikimedia.org/T184759#3896816 (10Milimetric) [14:40:04] anyone wanna brain-bounce on this oozie problem? [14:40:10] I don't get it [14:40:24] I am not sure I'd be of any help :( [14:42:38] milimetric, I can try, let me 5 mins to change rooms [14:45:02] joal: testing query to have iso code and country name in hive :) [14:45:35] milimetric: I'll go to grab Lino soon, I'll have time post-standup if not yet fixed [14:45:45] thx joal [14:46:03] np [14:46:08] I'll look at it with mforns [15:05:10] so spark2 seems the only thing that we are not able to run in the labs cluster, the rest works fine with java 8 [15:05:32] (spark2 gets weird also with java 7 in labs so something is probably wrong in there) [15:14:38] mforns: nuria_ joal ohhh damn, I was under the impression that projectview_hourly stored ISO numbers (like spain => 724) [15:14:54] but it stores alpha codes like Spain => ES [15:15:03] so it's human readable anyway [15:15:12] no need to include full country names [15:15:42] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add the prometheus jmx exporter to all the Hadoop daemons - https://phabricator.wikimedia.org/T177458#3896880 (10elukey) [15:19:48] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3896893 (10elukey) In https://oozie.apache.org/docs/4.1.0/AG_Install.html -> `Advanced/Custom Environment Settings` I can't see any CATALINA_OPTS listed... [15:25:32] nuria_: this was the trick I forgot yesterday: SET hive.mapred.mode = nonstrict; [15:25:37] (to make repair work) [15:25:40] it works fine after that) [15:46:48] milimetric: nice [15:55:39] fdans: any disagreements on https://gerrit.wikimedia.org/r/#/c/402466/? [15:55:56] did you already do the deploy yesterday without it? [15:59:16] (03CR) 10Faidon Liambotis: "Thank you *so* much for doing this! I don't have anything valuable to contribute, other than nitpicking: MaxMind capitalizes both Ms in th" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/403916 (https://phabricator.wikimedia.org/T167907) (owner: 10Joal) [16:02:44] 👋 I'd like to upload some data to https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/. Anyone can give directions on how to do so? [16:15:01] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3896976 (10elukey) Finally found the root cause. Each time that oozied.sh does start/stop from the init.d's script it starts with a clean environment. Th... [16:15:26] milimetric: I hate oozie too --^ [16:15:29] :D [16:17:02] to be fair, your reasons are much more legitimate, elukey :) [16:17:32] those bash scripts are... [16:17:47] .... [16:17:47] ... [16:33:26] (03CR) 10Nuria: Refactor geo-coding function and add ISP (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/403916 (https://phabricator.wikimedia.org/T167907) (owner: 10Joal) [16:54:36] (03PS1) 10Milimetric: Correct the column order [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403946 [16:54:47] (03CR) 10Milimetric: [V: 032 C: 032] Correct the column order [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403946 (owner: 10Milimetric) [17:00:02] a-team I’ll be a few minutes late to standup, sorry [17:01:32] ping ottomata[m] standup today? [17:07:58] 10Analytics-Kanban: Sqoop cu_changes table for geowiki - https://phabricator.wikimedia.org/T184759#3897156 (10Nuria) [17:35:44] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3897276 (10elukey) The problem seems to be in the oozie debian package itself: ``` elukey@hadoop-coordinator-1:~/oozie-4.1.0+cdh5.10.0+389/debian$ grep... [17:40:36] 10Analytics-Kanban: Sqoop cu_changes table for geowiki - https://phabricator.wikimedia.org/T184759#3897285 (10Milimetric) Some thoughts from post-standup: * snapshot partition name doesn't apply to this use case, change it to like temporary or something like that * sqoop only one month of data * after processin... [17:41:14] gonna go eat lunch, bbl [17:50:51] mforns: there? [17:58:35] 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965#3897355 (10Nuria) +1 to @mforns comment Let's talk about this on our next tasking meeting. I think the best option is the 1st one, so we test validity of data close... [18:00:35] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3897369 (10elukey) Tried to open https://community.cloudera.com/t5/CDH-Manual-Installation/Oozie-duplicates-CATALINA-OPTS-variables-in-oozie-env-sh/m-p/6... [18:02:24] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Explain difference in number of repositories when trying to manually exclude imported third party repositories - https://phabricator.wikimedia.org/T184420#3897372 (10Aklapper) Uhm, maybe my mind played a trick: What if those repos had no... [18:02:49] * elukey off! [18:04:08] nuria_, yes! 'sup? [18:04:56] mforns: for the EL to druid i am just going to look at it a bit more and maybe separate the indexation code from the rest so it can be used for other spark classes? [18:05:08] mforns: does that seem Ok? [18:05:20] nuria_, the indexation code is already separate no? [18:05:53] there is the generic DataFrameToDruid module, [18:06:05] that can be used by any client [18:06:33] no? what do you mean otherwise? [18:06:43] mforns: ah i see, that was the idea that was a base class to be used by all spark jobs? [18:06:49] yes [18:07:16] and the EventLoggingToDruid is the specific one, that handles the EL case [18:07:54] and passes a DataFrame to DataFrameToDuid [18:08:31] when I started this task, Andrew and I discussed and decided for this architecture [18:10:42] ok i see [18:10:57] the bigger part of EventLoggingToDruid is parameter parsing [18:11:11] and also formatting the specific EL case into something generic, meaning: [18:11:48] - identifying dimensions and metrics, given a schema convention [18:12:07] - specifying which EL standard fields are to be blacklisted [18:12:20] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Explain difference in number of repositories when trying to manually exclude imported third party repositories - https://phabricator.wikimedia.org/T184420#3897398 (10Aklapper) p:05High>03Low This. There is still something fishy, but... [18:12:39] - and flattening the EventCapsule and other nested fields [18:17:36] nuria_, although flattening is a pretty much generic thing, that could be included in the core DataFrameToDruid, I decided to move it out into EventLoggingToDruid, because blacklisting and metric/dimension designation is highly coupled with flattening, and those need to happen in the specific EventLoggingToDruid [18:32:21] ok will loook at it a bit more to see if i have any useful suggestions, i moved the template to a resource file on my last patch but that was a 2 liner [18:33:28] yes, template in another file makes sense [19:27:10] nuria_: Before starting to change, I double checked MaxMind database sizes - City is 130M, Country is 3.5M - I think this is the reason why we chose otiginally to provide the counry out of that specific database [19:31:51] 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965#3897645 (10JAllemandou) >>! In T177965#3897355, @Nuria wrote: > I think warming up of cache should happen after in the AQS deployment step of this data. Given we p... [19:57:45] (03PS2) 10Joal: Refactor geo-coding function and add ISP [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/403916 (https://phabricator.wikimedia.org/T167907) [20:00:55] (03CR) 10Joal: "@Nuria: I on purpose kept the MaxMindCountryCode class, allowing to get country with a way smaller amount of data loaded than if using Max" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/403916 (https://phabricator.wikimedia.org/T167907) (owner: 10Joal) [22:45:48] Hey, folks. Quick question: https://www.mediawiki.org/wiki/Extension:EventLogging/Guide seems to suggest that arrays are not valid data types. Is that true? [22:46:01] If so, is there a generally accepted alternative? [22:58:19] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3898072 (10demon) p:05Triage>03Normal