[03:39:18] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-Vagrant, 13Patch-For-Review, 15User-bd808: Replace upstart with systemd unit in eventlogging::devserver and eventlogging::service - https://phabricator.wikimedia.org/T154265#2988769 (10bd808) 05Open>03Resolved [06:50:32] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 13Patch-For-Review: Add autoincrement id to EventLogging MySQL tables. {oryx} - https://phabricator.wikimedia.org/T125135#2989010 (10Marostegui) >>! In T125135#2987577, @Ottomata wrote: >> This is key, is that somehow doable from the application side? It... [07:08:40] 10Analytics, 10DBA, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989026 (10Marostegui) >>! In T124307#2987587, @Ottomata wrote: > @Marostegui ok! So the T125135 auto-increment thing is a very small piece of this larger issue. > > Let's see if we can... [07:47:12] morning! [07:47:17] aqs1008-a is bootstrapping :) [08:51:51] hi team! joal yt? the monthly job is still generating the data, so... I guess better deploy without [08:52:15] will come back in a bit to recheck [09:45:48] Thanks mforns_away for checking this :) [09:46:01] elukey: yay, mar bootstapz [10:03:56] elukey: quick question: have you run cleanup on aqs1007-b? [10:05:39] nope, it didn't need it [10:05:57] elukey: yeaaah, in theory it didn't :-P [10:06:00] the only weird thing that I am seeing is https://grafana.wikimedia.org/dashboard/db/aqs-elukey?panelId=14&fullscreen&from=now-24h&to=now [10:06:17] well we can run it anytime [10:06:32] so P99 spikes in latency [10:06:37] Do you mind going for it (as you said, shoud have no effect) [10:07:19] hm [10:08:19] joal: done, finished in 20 secs [10:08:24] thanks mate :) [10:08:42] elukey: it makes me feal better ;) [10:08:48] feel sorry [10:09:04] the p99 spikes is weird indeed [10:09:11] it is not a huge deal [10:09:56] elukey: when looking at last 30 days, it doesn't seem to be a real tendency yet [10:10:05] let's keep it in mind though [10:10:26] yeah I was about to say that [10:10:36] anyhow, better double checking to be sure :) [10:10:41] for sure :) [10:11:02] elukey: deploy? [10:12:53] sure! Let me check the docs, didn't have time up to now [10:13:16] no prob elukey - Idea would be: you do it by the docs, I'm here in case :) [10:14:56] first question joal - do we need to deploy refinery source ? [10:15:26] elukey: good question! tickets in "Ready to Deploy" should tell you that :) [10:16:15] this is a good point [10:16:20] elukey: T156629 has a patch on refinery source, so yes, we should :) [10:16:20] T156629: Better explanation on pageview definition for edit actions - https://phabricator.wikimedia.org/T156629 [10:22:25] ok I am a already a bit confused [10:22:26] :D [10:22:47] I thought that refinery 0.40 was built by Madhu's automation in Jenkins (also checked Gerrit) [10:22:56] but then in the wiki I read [10:22:56] If you need to trigger a build (not a release) you can do that hereː https://integration.wikimedia.org/ci/job/analytics-refinery-release/build?delay=0sec [10:23:25] and then "Maven project analytics-refinery-release" [10:23:26] :D [10:23:32] Ah elukey: this is no more explanation on why triggering a build could be needed on docs? [10:24:35] not a lot - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Deploy/Refinery-source#How_to_deploy_with_Jenkins_.28and_related_steps.29 [10:24:40] but maybe I am missing something [10:26:42] elukey: Reason to trigger a build before the release is for jenkins to pick up automatically the correct release versions [10:27:00] elukey: To correctly prefill values in some release page [10:27:21] elukey: It's needed only if, when preparing a release job, values prefilled are incorrect [10:28:33] ahh ok got it [10:29:42] elukey: docs are not yet precise enough :) [10:30:00] I am adding info :) [10:30:12] Thanks elukey ! [10:32:34] ah yes I was going to ask about the changelog since https://github.com/wikimedia/analytics-refinery-source/commits/master shows periodic version bumps in changelog [10:32:45] but it is buried into "If the maven release job failed (step 3)" [10:33:02] that makes sense but it might be better to add some preconditions [10:33:21] elukey: please, please, please, pleaaaaaase :) [10:34:19] * elukey takes notes [10:34:59] ah snap "Update the changelog.md file at the root of the repository with changes that are going to be deployed - commit and merge this change." [10:35:06] * elukey goes in the corner of shame [10:35:27] anyhow, I'll make it straightforward for dumb people like me :D [10:39:38] (03PS1) 10Elukey: Changelog v0.0.40 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/335423 [10:41:21] joal: --^ all good? [10:41:59] afaics the maven release plugin will take care of the pom.xml version bump [10:42:17] elukey: commit message could be just a bit more explicit, but content of changelog is good :) [10:43:29] ah I followed the trend of the last releases [10:44:18] nevermind elukey, I'm too picky :) [10:44:25] elukey: merging that version :) [10:44:50] (03PS2) 10Elukey: Add v0.0.40 to the Changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/335423 [10:44:53] ah I just updated it [10:44:54] :P [10:44:57] hehe :) [10:45:05] ok, merging the noew one than ;) [10:45:24] (03CR) 10Joal: [C: 032] "LGTM :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/335423 (owner: 10Elukey) [10:45:37] (03CR) 10Elukey: [V: 032] Add v0.0.40 to the Changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/335423 (owner: 10Elukey) [10:45:42] elukey: let's wait a few minutes for jenkins to merge it [10:46:08] ehm sure [10:46:15] I haven't merged it already [10:46:19] :D [10:46:20] it is an illusion joal [10:46:25] don't look to wikibugs [10:47:47] * elukey writes down to wait [10:48:35] elukey: That's no problem [10:54:34] so https://integration.wikimedia.org/ci/job/analytics-refinery-release/m2release/ gives to me 0.0.39 as release version that looks wrong [10:54:49] so I should https://integration.wikimedia.org/ci/job/analytics-refinery-release/build?delay=0sec and then redo [10:55:01] That's indeed the idea elukey [10:55:25] elukey: You don't even have o wait for the build to finish, you can kill it a few seconds after it has started [10:56:48] other weird thing [10:57:20] git tag --list shows all vx.y.z but then I can see 0.0.39 [10:57:22] mmm [10:58:17] elukey: errors was made at deploy for version 0.039 :) [10:58:47] elukey: see Change refinery-x.y.z to vx.y.z in the "SCM tag" input textbox and update the number -- We should put the 'v' in bold :) [10:59:49] yep yep [10:59:52] will do [11:09:56] 06Analytics-Kanban: Document the difference in aggregate data on wikistats and wikistats 2.0 - https://phabricator.wikimedia.org/T150963#2802816 (10Elitre) I am afraid that the answers were not direct enough, so I am still unclear about what it is that you think you need from us and when. I can help review spec... [11:15:43] elukey, joal: there's a new privilege escalation vulnerability in ntfs-3g, which gets installed by Ubuntu on trusty systems. I'll simply deinstall it from the Hadoop cluster, I doubt anyone/anything uses NTFS? [11:16:04] Hi moritzm [11:16:20] moritzm: I don't think hadoop nodes are used for anything else than hadoop [11:16:34] moritzm: clients node (stat1002,4) might though [11:18:37] but none of the users of stat100[24] has physical access to these servers, so the typical use case of USB media is moot [11:19:19] I'll doublecheck via salt cluster-wide [11:19:35] awesome (for analytics machines, I'm pretty sure it's not used) [11:19:47] for stat, don't know [11:21:31] I doublechecked, none of stat100[24] has the fuse kernel module loaded, so it can't have been used since these were rebooted the last time [11:21:46] I'll drop it from there as well, seems sage [11:21:48] I'll drop it from there as well, seems safe [11:21:52] thanks moritzm [11:24:27] +1, sorry just seen the ping [11:25:28] I have a Hive query repeatedly failing in the reduce stage with "Timed out after 600 secs" errors https://yarn.wikimedia.org/jobhistory/attempts/job_1485458133961_26593/r/FAILED [11:25:41] any ideas on why this is happening, and how to avoid it? [11:26:50] https://www.irccloud.com/pastebin/huI3EcJU/Hive%20query%20which%20is%20failing%20with%20timeout%20in%20reduce%20stage [11:28:54] HaeB: reading [11:29:22] it's working ok with a smaller LIMIT (e.g. 100) in the inner query, so may have something to do with the windowing (OVER...) [11:30:20] joal: updated https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Deploy/Refinery-source [11:32:22] HaeB: ordering means a single reducer, so while it doesn't take a lot of parallel resource, it can possibly take a very long time :) [11:32:53] i'm happy to wait longer than 600 seconds though ;) [11:32:53] HaeB: One thing: no need to order in the inner subquery - will be done by windi [11:33:54] ah, i thought so (that was a leftover from an earlier version) [11:35:35] also, just to be sure I understand: you're willing to get page_titles with views and cumulative views for page ranks that are mutiple of 10k, and rank smaller than 100M, right? [11:35:49] HaeB: --^ [11:35:55] yes [11:35:59] k :) [11:36:24] 100M as hypothetical limit, IIRC it's actually less than 10M [11:36:46] probably, for a single hour [11:37:14] joal: last step for refinery source would be to run https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/build with 0.0.40 [11:37:15] no, that's the number of pages of the entire wiki [11:37:21] (without the 'v') [11:37:43] (because of the GROUP BY page_title) [11:37:43] HaeB: counting redirects? [11:38:35] elukey: looks correct :) [11:39:03] https://www.irccloud.com/pastebin/Q8B7ozrf/%23%20of%20distinct%20pages%20viewed%20on%20eswiki%20in%20December%202016 [11:39:12] joal: ^ [11:39:55] HaeB: running a slightly modifed version of your query (very slightly) - as expected, windowing step makes having a single reducer [11:40:11] HaeB: ok, thanks for the check :) [11:41:30] elukey: docs are way better than before :) I'll correct some typos, but it looks good ! [11:41:41] Thanks elukey for that ! [11:45:38] HaeB: Can you confirm issue happens on stage 3 of your request? [11:45:52] HaeB: This seems to be what happens to me [11:46:37] yes [11:55:08] HaeB: looking at EXPLAIN teels us that the sorting actually happens in stage 3 [11:56:23] joal: ok - you mean the sorting for the windowing terms, right? [11:56:36] correct - That's why this stage takes so long [11:56:49] However there is something I don't understand in the explain statement [11:57:23] HaeB: trying something slightly different to see if results happen [11:57:39] (the inner query does not seem resource intensive, i have run for an entire year recently for two other wikis) [11:57:50] HaeB: indeed [11:59:05] joal: I checked https://github.com/wikimedia/analytics-refinery/commits/master and it seems to me that we'll only need to restat the druid related coordinator after the refinery deployment right? [11:59:22] elukey: checking [11:59:55] elukey: yessir, looks correct to me :) [12:00:02] pageview jobs (both daily and monthly [12:01:47] all right so I'll deploy from tin to stat1002, follow the instructions and then we'll restart them [12:01:51] sounds good? [12:01:58] should take 5 mins [12:02:09] elukey: deploy everything alright, then restart:) [12:07:38] joal: should I run sudo -u hdfs /srv/deployment/analytics/refinery/bin/refinery-deploy-to-hdfs --verbose --no-dry-run only on one host right? And it could be done after the scap deployment [12:08:03] not super clear from the docs [12:08:20] elukey: indeed, this command deploys last version to HDFS - no need to do it from multiple places :) [12:08:35] And it indeed should be done after scap, to copy the last deployed version [12:17:42] !log deployed Refinery via scap and then executed the hdfs copies on stat1002 [12:17:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:18:31] (03PS2) 10Nschaaf: (in progress) Store sanitized data for WDQS [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335211 (https://phabricator.wikimedia.org/T146915) [12:20:21] joal: all good, the last step is to restart the coordinators [12:20:33] (03CR) 10Nschaaf: [C: 04-1] "I've updated the naming to better reflect what is happening to the data. Could Analytics comment on exactly what data is considered PII an" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335211 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [12:23:50] elukey: let me know when you want to do so :) [12:27:35] joal: anytime, even now [12:27:51] let's go :) [12:28:29] all right let me try to explain what I'd do [12:28:54] HaeB: I don't have better ideas than boosting timeout :( [12:29:10] HaeB: Not very good, but can't think of better solution [12:29:19] THEORETICALLY pageview-druid-monthly-coord and pageview-druid-daily-coord should be changed [12:30:04] HaeB: I think this should work: SET mapreduce.task.timeout = 1800000; (1800 seconds instead of 600, could even be more) [12:30:34] ah snap https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie/Administration is super long :P [12:31:06] Oh elukey, an interesting thought came to my mind: why haven't we changed jar version in refine job? [12:32:46] joal: is it a tricky question ? :D [12:33:00] elukey: A bit :) [12:33:16] well you changed comments afaics [12:33:29] so theoretically it doesn't matter much to bump it [12:33:37] but the .properties has been changed [12:33:47] so the coordinator needs a restart [12:33:47] elukey: correct :) We even could have not deployed refinery-source [12:33:53] I knowwww [12:33:55] :P [12:35:24] elukey: My point was to make sure we did not update jar version anywhere on purpose (we sometimes forget to do it when it's needed) [12:36:41] +1 [12:38:58] so joal, should I just kill the two coordinators and restart them? [12:54:06] excuse me elukey, got a phone call [12:54:24] elukey: That's the plan, kill, restart (with correct start time) [13:02:52] joal: me too :) [13:03:09] so pageview-druid-monthly-coord and pageview-druid-daily-coord right? [13:04:03] elukey: correct [13:12:58] !log restarted pageview-druid-monthly-coord and pageview-druid-daily-coord oozie coordinators after deployment [13:13:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:13:03] joal: --^ [13:13:08] hope that they are fine [13:13:23] just checked in hue and they seem good [13:16:13] indeed it seems good elukey :) [13:16:19] Thanks a lot elukey for having done that :) [13:16:25] awesome docs for oozie! [13:16:42] thanks for helping me, finally I managed to do my first deployment :) [13:18:05] elukey: I did nothing ! [13:51:53] * elukey afk for a bit! [13:57:35] 10Analytics, 10DBA, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989663 (10Ottomata) @mforns, can you comment about large DELETES? Do they happen often? How large are they when it happens? Would LOAD DATA actually help replication? [14:01:14] halfak / joal: running 5 minutes late, bank problem [14:02:34] no worries milimetric. I'm a little late myself. [14:15:00] elukey: I'm upgrading ca-certificates-java on the kafka* hosts, already done on the other jessie/java hosts, so should cause no problems [14:15:43] moritzm: sure - should we restart the brokers for openjdk later on? [14:15:43] it was required for the openjdk-8 updates, and since kafka uses java 7 it's strictly not needed, but better to have all the versions in sync across the fleet [14:15:50] ah ok [14:15:52] no, not needed for this one [14:15:54] already answered :) [14:30:31] 10Analytics, 10DBA, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989770 (10Marostegui) >>! In T124307#2989663, @Ottomata wrote: > > @Marostegui , Would LOAD DATA actually help replication? If you need to do massive data imports into the DB, it will h... [14:36:35] * fdans goes out for some 4pm lunch [14:38:40] (03PS3) 10Fdans: Adds map visualizer to Dashiki [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/333922 (https://phabricator.wikimedia.org/T153921) [14:39:02] (will review after meeting, fdans ) [14:39:13] 10Analytics, 10DBA, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989845 (10Ottomata) EventLogging is a stream of data. We can do batching because the data is consumed from Kafka, and then inserted into MySQL via a python MySQL client. So we could con... [14:39:15] milimetric: I just need to alter the tests a bit, but it's ready for review [14:39:21] thank you :) [14:56:08] 10Analytics, 10DBA, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989873 (10Marostegui) LOAD DATA is a lot faster to bulk lots of data in the DB, there is a lot less overhead in parsing SQL statements and all the processes around that parsing. This is... [15:07:31] fdans: gonna review now, if you want to watch me mumble through it, I can jump in the batcave [15:14:33] PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:14:52] yeah, was about to say: cluster is under heavy pressure [15:16:04] analytics 32 has weird behavior elukey: a lot of system CPU sued compared to others :( [15:16:48] checking [15:16:53] thanks [15:17:33] first day of the month is always a bad day for cluster ... [15:18:07] so 53 gives me OK: YARN NodeManager analytics1053.eqiad.wmnet:8041 Node-State: RUNNING [15:18:20] and I believe that it is the bug of the script checking node manager state [15:18:29] k elukey [15:18:50] WHAT [15:18:50] * Hadoop nodemanager is dead and pid file exists [15:18:59] ok now I am confused :D [15:19:01] ahahaha [15:19:05] wow [15:19:06] the script is definitely weird [15:20:06] java.lang.OutOfMemoryError: Java heap space [15:20:14] this is the memleak [15:21:10] joal: I think that we need a rolling restart of the cluster :( [15:21:19] elukey: :( [15:22:45] some nodes are fine, others are not - https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=17&fullscreen [15:23:20] elukey: most are not [15:23:33] RECOVERY - Hadoop NodeManager on analytics1053 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:23:47] close to the heap limit, and these metrics are not super fine grained to show small spikes [15:24:01] to yes I need to start restarting Yarn daemons now [15:24:03] :) [15:24:23] elukey: Mwarf [15:24:38] elukey: Let's be carefull, there are some jobs I'd rather not kill [15:24:50] 10Analytics, 10Wikimedia-Stream: Port RCStream clients to EventStreams - https://phabricator.wikimedia.org/T156919#2989939 (10Ottomata) [15:25:19] joal: sure, the last time I tried to poke the App Masters from the Resource Manager being super careful in not killing them, but it didn't work super well [15:25:46] this mess might push us to upgrade the cluster to the new cdh [15:25:56] correct [15:27:01] elukey: let's give it another hour: huge monthly stuff is being absorbed [15:27:07] elukey: ok for you? [15:28:00] sure [15:28:12] udo -u hdfs /usr/bin/yarn application -appStates RUNNING -list | egrep -o 'analytics10[0-9][0-9].eqiad.wmnet' | sort | uniq -c is what I used to check the appmasters [15:28:15] *sudo [15:28:36] checking elukey [15:29:08] joal: seems like you're busy, no need to respond :) small fyi https://phabricator.wikimedia.org/T153743 is the place to watch for remaining shards on new labsdbs [15:29:42] Thanks chasemp, will look after cluster gets quieter [15:31:19] atually elukey, no good - my faith was incorrect [15:48:38] 10Analytics, 10DBA, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10Nuria) @ottomata: we do not delete data from eventlogging (other than the purging that it should happen after 90 days) the system just inserts batches of records. [15:53:47] 10Analytics, 10DBA, 06Operations: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2990018 (10jcrespo) > purging that it should happen after 90 days How do you implement purging? That surely must run deletes or some kind of updates? [15:56:23] ottomata, elukey: spark is 1.6.0 in CDH 5.7+ [15:56:28] (03CR) 10Nuria: "Maybe it is worth taking about this in person? As far as I can see there is no sanitization despite of naming. We do not retain long term" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335211 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [15:56:35] mwahh, not even bug correction version :( [15:56:39] (03CR) 10Nuria: [V: 04-1 C: 04-1] (in progress) Store sanitized data for WDQS [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335211 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [16:01:46] nuria: standup wooo [16:01:53] mforns: too [16:01:53] :) [16:02:02] ottomata: indeeed!!!! [16:02:07] oooh! coming! [16:07:42] elukey: fyi, you asked me how: basically I do this: https://wikitech.wikimedia.org/wiki/Git-buildpackage#How_to_build_a_Python_deb_package_using_git-buildpackage [16:07:45] for new python debs [16:08:02] thanks! [16:09:36] 06Analytics-Kanban: Update montly 'unique computation' jobs for better resource management - https://phabricator.wikimedia.org/T156921#2990037 (10JAllemandou) [16:10:10] (03PS1) 10Joal: Update montly unique jobs for better resource mgt [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335459 (https://phabricator.wikimedia.org/T156921) [16:29:18] 10Analytics: CDH upgrade. Value proposition: new spark for edit reconstruction - https://phabricator.wikimedia.org/T152714#2857818 (10Ottomata) In Analytics Ops meeting today, we decided we should upgrade to CDH 5.10 now that it is out, even though it doesn't have Spark 2.x like we had hoped. - Mediawiki Histor... [16:31:47] elukey: have you started to restart nodemanagers: [16:31:48] ? [16:32:23] joal: yep, 1028/29/30 done [16:32:37] hm, killed a monthly job :( [16:32:44] ah snap [16:33:07] so definitely the script does not tell us the whole picture [16:33:15] or I am missing something [16:33:25] there is something I don't understand either [16:34:43] another thing that is really annoying is "nodemanager did not stop gracefully after 5 seconds: killing with kill -9" [16:34:59] IIRC I tracked down a bug that was still open for this [16:35:35] mforns: right elukey [16:35:39] oop sorrry mforns [16:35:53] hehe np :] [16:37:40] PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:39:20] :( [16:39:40] RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:39:46] elukey: I think your script doesn't give you correct info :( another spark job I had runnign has died because of restarts [16:39:56] elukey: I'll have a look at the script again [16:41:56] I am also wondering if there is a graceful restart procedure [16:42:22] elukey: makes no sense why my job died if you've not killed the master :( [16:42:29] https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html [16:43:36] looks good elukey!! [16:43:55] https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html is for our version [16:44:00] so it might be something to check [16:44:01] elukey: my spark job died because of 4 retries over a node that was not present - makes sense [16:45:19] joal: a node that wasn't present? :O [16:45:54] elukey: a node that had died [16:45:57] (or restarted) [16:47:37] marcel: your job dies anyway ;) [16:47:42] mforns: --^ [16:47:48] joal, :[ [16:47:50] ok [16:49:12] elukey: let me know when you're done with the NodeManagers, I'll restart jobs at that moment [16:49:37] mforns: I'll let you restart your job (maybe with the patch I suggest on mappers) [16:49:54] joal, yes will do! [16:50:00] mforns: not now though :) [16:50:17] joal: did only spark job failed by any chance? [16:50:23] ok [16:50:28] elukey: nope, others too [16:50:50] don't bother elukey, let's move on with restarting everything, I'll restart what's needed [16:50:58] because I can see application_1480065021448_201730BannerImpressionsStream SPARK joal root.default RUNNING UNDEFINED 10% http://10.64.5.104:4040 [16:51:10] elukey: you can kill it, I'll restart it [16:51:14] and the IP is stat1004 [16:51:24] elukey: fun ! [16:51:30] PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:51:35] elukey: Please go ahead, we'll see if it dies :) [16:52:30] RECOVERY - Hadoop NodeManager on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:54:50] weird I can see that your spark job is running on 1035 [16:54:51] mmmm [16:56:20] PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:20] RECOVERY - Hadoop NodeManager on analytics1042 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:02:43] (03PS2) 10Joal: Update montly unique jobs for better resource mgt [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335459 (https://phabricator.wikimedia.org/T156921) [17:02:58] mforns: updated the patch with some delay in coordinator, please have a look --^ [17:03:13] joal, oh! thanks :] [17:03:26] mforns: so, 2 things: start mappers only, + delay [17:03:33] ok [17:03:51] mforns: I think the banners job could be delayed by a day - data is present thanks to daily jobs, so no real rush on that one, agreed? [17:04:21] joal, I think it could be delayed by more days, like 5? [17:04:36] mforns: if you want, no big deal [17:05:30] elukey: doing good on restarts? [17:07:01] joal: so I am missing 1035 (that is running your banner impression job), 1040 and 1052->1055 [17:07:04] the rest is done [17:07:44] elukey: great - please go ahead with everything, don't bother about jobs anymore (let's do it, then restart - done :) [17:10:48] 06Analytics-Kanban, 10DBA, 13Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#2990284 (10Nuria) [17:11:21] 10Analytics, 10DBA, 13Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1532166 (10Nuria) [17:14:00] 06Analytics-Kanban: Investigate if Node Managers can be restarted without impacting running containers - https://phabricator.wikimedia.org/T156932#2990310 (10elukey) [17:14:19] 10Analytics-Cluster, 06Analytics-Kanban: Investigate if Node Managers can be restarted without impacting running containers - https://phabricator.wikimedia.org/T156932#2990323 (10elukey) p:05Triage>03High [17:14:22] joal: --^ [17:14:28] all restarts are done [17:14:51] 10Analytics: Create purging script for analytics-slave data - https://phabricator.wikimedia.org/T156933#2990326 (10Nuria) [17:15:01] oozie is complaining a lot [17:15:33] milimetric: which db hosts are these two? [17:15:34] https://phabricator.wikimedia.org/T156844 [17:15:39] or [17:15:41] mforns: db1046, db1047 [17:15:42] ? [17:15:58] ottomata, ??? [17:16:02] joal: can I restart some of the jobs? [17:16:05] or are you doing it? [17:16:12] analytics-store, etc. are special names [17:16:18] if you don't know i'll look [17:16:41] ottomata, I don't know... can look as well [17:17:04] db1046 is m2-master [17:17:33] that is an analytics mysql box? [17:18:07] 10Analytics, 10DBA, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2990353 (10Ottomata) [17:19:15] db1047 is a slave [17:19:19] ottomata, not sure [17:19:31] k [17:19:40] 10Analytics, 10DBA, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987618 (10Ottomata) @Marostegui, @jcrespo, we talked about this today. What is your timeline for replacing these boxes? We want to try to ween people off of EventLogging My... [17:19:43] sorry got hooked in baby mode [17:20:07] elukey: I'll do it (I'd like to patch some if mforns and nuria agree for merging) [17:21:00] super [17:21:02] joal, sure I'm reviewing it [17:23:20] joal, I see, so you make the job depend on other datasets, that's clever [17:23:43] now I'm not sure if you said you were going to modify the banner job or I can do it? [17:23:50] mforns: It's actually the same dataset, but different dependency :) [17:23:58] mforns: please go for it [17:24:03] joal, ok [17:24:38] mforns: this trick has been first by qchris__, we owe him a lot :) [17:24:54] :] I see [17:27:06] !log Restarting 2 webrequest-load text jobs that failed during NM restart (2016-02-01T11:00 and T13:00) [17:27:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:27:10] T13: Plan to migrate everything to Phabricator - https://phabricator.wikimedia.org/T13 [17:27:43] hehe, managed to confuse stashbot :) [17:28:14] mforns: if you think the patch for monthly is good enough, let's merge, I'll deploy and restart jobs [17:28:22] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 13Patch-For-Review, 06Services (watching): Prepare eventstreams (with KafkaSSE) for deployment - https://phabricator.wikimedia.org/T148779#2990398 (10Nuria) [17:28:26] 06Analytics-Kanban, 10EventBus, 06Operations, 10Traffic, and 2 others: Productionize and deploy Public EventStreams - https://phabricator.wikimedia.org/T143925#2990397 (10Nuria) 05Open>03Resolved [17:28:39] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 07Documentation: EventStreams documentation - https://phabricator.wikimedia.org/T153117#2990399 (10Nuria) 05Open>03Resolved [17:28:46] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2990400 (10Nuria) [17:28:54] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 13Patch-For-Review, and 2 others: RecentChanges in Kafka - https://phabricator.wikimedia.org/T152030#2990427 (10Nuria) 05Open>03Resolved [17:28:57] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2611852 (10Nuria) [17:29:08] 06Analytics-Kanban: Pageview Jobs: Make workflows easier to maintain using a variable instead of repeating some complex value accross the files - https://phabricator.wikimedia.org/T156668#2990430 (10Nuria) 05Open>03Resolved [17:29:22] 06Analytics-Kanban, 13Patch-For-Review: Follow naming convention on druid jobs: ts for long unix timestamps, dt for ISO. - https://phabricator.wikimedia.org/T156170#2990432 (10Nuria) 05Open>03Resolved [17:29:33] 06Analytics-Kanban, 13Patch-For-Review: Better explanation on pageview definition for edit actions - https://phabricator.wikimedia.org/T156629#2990433 (10Nuria) 05Open>03Resolved [17:29:45] 06Analytics-Kanban, 13Patch-For-Review: Improve AQS deployment - https://phabricator.wikimedia.org/T156049#2990437 (10Nuria) 05Open>03Resolved [17:30:23] joal: thanks, will try that out [17:30:41] no prob HaeB, sorry for not having something better [17:30:48] 06Analytics-Kanban, 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2990439 (10Nuria) [17:30:51] 10Analytics: Add hardware capacity to AQS - https://phabricator.wikimedia.org/T144833#2990438 (10Nuria) [17:32:00] joal, you mean banner monthly? or monthly uniques? [17:32:27] mforns: was thinking on monthly uniques [17:33:12] (03CR) 10Mforns: [C: 032] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335459 (https://phabricator.wikimedia.org/T156921) (owner: 10Joal) [17:33:21] joal, ^ [17:33:28] thanks mforns [17:41:55] 10Analytics, 10DBA, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2990444 (10jcrespo) > What is your timeline for replacing these boxes? The constraint, more than the decommission, is the budget for replacements. I do not know what is the d... [17:44:45] 10Analytics, 10DBA, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2990448 (10jcrespo) > What is your timeline for replacing these boxes? BTW, I forgot to answer literally your question, the deadline for replacement is January 2014 (not a ty... [17:55:54] all right going afk team! [17:56:00] talk with you tomorrow :) [17:56:01] byeee [18:14:59] 10Analytics, 10Pageviews-API: Pageview API: Better filtering of bot traffic on top enpoints - https://phabricator.wikimedia.org/T123442#2990549 (10MusikAnimal) I have a very buggy new version of Topviews that I'm working on that shows the percentage of mobile views each page receives. See http://tools.wmflabs.... [18:18:06] bearloga: Hi ! [18:44:49] (03CR) 10Joal: [V: 032] "Merging to deploy." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335459 (https://phabricator.wikimedia.org/T156921) (owner: 10Joal) [18:47:14] !log Deploy refinery for uniques monthly patches [18:47:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:01:00] !log Killed-Restarted Mobile apps Uniques monthly jobs to pick up new config - 0096638-161121120201437-oozie-oozi-C [19:01:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:07:16] joal: hi! Good morning! Sorry I missed your ping earlier. Are you still here? What's up? [19:07:26] hey bearloga :) [19:08:19] bearloga: I was willing to spend some time with you discussing if productionisation could be interesting for some of your hive queries [19:08:31] bearloga: now might not be the best moment though :) [19:08:49] And you possibly have already discussed this with mforns and milimetric [19:10:36] joal: Maybe? Depends which queries. I'm currently testing our Reportupdater-based code base which has a lot of Hive queries. [19:10:46] ah :) [19:11:12] bearloga: I'm asking since I've noticed you are a 'regular' user of the cluster ;) [19:12:43] I certainly hope I'm a "regular" user of the cluster! ;D analyzing data is literally my primary job :P [19:12:53] hehe :) [19:14:18] If you;re in the process of report-updater, it's definitely already on the move :) [19:14:24] Thanks for ligh [19:14:48] thanks for lighting some of my bulbs :) [19:15:57] Taking my leave now, see y'all a-team and others tomorrow :) [19:16:04] joal: Have fun! :) [19:45:29] bearloga: jajaja [19:47:10] ottomata|afk: milimetric will be able to mmake data mapping meeting, feel free to skip if you have other important things to attend to [19:48:21] 10Analytics, 10Wikimedia-General-or-Unknown: Browser and platform stats for logged-in vs. anon users for security and product support decisions - https://phabricator.wikimedia.org/T58575#2991001 (10Nuria) > as a) browser support is likely to differ between anonymous & authenticated users, Browser stats for th... [19:50:12] (I'll be there nuria) [19:54:35] 10Analytics, 10Wikimedia-General-or-Unknown: Browser and platform stats for logged-in vs. anon users for security and product support decisions - https://phabricator.wikimedia.org/T58575#2991043 (10Nuria) Correcting last post. From data on hadoop we cannot differentiate between logged in and not logged in user... [19:58:25] bearloga: is there an e-mail list for analysts at wmf? [19:58:54] nuria: not to my knowledge, no [20:01:49] 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Change userAgent field to user_agent_map in EventCapsule - https://phabricator.wikimedia.org/T153207#2991061 (10Nuria) [20:29:24] phew oh man [20:29:41] i locked my keys in my car, luckily a nice dude at the storage/parking place had some serious wire hanger skills [20:44:42] ottomata: I need your help. :D [20:45:21] leila: hiii [20:45:23] in a meeting [20:45:25] but, what's up? [20:45:40] we're choosing a name for a domain that will be used to serve recommendation APIs that we offer: think readMore recommendations (for readers), recommendation about what article to create, what article to translate, which hyperlinks to add to articles, which articles to expand, etc. [20:45:46] my question is: what should be the name? [20:45:47] :D [20:45:57] you can even say it https://phabricator.wikimedia.org/T147420 ottomata [20:46:48] OH I LOVE NAMING [20:46:48] hehehhe [21:01:18] ottomata: I don't ask you a task that you don't love. :D [21:02:56] 10Analytics: Remove user_agent_map from pageview_hourly long term - https://phabricator.wikimedia.org/T156965#2991318 (10Nuria) [21:07:14] joal: BTW, interesting to see https://gerrit.wikimedia.org/r/#/c/335459/ - are these perfomance tricks something that we could potentially also use in the other last-access queries we discussed with zareen recently? [21:07:35] 10Analytics: Remove user_agent_map from pageview_hourly long term - https://phabricator.wikimedia.org/T156965#2991322 (10Nuria) Browser data is been useful to many teams on druid. - For detail data we can delete after 90 days - We can load (to see browser trends) our browser dataset over time [21:08:33] leila: "RECOMENDATOR"....obviously [21:08:40] nuria: :D [21:10:00] HaeB: delay is just so two big jobs do not run at the same time, it doesn't make resources available [21:10:54] HaeB: for the other setting i am not sure [21:10:59] nuria: i was more thinking about the other part (SET mapreduce.job.reduce.slowstart.completedmaps=0.99;) [21:12:23] HaeB: we can consult with joal tomorrow those are useful if there is sorting [21:12:57] yeah, not urgent [21:44:30] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2991436 (10Jdlrobson) I'm not sure whether intentional is the right word.. but it's something I observed.... [21:50:33] 10Analytics, 10Wikimedia-Stream: Port RCStream clients to EventStreams - https://phabricator.wikimedia.org/T156919#2991446 (10Ottomata) [21:59:55] 10Analytics, 10Pageviews-API: Pageview API: Better filtering of bot traffic on top enpoints - https://phabricator.wikimedia.org/T123442#2991467 (10Milimetric) That's great insight, thank you @MusikAnimal [22:05:15] 10Analytics, 10Wikimedia-Stream: Port RCStream clients to EventStreams - https://phabricator.wikimedia.org/T156919#2991493 (10Ottomata) [22:05:17] 06Analytics-Kanban: Document the difference in aggregate data on wikistats and wikistats 2.0 - https://phabricator.wikimedia.org/T150963#2991494 (10Milimetric) Maybe a meeting would be easier then? Maybe our request is just getting lost in too much documentation? [22:12:20] 10Analytics-Dashiki, 06Analytics-Kanban, 13Patch-For-Review: Add extension and category (ala Eventlogging) for DashikiConfigs - https://phabricator.wikimedia.org/T125403#1986718 (10Milimetric) After a positive discussion on meta's Babel page, created T156971 to track deployment to prod. [22:15:00] 10Analytics, 10Wikimedia-Stream: Port RCStream clients to EventStreams - https://phabricator.wikimedia.org/T156919#2991532 (10Ottomata) [22:16:13] 10Analytics, 10Wikimedia-Stream: Port RCStream clients to EventStreams - https://phabricator.wikimedia.org/T156919#2989939 (10Ottomata) [22:23:05] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2991561 (10mobrovac) Adding the extra parameter shouldn't be a problem from the caching perspective, as we... [22:28:11] (03CR) 10Milimetric: [C: 04-1] "cool, looks good. Couple of nits and one or two ideas." (0312 comments) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/333922 (https://phabricator.wikimedia.org/T153921) (owner: 10Fdans) [22:33:29] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2991592 (10Jdlrobson) Clients wouldn't specify the value of max_age in config. They would specify a period... [22:35:48] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2991599 (10mobrovac) Yup, lapsus linguae, but the concern still stands. [22:41:12] (03CR) 10Ottomata: "Hm, you have external Hive partitions on this data, right? Are you sure you don't need to drop those too?" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/335158 (owner: 10EBernhardson) [22:51:16] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2991693 (10Jdlrobson) Right now I envisioned this as an integer but we could use an enumerator instead to... [23:10:11] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2991716 (10mobrovac) Fixed integers in the (0,24] range would be better. the more options we introduce, th... [23:44:19] 10Analytics, 06Reading-analysis, 06Research-and-Data, 10Research-consulting: Propose metrics along with qualifiers for the press kit - https://phabricator.wikimedia.org/T144639#2991803 (10Neil_P._Quinn_WMF)