[00:00:30] kevinator: ignore that. my bad. [02:03:18] Analytics, Ops-Access-Requests, operations: Access to stat1003 for jdouglas - https://phabricator.wikimedia.org/T98209#1263714 (Dzahn) p:Triage>Normal [02:36:20] Analytics-Tech-community-metrics, ECT-April-2015, ECT-May-2015, Patch-For-Review: "Volume of open changesets" graph should show reviews pending every month - https://phabricator.wikimedia.org/T72278#1263753 (Acs) >>! In T72278#1166368, @Dicortazar wrote: > After checking the code, I'd say that the c... [05:44:11] Analytics-Tech-community-metrics: Community Bonding evaluation for "Allowing contributors to update their own details in tech metrics" - https://phabricator.wikimedia.org/T98045#1263875 (Qgil) [08:30:18] Analytics, Mobile-Web, Mobile-Web-Sprint-46-Taken:-The-Dan-Garry-Story: Instrument “tags” and anonymous gather pages to track engagement with browse. - https://phabricator.wikimedia.org/T94744#1264047 (phuedx) [08:53:39] Analytics, Mobile-Web, Mobile-Web-Sprint-46-Taken:-The-Dan-Garry-Story: Instrument “tags” and anonymous gather pages to track engagement with browse. - https://phabricator.wikimedia.org/T94744#1171644 (phuedx) [08:55:04] Analytics, Mobile-Web, Mobile-Web-Sprint-46-Taken:-The-Dan-Garry-Story: Instrument “tags” and anonymous gather pages to track engagement with browse. - https://phabricator.wikimedia.org/T94744#1171644 (phuedx) I've added the provisional estimate. @kaldari, @bmansurov: We'll need to estimate this durin... [10:30:19] joal, hi [11:40:28] Hi mforns :0 [11:40:31] :) [11:48:29] Analytics-Tech-community-metrics: Community Bonding evaluation for "Allowing contributors to update their own details in tech metrics" - https://phabricator.wikimedia.org/T98045#1264476 (Dicortazar) @Sarvesh.onlyme, we would need to start working on the list of bullets found in the description. It would be... [12:36:04] good moorninnng joal [12:36:07] making coffee [12:36:11] morning ottomata :) [12:36:15] i think not starting those other guys was the right move :) [12:36:23] Take your time, I am babysitting now [12:36:57] everything back in track for refine [12:37:02] now only post-aggregation [12:37:11] --> after swich :) [12:37:20] coooOOol [12:38:01] hmm, doc swarm today, so I shouldn't ask, but I'd like to restart the RM (maybe with HA ?) to see if we [12:38:20] can solve the parquet file issue [12:38:26] ja we can restart it [12:38:31] HA will take me longer due to zk issue [12:38:55] yeah, right [12:39:18] let me know when you do it, I can assist :) [12:39:28] k [12:41:02] Enjoy your coffe :)h [12:42:32] also ottomata, I'd like your opinion on moving some doc pages [12:42:40] Let me know when you have some time :) [12:45:04] k gimme 7 mins [12:45:08] :) [12:45:30] actually we can talk now [12:45:32] just fixing coffee [12:45:33] batcave [12:45:47] k [12:57:10] Analytics-Wikistats: Traffic stats: analyze category 'Linux other' - https://phabricator.wikimedia.org/T48190#1264703 (ezachte) Comment from Jef Spaleta: (sent to me directly because he had sign-in issues) The primary reason why unknown linux is so large in the wikimedia stats is because both Mozilla Firef... [12:58:00] Analytics-Wikistats: Traffic stats: analyze category 'Linux other' - https://phabricator.wikimedia.org/T48190#1264705 (ezachte) [13:25:11] joal: things looking good. [13:25:34] Cool :) [13:27:14] Shall we restart RM before resuming jobs ? [13:29:17] PROBLEM - Throughput of event logging events on graphite1001 is CRITICAL 37.50% of data above the critical threshold [600.0] [13:29:43] joal: ja, we can restart everything... lemme try somethign though. [13:29:53] i realized that the zk that we were using might have beeen *slightly* older than the zk in production [13:29:57] since it was installed on precise hosts [13:30:11] going to try the HA thing in labs again, using ubuntu zk from trusty [13:30:17] or maybe jessie...hmmhmm [13:30:18] actually yeah [13:30:23] well, trusty first [13:33:35] k, let me know if you need me :) [13:34:27] oh! joal, it is working now! [13:34:30] with the trusty zk! [13:34:32] :) [13:34:41] Happy ottomata I guess [13:34:49] How come it was not working ? [13:35:00] labs was using non-trusty zk ? [13:35:19] i had installed the zk in labs on the kafka hosts there [13:35:20] for conveninece [13:35:28] but we only have a precise kafka package atm [13:35:30] ahhhhhh ! [13:35:32] so those nodes were precise [13:35:39] makes sense [13:36:06] cool, and auto failover works too [13:36:13] Nice :0 [13:36:16] :) [13:36:31] I continuously miss my smiley ... bad day [13:36:58] ok, going to read over my puppet work from yesterday, and try to apply this in prod [13:37:00] then let's do a restart [13:37:04] ok [13:37:19] Quick question : trusty / jessie --> moving from ubuntu to debian ? [13:44:08] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 64.4966098684 [13:44:31] weird --> icinga sends a graphite alert, but no data in graphite ... [13:44:45] yeah, what in the world is up with graphite... [13:46:48] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 0.852770131579 [13:51:54] I still like those alert emails with little hearts :) [13:59:31] joal: ja, we are moving from ubuntu to debian [14:02:52] ottomata: may I have some explanationS ? [14:03:19] I did that move two years ago, so I am happy, but I'd ove to understand the WMF reasons :) [14:03:41] HMMmMMM [14:04:17] there were big discussions about it last fall [14:04:23] i think one of the main motiviations [14:04:36] is folks feel ubuntu is focusing much more on desktop than server stuff [14:04:52] and debian does better in server world [14:04:54] ok :) [14:05:08] I actually think debian does better in desktop world as well ;) [14:05:20] At least for linux-friendly computer scientists :) [14:05:52] Anyway, that's still not so much of a difference :) [14:06:27] MAN ! No job running on the cluster ! [14:06:41] Either you have restarted it, or we definitely have cought back :) [14:06:47] ottomata: --^ [14:08:02] not me! [14:08:10] REAL ? [14:08:45] hmmm [14:08:59] May the RM lost have broken a few things :) [14:09:06] I'll double check [14:10:36] camus job ongoing [14:10:43] but no more oozies :) [14:11:35] ottomata: shall we wait for your zk stuff and RM restatt ? [14:12:11] yea, i'm getting close [14:12:15] let's just do it once [14:12:15] cool [14:25:48] ok joal, cool [14:25:53] hm ? [14:26:07] ya ready ? [14:26:07] we are ready to restart, i'm going to disable resubmission of camus (by commenting out cron) for amin [14:26:11] wait for this camus run to finish [14:26:13] then let's do it [14:26:18] k [14:26:19] or, should we [14:26:23] pause all oozie jobs [14:26:25] Shall I pause [14:26:26] and wait for all running jobs to complete? [14:26:27] indeed [14:26:31] yeah, ok, yeah pause [14:27:39] I go for the bundles [14:27:54] ok i can doo coords [14:28:52] 3 bundles suspended [14:29:19] still 4 coords running [14:29:21] k, really i only have to suspend one more coordinator [14:29:26] cause the others all depend [14:29:44] meh i'll suspeend them all :) [14:30:07] I still view mobile_apps monthly and daily [14:30:14] media_count archive [14:30:23] and clickstream [14:30:35] oh ja mobile apps [14:30:39] mediacount gone [14:30:47] oh clickstream, intresting! [14:30:56] I don't know this job :) [14:31:05] Can you explain ? [14:31:32] its ellery's i don't know it either [14:31:56] ok all suspended [14:32:59] great [14:33:56] one at a time i move jobs into essential queue to give them a boost :p [14:34:01] probaly doesn't matter, but it is fun [14:34:04] Weird : still 3 jobs running on the cluster, but oozie says no job matching when looking for wf running [14:34:13] ? [14:34:18] 3 oozie jobs? [14:36:43] oh, that is probably because [14:36:44] from oozie's perspective the job is usspended [14:36:44] well, 3 launched jobs [14:36:44] but, that will only keep it from launching further actions [14:36:44] i think it will still let the current action in yarn finish [14:36:44] ok makes sense [14:36:44] I let you play with your queue :) [14:36:44] about mediacounts: we not only should resume, but kill and restart from long ago, no ? [14:36:45] yes [14:36:49] well, we might not need to kill [14:36:51] but we can re-run [14:36:57] hm [14:37:03] rerun coord_id -action X-Y [14:37:10] I need to understand better the re-run stuff [14:38:31] k, so basically re-running allows you to start a new instance of the coord, working previously existing aciotns [14:38:32] hmm, mediacounts might just work if we resume [14:38:34] correct ? [14:38:36] beacuse i had told it to re run them once last week [14:38:37] https://hue.wikimedia.org/oozie/list_oozie_coordinator/0079582-150220163729023-oozie-oozi-C/ [14:38:39] yes [14:38:53] it just tells a particular instantiation of a coordinator action (workflow) to run all over again [14:38:58] you can tell it to do so based on action number [14:39:04] And one everything is done, you kill your re-run coord instance [14:39:08] that @123 after the coord id if you do -info [14:39:14] or, can do based on date too [14:39:17] and it works with ranges [14:39:24] yes I have seen that [14:39:28] reading oozie doc [14:39:32] there are no killed jobs for mediacounts-load [14:39:40] so, if we just resume it, it might just work [14:39:44] without having to tell it to rerun [14:40:49] I think we miss data from a long time ago [14:42:08] actually no [14:42:17] only from 05-02 [14:42:38] 2015/05/02 present, but not since [14:42:49] so it seems you are right ottomata, resuming should be enough [14:44:42] ottomata: doc on Kraken ... --> Archive ? [14:44:49] yes [14:44:51] or purge :) [14:44:55] :D [14:46:00] I like the idea though :) [14:47:27] tgr, yt? [14:52:01] joal, yt? [14:52:08] yup mforns [14:52:12] joal, hi! [14:52:16] Heya :) [14:52:20] joal, I've seen you in the etherpad [14:52:26] Yes [14:52:31] joal, can you archive wiki pages in mediawiki.org? [14:52:42] joal, I mean delete [14:53:08] For the moment I take pages, and will ensure during standup that the action I plan is ok [14:53:13] Then I'll take them :) [14:53:24] So I don't know :) [14:53:43] joal, ok, it seems I don't have permits to delete pages in mediawiki.org [14:53:50] hm :S [14:53:58] Not sure I can either [14:54:30] I'll try on infrastructure/access (the first one of the to be deleted list) [14:55:13] Nope, no deletion for me neither [14:55:28] joal, if you have permits, you should see, when logged in, a delete tab next to edit and edit source... [14:55:29] ok [14:55:55] You have put the historical banner thouhg :) [14:58:34] joal, yes [15:00:58] ottomata: the, legacy tsvs were created before oozie ? [15:01:16] hm, yes/no? [15:01:21] hm, no [15:01:25] not sure what you are asking though [15:01:27] weird [15:01:42] there are 5 scripts, identical [15:01:46] but the name [15:01:52] if I don't miss something [15:01:56] are you asking of that data existed before hive/hadoop/oozie? then yes. [15:02:00] it was created by udp2log then though [15:02:14] since januarry 2015 is has also been created by hive/hadoop/oozie [15:02:25] and only in the last few weeks has it no longer been generated by udp2log [15:02:28] that is [15:02:30] on stat1002 [15:02:32] /a/squid/archive [15:02:33] vs. [15:02:37] /a/log/webrequest/archive [15:02:57] the generate_5xx-[bits/misc/etc]_tsv.hql files are the same [15:03:07] We should ahve handled that with datasets, no ? [15:03:09] looking [15:03:17] oh [15:03:26] whoa werid [15:03:27] they are [15:03:29] not sure why [15:03:37] Neither do I :) [15:04:55] hm, ahh, i think that is do to an abstraction in workflow.xml [15:04:58] [15:05:06] yeah [15:05:12] Was looking into that as well [15:05:14] so it required that they are the same? [15:05:14] i guess [15:05:21] qch ris wrote this, (i reviewed it :) ) [15:05:52] almost empty cluster! [15:05:58] 3 reduces go! [15:06:01] 1! [15:06:26] still a launcher :) [15:06:31] heh [15:06:34] there we go [15:06:36] ok! [15:06:37] Gone [15:06:43] hipa ! [15:06:44] restarting RMs and nodemanagers [15:07:07] with updates for versions on parquet and others deps, zk, and a new MR ? [15:07:11] RM soory [15:08:21] ja installed updated deps yesterday [15:08:27] fingers crossed [15:09:09] ok, all restarted, HA RM looks good! [15:10:07] Nice :) [15:10:12] i'm going to let camus run, but lets try your parqeut thing before we start ooie stuff [15:10:18] cool [15:10:19] fingers corssed (im' only 50% hopeful) [15:10:22] testing now [15:11:16] org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby [15:12:11] stopped and restarted -> same error [15:12:14] try now [15:12:18] that is because of a bad config on my part [15:12:19] fixing [15:12:31] k [15:12:37] What was it ? [15:12:41] some HA stuff ? [15:13:09] shell started [15:14:05] naw, uhhh, tell you in a bit [15:14:09] dumb thing [15:14:50] webUI not accessib;le :) [15:14:58] text files ok [15:15:29] parquet fails [15:15:39] Damn .. [15:15:47] SAme error [15:16:06] crapo [15:16:37] crap, welp, i guess we should just start jobs and then continue figuring this one out, eh? [15:16:42] testing loading data from hive [15:16:49] Yup [15:16:53] ok, starting things... [15:17:03] want me to start some ? [15:17:05] naw i got it [15:17:08] you mess with spark [15:17:14] k [15:25:36] Wow, production queue filled up :) [15:26:56] Hullo, I'm running 5 minutes for late for standup. [16:05:24] hm, milimetric, joal, something to keep an aye on: [16:05:27] apache flink [16:05:32] vs spark streaming [16:05:32] flink! [16:05:52] pipeline db too [16:07:20] hm cool [16:07:46] so many things! [16:27:17] hm, joal, very strange that this parquet spark thing works in local mode, but not yarn [16:45:13] yezh, completely strange ottomata [16:45:22] also strange [16:45:34] so, i'm trying to write a parquet file from a paralleized data set [16:45:59] to see if i can write and load my own file [16:46:09] http://www.codeshare.io/aITwc [16:46:30] getting :33: error: value saveAsParquetFile is not a member of org.apache.spark.rdd.RDD[Person] [16:46:30] people.saveAsParquetFile(parquetFilePath) [16:46:36] i'm doing this from exampeslhere : [16:46:40] https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#loading-data-programmatically [16:47:03] That's weird for sure ! [16:48:24] Will try it as wel [16:50:59] also, btw, pyspark shell gives same error [16:51:05] for loading parquet file [16:51:17] hm [16:51:23] At least it's coherent [16:53:24] Ok ottomata [16:53:34] Managed to write a file, but can't read it back ! [16:54:41] how'd you write it? [16:54:56] You need to use a dataframe object [16:55:03] paste please! [16:55:05] :) [16:55:06] So: people.toDF [16:56:07] hm, guess that implicit things wasn't working [16:56:23] http://www.codeshare.io/dVKSG [16:57:41] From the programming guide of 1.3.0 --> val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() [16:57:50] This toDF() at the end is the key :) [16:57:53] aye [16:58:06] ok cool [16:58:14] So parquet written [16:58:19] But not read !!!! [16:58:24] Do you believe that ... [17:00:04] ok, yeah, going to post for help... [17:00:11] hmm ,first [17:00:13] going to try with spark-submit [17:01:11] https://issues.apache.org/jira/browse/SPARK-6315 [17:01:24] found it I think ! [17:01:33] ottomata: --^ [17:02:37] HOW DO YOU GOOGLE SO WELL [17:02:43] :D [17:11:22] PROBLEM - MySQL Slave Delay on db1045 is CRITICAL: CRIT replication delay 352 seconds [17:11:36] so far it isn't helping joal :/ [17:12:50] ottomata: try setting the conf value as they said, but no chance [17:12:55] not working :( [17:13:02] yeah me too [17:13:28] And what's event more weird is : when writing your own file, it should wsritten with parquet new version ... [17:14:08] I think posting sounds reasonnable in that case :( [17:17:42] ja, joal, exactly [17:17:52] yeah, the example i ahve now, is 100% contained within new spark [17:17:54] so it should work! [17:17:58] it does work in local mode [17:17:59] just not yarn [17:18:07] crazyness [17:21:05] ottomata, hi, is the cluster ok? Dan tells me hive queries didnt run [17:21:55] cluster is sitll a little busy, but it should be ok now [17:24:15] Analytics, Mobile-Web, Mobile-Web-Sprint-46-Taken:-The-Dan-Garry-Story: Instrument “tags” and anonymous gather pages to track engagement with browse. - https://phabricator.wikimedia.org/T94744#1265755 (phuedx) @kaldari, @bmansurov: If you're happy with the estimate, then just move it through the process. [17:27:01] Analytics, Mobile-Web, Mobile-Web-Sprint-46-Taken:-The-Dan-Garry-Story: Instrument “tags” and anonymous gather pages to track engagement with browse. - https://phabricator.wikimedia.org/T94744#1265772 (bmansurov) 3 sounds good to me. [17:28:17] mforns_brb: any new insight into the volume going into EL from those two schemas? [17:28:30] just curious if we know whether we're dropping events, etc. [17:35:14] AH joal [17:35:18] ? [17:35:20] i am able to make this example work in vagrant [17:35:27] so something is wrong with cluster [17:35:33] maybe there is still some version problem we ahven't detecte dyet [17:35:42] trying in labs real quick [17:35:43] I can't imagine anything else [17:35:46] k [17:35:46] (maybe) [17:36:44] milimetric: last email from icinga is not nice .. [17:37:23] milimetric: Shall we drop some mobile eventx ? [17:37:36] joal: in scrum of scrums right now, but let's talk afterwards [17:37:42] k sorry [17:39:21] joal: the slave replication alert, right? (just making sure I'm not missing other alerts) [17:39:32] yup [17:40:04] means that DB is not keeping up normally with insert rate, if I don't mistake [17:40:39] it sounds like the replication to the slave is bad, but the master seems ok [17:40:55] although the throughput as of this morning is over 800 / second which is much higher than we've seen in the past [17:41:22] like x2 right ? [17:42:13] k, if master's ok, let's wait and see what mobile team answer [17:42:17] thx for the answer [17:42:42] diner time, brb [17:48:54] wow, joal. fixed. [17:48:58] it was a version thing...but i'm not sure of what [17:49:01] i just did this: [17:49:12] salt -E 'analytics.*|stat1002' cmd.run 'sudo apt-get -y --force-yes install $(dpkg -l | grep "^ii" | grep -i cdh | grep -v cdh5.4.0 | awk "{print \$2}" | tr "\n" " ")' [17:49:44] that manually upgraded any cdh packages that were not at cdh5.4.0 that were installed across the whole cluster [17:50:03] the only ones i saw upgraded were [17:50:04] hbase hive hive-jdbc impala-shell sentry zookeeper [17:50:20] hm, maybe hive somehow [17:50:37] joal / mforns_brb: I'm going to add a bug for the missing graphite data and tag it operations [17:51:13] Analytics-Kanban, operations: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1265890 (Milimetric) NEW [17:51:20] yup, totally works now! [17:51:22] phew! [17:51:25] sorry abou tthat. [17:56:13] milimetric, YGM :) [17:57:17] Ironholds: what? I got mail? I got you? [17:57:28] I mean, you do get me, but you also have mail [17:57:39] * milimetric doesn't see mail [17:57:51] oh! [17:58:04] you fell under my "look at only the top 3 emails" rule [17:58:07] ahhh [17:58:28] there's a followup question there too which is "how does that process work, because it looks like we need to surface some logs from fluorine" [17:59:17] Ironholds: yes, stat1002 has those folders too [17:59:21] neat [17:59:25] you can see that /a/limn-public-data is there [17:59:36] but there's also /a/aggergate-datasets [17:59:44] and /a/public-datasets [17:59:53] *nods* [18:00:17] there's some reason why we're doing that, but I always kind of wanted just one folder like "public-data-from-stat1002" [18:00:23] and "public-data-from-stat1003" [18:00:30] that seems much easier to understand [18:00:39] totally. Makes sense, though! [18:00:51] and then the final question; how does that process work and how would I go about replicating it for a different machine? [18:01:52] Analytics-Kanban, operations: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1265951 (chasemp) I see this in icinga for graphite1001: > Throughput of event logging events CRITICAL: 92.86% of data above the critical threshold [600.0] [18:16:34]