[09:37:50] 10Analytics: Create robots.txt policy for datasets - https://phabricator.wikimedia.org/T159189#3059607 (10Peachey88) Is there any reason we are actually concerned about bandwidth usage? [10:30:21] 10Analytics-Tech-community-metrics, 10Phabricator, 06Developer-Relations (Jan-Mar-2017): Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#3060641 (10Aklapper) p:05Low>03Normal a:03Aklapper [10:32:17] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Identify Wikimedia's most important/used info panels in korma.wmflabs.org - https://phabricator.wikimedia.org/T132421#2197987 (10Aklapper) p:05Low>03Normal [10:32:40] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Go through default Kibana widgets; decide which ones are not relevant for us and remove them - https://phabricator.wikimedia.org/T147001#3060649 (10Aklapper) p:05Low>03Normal [11:10:03] hello a-team! [11:10:27] Moritz upgraded apache on thorium and I was reviewing all the websites as consistency check [11:10:30] but https://analytics.wikimedia.org/dashboards/browsers/ looks weird [11:10:50] same thing for https://analytics.wikimedia.org/dashboards/vital-signs/#empty [11:10:59] (they seems empty or not functioning correctly) [11:11:11] probably I am missing something but can somebody please double check? [11:38:00] elukey: it indeed look weird ! [11:38:50] both browser report and vital sign seem broken :( [11:40:26] elukey: Also, do you want me to pause / stop oozie jobs in preparation for cluster upgrade? [11:41:27] 10Analytics, 10Analytics-Cluster: Apply Xms Java Heap settings to all the Hadoop daemons - https://phabricator.wikimedia.org/T159219#3060783 (10elukey) [11:45:02] joal: we can do it later if you want, it is just a suspend and wait.. so we'll let a bit more jobs going [11:45:14] sure elukey [11:45:16] but I am fine whatever you want to do :) [11:45:18] you are the master [11:45:22] huhu [11:45:35] elukey: about website, what can we try to do? [11:46:07] I checked apache logs and JS console logs in Chrome but nothing comes up [11:46:19] it might be a subtle JS problem [11:46:26] I am going to check the apache changelog [11:47:46] elukey: it's as if there was no data / nothing to display [11:53:33] joal: I restarted apache reverting a new setting (that is the big change) but nothing changes, so it must be something else.. [11:54:22] everything else seems to work fine [11:54:38] maybe we'd need to wait for Marcel [11:54:53] (03CR) 10Joal: "2 comments:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340093 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans) [11:55:37] elukey: marcel is off this week :( [11:55:45] elukey: I think we need Dan [11:56:26] argh you are righttttt [11:57:58] (03PS3) 10Fdans: Use v2 table in Cassandra, switch to padded day timestamp [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340093 (https://phabricator.wikimedia.org/T156312) [11:58:39] we can ping our new JS expert and lover fdans!! [11:58:55] what do you mean by your lover? [11:59:05] 😛 [11:59:21] nono JS lover, don't read sentences as you would like to [11:59:27] hahahah [11:59:32] :D :D :D [11:59:52] elukey looking... [11:59:58] jokes aside, I am checking why https://analytics.wikimedia.org/dashboards/browsers/index.html and https://analytics.wikimedia.org/dashboards/vital-signs/#empty looks weird [11:59:59] (03CR) 10Joal: [C: 04-1] "Comments inline" (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/339419 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans) [12:00:14] Hi fdans, thanks for the changes :) [12:00:18] I am not sure if they were working before the apache upgrade though [12:00:24] all the other websites are good [12:03:22] so I'm looking at https://analytics.wikimedia.org/dashboards/browsers/ [12:03:48] and the following request is getting a 400 https://piwik.wikimedia.org/piwik.php?action_name=Simple%20Request%20Breakd…p=0&wma=0&dir=0&fla=1&java=0&gears=0&ag=0&cookie=1&res=3360x2100>_ms=148 [12:06:07] Hey yall [12:06:45] Hm... piwik shouldn't stop stuff from working... [12:07:03] I double checked after deploying last and everything was ok [12:07:12] Lemme take a closer look [12:08:05] Thanks fdans and milimetric (i'm really bad at front end debugging :( [12:08:17] I got it, no worries [12:09:23] of course they did :) [12:09:35] the pages on meta vanished [12:09:38] (that stored the configs) [12:09:46] after we deployed dashiki last night probably [12:09:54] that doesn't make sense, must've been moved [12:10:54] (talking about pages like https://meta.wikimedia.org/wiki/Config:VitalSigns) [12:11:42] also, I got permissions revoked on meta.... [12:11:44] wtf is going on [12:12:59] ok so this is not related to the apache upgrade [12:13:02] no [12:13:18] choices are: migrate all configs from backups to the new Config:Dashiki namespace [12:13:45] or... I guess that's the only choice because I don't have rights to create Config: pages anymore [12:13:53] maybe that's what happened during the deploy [12:14:13] which is not how it worked on beta... [12:14:29] biggest problem right now is how do I get old text... [12:14:54] sorry talking out loud, morning brain :) [12:15:01] nono it is really useful :) [12:15:09] I am wondering if another change wiped the config [12:15:13] and not the deployment [12:15:22] did you guys check the websites after the deployment? [12:15:33] no, didn't think to, was late last night [12:15:36] (just to understand when this mess could have happened) [12:17:09] ok, so page is still in the page table, that means it's being masked probably, and then it's definitely the deploy [12:17:43] yeah, 'cause if it was moved or deleted it would either have page_is_redirect or be in the archive table instead [12:18:39] wonder if the API will get me the text [12:18:50] duh obviously not, that's what dashiki does :) [12:18:51] hahaha [12:20:08] milimetric: I am going to step away a bit to eat something before the CDH upgrade, will be back in a bit if you need me ok? [12:20:23] yeah, no problem, elukey, this is all me I think [12:20:54] just gotta query and find this text, move to the Config:Dashiki: namespace, update Dashiki and re-deploy all dashboards [12:20:55] no biggie [12:21:35] super :) [12:21:48] * elukey lunch! [12:26:41] (03PS3) 10Fdans: Format timestamps correctly in per-project aggregation [analytics/aqs] - 10https://gerrit.wikimedia.org/r/339419 (https://phabricator.wikimedia.org/T156312) [12:28:51] hm, anyone know where I can get revision text from? joal did you happen upon the db that has that? [12:29:08] milimetric: which wiki? [12:29:10] like I have page_id, rev_id, rev_text_id on meta [12:29:28] give me a minute milimetric [12:31:11] milimetric: I can find that yes, but I don't have very recent stuff [12:31:20] oh from dumps [12:31:25] ummm, how recent? [12:31:25] milimetric: correct [12:31:37] milimetric: currently checking how recent I can get [12:32:06] milimetric: 2017-02-20 [12:32:09] (8 days ago) [12:32:21] oh plenty, all updated last before that [12:32:28] ok, great, where do I go? [12:32:51] milimetric: I need to import the file, then parse it with my utilities (shouldn't be long) [12:33:05] awesome, thx joal [12:33:24] milimetric: np, those things I'm working will at some point be useful ;) [12:33:32] I should really figure out where to get this from the db as well [12:35:27] milimetric: before I go for the full thing, can you give me the list of rev_ids? [12:36:17] milimetric: cause I actually don't have history data from 2017-02-02, but only from 2017-02-01 [12:36:47] So if rev_ids are the last ones for each page, should be ok, but I prefer double check [12:36:51] https://www.irccloud.com/pastebin/CdzOmB7r/ [12:37:46] milimetric: from the rev_timestamp, things should be ok even with 2017-02-01 [12:37:47] (sorry pasted some bad ones in there joal) [12:37:57] yep, all old [12:38:07] milimetric: waiting for an updat? [12:38:16] what do you mean? [12:38:23] milimetric: which ones should I take? [12:38:31] oh I pasted them in, you don't see it? [12:38:36] it's a pastebin [12:38:45] milimetric: I see them, but you said there were too many [12:38:50] ohoh [12:39:08] https://www.irccloud.com/pastebin/NQllyN5b/ [13:07:18] joal: how's it going? [13:07:47] milimetric: I got results, trying to wrote them [13:10:31] milimetric: stat1004:/home/joal/metamili.txt [13:11:03] thx joal, looking [13:11:07] format: rev_id\ntxt\n\n [13:11:54] milimetric: (probably old) but I saw this on #operations: 14:03 @ !log ran namespaceDupes on meta to fix some Config pages [13:14:17] yep, been talking to him in -databases, he's being very nice and explaining a lot [13:14:45] so the dashboards should be ok now that he did that actually [13:14:46] https://meta.wikimedia.org/wiki/Config:SimpleRequestBreakdowns [13:15:15] so we're not against the clock anymore, I'll slow down and see the right way to move these pages [13:15:30] milimetric: indeed, vital signs back in place [13:15:47] heh, Seinfeld's so smart. He said pain is lots of knowledge rushing in to fill a gap [13:15:52] what happened?? [13:16:12] you mean overall elukey? [13:25:56] yes, just curious [13:29:23] Hi ottomata [13:29:29] I'm sorry I missed your ping yesterday [13:29:33] hi! [13:29:55] ottomata: o/ [13:30:02] joal: I think that we can start the suspend procedure [13:30:15] elukey: sure ! [13:30:26] elukey: please go ahead with camus, I'll care oozie [13:30:38] sure [13:30:51] !log stopping camus as prep step for the CDH upgrade [13:30:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:31:05] !log Suspend webrequest-load bundle for CDH upgrade [13:31:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:31:21] bearloga: Hi ! [13:31:26] thanks! [13:31:36] joal: uhhhh i think iw as gonna talk about spark 2 stuff with ya [13:31:39] let's chat later [13:31:44] sure ottomata [13:31:45] about taht [13:32:49] bearloga: We are going to upgrade hadoop soon - Can you please stop launching requests (or they'll fail soon) [13:35:23] ottomata: do we need to add the apt source.d config before starting ? [13:35:35] yeah [13:35:37] on that in a min... [13:35:42] super [13:36:58] want to get through morning emails before starting :) [13:38:37] sure sure [13:38:43] I am going through the prep steps [13:38:48] (silencing alarms, etc..) [13:41:42] k awesome :) [13:48:02] joal: can't stop to my knowledge; they're automatically spawned by reportupdater and crontab :\ no big deal if they fail, though :) [13:48:16] bearloga: ok :) [13:54:52] all right puppet disabled and all hosts silenced (drud100[123], an1027->an1057, stat100[234]) [13:57:50] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Kibana's Mailing List data sources do not include recent activity on wikitech-l mailing list - https://phabricator.wikimedia.org/T146632#3061018 (10Aklapper) 05Open>03Resolved Thanks Lcanasdiaz! Closing as resolved as I can confirm t... [13:58:30] great [13:58:55] elukey: pcc looks good to me [13:59:05] (cept drud1001 :p) :) [13:59:21] haha, elukey actually we probably want to RUN puppet everywhere [13:59:23] to pick up this change :p [13:59:26] i guess we can do it one by one [13:59:28] as we upgrade them [13:59:29] yaaa [13:59:53] ah yeah I forgot that :D [14:00:17] yeah pcc looks good to me too! [14:00:26] drud is not good :P [14:00:27] ottomata: still a webrequest job running on cluster [14:00:56] Arf, actually, gone as I speak :) [14:01:01] yeah i see that joal [14:01:01] we'll wait for it to finish [14:01:01] oh great! [14:01:06] joal: you are suspending the jobs then? [14:01:09] elukey: hm [14:01:16] ooooook [14:01:17] ottomata: Did that a while ago [14:01:27] great :) [14:01:32] eliu icinga is silenced? [14:01:35] elukey: ? [14:02:14] !log Suspend mediawiki-load jobs as well (forgot about those) [14:02:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:02:42] ottomata: I updated the etherpad, all silenced :) [14:02:50] I think I'd need to re-enable puppet for the moment [14:02:55] great :) [14:03:03] gimme 1 min [14:03:06] k [14:03:10] joal: i see a few bundles running [14:03:11] still [14:03:16] they ok? [14:03:55] and also quite a few coords (but they might belong to the bundles, not sure) [14:03:56] they should be blocked by webrequest, joal is a fan of minimal suspention [14:04:00] oh ok [14:04:07] even the elasticsearch ones? [14:04:20] 'transfer_to_es-eqiad-coord' ? [14:04:21] etc.? [14:04:28] don't know them, probably it is safer to suspend [14:04:47] hm, i think they are blocked by camus stop too [14:04:48] pretty sure those come from mw avro search stuff [14:04:49] ottomata: I think they depend on cirrussearch being loaded, but not sure [14:04:53] which is imported by camus [14:04:54] yeah [14:04:56] ottomata: give me aminute [14:05:20] /wmf/data/discovery/popularity_score/ ? [14:05:20] hm [14:05:30] puppet re-enabled! [14:05:35] ottomata, joal: feel free to disable this one if it causes issues [14:05:54] no real issue, just trying to prevent disabling everything:) [14:06:00] ok :) [14:06:07] ok i think those come from pageview_hourly [14:06:15] i think we are good joal, afaict [14:06:45] sounds good indeed ottomata [14:06:51] pv hourly -> discorvery popularity score -> transfer to es [14:07:05] no prod job running on the cluster, ready to be stopped [14:07:11] correct ottomata [14:07:20] elukey: ok, let's run puppet everywhere [14:07:20] and then disable, ja? [14:07:51] ottomata: change already merged? [14:08:00] ohp... [14:08:42] it is now elukey :). oh we should be careful about puppet vs. camus crons on an27 [14:09:11] yeah :( [14:09:20] all right running puppet! [14:09:25] ooook! [14:10:05] also we'd need apt-get update [14:10:16] (not sure if puppet does it automagically) [14:10:21] puppet should do it [14:10:29] pretty sure puppet runs update every run [14:10:31] Scheduling refresh of Exec[apt-get update] [14:10:35] :) [14:15:05] 1027 and all the worker nodes done, just a min to finish the rest [14:15:35] cool [14:15:56] nice, see Version: 2.6.0+cdh5.10.0+2102-1.cdh5.10.0.p0.72~trusty-cdh5.10.0 avail [14:15:58] great :) [14:17:42] ooook [14:18:00] bearloga: joal ok if i kill that job? [14:18:11] i know you said it was above, but i just want to double check [14:18:15] ottomata: I asked earlier, bearloga said yes [14:18:19] Ah ok :P) [14:18:23] ok [14:19:29] killed [14:19:55] ok elukey ready when you are, i guess i'll follow along? if you want me to do a part let me know? [14:21:01] mmmm an1003 also reports W: Failed to fetch http://ubuntu.wikimedia.org/ubuntu/dists/trusty-backports/Release.gpg Could not resolve 'ubuntu.wikimedia.org' [14:21:08] that is not a blocker but let's remember to check it [14:21:11] hm [14:21:23] that's weird [14:21:24] I think that the domain was killed [14:21:26] just an03? [14:21:49] maybe that's not related to the thirdparty-cloudera stuff [14:22:01] ah no all of them [14:22:08] nono I don't think it is [14:22:11] it was there before [14:22:26] maybe we were using backports from ubuntu.w.o and it has been killed? [14:22:39] i see that domain in sources.list on these hoses [14:22:41] hosts [14:22:43] hm [14:22:51] ok, dunno, let's move forward though, seems unrelated [14:23:53] elukey, ottomata: Can I take a 1 hour brake or do you need me around? [14:24:35] /etc/apt/sources.list.distUpgrade [14:24:35] joal: ya no problem, go ahead [14:24:52] joal: we always need you but you can go :D [14:24:53] k, will be back for restart (normnally:) [14:25:02] * joal blushes [14:25:35] anyhow, super weird but we can proceed [14:25:53] ja [14:27:03] one min for final checks and I'll start [14:28:58] ottomata: ready! [14:29:38] if you are ok I am going to proceed with the stop of the daemons [14:32:14] proceed! [14:32:17] elukey: ! [14:34:03] * elukey proceeds! [14:38:48] elukey@stat1002:~$ sudo lsof -n | grep "mnt/hdfs" [14:38:48] bash 16517 ebernhardson cwd DIR 0,24 4096 17590204 /mnt/hdfs/user/ebernhardson/discovery-analytics/current [14:39:01] I think I can just kill this right [14:39:01] ? [14:39:14] it is preventing me to umount /mnt/hdfs [14:39:44] ottomata: --^ [14:39:59] hm [14:40:00] oh [14:40:03] yea [14:40:04] kill that [14:40:30] gooood [14:48:10] (03PS1) 10Milimetric: Move dashboard configuration to Config:Dashiki: [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/340318 [14:49:12] ottomata: interesting thing :) [14:49:13] sudo -i salt 'analytics*' cmd.run "for service in $(ls /etc/init.d/hadoop-*); do echo $service; done" [14:49:25] (it is a modified version) [14:49:36] tries to eval the $() as local on neodymium [14:49:50] that is not what I expected :D [14:50:00] (03CR) 10Milimetric: [V: 032 C: 032] Move dashboard configuration to Config:Dashiki: [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/340318 (owner: 10Milimetric) [14:50:28] 10Analytics-Tech-community-metrics: "Email senders" widget empty though "Mailing Lists" widget states that there are email senders - https://phabricator.wikimedia.org/T159229#3061153 (10Aklapper) [14:50:28] hm, also, won't ls /etc/init.d/hadoop-* do full path [14:50:31] ? [14:50:34] oh sorry [14:50:40] you aren't doing service $service [14:50:44] you are just lsing them? [14:51:00] yeah I wanted to know why it wasn't doing anything or trying to stop "stop" :D [14:51:08] so I modified the script [14:51:16] IIRC I never encountered this issue [14:51:21] oh ha [14:51:28] oh you have the cd in there now [14:51:29] right? [14:51:32] wait, am confused [14:51:38] 10Analytics-Tech-community-metrics: Fix incorrect mailing list activity of AKlapper (=Phabricator) in Technical Community Metrics user data - https://phabricator.wikimedia.org/T132907#3061165 (10Aklapper) 05Open>03Resolved a:03Aklapper This specific issue is not a problem anymore (maybe fixed by {T146632}... [14:51:38] are you running what is in the etherpad? [14:51:45] I tried but it fails [14:52:23] example [14:52:24] elukey@neodymium:~$ sudo -i salt 'analytics*' cmd.run 'for service in $(cd /etc/init.d; ls hadoop-*); do echo $HOSTNAME $service; done' [14:52:27] analytics1049.eqiad.wmnet: neodymium neodymium [14:52:31] horrible paste [14:52:40] but you get what happens [14:53:17] OHHHH [14:53:17] i see, hm, but its in single quotes [14:53:17] checking too [14:53:32] hm, it works for me [14:53:33] hm [14:53:47] really? [14:54:07] maybe it is tmux? [14:54:07] elukey: its the -i [14:54:07] on your sudo [14:54:07] don't know why [14:54:07] but if you remove that, it works [14:54:18] I tried even with -i [14:54:21] *without [14:54:23] weird [14:54:31] I'll redo yours without -i [14:54:33] 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3061177 (10Aklapper) >>! In T157898#3024337, @Lcanasdiaz wrote: > @Aklapper I confirm this is broken right now. I'm appliying it manually today and talkin... [14:54:56] it works! [14:54:59] weeeeeird [14:55:08] 10Analytics-Tech-community-metrics, 07Regression: Git repo blacklist config not applied on wikimedia.biterg.io - https://phabricator.wikimedia.org/T146135#3061178 (10Aklapper) >>! In T146135#2815705, @Lcanasdiaz wrote: > I confirm that blacklist is not working. Working on it .. @Lcanasdiaz: Any news to share? [14:55:14] thanks, I'll fix my ignorance later on :) [14:55:21] no, remove -i [14:55:21] oh? [14:55:21] sudo salt 'analytics*' cmd.run 'for service in $(cd /etc/init.d; ls hadoop-*); do echo $HOSTNAME $service; done' [14:55:21] analytics1049.eqiad.wmnet: [14:55:21] hadoop-hdfs-datanode [14:55:21] hadoop-yarn-nodemanager [14:55:57] yes yes without the -i works for me too [14:56:00] no idea why [14:56:46] proceeding with the backups [14:57:15] k [15:00:21] ready to upgrade packages [15:00:28] proceeding with journal nodes [15:03:27] uh, what... who made this: http://datavis.wmflabs.org/where/ [15:03:45] weird elukey sorry accidentaly quit irc and couldnt' get back on [15:04:13] oliver ... [15:05:37] ottomata: just upgraded the journal nodes, all good! [15:06:15] yeehaw cool! [15:06:43] going for 1001 and 1002 [15:07:09] coo [15:08:04] oh elukey were we going to do fstab now too? [15:08:08] or wait til we upgrade to jessie? [15:08:19] oh no [15:08:20] you already did that?! [15:12:33] a-team: did you all want to meet in 20 minutes to discuss solving this webrequest-counting problem in hadoop? [15:12:58] (I had put it on the team calendar but it was last-minute so no pressure) [15:13:05] ottomata: I was planning to do it after reimaging the nodes to debian [15:13:16] so avoid adding too many things [15:13:39] ok col [15:13:40] I only ran the script on an1039 to test and reboot the new fstab [15:13:42] yeah makes sense elukey [15:13:42] cool [15:13:45] super [15:13:52] milimetric: we doing upgrade today/now so prob not us [15:13:54] in the meantime, I am upgrading the workers [15:13:57] elukey: great [15:14:05] k, postponing [15:17:17] 10Analytics, 10Analytics-Dashiki: Create dashboard for upload wizard - https://phabricator.wikimedia.org/T159233#3061266 (10Milimetric) [15:34:36] elukey: how we doin? [15:34:50] I am testing HDFS! [15:35:02] I had to remove the tmp/etc.. dirs since it was complaining [15:35:05] but all seems good [15:35:42] yeppp all good, just finished the checks [15:35:54] ottomata: proceeding with stat* [15:36:33] back a-team [15:36:45] how is it going hadoop-prod-guys? [15:36:54] oh ya [15:36:57] ha from a previous test :) [15:36:59] great [15:37:03] milimetric: We can meet if you want, but might be better with ottomata and elukey [15:37:17] yeah, let's wait for everyone [15:37:43] milimetric: just to be sure: it's the rest thing, or something else? [15:37:56] joal: going well. elukey is driving 100%, i hear good things :) [15:38:13] yeah, it seems counter-productive to start setting up infrastructure that is hardening the tech debt we all want to get rid of anyway [15:38:25] awesome ottomata :) Thanks a lot elukey for driving ! [15:38:36] makes sense milimetric [15:38:50] milimetric: I had not seen the map in tools, it's cool ! [15:39:03] it's super old [15:39:18] I think [15:39:28] 2014 from what I read [15:39:43] I'd love to get a handle on all this and centralize / simplify [15:40:09] so I am proceeding with stat, druid and an1003/1027 nodes [15:40:10] milimetric: the curse and the blessing of openness: plenty, but messy :) [15:40:44] milimetric: are you talking about the services api metrics thing? [15:40:55] yes [15:40:59] ah [15:41:27] joal: restarting druid (one node at the time) [15:41:33] k elukey [15:41:47] I thought it was a hadoop upgrade ! [15:42:00] joal: druid uses hdfs! [15:42:09] ottomata: I know ;) [15:42:11] haha [15:45:14] (03CR) 10Joal: "One last thing, thanks again fdans :)" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/339419 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans) [15:47:25] joal: so, spark 2.x Qs, in general vs. flink etc. [15:47:41] so, i played with newer flink the other day, and it is v cool, but i'm wondering how much adoption we could get on it [15:47:42] for streaming [15:47:46] since it doesn't support python for streaming [15:48:01] and, maybe spark 2.x streaming would be good enough for our use cases? [15:52:57] (03PS4) 10Fdans: Format timestamps correctly in per-project aggregation [analytics/aqs] - 10https://gerrit.wikimedia.org/r/339419 (https://phabricator.wikimedia.org/T156312) [15:55:28] ottomata: I noticed that on worker nodes spark is still 5.5 [15:55:40] so I guess I'd need to add spark and spark-core in the apt-get install right? [15:56:04] huh yes [15:56:44] elukey: good catch [15:56:45] spark-core spark-python [15:56:46] spark-core and spark-python [15:56:46] i guess [15:56:48] yeah [15:56:50] all right :) [15:56:56] adding it to command in etherpad [15:56:58] so we don't forget next time [15:58:00] ah also bigtop-tomcat, flume-ng, hbase, kite, solr and sqoop [15:58:05] oh ok [15:58:06] ottomata: --^ [15:58:08] adding [15:58:25] flume-ng? [15:58:29] hmmm actually [15:58:46] i think you can uninstall flume-ng [15:59:08] not sure what that is :D [15:59:12] but I am reporting :) [15:59:29] aye, uninstall flume-ng, the others can be upgraded [15:59:37] we don't use flume, andi really doubt anything depends on it [15:59:45] its a data injestion pipeline thing [15:59:47] for an100[12] bigtop-tomcat, bigtop-utils, hive, hive-jdbc [16:00:05] hm ok [16:00:17] a-team: any issue if I skip standup? I'd prefer to keep going with the upgrade [16:00:28] good for me elukey [16:00:54] elukey: np [16:01:01] ottomata: you coming to standup [16:04:01] ottomata: I think that flume-ng is a spark dep [16:05:15] really? [16:05:31] bwh! [16:05:31] it is [16:05:32] ok [16:05:33] welp [16:05:34] crazy [16:05:39] i guess upgrade it then [16:06:47] yeah :( [16:12:28] ottomata: I think that flume-ng is in the blacklist of the cdh repo [16:12:32] it doesn't get upgraded [16:13:07] bwah yeah [16:13:09] ok fixing [16:21:31] elukey: did spark-python get upgraded? [16:21:37] yeah [16:21:37] it isn't whitelisted explicitly [16:21:39] ok [16:21:40] hm [16:21:55] elukey@analytics1039:~$ dpkg --list | grep spark [16:21:55] ii spark-core 1.6.0+cdh5.10.0+457-1.cdh5.10.0.p0.71~trusty-cdh5.10.0 all Lightning-Fast Cluster Computing [16:21:58] ii spark-python [16:22:12] missed one version but it is the same [16:22:15] ok merged elukey try apt-get update and install [16:22:18] super [16:22:50] oh wait [16:22:56] i gott run puppet on apt how [16:22:56] host [16:25:05] ok we good elukey https://apt.wikimedia.org/wikimedia/pool/thirdparty/cloudera/f/flume-ng/ [16:25:18] ottomata: piece of cake now [16:25:57] thanks :) [16:26:20] I am going to restart the Yarn Node Manager since I can see the old spark dependencies in lsof [16:26:28] *Managers [16:29:01] also the datanode [16:29:03] sss [16:29:04] sigh [16:53:24] joal: you there? [16:53:41] I am [16:54:15] what's up elukey ? [16:55:19] joal: would you mind to do a rapid spark test? [17:12:09] cluster upgraded people, all back and running! [17:22:08] ottomata: do you have a minute to help me debug some python on stat1002? [17:22:59] joal: sure [17:23:04] what's up? [17:23:31] ottomata: I'm trying to run the sqoop script (in refinery/bin/sqoop-...) with hdfs user -- no luck :( [17:23:43] works fine with my user, but not from hdfs user [17:23:50] seems a problem path related [17:24:19] export PYTHONPATH=/srv/deployment/analytics/refinery/python [17:24:20] ? [17:24:29] hm, will try [17:24:39] not sure if it will get carried over to the sudo shell though [17:25:13] Yay ! Worked :) [17:25:18] Thanks ottomata :) [17:25:54] great! [17:25:56] ottomata: I used the /bin/bash trick (did not try without sudo -u) [17:28:46] joal can you see this query result map graph? [17:28:49] just wonderin gif linking works [17:28:50] https://hue.wikimedia.org/notebook/editor?editor=15 [17:29:19] ottomata: nope: Document2 matching query does not exist. [17:29:23] hm [17:29:37] ok must be per user [17:30:01] removed downtimes from icinga [17:30:02] yes, history seems to be per user [17:33:42] hmmm joal [17:33:43] try now [17:33:43] https://hue.wikimedia.org/notebook/editor?editor=15 [17:50:08] hey people, are all the oozie emails expected? [17:52:18] mmm it seems that two coordinators in misc/maps failed [17:52:22] restarting them [17:52:25] joal: ---^ [17:53:28] !log restarted via Hue Feb 2017 14:00:00 webrequest-load-coord-misc/maps [17:53:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:54:28] all right team I am logging off, will check back later if things are ok!! [17:54:31] byyyyeeee o/ [17:57:21] elukey: Back from lighting the fire, will look after oozie [18:03:52] !log restart pageview oozie job for 2017-02-28T12:00 [18:03:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:14:59] ottomata: looks like we have an issue :( [18:19:02] what's up? [18:19:05] joal: [18:19:16] oh looking [18:19:19] ottomata: looking at maps/misc failed jobs [18:22:59] hm am looking at 0000107-170228165458841-oozie-oozi-W [18:23:03] webrequest-load-wf-maps-2017-2-28-14 [18:23:14] i don't see any insightful errors [18:23:18] just that it was killed? [18:23:21] ottomata: Job failed, did almost the same [18:24:01] weirdo [18:24:15] even more weirdoh is the fact that we can't open the job in hue [18:24:56] but others ran? even the one before that [18:25:01] ran succesffully [18:25:09] ottomata: the ones AFTER that actually ran ! [18:25:10] and i'm pretty sure that is after we restarted oozie, right? [18:25:16] Started : 2017-02-28 17:20 GMT [18:25:19] that's an hour ago [18:25:20] so ya [18:25:30] that one is 0000023-170228165458841-oozie-oozi-W [18:25:34] webrequest-load-wf-maps-2017-2-28-13 [18:25:35] which ran [18:25:41] (unless you manually re-ran that?) [18:25:56] joal: ? [18:26:08] Elukey re-ran it, I did as well [18:26:40] oh [18:26:42] ? [18:27:09] So weird ! Misc finally did it correctly ... [18:27:24] hm, let's retry it for maps - I think I have an idea [18:27:33] ok [18:27:33] doing it ottomata [18:28:38] ottomata: I think there are good reasons for us to restart camus a while before oozie jobs [18:31:26] joal you think they got launched prematurely? [18:37:41] ottomata: I think the camus-checker can be messed-up if multiple hours are written qt the same time (but maybe not fully) [18:38:32] ottomata: It could flag an hour as done (because next has started), while not being fully imported [18:43:55] hm [18:44:15] i guess if a single partition finishes way before another? [18:44:23] but the offset/timestamps are checked for each partitoin, right? [18:44:43] * elukey sees that everything seems fine :) [18:45:34] yes elukey, everything back to normal [18:45:47] ok phew [18:45:51] thanks yall [18:47:14] super :) [18:47:18] * elukey afk again! [18:49:00] wikimedia/mediawiki-extensions-EventLogging#637 (wmf/1.29.0-wmf.14 - 838abb7 : Translation updater bot): The build has errored. [18:49:00] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/compare/wmf/1.29.0-wmf.14 [18:49:00] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/206268405 [19:10:13] logging off for tonight a-team [19:10:29] Thanks again prod guys for the smooth version bump ! [19:14:26] laters joal! [19:55:30] 10Analytics: Prototype counting of requests with real time (streaming data) - https://phabricator.wikimedia.org/T159264#3062177 (10Nuria) [19:56:15] 10Analytics: Prototype counting of requests with real time (streaming data) - https://phabricator.wikimedia.org/T159264#3062192 (10Nuria) [19:56:18] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3062193 (10Nuria) [20:01:28] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3062220 (10Nuria) >If you can budget some time to help us get access to the data as a stream or in Hadoop, then I think we should be able to work something out tha... [20:03:22] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3062237 (10Pchelolo) @Nuria Could you explain a bit on what's the difference between refined data and raw data in this context? All we need here is URIs that we ca... [20:08:10] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3062261 (10Ottomata) Yeah, in this case raw vs refined doesn't make a difference, but as part of a stream refinement, we had talked about splitting the firehose we... [20:18:31] 10Analytics, 10Analytics-Dashiki: Clean up remaining Dashiki configs on meta - https://phabricator.wikimedia.org/T159269#3062327 (10Milimetric) [20:22:21] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3062369 (10Milimetric) +1 to what Andrew said. We don't want to block you on doing that. We will start building a simple infrastructure to do things like what yo... [20:25:08] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3062381 (10Nuria) @Pchelolo: For at least two reasons I can think of: urls and hosts are "normalized" as part of refine process and It is likely that you want mo... [20:26:16] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3062390 (10Nuria) @Pchelolo: this is the bulk of code that runs as part of refinement process: https://github.com/wikimedia/analytics-refinery-source/tree/master/r... [20:38:47] 10Analytics: Prototype counting of requests with real time (streaming data) - https://phabricator.wikimedia.org/T159264#3062447 (10Milimetric) [20:45:45] 10Analytics: Prototype counting of requests with real time (streaming data) - https://phabricator.wikimedia.org/T159264#3062177 (10Ottomata) >- produce to a new topic >- count and send stats to grafana by [some granularity? hour/day...] These two can probably be made generic, but it would be important to be ab... [20:47:33] 10Analytics: Create robots.txt policy for datasets - https://phabricator.wikimedia.org/T159189#3062475 (10Milimetric) @Peachey88 not particularly, this is low priority, but it just seems like a bad idea to waste it for no reason, especially on larger files like datasets. I mean it downloads the whole thing just... [21:26:16] ottomata: is cluster fit to run jobs in again? [21:43:52] (03PS1) 10Milimetric: Clean up Config [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/340375 [22:27:12] nuria, yup! [22:27:13] all is well [23:15:44] not sure if its related to the cluster upgrade, but i 4 different runs of a job that does an hourly hive script have failed today. looking into what it is, early review of logs suggests being killed for exceeding memory [23:20:12] hmm, yea thats the problem, but unknown how related it is. Will just bump the memory to 1.5G or some such: Diagnostics report from attempt_1488294419903_1121_m_000000_2: Container [pid=75888,containerID=container_e42_1488294419903_1121_01_000005] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 5.0 GB of 2.1 GB virtual memory used. Killing container. [23:52:42] ottomata: thanks for the upgrade :) paws internal is doing good! is there info on what's changed and what new shiny things we get? [23:53:08] no hurry just curious