[00:12:55] Analytics-Cluster, operations: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1972202 (Dzahn) NEW [00:13:46] Analytics-Cluster, operations: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1972217 (Dzahn) [00:16:09] (CR) Yuvipanda: [WIP] Database selection (2 comments) [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk) [00:16:44] Krenair: minor changes. I can test it later today and if it works ok we can merge and deploy today/tomorrow [00:20:37] YuviPanda, what about the stuff in the commit message? [00:21:00] Analytics, operations: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015#1972277 (mforns) NEW [00:21:35] Krenair: what do you mean by 'this list' [00:22:13] the db lisgt [00:22:14] list* [00:25:43] Krenair: and by encoding you mean the JSON encoding? [00:26:26] YuviPanda, well look at the options it produces... [00:26:30] some messed up stuff in there [00:26:57] (CR) Yuvipanda: "Yes we should sort this list - ideally in 'most popular to least popular' (similar to https://github.com/wikimedia/apps-android-wikipedia/" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk) [00:27:15] ottomata: ah, didn't see your ping because of the opening bracket ;) [00:27:28] yes, thanks a lot jynus, much appreciated [00:27:38] Krenair: are you talking about "\u043c\u043e\u043b\u0434\u043e\u0432\u0435\u043d\u044f\u0441\u043a\u044d Wiktionary" [00:27:40] that seems ok [00:27:41] all hail the DBA! ;) [00:28:22] yes, that sort of mess [00:28:23] it's not ok [00:28:42] that's supposed to show as молдовеняскэ [00:30:24] >>> print u"\u043c\u043e\u043b\u0434\u043e\u0432\u0435\u043d\u044f\u0441\u043a\u044d Wiktionary" [00:30:27] молдовеняскэ Wiktionary [00:30:29] Krenair: ^ [00:30:31] hmm [00:30:50] Krenair: the lack of 'u' is the problem [00:30:53] caused by the json.dumps :) [00:31:19] I still think we're crazy for trying to auto-generate a python file [00:31:25] well, I say we.. :P [00:31:28] Krenair: we could just do a json.dump into a file, and load it right after setting up 'app [00:31:31] ' [00:31:32] and see what happens [00:31:36] it will probably be ok [00:31:41] and it's highly possible I'm crazy :D [00:31:55] I wrote a python script that generates an nginx+lua conf last week (not for wikimedia :P) [00:31:58] so there's a pattern ther [00:32:35] didn't you or andrew get me to write python to generate lua, all from a puppet template? [00:32:41] which uses ruby [00:32:47] that was me yes [00:32:56] or did we change that so it was no longer in a template, but got separate config [00:33:40] I think it's not a template [00:33:43] Krenair: we could just do a json.dump into a file, and load it right after setting up 'app' [00:33:47] that's what I suggested [00:33:50] that's what you had originally suggested [00:33:52] yeah [00:34:01] so let's do that and see how that behaves [00:36:39] (PS9) Alex Monk: [WIP] Database selection [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) [00:56:07] Krenair: what do you think of my sorting suggestion? [00:56:53] I think I will think about it in 10 minutes [00:57:19] or maybe think about it while sync-master is running [00:59:00] YuviPanda, most popular to least popular? how do you plan to look that up? [00:59:15] Krenair: so for the android app [00:59:25] we just get list of wikipedias sorted by 'pageviews' [00:59:32] err [00:59:34] no [00:59:36] by 'number of good articles' [00:59:40] which is just a byte measure I think [01:00:46] 'just' [01:01:11] right, not a measure of GA / FA ratings since those are enwiki specific [01:01:31] hah [01:01:33] https://github.com/wikimedia/apps-android-wikipedia/blob/master/scripts/generate_wiki_languages.py [01:01:35] github is down [01:01:46] indeed [01:01:59] gerrit is not [01:02:16] yes [01:02:23] but I still have no idea how to find that file in gerrit [01:02:25] I guess [01:02:27] I should find it in diffusion [01:03:06] https://phabricator.wikimedia.org/diffusion/APAW/browse/master/scripts/generate_wiki_languages.py [01:33:42] YuviPanda, if you know the gerrit project, /r/project/ works [01:33:54] YuviPanda, that pulls from inside labs? [01:34:09] we should probably not be depending on other labs projects like that [01:34:16] Krenair: it is a build time thing [01:34:29] run once every few months [01:37:15] YuviPanda, so we have to download one for each family, and presumably one for special sites? [01:37:19] and they don't include dbname [01:38:42] milimetric: 'round? [01:45:14] hey mobrovac what's up [01:45:42] milimetric: there's someone asking about all-days params for AQS @ https://www.mediawiki.org/wiki/Topic:Swpsoqwzco0g3poj so take a look [01:45:56] i asked them to give more info, but maybe you can tell right away what the problem is [01:46:40] Oh, mobrovac yeah, we still haven't been able to compute that efficiently, thanks for the ping [01:46:52] kk cool [01:46:53] thnx [01:50:52] (PS10) Alex Monk: [WIP] Database selection [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) [02:39:16] Krenair: yeah, you'd actually have to 'join' [02:39:28] Krenair: use the stat stuff just for sorting [02:39:50] to 'join'? [02:39:57] no idea what you mean [02:42:40] YuviPanda [02:49:08] Krenair: I mean [02:49:18] Krenair: you get the dbnames + human readable names from the API [02:49:28] then get the sort order from stats [02:49:48] and use the sort order to sort the data you got from the API [03:59:52] Hey all, not sure if anyone is around at the moment who'd be able to help, but it seems like the pagecounts-raw uploads are stuck / missing some of today's data [03:59:54] https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-01/ [04:04:13] pagecounts-all-sites is also missing the same data, though my consumer only uses pagecounts-raw as we need data consistency back to 2010 (and as such also can't use the pageviews API) [08:00:54] Analytics, Services, RESTBase-API: RESTBase pageview data not updated - https://phabricator.wikimedia.org/T125048#1973342 (mobrovac) [09:25:13] Analytics-Cluster, operations: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1973447 (ArielGlenn) NEW [09:36:57] hello a-team! I need to restart our beloved kafka brokers [09:37:25] let me know when you are free to ensure that EL doesn't get into a weird state again :) [09:52:05] elukey: statsv (which also consumes from kafka) tends to get into a weird state as well [09:52:11] i have not had the time to debug it [09:52:55] could you remember to restart it as well? it's running on hafnium (so `service statsv restart` should do the trick) [09:53:28] ori: sure! I'll add this note to https://wikitech.wikimedia.org/wiki/Service_restarts#Kafka_brokers [09:53:35] is it only related to Kafka? [09:53:56] yeah. it's a very primitive script; there is hardly more to it than a for-each loop [10:02:21] Analytics-Cluster, operations: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1973531 (elukey) Hello Ariel, I believe that @Ottomata is still working on the host: ``` # This node was previously a Hadoop Worker, but is now waiting # to be rep... [10:19:41] oh boy, the last email from hdfs@stat1002 doesn't look good joal [10:31:00] wow elukey, that does indeed look bad [10:31:35] elukey: seems to be a camus issue [10:35:27] joal: where are you looking? Just to have an idea.. Yarn? [10:35:58] looked into hue to see which jobas are blocked --> started at the root, load_webrequest - They are blocked [10:36:32] They are failing due to a missing _IMPORTED file --> this is generated after succesful camus run - I therefore think camus is failing [10:36:59] ahh hue I forgot the name :D [10:36:59] Now looking at camus logs on analytics1027:/var/log/camus/webrequest.log [10:37:44] elukey: have restarted analytics 1027 yesterday ? [10:37:54] with, like, an update of java version ? [10:38:27] no 1027 was blacklisted by ottomata [10:38:32] there is a phab task opened [10:38:35] hm [10:39:04] but I am double checking [10:40:03] yep nothing in https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:30] uptime big, so recent restat [10:41:34] Start-Date: 2016-01-27 09:32:48 [10:41:34] Commandline: apt-get -q -y install linux-image-generic [10:41:38] but not restarted [10:41:53] hm [10:41:56] Any change on java ? [10:42:19] and also Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install apt-transport-https [10:42:23] not yesterday [10:42:48] Looks like a java issue with file rights on hadoop [10:46:13] the webrequest.log looks horrible :D [10:46:26] elukey: do you have root on this machine ? [10:46:43] yep [10:46:47] If so, can you try to join screen camus and camus_loop ?h [10:46:48] is the partition mounted? [10:48:12] I mean, sorry, since we have failures in hdfs moves I was thinking if everything is ok from the fs side [10:48:35] np, everything is done through hdfs command [10:48:49] have you managed to join screen ? [10:50:01] sorry joal I didn't get what do you mean with join screen [10:50:20] There are screen running on analytics1027, probably from root [10:51:08] actually, they are from otto [10:51:18] and I wonder if there commands running from there [10:53:50] screen -ls doesn't show anything [10:54:02] you sdhould run that as otto :) [10:54:36] elukey: another question: have we changed anything that impact java in yesterday's install-reboot ? [10:57:10] well, the kernel :) [10:57:15] but not on that host [10:57:30] and without a reboot we are still using old packages [10:58:01] elukey: speaking of hadoop global (issue seems to come from hdfs rights to write being handled differently) [11:00:01] batcave? [11:00:28] yes [11:08:36] Analytics-Tech-community-metrics, pywikibot-core, DevRel-January-2016, Upstream: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1973640 (Lcanasdiaz) >>! In T123808#1971343, @jayvdb wrote: >>>! In T123808#1970520, @Lcana... [11:29:57] !log disabled puppet and # Puppet Name: camus-webrequest [11:30:23] mmm [11:31:43] !log Moving faulty camus imported files to a temporary place [11:31:57] !log disabled puppet and camus webrequestlog in analytics1027 [11:37:53] !log run camus manually to check if succesful [11:53:10] \o/ [12:13:57] joal: I am going away for ~30 mins to eat something, I'll be back in a bit to check. [12:14:07] sure, still having issues [12:14:18] ouch.. do you need me to do anything? :( [12:21:38] naaa, fighting with files to remove [12:42:22] mforns, are you onlinez? [12:42:28] elukey, yes :] [12:42:45] o/ [12:42:50] I was reading there are problems with the cluster, can I help? [12:42:51] I am going to restart kafka [12:42:56] :) [12:43:02] ok, will follow up EL [12:44:04] joal is kicking Camus very hardly in its teeth, should be fine in a bit :D [12:44:30] this camus guy, would never had thought a writer could fight that bad [12:45:36] :D [12:45:54] let me know if you need any help, in the meantime I am going to deal with Kakfa [12:45:57] *kafka [12:50:46] !log stopping kafka1012 for kernel upgrade [12:51:43] EL logs went crazy [12:53:53] O_O [12:53:59] I need to stop the node this time [12:54:05] not only restart [12:54:15] so it might take it not well [12:56:11] elukey, I think it's OK, kafka will buffer logs and after restart EL will catch up [12:56:27] elukey, do you know how long it will take to upgrade? [12:56:59] I need to restart all the hosts :D [12:58:05] so I'd say, 2 hours to be super conservative? [12:58:21] I'll bring up the node + leader election probably [12:58:29] if the leaders are not balanced [12:59:04] plus the partitions are replicated so it is only a metadata change for EL (theoretically) [13:04:34] elukey, then I think I'll stop EL [13:05:42] elukey, was this scheduled or is this an emergency measure? should we send an email to analytics? [13:07:03] elukey: successful camus run ! huray ! [13:07:13] WWWOOOOOOOW [13:07:16] elukey: can you restart cron and puppet ? [13:07:26] joal: sure [13:07:45] mforns: it is a scheduled reboot of the kafka nodes to upgrade the kernel for a security fix [13:08:04] if you could send an email it would be great, the node is not restarting -.- [13:08:56] elukey, cool, then I'll send an email to analytics mentioning the scheduled reboot and the EL temporary stop [13:09:26] elukey, is there any link to a Ohab task I can paste? [13:09:54] joal: 1027 back to work [13:10:27] mforns: I don't have one atm, but you can mention the latest kernel vulnerability bug [13:11:26] elukey, thanks [13:11:40] thanks elukey ! [13:11:43] I'm monitoring [13:21:07] elukey, joal, EL is kinda enduring the connection problems, I think we don't need to stop it... [13:21:32] just restart it when upgrade is done, unless the situation changes [13:21:42] mforns: I trust you :) [13:21:58] mforns: Thx for caring that one :) [13:23:32] !log restart EventLogging to overcome connection problems following kafka maintenance [13:24:19] joal, is there anything I can help with the cluster, it's my ops-week [13:24:36] mforns: I'm gently monitoring, thx for asking :) [13:24:51] mforns: any issue in the past with kafka hosts not responding to ssh? [13:25:01] O.o [13:25:09] not for me [13:25:27] actually I think I never ssh'd a kafka instance [13:25:28] mmmm now it is refusing me sigh sigh [13:26:33] elukey: That bug in the kernel, that was your way to get in ! I understand now ;) [13:27:03] yes you got me! I released a CVE only to reboot hosts :P [13:27:11] :) [13:27:18] now kafka1012.eqiad.wmnet doesn't respond [13:27:20] sigh [13:27:28] mwarf :( [13:35:51] Analytics, Services, RESTBase-API: RESTBase pageview data not updated - https://phabricator.wikimedia.org/T125048#1973848 (JAllemandou) Thanks for raising this issue @mobrovac. Problem comes from down the line, at hadoop data ingestion. Is it now fix, but the cluster will take time to catch up on late... [13:38:28] elukey, joal, mmm EL got worse, it isn't ingesting events at all [13:38:39] :( [13:39:21] mforns: what does the log say? Completely empty or some weird messages? [13:39:40] 2016-01-28 13:22:41,201 (MainThread) Unable to connect to broker kafka1012.eqiad.wmnet:9092 [13:39:41] 2016-01-28 13:22:41,202 (MainThread) [Errno 111] Connection refused [13:40:28] ah yes it is down [13:40:34] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [30.0] [13:40:37] also there is an outage in -ops [13:40:51] hello icinga [13:41:55] mforns: even if you restart EL it still get stuck with 1012? [13:42:50] joal, it seems to try to connect to kafka1020 [13:43:18] mforns: is that bad ? [13:43:34] 1012 is down [13:43:35] :( [13:44:03] joal, no, but the processor and the consumer are stuck anyway [13:44:04] Man, we were talking of OOW yesterday, and today machine dies -- That is bad luck ! [13:44:15] xDDDDD [13:44:15] I don't get it mforns [13:44:35] joal, they are not consuming [13:44:58] mforns: right. give me a minute to check the kafka state [13:46:13] mforns: We have leaders that are not 1012 assigned on every topic, so a restart should do I think [13:48:07] joal, just restarted it, but the processor and the consumer are not counsuming from kafka [13:48:19] batcave? [13:48:26] sure [13:55:16] elukey, how is it going with 1012? [13:56:02] well it caused a major outage for mediawiki, people are investigating. Still down :( [13:56:14] 1012 [13:56:20] sorry [13:58:24] elukey: you think the mediawiki errors are due to kafka1012 being down ? [13:58:45] the ops team is investigating atm [13:59:25] !log stop EventLogging until kafka is ok [14:00:10] elukey: If so, it might be related to something we have witnessed with mforns on pykafka [14:00:30] I think that the kafka nodes are hardcoded in php files as pool [14:00:37] --> When the first instance of a writing corum is not responding, pykafka fails [14:01:13] If the C-lib used by varnish-kafka does the same, then maybe kafka1012 down is the root cause [14:11:24] yep, I won a t-shirt [14:16:42] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1973932 (ArielGlenn) [14:17:15] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1973881 (ArielGlenn) I checked on stat1002:/mnt/hdfs/wmf/data/archive/pagecounts-raw/2016/2016-01 and the files aren't there. [14:23:22] mforns: how is EL working? [14:23:41] better, s/working/doing [14:23:54] elukey, I've stopped it [14:24:35] ah okok [14:24:41] oh! I forgot, I restarted it, because the good part is, events are making it into kafka from varnishkafka [14:24:47] so no data loss as of now [14:25:04] I restarted it so that server-side events get inserted, too... [14:27:55] all right so it is running atm? [14:28:17] theoretically one node shouldn't matter [14:28:25] but of course the client do mind :) [14:30:33] Analytics-Kanban, Wikipedia-Android-App: Beta Event Logging no longer functional {oryx} - https://phabricator.wikimedia.org/T123781#1973981 (mforns) Open>Resolved @Niedzielski EventLogging logs are no more in `/var/log/eventlogging/all-events.log`, they are in `/srv/log/eventlogging/` and `/srv/log/u... [14:34:43] elukey: t-shirt ? [14:34:53] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1974003 (Aklapper) p:Triage>High [14:34:54] I broke XYZ [14:35:00] :) [14:35:03] :D [14:35:08] Your fault really ? [14:37:30] yep [14:37:59] joal: https://phabricator.wikimedia.org/T125084 [14:39:55] right ... mforns_brb --> seems that event-logging client-side might be flowing trhough varnish kafka :) [14:40:08] yes :] [14:43:02] and surprisingly the server-side forwarder works [14:43:02] Analytics: pagecounts raw from 28/01 are not present - https://phabricator.wikimedia.org/T125079#1974026 (JAllemandou) Hello, Issue is know, an email has been sent to the analytics list about the problem. Hadoop data ingestion has been failing yesterday, and was restored around 13:00UTC today , but all the... [14:43:02] :) [14:43:02] * joal is happy to have diagnose the stuff correctly [14:43:02] Analytics-Kanban, Wikipedia-Android-App: Beta Event Logging no longer functional {oryx} - https://phabricator.wikimedia.org/T123781#1974027 (Niedzielski) Ah, thanks! That's very helpful! Looks like it's moved to deployment-eventlogging02 but it's working fine. [14:43:02] mforns_brb: What the thing means is that we need to change eventlogging conf to remove kafka1012 from the senders list [14:45:22] HIII [14:45:25] reading backlogs and emails! [14:45:38] ottomata: You're too late, you've missed all the fun :D [14:46:34] If i missed the fun that means things are working smoothly! :D [14:46:45] almoooooooost [14:47:22] Man, kafka is a dangerous player :) [14:48:17] good i see we have a (weird looking) webrequest partititon status email [14:48:23] i was worried we weren't getting those or something [14:48:39] ottomata: that bit is (kinda) solved [14:48:46] kafka is the still broken one [14:48:46] aye [14:48:48] so, what's up? [14:48:54] elukey: you go ? [14:49:02] or shall I ? [14:49:13] looks like I'll go :) [14:49:34] Started the day with broken jobs - camus failure yesterday after cluster reboot [14:49:57] after kafka? or hadoop [14:49:58] ? [14:49:59] reboot [14:50:00] ? [14:50:07] after hado [14:50:11] hadoop sorry [14:50:15] Analytics-Kanban, Editing-Analysis: Queries for the edit analysis dashboard failing since December 2015 [5 pts] - https://phabricator.wikimedia.org/T124383#1974038 (mforns) a:Milimetric>mforns [14:50:30] hey [14:50:46] yep joal you have most of the context, I'll go for the API outage [14:51:03] explanation: a camus run failed while writing it's offsets file, leaving the system in a unstable state (data already imported, but no correct offset file) [14:51:24] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1974041 (Ottomata) @jcrespo, could you give us a slave resync status update when you get a chance? Danke! [14:51:35] Then camus tried to reimport the same data over and over again, failing because it actually already existed [14:51:36] doh [14:51:42] indeed [14:52:06] So, we agreed with elukey that, when restarting the cluster (even node after node), we should stop camus :) [14:52:13] not a bad idea [14:52:18] strange thoughj [14:52:20] it should be better than that [14:52:21] I spent a few hours cleaning the wrong files for camus to restart [14:52:24] that happened while restarting datanodes/ [14:52:25] ? [14:52:27] workers? [14:52:33] we haven't done namenode yet, ja? [14:52:33] hm, can't say realy [14:52:37] nope [14:52:54] ottomata: nope [14:52:57] so, are there also kafka broker issues? [14:53:01] https://phabricator.wikimedia.org/T125084#1973969 [14:53:06] all info in there :) [14:53:17] So while I was fighting with camus, elukey went to his task for today of patching kafkas and rebooting them [14:53:44] Starting with kafka1012 --> At reboot, it has not came up, and broke the full platform [14:53:47] kafka1012 is still down, went is single user mode after reboot [14:54:12] ahhh interesting [14:54:25] yeah, because kafka1012 is the first broker in the list of bootstrap brokers [14:54:26] ops is trying to fix it, I started a leader election a while ago but some partitions have only two replicas now [14:54:29] and it just gets blocked on it? [14:54:33] 15:00:37 < joal> --> When the first instance of a writing corum is not responding, pykafka fails [14:54:36] 15:01:12 < joal> If the C-lib used by varnish-kafka does the same, then maybe kafka1012 down is the root cause [14:54:54] ottomata: you now know [14:55:15] hmmm that is not how it shoudl work. [14:55:23] i highly doubt varnishkafka does that [14:55:26] I do agree with that :) [14:55:31] we have done reboots of these before and not had this problem (at least, before monolog) [14:55:38] pykafka shoudl be better too [14:55:51] elukey: did we watch EL logs while rebooting kafka's / doing leader elections? [14:57:16] ottomata: better explanation of what happened: https://phabricator.wikimedia.org/T125084 [14:57:23] ja read that [14:57:31] is kafka1012 ok though? [14:57:49] so kafka1012 is still down [14:58:01] I think it's dead, we were making fun of discussing OOW yesterday with mforns_brb and killing a machine today :) [14:58:05] Giuseppe is working on it, it went up in single user mode for some reason [14:58:31] EL logs were watched by Marcel that told us about the connection refused [14:58:37] also this, wip: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160128-MediaWiki-API [14:58:55] ok interesting [14:59:03] oh, so kafka1012 was rebooted but didn't come back alive? [14:59:25] yep! [14:59:38] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1974056 (jcrespo) It is continuing resyncing- but I do not have an ETA to finish. I will try to run a script today to appr... [14:59:42] I was trying to figure out why because I don't know how to access the console [14:59:49] and then everything went on fire [14:59:52] hmmm [14:59:56] heh ok cool [15:00:03] it look slike it is up though [15:00:05] OHH [15:00:09] no it doesn't sorry [15:00:37] hm, graphite/grafana is so funky sometimes [15:00:41] why no recent data.. [15:00:43] https://grafana.wikimedia.org/dashboard/db/kafka [15:01:28] ottomata: kafka is in pain because of backfilling old data with camus :( [15:01:51] ottomata: doing a lot of reads (and therefore taking long to move forward in backfilling [15:02:03] its weird though, other recent data for jmxtrans kafka stats are present [15:02:06] just not the top 3 graphs? [15:02:21] yup, from what I see [15:03:10] Actually, data is present, but not charted (if you look at the box displaying numbers, they actually are here) [15:03:15] ottomata: --^ [15:03:19] HMMMM [15:03:34] yeah, i think maybe having 1012 in the group makes it not able to chart? [15:03:42] possibly [15:03:44] if i deslect that one I see graph [15:04:03] okayy [15:04:14] not cool, but ok... [15:04:54] ok, status on hadoop data: no data loss, catching up (slowly) [15:07:05] joal: how hard was it to fix the camus offsets? [15:07:10] did you just put the previous ones in place? [15:07:32] no, I actually deleted the data files that camus was trying to reimport [15:07:51] leading to a huge volume of duplicate data, but no data loss [15:08:32] ottomata, I firmly believe there is something inserting events out-of-order [15:10:09] ottomata: looking at kafka-bytes out chart, camus backfilling is visible :) [15:13:40] jynus: eh? [15:16:10] backfilling out of the maintenance gaps is adding like 1% of events [15:17:17] it is worth investigating- either the replication script has a problem or some EL/logs arrive very out of sync [15:17:20] the resync? [15:17:29] yes [15:17:40] could make sense if you are resyncing close to current time [15:17:48] but for past, it shouldn't matter, right? [15:17:55] no, it is almost all timestamps [15:18:04] before the gap, too [15:18:18] not sure I understand, ok two things. [15:18:22] there is the resync you are trying to run now [15:18:26] that should pick up everything in the past, no? [15:18:33] since it compares counts in hourly periods [15:18:43] yes, up to what I told it to [15:18:46] and data is not being inserted into master for past data, (say, last weekend) [15:18:48] right ok [15:19:06] but, sync.sh, i could see that not quite working right, since it is trying to sync recent data. [15:19:17] there are 4 mysql EL insert processes right now [15:19:20] each is consuming events from kafka [15:19:23] grouping by schema [15:19:26] and batch inserting [15:19:35] batch settings are 3000 events or 5 minutes [15:19:36] that is ok, I agree that, for recent events [15:19:39] whichever comes first [15:19:43] that is not an issue [15:19:53] so, if one process inserts a batch for table A [15:20:05] there may be queued events for that table sitting in another process [15:20:09] I am worried about older events, but outside of the maintenance window [15:20:32] let me give you an example [15:20:35] right, but outside of the maintenance window would jsut be affected by sync.sh running regularly [15:20:37] like i'm saying now [15:20:41] i think i could see how it wouldn't work [15:20:48] if it only compares the latest timestamp in the slave [15:20:52] is it ok to lose some events? [15:20:53] and selects after that [15:20:58] shouldn't be! :) [15:21:08] i mean, its analytics data, and some of it is sampled, but we should lose! [15:21:17] i'm saying i agree with you, i see how it wouldn't work right [15:21:17] so that is why I am concerned [15:21:19] yeah [15:21:33] not too, because nothing is lost [15:21:49] right but according to analysists it is, since they are working on slave [15:21:56] exactly [15:21:56] i think maybe it would work fine if there is only one EL mysql insert process [15:22:15] but, some months ago, we spawned a few more to paralelize [15:22:28] which my point is to test [15:22:47] and/or modify the sync to adjust to it [15:23:32] i think we need to fix the syncing [15:23:45] looking at the latest timestamp on the slave is not a good way to do it [15:24:03] if all PKs were autoincs [15:24:07] that would be better [15:24:14] hmmm [15:24:15] and actually improve the performance [15:24:33] because even if they are inserted out of timestamp order [15:24:37] cause then the script could jsut remember where it left off? [15:24:42] for each table? [15:24:50] the autoinc is "always" respected [15:24:55] ja [15:24:59] thinking [15:25:22] yes, we do not even need to remember it, it is just max(id) [15:25:23] hmm ja it wouldnt' have to, then the latest id on the slave would be good enough, ja? [15:25:26] right [15:25:45] jynus: i think that is a good idea, as long as we keep a unique key on the uuid field too [15:25:57] yes, we have to validate that [15:26:10] oooof [15:26:14] but altering these tables... [15:26:15] Ungh [15:26:21] would that be fast with toku? heheh, probably not he? [15:26:22] but even if it is not ok for all tables, it is a win [15:26:33] alter tables in toku are fully online [15:26:38] oh awesome [15:26:42] another advatage of toku [15:26:51] and, we wouldn't have to change EL at all, right? [15:26:51] delayed writes and alters [15:26:54] because mysql woudl generate the id [15:27:18] let's check it instead of assuming it :-) [15:27:21] :) [15:27:28] we got this all set up in beta [15:27:30] checking there [15:27:38] I will add this to the imprving sync ticket [15:27:44] oh, btw, the MobileWebSectionUsage_14321266 dump finished [15:27:50] can I do the same thign to get it back on the slave? [15:28:01] master -> slave this time? or i could do db1047 -> analytlics store [15:28:37] shouldn't sync do it automatically? or it has a lot of lost events? [15:28:50] naw, beacuse it has some recent data [15:28:57] on the slave [15:28:57] | min(timestamp) | max(timestamp) | [15:28:58] +----------------+----------------+ [15:28:58] | 20160118144945 | 20160128132158 | [15:29:07] ottomata: I need to leave soon [15:29:10] i could delete the table on the slave [15:29:16] go on, but that will slow down the resync even more [15:29:17] ok joal i should just watch jobs? [15:29:26] or maybe should I ask mforns_brb, he's on duty [15:29:33] please ottomata, that'd be great [15:29:54] things to monitor: loading of text and upload, refine of text, upload and mobile [15:29:57] ok mostly i'm looking at refine, ja? [15:30:00] oh load too right [15:30:00] ok [15:30:04] will keep my eye on it [15:30:06] thx mate [15:30:36] jynus: probably better to delete table on the slave ja? and just let the regularly scheduled sync handle it? [15:31:02] oof that will take forever though :) [15:31:45] import it, just do not expect the resync to be fast [15:32:16] jynus: i'll wait to start that tonight [15:32:20] as I did yesterday [15:55:37] hey mforns_brb, what's status of EL? [15:55:40] i see its not really working!? [15:55:42] can I restart it? [15:55:49] oh brb [15:55:51] doing it [15:55:54] !log restarting eventlogging [16:08:58] Analytics-Cluster, operations: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1974228 (Ottomata) Yeah, I had tried to reinstall it back during the all staff, but had some issue and never finished. [16:09:49] quick question: in order to stop camus, do somebody need to simply comment/deactivate the related crontab entry for the hdfs user on 1027? or is there a better way [16:09:53] ? [16:10:11] elukey: naw, no better way [16:10:13] that's is [16:10:26] either stop puppet and comment the cron if you are stopping very temporarily [16:10:27] or [16:13:29] make a puppet commit to disable it [16:13:29] if you are going to stop it for a while [16:13:29] all right, because I wanted to add an entry in https://wikitech.wikimedia.org/wiki/Service_restarts#Hadoop_workers [16:13:29] Analytics-Kanban: Restore MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 - https://phabricator.wikimedia.org/T123595#1974231 (Ottomata) Ok, MobileWebSectionUsage_14321266 has been restored to m4-master. I'll need to do a manual sync from m4-master to analytics-store. @jcrespo's resync of... [16:17:45] updated https://wikitech.wikimedia.org/wiki/Service_restarts, both Kafka and Hadoop [16:18:55] whoa didn't know that was a page [16:19:23] nice! [16:29:48] ottomata: do you want to do the ops sync or skip? [16:31:24] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [16:33:36] OH! [16:33:39] ya let's [16:33:42] thanks [16:33:46] dunno why i iddn't see it [16:33:53] i think its not on my cal properly [16:33:55] maybe i didn't invite me? [16:34:35] :D [16:34:40] bat-cave [16:40:47] Analytics-Tech-community-metrics, DevRel-February-2016, DevRel-January-2016: What is contributors.html for, in contrast to who_contributes_code.html and sc[m,r]-contributors.html and top-contributors.html? - https://phabricator.wikimedia.org/T118522#1974274 (Aklapper) Adding February project as I am bl... [16:43:46] (PS1) Milimetric: Implement a Tabular Layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/267045 (https://phabricator.wikimedia.org/T118329) [16:50:54] Analytics-Kanban, Wikipedia-Android-App: Beta Event Logging no longer functional {oryx} - https://phabricator.wikimedia.org/T123781#1974307 (Dbrant) Working for me on eventlogging03 now. [17:00:56] madhuvishy: standuppp [17:02:20] Analytics-Kanban, Patch-For-Review: Categories without metrics should not show up {crow} [3 pts] - https://phabricator.wikimedia.org/T124926#1974368 (Nuria) Open>Resolved [17:03:16] Analytics-Kanban, Research-and-Data: Research Spike: How do redirects affect pageviews [8 pts] {hawk} - https://phabricator.wikimedia.org/T108867#1974370 (Nuria) Open>Resolved [17:05:05] (CR) Madhuvishy: [C: 2 V: 2] "lgtm. Thanks for working on this! :)" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/264297 (https://phabricator.wikimedia.org/T117615) (owner: Mforns) [17:05:12] Analytics-Kanban: Evaluate performance of country breakdown of last access monthly unique numbers {bear} [5 pts] - https://phabricator.wikimedia.org/T123265#1974379 (Nuria) Open>Resolved [17:06:07] (CR) Madhuvishy: [C: 2 V: 2] "lgtm!" [analytics/refinery] - https://gerrit.wikimedia.org/r/264292 (https://phabricator.wikimedia.org/T117615) (owner: Mforns) [17:06:18] :] [17:24:14] Analytics-Kanban: Foundation-only Geowiki stopped updating [3 pts] - https://phabricator.wikimedia.org/T106229#1974486 (Milimetric) Open>Resolved [17:31:25] k, elukey want to do hadoop masters? [17:32:07] yep! [17:32:13] mforns: I'm not sure I sent you this, it's the "fix redirects in mediawiki" ticket that I mentioned while we were talking about that: https://phabricator.wikimedia.org/T104755 [17:32:20] try the failover command? :) [17:32:22] reading https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration [17:32:23] that is step one ja? [17:32:24] yeah [17:33:34] so failover to secondary, reboot of primary, failover to primary, reboot secondary [17:33:45] and 1001 is primary, 1002 secondary [17:33:56] should I run this from 1027? [17:34:09] well [17:34:14] might as well reboot secondary first i guess [17:34:15] ja? [17:34:18] milimetric, yes, it was linked from the original task https://phabricator.wikimedia.org/T108867 [17:34:23] you can run it from wherever, i usually do it from an01 or an02 [17:34:25] but doesn't matter [17:34:37] yeah, was making sure you saw it [17:36:56] milimetric, aha, do we have any action items in that task? it seems to me that it is mediawiki core no? [17:37:13] ottomata: yeah I'll do it on 1002 first, just in case :) [17:38:04] mforns: yeah, nothing for us yet, just to pay attention and stay in the loop for when they start making changes [17:38:38] milimetric, sure, thx for the heads up [17:39:00] subscribed [17:41:10] k, elukey you are rebooting 1002 first? [17:41:11] ja? [17:41:11] ottomata: only service hadoop-hdfs-namenode stop i guess right? [17:41:15] yep! [17:41:21] hm, i think resource manager is htere too [17:41:23] no? [17:41:23] hmmm [17:41:24] Analytics-Cluster, Analytics-Kanban, EventBus, Patch-For-Review: Update MirrorMaker in Kafka .deb and puppet module [13 pts] - https://phabricator.wikimedia.org/T124077#1974575 (Nuria) Open>Resolved [17:41:26] Analytics-Cluster, EventBus, Services, operations: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1974576 (Nuria) [17:41:31] ja [17:41:35] hadoop-yarn-resourcemanager [17:42:13] ahhh yeah not nodemanager [17:42:17] Hmmmmm [17:42:18] you know [17:42:20] all right, stopped [17:42:22] i think we need to failover resourcemanager too [17:42:24] when we do 1001 [17:42:30] resourcemanager does do automatic failove ri think [17:42:45] hmm, trying to remember how that works... [17:42:48] haven't done this in a while [17:43:02] well I think that I can reboot 1002 now [17:43:06] ja go ahead [17:43:30] Analytics-Kanban: Make Dashiki get pageview data from pageview API {melc} - https://phabricator.wikimedia.org/T124063#1974591 (Nuria) a:Nuria [17:45:51] ottomata: autoincrement EL ticket : https://phabricator.wikimedia.org/T87661 [17:48:55] elukey: i think i need to update docs to do rm failover too [17:48:58] doing so now [17:49:00] elukey@analytics1002:~$ sudo service hadoop-hdfs-namenode status * Hadoop namenode is running [17:49:03] elukey@analytics1002:~$ sudo service hadoop-yarn-resourcemanager status * Hadoop resourcemanager is running [17:49:29] cool [17:49:41] so starting the failover for hdfs [17:49:44] Analytics-Kanban: Eventlogging replication not working with mysql parallel consumption - https://phabricator.wikimedia.org/T125113#1974631 (Nuria) NEW [17:49:56] hmmm [17:50:08] elukey: k go ahead [17:50:10] hold on reboot though [17:50:11] sudo -u hdfs /usr/bin/hdfs haadmin -failover analytics1001-eqiad-wmnet analytics1002-eqiad-wmnet [17:50:15] ja [17:50:15] yep yep [17:50:17] that's good [17:50:33] ok, i think that if we just stop resourcemanager on an01, the an02 one will take over [17:50:39] but, we should do failover for good measure [17:50:44] Failover from analytics1001-eqiad-wmnet to analytics1002-eqiad-wmnet successful [17:50:54] -getServiceState analytics1002-eqiad-wmnet [17:50:56] says it is active? [17:51:37] active [17:52:28] !log rebooted analytics1002-eqiad-wmnet for kernel upgrade [17:52:42] !log HDFS failover from analytics1001-eqiad-wmnet to analytics1002-eqiad-wmnet [17:54:17] elukey: , updated https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Manual_Failover [17:54:25] go ahead and to the same witih yarn rmadmin [17:56:03] thanks! [17:56:23] Exception in thread "main" java.lang.UnsupportedOperationException: RMHAServiceTarget doesn't have a corresponding ZKFC address [17:56:26] :D [17:56:30] hmmm [17:56:37] looking [17:57:55] https://issues.apache.org/jira/secure/attachment/12690518/YARN-3006.001.patch [17:57:59] this was a patch proposed [17:58:09] hmmm [17:58:13] logs look like its working fine [17:58:13] 2016-01-28 17:45:30,145 INFO org.apache.hadoop.ha.ActiveStandbyElector: Successfully created /yarn-leader-election/analytics-hadoop in ZK. [17:58:14] 2016-01-28 17:45:30,152 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to standby state [17:58:14] I think it wants to say "hey automagic failover is enabled!" [17:58:26] oh! [17:58:28] interesting [17:58:31] https://issues.apache.org/jira/browse/YARN-3006 [17:58:34] so you just stop the service on 1001? [17:59:09] huh, that would make sense as to why I didn't update the docs to say so :) [17:59:10] possibly, not sure [17:59:18] elukey: this seems correct to me [17:59:40] so I just need to stop yarn res manager on 1001 [18:00:15] i believe so [18:00:18] everything looks good [18:00:24] its connected to zk, it knows about elections [18:00:28] it knows itis currently in standby [18:01:00] elukey: how about to be extra safe [18:01:05] i disable camus :p :) [18:01:09] and we wait til the curent run is done [18:01:16] sure [18:02:14] k will let you know, should be a few mins [18:02:39] sure [18:06:36] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Remove autoincrement id from tables [5 pts] - https://phabricator.wikimedia.org/T87661#1976428 (Ottomata) Hahah, @jcrespo, tables //used// to have an auto increment key. [18:10:42] ok elukey good to go [18:10:50] stop hadoop-yarn-resourcemanager on an01 [18:10:53] i'm watching logs on an02 [18:10:59] and will verify that an02's gets promoted to active [18:11:58] doing it [18:12:39] done! [18:13:05] !log stopped Yarn resouce manager on analytics1001 to trigger automatic failover to 1002 [18:13:21] elukey: looks good! [18:14:25] all right, rebooting 1001 [18:14:39] Analytics-Kanban: pagecounts raw from 28/01 are not present [3 pts] - https://phabricator.wikimedia.org/T125079#1973881 (Milimetric) [18:15:06] Analytics-Kanban: pagecounts raw from 28/01 are not present [3 pts] - https://phabricator.wikimedia.org/T125079#1977699 (Milimetric) a:JAllemandou [18:15:10] Analytics-Kanban, Research-and-Data: Research Spike: How do redirects affect pageviews [8 pts] {hawk} - https://phabricator.wikimedia.org/T108867#1977704 (DarTar) @nuria @mforns: just checking in on the scope of this, the ticket is about exploring and documenting issues and possible solutions, not settling... [18:15:47] Analytics-Kanban, Services, RESTBase-API: RESTBase pageview data not updated - https://phabricator.wikimedia.org/T125048#1977799 (Milimetric) a:Milimetric [18:17:15] Analytics, operations: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015#1978088 (Milimetric) p:Triage>Normal [18:17:41] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Remove autoincrement id from tables [5 pts] - https://phabricator.wikimedia.org/T87661#1978162 (jcrespo) "Have in mind that code that checks for insertion of duplicates might use this field." Then it proceeds to be ignored? Expect things to b... [18:18:16] Analytics-Cluster, hardware-requests, operations: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1978243 (Milimetric) [18:19:27] Analytics: Server side eventlogging should publish to kafka and not use udp {stag} - https://phabricator.wikimedia.org/T124813#1978476 (Milimetric) [18:21:21] Analytics, Analytics-EventLogging: EventLogging dies when fetching a schema over HTTP that does not exist. {oryx} - https://phabricator.wikimedia.org/T124799#1978806 (Milimetric) [18:21:33] Analytics, Editing-Analysis, Editing-Department: Consider scrapping Schema:PageContentSaveComplete and Schema:NewEditorEdit, given we have Schema:Edit - https://phabricator.wikimedia.org/T123958#1978816 (Milimetric) Open>declined a:Milimetric Declining this then, in favor of future work that n... [18:21:55] Analytics: Pageview API demo does not show date selector arrows in Chrome - https://phabricator.wikimedia.org/T123584#1978823 (Milimetric) Open>declined a:Milimetric We're not fixing bugs on the demo [18:22:18] !log rebooting analytics1001 [18:24:52] Analytics: wmit-* account creation campaigns totals - https://phabricator.wikimedia.org/T123059#1978846 (Milimetric) Open>declined a:Milimetric We don't keep the raw data that far back. If you'd like to do this sort of query, you'd have to have access to the cluster and be ready to start running it... [18:25:46] Analytics: Traffic Breakdown Report - Visiting Country per Wiki {lama} - https://phabricator.wikimedia.org/T115607#1978852 (Milimetric) a:Milimetric [18:25:58] Analytics: Traffic Breakdown Report - Visiting Country {lama} - https://phabricator.wikimedia.org/T115605#1978856 (Milimetric) a:Milimetric [18:26:09] Analytics: Traffic Breakdown Report - User Agents Trend {lama} - https://phabricator.wikimedia.org/T115601#1978863 (Milimetric) a:Milimetric [18:26:16] Analytics: Traffic Breakdown Report - User Agent Overview {lama} - https://phabricator.wikimedia.org/T115599#1978866 (Milimetric) a:Milimetric [18:26:32] ottomata: 1001 is up and running [18:26:44] moving hdfs to 1001 [18:26:45] Analytics: Traffic Breakdown Report - Client OS Major Minor Version {lama} - https://phabricator.wikimedia.org/T115591#1978875 (Milimetric) a:Milimetric [18:26:53] Analytics: Traffic Breakdown Report - Browser Major Minor Version {lama} - https://phabricator.wikimedia.org/T115590#1978885 (Milimetric) a:Milimetric [18:28:02] Analytics, Analytics-Cluster: Investigate (and remove?) spamy pageviews on pageview_hourly - https://phabricator.wikimedia.org/T115477#1978893 (Milimetric) Open>declined a:Milimetric [18:28:19] Analytics-Kanban: Improve loading Analytics Query Service with data {slug} [5 pts] - https://phabricator.wikimedia.org/T115351#1978899 (Milimetric) [18:28:21] Analytics: optimize Analytics Query Service {slug} - https://phabricator.wikimedia.org/T115361#1978896 (Milimetric) Open>declined a:Milimetric too vague [18:29:10] k cool [18:29:14] proceed! [18:29:33] Analytics: Oozie sends emails when any job fails {hawk} - https://phabricator.wikimedia.org/T114901#1978923 (Milimetric) a:JAllemandou [18:29:46] Analytics: Oozie sends emails when any job fails {hawk} - https://phabricator.wikimedia.org/T114901#1709169 (Milimetric) a:JAllemandou>None [18:30:04] Failover from analytics1002-eqiad-wmnet to analytics1001-eqiad-wmnet successful [18:30:17] !log HDFS failover from 1002 to 1001 [18:30:41] Analytics-EventLogging, MobileFrontend, Easy, Technical-Debt: Should be possible to override sampling in EventLogging schemas for development purpose - https://phabricator.wikimedia.org/T125122#1978941 (Jdlrobson) NEW [18:31:19] Analytics-Kanban, Learning-and-Evaluation: Add instruction text next to the input fields in the Program Global Metrics Report {kudu} - https://phabricator.wikimedia.org/T121899#1978948 (Milimetric) [18:31:24] Analytics, Analytics-Kanban, Analytics-Wikimetrics: Include all timezones in global metrics report interface {kudu} - https://phabricator.wikimedia.org/T121167#1978949 (Milimetric) [18:31:36] Analytics-Kanban: Expose the results of the global metric at a public link, that's available immediately for the API {kudu} [8 pts] - https://phabricator.wikimedia.org/T118310#1978950 (Milimetric) [18:32:09] Analytics-Kanban, Analytics-Wikimetrics: Include all timezones in global metrics report interface {kudu} - https://phabricator.wikimedia.org/T121167#1978951 (madhuvishy) [18:33:36] Yarn still standby on 1001 [18:35:26] ottomata: :) [18:35:33] hmmm [18:35:35] oh yarn [18:35:39] ja you'll have to restart the 1002 one [18:35:48] yarn don't care [18:35:52] but we should keep it on an01 [18:36:04] so I am stopping yarn on 1002 right? [18:38:21] ja [18:38:47] ah looks like you did ?:) [18:38:54] perfect [18:38:54] This is standby RM. Redirecting to the current active RM: http://analytics1001.eqiad.wmnet:8088/cluster/scheduler [18:38:56] :) [18:39:01] looks good! [18:39:24] reenabling camus jobs [18:39:28] are you watching /var/log/hadoop-yarn ? [18:40:08] ja [18:40:20] all right [18:40:28] that log is from a http req to hte job browser on 1002, redirected me to current active rm on 1001 [18:40:30] !log Yarn resource manager transitioned to 1001 [18:40:49] ahh not in the main log [18:40:52] ja [18:41:38] ok, gotta make some lunnnnch [18:41:45] looks good elukey! [18:42:44] goooood [18:42:50] I am going offline a-team! [18:42:53] byyyyeeeeeee [18:42:57] bye elukey ! [18:43:03] cya tomorrow [18:43:25] Analytics: Daily/monthly aggregation of hourly page view files halted - https://phabricator.wikimedia.org/T123477#1978996 (ezachte) Open>Resolved Oops. This I fixed some two weeks ago, but I hadn't marked it as resolved yet. Doing that now. [18:49:39] Analytics-Wikistats: LIMN input file wikilytics_in_pageviews.csv no longer updated - https://phabricator.wikimedia.org/T124340#1979020 (ezachte) Open>Resolved Path names for this step were wrong after major update for https://phabricator.wikimedia.org/T114379 . Fixed [18:59:21] Analytics-Kanban: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#1979062 (Nuria) NEW [19:17:26] heya a-team, is eventlogging processor stopped in beta on purpose? [19:17:40] ottomata, no... [19:17:43] that I know [19:17:47] ok, i'm starting it [19:17:52] seems to have been stopped since the 21st [19:18:07] ha [19:20:10] a-team: reworked mission , any feedback welcome: "The Analytics Team sees as its primary responsibility making Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders. We develop infrastructure so all our users, both within the Foundation as within the different communities, can access data [19:20:10] in a self-service fashion that is consistent with the values of the movement" [19:20:10] https://www.mediawiki.org/wiki/Analytics#Mission [19:22:48] mforns: can you review real quick? [19:22:49] https://gerrit.wikimedia.org/r/#/c/267070/2 [19:22:51] i want to put this on beta [19:22:54] ottomata, sure [19:29:29] ottomata, merged it [19:30:09] danke [19:38:17] Analytics, Analytics-EventLogging: Add autoincrement id to EventLogging MySQL tables. - https://phabricator.wikimedia.org/T125135#1979211 (Ottomata) NEW [19:39:19] Analytics, Analytics-EventLogging: Add autoincrement id to EventLogging MySQL tables. - https://phabricator.wikimedia.org/T125135#1979222 (Ottomata) I just tested on the Echo_7731316 table in beta. Adding an autoincrement key is transparent to EventLogging if the table already exist. https://gerrit.wik... [19:58:17] hey ottomata, just relaunched two oozie failed load jobs [19:58:59] failed?! [19:59:02] sorry [19:59:06] was trying to get into hue earlier [19:59:10] have been watching running jobs complete [19:59:13] in cli [19:59:19] didn't notice teh failed ones [19:59:25] did you see them via cli? [20:00:05] joal: ^ [20:00:29] nope, using hue [20:01:01] haven't been able to load it :/ [20:01:20] I notice the difference in hue better than in CLI for the load ones --> when failing because of statistics, nothing in the "comment" section, while in the case of missing _IMPORTED, it tells you :) [20:01:26] mwarf: ( [20:01:53] Anyway, done, and hopefully everything will be fine tomorrow morning :) [20:02:20] ja, did they fail around 1.5h ago? [20:02:28] when we restarted hadoop master things? [20:04:12] ottomata: I don't think so [20:04:17] ok [20:04:20] ottomata: I think they failed because of timeout [20:04:32] oh because our oozie server problem? [20:04:54] they were the one waiting for _IMPORTED while I was fixing the camus stuff [20:05:01] milimetric: mforns i hope to have almost all the docker things working tomorrow for wikimetrics. Can one of you, or both - spend may be 15-20 minutes being the first testers for this? [20:05:12] So they have waited long enough for timeout I guess [20:06:23] madhuvishy: right now or tomorrow? [20:06:28] milimetric: tomorrow [20:06:32] sure, no prob [20:06:38] thanks :) [20:07:09] nuria too if you're interested ^ [20:20:02] (PS2) Milimetric: Implement a Tabular Layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/267045 (https://phabricator.wikimedia.org/T118329) [20:25:32] Wooow, many emails from phabricator ! [20:25:43] grooming must have been a strong one ! [20:26:18] joal: wikibugs-l got unsubscribed from a few thousand tasks, which might have been an impact. [20:26:23] joal: (I send my Phab e-mail to /dev/null ;-)) [20:26:31] :D [20:26:49] I try to at least scan them, it sometimes helps me not to forget some stuff [20:30:26] madhuvishy, yes I can also try it [20:32:03] Analytics-Cluster, operations: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1979374 (Ottomata) Open>Resolved a:Ottomata Weird! I umounted /mnt/hdfs, ran puppet twice, and now everything is fine. [20:32:46] Analytics, Analytics-Cluster: Uninstall Impala - https://phabricator.wikimedia.org/T125141#1979381 (Ottomata) NEW [20:35:34] Analytics, Analytics-Cluster: Procure hardware for future druid cluster - https://phabricator.wikimedia.org/T116293#1979417 (Ottomata) In a hardware planning meeting yesterday, we determined that analytics1015, analytics1017 and analytics1021 could be used for this. T124945 should be done first to free... [20:35:40] Analytics, Analytics-Cluster: Procure hardware for future druid cluster - https://phabricator.wikimedia.org/T116293#1979419 (Ottomata) [20:35:42] Analytics-Cluster, hardware-requests, operations: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (Ottomata) [20:36:35] Analytics, Analytics-Cluster: Expand people's ability to use Hive/Cluster {hawk} - https://phabricator.wikimedia.org/T94903#1979423 (Ottomata) Open>declined a:Ottomata Feel free to reopen if needed. [20:37:36] Analytics-Cluster, hardware-requests, operations: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (Ottomata) [20:37:38] Analytics, operations, Patch-For-Review: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#1979429 (Ottomata) [20:38:32] Analytics-Cluster, Analytics-Kanban, EventBus: Camus job to import mediawiki.* eventbus data to Hadoop. - https://phabricator.wikimedia.org/T125144#1979431 (Ottomata) NEW a:Ottomata [20:43:40] mforns: awesome, thanks - let's do after standup tomorrow then [20:59:48] !log restarted eventlogging to overcome burrow alerts [21:02:45] Analytics-Tech-community-metrics, DevRel-January-2016: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1979517 (Aklapper) **TODO:** I don't know why "Abandoned" is interesting but I'd rather go for these columns: | Total Submitted | CR=0 && V=1 |... [21:05:18] Analytics-Tech-community-metrics, DevRel-January-2016: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1979524 (Aklapper) p:Low>Lowest I've sorted out above in T63563#1895837 and T63563#1979517 what I see as TODOs. The current page kind of... [21:05:31] Analytics-Tech-community-metrics, DevRel-January-2016: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1979526 (Aklapper) a:Aklapper>None [21:10:47] milimetric: where do I send people if they want to put in a feature request for the pageview API? [21:13:51] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1979555 (Aklapper) [21:14:25] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [21:24:33] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1979600 (Aklapper) Note: I assume the timeframe of http://korma.wmflabs.org/browser/top-contributors.html is //all ti... [21:27:53] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1979612 (Aklapper) == General comments on the user account data in //korma//: == Usernames in different data sources... [21:29:24] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [21:39:57] ottomata: yt? [21:42:08] hiya yup [21:42:52] ottomata: question what would be the value proposition of a public edit stream if it cannot replace RC stream? what kind of usages do you see [21:43:20] nuria: i think we need to talk to timo and aaron about this [21:43:26] i don't have all the context. [21:43:58] ottomata: k, let's see if Krinkle is arround [21:44:02] He is [21:44:24] HIii [21:44:29] :) [21:45:04] u 2 wanna jump in hangout real quick? probably easier [21:45:21] three of us? [21:45:25] ja [21:45:27] o [21:45:29] k [21:45:32] batcave! [21:45:40] https://plus.google.com/hangouts/_/wikimedia.org/a-batcave [21:46:10] nuria: ^ [21:47:56] Analytics-Kanban, Editing-Analysis: Queries for the edit analysis dashboard failing since December 2015 [5 pts] - https://phabricator.wikimedia.org/T124383#1979701 (mforns) I've looked a bit into it. The queries really are taking too long. The one mentioned in the task description is taking around 5-6 hou... [21:50:56] ottomata, just a heads up that I had to restart EL again because of the same issues that we have lately: https://grafana.wikimedia.org/dashboard/db/eventlogging [21:54:10] the burrow alarm indicated a kafka connection problem in the processor, restarted and went back to normal [21:59:51] good night folks! see you tomorrow [22:32:53] (PS1) Ottomata: JsonStringMessageDecoder can now find timestamps using dotted notation, e.g. ("a.b.ts") [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/267167 (https://phabricator.wikimedia.org/T125144) [22:34:41] mforns, very strange. [22:34:41] hm [22:34:42] thanks [22:34:48] i'm out soon too a-team, latesr! [22:35:22] ottomata: good night! :) [22:38:15] (PS2) Ottomata: JsonStringMessageDecoder can now find timestamps using dotted notation, e.g. ("a.b.ts") [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/267167 (https://phabricator.wikimedia.org/T125144) [22:39:26] (CR) Mooeypoo: "The strings this generates has a few that are in unicode characters. If that's okay for your purposes that's fine, but I ran into that iss" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk) [22:44:57] Analytics, Learning-and-Evaluation: Add instruction text next to the input fields in the Program Global Metrics Report {kudu} - https://phabricator.wikimedia.org/T121899#1979999 (Abit) [22:51:40] (PS1) Madhuvishy: Development environment for wikimetrics using docker [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/267172 [22:55:07] (PS2) Madhuvishy: Development environment for wikimetrics using docker [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/267172 [22:59:04] !log restoring MobileWebSectionUsage_14321266 from db1047 to dbstore1002 using mysqlimport [22:59:44] Analytics-Kanban: Restore MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 - https://phabricator.wikimedia.org/T123595#1980088 (Ottomata) Just started sync from db1047 to dbstore1002, running on db1047. ``` mysqldump --single-transaction --insert-ignore --no-create-info --skip-add-locks log... [23:25:51] Analytics: Workshop to teach analysts, etc about Quarry, Hive, Wikimetrics and EL {flea} - https://phabricator.wikimedia.org/T105544#1980184 (DarTar) [23:26:23] Analytics, Research consulting: Too few page views for June/July 2015 - https://phabricator.wikimedia.org/T106034#1980186 (leila) [23:27:05] Analytics, Developer-Relations, MediaWiki-API, Reading-Admin, and 5 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1980188 (leila) [23:28:09] Analytics, MediaWiki-API, Reading-Infrastructure-Team, MW-1.27-release-notes, and 2 others: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618#1980191 (leila) [23:59:41] (PS3) Madhuvishy: Development environment for wikimetrics using docker [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/267172 (https://phabricator.wikimedia.org/T123749)