[01:12:54] Hi! Any insight details of the x-analytics-map['proxy'] field in Hive webrequests, i.e., how it works, how reliable and supported it is? Or maybe I should ask on #wikimedia-mobile? [01:14:08] Specifically looking for info on IORG (internet.org) proxy [01:14:16] thx in advance! [01:14:18] :) [02:04:16] 10Analytics, 06Operations: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2915651 (10fgiunchedi) [02:44:03] milimetric: nuria: hi! Happy new year! A quick question to start 2017: do u know offhand where the code that generates the x_analytics header lives? thx!!!!! ;D [03:01:15] ^ just FWIW, found the answer... :) [03:01:24] many greetings still :) [04:46:06] 10Analytics, 10MediaWiki-API, 06Wikipedia-Android-App-Backlog, 06Wikipedia-iOS-App-Backlog: Add page_id and namespace to X-Analytics header in App / api requests - https://phabricator.wikimedia.org/T92875#1122344 (10Krinkle) >>! In T92875#1281526, @gerritbot wrote: > Change 202801 merged by jenkins-bot: >... [07:30:41] 06Analytics-Kanban, 10Community-Wikimetrics, 13Patch-For-Review: Story: WikimetricsUser reports pages edited by cohort {kudu} [13 pts] - https://phabricator.wikimedia.org/T75072#763343 (10Urbanecm) Why this isn't displayed in the UI? Because the missing deduplication? [08:26:54] 06Analytics-Kanban: Kill limn1 - https://phabricator.wikimedia.org/T146308#2915964 (10Nemo_bis) Yes, the dashboards which need to be archived are those which don't get migrated. [12:39:27] 10Analytics, 10MediaWiki-extensions-WikimediaEvents, 10The-Wikipedia-Library, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Implement Schema:ExternalLinksChange - https://phabricator.wikimedia.org/T115119#2916249 (10Samwalton9) @Legoktm Could you please look into the above issue? I've pinged you on... [13:50:28] o/ [14:03:30] o/ joal [14:03:34] & milimetric [14:03:39] nothing pressing this morning. [14:03:50] But would be very happy to meet. I'm hanging out in the call [14:18:56] OK dropping out now. I'm going to go have a proper breakfast. [14:19:02] o/ [14:19:25] (03CR) 10Mforns: Add mediawiki history spark jobs to refinery-job (0312 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) (owner: 10Joal) [14:30:43] off to get lunch at 3:30pm, naturally! [15:00:47] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#2916525 (10Lcanasdiaz) Guys, we are ready to deploy this but we are finding some issues with the data collection. We do not manage to download more than 600 ticket... [15:24:45] hi all. is there any job in the analytics oozie workflow that uses Wikipedia XML dumps as input? [15:31:18] mschwarzer: no, there isn't :/ [15:32:37] ottomata: :/ what would be then the best approach for a new job to use the dumps? [15:33:02] (03CR) 10Mforns: Add mediawiki history spark jobs to refinery-job (0314 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) (owner: 10Joal) [15:33:23] mschwarzer: I'm not totally sure, but there shouldn't be any real difference in oozie between different types of data inputs [15:33:31] it more depends on what type of job oozie will launch [15:33:41] you are doing flink, right? [15:33:57] i don't have any experience with oozie and flink, but it should be possible, since ultimately its just a hadoop/yarn job [15:34:06] but, on the data input side of things [15:34:20] are you trying to use oozie to launch a regular job based on data input? [15:34:26] how are you getting the dumps into HDFS? [15:34:35] joal: can answer more about dumps in HDFS than I can [15:35:11] but IIRC it is difficult, especially since some particular page splits are SO large (soo many revisions, large revision changes, etc.) [15:35:21] ottomata: yes, i'm using yarn/flink. but that's shouldn't be the problem. its more about getting the dumps to HDFS [15:35:35] so, oozie is a hadoop/yarn job schedule [15:36:00] you use it to either: A. run jobs to run on a regular schedule, or more usefully B: run jobs based on data presence [15:36:04] that's mainly what we use it for [15:36:16] oozie doesn't import data into hdfs [15:37:22] can you access the dumps from the cluster directly? [15:40:12] mschwarzer: no [15:40:14] you'd have to put them there [15:40:49] they are NFS moujnted on stat1002 [15:40:57] so you can hdfs -put them [15:41:01] but really, we should ask joal (and maybe halfak) [15:41:04] because they've done this before [15:41:07] I haven't [15:41:15] o/ [15:41:44] ottomata: alright. thanks for the info! [15:41:49] I've not tried to work with XML dumps on our cluster in a long time. [15:42:04] Most recently, I've been transfering them to S3 and using them on the altiscale cluster [15:45:18] halfak: is the altiscale cluster still active? [15:45:25] ottomata, yup [15:45:37] Or rather, not *our* but the bigger shared one is. [15:45:42] *ours [15:45:53] We share it with a few professors and the InternetArchive [15:46:22] could mschwarzer get access to that? i betcha what he wants to do would be eaiser there...although hm, maybe not, he probably wants to feed info back into hdfs [15:46:23] sorry [15:46:28] back into elasticsearch [15:46:41] Ahh yeah. That might be more complicated. [15:47:03] joal was working on some nice scripts for loading data to altiscale. I'm not familiar with a good way of getting it out other than s3 buckets. [15:48:16] yea mschwarzer wants to feed the data back into ES so it can be used as weights in article similarity. It seems the end goal is for that to be a productionized pipe that runs regularly [15:48:54] Yeah, analytics hadoop seems better. Can it handle the capacity needed? [15:49:35] not sure, joal would know better [15:49:44] my job looks like this : xmp dump (hdfs) -> yarn/flink -> elastic search [15:49:48] halfak: the problem was large split outliers, right? [15:49:51] most pages worked fine? [15:50:02] does your xml -> json format fix that? [15:50:12] ottomata, problem was capacity. [15:50:21] The jobs I wanted to run were huge [15:50:37] But yeah, the json format makes working with the data in hadoop pretty straightforward [15:50:41] hm, but the capacity problems were because of certain large pages (splits) i thought? [15:50:59] depending on what hes doing, could maybe do es -> yarn/flink -> es, hadoop can pull an index scan on the live ES cluster [15:51:01] ottomata, that was an OOM issue [15:51:08] oh ok [15:51:23] were your jobs large because of the data, or just because you were doing that historical analysis [15:51:28] mschwarzer: you just need latest rev, right? [15:51:30] not all history? [15:51:40] ottomata: yes. no history. [15:51:50] do you need the xml dumps? or just the rev content? [15:52:48] if the rev content is in wiki-markup, it should be enough. [15:53:14] article title + rev content [15:54:08] mschwarzer: if you had all mediawiki page tables in hdfs, could your job just grab the latest rev content from the MW API? [15:54:19] not sure if the API would like that, but since its only asking for one rev per page, maybe it'd be ok? [15:54:42] you could iterate through the page table, and then ask the API for the latest rev content for each page [15:54:50] You don't want to get 5M pages from the API. [15:54:53] ok [15:54:57] i take it back then :) [15:55:00] We do have dumps that contain just the last edit to article pages* [15:55:05] oh, we do? [15:55:07] ok cool. [15:55:10] theoretical yes, but all 7M+ pages :D [15:55:11] *"article pages" include templates and stuff too [15:55:19] that would probably be much easier to deal with [15:55:44] mschwarzer: are you familiar with halfak's xmldump -> json processor job? [15:55:50] it might make things easier too [15:56:08] my job is currently using the all-articles-dump with latest rev only. [15:56:12] joal has a faster version, but I can't remember how to find the docs for it. [15:56:12] ah ok [15:56:24] haha, yeah we really need joal for these Qs :) [15:56:24] ottomata: no. but i'll have a look at it. [15:56:33] Converting last rev to JSON should be pretty fast [15:56:43] cool, mschwarzer i betcha it'd run fine in our cluster with just the latest revs [15:56:45] See https://github.com/mediawiki-utilities/python-mwxml [15:56:53] and http://pythonhosted.org/mwxml/ [15:56:56] you'll have to do the import of the xmldump yourself [15:57:05] halfak: thanks! [15:57:09] probably what you want is some cron that checks to see if the new xmldump is available [15:57:16] then a job that just hdfs dfs -puts it into hdfs [15:57:27] Looks like the docs are a little out of date. I'll update 'em quick. [15:57:29] then, an oozie job that would respond to the precense of new data [15:57:32] in hdfs [15:57:52] that would launch your job chain, maybe halfaks xml -> json job, then your flink stuff [15:57:55] or just your flink stuff [15:58:18] so: [15:58:18] 1. some script that puts xmldumps into hdfs [15:58:18] 2. oozie job to launch your workflow (flink, etc.) [15:58:24] there are also json dumps of article content already available on dumps.wikimedia.org [15:58:25] but really, xmldumps are generated so infrequently, right? [15:58:27] generated weekly [15:58:31] OH [15:58:32] ok right [15:58:35] since they are just latest [15:58:40] ok weekly is cool [15:58:51] hm, yeah, actually [15:58:51] the json dumps are in elasticsearch bulk import format, but it's easy to parse [15:59:07] and contains a bunch of extra structured information (but probably not useful for this particular job) [15:59:22] I didn't know we had JSON dumps at all [15:59:30] where are the json dumps? [15:59:32] don't see those [15:59:35] it's a dump of the live elasticsearch indices, they run every monday [15:59:52] https://dumps.wikimedia.org/other/cirrussearch/current/ [16:00:08] the -content.json.gz contains just article namespaces, the -general.json.gz contains all the other stuff [16:00:26] the format looks like: https://en.wikipedia.org/wiki/Foobar?action=cirrusdump [16:00:45] ebernhardson: we should team up and document this along w/ the relevancy forge server for use from labs/tools man [16:00:55] maybe like within the next few months that's realistic :) [16:01:05] so the dumps.wikimedia.org xml are available on the cluster? or only the JSON dumps? [16:01:14] mschwarzer: nothing is in hadoop [16:01:16] you'd have to grab it [16:01:26] you could either hdsf dsf -put them from the nfs mount on stat1002 [16:01:26] OR [16:01:33] your job could just get them via http [16:01:34] as it runs [16:01:37] joal, elukey, standuppppp? :] [16:01:46] ottomata: ah ok. so hdfs -put is fine for a oozie job? [16:01:52] hm, no [16:01:53] i mean [16:01:59] sort of, but not normally [16:02:04] chasemp: yea really should no-one knows these things exist [16:02:12] oozie is usually used to launch a hadoop job based on data presence [16:02:13] so [16:02:19] data shows up in HDFS [16:02:54] ebernhardson: access to a relevancy forge (like thing) from labs was brought up by commtech as a big wish at the product/tech offsite/onsite back in oct(?) and I *adjust neck tie* had to be like "yeah well...that exists" [16:03:07] ottomata: so i need something else that triggers the hdfs -put? [16:03:13] it's just we have no time to promote and organize docs and do a bit of handholding [16:03:44] so fyi it's on my near-term radar to kind of formalize an explanation page or two and nudge some folks [16:04:29] and oozie is like: cool! This weeks data is finally avaiable. time to launcha job! [16:04:47] mschwarzer: usually ,yes, that would probably be easier [16:04:54] a simple little script that runs via cron would probably be fine [16:06:40] how can i push changes to cron? oozie uses just the gerrit repo, right? [16:10:45] mschwarzer: depends on how 'production' this is going to be [16:11:09] ebernhardson: where will mschwarzer's code live? in the discovery search repo? [16:11:14] discovery analytics? [16:11:15] i mean? [16:11:20] (or, what is it called?) [16:11:31] mschwarzer: do you have access to the cluster? I forget. to stat1002? [16:13:10] ottomata: probably yes, because we would then ship the code back to es with our other data [16:13:19] s/ship the code/ship the data/ [16:13:47] aye, so mschwarzer, the cron part *could* be puppetized, or, at first at least, it could just run as your user on stat1002 [16:13:58] that might be best to do while you work out the kinkds [16:14:02] kinks [16:14:24] but, as long as the data isn't too big or annoying (i don't think it is), a puppetized hdfs -put of latest rev xmldumps (or json?) could be useful [16:14:35] hmmm, alternatively [16:14:42] we could sqoop the latest revs directly out of the mysql db [16:14:49] and regularly schedule that [16:15:02] then the data would be in Avro format already, and you could query it directly via Hive or whatever [16:15:11] hmm [16:15:12] actually [16:15:16] that is probably harder than it sounds [16:15:21] because the content is nasty in the db [16:15:37] yeah, i take that idea back, nm. [16:15:51] there has been talk about sqooping content eventually, but it is not a priority [16:24:48] ottomata: i don't have access to the cluster and if possible i would like to avoid that ;) thanks for all the infos. i'll now try to prepare my oozie changes and then we can see how to deploy the dump2hdfs script. [16:25:01] oook! [16:28:34] 06Analytics-Kanban: Kill limn1 - https://phabricator.wikimedia.org/T146308#2916734 (10Nuria) @Nemo_bis : sorry but we cannot possibly do that as it would equal maintaining them alive, you can see the list above that the dashboards killed were of no interest to their owners/had no updated data/had testing data. [16:59:41] mforns: , still there? [17:10:09] ottomata, mschwarzer: Heya ! [17:10:16] sorry for arriving late [17:10:21] HIII joal! [17:10:22] :) [17:10:24] :) [17:11:15] mforns: Hi, :) [17:11:56] a-team, I think I'm too decorelated from our work this week to go to scrum of scrum [17:12:10] anybody could go in place of me? [17:12:30] joal: i could go, but we are all disconnected [17:12:53] actually, yeah i'll go! [17:12:57] and mention that i'm blocked on ops :p [17:13:14] k ottomata, thanks a lot mate ! [17:14:33] joal: want to quickly brain bounce a name with me? :D [17:14:41] sure ottomata [17:14:46] batcave? [17:14:47] k [17:15:13] hey joal, sorry was afk [17:17:38] np mforns :) [17:18:44] mforns: currently with ottomata, will ping you in a minute about sqoop [17:18:54] joal, ok! [17:19:00] thx [17:24:20] mforns: ready ! [17:26:04] hey joal batcave? [17:26:09] sure mforns OMW [17:37:28] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#2916903 (10Aklapper) >>! In T138002#2916525, @Lcanasdiaz wrote: > Guys, we are ready to deploy this but we are finding some issues with the data collection. We do... [17:38:52] milimetric: Hi ? [17:39:29] milimetric: you tricked me ! I tried to launch a new sqoop today, but some was already launched ;) [17:47:41] joal I think that might be a hanging job that needs to be killed. I will try to look but am running around getting ready for flight. Just kill it in hue if you can [17:47:55] milimetric: KILLIIIIING ~! [17:48:01] K [18:16:09] ottomata: ahem ( cc ebernhardson ) how are these folks that wish to use dumps going to test their oozie changes w/o access to cluster, is ebernhardson testing those for them? [18:26:46] nuria: i'm not sure either if mschwarzer doesn't have access to the cluster, i imagine dcausse or myself will have to test [18:44:16] ottomata: around? [18:46:42] ja, in SoS [18:46:43] but ja [18:47:43] ottomata: milimetric has a running python sqoop script on stat1002, but it has failed and is not stopped [18:48:05] ottomata: could you try to see if you can kill that? [18:48:49] k [18:49:32] i see lots of processes there [18:49:39] do you know which I should target? [18:49:40] ottomata: hm [18:50:13] ottomata: checking as well [18:50:31] wow ottomata [18:50:34] yeah [18:50:37] is htis from a cron? [18:50:42] maybe its launching too often? [18:50:42] ottomata: nope [18:50:45] hm [18:51:20] ottomata: subprocesses for paralelisation? [18:52:19] ja maybe so [18:52:25] any can be killed? [18:52:30] ottomata: I think all can [18:52:32] i can try to kill them, just don't want to kill the wrong thing [18:52:33] ok [18:52:38] thanks ottomata [18:54:06] ok, killed the parent procs, there are 2 yarn jobs running [18:54:08] should I kill those too? [18:54:30] joal: ? [18:54:39] ottomata: I can do that ;) [18:54:45] ok [18:54:57] ottomata: I was willing to have python stuff killed first :) [18:56:23] Done ottomata [18:56:28] many thanks :) [18:57:12] yuppers :) [18:58:20] mforns, milimetric, ottomata : New sqoop job launched in a screen under joal user on stat1004 [18:58:27] for info :) [18:58:38] joal, without problems? [18:58:49] mforns: nothing so far, but still waiting [18:59:22] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2917263 (10Nuria) @Jsalsman : please read premise of ticket, this is about replacing IPs by hashes to avoid cut & paste errors. Given that hash salt will still be available for at least 60 days it doesn'... [19:00:06] 10Analytics, 10Analytics-EventLogging: EventLogging fails to validate a Recentchanges event for he.wikipedia.org - https://phabricator.wikimedia.org/T154395#2909660 (10Nuria) Alarm: Notification Type: PROBLEM Service: Throughput of EventLogging EventError events Host: graphite1001 Address: 10.64.32.155 Stat... [19:00:51] 10Analytics, 10Analytics-EventLogging: EventLogging fails to validate a Recentchanges event for he.wikipedia.org - https://phabricator.wikimedia.org/T154395#2917267 (10Nuria) {F5229514} See spike on errors, it appears as a spike on EventError schema. [19:34:54] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2917418 (10Jsalsman) How is the 60 days figure derived? [19:56:47] 10Analytics, 10Analytics-EventLogging: EventLogging fails to validate a Recentchanges event for he.wikipedia.org - https://phabricator.wikimedia.org/T154395#2917508 (10Nuria) Looked at events on errors and agree with initial diagnosis about events ChangesListFilters being on error. Assigning to Roan which seem... [19:57:00] 10Analytics, 10Analytics-EventLogging: EventLogging fails to validate a Recentchanges event for he.wikipedia.org - https://phabricator.wikimedia.org/T154395#2917510 (10Nuria) a:03Catrope [19:58:22] ottomata: yt? [19:58:52] nuria: ya [19:59:51] ottomata: the icinga alarm trigerred about EL errors (https://phabricator.wikimedia.org/T154395#2917267) should be reset to no longer be opened ... makes sense? as issue has been resolved [20:00:09] ottomata: not sure how to do that on icinga ui [20:03:43] ottomata: is there a way to see alarms open on kafka hosts ? [20:04:41] ottomata: alarms i see for those hosts have nothing to do with EL [20:06:07] nuria: sorry [20:06:11] ottomata: np [20:06:50] nuria: these are all currently triggered alarms [20:06:50] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?servicestatustypes=28 [20:07:03] there is nothing related to eventlogging there [20:07:30] ottomata: ok, nothing else to do then [20:07:36] ja :) [20:08:01] 10Analytics-EventLogging, 06Analytics-Kanban: EventLogging fails to validate a Recentchanges event for he.wikipedia.org - https://phabricator.wikimedia.org/T154395#2917563 (10Nuria) [20:08:34] ottomata: thnksssirr [20:08:42] *thankssirrr that is [20:12:20] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream: Set charset=utf-8 in Content-Type response header from sse.js client - https://phabricator.wikimedia.org/T154328#2917597 (10Ottomata) [20:19:09] mforns: still there? [20:19:15] hey ottomata yes [20:19:21] looking for reviewer on this: https://phabricator.wikimedia.org/D524 [20:19:42] ottomata, will do! [20:19:45] danke! [21:26:59] thanks mforns! [21:27:06] ottomata, np! [21:42:33] 06Analytics-Kanban, 06Operations, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896288 (10Nuria) Also here: https://issues.apache.org/jira/browse/HADOOP-11105 regarding class: org.apache.hadoop.metrics2.impl.MetricsSystemImpl which is retaining a lot of memory... [22:08:46] 10Analytics-Tech-community-metrics: Fix incorrect mailing list activity of AKlapper (=Phabricator) in Technical Community Metrics user data - https://phabricator.wikimedia.org/T132907#2917988 (10Aklapper) a:05Aklapper>03None [22:48:11] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2918168 (10cscott) FWIW, the offline content generation service (OCG, generates PDFs, ZIM files, books, etc) also has a be...