[01:02:34] Analytics-Kanban, VisualEditor: Schema:Edit seems to incorrectly set users as anonymous {lion} - https://phabricator.wikimedia.org/T92596#1208537 (Krenair) a:Krenair>None [01:49:46] Analytics-EventLogging, MediaWiki-General-or-Unknown, Patch-For-Review, Performance: Add event tracking queue to MediaWiki core for loose coupling with EventLogging or other interested consumers - https://phabricator.wikimedia.org/T95356#1208569 (bd808) What should the config setup for the EventLog... [03:08:55] (PS1) Nuria: Undoing Ide98e20eb54523353153ccd212df5... [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/204214 (https://phabricator.wikimedia.org/T93023) [03:21:04] (PS2) Nuria: Undoing Ide98e20eb54523353153ccd212df511a9298bd16 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/204214 (https://phabricator.wikimedia.org/T93023) [03:22:54] (CR) Nuria: [C: 2] Undoing Ide98e20eb54523353153ccd212df511a9298bd16 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/204214 (https://phabricator.wikimedia.org/T93023) (owner: Nuria) [03:23:01] (Merged) jenkins-bot: Undoing Ide98e20eb54523353153ccd212df511a9298bd16 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/204214 (https://phabricator.wikimedia.org/T93023) (owner: Nuria) [04:08:39] Analytics-EventLogging, Analytics-Kanban: Flow events not validating on EL - https://phabricator.wikimedia.org/T95169#1208672 (Mattflaschen) In the future, just associating the #Flow project will ensure we see it. [04:10:13] Analytics-EventLogging, Analytics-Kanban: Flow events not validating on EL - https://phabricator.wikimedia.org/T95169#1208674 (Mattflaschen) But this is actually an #Echo bug. [04:10:24] Analytics-EventLogging, Analytics-Kanban, Collaboration-Team, Echo: Flow events not validating on EL - https://phabricator.wikimedia.org/T95169#1208675 (Mattflaschen) [04:11:35] Analytics-EventLogging, Analytics-Kanban, Collaboration-Team, Echo: Echo events not validating on EL - https://phabricator.wikimedia.org/T95169#1208678 (Mattflaschen) [04:19:53] Analytics-EventLogging, Analytics-Kanban, Collaboration-Team, Echo, Patch-For-Review: Echo events not validating on EL - https://phabricator.wikimedia.org/T95169#1208688 (Mattflaschen) a:Mattflaschen [08:01:52] Analytics-Cluster, Analytics-Kanban: Compute pageviews aggregates daily and monthly from April - https://phabricator.wikimedia.org/T96067#1208840 (JAllemandou) [08:02:22] Analytics-Cluster, Analytics-Kanban: Compute pageviews aggregates daily and monthly from April - https://phabricator.wikimedia.org/T96067#1207516 (JAllemandou) Thanks for the reminder Aklapper ! I usually add projects, and forgot this time :) [13:07:28] o/ joal [13:17:02] Today I am focusing 100% on moving value-adding/WikiCredit forward. :) [13:33:20] hey halfak [13:33:37] Sorry, I was concentrated :) [13:33:50] I have 1/2 hour to talk now if you wish :) [13:36:11] Oh yeah! Batcave? [13:36:14] joal, ^ [13:36:15] sure :) [15:04:39] joal, something is totally weird [15:04:40] e.g. [15:04:41] https://yarn.wikimedia.org/proxy/application_1424966181866_85838/mapreduce/job/job_1424966181866_85838 [15:04:48] that is the add partition job for a webrequest misc hour [15:04:50] very low data [15:04:54] it has been running for 20 hours [15:05:06] i can't find the actual job though, that is the oozie launcher task [15:05:16] i think it shoudl spawn off another application [15:05:39] AttemptID:attempt_1424966181866_85838_m_000000_0 Timed out after 600 secs Container released on a *lost* node [15:05:52] maybe this has something to do with analytics1017 crashing this weekend [15:06:39] going to kill and restart that one...via oozie [15:06:40] let's see [15:21:25] joal: i;m going to kill your SELECT regexp_extract(uri_host,...100000(Stage-1) job [15:21:27] it hasn't started yet [15:38:54] joal, simplewiki XML on altiscale HDFS @ /user/halfak/streaming/simplewiki-20141122/xml-bz2 [15:56:57] thx halfak [16:00:44] joal, you talked to yurik a bunch yesterday about his jobs, right? [16:00:45] they ahve been running a long time. [16:02:18] mforns_brb: let me know if you need help with deploying EL so you get your logging change [16:05:11] nuria, I'm going to look at the docs for deploying, and if I need something I ping you [16:05:20] thanks! [16:12:06] joal, yt? [16:12:15] mforns: k [16:14:19] lunch! [16:15:31] nuria, I do not have access to tin.eqiad.wmnet :[ [16:15:58] ok, i think i figured out what was wrong with the cluster, although I don't know exactly why it went wrong. [16:16:01] no wlunch! [16:16:43] ottomata, should I file a task for ops to request access to tin.eqiad.wmnet? [16:18:16] nuria, from the EL docs, it seems that I do not need root in tin.eqiad.wmnet to deploy EL. But I guess I do need root, right? [16:21:02] sorry ottomata didn't see your lunch, we talk later :] [16:22:17] o/ kevinator [16:22:21] WOops. offline. [16:22:24] * halfak --> email [16:26:16] ottomata: here now, but in meeting [16:26:23] Can we talk after ? [16:28:57] ottomata: I am eager to know what it is ! [16:30:02] mforns: i think so yes. [16:30:28] mforns: or maybe just ssh access, did you try that? [16:30:55] nuria, what do you mean? ssh tin? [16:31:00] yes [16:31:33] yes, that's what I tried [16:31:39] nuria, is there another waY? [16:32:03] mforns: so you DO have ssh, correct? [16:32:35] when I execute: ssh tin.eqiad.wmnet [16:32:46] nuria, I get: Permission denied (publickey) [16:33:02] mforns: ok, you need to file a request for access [16:33:10] mforns: i will deploy [16:33:19] nuria, yes, and ask for root [16:33:33] mforns: no actually w/o root it works [16:33:46] mforns: i just forget cause it is different everywhere [16:33:51] nuria, does it take a long time, because, maybe I can talk with andrew and speed up the access [16:33:59] nuria, ok [16:34:28] mforns: it's fine, it'll take 1 week but it is ok, cause really, as long as you have access to EL box you can change the code, right? [16:34:36] as we deploy from source here [16:35:45] nuria, I meant if the deployment process does take a long time? so that you don't need to do that [16:36:07] yurik: yt ? [16:36:20] joal, ? [16:36:41] As per ottomata email, please don't run jobs on the cluster now [16:36:55] I know that it has catch up, but still :) [16:37:03] joal, i'm not, there is a script that autoruns things [16:37:05] There is some debugging and config change needed [16:37:13] Then please stop this script :) [16:37:23] \it runs the thing on your name, you run it ;) [16:37:25] working on that already :))) [16:37:33] ok [16:37:38] eta? [16:37:45] Just to let you know: if needed, we'll kill themn [16:37:51] Thamks :) [16:37:55] when can i start them again? [16:38:02] asap, hoppefully [16:38:11] ok, need about 10 min [16:38:27] we'll let you know asap it's readcy [16:38:39] Thanks again :) [16:41:21] mforns: note that the master also includes a graphite change that you will need to foloow up on: [16:41:25] https://www.irccloud.com/pastebin/56wRNqhU [16:42:25] nuria, ok [16:44:21] mforns: ok, deployed to EL box, try and see if code made it to hafnium and you can deploy there [16:44:34] nuria, ok [16:45:51] mforns: please check EL is doing good with your changes [16:46:01] mforns: brb in 40 mins [16:46:08] nuria, it seems [16:46:21] I have no access to hafnium too... [16:46:37] sigh... sorry, I'll ask for it too [16:46:38] mforns: ok, another one [16:47:00] mforns: looks like EL is fine but do check logs for abit [16:47:08] ok [16:48:23] joal: back from lunch, lemme know when you want to chat [16:51:38] mforns: uh, yes, if you want deployment access ja [16:51:41] make a ticket [16:55:38] ottomata, thanks already doing that [17:38:48] Analytics, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1209816 (DarTar) NEW [17:42:26] oo, nuria, joal, i set up a stupid nginx reverse proxy on my mac, now I don't have to replace yarn.wikimedia.org when clicking on links naymore [17:46:47] hey Ironholds! [17:46:56] you starting jobs/!? [17:47:53] o/ mili|lunch [17:48:02] When you get back, check out this: https://commons.wikimedia.org/wiki/File:Diff_time_density_(enwiki).svg [17:48:16] (PS5) Jdlrobson: Update limn graphs [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/199162 (https://phabricator.wikimedia.org/T93690) [17:49:00] The median time it takes to diff article revisions in Wikipedia is 0.05 seconds. The 90th percentile is 0.23 seconds. [17:50:31] So, we're super-duper fast at performing diffs. I'm now looking back to hadoop for behaving badly with the reducer. [17:51:26] halfak: that's with spark? [17:51:56] ottomata, streaming mapreduce where I do most of the work in the mapper. [17:53:45] aye cool [17:54:32] ottomata, did I tell you that the big enwiki job finished? [17:54:37] no! [17:54:39] awesome [17:54:45] On our smaller altiscale cluster, I was able to complete it in 5 days [17:54:48] :) [17:55:03] So, now I'm doing analysis on it to figure out how long the bits took and how big the diffs are. [17:55:43] I don't have a good sense for out capacity at altiscale right now, but joal mentioned that he thought it was only 4 boxes. [17:56:32] Analytics-Kanban: Safely reboot limn1, wikimetrics*, dan-pentaho, and any other labs instances running Ubuntu Precise - https://phabricator.wikimedia.org/T96175#1209914 (Milimetric) NEW a:Milimetric [17:57:33] nuria: The first step on Developer setup for EL - where does this code go? [18:03:04] ottomata, hi, can i run my scripts again? [18:06:45] yurik: i believe i have fixed the main problem, which had to do with a weirdly stuck job, not lots of folks usin gthe cluster, so that is good. [18:07:03] however, i would like to verify this, and i would like to make sure refinement jobs get caught up asap before I say you can submit more jobs [18:07:11] ok [18:07:39] Analytics-Tech-community-metrics: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1167361 (Qgil) fyi, I was right about this task being a high priority goal for this quarter. ;) >>! In T94165#1209948, @Qgil wrote: > The main piece... [18:10:06] ottomata: I have time now :) [18:10:40] halfak: that's music to ears, great stuff [18:16:16] halfak: spark not really set up in altiscale from what I see [18:16:36] We can file a ticket for it. [18:16:38] I don't manage to get my job running, and spark-shell is not ready to use [18:16:42] It *should* be set up. [18:16:43] I did ;) [18:16:49] Great :) [18:17:09] Got help already about spark-shell (port between machines issue, should be fixed soon) [18:17:22] And just logged a new one about job failing without exploitable logs [18:18:17] joal, on phone... [18:18:32] no problemo ottomata, will wait for you ;) [18:20:26] milimetric: Around? [18:20:53] hi madhu [18:21:37] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1209980 (kevinator) [18:21:40] question on EL again. The first step on Developer setup for EL (https://www.mediawiki.org/w/index.php?title=Extension:EventLogging#Specifying_a_dependency_on_EL) - where does this code go? [18:22:05] also what does it do and is it necessary [18:22:28] madhuvishy: so this is developer as in developer using EL, not working on EL [18:22:35] but working on EL includes using EL [18:22:36] milimetric: aaah [18:23:17] but are you wondering more generally about EL? We can hangout and talk if you like [18:24:34] madhuvishy: EL is a service [18:24:46] So this was the first task on my onboarding - https://phabricator.wikimedia.org/T89162. I'm in general wonderment on where to start. [18:24:51] madhuvishy: in vagrant you only have access to client side part [18:25:07] madhuvishy: that logging goes on the server side [18:25:12] nuria: right. [18:25:21] madhuvishy: do you have the code chcked out? [18:25:26] *checked [18:25:38] yes [18:25:43] madhuvishy: and the role set up? [18:25:56] madhuvishy: ok, set up also navigation timing rile [18:25:57] *role [18:26:00] role set up in vagrant yes. [18:26:02] and we can go to batcave [18:26:25] ah okay, let me provision for that role [18:29:08] madhuvishy: also do you have permits to ssh to machines in labs? [18:29:47] mforns: did you check your login in EL box? [18:30:08] nuria: https://phabricator.wikimedia.org/T96053 says its done [18:30:23] i think. where do I check? [18:30:29] nuria, what do you mean? [18:30:57] madhuvishy: try to log in here: deployment-eventlogging02.eqiad.wmflabs [18:31:17] mforns: did you check the login from mysql consumer is what you expected? [18:31:33] nuria, yes! [18:31:38] mforns: ok, good [18:31:51] mforns: i looked ta it and it seem fine but just to triple chcek [18:32:02] *at it [18:32:14] it works perfectly, the event timestamps are almost identical to the consumer timestamps, all the time [18:32:42] nuria, so it seems a db problem [18:33:00] nuria, I've observed a pattern in the master-slave replication lag [18:33:05] mforns: db not receiving records? [18:33:13] nuria: I think I need to set up something else first. ProxyCommand? [18:33:27] it *seems* to me that the replication lag gets progressively bigger with time [18:33:30] madhuvishy: did you check the ssh config on onboarding page? [18:33:35] nuria, like 10 minutes increase per hour [18:33:40] Yeah seeing that now. [18:33:42] mforns: ahhhhhh [18:33:59] madhuvishy: ok, give it a try [18:34:12] nuria, until the data gap happens, and then the replication lag goes to less than 1 minute [18:34:13] mforns: and - as lag increases- so does the gap? [18:34:21] mforns: what? [18:34:35] mforns: springle is going to like that #not [18:34:45] nuria, the gap in data happens only once or twice per day [18:35:14] nuria, the thing is, I think the gap in data coincides with the replication lag going back to normal [18:35:36] mforns: so funny, in a ahem not-so-good-way [18:36:03] hehe [18:36:27] mforns: sounds like the master is unable to write when replication is being "speed up" or doing more heavy stuff [18:37:34] mforns: ok, then you can work on that with springle, i cannot think of an alarm on our end on that [18:38:10] nuria, yes I agree, the only thing that goes against this is springle's statement, that "the db did not see any inserts of the missing data" [18:38:25] nuria, springle concluded that greping the binary logs [18:39:16] nuria, but yes, I think the best will be reply to him telling about the replication pattern and giving the event examples that he asked for [18:39:59] mforns: ya, also, if you think when next event is going to happen you can "listen" to inserts with tcpdump and see those going to the box i think [18:40:04] ottomata: I am getting diner [18:40:09] Will be back soon [18:40:41] nuria, good idea [18:40:54] mforns: sounds to me that db acknowledges inserts and they are not actually being executed (and thus not being written to db transaction logs) but again i know next to nothing about inners of db setup [18:41:22] nuria, I've read about some innodb insert buffer [18:41:39] nuria, but my understanding of that was so short, that I stopped reading [18:41:49] nuria: something something Permission denied (publickey). [18:42:20] nuria, what? [18:42:26] madhuvishy: ok, with the right ssh config? [18:42:58] mforns: tcdump to listen on port 5000 9incoming/outgoing) [18:43:04] https://www.irccloud.com/pastebin/ojV7NUj0 [18:43:22] mforns: you should be able to listen on mysql port (port is on config) [18:43:40] nuria, ok [18:44:33] madhuvishy: and you did do this right? https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Access#Creating_a_Labs_account_on_Wikitech [18:44:53] nuria: yup [18:45:05] madhuvishy: BTW, please update the onboarding docs that might be incorrect, the idea is that each hire keeps them current [18:45:08] my public key matches with the one on labs [18:45:21] nuria: okay sure [18:45:36] https://www.irccloud.com/pastebin/KcAlsB4w/.ssh%2Fconfig [18:46:38] are you guys watching 'we're all remoties' talk? [18:46:46] madhuvishy: since this is outside our control, try to ask in #wikimedia-labs and see if you get help there [18:46:55] mforns: arghhh, forgot! [18:47:04] starts in 10 [18:47:08] mforns: do you have the link handy? [18:47:12] mmm [18:47:37] hi madhuvishy nuria [18:47:44] someone needs to add her to the deployment-prep project [18:47:49] (her name is not listed at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep) [18:48:01] nuria, no link yet [18:48:05] YuviPanda: someone being ....ahem.... [18:48:23] YuviPanda: a handsome ops indian guy? [18:48:26] nuria: you, for instance :) [18:48:36] or anyone in that list of ‘admins’ :) [18:48:50] YuviPanda: me? wait i do not even know what that is but checking ... [18:48:59] nuria: :D you got admin there :D [18:49:08] nuria: you can go to https://wikitech.wikimedia.org/wiki/Special:NovaProject and add people to the deployment-prep project [18:50:23] nuria: also I don’t think the ops team has any ‘handsome’ indian guys :) [18:50:25] * YuviPanda slinks away [18:50:40] YuviPanda: true that. :P [18:51:05] YuviPanda: finding myself to be admin of things that i barely know what they are is unsettling but let me add madhu there [18:51:56] nuria: :D deployment-prep admin is given out fairly freely [18:52:02] you also have sudo there on all machines for example [18:53:19] madhuvishy: added you to : https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep, try again [18:55:38] nuria: YuviPanda same error. [18:56:06] YuviPanda: any ideas? [18:56:49] nuria: did you add her at https://wikitech.wikimedia.org/wiki/Special:NovaProject [18:57:03] I don’t see her there [18:58:02] joal, phone call over! [18:58:04] want to batcave? [18:58:37] YuviPanda: but i do not even see where to add people there... [18:58:51] I can help in an hour or two maybe, in the middle of another migration, sorry [18:58:55] ask in -labs? [18:59:32] nuria: actually, let me just add her [18:59:41] YuviPanda: sounds good, ok [18:59:47] madhuvishy: nuria done [18:59:55] but you should update your onboarding docs :) [19:00:08] YuviPanda: thank you, madhuvishy please try again, if it no work we shall ask in labs [19:00:41] YuviPanda: had just done it: https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs#Give_people_access [19:00:48] cool :) [19:01:25] YuviPanda: i am all about docs otherwise i forget EVERYTHING [19:02:04] nuria: YuviPanda thanks, it seems to have worked. milimetric pointed out I had <> around my username in config. [19:02:34] ottomata, so: how come HDFS is so backlogged? [19:02:41] Like, I get the production tasks, but this is..new. [19:02:49] madhuvishy: all right, one less thing, i want to watch the "all remoties" talk (that i think has not started yet) [19:02:59] madhuvishy: we can go to batcave after [19:03:05] nuria: me too :D [19:03:16] nuria: sure. i will also get lunch. [19:03:20] ottomata: joining the remoties neeting [19:03:27] madhuvishy: make sure to install navigationtiming extension and that your mw instance is running on http://localhost:8080 [19:03:28] Can chat thoughb [19:03:39] nuria: done [19:03:51] madhuvishy: open your network panel [19:04:07] madhuvishy: everytime you load that page you should see a req trying to hit bits [19:04:26] haha, ok, so , i'm not sure this is 100% fixed now...but [19:04:26] that comes back with an error (as bits is not setup on vagrant0 [19:04:35] remember that job I mentioend yesterday that was taking forever? [19:04:35] ottomata: yesssss...... [19:04:49] it was messed up because of analytics1020 [19:04:52] nuria: yes that seems to be happening [19:04:56] yesterday, cmjohnson fixed an20 [19:05:06] but, wasn't quite done fixing it, and had left hte network cable unplugged for something [19:05:17] so, some containers got allocated there [19:05:18] madhuvishy: ok, that is the "client" of eventlogging, javascript code that lives on mediawiki [19:05:23] but an20 disappeared [19:05:27] i dunno why yarn didn't fix itself [19:05:32] but those containers were just stuck in timeout [19:05:35] waiting for response [19:05:42] weird [19:05:44] the app master just sat there doing nothing [19:05:44] madhuvishy: we own both client and server (and both are on same depot on EventLogging extension) [19:05:57] which took up a lot of the fair share, because ti still had resources allocated to it [19:05:57] nuria: right, okay.. [19:06:09] anyway, i killed that, and am trying to get prod jobs caught up [19:06:18] webrequest load jobs are doing well [19:06:18] still alot in the essential queue ottomata [19:06:26] once i saw load jobs were cool [19:06:26] yeah [19:06:33] ok [19:06:36] i unsuspended refine jobs [19:06:41] now things are looking pretty slow again [19:06:44] the one thing that is werid [19:06:45] madhuvishy: in vagrant the python server code is not deployed [19:06:53] is that the first job in the queue takes up most of the fair share [19:06:56] madhuvishy: but you can run unit tests and execute some of it [19:06:57] and things still don't get allocated [19:07:06] joal, our essential queue is fifo [19:07:10] i'm thinking about changing it to fair. [19:07:22] hmm [19:07:23] madhuvishy: see that you can run tests: http://www.mediawiki.org/wiki/Extension:EventLogging#How_to_run_tests [19:07:33] and also making another queue [19:07:34] . [19:07:37] or, maybe we keep eseential fifo [19:07:41] and make a production queue [19:07:42] that is fair [19:07:45] Wouldn't it be better to keep it fifo, and add other queues ? [19:07:48] yeah [19:07:49] maybe so [19:07:50] yup [19:07:57] so queues. [19:08:23] default [19:08:23] _something else in between default and production_? [19:08:23] production [19:08:23] essential [19:08:24] ? [19:09:28] madhuvishy: will be watching talk, let me know if you run into trouble running tests [19:09:52] madhuvishy: it might be that php unit tests docs needs updating [19:10:32] default < production < essential < top_prio (to catch up on errors for instance) [19:10:38] ottomata: --^ [19:11:21] hm, naw, cause ellery and leila wanted a queue that they could submit things to that was > default, but not production [19:11:34] and, if we move more jobs out of essential [19:11:38] essential will be top prio [19:11:40] Yeah, but how to name that ? [19:11:46] i'm thinking essential just has [19:11:47] camus [19:11:49] load [19:11:52] mayyybe refine [19:11:55] ? [19:12:13] I think refine can go in production [19:12:17] aye ok. me too [19:12:22] Doesn't need to prehempt [19:12:37] aye, i mean, all the other jobs will depend on it. [19:13:02] Yes, but we don't loose data if it doesn't run [19:13:30] But it's one of the important, correct [19:13:50] yes, i mean, we don't lose data if load doesn't run [19:13:51] only if camus doesn't run [19:13:57] ottomata: in that thread, we were asking for access to essential for a period of time, but if doing it in a different and more permanent way makes more sense, we're definitely open to that. [19:13:57] nuria: alright. running them. brb lunch [19:14:09] leila, it does make sense [19:14:12] help us think of a name [19:14:15] it is nto production [19:14:18] itis not default [19:14:20] what is the name of the queue? [19:14:23] privileged? [19:14:24] naw [19:14:55] humm. thinking. [19:14:55] ottomata, you mean root.essential? [19:14:55] bonus! [19:14:56] ? [19:14:58] ottomata: so it's either camus + load + refine in essentials, or camus only [19:15:04] or, to distinguish from root.essential? [19:15:07] ottomata: Jeopardy ! [19:15:08] new queue [19:15:11] Ironholds: [19:15:23] aha [19:15:34] joal: load is important toooooo, dunno, i could see camus and load in essential [19:15:46] so if essential is for the root, shit will explode and everything falls over if this doesn't work [19:16:04] ottomata: yeah, but as you said as well, refine is core to users [19:16:11] root.prioritised? [19:16:12] nothing happens with load only [19:16:12] ottomata, joal, so what you have in mind will go between default and production in terms of priority? [19:16:21] prioritised! is prettty good [19:16:24] correct leila [19:16:24] default, prioritised, essential [19:16:31] ottomata: I like it [19:16:35] + production [19:16:42] produciton > prioritized [19:16:46] does that make sense? [19:16:54] default < prioritized < production < essentials [19:16:57] a [19:16:58] ja [19:17:03] +1 to prioritized as long as it has a clear distinction from top_prio [19:17:05] priority* [19:17:58] ottomata: so camus + load + refine in essentials, FIFO, with maybe less prehemption than currently ? [19:18:05] okay, so we're dropping top_prio in [19:18:05] default < production < essential < top_prio [19:18:05] and replacing it with [19:18:05] default < prioritized < production < essentials joal, ottomata? [19:18:14] goot to me :) [19:18:15] yup [19:18:32] hmm, ja what should preemption be? i dunno. maybe we shoudl jsut put camus in essential. it shoudl preempt! [19:18:36] but maybe the other two shoudlnlt [19:18:54] Yeah, that was my point [19:18:54] the logic works well for me, as long as in practice the prioritized works well for us. [19:19:04] leila: ;) [19:19:13] ;-) joal. [19:19:15] leila, i think we will have to tweak as it goes [19:19:30] I agree. we won't know it until we try it. [19:19:59] thanks for looking into this joal, ottomata. [19:20:12] You're welcome leila [19:20:51] ottomata: the in prodution, we have everything from core (load, refine) to aggregate (media, pageviews etc) [19:21:46] yes? [19:22:16] load + refine are to be run aggregation ... [19:22:24] ? [19:22:26] before aggregation sorry [19:22:38] yes, i mean, the agg jobs depend on load and refine [19:22:44] yeah [19:22:49] Oozie does that for us [19:22:54] so, if the cluster is doing well, that should be fine, but if load and refine jobs back up, dunno [19:22:54] yeah [19:23:06] But I'd still like to have the former have more resources ;) [19:27:20] So ottomata essentials = camus, FIFO [19:27:36] Might have camus-webrequests and camus-eventlogging [19:27:55] then production, fair with high prio, no prehemption [19:28:07] for load, refine, and aggregations [19:28:39] joal, i'm going to give production low preemption settings? [19:28:42] like, 10 minutes for min share [19:28:46] then prioritized, fair, medium prio, access on demand [19:28:46] no preemption for fair share [19:28:59] really ? [19:29:02] ja, no? [19:29:09] don't think so? [19:29:12] i mean, maybe not? [19:30:20] I think it does (from doc) [19:30:28] If a pool's minimum share is not met for some period of time, the scheduler optionally supports preemption of jobs in other pools. [19:32:24] ja, right? [19:32:28] so production with low preemption -> sounds good [19:32:42] Just need to ensure it doesn't preempt from essentials ;) [19:33:18] that's what we want? that way we can be sure production jobs will be started at least within 10 minutes [19:33:18] even if they don't get there fair share [19:33:18] It's dumb, but we need to be sure [19:33:18] yeah, i'm not sure about that, i was thinking about making essential a subqueue of produciton [19:33:18] but i'm not sure [19:33:18] if it would matter [19:33:18] Don't know [19:33:23] Will look into iu [19:36:07] having proper weights seems the way to go [19:36:57] yeah [19:37:00] joal, weights: [19:37:06] 1, 5, 5, 10 [19:37:06] ? [19:37:15] production is same as prior, but production can preempt? [19:37:32] hmm [19:37:54] We can try that, and adjust if it doesn't fit :) [19:37:58] k [19:38:31] I don't see why it wouldn't work, so trying seems a good way [19:39:06] Sounds good [19:39:10] queues are there [19:39:15] ottomata: --^ [19:39:21] yeah :) [19:39:25] But still a lot to catch up [19:40:00] May be we wait for things having caught up before restarting the oozie jobs ? [19:40:17] well, i'm worried abou tthe fifo thing [19:40:29] for essentials ? [19:40:35] right now, yes. [19:40:41] Why ? [19:40:43] i think that the first job is taking up all the resources [19:40:53] having somebiody blocking everythong ? [19:40:53] hmm, actually [19:41:01] no it looks better now [19:41:06] before when I was looking [19:41:15] the first job was taking a huge portioin of hte share [19:41:18] portion* [19:41:21] and everything else had 0 [19:41:27] now it is more evenly distributed [19:41:33] well, no [19:41:33] yup [19:41:34] not really [19:41:40] i mean, more jobs are evenly disributed [19:41:41] but [19:41:43] You've restarted the oozie already ? [19:41:45] the earliset submitted job [19:41:47] no [19:41:51] i haven't [19:41:56] hum [19:41:58] the earliest submitted job has 1123312 [19:42:02] true [19:42:24] If we only have camus in essentials, then fifo soumnds correxct [19:42:29] yeah [19:42:33] that's why i kind want to restart oozie [19:42:39] i think i might do that for refine jobs now. [19:42:42] you want the thing to wait for the one that has issues [19:42:59] sounds good, please help yourself [19:43:11] or ask me to help, if you wish ;) [19:43:27] i just want validation, thank you! [19:43:27] :) [19:43:59] Well, I am not that powerfull to pevent you doing the right thing ;) [19:45:41] joal, i think i can change the queues that new workflows will be submitted in by oozie [19:45:44] wihtout restarting the jobs [19:45:54] reaaaaaly ? [19:46:44] https://oozie.apache.org/docs/4.1.0/DG_CommandLineTool.html#Updating_coordinator_definition_and_properties [19:47:40] sounds cooooool ! [19:47:55] But works for coordinators, not bundles, does-it ? [19:48:14] hmMMmm [19:48:18] not sure [19:48:23] bundles are made of coordinators...soooohm [19:48:27] there is a dry run optoin [19:48:28] i will try it [19:48:33] ok [19:48:38] I think [19:49:02] you might have to change every coordinator started by the bundle [19:49:39] ja maybe [19:50:58] ahh cool joal [19:51:04] oozie job -configcontent 0072945-150220163729023-oozie-oozi-C [19:51:15] gets coordinator config [19:51:25] hmm, that is the xml though :/ [19:51:34] oh queue name is htere...hmm [19:51:54] yeah, comes from bundle.properties [19:52:34] Man, jobs pack up in the essential queue currently [19:52:47] 23 2 hours ago, 25 one hour ago, 30 now [19:53:34] yea [19:53:44] gonna suspend the refine bundle again [19:54:00] May be kill it, change the config, then restart later ? [19:54:17] instead of just suspend, kill and change config [19:58:07] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1210287 (Eloquence) @DarTar, if you and Oliver can work out a plan that works for both of you, I am f... [20:00:52] ottomata: so, killing refine ? [20:01:06] i suspendd, did not kill yet [20:01:09] no new jobs shoudl come in [20:01:14] i'm still looking [20:01:21] the annoying thing about kill is [20:01:37] that if I kill a bundle, i have to make sure i backfill properltly, since the jobs for each coordinator aren't all caught up [20:01:57] bah, but ja [20:01:58] yeah, need to double check on folders [20:02:02] this -update thing is annoying [20:02:17] coordinators only, no bundle heh [20:03:09] meh [20:03:18] hang on, lemme try a few more mins on -update [20:03:31] sure, no prob [20:03:38] it should work, but is currently saying [20:03:47] Error: E0803 : E0803: IO error, E1023: Coord Job update Error: [Coord name can't be changed. Old name = refine-webrequest_misc new name = refine-wmf_raw.webrequest->wmf.webrequest-misc-coord] [20:03:59] which iuunnoo why it thinks i'm trying to change names [20:04:11] i don't even know what property that would be [20:04:15] it is an attribute :/ [20:04:33] hmm, may it changes name by default, (as an id), and since name is tied to bundle, not possible ? [20:04:45] yeah that could be it [20:05:00] in the future we shoudl make bundle coodinator name and coordinator coordinator names match [20:05:13] yeaaa [20:05:13] hm [20:05:18] They don't ? [20:05:22] no i just noticed [20:05:27] bundle.xml does [20:05:33] [20:05:35] and coordinator.xml does [20:05:41] name="refine-${source_table}->${destination_table}-${webrequest_source}-coord" [20:06:01] ok [20:09:32] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1210304 (Nuria) >We're handing off the generation of this data to Analytics Eng, and the team will s... [20:14:51] huhu, ottomata , jobs restarted in production queue ? [20:15:50] yes, i haven't restarted the refine bundle yes [20:15:56] just the few coordinator jobs that had been lagging [20:16:03] cool :) [20:16:08] i think i'm going to do the same for load... [20:16:17] so the coordinator still runs. with the new config ? [20:16:32] and the bundle knows it ? [20:16:44] i had to start the coordinators manually [20:16:47] outside of the bundle [20:16:53] right [20:17:01] so, i pick the date at which all coordinators have not yet run, and choose that for my bundle start time [20:17:23] then, I start manual coordinators for each of the ones I need to backfil, and set start_date AND stop_date [20:17:39] e.g. [20:17:39] -Dstart_time=2015-04-15T07:00Z \ [20:17:39] -Dstop_time=2015-04-15T15:00Z \ [20:17:39] -Dwebrequest_source=misc \ [20:18:04] you killed the bundle, re-run the coordinators manually, and will restartthe bundle with new config ? [20:18:04] joal, one thing we might wnat to do [20:18:13] is move seqeunce stats calculation to refine table. [20:18:21] might be faster [20:18:22] YEAHHHHH ! [20:18:26] Aggred [20:18:41] ha, or i should just figure out how to convert from json to parquet in camus [20:18:44] i'm sure that is possible. [20:18:56] Other question: the distinct in refine, does it realy need to happen on all fields ? [20:19:03] i don't think so [20:19:07] qchris did that part :) [20:19:13] Maybe we could change the refine job to spark ? [20:19:17] hostname, sequence and something else... [20:19:19] sure! [20:19:19] :) [20:19:40] pageviews is priority, but I'll keep that in mind for sure [20:20:05] ottomata: can we submit new jobs already? [20:20:05] hostname, sequence, timestamp ? [20:20:59] Question though about sequence stat : when there is an errot, what happens ? [20:21:05] ottomata: -- [20:21:09] nuria: not yet [20:21:17] ja joal that hsould do it [20:21:28] joal, as in, there is missing or dup data? [20:21:35] then the _SUCCESS file is not written [20:21:36] yeah [20:21:40] and ? [20:21:42] but, there is a _PARTITIONED file [20:21:43] no load ? [20:21:49] sequence stat is part of load [20:21:55] load does: partition + check data [20:21:55] So no refine ? [20:21:58] no, because [20:22:06] refine depends on _PARTITIONED file [20:22:08] not _SUCCESS [20:22:10] this is not idea :) [20:22:13] ideal* [20:22:19] but, beacuse esams is still lossy. [20:22:30] and the _SUCCESS job is not tolerant [20:22:37] i want refine to by default proceed [20:22:43] i get an email with a report about the partitions [20:23:01] if I see that there are contiguous hours with no _SUCCESS [20:23:03] i look into it [20:23:10] ok [20:23:12] otherwise...i assume it is due to lossiness [20:23:25] hmm [20:23:28] yea, not ideal [20:23:31] i want this [20:23:35] refine is looking for _PARTITIONED, you sure ? [20:23:50] https://phabricator.wikimedia.org/T92499 [20:23:55] yes [20:24:16] [20:24:28] frequency="${coord:hours(1)}" [20:24:28] initial-instance="${start_time}" [20:24:28] timezone="Universal"> [20:24:28] ${webrequest_raw_data_directory}/webrequest_bits/hourly/${YEAR}/${MONTH}/${DAY}/${HOUR} [20:24:28] _PARTITIONED [20:24:29] [20:24:49] Right [20:25:03] thanks for pointing, too late for my brain ;) [20:26:38] BUT -> since we have the distinct in refine, we don't catch duplicates [20:26:44] ottomata: --^ [20:27:19] ? [20:27:29] yes, if there are duplicates in raw, no_SUCCESS [20:27:33] Well, in refine, we remove duplicates [20:27:37] but _PARTITIONED happens as soon as the hive partition is added [20:27:38] yes [20:27:48] which is another reason i don't always check if I see no _SUCCESS [20:27:56] i need better detail in this report [20:28:07] all the report tells me is that there is something not 100% perfect with a raw partition [20:28:08] I don't understand this check then [20:28:17] right [20:28:29] either duplicates, either nulls [20:28:34] right ? [20:28:36] Analytics-Cluster: Augment refinery-dump-status-webrequest-partitions script to show more useful status of webrequest raw partitions - https://phabricator.wikimedia.org/T92499#1210322 (Ottomata) [20:28:48] yup [20:29:01] joal, have you looked at the webrequest_sequence_stats table? [20:29:02] in wmf_raw? [20:29:04] and then, I think this part of the process should be part of the refine [20:29:21] I'm looking at the hql for oozie [20:29:58] Since we read raw webrequest for refine, let's compute stats in the mean time [20:30:06] aye [20:30:10] that would be cool [20:30:12] that would have to be in spark then :) [20:30:14] Right [20:30:14] bu tja [20:30:19] Yeah, spark ftw [20:30:34] nuria: Hi! [20:30:45] Particularly knowing that we could cache most of the stuff [20:30:48] madhuvishy: still on talk but it is wrapping up [20:30:57] ok, joal, i'm inclined to let the running load jobs finish up [20:31:02] i've suspended the creation of new ones [20:31:12] nuria: cool let me know when done [20:31:14] once they finish i will kill and restart load bundle [20:31:36] joal: also todo: really loook into camus [20:31:41] it is consuming much slower than it used to [20:31:48] hmmmm [20:31:51] it used to be able to consume the last 10 minutes within 10 minutes [20:32:19] although, maybe it is slow because it is competing for resources on cluster [20:32:25] perhaps having just camus in essential queue will help [20:32:28] That is my guess [20:32:37] But we'll see ! [20:32:50] Aggred on the process to restart load jobs [20:33:00] what time as it been suspened ? [20:33:19] hour = 17 ? [20:37:39] ottomata: there is a load in production already, normal ? [20:37:43] yes, actually [20:37:59] i started th load bundle in production queue at 18 [20:38:08] the previous bundle is suspended [20:38:13] and i will kill it once it catches up through 17 [20:38:17] huhu, pre-launching, cool :) [20:39:41] essential only has camus :) [20:40:19] and production only catch up on some refinery [20:41:33] I think it would have been better to use for refine the same process you used for loa [20:41:40] ottomata: --^ [20:42:00] refind is way behind it seems [20:42:58] load only had a few jobs to finish [20:43:00] refine had a lot [20:43:11] you mean leaving refien in essential to hurry it up? [20:43:55] na, waiting instead of manually relaunch [20:44:01] i'm going to start refine bundle now [20:44:03] haven't done that yet [20:44:07] but it works fine [20:46:57] joal, i set up a local nginx proxy that makes it so I don't have to replace the url with yarn.wm.org anymore :) [20:47:30] https://gist.github.com/ottomata/833d7d771f74c932c739 [20:47:43] and I edited my local /etc/hosts so that anlaytics1001 was localhost [20:48:11] That's cool ! [20:48:20] I'm gonna do that as well ;) [20:48:27] Thanks for the tip [20:49:19] Looks like refine needs some optimisation [20:49:22] madhuvishy: yt? [20:49:24] nuria: so the python 3 tests all failed. i have python3 installed. also, its not really sending requests to bits.wikimedia - I was looking at wrong place. [20:49:33] ottomata: time for me go to bed [20:49:41] madhuvishy: python 3 will not work [20:49:51] let's talk about having a better refine stage tomorrow [20:50:04] ok, thanks joal [20:50:05] nuria: aah okay. [20:50:05] yeah [20:50:19] madhuvishy: as the mysql driver for sqlalchemy does not support 3 (not used in tests, but it is kind of pointless to test3 as we know it will not work in prod quite yet) [20:50:20] camus--> done ! [20:50:34] madhuvishy: were you able to run php tests? [20:50:58] madhuvishy: so having 2.7 should be sufficient [20:51:13] nuria: no. [20:51:19] https://www.irccloud.com/pastebin/EMQkHqV8 [20:51:33] madhuvishy: ya, we need to pint to mw unit tests docs [20:52:04] Bye team ! [20:52:17] madhuvishy: try this and if it works you can remove our info and point to this wiki: http://www.mediawiki.org/wiki/Manual:PHP_unit_testing/Running_the_unit_tests [20:52:30] madhuvishy: no point of us having that duplicated [20:52:41] alright let me do that [20:59:30] nuria: I think I had to install phpunit - but I get same error after that too. [21:01:27] https://www.irccloud.com/pastebin/TIhBp7o4 [21:01:49] madhuvishy: cause i think the command listed on doc is outdated [21:02:20] ohh [21:02:46] this one worked [21:05:07] madhuvishy: this should work: [21:05:11] https://www.irccloud.com/pastebin/m79ddhz3 [21:05:42] madhuvishy: ah ok, you got it too, just please update our doc to pint to this other one so we do not have outdated info [21:06:00] nuria: yeah, this worked. will do [21:10:37] ottomata, how about now? [21:11:58] yurik, i'm still working things out, but it is probably ok. [21:12:02] queues have been rearranged [21:12:07] can you launch things limitedly? [21:12:10] like, one job at a time? [21:12:48] ottomata, there could be two - one manual and one croned [21:12:55] croned is one once a day [21:13:02] *only [21:14:11] yurik: i think its ok, i'm just not ready to pronounce it so yet :) [21:14:17] nuria: done - https://www.mediawiki.org/wiki/Extension:EventLogging#How_to_run_tests [21:14:19] :) [21:14:29] refine jobs still are catching up [21:14:30] madhuvishy: ok, great, now, onto the python code [21:14:40] and ih aven't started the other production (non data generative jobs) yet [21:15:44] nuria: also don't know why requests are not being sent to bits.wikimedia - they are being sent to http://localhost:8080/event.gif - and failing. This is correct behavior perhaps? Given that server is not running? [21:15:48] madhuvishy: rememeber this machine you can ssh to we talked about earlier? You will need to get sudo there, so you can see the whole system in action (eventlogging server is deployed such it requires sudo to see logs) [21:15:58] madhuvishy: I asked for that on wikimedia labs [21:16:16] madhuvishy: the machine is the testing environment for event logging [21:16:39] madhuvishy: try ssh -ing and looking into /var/upstart/logs [21:18:02] i don't see upstart in var [21:18:55] nuria: ^ [21:19:09] sorry ! [21:19:24] var/log/upstart [21:19:53] try to tail logs (tail -f ) [21:20:24] madhuvishy: you probably will get denied access [21:20:29] nuria: yeah... denied [21:21:00] Has the command that you run from stat1003 to see EventLogging data streaming in changed recently? [21:21:08] madhuvishy: ok, since you are changing login and we are going to deploy it to testing environment to see it in action you will need sudo, ask in wikimedia-labs [21:21:28] Analytics-Volunteering, Engineering-Community, Phabricator, Project-Creators, and 2 others: Analytics-Volunteering and Wikidata's Need-Volunteer tags; "New contributors" vs "volunteers" terms - https://phabricator.wikimedia.org/T88266#1210475 (Qgil) a:kevinator>Aklapper [21:21:51] Deskana: i do not think that command works anymore since we swaped teh box [21:22:04] Deskana: we should remove it from docs [21:22:12] nuria: Aha! Do we have a new command we can use instead? [21:22:29] madhuvishy: but as always yuvi knows best [21:22:37] Deskana: Thanks for asking the question. :) [21:22:52] nuria: ha ha. i ask in labs for sudo for logs right? [21:23:00] Deskana: no, that command created a bunch of load in vanadium earlier and -ops wise- was not the best [21:23:43] nuria: Good to know. What's the best way for us to verify that our instrumentation is correct and that the data is being received and recorded? [21:23:46] Deskana: i would need to try it to see if it works on the new machine and see if it causes load [21:23:56] Deskana: testing in beta labs [21:24:22] nuria: please keep me also in the loop about this [21:24:32] Deskana: using it as endpoint for your logging: https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs [21:24:54] Analytics-Volunteering, Engineering-Community, Phabricator, Project-Creators, and 3 others: Analytics-Volunteering and Wikidata's Need-Volunteer tags; "New contributors" vs "volunteers" terms - https://phabricator.wikimedia.org/T88266#1210492 (Qgil) This task was part of #ECT-March-2015, and it is st... [21:25:03] Deskana: if it is apps traffic, they just need to hit (when developing new events) bits beta labs endpoint [21:25:12] Analytics, Engineering-Community, ECT-April-2015: Analytics Team Offsite - Before Wikimania - https://phabricator.wikimedia.org/T90602#1210529 (Qgil) This task was part of #ECT-March-2015, and it is still open and assigned. Assuming that it belongs to #ECT-April-2015 as well. Otherwise please edit acco... [21:25:26] bearND: Does it sound doable to update our EL endpoint like that? [21:25:38] Deskana: if we are talking events already in production (that have been tested towork prior, as we do not want to test in prod) and you want to see the "rate" [21:25:51] Deskana: for an app you are building? ya, it should be easy [21:25:59] Deskana: ah sorry [21:26:38] nuria: Deskana: you mean pointing to a different EL endpoint for testing only, right? [21:26:39] Deskana: for additional monitoring, you can also look at events on graphite: http://graphite.wikimedia.org [21:26:59] bearND: right, you compile the app with a different endpoint configuration [21:27:20] bearND: and you can log into machine (beta-labs) and see you stuff on db, errors and such [21:27:25] nuria: Deskana: Yeah, that's easy [21:27:26] bearND: all in teh wiki i sent earlier [21:28:06] *the [21:28:11] nuria: ok, going through that [21:28:26] bearND: k [21:28:33] nuria: it seems i just needed to prepend sudo. [21:28:37] bearND: to see db records you need sudo on box [21:28:48] madhuvishy: and you can tail logs excellent! [21:28:56] nuria: Yup! [21:29:21] madhuvishy: ok, read a bit about EL: https://wikitech.wikimedia.org/wiki/EventLogging [21:29:59] madhuvishy: poke around in box and we shall talk later to see where your code goes [21:30:14] nuria: on it. cool [21:30:23] nuria: So, if I want to see some EL for the apps I tail -f client-side-events.log ? [21:31:43] nuria: Who do I ask for sudo on deployment-eventlogging02.eqiad.wmflabs? [21:31:54] is that YuviPanda? [21:32:25] bearND: if you can log in to that box you already have sudo :) [21:33:04] YuviPanda: Nope, Connection closed by UNKNOWN [21:33:36] please ask the releng team, I think [21:33:46] they’re responsible for beta labs [21:33:48] I know I *can* do it [21:33:57] but this is like the 5th time I’ve been asked beta questions today :) [21:34:02] and I *am* doing other things to [21:34:11] bearND: sorry, not snapping at you, but do ask releng. [21:34:33] YuviPanda: were all the first 4 times me? [21:34:39] madhuvishy: nope [21:34:43] YuviPanda: That's cool. I guess this means via Phab [21:34:54] bearND: there’s also #wikimedia-releng [21:37:37] bearND: alright, I’ve added you too :) [21:37:46] (the other stuff I was doing just finished up) [21:37:55] but in the future, please poke betacluster people for beta stuff [21:38:04] I’ll be happy to help with any labs stuff in general [21:38:17] bearND: you should be able to ssh in and have sudo now [21:38:32] YuviPanda: isn't beta part of labs (betalabs)? [21:38:48] YuviPanda: great. I'm in [21:38:50] thanks [21:38:53] bearND: it is called beta cluster (greg-g levies a fine on anyone calling it betalabs) [21:38:57] bearND: it just happens to run on labs [21:39:22] for example, uploadwizard / mediawiki runs on our prod platform, but doesn’t mean poking ops for a JS error is going to be of much use :) [21:39:30] * bearND pleads ignorance of that rule [21:39:35] bearND: yeah, it’s not just you [21:39:38] bearND: i made a page for it [21:39:48] bearND: https://wikitech.wikimedia.org/wiki/Labs_labs_labs [21:40:08] bearND: we just suck at naming :) [21:40:16] YuviPanda: l like the title of the page, tho [21:40:19] :D [21:40:57] wikipmediawiki! [21:43:00] madhuvishy: https://wikitech.wikimedia.org/wiki/We_suck_at_naming [21:43:01] err [21:43:02] marktraceur: https://wikitech.wikimedia.org/wiki/We_suck_at_naming [21:43:22] Mmmhmm. [21:44:11] ottomata, https://phabricator.wikimedia.org/T88366 is something that zero has been asking for constantly of me. Unless zero team (non tech) removes it from our que, it should stay as a inter-team dependency [21:44:26] intra [21:44:26] YuviPanda: I can hear this page in your voice. *this* [21:44:38] madhuvishy: :D [21:49:16] yurik: your manager and ops manager need to talk [21:49:23] ops does not have resources for it at this time [21:50:10] ottomata, i was under the impression that this fits perfectly with what bblack was about to do anyway - unifying all of the varnish infrastructure [21:51:31] yurik: all i know is that there is nothing that SoS can do for that task at the moment, you and bblack both know what is up and are in communication. if you need somethign more, then you need to work it out. it isn't a task that needs poking, its a task that needs collaboration between either you and bblack, and/or your manager and mark [21:51:52] SoS is for greasing the wheels for squeaky tasks :) [21:52:00] sigh ) [21:52:04] let me check with bblack [21:52:09] k [21:58:37] bearND: would you be so kind as to add to the doc what you needed to get sudo on box? [21:59:37] bearND: that way we are set for other users (I had no issue, so there must be something that changed) [22:00:45] nuria: I think that's a question for YuviPanda. Not sure what the official way is and if he wants to keep doing it. [22:01:05] it’s basically ‘get added to deployment-prep project' [22:01:21] bearND: ah, then it is there! [22:01:33] bearND: cause i added it myself, ahem ... [22:01:39] you should talk to release engineering to see if they want to formalize that more than ‘poke someone on http://wikitech.wikimedia.org/wiki/Nova_Resource:deployment-prep' [22:02:09] bearND: ta-ta-channnn: https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs#Give_people_access [22:02:29] bearND: did you get sudo alredy? [22:02:34] *alredy? [22:03:24] nuria: yes, I've got it [22:03:41] bearND: k, you are set then [22:03:52] madhuvishy: let me know when you are done looking at docs [22:03:56] nuria: your new section still would no have helped me, though [22:04:23] bearND: I know it's so criptical.. like "mubo-jumbo-wikitech" [22:04:30] nuria: it's not clear who to ask for access. This is info for releng engineers maybe [22:05:20] nuria YuviPanda: maybe you could add some blurp as to ask in releng channel, or whatever the process should be [22:06:46] bearND: i did but these things keep changing continuously [22:07:09] nuria: I also have a general question about this. Am I understanding this correctly, that from now on, if I want to test any EL code, that I need to use this way? [22:08:03] nuria: So, we should get all other devs doing anything with EL (pretty much everyone, I guess) hooked up with this as well. [22:08:44] nuria: sorta done. too many terms. think they'll clarify as i dig into it more [22:08:56] i understand the overall flow though [22:10:26] bearND: they mostly are, teams that use it the most are mobile web , apps, flow-whatever-iscalled, mediaviewer [22:10:49] bearND: and they know about beta labs, we got it working couple months ago [22:10:55] beta cluster :) [22:11:06] YuviPanda helped a bunch [22:11:23] YuviPanda: write a bot for that [22:11:33] very tempting idea, actually [22:11:37] might do that later today [22:12:00] bearND: it is obviously a testing env, so we also test our software there as madhuvishy is going to do today. [22:12:08] nuria: I'm in apps and currently the tech lead for the Android app, and did not know about this. This was clearly not communicated well. [22:12:13] madhuvishy: can you send me your phab ticket again? [22:12:27] nuria: https://phabricator.wikimedia.org/T89162 [22:12:31] bearND: you are probably not in the analytics@ list then [22:12:46] nuria: probably [22:12:58] bearND: or you were not when it was announced, which is fine [22:13:06] (PS1) Ottomata: Make oozie actions that need to submit MR jobs run the dummy oozie:launcher job in a low priority queue [analytics/refinery] - https://gerrit.wikimedia.org/r/204394 [22:13:07] bearND: i would subscribe to analytics@ [22:13:29] bearND: dan & adam are there, and as i said it is been a while [22:13:43] (CR) Ottomata: [C: 2 V: 2] Make oozie actions that need to submit MR jobs run the dummy oozie:launcher job in a low priority queue [analytics/refinery] - https://gerrit.wikimedia.org/r/204394 (owner: Ottomata) [22:14:31] madhuvishy: ok, the idea is that in places where exceptions are logged should be so with a timestamp [22:14:41] nuria: right [22:15:14] nuria: ok, I just subscribed [22:15:40] bearND: ok, analytics communications go there as it is amore targeted audience [22:15:47] bearND: it's a public list [22:16:40] nuria: do you remember roughly when this EL change was announced, so I can look it up in the archive? [22:18:10] bearND: puf, no, i know it sprang from a very/very long thread with content translation but it was a while back [22:18:41] bearND: ah , i know, i started wiki on nov 2014 [22:18:55] bearND: so probably after that [22:19:47] ottomata, i looked at the history of my jobs - first they took about 3hrs each, now they are taking 15hrs+ [22:20:02] madhuvishy: or actually any logging should probably incorporate a friendly timestamp [22:20:08] yurik: there are problems with resource allocation [22:20:10] i am working on them [22:20:13] gotcha [22:20:21] thx [22:20:32] madhuvishy: but it is more important in exceptions [22:20:54] nuria: thanks. I think I found it https://lists.wikimedia.org/pipermail/analytics/2015-January/003020.html [22:20:56] also whaaatthevrap now, oozie isn't responding to my queriesssss [22:21:00] growwlll [22:21:18] ottomata, i was worried it is due to this patch (its already running in manual mode) - https://gerrit.wikimedia.org/r/#/c/204153/1/scripts/countrycounts.hql,cm [22:21:33] there i set reducers to 1 [22:21:43] yuri, i dunno [22:22:08] madhuvishy: take a look at login code and see where you could add a timestamp best [22:22:23] nuria: looking into it [22:22:37] madhuvishy: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/utils.py [22:22:54] yay in right file already [22:22:54] madhuvishy: we will deploy your patch and test it in beta labs [22:23:07] madhuvishy: jaja it's because you are a pro [22:24:38] nuria: ha ha. now for noob questions - where do i actually write the patch? local machine/vagrant? also what are the practices around git etc. does this go into a separate branch/things like that [22:26:11] madhuvishy: it's all gerrit, and you will access repo outside vagrant (where you are madhuvishy not vagrant user) [22:26:23] madhuvishy: have you used gerrit before? [22:26:31] madhuvishy: there are no branches, just changesets [22:26:40] nuria: okay. no not used gerrit [22:26:48] (there could be branches!) [22:26:51] (but we don't use them much) [22:27:19] ottomata: ya ... ahem... tryingg to keeping it simple [22:27:38] madhuvishy: so you do code changes outside vagrant and you run tests in vagrant via mount [22:27:54] nuria: aah [22:27:55] but gerrit/git needs to "real person" credentials [22:28:15] nuria: right, that i think i have set up [22:28:24] madhuvishy: so tests are here: vagrant@mediawiki-vagrant:/vagrant/mediawiki/extensions/EventLogging/server [22:28:30] are "run" from there [22:28:51] Analytics-EventLogging, Analytics-Kanban, Mobile-Web: Follow up with mobile team on instrumentation sampling rate (%50) - https://phabricator.wikimedia.org/T88363#1210774 (Jdlrobson) What needs to happen here? [22:29:29] madhuvishy: code for all your roles is here: /some-dir/vagrant/mediawiki/extensions [22:29:41] madhuvishy: outside vagrant (for simplicity) [22:30:08] madhuvishy: do change some code, see effects and we can go over how to submit patches later [22:34:54] nuria: It seems like timestamps are already being logged. [22:35:00] https://www.irccloud.com/pastebin/fvG11q94 [22:36:05] madhuvishy: do you see timestamps on exceptions in the logs in "upstart" dir on the beta-cluster machine? [22:36:17] checking [22:38:29] nuria: are these exceptions? [22:38:32] https://www.irccloud.com/pastebin/78lTJfDr [22:38:59] that is a "caught" exception [22:39:06] nuria: aah [22:39:57] madhuvishy: but yes, the logging was updated (did i do that?) for timestamps on the regular logger [22:40:12] madhuvishy: see if you find any uncaught [22:40:29] nuria: they will be in the same files too? [22:40:35] madhuvishy: yes [22:46:28] nuria: not really sure how to identify them, anything i can grep for? [22:46:40] madhuvishy: exception [22:46:54] as in rgrep -i exception . [22:47:43] no occurences. [22:53:05] nuria: ^ [22:53:30] madhuvishy: let me look in prod [22:59:04] nuria: okay [23:03:31] madhuvishy: i think we are good... [23:03:35] let me see 1 more thing [23:05:38] madhuvishy: i think we are good with timestamps, we can close phab ticket, you got a lot of setup done today with beta cluster and tests and vagrant [23:06:24] nuria: ah great! yup :) [23:09:20] madhuvishy: what is your next (tomorrow's) item? [23:11:13] nuria: https://phabricator.wikimedia.org/T88433 [23:11:39] madhuvishy: i would hold on on that one as cluster is overstaurated now, what was the next one? [23:12:11] nuria: Wikimetrics - https://phabricator.wikimedia.org/T78339 [23:12:39] madhuvishy: this is a better one, you would need to setup wikimetrics [23:13:01] nuria: alright [23:13:08] madhuvishy: another role in vagrant, prod instance is here: https://metrics.wmflabs.org/reports/ you can play with it a bit [23:13:15] madhuvishy: but again, no rush [23:13:52] nuria: cool. i'll check it out [23:14:19] thank you :) [23:17:51] madhuvishy: np [23:27:17] (( job froze for half an hour, not a millisecond added with each pass [23:36:50] Analytics-EventLogging, Analytics-Kanban, Easy: Eventlogging should log timestamps for all error messages {oryx} - https://phabricator.wikimedia.org/T89162#1210914 (madhuvishy) a:madhuvishy [23:39:50] Analytics-EventLogging, Analytics-Kanban, Easy: Eventlogging should log timestamps for all error messages {oryx} - https://phabricator.wikimedia.org/T89162#1210918 (madhuvishy) Open>Invalid Closing because this seems to have already been accomplished in this changeset - https://gerrit.wikimedia.o...