[01:51:03] Analytics-Tech-community-metrics, DevRel-January-2016: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1946757 (jayvdb) >>! In T123808#1945858, @Aklapper wrote: > @jayvdb, Thanks for finding this and raising this! And I think y... [04:10:38] Analytics-Tech-community-metrics, pywikibot-core, DevRel-January-2016: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1946980 (jayvdb) [05:07:34] (PS1) Milimetric: Update hostnames to analytics-store [analytics/geowiki] - https://gerrit.wikimedia.org/r/265213 [05:08:03] (CR) jenkins-bot: [V: -1] Update hostnames to analytics-store [analytics/geowiki] - https://gerrit.wikimedia.org/r/265213 (owner: Milimetric) [05:10:01] (CR) Milimetric: Update hostnames to analytics-store (1 comment) [analytics/geowiki] - https://gerrit.wikimedia.org/r/265213 (owner: Milimetric) [06:24:26] o/ [07:36:51] Analytics-EventLogging, MediaWiki-API, Easy, Google-Code-In-2015, Patch-For-Review: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454#1947125 (Florian) >>! In T91454#1946228, @greg wrote: > Well, we missed today's branching, but othe... [08:39:55] quick question for you guys: do we need to set any particular git config for https://github.com/wikimedia/operations-puppet ? [08:40:14] I am trying to send a code review with git review but it doesn't get my ssh cert [08:40:27] * elukey is probably doing something terribly wrong [08:57:55] * elukey should have used ssh:// 30 minutes ago probably [08:57:58] https://gerrit.wikimedia.org/r/#/c/265227/ :) [09:01:02] So the patch could be discared, I wanted to try the workflow [09:01:13] but it might be useful for the newbies like me [09:28:07] Hi elukey [09:28:22] Have you managed to git review ? [09:41:50] yep already merged by Yuvi :) [09:42:16] I don't have palladium's access so I couldn't do puppet apply [09:53:32] k [09:55:49] all right going out for ~2hrs, talk with you later! [09:58:24] laters ! [12:05:16] back! :) [12:07:23] joal: do you have a minute for https://phabricator.wikimedia.org/T123942 ? I am trying to figure out one thing [12:46:15] Analytics-Kanban, Patch-For-Review: Burrow should be restarted automatically when config changes - https://phabricator.wikimedia.org/T123942#1947617 (elukey) Tentative for a patch: https://gerrit.wikimedia.org/r/#/c/265246/ [12:46:29] Analytics-Kanban, Patch-For-Review: Burrow should be restarted automatically when config changes - https://phabricator.wikimedia.org/T123942#1947619 (elukey) p:Triage>Normal [12:46:55] ottomata: https://gerrit.wikimedia.org/r/265246 - if you have time :) [12:48:05] I created a krypton-test instance in labs, reading the docs this seems to be the next step after the change has been reviewed and before the merge to prod [12:48:05] why the hells do I have to be awake [12:48:52] Ironholds: too early? [12:49:20] well, that, but also I've been continuously awake foor..22 hours. [12:49:57] ah ok now I got your question :D [13:10:29] Hi elukey, sorry I was away for lunch [13:14:43] hey joal! Nothing super important, I just wanted to ask you some things before sending the CR but I proceeded anyway :) [13:14:57] okayy [13:15:19] Also, I'm not very burrow knowledgeable, so you probably will teach some ) [13:19:23] I've read about it this morning :d [13:28:26] elukey, update [13:28:28] the plane is tomorrow [13:28:44] I have been awake for 22 hours to guarantee I will be asleep on a plane that does not take off for another 26 hours [13:28:51] * Ironholds hurls everything into the sun [13:30:53] Ironholds: what's the plan? Can we bet on your next 26 hours? :D [13:31:21] joal: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules - this seems nice! Still need to understand it fully, but it seems well done [13:34:51] Quarry: Cannot download data from a query with Unicode characters in its title - https://phabricator.wikimedia.org/T123031#1947747 (XXN) I confirm this. I encountered the same problem with http://quarry.wmflabs.org/query/6945 [14:39:29] Analytics-Kanban: Projections of cost and scaling for pageview API. {hawk} [8 pts] - https://phabricator.wikimedia.org/T116097#1947959 (Milimetric) so if we need 3T per year, we'll naively need 15T for 5 years. But we shouldn't keep daily per-article resolution for that long. We could cut it dramatically by... [15:41:12] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948124 (Cmjohnson) @ottomata @nuria let's coordinate a time that we can get this done. [15:41:52] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948126 (Ottomata) I think we need to coordinate with @jcrespo. This box is more than just eventlogging db proxy. [15:47:04] milimetric: Where did you say logs would get dumped if my scripts failed again? [15:47:28] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948150 (jcrespo) No I think dbproxy1004 only serves m4/eventlogging. But we can failover to another machine without needing downtime, I just need time to setup another proxy temp... [16:11:39] milimetric: let me know if you want to talk about piwik [16:13:47] Analytics-EventLogging, MediaWiki-API, Easy, Google-Code-In-2015, Patch-For-Review: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454#1948220 (greg) Yeah, getting it on the deployments wiki page so it's not missed is important (I'm m... [16:49:11] Analytics-Cluster, EventBus, Services, operations: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1948340 (aaron) [16:59:05] Analytics-Kanban, Analytics-Wikimetrics, Patch-For-Review, Puppet: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [21 pts] - https://phabricator.wikimedia.org/T101763#1948417 (Nuria) Open>Resolved [16:59:20] Analytics-Kanban, Analytics-Wikimetrics, Patch-For-Review: Use fabric to deploy wikimetrics {dove} [13 pts] - https://phabricator.wikimedia.org/T122228#1948418 (Nuria) Open>Resolved [16:59:22] Analytics-Kanban, Analytics-Wikimetrics, Patch-For-Review, Puppet: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [21 pts] - https://phabricator.wikimedia.org/T101763#1347110 (Nuria) [16:59:45] Analytics-Kanban: Create a set of celery tasks that can handle the global metric API input {kudu} [0 pts] - https://phabricator.wikimedia.org/T117288#1948423 (Nuria) [16:59:47] Analytics-Kanban, Patch-For-Review: Create celery chain or other organization that handles validation and computation {kudu} [13 pts] - https://phabricator.wikimedia.org/T118308#1948422 (Nuria) Open>Resolved [17:00:06] Analytics-Kanban: Implement a simple public API to calculate global metrics {kudu} [0 pts] - https://phabricator.wikimedia.org/T117285#1948424 (Nuria) Open>Resolved [17:00:30] Analytics-Kanban: Implement a simple public API to calculate global metrics {kudu} [0 pts] - https://phabricator.wikimedia.org/T117285#1770422 (Nuria) [17:00:32] Analytics-Kanban, Patch-For-Review: Build a public form that can hit the new API {kudu} [8 pts] - https://phabricator.wikimedia.org/T117289#1948436 (Nuria) Open>Resolved [17:00:46] Analytics-Kanban, Community-Wikimetrics, Patch-For-Review: Story: WikimetricsUser reports pages edited by cohort {kudu} [13 pts] - https://phabricator.wikimedia.org/T75072#1948439 (Nuria) Open>Resolved [17:02:00] a-team: joining standup in 2 minutes [17:02:39] madhuvishy: ok [17:07:12] Analytics-Kanban: Create Piwik cron to optimize dashboarding - https://phabricator.wikimedia.org/T124187#1948469 (Nuria) [17:08:04] Analytics-Kanban: Create Piwik cron to optimize dashboarding [3 pts] - https://phabricator.wikimedia.org/T124187#1948473 (Milimetric) [17:09:13] Analytics-Kanban: Create Piwik cron to optimize dashboarding [3 pts] - https://phabricator.wikimedia.org/T124187#1948478 (Nuria) Disable querying of data on db ever ytime you look at the dashboard [17:14:44] ottomata: after your ops meeting, can we chat about https://github.com/linkedin/Burrow/issues/4#issuecomment-172944046 [17:24:27] sho [17:37:51] !log restarted EventLogging because of Kafka consumption lag [17:37:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [17:50:58] nuria, I was thinking to grab one (or both) of those tasks next: https://phabricator.wikimedia.org/T108599 https://phabricator.wikimedia.org/T108867 is that OK? [17:52:20] a-team, so that you know: hive has handled the query in 88 seconds, no problem [17:52:23] :( [17:52:28] mforns: on meeting , will look in a sec [17:53:06] sure no rush [17:54:44] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [18:02:45] a-team, I have found a workaround using dataframes instead of raw sql [18:02:59] so nevermond sql :) [18:03:07] joal: coool :) [18:03:19] joal, there's a method in rdd called cartesian I think [18:03:48] ottomata: wanna chat now? if not i'll go to office [18:03:49] there is a cartesian join in dataframes which works (by opposition to the one in SQL :-P) [18:03:54] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 28.57% of data under the critical threshold [10.0] [18:04:53] I see [18:05:54] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [18:07:55] madhuvishy: ja [18:07:58] just ordered lunch [18:08:07] ottomata: batcave? [18:08:09] sure [18:08:13] very loud here :) [18:16:08] mforns, yt [18:16:13] shoudl we EL to prod? [18:16:14] ottomata, yes [18:16:18] aha [18:16:24] let's do that :] [18:16:25] looks good in beta :) [18:16:26] ok [18:16:31] Analytics, Wikimedia-Developer-Summit-2016: Developer summit session: Pageview API from the Event Bus perspective - https://phabricator.wikimedia.org/T112956#1948744 (Aklapper) Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. **If the session in this task took place**, plea... [18:16:40] batcave ottomata? [18:16:51] sure i just got a giant salad [18:16:51] or do you want me to do that? [18:16:56] hehe [18:16:57] you can run deploy, and I hang out and eat? [18:16:58] and backup? [18:17:06] sure [18:17:14] k lets do it [18:17:18] omw [18:20:39] milimetric: jynus on ops says that EL is lagging only 3 hrs [18:21:30] mforns: the wikimedi abot sounds like a better task, [18:21:34] *wikimedia bot [18:23:23] mforns: are the mobile patches https://gerrit.wikimedia.org/r/#/c/264297/ [18:23:23] mforns: stopped until the mobile cache switch? [18:23:31] nuria, yes I guess so [18:23:40] mforns: btw, i have added madhuvishy as cr-er [18:25:44] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948817 (Nuria) @Cmjohson: the update should only be a few minutes right? If so let's do it today/tomorrow if possible. [18:30:27] !log deployed EL in production with removal of queue [18:30:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:35:35] ottomata: https://gerrit.wikimedia.org/r/#/c/265246/ updated :) [18:36:44] nuria, I'll take wikimedia bot, and setup the meeting. but in the meantime should I do the other one? [18:37:58] ah haha [18:37:59] elukey_: almost! [18:38:03] the subscribe goes on the service [18:38:13] the service subscribes to file changes [18:38:34] ottomata: sorry I need to sleep [18:39:01] subscribing a file to itself is a clear sign that my brain is not working [18:39:31] Analytics-Kanban: Communicate the WikimediaBot convention {hawk} [5 pts] - https://phabricator.wikimedia.org/T108599#1948902 (mforns) a:mforns [18:40:04] RECOVERY - Overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [100.0] [18:40:11] Analytics-Tech-community-metrics, DevRel-January-2016: Make GrimoireLib display *one* consistent name for one user, plus the *current* affiliation of a user - https://phabricator.wikimedia.org/T118169#1948905 (Lcanasdiaz) It is fixed now. [18:40:14] mforns: :) yay! [18:40:16] ees working! [18:40:25] :] ottomata [18:40:33] Analytics-Tech-community-metrics, DevRel-January-2016: Many profiles on profile.html do not display identity's name though data is available - https://phabricator.wikimedia.org/T117871#1948909 (Lcanasdiaz) [18:40:35] Analytics-Tech-community-metrics, DevRel-January-2016: Make GrimoireLib display *one* consistent name for one user, plus the *current* affiliation of a user - https://phabricator.wikimedia.org/T118169#1948908 (Lcanasdiaz) Open>Resolved [18:40:36] mforns: we should be careful not to repeat this work: https://phabricator.wikimedia.org/T123546 [18:40:37] Analytics-Tech-community-metrics, Developer-Relations, DevRel-February-2016: Who are the top 50 independent contributors and what do they need from the WMF? - https://phabricator.wikimedia.org/T85600#1948910 (Lcanasdiaz) [18:41:11] madhuvishy: can you coordinate this small outage (minutes) for EL since you are on ops-duty this week: https://phabricator.wikimedia.org/T123546 [18:41:48] madhuvishy: i have added you to the ticket [18:42:46] nuria, I don't understand [18:43:10] mforns: sorry, wrong ticket [18:43:26] nuria, ah! fiu... [18:43:35] mforns: give a sec [18:43:37] sure [18:43:39] *give me a sec [18:43:49] xD [18:44:31] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948935 (Nuria) @Cmjohson: @madhuvishy is on ops duty this week and she can help coordinate this small maintenance window. We just need to: 1. communicate to list 2. stop el,... [18:44:49] mforns: this is it https://phabricator.wikimedia.org/T117945 [18:45:45] madhuvishy: let me know if you feel Ok coordinating the small maintenance window for hardware for EL [18:45:57] nuria: sure but i don't understand - why do we need to stop all of EL? [18:46:17] we can just stop the mysql consumers right? [18:46:23] ottomata: ^ [18:46:26] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1948959 (Nuria) @jcrespo: let's do this hardware update before the conversion Ok? https://phabricator.wikimedia.org/T123546 [18:46:44] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 45.45% of data under the critical threshold [10.0] [18:47:03] doh! [18:47:10] mforns: this check probably isn't tuned right anymore [18:47:31] yeahhhh now its just gonna be really spikey [18:47:44] because it will just wait til it gets 5 mins or 3000 events [18:47:46] for each schema [18:48:12] ummmm [18:48:27] nuria, yes it's linked from the task itself, OK [18:48:39] madhuvishy: that would work too, you can update ticket telling jaime we can take teh downtime easy [18:48:39] what are we talking about? [18:48:46] ottomata, yes, it will be spikey for a while [18:49:13] nuria: yeah - we can just leave EL running - and no data loss hopefully [18:49:15] madhuvishy: sorry, the hardware outage : https://phabricator.wikimedia.org/T123546 [18:49:19] it will just reconsume [18:49:33] but ottomata, this happened before also [18:49:35] madhuvishy: just FYI to jaime that we do not need to fallback [18:49:40] oh utnil it gets out of sync [18:49:41] right [18:49:48] maybe we should graph and alert on a moving average? [18:50:06] madhuvishy: makes sense? [18:50:16] nuria: yeah let me comment [18:50:34] ottomata: https://gerrit.wikimedia.org/r/#/c/265246/ - should be good now :) [18:50:37] madhuvishy: cause i think we should that before taking teh longer outage for toku db conversion [18:50:49] oooo this could be a fun one for elukey :) i will make a task [18:51:08] ottomata, yes :] [18:51:23] elukey: merged! :) [18:51:39] nuria: downtime for m4-master will still be there [18:51:43] that's not a problem [18:51:44] ? [18:52:12] i see ottomata mentioned that there might be other users of m4-master that are not EL [18:52:31] Analytics, Reading-Web, Wikipedia-iOS-App-Product-Backlog: As an end-user I shouldn't see non-articles in the list of trending articles - https://phabricator.wikimedia.org/T124082#1948995 (Milimetric) Doing this kind of heuristic as part of the API call is possible, then clients would get less than 1... [18:52:42] madhuvishy: i do not think that is the case from my prior conversations with sean but it will be something to confirm [18:53:11] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948996 (madhuvishy) @jcrespo @Cmjohnson: EL can handle downtime - We will just stop the EL mysql consumers, and restart them after maintenance window - and data should get recons... [18:54:14] nuria: okay i commented on the ticket [18:54:34] I'll communicate to the list anyway saying mysql consumers are being stopped for a while [18:54:41] when it happens [18:54:52] ottomata: mmm so now the change can be merged in Palladium and pushed, but we'd need to test it in labs before that.. how can we prevent anybody from pushing out a newer change without pulling mine in? [18:55:03] (maybe I am confused about this workflow) [18:55:07] MarkTraceur: here's your logs (location's in the cmd prompt) [18:55:12] madhuvishy: thank you, sorry not to have mentioned this on standup earlier but i think is a good fit for ops duty as it i s all about commnunication mostly [18:55:12] https://www.irccloud.com/pastebin/fQNllp0k/ [18:56:01] Shoot. [18:56:35] milimetric: So my existing files couldn't get overwritten because they have data that doesn't match? Can we 1. delete the files and 2. run the generator again manually? [18:56:39] elukey: do you want to test burrow change in self hosted labs instance? [18:56:41] i.e. non-cron run [18:56:45] Analytics-Kanban: Tune eventlogging.overall.inserted.rate alert to use a movingAverage transformation - https://phabricator.wikimedia.org/T124204#1949004 (Ottomata) NEW a:elukey [18:56:48] nuria: yeah alright [18:57:00] elukey: you can't :) [18:57:04] MarkTraceur: I'll delete the files if you back them up [18:57:10] if you mean test in a non self hosted puppet master in labs [18:57:14] you can't [18:57:29] merge into production is applied everywhere in labs eventually [18:57:32] that uses it [18:57:40] in prod, it is a manual merge into local puppet repo clone on palladium [18:57:45] milimetric: Sold, where do you want them? Somewhere else in the public data? [18:57:46] and then it will be applied everywhere [18:57:52] Analytics-EventLogging, MediaWiki-API, Easy, Google-Code-In-2015, Patch-For-Review: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454#1949019 (Florian) :P Ok, thanks for the answer! :) I'll edit wikitech:Deployments, [[ https://wikit... [18:58:01] MarkTraceur: grrr... I have no permission to chown them [18:58:09] Heh, yeah [18:58:12] MarkTraceur: I don't care about the backups that's just for you :) [18:58:12] madhuvishy: yep I tried it today but didn't finish it.. [18:58:29] elukey: okay - let me know if you need help there [18:58:31] Oh, then screw it, it's the same data, but better formatted and easier to maintain [18:58:35] but maybe ottomata can help with this. Andrew I can't do this: [18:58:35] milimetric@stat1003:/a/limn-public-data/metrics$ sudo -u stats chown -R stats multimedia-health/ [18:58:57] madhuvishy: I followed the things that you told me last time and it worked, all good :) [18:59:03] oh wait MarkTraceur you can just delete that directory completely if you're done with it [18:59:16] KK [18:59:16] elukey: cool :) [18:59:26] {{done}} [18:59:42] I gotta go, sorry, but if you delete that dir you should be good. [19:00:00] milimetric: Can I force the job to run again? [19:04:33] ottomata: it's possible to kill just the mysql consumers in EL right? [19:04:43] yes [19:05:24] ottomata: do we have to kill them individually? [19:05:38] * madhuvishy looks at puppet [19:05:57] all right, going offline.. byeeee o/ [19:06:49] Analytics-Kanban: Communicate the WikimediaBot convention {hawk} [5 pts] - https://phabricator.wikimedia.org/T108599#1524599 (Nuria) Please connect with @bd808 [19:10:34] ottomata: hmmm - not super sure - do you have a way to kill all processes in a consumer group? [19:15:09] Analytics-Tech-community-metrics, DevRel-January-2016: "Unavailable section name" displayed on repository.html - https://phabricator.wikimedia.org/T121102#1949076 (Lcanasdiaz) It is fixed and already deployed in production. [19:15:19] Analytics-Tech-community-metrics, DevRel-January-2016: "Unavailable section name" displayed on repository.html - https://phabricator.wikimedia.org/T121102#1949077 (Lcanasdiaz) Open>Resolved [19:16:16] (CR) Nuria: "Adding @joal. We are still looking forward to merge this, correct?" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255105 (owner: DCausse) [19:22:43] madhuvishy: i have kill them with grep mysql | xargs kill -9 <> in the past, would love to know how to make it more sophisticated [19:26:10] milimetric: the piwik ui is still trying to run queries, did you restarted after changing the archiving settings? [19:34:24] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [19:34:41] we started the research showcase, streamed at https://t.co/tdESrvwEhd [19:34:57] Analytics-Kanban: Communicate the WikimediaBot convention {hawk} [5 pts] - https://phabricator.wikimedia.org/T108599#1949173 (bd808) >>! In T108599#1949029, @Nuria wrote: > Please connect with @bd808 Yes, please. @Anomie and I can help with getting the new guidelines broadcasted out to bot developers, librar... [19:45:04] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [19:46:24] nuria, which queries, the real-time visitor one? That's apparently safe [19:47:14] I restarted it, just to be safe, but then I changed some settings and verified they get picked up whether you restart or not [19:49:37] Analytics-Tech-community-metrics, pywikibot-core, DevRel-January-2016: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1949253 (Aklapper) ...that dropdown also lists quite some duplicate entries like wikipedia, vendor, varn... [19:49:56] hello. i would like to check traffic stats for a bunch of domain names we have that aren't the regular project domains [19:50:03] stuff like wikiepdia.com [19:50:31] i was looking at 1000-sampled.json on oxygen, but if they are very low usage, it should probably check unsampled logs [19:50:45] could you point me to instructions to get that out of hadoop? [19:51:10] milimetric: was going to help point, but then i was wondering about projectview files [19:51:10] i want to justify deactivating a bunch of them , if they are rarely ever used [19:51:13] etc. [19:51:23] are bad domains like that in projectview/projectcounts? [19:51:24] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [19:51:46] we have over 600 domain names :p [19:51:57] all kinds of typo and weird stuff [19:52:06] but just a fraction of them get traffic [19:52:09] (CR) Joal: "I Think so @nuria" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255105 (owner: DCausse) [19:53:34] oh joal is here, maybe he can answer ^^^ [19:53:38] mutante: in lieu of an answer from milimetric, check out: [19:53:42] https://upload.wikimedia.org/wikipedia/commons/5/53/Introduction_to_Hive.pdf [19:53:56] thanks! [19:53:58] some good stuff here too [19:53:59] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries [19:54:03] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive [19:54:22] (Reading up) [19:54:27] mutante: if you are after special domain names (not subdomains), then you'll only find them in webrequestlog [19:54:43] webrequest log sorry mutante [19:54:51] (in hive, webrequest table) [19:55:16] thanks, both of you [19:55:23] looks on stat1002 [19:55:35] mutante: ja, hive --database wmf [19:55:38] webrequest [19:55:40] table [19:55:44] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [19:55:46] ...you might need to be in the analytics-privatedata-users group [19:55:49] The thing is, this is a huge table even if it contains only 2 month of data. [19:56:12] mutante: 80Tb for 2 month ... [19:56:24] WARNING: Hive CLI is deprecated and migration to Beeline is recommended. [19:56:27] hive (wmf)> [19:56:29] mutante: let me know if it gets frustrating and I can help with the sql. But my first thought is maybe grouping and counting by hostname for an hour of traffic? [19:56:36] joal: as long as it's ok i run a query ? [19:56:37] So, your query might be long-ish depending on how much you are reqesuting to analyze :) [19:56:53] ignore the beehive thing, Hive cli is fine [19:56:54] well, if i just check an hour of traffic [19:57:07] then i could also use the sampled log on oxygen? [19:57:07] mutante: for an hour, you'll be fiiiiiine :) [19:57:12] even a day is manageable [19:57:24] it get's trickier for more [19:57:30] ideally i want to say " nobody ever used this in the last year " :p [19:57:39] i dont know what our threshold is , heh [19:57:46] i want reasons to _remove_ domains :) [19:57:51] mutante: only 2 month of data anyway ... [19:57:58] ah, right [19:58:06] data-retention,, ack [19:58:10] privacy first, mutante :) [19:58:10] yeah, but looking at 2 months of data will be *really* slow [19:58:19] yeah, really really [19:58:31] like, kill the cluster slow :) [19:58:32] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1949298 (Ottomata) Since we are about to have an EL downtime anyway, can we fit this in as well? [19:58:59] milimetric: hopefully not, but long to get results :) [19:59:15] hmm, i'm wondering if i make a real difference whether i do this (and limit it to days) or just rely on sampled-1000.log [19:59:18] milimetric: it would kill the cluster if we run that in a queue preventing other jobs to happen [19:59:21] yeah. Hm... mutante: do you have a list of candidates for removal? [19:59:43] sampled-1000 is usually ok [19:59:58] milimetric: here's a random selection to start with https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/dns+branch:master+topic:parking,n,z [20:00:03] the only problem is if you expect that there are less than 1000 hits for something over whatever period you're looking at [20:00:31] Well, using sampled-1000 is actually a good trick: if it's not in there, there a strong probability that it has not past 1000 hits [20:00:31] mutante: ok, that might help, you can look for hits where uri_host is equal to those [20:00:35] wikiepdia had 2 lines in the sampled-1000 [20:01:02] milimetric: thx [20:01:34] and mutante one idea would be to run for an hour first, then a day, and see how the performance looks, then if you're comfortable look at a few days [20:01:43] mutante: if you only filter a subset of project and count, you could go for a month long query and wait for results [20:01:52] But make sure you have it tested on hours before :) [20:02:07] milimetric: You always are faster than me :) [20:02:10] ok! [20:02:30] YuviPanda: Hellooo-oo? [20:02:56] Ironholds: helloooo-oo as well? [20:03:18] ottomata: what EL downtime are you talking about on the ticket? [20:03:23] Given that Ironholds had not slept for 22 hours at my morning time, I strongly suspect he won't answer :) [20:03:58] maybe madhuvishy knows [20:03:58] joal: Ironholds was in the office a few minutes back [20:04:05] Ah, maybe then :) [20:04:36] madhuvishy: do you know if YuviPanda has private notebooks working on our network or not yet ? [20:04:49] joal: not yet as far as I know [20:05:01] Crap ... k, thx madhuvishy :) [20:07:06] ottomata: hah , Exception: Permission denied: user=root, access=READ, [20:07:12] i'll ask for access :p [20:07:35] i can get on the client and use "describe" on tables but not select data [20:07:42] s/client/shell [20:07:46] right [20:07:47] good it works! [20:07:49] :) [20:09:11] mutante: yeah it lets you do everything until the point you try to read from hdfs [20:09:31] the desc etc is metadata that hive has I think [20:09:55] are you gonna use that "Beeline" stuff later? [20:10:19] hehe, that shoudl be a task too! [20:10:23] madhuvishy: did we make a task for that? [20:10:32] that's another good little fun one, maybe luca can do that too [20:10:39] ottomata: yeah i think we did [20:10:52] found it [20:10:52] https://phabricator.wikimedia.org/T116123 [20:11:11] Analytics: Make beeline easier to use as a Hive client {hawk} - https://phabricator.wikimedia.org/T116123#1949356 (Ottomata) @elukey this is another that could be fun and very helpful! [20:11:17] we should may be have some script or sth that can launch it with the configuration set up [20:11:46] yeah, probably there is some way to set some default vars, if not then ja a wrapper of some kind that reads from hive-site.xml or seomethign [20:13:36] one more question. what's the issue here [20:13:37] FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest" [20:14:04] that happens when i just do example commands like SELECT agent_type FROM webrequest LIMIT 5; [20:15:28] ignore me, i just had to keep reading [20:16:21] :) [20:16:32] rookie mistake #1 ! :D [20:16:45] yes [20:18:21] hive (wmf)> SELECT agent_type FROM webrequest where year=2016 and month=1 and day=19 limit 5; [20:18:28] OK [20:18:28] agent_type [20:18:28] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". [20:18:39] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1949407 (Ottomata) @cmjohnson and I will do this Jan 21 16:00 UTC (11am EST). Should be a very short and unnoticeable downtime. [20:22:12] ut you can ignore the SLF4J error [20:22:16] did you get no results? [20:22:25] btw, that is a pretty large query you are trying to launch [20:22:39] it would be best if, while exploring, you limited it to the smallest partition spec possible [20:22:50] where year, month, day, hour, webrequest_source [20:22:52] with all of those set [20:22:57] webrequest_source='misc' will give you the smallest [20:23:04] mutante: ^ [20:23:22] ok, i already canceled that again [20:23:36] sets hour too [20:25:04] source='misc' show me all the 15.wp hits, nice [20:27:42] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1949448 (Ottomata) Just worked this out in IRC. The downtime will start at 16:00 UTC. @madhuvishy will email the analyti... [20:28:30] joal: https://phabricator.wikimedia.org/T109286 [20:28:34] "The traffic move from mobile->text is now on hold (we did convert codfw, then we rolled back) due to purge-related issues that need to be addressed first, in blocking task https://phabricator.wikimedia.org/T124165. [20:28:34] " [20:28:53] MarkTraceur: yeah, we can force it if we have to but it's kind of a pain in the butt. I think it runs again in about 3 hours [20:28:57] ottomata: Yes, I followed the talk on traffic chan earlier on today [20:29:01] nice [20:29:06] Problems of cache invalidation [20:29:08] i actually found 3 hits on my typo domain in a one-hour range.. hmmm hmmm [20:29:11] so, we need to remember to not deploy unless we roll back our refinery patches [20:29:16] milimetric: All right, fair enough [20:29:18] :/ [20:29:22] that's what I get for merging too soon [20:29:25] Mwarf ... [20:29:37] ottomata: My bad, I have been pushing for that [20:29:42] milimetric: If it fails again, though... [20:30:22] ottomata: there was no deploy plan soon (still work ongoing both on mforns_gym app-sessions and nuria last_access_uniques [20:30:54] MarkTraceur: if it fails again I'll run it myself and fix any problems :) deal? [20:31:18] I mean, I'd be happy to fix it [20:31:26] But I don't know how [20:31:40] no worries, that's the kind of customer service we pride ourselves on here in -analytics [20:32:00] Heh [20:32:05] All right, fair enough [20:32:33] aye [20:32:36] joal: we'll leave it for now [20:32:55] ok ottomata [20:33:11] From what I have understood ottomata they still want to have it done soon-ish [20:33:19] cool, aye, just asked that :) [20:33:24] oh 1:1 with nuria! [20:33:34] ottomata: yessir [20:34:35] Analytics, Wikipedia-iOS-App-Product-Backlog, Patch-For-Review, iOS-app-v5-production: Puppetize Piwik to prepare for production deployment - https://phabricator.wikimedia.org/T103577#1949492 (Fjalapeno) Open>Resolved a:Fjalapeno Deployed [20:35:16] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-v5-production: Support Piwik in production - https://phabricator.wikimedia.org/T116308#1949505 (Fjalapeno) Open>Resolved a:Fjalapeno [20:37:31] (CR) Milimetric: Add a graph tracking how many people have enabled the cross-wiki notifications beta feature (1 comment) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/264700 (owner: Catrope) [20:37:57] nuria: can we push our 1:1 by 30 minutes. I was at research showcase and would like to get lunch [20:38:09] (CR) Milimetric: [C: 2 V: 2] Add a graph tracking how many people have enabled the Flow beta feature [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/264698 (https://phabricator.wikimedia.org/T114111) (owner: Catrope) [20:39:43] hm, ottomata I've been reading brandon answer and now I wonder about reverting now [20:42:16] Analytics-General-or-Unknown, Community-Advocacy, Wikimedia-Extension-setup: enable Piwik on ru.wikimedia.org - https://phabricator.wikimedia.org/T91963#1949533 (Fjalapeno) [20:44:49] ottomata1: [20:44:53] https://www.irccloud.com/pastebin/rTpzNm2E/ [20:44:54] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [20:51:14] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [20:52:58] ergh ^ [20:53:02] annoying [20:55:54] ottomata1: what is this coming from? [20:56:12] also pasted email draft above. if it's fine i'll send [20:56:23] analytics and engg lists? [21:00:02] (in 1:1, with you shortly) [21:01:14] madhuvishy: i have a 1:1 with dario shortly after ours [21:01:23] madhuvishy: but i can do it later, at 3pm [21:01:25] nuria: ya it's fine - lets do now [21:01:31] i'll go for lunch after [21:01:36] +1 for email madhuvishy [21:01:40] ottomata: cool [21:01:55] madhuvishy: i think that alert is fireing more often now because of mforns queueiung change [21:02:01] made a ticket for fixing this morning [21:02:18] joal: aye [21:02:19] madhuvishy: ok, let's do it then [21:02:20] ja :/ [21:02:51] ottomata: so, revert tomorrow? (I'm about to go to bed) [21:03:37] ja [21:03:38] k [21:03:43] lets do later [21:03:50] nighters joal! [21:03:54] Thx ottomata [21:04:14] ottomata: I'll bug you tomorrow not to forget (I just made myself a post it :) [21:04:26] k :) [21:37:17] (PS6) Nuria: Drop support for message without rev id in avro decoders and make latestRev mandatory [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255105 (owner: DCausse) [21:38:39] ottomata, yt? [21:40:16] ja hey [21:40:47] hey! [21:40:56] are you planning to rollback the EL deployment? [21:41:40] ottomata, ^ [21:42:54] oh no, I think it was another thing, now that I read the scrollback [21:43:17] mforns: no [21:43:20] yeah another thing :) [21:44:11] anyway ottomata I thought we might add a random number of seconds to the time limit, so that batches are not inserted in spikes, it seems this version of the code is not going to unsync [21:46:36] yeah [21:46:37] hm [21:46:50] I'm writing the commit msg [21:46:51] heh, i dunno [21:46:55] no? [21:46:57] or a random sleep before starting [21:47:02] would do it too [21:47:08] randomly changing the time might be funky [21:47:13] really, we might want to do a shorter time limit [21:47:21] which would also smooth this out [21:47:31] dunno though [21:47:42] yes, the spikes would be shorter and more frequent [21:48:39] but wouldn't we lose the performance gain we reached when batching inserts? [21:50:08] maybe, i guess, but, hm. i'd keep the max batch size large [21:50:13] so that if events come in real fast [21:50:16] they will be in large batches [21:50:28] but for schemas with small numbers of events, maybe its ok to insert frequently? [21:50:32] not sure. [21:50:37] I see, I think you're right [21:50:50] dunno though [21:50:54] i'm reluctant to change this now though [21:51:02] and where would you put the random sleep befor start? [21:51:05] since we are going to do a downtime tomorrow, and then heavily overload the thing after [21:51:06] aha, ok [22:16:14] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [22:37:39] Analytics, Analytics-Cluster: https://yarn.wikimedia.org/cluster/scheduler should be behind ldap - https://phabricator.wikimedia.org/T116192#1950054 (Tbayer) OK, I've added this alternative to the documentation for now (not fully sure what's going on though, or whether the variant with bast1001 should be... [22:41:44] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [22:58:37] Analytics: Remove LegacyPageviews from vital-signs - https://phabricator.wikimedia.org/T124244#1950171 (madhuvishy) NEW [22:59:46] Analytics: Move vital signs to its own instance {crow} [5 pts] - https://phabricator.wikimedia.org/T123944#1950184 (madhuvishy) [22:59:47] Analytics: Remove LegacyPageviews from vital-signs - https://phabricator.wikimedia.org/T124244#1950185 (madhuvishy) [23:00:14] Analytics: Make beeline easier to use as a Hive client {hawk} - https://phabricator.wikimedia.org/T116123#1950188 (Tbayer) One blocker I have personally encountered in using beeline is that there does not seem to be an option to raise the heap memory limit akin to [[https://wikitech.wikimedia.org/wiki/Analyt... [23:01:56] Analytics: Make beeline easier to use as a Hive client {hawk} - https://phabricator.wikimedia.org/T116123#1950196 (madhuvishy) @Tbayer This should not be a problem anymore. @Ottomata configured it to have 1024m as the default. [23:19:45] Analytics: Make beeline easier to use as a Hive client {hawk} - https://phabricator.wikimedia.org/T116123#1950359 (Tbayer) >>! In T116123#1950196, @madhuvishy wrote: > @Tbayer This should not be a problem anymore. @Ottomata configured it to have 1024m as the default. Thanks, that's great! Although there ar... [23:23:19] Analytics: Make beeline easier to use as a Hive client {hawk} - https://phabricator.wikimedia.org/T116123#1950366 (madhuvishy) @Tbayer aah, yes that'd be a problem them. We'll try to figure out a way to make configuring these options easier, but it's probably something better fixed in beeline upstream. (http... [23:44:04] madhuvishy: btw, http://googlecloudplatform.blogspot.com/2016/01/Dataflow-and-open-source-proposal-to-join-the-Apache-Incubator.html [23:48:43] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1950435 (Tbayer) [23:48:45] Analytics: Restore MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 - https://phabricator.wikimedia.org/T123595#1950434 (Tbayer) [23:48:47] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1950436 (Tbayer) [23:48:59] YuviPanda: google compute engine hhmmmm [23:49:15] madhuvishy: but as an apache project [23:49:21] right [23:49:41] looks interesting [23:51:08] Analytics: Restore MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 - https://phabricator.wikimedia.org/T123595#1950448 (Tbayer) >>! In T123595#1934514, @Nuria wrote: > @Tbayer: > > Given the many issues we have in our data store right now. > > Hardware: https://phabricator.wikimedia.org/... [23:51:44] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [23:55:55] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0]