[02:32:04] HaeB: You around? [02:32:41] kaldari: hi [02:33:22] Haeb: I tried running that query on Tool Labs and Production, but I don't get the first 2 results (the deleted pages): [02:33:33] https://www.irccloud.com/pastebin/Zgh2O8KJ/ [02:33:59] Haeb: Which server were you running the query on? [02:34:19] analytics-store [02:34:32] Does it still show the 2 deleted pages? [02:35:25] 49995280 and 50001687 [02:35:43] yes [02:36:17] interesting [02:36:29] I guess there must be a lag for analytics-store being updated (but a 1 year lag??) [02:38:03] Haeb: One other weird thing is that those 2 pages exist in the page table on Tool Labs, but not in the page table on Production. [02:39:22] i just noticed that pagetriage_page actually has an "ptrp_deleted" field... does that mean what it looks like it means? [02:39:28] https://www.irccloud.com/pastebin/JmGejCXb/ [02:39:55] HaeB: Unfortunately, no [02:40:14] That field stores whether or not the page is nominated for deletion. [02:41:01] ok [02:41:51] I'm really mystified about these database discrepancies. Are such anomalies common? [02:42:17] I guess I should ask jcrespo about it [02:42:48] btw, no lag: [02:42:52] https://www.irccloud.com/pastebin/ZZ0CgoNF/ [02:44:47] about a year ago we had some discrepancies for eventlogging tables between analytics-store and -slave https://phabricator.wikimedia.org/T131236 [02:45:35] HaeB: Can you confirm the discrepancies for me, maybe I'm high. [02:46:49] kaldari: i guess the order of rows doesn't need to be the same (for the "same" table on different servers) [02:47:10] Oh, maybe it's just not in that result set [02:47:17] i.e. SELECT .... LIMIT 10 is not deterministic [02:47:23] need to ORDER BY page_id [02:48:42] HaeB: Nope, still get the same results (without the 2 deleted pages) [02:48:45] kaldari: can you try [02:48:46] select ptrp_page_id from pagetriage_page, pagetriage_page_tags, page where ptrp_reviewed = 0 and page_id = ptrp_page_id and page_namespace = 0 AND page_is_redirect=0 AND ptrp_page_id = ptrpt_page_id AND ptrpt_tag_id = 13 AND ptrpt_value = 0 AND page_id IN (49995280, 50001687); [02:48:49] ah ok [02:49:31] Empty set for that query on both Tool Labs and Prod [02:51:33] oh, on s1-analytics slave too (empty) [02:52:26] https://www.irccloud.com/pastebin/n0SKx9KE/ [02:52:37] What does... [02:52:39] select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever'; [02:52:47] give you on s1-analytics slave? [02:54:51] kaldari: empty set [02:55:39] OK, I guess I'll just file a bug about this [02:56:05] but it makes me a bit worried about our ability to generate acurate stats [02:57:21] Can you give me the output of select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever'; on analytics-store as well, just so I can paste it into the bug report? [02:57:31] Haeb: ^ [02:57:57] https://www.irccloud.com/pastebin/lUUQxBWR/ [03:01:25] kaldari: here are all three of your queries repeated on analytics-store: [03:01:33] https://www.irccloud.com/pastebin/czKKPxhJ/ [03:03:11] ...the numbers are pretty close: 22021 (your result) vs. 21999 , 906 vs. 908, 21115 vs. 21092 [03:03:23] thanks. At least it doesn't differ by a huge number from prod. [03:03:26] ...easily explainable with having run the query a bit later [03:03:39] yes but i would say that's a bad sign, not a good sign ;) [03:04:07] ...because those number include delted pages on store [03:08:20] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, and 2 others: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3287995 (10kaldari) [03:08:29] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288008 (10kaldari) [03:09:31] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3287995 (10kaldari) [03:09:51] HaeB: ^ [03:09:55] bug filed [03:10:59] thanks! [03:12:50] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288010 (10Tbayer) [03:13:00] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288011 (10kaldari) FWIW, this doesn't seem to be a lag issue as all the pages af... [03:20:31] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288012 (10kaldari) p:05Triage>03High Marking high priority since this is aff... [03:22:24] kaldari: i guess it's still worthwhile running an explicit query for all deleted pages in that set (in production and tool labs too) [05:58:21] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3288072 (10jcrespo) 05Open>03Resolved a:03jcrespo This is a known issue, wa... [07:27:48] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3287995 (10Marostegui) Just for the record of this ticket, Jaime kindly fixed it... [08:36:42] (03PS1) 10Joal: Correct last_access_uniques daily/monthly bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355387 (https://phabricator.wikimedia.org/T165661) [08:36:44] Hi a-team [08:37:34] I'm going to start my day with merging my bugfixes and re-deploying, if you folks agree [08:41:29] ok, looks like I'm alone this morning, going with bugfixes [08:42:01] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355221 (https://phabricator.wikimedia.org/T164713) (owner: 10Joal) [08:43:35] (03PS2) 10Joal: Upgrade jar version for restbase job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355266 (https://phabricator.wikimedia.org/T163479) [08:43:55] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355266 (https://phabricator.wikimedia.org/T163479) (owner: 10Joal) [08:44:36] (03PS2) 10Joal: Correct last_access_uniques daily/monthly bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355387 (https://phabricator.wikimedia.org/T165661) [08:45:19] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355387 (https://phabricator.wikimedia.org/T165661) (owner: 10Joal) [08:48:06] !log Deploying refine [08:48:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:52:12] !log Deploy refinery to HDFS [08:52:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:01:59] !log Restart oozie restbase job after bug fixing deploy [09:02:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:04:20] !log Restart oozie last_accesst_uniques daily/monthly job after bug fixing deploy [09:04:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:06:10] !log Restart oozie mediawiki_history denormalize/metrics job after bug fixing deploy [09:06:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:17:55] * elukey back! [09:32:29] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3288504 (10JAllemandou) Hi @Pchelolo, We use scala 2.10.4 for the moment (I'd like to move to 2.11 soon though). You can find most of our scala code in here: - h... [09:41:22] leaving for a medical apointment, will be back in ~1h30 [11:56:59] 10Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#3288840 (10Dvorapa) Is this still an issue? I haven't seen this issue for a while. [11:57:53] * elukey lunch! [12:08:48] ottomata: I sent the announcement after tillman suggested it, so I think we're go for deploy :) [12:50:02] great [12:50:23] fdans: i want to do druid upgrade first this morning, elukey lemme know when you are back [12:50:27] i'm going to start prepping some things [12:52:03] ottomata: already back, doing some ops stuff.. can we do it in ~20 mins? [12:52:31] ya that's perfect [13:02:27] ottomata / elukey: Should I suspend hourly pageview laoding while you upgrade? [13:02:46] joal: i thought you said most frequent was daily? :) [13:03:22] Arf ottomata - Most frequent frequency is daily, we have one hourly and one real-time [13:03:37] ottomata: sorry for the misunderstanding [13:04:24] i'll just stop tranquility for a bit, i think that should be fine [13:04:29] joal: still don't understand [13:04:45] if most frequent is daily, it is unlikely that we will run into a druid indexing job in the next couple of hours, correct? [13:04:56] i was just going to check and make sure there were no running indexing jobs as we bounce nodes [13:05:49] ottomata: my weird mind understood "frequency of frequency" as which jobs-frequencies do we have the most? This is daily (equals to monthly0 [13:06:12] But, the highest frequency for our jobs is hour, we have one of the [13:06:32] ohhh [13:06:32] haha [13:06:36] :0 [13:06:39] ok, then yes please pause :) [13:07:03] Doing that now [13:07:48] !log Suspend pageview-druid-hourly-coord for druid upgrade [13:07:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:09:54] thanks [13:10:03] np [13:10:10] thanks for upgrading :) [13:10:44] by the way while you're at it ottomata, would you mind upgrading the superset version you have and restart it? It has a bug preventing us for using it [13:11:15] sure [13:11:18] thanks alot [13:11:31] elukey: read whenever you are [13:11:39] ready* [13:12:01] * fdans lunch! 🇪🇸 [13:12:30] enjoy fdans :) [13:27:53] I am ok! [13:28:05] sorry ottomata I was reviewing a task and got lost [13:28:10] whenever you want [13:29:33] ok! [13:29:38] batcave! [13:31:03] elukey: ^ [13:32:13] coming! Hangouts hates me [13:38:46] joal: i have stopped tranquility [13:38:49] k [13:38:54] but the index_realtime tasks are still around [13:38:57] one on each node [13:39:04] i should expect them to finish, right? [13:39:11] ottomata: I think we hit the same problem I had the other day [13:39:23] oh? [13:39:24] ottomata: I actually have no idea :( [13:39:27] haha [13:39:28] ok [13:39:37] ottomata: let's kill them (restarting druid) [13:39:57] ottomata: today is a special day and data will be reindexed daily anyway [13:40:05] ok, so we will ignore [14:15:16] hey joal, can you come to bc? [14:15:20] Hey all, is there an Easy Way™ to query a table across databases? I was asked to find a count of all files on WMF wikis, and my sql-fu is not immediately up to the task [14:15:25] I sure can [14:16:50] Hi marktraceur - I don't think we have file info in our cross-wiki processed data :( [14:17:05] Sad days. [14:19:53] I'll send back some big numbers from a few wikis and ask for more specific requirements. [14:20:19] marktraceur: which tables would you be looking at in mysql? I'm assuming images? [14:20:45] joal: Just "select count(*) from image where img_timestamp > 20170417000000;" [14:21:01] marktraceur: We indeed don't have that [14:21:38] (sorry, < not >) [14:21:46] marktraceur: you're not the first one to ask similar requirements, I'll suggest the team we ingest MOAR datas onto hadoop to facilitate - But not satisfactory answer for now I'm sorry [14:24:18] That's fine, I sent five counts from what I think must be some of the biggest file collections, but who knows [14:24:26] If he's not satisfied I can script something [14:26:45] ottomata / elukey: I guess we skip ops sync? [14:26:53] ya lets skip [14:27:29] I have one thing to talk - druid is broken now! [14:27:31] :D [14:27:39] haahaha [14:27:46] (a little) [14:29:43] mwahaha :) [14:41:28] he, i was wondering if someone might be able to create an analytics view of a subset of namespaces... [14:42:36] thedj: do you have a data request? [14:43:28] basically i would like to see some specific numbers about edit of User:*.js/css and Mediawiki:*.js/css [14:44:14] i'm wondering if there is a trend there, and if that trend lines up with generic wikipedia/wikimedia edit trends or if it is divergent [14:44:49] i've been having a suspicion that those number have dropped significantly over the last couple of years. [14:51:46] thedj: of the mediawiki platform itself? commits via gerrit? [14:54:56] thedj: I would create a phab ticket a ping akappler, see: https://wikitech.wikimedia.org/wiki/Analytics/DataRequests and https://meta.wikimedia.org/wiki/Research:FAQ i am not sure such a data is anywhere to be easily collected though [14:55:12] i mean wikipage edits [14:57:03] thedj: what wikipages (so i can understand who can help you best0 [14:57:18] User:*.js/css and Mediawiki:*.js/css [14:57:23] of all properties [14:59:29] thedj: can you link to some examples? [14:59:45] https://en.wikipedia.org/wiki/MediaWiki:Common.js [14:59:57] https://en.wikipedia.org/wiki/User:TheDJ/mobileVector.css [15:00:36] https://en.wikipedia.org/wiki/MediaWiki:Gadget-imagelinks.js [15:06:02] joal: tranquility for banner impressions is weird after upgrade, it is upset about the lost indexing tasks! [15:06:23] ottomata: :( [15:06:33] ottomata: can we kill it's state? [15:06:40] not sure, let's figure it out after standup [15:06:48] we should also verify that a hadoop loading job works [15:06:53] so maybe we can do a manual run of one together [15:08:16] thedj: We have the Mediawiki History dataset, but it's not public yet [15:12:54] 06Analytics-Kanban, 13Patch-For-Review: Count global unique devices per top domain (like *.wikipedia.org) - https://phabricator.wikimedia.org/T143928#3289216 (10JAllemandou) Some deeper view of the underestimate computation (since difference happens on that): ``` SUM(CASE -- Last access not set and cl... [15:27:25] !log Resume pageview-druid-hourly-coord after druid upgrade [15:27:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:30:18] mforns: this is the exception - NotSupportedError: (1235, u"This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'") [15:30:38] hmmm.... [15:30:46] crazy :] [15:31:59] thedj: i see, the best data on that regard that exists publicy is the data on the labs replicas, correct? [15:32:29] elukey, we could always do it in 2 steps, first select, then update, it's the same order of magnitude computation-wise, no? [15:33:02] thedj: you can put a phabricator request to neilpquinn (which is the data analysts for the editing team) please do add the value of data gathering so he can prioritize such an ask [15:33:52] mforns: I am checking if it is supported in all our dbs, I think it is only a testing issue [15:34:48] elukey, aha, well anyway, if we need to do it, I don't see it as a problem... a couple lines more of code, np [15:36:12] yeah but it complicates a bit the code, I prefer the current version [15:37:25] mforns: ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery' [15:37:38] O.o [15:37:39] * elukey plays sad_trombone.wav [15:38:00] I tried select timestamp from Echo_7731316 WHERE timestamp IN (select timestamp from Echo_7731316 limit 10 offset 100); [15:38:08] aha [15:38:28] well [15:38:32] so yeah plan B [15:38:35] xD [15:42:03] mforns: the other problem is that we need to check beforehand if the table has the id field or not [15:42:14] * elukey loves corner cases [15:42:34] aha [15:42:59] interestinnggg [15:43:03] if not we use.. uuid? But is it indexed ? Checking.. [15:43:18] still didn't get when we use uuid and when id though [15:44:07] elukey, I recall that either id or uuid was removed and then added again, so new tables should have it [15:44:36] we can pick that field, and assume bad performance in the 1st run [15:45:01] for the interval of data without... wait, what I'm saying is stupid [15:46:24] elukey, oh! if we're going for 2 separate queries, we may as well use: select min(timestamp), max(timestamp) ... LIMIT X OFFSET Y [15:46:51] and then UPDATE FROM T ... WHERE timestamp between min and max [15:47:32] it's probably faster than the IN comparison, and all tables have timestamps [15:47:36] yeah we could but it is not super precise (I mean, respect to ids) [15:48:49] elukey, it is almost totally precise, the only case where it isn't is when several events are happening at the same second, and in this case, we'll have a batch of 1003-1004 instead of 1000, which is perfectly fine I guess [15:49:13] mmmmh [15:49:32] yea, maybe we'd be updating the same row twice... [15:49:48] which is not incorrect, but ugly still [15:50:16] mforns: what if we select min/max id rather than timestamp? [15:50:24] and then we delete the range [15:50:55] elukey, makes sense if ids are consecutive [15:51:08] but still we have the problem of some tables having id and some others not [15:52:28] oh sure sure [15:52:49] I am trying to get which tables are weird [15:53:52] elukey, if we do use the min(timestamp) and max(timestamp) and then carefully comparing with timestamp >= min and timestamp < max (min->inclusive and max->exclusive) we ensure that we'll not update the same row twice, no? [15:54:16] and also that we won't leave any row without update [15:55:22] well we don't care much if a row is updated twice no? [15:56:18] elukey, not really, but still we can avoid it I think [16:02:07] so SELECT table_name FROM information_schema.columns WHERE column_name = 'id' and TABLE_SCHEMA = 'log'; returns 322 on -store [16:02:15] the equivalent with uuid 384 [16:02:30] err 381 [16:02:40] so 59 tables without id? [16:05:34] nuria_: ping for 1on1? :) [16:05:47] sorry, i forgot i had changed it! omw [16:08:40] mforns: horrible query - SELECT distinct table_name FROM information_schema.columns WHERE TABLE_SCHEMA = 'log' AND table_name NOT IN (select table_name FROM information_schema.columns WHERE TABLE_SCHEMA = 'log' and column_name = 'id'); [16:08:52] but returns the 59 tables [16:10:33] and the uuid seems to be an index [16:10:37] (and populated) [16:21:43] joal,urandom - https://www.youtube.com/watch?v=2l0_onmQsPI [16:22:13] elukey: good, i guess? [16:23:35] urandom: sure sure, I was only sending the link to interested people :) [16:23:45] i'll have a look [16:23:46] (the miami videos are all out afaics) [16:30:45] elukey: I might ask you for a summary [16:31:06] nuria_: do you have 5 minutes now? [16:36:16] oof, a-team [16:36:20] we have to roll the druid upgrade back [16:36:23] :( [16:36:26] bye ottomata ! [16:36:29] byyyeeee [16:36:35] ottomata: what happened? [16:36:44] new druid runs java 8, hadoop java 7 [16:36:45] ottomata: ah ok, you can tell us later [16:36:46] fine [16:36:46] but [16:36:51] indexing tasks that use hadoop don't work [16:36:56] so either we upgrade hadoop to java 8 [16:37:00] or we don't upgrade druid [16:37:07] we can't do a willy nilly hadoop java upgrade [16:37:10] or we tr to run druid on java 7] [16:37:13] nope [16:37:14] won't work [16:37:36] https://groups.google.com/forum/#!topic/druid-user/aTGQlnF1KLk [16:37:49] they polled a year ago to see if they should udpate [16:37:53] most folks said sure! [16:43:10] ottomata: i see [16:43:22] ottomata: will spark work with java8 or is it only on java7 [16:43:33] unsure, it would need a lot of testing [16:43:38] which is why i' going to roll back [16:43:46] i think updating to java 8 is a multi week project [16:44:24] ottomata: much agreed [16:45:10] ottomata: ya, java8 will need throuhg testing for spark /scala for sure [16:47:45] plus it will need a stop-of-the-world cluster upgrade [16:48:08] ottomata: let me know if you need help rolling back (not sure if you have already done it) [16:53:00] ottomata: should we send e-mail about druid/pivot not working? or do we think rolling back is a matter of minutes? [16:53:24] elukey, sorry I missed your msg with query [16:54:20] mforns: no problem :) [16:54:22] ottomata: will send e-mail [16:54:28] still wondering why we have id vs uuid [16:54:41] so, I understand that all queries have uuid and that it is indexed? is that correct? is it unique across all tables? is it consecutive? :] barraging you with questions [16:56:33] mforns: uuid will eb unique per event [16:56:37] so uuid seems populated in all the tables, indexed and using UNIQUE KEY [16:56:41] mforns: but it shoudl not be indexed by default [16:56:53] mforns: as tables are created w/o indexes [16:57:07] aha [16:57:08] mforns: it could be that a process is adding indexes later [16:57:30] all the ones that I checked with show indexes from have an index for uuid [16:57:35] nuria_, I see, I guess, after 90 days the indexes will be in place [16:57:51] after 90 days? [16:58:03] elukey, yes when the purging applies [16:58:18] new tables aren't going to be touched for 90 days no? [16:58:24] mforns: wait, why would that be teh case ? /me no follows [16:58:45] mforns: well, unless there is a db trigger running that adds that index [16:58:55] aha [16:59:00] mforns: if all tables have it then there is a process that addsa it [16:59:03] *adds it [16:59:22] still not sure about the part that all the tables have it, need to double checke.. [16:59:27] *check.. [16:59:49] nuria_: it is weird that 59 tables do not have id but only uuid [17:00:10] elukey: no, it makes sense, cause we altered table structure [17:00:36] elukey: there is uuid which all tables shoudl ahve, it is part of eventlogging itself [17:00:47] elukey: so that uuid is also in hadoop [17:00:54] elukey: and there is autoincrement id [17:01:02] nuria_: it works [17:01:06] druid/pivot [17:01:09] its just that data isn't being updated [17:01:17] elukey: which was added/removed/added again for mysql performance purposes [17:01:18] elukey: i haven't rolled back yet [17:01:23] was getting lunch, am about to [17:01:49] elukey: do you think i should just dpkg -i the 0.9 jars? [17:02:03] or should I put them back in apt and remove the 0.10 ones? [17:02:05] elukey, mforns ; as replicating with an autoincrement id makes possible to have parallel consumers and data is still inserted in sequence [17:02:57] ottomata: it could be an option but we'd risk that some ops would upgrade it by mistake for some reason (running apt-get install etc..).. I remember a ticket for something similar with piwik [17:04:51] ottomata: maybe we could upload to reprepro 0.9 again, and then apt remove/reinstall 0.9 [17:05:02] not sure if druid will like it though [17:06:15] nuria_: ah ok so the id was added manually and not stricly part of EL as uuid right? But why 59 tables are left without it? Not needed for perf? [17:06:35] elukey: no, it was added to eventlogging mysql consumer code [17:06:52] elukey: because those tables predate the id change [17:07:08] elukey: and we have not yet changed to parallel consumption for mysql [17:07:17] elukey: if we did those tables would need an id [17:07:31] elukey: thus far the only consumer is working fine w/o help [17:08:40] nuria_: ah ok so I'll consider id not usable for the purge logic [17:09:42] elukey: ok [17:10:18] elukey: going to do the following [17:10:18] - revert https://gerrit.wikimedia.org/r/#/c/351691/3 [17:10:18] - remove druid 0.10 from apt, re-add druid 0.9 [17:10:18] - run puppet [17:10:18] - apt-get remove && apt-get install druid-* [17:10:18] - sudo /usr/local/bin/druid-hdfs-storage-cdh-link to regenerate cdh extension symlinks [17:10:19] - restart services in order [17:10:23] bottom of https://etherpad.wikimedia.org/p/analytics-ops-druid [17:10:30] nuria_: thanks :) [17:10:48] elukey: ok, let me know if you need anything else [17:11:20] ottomata: looks ok from me! [17:13:16] oof, except elukey, i don't really have the 0.9 .debs anymore...well, i have the .debs [17:13:19] but not the full .changes or tarball that i used before [17:13:26] i should be able to get it out of the deb repo [17:13:27] buuuuuut [17:13:29] sorry elukey, nuria_, needed to attend kid for a sec, so we can theoretically use uuid then right? [17:13:36] not sure exactly how to do that, with git-buildpackage [17:13:40] hmmm [17:14:10] mforns: assume that uuid is always filled? yes [17:14:19] mforns: but it is not sequentital in any way [17:14:53] ottomata: lemme check [17:16:25] nuria_, elukey, I think uuid being non sequential is not a problem, provided we use the uuid IN (...) clause (as opposed of uuid between min_uuid and max_uuid) [17:16:39] elukey: --git-upstream-branch=0.9.0 --git-debian-branch=debian-0.9.0 might do it, trying... [17:16:40] something like that [17:16:47] if i check those branches out properly [17:17:35] I checked on copper if we had 0.9 in /var/cache but nothing.. yeah that should do the trick if the repo has the right tags [17:17:44] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3289504 (10Nuria) @tbayer: no, every table not in whitelist whose entire data was older than 90 days was dropped. >FWIW, it looks like these three are cur... [17:28:47] mforns: definitely, I am trying something similar [17:29:27] ok [17:38:15] nuria_: pivot is not down [17:45:32] nuria_: ready when you are for some uniques [17:46:04] ottomata: sorry, i must have missunderstood joal when he told me it wasn't working. my mistake [17:46:17] nuria_: Ah yes, sorry [17:46:43] nuria_: data not updated, and possibly at some point down today - I didn't want to explicitly ask people to look at it now [17:47:42] thus far we haven't had any downtime [17:47:57] since we haven't had all copies of any given druid down at once [17:48:03] been doing rolling bounces [17:48:07] which has been working great [17:48:42] awesome - I was just being on the careful side given some issues arose [17:48:47] Maybe I shouldn't :/ [17:50:11] joal: I think it might be better to provide a link in SoS as otherwise people might not know/ be familiar with what you are taking about\ [17:50:28] right [17:50:35] joal: uniques on cave whenever you are ready? [17:50:42] let's go nuria_ :) [17:51:46] ottomata: I hope I didn't upset you by wondering about downtime when you were making the best to not have one :/ [17:52:50] haha, i am not upset! :) [17:53:09] Ah, good :) [17:55:06] nuria_, is there an estimated date for the MySQL EventLogging purge? [17:55:20] This is something our team knows about in theory, but I don't think we've properly audited it. [17:57:38] matt_flaschen, hi! The idea is to purge as soon as we finish the purging script and test it thoroughly [17:58:42] in theory, this purge was intended for August/September 2015, but we've been dragging this since then for various reasons. [17:59:34] mforns, yeah, I kind of thought you were already doing it, but I also don't specifically remember our team ever really auditing it (do we need older events for comparison, graphs, etc.?), so I'm somewhat concerned. [18:00:03] mforns, I'll treat that as Real Soon Now, but if you do come up with a date, please let me know: T166245 [18:00:05] T166245: Audit Collaboration Team schemas and see if we want to request different purging policies for any - https://phabricator.wikimedia.org/T166245 [18:00:52] we have done 2 comprehensive audits on EL data and the purging strategies for all schemas, one on summer 2015 the other on summer 2016, in both cases I contacted all teams and the corresponding schema owners and discussed all details, no? [18:01:06] the audits took several months [18:02:26] it may be that since summer 2016 there are new schemas that do not have been audited, but when the previous audits finished, we established the convention that new schemas would have the full-purge after 90 days policy by default [18:03:15] mforns, understood. I am just saying I want to make sure we're ready, not that anything was done wrong by Analytics. (I'm sure the past discussions were thorough, but as you mention there are new schemas since both those dates). [18:03:30] yes, that's right [18:06:55] matt_flaschen, so, do you want us to wait until you guys know if there are new schemas (since summer 2016) that need a non-default purging strategy? [18:08:13] ok, now I see the task, cool [18:08:47] mforns, yeah, if you don't mind. [18:08:55] I'll keep you updated. [18:09:08] thanks a lot matt_flaschen :] [18:09:13] Thank you. [18:09:56] oook joal [18:10:01] let's try pageview hourly load again [18:10:03] 0.9.0 back up [18:12:52] ottomata: nice! [18:20:17] 10Analytics, 10Analytics-EventLogging, 06Community-Tech, 06Editing-Analysis, and 2 others: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3289760 (10kaldari) [18:25:01] 10Analytics, 10Analytics-EventLogging, 06Community-Tech: Remove EventLogging for cookie blocks - https://phabricator.wikimedia.org/T166247#3289784 (10kaldari) [18:30:47] elukey: , your friend, unless, for review: https://gerrit.wikimedia.org/r/#/c/355469/ [18:34:39] ottomata: looks good [18:34:56] tested and working fine? [18:36:27] not tested! :p [18:36:38] just wanted to get it into gerrit before i stopped thinking about it [18:36:56] i mean, tested on the CLI, but not via puppet [18:37:26] okok :) [18:39:27] 06Analytics-Kanban, 13Patch-For-Review: Update druid to latest release - https://phabricator.wikimedia.org/T164008#3289855 (10Ottomata) We did this upgrade today, but ended having to roll back to 0.9.0. Druid 0.10 requires Java 8, which is fine. But the Analytics Hadoop cluster runs Java 7, and Hadoop indexi... [18:40:36] 10Analytics, 10Analytics-Cluster: Upgrade Analytics Cluster to Java 8 - https://phabricator.wikimedia.org/T166248#3289857 (10Ottomata) [18:41:58] elukey: i wonder if we should consider java 8 for the new kafka cluster [18:42:51] ottomata: let's read what is suggested by the confluent people.. surely it could be less hassle to upgrade later on [18:43:09] mforns: https://gerrit.wikimedia.org/r/#/c/353265 - works on my testin env! [18:43:27] it grabs the uuids and then update where in [18:43:30] super simple [18:43:37] needs more work and testing [18:46:27] ottomata: going afk now, let's chat about it tomorrow ok? [18:46:32] * elukey goes afk team! [18:46:57] elukey, awesooome :] !! bye cya [18:47:01] 10Analytics, 10Analytics-EventLogging, 06Community-Tech, 06Editing-Analysis, and 2 others: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3289902 (10Ottomata) FYI, the Edit Review project (see T145164 and T143743) is also interested in so... [18:48:13] (03CR) 10Nuria: "Thanks for catching this." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355387 (https://phabricator.wikimedia.org/T165661) (owner: 10Joal) [18:49:56] 10Analytics, 10Analytics-EventLogging, 06Community-Tech, 06Editing-Analysis, and 2 others: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3289913 (10Ottomata) We were talking about adding things like this to the [[ https://github.com/wiki... [18:51:48] 10Analytics, 10Analytics-Cluster, 06Operations, 10ops-eqiad: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3289924 (10Ottomata) @Cmjohnson estimate on these? We'd like to get them up and running by the end of this quarter, so I'm going to nee... [18:55:39] 10Analytics, 10Analytics-EventLogging: Bulk/Batch event endpoint - https://phabricator.wikimedia.org/T166249#3289933 (10AMroczkowski) [19:01:37] 10Analytics, 10Analytics-EventLogging: Bulk/Batch event endpoint - https://phabricator.wikimedia.org/T166249#3289933 (10Ottomata) +1 to this idea. At our recent offsite, we talked a bit about having a more scalable and less fragmented event intake infrastructure, but it was mostly talk. Not sure what the prio... [19:05:46] fdans: not sure if you are still there, but i'm gonna deploy your thang [19:11:18] 10Analytics, 10Analytics-EventLogging: Bulk/Batch event endpoint - https://phabricator.wikimedia.org/T166249#3289933 (10Nuria) > Doing so can help reduce battery usage and allow for event collection when there isn't an active internet connection. Right, this is the 'traditional' way to support offline events,... [19:11:33] ottomata: okeis, let's deploy [19:15:17] 10Analytics, 10Analytics-EventLogging: Bulk/Batch event endpoint - https://phabricator.wikimedia.org/T166249#3290005 (10Ottomata) Which is why it would be cool to have an public API with a POST endpoint :) [19:17:23] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3290043 (10Tbayer) >>! In T161855#3289504, @Nuria wrote: > @tbayer: no, every table not in whitelist whose entire data was older than 90 days was dropped.... [19:18:07] whYYyYYY did scap work yesterday but not today?! [19:18:28] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3290046 (10Nuria) Amend: Here is list of tables, now , note that running this script was a one-off, purging script is being worked on in the prior ticket:... [19:26:50] 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 05WMF-NDA: Drop tables with no events in last 90 days. - https://phabricator.wikimedia.org/T161855#3290067 (10Tbayer) Thanks for the list! Agree that leaving an empty table is much preferable to dropping it without a trace. [19:27:18] Hey ottomata - Do you need help with druid? [19:29:33] joal: yes, need to try / restart loading job [19:29:34] s [19:29:41] ottomata: now? [19:29:43] uhhh [19:29:46] well, 0.9.0 is back up [19:29:57] do we want to try and restart pageview hourly? [19:30:00] or can it wait? [19:30:02] I can do that [19:30:22] oook, if you like! [19:30:24] ottomata: I think it's the easiest way to test [19:30:26] i'm watching middlemanager logs now [19:30:27] yea [19:30:33] ottomata: however, only if you feel it's ok [19:30:44] ok, restarting then [19:31:17] k [19:31:20] yeah we need to try [19:31:27] druid should be back to the way it was [19:31:31] ok [19:31:38] then we should be ok [19:33:37] i see it [19:33:43] hadoop job launched! [19:33:54] started, looks so far [19:33:54] its failing?! [19:34:12] no no, I don't think [19:34:36] Error: io/druid/storage/hdfs/HdfsStorageDruidModule : Unsupported major.minor version 52.0 [19:34:37] everything looks good I think [19:34:48] wow [19:34:51] maps keep failling [19:34:57] https://yarn.wikimedia.org/proxy/application_1492691387549_112939/mapreduce/attempts/job_1492691387549_112939/m/FAILED [19:35:05] oh wait [19:35:07] maybe wrong job? [19:35:12] possibly [19:35:33] naw same deal [19:35:33] https://yarn.wikimedia.org/proxy/application_1492691387549_112940/mapreduce/attempts/job_1492691387549_112940/m/FAILED [19:35:34] this one [19:35:45] waiiiiit [19:36:06] looks ok to me [19:36:27] some jobs have failed, but most have succeded [19:36:32] oh yeah? [19:36:39] hm yeah maybe so [19:36:41] that's pretty weird [19:39:17] SUCCESS [19:39:18] ooook [19:39:30] but [19:39:30] Failed map tasks=3 [19:39:30] Failed reduce tasks=4 [19:39:31] why? [19:39:39] hm, don't know [19:40:36] Error running child : java.lang.UnsupportedClassVersionError: io/druid/storage/hdfs/HdfsStorageDruidModule : Unsupported major.minor version 52.0 [19:41:33] OoooOk joal, i mean, i guess we don't know if this wasn't happening before [19:41:37] but um, we call that a success? [19:41:39] uhhh [19:42:16] That's indeed a bit weird [19:42:45] maybe some are using the 0.10 hdfs extension jar?! [19:42:46] but, they can't [19:42:50] it doesn't exist [19:43:05] ottomata: or some different java version on certain nodes of the cluster? [19:43:11] That'd be weirdoooooh [19:43:31] But yes ottomata - Let's cal that success [19:44:49] hm, no don't think so. hm [19:45:04] hm, joal one thing that is different [19:45:22] i don't think that i ever updated symlinks for the druid hdfs extension after the cdh 5.10 upgrade from a month or so ago [19:45:44] hm [19:46:01] hmm, no, they are linked against the non versioned jars [19:46:05] hadoop-mapreduce-client-app.jar -> /usr/lib/hadoop/client/hadoop-mapreduce-client-app.jar [19:46:07] so it wouldn't matter [19:46:16] That's weird [19:46:34] ok, dunno [19:46:43] let's keep an eye on those oozei loading jobs [19:46:47] i guess you can reenable them joal [19:46:53] and its late for you so seee yaaaaz [19:47:46] looks like something has changed ottomata - this is a previous job (before upgrade/rollback): https://yarn.wikimedia.org/jobhistory/job/job_1492691387549_111913 [19:47:52] Anyway, seems to work as you said [19:47:55] Going to bed [19:47:58] See you tomorrow [19:48:39] maybe there are some leftovers [19:55:13] oook [19:55:16] laters! [20:02:19] ottomata, mforns: what's the link again for the current version of the EL purging whitelist? [20:02:35] HaeB, one sec :] [20:02:50] no rush ;) [20:03:12] HaeB, https://gerrit.wikimedia.org/r/#/c/298721/ I haven't added the userAgent fields yet [20:05:45] mforns: ah ok so that version is still current... i think there should be a stable link at https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging (that always points to the latest version) so that people can check whether a particular schema is whitelisted currently [20:12:38] HaeB, makes sense, when that whitelist is merged, we'll have a file in puppet that will be linkable from wikimedia github i.e. [20:22:21] mforns: i see. note that (independent of the apps thing) there are likely quite a few new schemas created since july 2016 with whitelist information on the schema talk page. i will file tasks for some that i'm aware of. ideally that kind of information would be synced automatically... [20:25:03] Haeb, yea... now, the idea behind the white-list being the single source of truth is that all purging strategies get reviewed by Analytics [20:25:25] if we setup an automatic sync from talk page to white-list, we loose the review no? [20:26:40] also, talk page information is not detailed enough to get translated to a patch to the white-list I guess [20:27:41] HaeB, oh! you mean the other way round? like, from white-list to talk page? that would be very nice! [20:30:12] mforns: yes, or at least an automated "this information is not consistent with the current version of the actual whitelist" type warning [20:30:29] for now i have added a link to the wikitech documentation https://meta.wikimedia.org/w/index.php?title=Template:SchemaDoc&diff=prev&oldid=16808766 [20:30:33] aha [20:30:49] thx [20:34:09] 06Analytics-Kanban: Blog post about druid - https://phabricator.wikimedia.org/T157978#3290233 (10Milimetric) [20:50:40] 10Analytics, 10Analytics-EventLogging, 06Community-Tech, 06Editing-Analysis, and 2 others: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3290278 (10Nuria) > I'm going to reopen this task. Here are my reasons: As you wish as I imagine is... [20:52:08] woah superset is getting cooler [20:52:17] it connects to hive and mysql pretty easily [20:52:19] as well as druid [20:54:13] 10Analytics, 10Analytics-EventLogging, 06Community-Tech, 06Editing-Analysis, and 2 others: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3290282 (10kaldari) Another reason we would like to have this data in EventLogging is that it would... [20:54:30] :) too bad I had to like type in each column one by one with JSON dimension specs [20:54:39] (to get superset to see druid) [20:54:51] maybe not anymore? [20:54:55] milimetric: just updated new version [20:55:05] checkin [20:55:06] i didn't run it with the db you guys had done previously though [20:55:37] milimetric: ssh -N druid1001.eqiad.wmnet -L 8088:druid1001.eqiad.wmnet:8088 [20:55:42] localhost:8088 [20:55:43] admin/admin [20:56:05] i added database sources for both hive wmf and mysql eventlogging [20:58:41] hm... it works with druid now, that's nice [20:59:20] it's a bit hard to look at, I'm kind of mourning pivot. I think for my own personal use I'm leaning towards just dockerizing the updated pivot and using that. [20:59:35] also, I hope we don't kill the old pivot either. [21:04:51] rats, a pageview hourly druid job failed [21:04:52] looking [21:05:21] milimetric: yeah, superset is def more for prebuilt dashboards [21:06:15] yeah, the new pivot has dashboarding too [21:06:21] but spilled milk [21:06:39] thing is, I usually just drink spilled milk right off the floor/table. Nothing wrong with it :P [21:09:14] !log pausing all druid oozie coordinators until hadoop loading is fixed [21:09:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:11:38] 06Analytics-Kanban, 13Patch-For-Review: Update druid to latest release - https://phabricator.wikimedia.org/T164008#3290363 (10Ottomata) Oof, Hadoop druid loading jobs are still failing, even after rolling back: ``` 2017-05-24T20:59:07,294 ERROR [main] org.apache.hadoop.mapred.YarnChild - Error running child :... [22:12:45] 10Analytics, 10DBA, 06Labs, 10MediaWiki-Page-deletion, 10Tool-Labs-tools-Database-Queries: Database replication issues with deleted pages (affecting Tool Labs and Analytics Store) - https://phabricator.wikimedia.org/T166194#3290582 (10Tbayer) >>! In T166194#3288072, @jcrespo wrote: > This is a known issu...