[03:14:03] (03PS1) 10Milimetric: Remove decomissioned dashboards [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/467557 (https://phabricator.wikimedia.org/T199340) [04:16:33] (03CR) 10Nuria: "Looks good, please also remember to remove hiera configs and references to these dashboards from meta." [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/467557 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [04:16:43] (03CR) 10Nuria: [V: 032 C: 032] Remove decomissioned dashboards [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/467557 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [04:27:03] 10Quarry: Show the execution time in the table of queries - https://phabricator.wikimedia.org/T71264 (10zhuyifei1999) [04:27:09] 10Quarry, 10Patch-For-Review: Include query execution time - https://phabricator.wikimedia.org/T126888 (10zhuyifei1999) 05Resolved>03Open @Framawiki: I think using using the time of `cur.execute` to get the execution time is flawed. There is no way [[https://quarry.wmflabs.org/query/17928|this query]] coul... [04:37:58] 10Quarry, 10Patch-For-Review: Include query execution time - https://phabricator.wikimedia.org/T126888 (10zhuyifei1999) Tested with https://quarry.wmflabs.org/query/30399: ``` SELECT SLEEP(10); ``` waits for completion in cur.execute(), while ``` USE enwiki_p; SELECT SLEEP(10); ``` doesn't. [04:45:10] 10Quarry, 10Patch-For-Review: Include query execution time - https://phabricator.wikimedia.org/T126888 (10mahmoud) It would be nice to have both execute time and combined execute + fetch time, as the latter more accurately represents the an application would spend waiting (which would really help with prototyp... [04:53:29] 10Quarry, 10Patch-For-Review: Include query execution time - https://phabricator.wikimedia.org/T126888 (10zhuyifei1999) I guess we could just change to execute + fetch + store. Beware though, that can be heavily lagged due to the use of SQLite on NFS for storing the results, [05:01:18] 10Quarry: Example queries for Quarry - https://phabricator.wikimedia.org/T207098 (10zhuyifei1999) https://wikitech.wikimedia.org/wiki/Help:MySQL_queries#Example_queries ? [06:44:31] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10elukey) p:05Triage>03High [06:45:23] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10elukey) @Cmjohnson this server is OOW but the replacement will take time to arrive (still in procurement..) and this host is really important for the research users. Do we have a spare disk tha... [09:30:53] joal: o/ [09:31:08] whenever you have time I'd need to ask you a couple of things for banner impression data [09:31:16] since I am a n00b in Druid indexing [09:31:18] :) [10:34:09] * elukey lunch! [11:02:49] PROBLEM - Check the last execution of eventlogging_db_sanitization on db1108 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_db_sanitization [11:05:39] PROBLEM - Check the last execution of eventlogging_db_sanitization on db1107 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_db_sanitization [12:20:27] 10Analytics, 10Analytics-EventLogging: eventloggiong_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10Marostegui) [12:21:31] ACKNOWLEDGEMENT - Check the last execution of eventlogging_db_sanitization on db1107 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_db_sanitization Marostegui T207165 - The acknowledgement expires at: 2018-10-17 15:21:05. [12:21:31] ACKNOWLEDGEMENT - Check the last execution of eventlogging_db_sanitization on db1108 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_db_sanitization Marostegui T207165 - The acknowledgement expires at: 2018-10-17 15:21:05. [12:22:38] 10Analytics, 10Analytics-EventLogging: eventloggiong_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10Marostegui) I've ack'ed the alerts for 3 hours on db1107 and db1108. [12:23:02] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10Marostegui) [12:23:44] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10Marostegui) [12:28:48] Heya elukey - druid tnime? [12:30:05] here I am :) [12:30:29] elukey: How may I help you? [12:32:33] elukey: I'm suprised, pageview-hourly-wf-2018-10-15-17 was still red in oozie but supposedly reran yesterday? [12:32:41] !log rerun pageview-hourly-wf-2018-10-15-17 [12:32:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:33:40] joal: I re-run a pageview hourly job this morning, I've sent an email about it [12:33:45] did you re-run the same? [12:34:13] yes I've seen you reran one - One was still failed in pageview-hourly-coord :( [12:34:50] ah wait I missed the other one you are right [12:34:58] just seen the most recent one this morning :( [12:35:19] so about druid :) [12:35:29] Yes I have just double checked: I reran 14241, yours was 14256 [12:35:31] druid [12:35:53] oh yes my bad sorry, thanks for triple checking :) [12:35:54] https://phabricator.wikimedia.org/T203669 [12:35:58] context in --^ [12:36:09] so in theory we can use eventlogging_CentralNoticeImpression [12:36:14] with KIS [12:36:15] Right - I have seen that [12:36:48] now I am wondering one thing - can we use KIS to index "real time" and keep the existing daily banner impression batch version from webrequest? [12:37:01] Now I have a related question - What about https://phabricator.wikimedia.org/T204396 ?? [12:37:24] I wasn't aware of it [12:37:26] :( [12:37:47] elukey: even without considering the issue above, have realtime and batch from different datasources sounds a bad idea [12:38:23] elukey: for various reasons- schema reconciliation and possible discrepencies [12:38:29] most prominetly [12:38:43] okok makes sense, I was only trying to figure out if it was possible or not :) [12:39:19] so the best course of action would be to set up KIS and a "batch" daily/hourly job for eventlogging_CentralNoticeImpression as separate entities [12:39:34] in this case, we'll not be able to backfill anything [12:39:43] elukey: +1 for that solution [12:39:56] CentralNoticeImpression data is available from month=3 [12:40:19] ah so maybe we could start from there [12:40:34] elukey: However there are big discrepencies in data sizes [12:40:37] by month [12:40:46] (03PS7) 10Fdans: Add change_tag to mediawiki_history sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416 [12:41:33] heya fdans - Have you found the problem with AQS npm-install thing? [12:42:20] elukey: I think it would be interesting to triple check data from CentralNoticeImpression before moving to that [12:42:29] elukey: just axs a matter of safet [12:43:19] joal: yep yep I am going to ask and add some questions, I needed to verify with you first :) [12:44:13] cool elukey :) [12:44:18] so failures from the el databases [12:44:19] Oct 16 11:00:00 db1107 eventlogging_cleaner[27817]: ERROR: line 645: Some table prefixes in the whitelist do not match any table name retrieved from the database. Please review the following entries of the whitelist: ['ResourceTiming'] [12:45:01] so if I have to guess, ResourceTiming was added to the whitelist and no event has landed yet [12:49:55] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10elukey) Thanks! This should be a protection mechanism that in this case caused a false positive. So https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/466607/ introduces a new... [12:50:36] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10elukey) @Gilles hi! Do you know when ResourceTiming will start registering events in Eventlogging? [12:51:25] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10Gilles) It already is [12:55:52] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10Gilles) ``` 0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> SELECT COUNT(*) FROM event.resourcetiming WHERE year = 2018; [...] 6536590 1 row selected (48.308 seconds) ``` [12:56:07] heloooo [12:56:30] yohoo [12:58:03] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10elukey) Thanks! So this might be the case of schema present only on Hadoop and not on Mysql? If so the logic that triggered the above check needs to be removed :) [12:58:05] o/ [12:58:11] mforns: ---^ [12:58:43] elukey, reading [13:00:49] I don't remember how to check what EL events are whitelisted for mysql and what not, but I am pretty sure that ResourceTiming isn't :) [13:02:05] 10Analytics, 10Analytics-EventLogging: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10mforns) @Gilles @elukey Since we changed the EL blacklist that prevented schemas to be loaded to MySQL to a whitelist, new schemas are being loaded only to Hive by default. So this s... [13:03:48] elukey: https://github.com/wikimedia/puppet/blob/production/modules/eventlogging/files/plugins.py#L7-L12 [13:04:07] elukey, yes, resourceTiming was created this september [13:04:46] elukey, all new schemas from now on will fall in this case by default [13:04:50] ottomata: thanks! [13:04:54] elukey, you want to pair on a fix? [13:04:58] mforns: sending a code change now :) [13:05:05] oh! ok [13:05:57] ninja speed [13:06:59] mforns: it is super easy, but let's triple check that I have not missed anything [13:07:03] k [13:07:18] mforns: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467679 [13:10:00] 10Analytics, 10Analytics-Wikistats, 10User-Elukey: Git push and pull don't complete - https://phabricator.wikimedia.org/T206331 (10ezachte) 05Open>03Resolved @elukey yes, it works like a charm now, thanks so much :-) [13:13:19] elukey, I think it's great [13:13:58] \o/ [13:15:15] 10Analytics, 10Tool-Pageviews: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Amire80) [13:17:09] elukey: ooo https://engineering.salesforce.com/open-sourcing-mirus-3ec2c8a38537 [13:17:47] wooowww [13:17:53] also https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0 [13:23:57] RECOVERY - Check the last execution of eventlogging_db_sanitization on db1108 is OK: OK: Status of the systemd unit eventlogging_db_sanitization [13:24:55] mforns: restarted the cleaner on db1108, seems working fine! [13:25:14] * elukey likes icinga alarms [13:30:06] 10Analytics, 10Analytics-EventLogging, 10Patch-For-Review: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10elukey) 15:24 RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational 15:29 RECOVERY - Check sys... [13:30:27] 10Analytics, 10Analytics-Kanban: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10elukey) p:05Triage>03High a:03elukey [13:36:57] RECOVERY - Check the last execution of eventlogging_db_sanitization on db1107 is OK: OK: Status of the systemd unit eventlogging_db_sanitization [13:43:13] 10Analytics, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10elukey) @AndyRussG sorry for the lag but I had to clarify with Joseph some details :) So first of all, we'd need to upgrade Druid to 0.12.3 before proceeding, to have a robust Kafka Index... [13:47:47] elukey, great :] [13:50:19] mforns: if you have time, https://phabricator.wikimedia.org/T128623 [13:50:28] (even during the next days) [13:50:36] we should decide what to do with that task :) [13:52:17] elukey, looking [13:54:36] elukey, I think: 1) Remove Echo from Whitelist, 2) delete tables from Hive and MySQL [13:54:57] ok maybe something that we can do tomorrow? [13:55:05] elukey, yes, totally [13:55:10] pairing so we don't accidentally drop database log [13:55:11] :P [13:55:21] today I can write the puppet change to the whitelist [13:55:26] tomorrow we do the drops [13:55:30] ack [13:55:34] k [13:56:26] 10Analytics, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10AndyRussG) >>! In T203669#4670568, @elukey wrote: > the size of every month varies a lot, so we were wondering about the consistency of the data over the past months. Thanks!!!!!! That so... [14:03:02] elukey: ok datalake/labs/cloud/hadoop/presto node naming time! [14:03:06] ok [14:03:06] so [14:03:17] the HA name of our hadoop cluster is 'analytics-hadoop' [14:03:25] maybe the new one can be called 'cloud-hadoop' [14:03:26] ? [14:03:37] and [14:03:44] we could call nodes abbreviated 'ch' [14:03:45] so [14:03:51] ch-worker1001 [14:03:52] ch-master1001 [14:03:52] etc. [14:04:01] ? [14:04:44] or maybe we use the word 'public' rather than 'cloud'? [14:04:53] public-hadoop [14:04:56] ph-worker1001, etc.? [14:05:54] (I am thinking :) [14:06:51] so in theory if we assume that eventually all our analytics hadoop nodes are refactored, we'll get to an-worker1028, an-worker1029, etc.. hence this one should be cl-master, cl-worker, etc..? [14:07:14] ch- [14:07:15] yes [14:07:23] why ch? [14:07:26] cloud-hadoop [14:07:34] sure but we use 'an-' now [14:07:37] not ah- [14:07:39] yayya [14:07:41] but oh well [14:08:07] maybe ah would ahve been better, dunno, but [14:08:09] 'cloud' is not ours [14:08:13] so i dont' want to call these cloud-worker [14:08:22] cloud-hadoop-worker => ch-worker [14:08:22] ? [14:08:43] (btw, this is a naming discussion I don't like, because we have too many constraints :( ) [14:08:55] ok ok it is fine to me, I was assuming that we needed to keep consistency with an-etc.. my bad :) +1 for ch-something [14:08:56] there is no good name for this thing [14:09:25] ok, i will file ticket, if we need more name bikeshedding we can do more [14:09:37] ah quick summary from Chris [14:09:38] (03CR) 10Milimetric: "The hiera config stays, because there are a couple of other dashboards on that domain. I'll look through meta." [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/467557 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [14:10:02] I think that we'll get the last 5 nodes to the cloud setup, he would prefer to rack the hadoop nodes now [14:10:07] (analytics hadoop) [14:10:09] that are 13 [14:10:10] why? [14:10:42] I think it should be related to how he organize the DC, but the next batch arrives very soon [14:10:47] so it doesn't really matter [14:10:55] I think that whatever works for him is good no? [14:10:56] if we take 5 for cloud-hadoop [14:11:04] there are still 18 to rack, no? [14:11:36] ya i guess it doesnt' matter [14:11:54] would kinda like to work on those this week if i could, since petr is kind of unavailable, which slows down MEP work a bit [14:12:00] I mean, I tend to follow Chris' suggestions since he is the one sitting there among all the boxes :) [14:12:04] since the hw is in, was hoping we could rack 5 asap and do it [14:12:18] yeah but timing wise it shouldn't change much [14:12:22] i guess i can make puppet patches eitiher way [14:12:23] anyhow, he proposed this config [14:12:25] A2=2 servers, A4=1 server, A7=2 servers, B2=2 servers, B4=1 server, B7=2 servers, C2 =1, C4=2, C7=1, D2=2, D7=2 [14:12:25] D7: Testing: DO not merge - https://phabricator.wikimedia.org/D7 [14:12:26] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [14:12:36] 10G ports/racks of course [14:13:01] looks fine to me [14:13:25] I'll ask if we can prioritize the cloud ones [14:13:27] timing does make a little difference, just for my own work priortization and free time, but i guess it'll be fine [14:13:30] ok [14:13:34] i'll make the ticket anyway [14:13:34] :) [14:13:46] sure sure [14:13:50] elukey: hm naming tho [14:13:53] these nodes all have the same hw [14:14:09] i guess we are going to do colocated worker + master on the first 2 witih HA master stuff? [14:14:45] I'd prefer not to if possible, but maybe only 3 worker nodes are not enough? [14:15:42] well, i wouldn't want to use this hardware for just hadoop masters [14:15:53] maybe we should ask for a couple of miscs? :/ [14:15:57] miscs [14:16:09] or...dundundunnnn ganeti? [14:17:10] could be a good idea [14:17:48] better than to have things co-located in my opinion [14:17:59] (for ops sake, to have workloads separated, etc..) [14:18:41] hm, we will have another hive server and metastore too it hink? or hm, we could use the same mysql db we use, jsut a different metastore database [14:19:03] so hive-serrver and hive-metastore procs could be ganeit [14:19:30] elukey: should I ask for +3 ganeti instances then: ch-master1001, ch-master1002, ch-coord1001? and then ask for 5 racked hw nodes: ch-worker100[12345] ? [14:20:42] HMMMMMM i'm having second thoughts on using the word 'cloud' too. maybe public is better? [14:21:23] elukey: that "The last packet successfully received from the server was 31,700,034" happened yesterday too, I replied to the oozie email about it [14:21:33] I think it was the same job, pageview hourly, one sec [14:22:40] no, it was the ApiAction oozie job [14:23:17] ah, you're not on that list, weird, we should fix that job's mailing list, I'll send a patch [14:23:26] anyway, here's the failure: https://hue.wikimedia.org/oozie/list_oozie_workflow/0006012-181009135629101-oozie-oozi-W/?coordinator_job_id=0051411-170829140538136-oozie-oozi-C&bundle_job_id=0051409-170829140538136-oozie-oozi-B [14:23:36] elukey: also, am considering using the same naming convention for other clusters we have to, so the hadoop cluster name would be 'public-eqiad' ? so Hadoop clusters: public-eqiad ? [14:23:38] so it looks like the mysql metastore is acting up? [14:24:18] (03PS2) 10Milimetric: [SPIKE] [Don't merge] [analytics/refinery] - 10https://gerrit.wikimedia.org/r/466730 [14:24:44] elukey: we could then prefix the nodes with ph ? [14:24:50] ph-worker1001 [14:24:51] ? [14:26:37] seems like phabricator nodes, might be confusing to sres [14:27:03] public-eqiad seems fine [14:30:41] hmm that's true. [14:30:42] hm [14:31:16] pu-worker ? [14:31:17] haha [14:31:17] no [14:31:20] puh [14:31:21] man [14:31:23] sucks so bad [14:39:43] what about ch-worker? [14:39:53] I thought we were settling for that one [14:40:14] i started walking back using 'cloud' at all [14:40:34] yes these are accessible from cloud vps [14:40:41] but that's the only reason really to call them 'cloud', eh? [14:42:19] ah ok I thought that the scope of the hadoop cluster was limited to the cloud virts [14:42:47] but in theory we'll have something like the labsdb [14:42:49] right? [14:43:27] yeah [14:44:53] yar ok we will discuss post standup! [14:44:53] (03PS1) 10Milimetric: Standardize emails and remove former colleagues [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467700 [14:45:29] (03CR) 10Ottomata: [C: 031] Standardize emails and remove former colleagues [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467700 (owner: 10Milimetric) [14:47:00] elukey: just making sure you saw my ping about the metastore above [14:48:35] milimetric: ah no sorry didn't see it! [14:49:23] wow after almost 3y I discover a new email! [14:54:01] joal: sorry I missed your msg earlier [14:54:20] I've no idea what I'm doing wrong but npm fails to install sqlite [14:54:50] like, I think the problem is that it gets a 403 when trying to download from mapbox's repo [14:55:02] node-pre-gyp ERR! Tried to download(403): https://mapbox-node-binary.s3.amazonaws.com/sqlite3/v3.1.13/node-v64-darwin-x64.tar.gz [14:55:09] joal ยด [14:55:35] :( [14:55:55] joal: the funy things is that i don't have this problem using my patch on aqs [14:56:11] 10Analytics, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics, 10Patch-For-Review: Decommision edit analysis dashboard - https://phabricator.wikimedia.org/T199340 (10Milimetric) Ok, the dashboards are offline, data generation is stopped, there are some placeholder pages in place, and I've... [14:56:13] joal: do you get a 403 here? https://mapbox-node-binary.s3.amazonaws.com/sqlite3/v3.1.13/node-v64-darwin-x64.tar.gz [14:56:16] fdans: I'm not sure of what it means [14:56:31] I do fdans - 403 [14:56:43] you don haz credentials [14:56:54] or you cannot access on purpose :) [14:57:16] joal: how are you downloading sqlite then? [14:57:19] when npm installing [14:57:43] fdans: I never checked - doing it again [14:57:48] cd [14:58:02] * joal is now home [14:58:03] joal: just checking, you're removing node_modules? [14:58:08] yup [14:59:19] https://github.com/mapbox/node-sqlite3/issues/918 [14:59:24] if it helps [14:59:57] fdans: https://mapbox-node-binary.s3.amazonaws.com/sqlite3/v3.1.13/node-v59-linux-x64.tar.gz [15:03:42] joal: wat but why do you get that and i dont??? [15:03:55] fdans: node-vXXXX changes [15:04:15] fdans: Mine is v59, yours is v64 [15:04:21] and I have no clue why [15:04:29] joal: OH WAIT [15:04:36] joal: mac vs linux? [15:04:43] must be ! [15:21:31] 10Analytics, 10Analytics-Wikistats, 10User-Elukey: Git push and pull don't complete - https://phabricator.wikimedia.org/T206331 (10ayounsi) >>! In T206331#4666622, @elukey wrote: > @ayounsi Let me know if this is not ok to you. I swapped cobalt's ipv4/6 in the analytics-in4/6 filters, it seems not correct to... [15:50:22] (03CR) 10Nuria: [V: 032 C: 032] "I think what it used to be hiera is now 1 proxy config per dashboard that needs to be deleted on the horizon tool: https://horizon.wikimed" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/467557 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [15:51:00] milimetric: do we use at all wmui-base in wikistats? [15:51:58] I think so, but we should keep it and use it more [15:52:05] *I'm not sure / I think so [15:55:07] milimetric: I'm getting a bit of a cryptic npm error because of it [15:55:17] https://www.irccloud.com/pastebin/LN8nvFo4/ [15:56:45] (03CR) 10Nuria: "Looks good, let's make a ticket so we can group more changes if there is any and we can remember to start bundle when deploying these." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467700 (owner: 10Milimetric) [16:31:41] joal: https://prestodb.io/overview.html [16:31:47] Additionally, a remote Hive metastore is required [16:31:54] (when using hadoop as data store) [16:31:56] ok ottomata :( [16:32:34] it can query cassandra tho, which is interesting...maybe we will want to make the aqs cassandra cluster also queryable via presto in labs [16:32:35] WHO KNOWS [16:32:37] :) [16:33:59] ottomata: so final count hadoop nodes for each row should be A:16 B:16 C:9 D:13 [16:34:08] (decom + expansion) [16:34:28] adding nodes to row c is a bit more problematic [16:34:42] but overall it looks good to me [16:34:42] aye that's fine [16:34:45] super [16:34:46] sounds good! [16:37:56] ottomata, is it possible/likely that cron launches the EL2Druid job and it sits there waiting to get executed for more than 1 hour? [16:38:09] ottomata: other question - new hosts starting from an-worker1001 or an-worker1078? [16:38:14] I vote for the latter [16:38:29] mforns: sure that's possible [16:38:33] not likley but possible [16:38:41] elukey: i vote for latter too plz [16:38:44] ack [16:38:51] ottomata, but even the code that runs in the coordinator, like parameter parsing? [16:39:01] driver [16:39:02] oh, no tthat mforns [16:39:18] even with deploy-mode cluster? [16:39:22] cron will run, but hadoop job might delay or be slow [16:39:25] hmmmm [16:39:30] considering... [16:39:39] i think you are right [16:39:50] the master process won't start until hadoop schedules it to run [16:39:59] hm [16:40:39] this is a problem... [16:41:03] is deploy-mode client an option? [16:41:38] mforns: didn't you say you were going to set the dates with bash in the cron though? [16:42:04] the spark-submit command (with the shell params expanded/evaled) will be executed locally [16:42:05] ottomata, bash is not going to interpret params, because they are in the config file [16:42:11] oh right. [16:42:17] mforns: you could set the dates on the cli. [16:42:20] hm [16:42:21] in this case it doesn't work [16:42:59] I also tried to use generate(...) but this executes for each puppet run, not cron run... [16:43:12] --config_file <> for most things, and then in this (temporarly?) case --since $(date ...) --until $(date ...) in cron CLI [16:43:33] right, those will override the config file, makes sense [16:43:35] in the $job_opts [16:43:36] ya [16:43:39] ok! [16:44:00] but a big ol' comment there in puppet about why you are doing that :) [16:44:06] yeaaaaa [16:44:21] hm, so the since and until will be actually full ISO dates [16:44:47] generated at cronjob exec time [16:45:16] yea, this is hacky but somehow robust [16:45:20] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10RobH) p:05Triage>03Normal [16:45:28] ottomata: --^ [16:46:13] cooo [16:46:47] (03CR) 10Milimetric: "@nuria the proxies are configured per domain name, so while /compare and /multimedia-health are removed, we still have other dashboards li" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/467557 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [16:46:51] elukey: i was about to make the task for 5 ca-worker nodes [16:46:56] @nuria: no working while sick :) [16:47:02] shoudl I talk to chris and let him make it? [16:47:08] or just copy that format there? [16:47:29] ottomata: can you jump in #wikimedia-dcops and ask rob? So we can chat in there all together :) [16:49:22] @nuria: I can do the presentations if you need [16:55:03] https://etherpad.wikimedia.org/p/Analytics-Reseach [16:55:10] 2018-07-17 [16:55:10] Attendees [16:55:10] Baha [16:55:10] Discussed [16:55:11] * Purpose of life [16:55:16] lol [16:55:33] * elukey wants Baha for president [16:56:23] haha [17:01:29] (03PS1) 10Fdans: Fix important vulnerabilities on Wikistats 2 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/467730 (https://phabricator.wikimedia.org/T206474) [17:04:29] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) p:05Triage>03Normal [17:05:19] joal: you coming to research hangout? [17:11:52] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) a:05Cmjohnson>03Ottomata So, we are not sure about what vlan these will be going into. This could affect what row they go into. @Ottomata: Can y... [17:12:44] (03Abandoned) 10Fdans: Get rid of critical vulnerabilities in the aqs project [analytics/aqs] - 10https://gerrit.wikimedia.org/r/467398 (https://phabricator.wikimedia.org/T206474) (owner: 10Fdans) [17:17:40] (03PS1) 10Fdans: Upgrade packages and commit package-lock to remove vulnerabilities [analytics/aqs] - 10https://gerrit.wikimedia.org/r/467733 (https://phabricator.wikimedia.org/T206474) [17:30:45] 10Analytics, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics, 10Patch-For-Review: Decommision edit analysis dashboard - https://phabricator.wikimedia.org/T199340 (10Neil_P._Quinn_WMF) >>! In T199340#4670813, @Milimetric wrote: > Ok, the dashboards are offline, data generation is stopped,... [17:45:09] 10Analytics, 10Analytics-Wikistats: Audit and address performance issues - https://phabricator.wikimedia.org/T207197 (10Milimetric) p:05Triage>03High [17:50:35] elukey, have you seen my question about python version on stat1004? [18:02:54] nope! [18:03:11] ottomata: camus gives me [18:03:11] 18/10/16 17:20:08 ERROR kafka.CamusJob: failed to create decoder [18:03:11] com.linkedin.camus.coders.MessageDecoderException: com.linkedin.camus.coders.MessageDecoderException: java.lang.NullPointerException [18:03:35] because I didn't specify the json decoder.. so I guess I have to anyway, but it will fallback by itself to system time for bucketing? [18:03:40] dsaez: what was the question? [18:04:16] AHHHH SORRY ELUKEY [18:04:19] you do need a decoder [18:04:20] not the json one [18:04:22] just the string one [18:04:48] ah snap I though there wasn't one [18:05:01] looking up proper configs... [18:05:22] dsaez: in the meantime, you home dir on stat1004 now has "notebook1003" :) [18:07:33] ottomata: because in https://github.com/linkedin/camus/tree/master/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders I didn't see one that would fit [18:08:03] ah but maybe JsonStringMessageDecoder works fine [18:08:09] If the JSON does not have [18:08:09] * a timestamp or if the timestamp could not be parsed properly, then [18:08:12] * System.currentTimeMillis() will be used. [18:08:25] is this the one that you were referring to ottomata --^ ? [18:09:22] hmmmmmmMMMM [18:09:31] ok i am confused am trying to rmmeber [18:09:41] but elukey sorry i have to run out for a [18:09:42] bit [18:09:47] will be back in 30ish minutes! [18:09:50] maybe 40! [18:10:03] ack! Will try that one [18:13:32] (03PS14) 10Joal: Add python script importing xml dumps onto hdfs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/456654 (https://phabricator.wikimedia.org/T202489) [18:13:57] (03PS4) 10Joal: Add mediawiki-history-wikitext oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/463548 (https://phabricator.wikimedia.org/T202490) [18:20:38] elukey, the python version on stat1004 it's 3.5 , and on stat1005 it's 3.6 [18:29:19] dsaez: python3 seems 3.5 on both nodes no? [18:29:44] same on notebook1003 [18:30:51] elukey, you are right, I'm wrong. I'm calling my virtualenv 3.6, but is 3.5.3 indeed [18:31:05] ahhh okok [18:41:33] ottomata: it seems working! [18:41:40] will re-check later :) [18:41:55] dsaez: going offline but feel free to write me! (email or in here) [18:42:04] * elukey off! [18:42:15] Bye elukey [18:42:22] elukey, cool, thx [18:51:29] (03PS5) 10Joal: Add mediawiki-history-wikitext oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/463548 (https://phabricator.wikimedia.org/T202490) [18:53:48] 10Analytics, 10Analytics-Kanban, 10Release-Engineering-Team: How to remove outdated and not used repo? - https://phabricator.wikimedia.org/T207204 (10Nuria) [18:59:01] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) They need to be reachable by the Analytics VLAN, so I would normally propose that one. Since this is a special case, maybe it makes more sense to... [18:59:35] elukey: great! [19:10:59] 10Analytics, 10Analytics-Kanban: Set up 3 Ganeti VMs for datalake cloud analytics Hadoop cluster - https://phabricator.wikimedia.org/T207205 (10Ottomata) p:05Triage>03High [19:11:27] 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests: Set up 3 Ganeti VMs for datalake cloud analytics Hadoop cluster - https://phabricator.wikimedia.org/T207205 (10Ottomata) [19:16:10] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10CCicalese_WMF) [19:32:26] 10Analytics: [EL2Druid] Make RefineTarget compatible with Druid and use it from EventLoggingToDruid - https://phabricator.wikimedia.org/T207207 (10mforns) [19:34:19] 10Analytics: API endpoint for mediacounts - https://phabricator.wikimedia.org/T207208 (10Nuria) [21:54:25] 10Analytics, 10Analytics-Kanban, 10GitHub-Mirrors, 10Release-Engineering-Team, 10Repository-Admins: How to remove outdated and not used repo? - https://phabricator.wikimedia.org/T207204 (10greg) [22:02:39] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics: Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10MMiller_WMF) [22:44:41] I have a Q about EventLogging deployment/testing. Is testing of that only available for testwiki, or is it somehow also possible on betalabs? I've found testwiki data in the Data Lake, but not sure where betalabs data would go. [22:54:59] Nettrom: there's a separate mysql instance in the beta cluster that EventLogging writes to [22:55:04] lemme see if I can find the docs [22:55:15] Nettrom: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster [22:56:11] (I hate to plug google, but it's so good. I just google "wikitech Whatever I want to remember about analytics" and if it exists it's the top result) [22:56:14] milimetric: ah, that seems to document what I need to know. wonderful, thanks so much! :) [22:56:44] and you are right, I should've googled for this! adding the wikitech keyword (or searching wikitech specifically) is a neat way to do it [22:57:46] you're always welcome to ask, I didn't mean to imply you did it wrong :) But it does do an awesome job of indexing our stuff [22:59:18] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Growth-Team, 10Product-Analytics: Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10MMiller_WMF) [22:59:21] exactly! and when I know the keywords I'm looking for, as in this case, I should try that first :) [23:00:10] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Growth-Team, 10Product-Analytics: Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10MMiller_WMF) Growth team is going to pursue this, since we need the data in Hadoop for our "[[ https://phabricat...