[00:16:12] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (later): Enable EventBus on all wikis - https://phabricator.wikimedia.org/T185170#3908258 (10Pchelolo) [00:22:52] nuria_: i'd use spark2-shell now whenever possible [01:37:05] 10Analytics-Tech-community-metrics, 10Possible-Tech-Projects, 10ECT-June-2015, 10ECT-March-2015, and 2 others: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#1215552 (10srishakatux) @jgbarah @Dicortazar @Acs Would there be any interest in... [07:11:21] !log re-run webrequest-load-wf-misc-2018-1-18-3 via Hue [07:11:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:42:38] Hi elukey - I have some time now - Is there anything urgent you wish me to help with? [08:44:23] hi joal! How are things!?? [08:44:56] elukey: better! Mélissa is back seeing and Naé doesn't vomit anymore :) [08:44:57] all good, yesterday druid1002's reboot caused a bit of weirdness for Druid, had to restart overlords/middlemanagers to get things back in shape [08:45:04] nice! :) [08:45:27] elukey: even with the procedure we have? [08:46:45] yep, and this time also the "batch" indexing jobs were acting weirdly [08:46:53] maaaan [08:46:58] elukey: :( [08:47:16] elukey: and from me reading the chan, it seemed the reboot way worked, right? [08:47:23] I think that this version of druid is not tolerant to zookeeper/overlord changes at the same time [08:47:37] need to do druid1001, the only one left :D [08:47:50] if you have time we can do it now [08:48:06] then the reboots will be completed [08:48:15] elukey: Let's go ! I'll feel a bit useful ;) [08:50:39] all right! [08:51:00] elukey, thanks for: re-run webrequest-load-wf-misc-2018-1-18-3 via Hue [08:51:16] so the druid overlord commmander is druid1002 [08:51:17] :) [08:51:37] I didn't get exactly why it failed though, I checked briefly but it was before coffee :D [08:51:59] 0001008-180117090906503-oozie-oozi-W was the id [08:52:45] !log disable druid1001's middlemanager as prep step for reboot [08:52:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:53:17] !log temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted) [08:53:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:54:41] I imagine that superset also uses druid1001 only [08:54:57] elukey: I think so yes [08:58:23] !log temporarily set druid1002 in superset's druid cluster config (via UI) [08:58:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:24] !log suspended hourly druid batch index jobs via Hue [09:00:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:02:23] all right going to wait for druid1001's middlemanager to completely drain [09:02:43] elukey: about 1 hour of waiting - hourly jobs just started :( [09:03:01] joal: analytics1003's reboot went super smooth yesterday, one good thing [09:03:13] awesome :) [09:03:20] argh so I haven't stopped them in time? [09:03:49] elukey: there is no "in time" thing - You stopped them just after the hour instead of just before :) [09:04:15] elukey: I'm talking about the realtime jobs [09:05:11] Oh elukey ! Actually, new jobs are inexing on druid100[23] only - We should be good to go in the next 10 minutes :) [09:05:26] excuse me elukey - Didn't look precsiely enough :) [09:07:48] ahhh okok [09:12:58] joal: another thing that I wanted to ask - yesterday I tried to kill the banner streaming job to get it restarted, in a attempt to figure out why real time indexers where not running.. All was good except the fact that indexers were spawned only after the hour, not before. Now I am *sure* that you already told me why 100 times, would you be so kind to redo it for the 101th? [09:13:41] huhuhu elukey :) [09:14:02] elukey: this problem is due to tranquility realtime-tasks naming scheme [09:14:46] elukey: in order to be able to reconnect to existing tasks, it creates them with a naming convention using dates [09:15:34] elukey: However when those tasks get killed (druid master overlord restart, or manual kill), tranquility tries to reinstanciate them with the same name, and druid doesn't allow it [09:17:26] elukey: does it make more sense? [09:18:05] it does, yes [09:18:07] <3 [09:18:52] joal: do you think that I can do the netflow scala task ? [09:19:17] elukey: I think you can - only issue is that data is not present anymore :( [09:19:26] it probably takes 30 mins for you and 3 days for me but I'd learn a lot :D [09:19:36] it is in jumbo! [09:20:00] I restarted the pmacct daemon a while a go [09:20:01] *ago [09:20:13] so afaik data is pushed to jumbo as we speak [09:20:39] in the meantime, druid1001 is ready to be rebooted [09:21:19] !log reboot druid1001 for kernel upgrades [09:21:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:25:08] not seeing any data in kafkacat -C -b kafka-jumbo1006.eqiad.wmnet -t netflow -p0 [09:25:11] mmmmm [09:25:28] maybe I need to ask to Faidon what is the status of pmacct, he is working on it these days [09:25:31] elukey: I was about to say that I odn't see any netflow topic in jumbo from grafana [09:25:42] ah no just seen some events [09:25:55] {"tag": 0, "as_dst": 0, "as_path": "", "peer_as_dst": 0, "stamp_inserted": "2018-01-18 09:09:00", "stamp_updated": "2018-01-18 09:25:02", "packets": 51000, "bytes": 74460000} [09:26:18] now I have no idea how to read the data [09:26:18] elukey: weird ! No topic abailable in grafana :( [09:27:21] I suspect that kafka by topic has not been ported to prometheus [09:27:38] Ah! makes sense [09:27:41] ok :) [09:27:58] then, if we haz dataz, you canz go for netflow :) [09:33:05] \o/ [09:33:21] druid1001 is up with the new kernel [09:33:27] awesome elukey :) [09:37:18] !log resumed druid hourly index jobs via hue and restored pivot's configuration [09:37:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:37:31] now I need to reboot thorium [09:38:43] !log reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites [09:38:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:39:52] rebooting in a minute, all websites down now [09:43:00] all good! [09:43:18] joal: do you want to know something funny? [09:43:43] there is going to be soon another jvm upgrade to do :P [09:43:45] elukey: tell me ! [09:44:00] Bwaaahhhhahhhahah ;( [09:44:45] but since Moritz is mercyful he told us to couple it with the java 8 upgrade [09:45:05] so in theory we'll do everything in one go [09:45:22] great elukey [09:45:34] * joal bows to moritzm mercy [09:45:46] elukey: about j8 tests - anything new on the labs clluster? [09:46:16] I was about to say that, we can definitely work on it before SF and make everything work [09:46:31] Andrew agrees with us that doing it before would be too aggressive :D [09:46:54] so I propose to work on it (depending on your work schedule in these days) up to Tue [09:47:04] and then write down a procedure for the prod cluster [09:47:31] elukey: works for me [09:48:25] elukey: I'll be on-and-off today, as Naé is still home, but should be full-time tomorrow, can make time this weekend and full-time monday (with late start) [09:52:52] joal: ack, I'll try to gently help the labs cluster understand that it needs to work [09:53:31] if you want to laugh a bit https://phabricator.wikimedia.org/T184794 [09:53:45] elukey: Thing is - I'm not sure what we need - first is to make sure spark1 and 2 work with Hive, ok [09:54:00] elukey: Then, testing upgrade? [09:55:05] joal: I am going to set via Puppet the JAVA_HOME to openjdk 8 (at the moment is only manual via puppet disabled) [09:55:16] but we should already be good with upgrade testing no? [09:55:25] I mean, everything now has been working with java 8 [09:55:37] elukey: should we try to have spark2 wotking with hive with j7, and test upgrade again? [09:55:50] elukey: or let's just fix it with j8, works for me as well [09:58:51] elukey: just read the oozie/hive prometheus thing - Man there's something wrong in there [09:58:52] 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10MoritzMuehlenhoff) This is solely for T174110 or are we anticipating other use cases? The high level implementation idea seem... [10:04:14] joal: I wanted your opinion to share my inner pain, I felt really bad while checking those bash scripts [10:04:21] there is something horrible in those [10:04:37] now I understand why those packages do not have any systemd unit [10:04:45] elukey: I'm actually fairly sure there are MANY things horrible in those [10:04:50] hhahahaah [10:05:02] the symlink was really weird [10:06:21] I also have no idea who is the owner of the debian packaging [10:06:32] I guess cloudera ? [10:08:15] elukey: hm, I have no clue [10:08:38] elukey: if so, this might actually be a good argument to try hortonworks ;) [10:08:43] :D [10:08:54] * joal runs and hides [10:16:09] if they have proper systemd units I'd be happy to test it :D [10:16:15] I mean, it would be a really nice pro [10:19:05] 10Analytics-Kanban, 10Operations, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3908967 (10elukey) @faidon whenever you have time do you mind to explain a bit what data is currently pushed to the netflow... [10:22:00] elukey: quickly scanning https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ch_getting_ready_chapter.html [10:22:09] elukey: some things makes me tick [10:25:48] elukey: And it seems .sh scripts are used to start/stop hadoop all along the doc [10:25:51] :( [10:26:29] I suspect that it is a mess that upstream need to fix first [10:26:43] very much probable elukey [10:27:35] 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10akosiaris) Ι 'll echo Moritz on this one. It does look like adding system users to the admin module adds some complexity and do... [10:32:07] 10Analytics-Kanban, 10RESTBase-API, 10Patch-For-Review, 10Services (done): Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541#3909087 (10mobrovac) The RESTBase side of things has been deployed. [10:42:31] going for an errand + early lunch, ttl! [10:42:34] * elukey afk! [10:43:03] (reachable via hangouts/phone of course if needed) [10:47:42] 10Analytics-Tech-community-metrics, 10Possible-Tech-Projects, 10ECT-June-2015, 10ECT-March-2015, and 2 others: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#3909164 (10Aklapper) Note: The task description is very outdated and links to tec... [11:23:40] 10Analytics, 10TCB-Team, 10Documentation: Where should keys used for stats in an extension be documented? - https://phabricator.wikimedia.org/T185111#3909240 (10Legoktm) Just use `README` or `docs/metrics.txt` or wherever you want? :) [12:40:47] !log set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot) [12:40:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:43:22] !log piwik on bohrium re-enabled [12:43:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:44:35] aaand bohrium also updated [12:44:43] last one is probably archiva [12:47:16] elukey: sonething just hit my mind: when updating cluster to j8, it means we need to upgrade druid as well, no? [12:50:41] joal: not sure, can we do it later on or does it need to be done in parallel? [12:51:23] elukey: We failed to upgrade druid last time because of java versions discrepencies - this is why I'm wondering [12:51:47] yep I remember, new druid requires java8 [12:52:11] My question would then be: if cluster is java8, would druid undfer j7 work with it? [12:53:11] do you remember how druid was failing list time? I mean, what specifically was failing [12:54:38] https://phabricator.wikimedia.org/T164008 [12:54:53] "Druid 0.10 requires Java 8, which is fine. But the Analytics Hadoop cluster runs Java 7, and Hadoop indexing tasks were failing. See: https://groups.google.com/forum/#!topic/druid-user/aTGQlnF1KLk" [12:57:44] so it is possible that indexing jobs may fail, not sure what could happen [12:57:53] but afaik we'd be ready tomorrow to upgrade druid [12:58:07] so we can think about coupling the two [12:59:56] but I'd prefer to keep them separate [13:00:21] I can see two issues: [13:00:30] sounds good to me that way elukey :) [13:00:56] 1) HDFS read/write impaired on druid due to java mismatch -> big issue since it would affect wikistats [13:01:27] 2) only index jobs failing -> not a big trouble for wikistats, but some problems to pivot/superset/etc.. [13:04:52] elukey: very much agreed [13:05:07] elukey: I wonder if it would be possible to test issue 1 in labs [13:05:42] pretty sure we can, I am currently working on the last settings for hadoop in labs, after that we can spin up a druid cluster [13:05:49] it should be easy (last famous words) [13:06:12] elukey: those are my TM :) [13:48:26] joal: now yarn nodemanagers in labs should be healthy [13:54:26] and also I deployed the new puppet changes for JAVA_HOME set via hiera, all good [14:10:12] joal: also configured the coordinator not to run the crons for the moment [14:10:21] without the need to disabling puppet [14:10:32] now if we need stuff I'll add it via horizon uoi [14:10:36] ui [14:10:42] like camus etc.. [14:17:13] scala2 still acting a bit weirdly [14:30:57] joal: whenever you can shall we reload per country data? [14:32:15] 10Analytics-Kanban, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909685 (10elukey) Fixed all except j1.analytics.eqiad.wmflabs - @Ottomata do we still need this? It seems running superset, and puppet is broken in there.. [14:34:12] (also joal I wanted to ask your opinion about minor inconsistencies in rank order between wikistats and aqs by country) [14:37:54] fdans, joal: I will deploy refinery now [14:38:14] mforns: can I shadow you on the cave? [14:38:34] fdans, sure, gime 5 mins to change rooms [14:39:18] cool! [14:43:41] fdans, omw to bc [14:48:02] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Explain decrease in number of patchset authors for same time span when accessed 3 months later - https://phabricator.wikimedia.org/T184427#3909716 (10Aklapper) p:05High>03Normal I saved the CSV data for Dec2017 and Oct-Dec2017 locall... [14:57:06] milimetric, this change https://gerrit.wikimedia.org/r/#/c/403946/ can be deployed as is? or needs some manual table maintenance? [14:58:34] mforns: yes, can be deployed as-is, it's just updating the table definition, I already dropped and re-created the table [14:58:43] milimetric, ok! [14:58:45] so this is more just for reference and in the future if we drop the table again [14:59:02] thx for checking, sorry I wasn't more clear in the comment, we should start adding a Deloyment: section [14:59:57] fdans: makes sense about the aggregates on the top pages. I was thinking like maybe averaging out the buckets for per-country, adding up the view_counts for top articles, but that gets too complicated, it's fine to leave it out for now [15:02:24] milimetric: maybe we need yet another property at the config detailing whether the metric is bucketed or not [15:02:27] Hey elukey / fdans - Was away, will be again [15:02:46] we're in the batcave deploying refinery joal :) [15:02:55] fdans: great :) [15:03:27] fdans: Will restart the jobs later tongiht if ok [15:03:58] yess great joal :) [15:04:08] cool :) [15:04:10] Gone again ! [15:04:18] elukey, joal, the repo in deployment.eqiad.wmnet:/srv/deployment/analytics/refinery/scap has uncommited modifications, do you know something about that? [15:04:22] elukey: Will also investigate spark2 this evening [15:04:34] mforns: aouch - no idea [15:04:38] ok [15:06:40] elukey, joal, solved :] [15:07:29] !log starting refinery deployment [15:07:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:08:52] mforns: just seen the ping sorry [15:09:13] np :] [15:21:09] !log refinery deployment using scap and then deploying onto hdfs finished [15:21:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:41:16] 10Analytics-Kanban, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909876 (10Ottomata) j1 deleted! [15:42:05] (03CR) 10Milimetric: Map component and Pageviews by Country metric (037 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans) [15:42:37] 10Analytics-Kanban, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909884 (10elukey) 05Open>03Resolved Closing the task since puppet should be ok now, please re-open otherwise! [15:44:23] 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3909891 (10Jgreen) >>! In T185136#3907424, @Ottomata wrote: > @Jgreen FYI, we'll need to coordinate this soon :) No problem. I tried a little puppetspelunking... [15:45:11] ottomata: hiiiiiiiii [15:45:38] whenever you have 10 mins can we check why in labs spark2-shell behaves weirdly? [15:46:01] it maybe a config that needs to be applied somewhere that I am not aware [15:46:18] 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3909912 (10Ottomata) > I'm guessing we're talking about a new pool of kafka hosts Yup! Mostly just changing settings and bouncing the kafkatee instances, but... [15:46:25] if we try to use the hive context it doesn't work (tested with show databases, spark-shell works) [15:46:38] hiii [15:46:49] hmm [15:50:41] elukey: how are you testing? [15:50:43] I'm trying [15:51:27] spark.sql("SHOW DATABASES").collect() [15:51:34] but i don't get any results [15:51:36] it works though [15:52:01] hmm [15:52:42] OK [15:52:47] hive-site.xml in supposed to be symlinked [15:52:50] in /etc/spark2/conf [15:52:51] hmmm [15:52:57] i'd think that was puppetized... [15:52:58] looking [15:54:28] ah elukey on install, the .deb package does the symlink [15:54:32] if hive-site.xml exists [15:54:35] maybe there was a race condition [15:54:59] I've added the symlink [15:57:31] (03PS14) 10Fdans: Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) [15:57:48] milimetric: pong :) [15:57:59] (03CR) 10Fdans: Map component and Pageviews by Country metric (035 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans) [15:59:10] looking :) [16:01:50] ottomata: sorry I was doing another thing, checking now! [16:02:08] it worked for me but no result was returned [16:03:03] 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3909987 (10Jgreen) >>! In T185136#3909912, @Ottomata wrote: >> I'm guessing we're talking about a new pool of kafka hosts > Yup! Mostly just changing settings... [16:03:59] ottomata: workssssss \o/ [16:04:00] you rock [16:04:14] (03CR) 10Milimetric: [C: 032] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans) [16:04:27] awyisss [16:04:28] so /etc/spark2/conf has the same trick as /etc/hadoop/conf etc.. [16:04:43] joal: spark2 works with java 8 :) [16:05:02] yessssssssssssssssssssssss [16:05:21] :) nice fdans, you're definitely meant for greatness, I'm just meant for laziness :) [16:05:45] milimetric: haha I didn't intend to merge just yet though [16:06:04] I still gotta uncomment those couple lines once the iso codes are live on the endpoint [16:06:11] ahh no hive-site.xml -> /etc/hive/conf.analytics-hadoop-labs/hive-site.xml [16:06:19] so only hive-site is symlinked [16:06:23] okok got it now [16:06:41] fdans: oh, sorry, just self-merge follow-ups [16:06:51] maybe in puppet we need to add the File to the requires of spark2 pkg [16:06:58] yup, and because the deb package does it on installl, if the hive/conf/hive-site.xml file doesn't exist (yet), it won't symlink it [16:07:08] super [16:07:53] ottomata: no idea why spark2 picks up java8 straight away without any config, buuut everything works [16:08:06] I applied in labs the JAVA_HOME hiera change to force it to java8 [16:08:15] all good as far as I can see [16:08:55] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Provision new Kafka cluster(s) with security features - https://phabricator.wikimedia.org/T152015#3910001 (10Ottomata) [16:09:01] 10Analytics, 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3909999 (10Ottomata) 05Open>03Resolved deployment-kafka03 has been deleted. [16:09:08] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure: deployment-kafka01 - disk is full - https://phabricator.wikimedia.org/T174742#3910003 (10Ottomata) 05Open>03Resolved a:03Ottomata deployment-kafka01 has been deleted. [16:15:39] 10Analytics, 10Analytics-Cluster: Move EventStreams to new jumbo cluster. - https://phabricator.wikimedia.org/T185225#3910019 (10Ottomata) p:05Triage>03Normal [16:15:42] elukey: it has to [16:15:50] spark2 doesn't work with java 7 [16:16:10] this is a good explanation :D [16:17:34] ottomata: I think that require_package('spark2') in profile::hadoop::worker comes before include ::profile::hive::client [16:17:41] so the symlink doesn't get installed [16:17:59] aye, maybe we can add a dependency [16:18:03] or manually ensure the symlink exists [16:18:15] we'll probably need to make a spark2 class or profile or something if we do either [16:18:24] eventually [16:22:37] ottomata: another question that we might need to test - do we need to upgrade druid as well? [16:22:55] or maybe move it to 0.10 [16:27:12] elukey: we should probably test a loading job if we can? but i'm fairly sure it will work with hadoop running java 8 [16:27:23] if we can upgrade druid to java 8 at the same time or soon, that would be good [16:27:32] we should do the actual druid upgrade as a separate piece [16:28:08] yep I'd really like to keep things separate.. [16:28:25] so you are suggesting so create a small cluster in labs and then test a load/indexing job? [16:32:28] elukey: maybe? hm. [16:32:47] dob'nt have one right now, eh? [16:32:53] elukey: a single node cluster would be fine...i'll help [16:32:55] i'll make one now :) [16:33:12] if you have time sure, otherwise I can do it tomorrow :) [16:33:53] i'll at least spawn up the cluster now [16:34:02] then maybe you and/or jo al can test a loading job tomorrow [16:44:42] 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3910130 (10Ottomata) > first step on the frack side is to whitelist the new hosts at the firewalls, can you point me to the list and I'll add a phabricator tas... [16:46:59] 10Analytics-Kanban, 10Discovery-Analysis, 10MobileApp, 10Wikipedia-Android-App-Backlog: Bug behavior of QTree[Long] for quantileBounds - https://phabricator.wikimedia.org/T184768#3910132 (10Nuria) a:03Nuria [16:52:05] heya ottomata and elukey - Just here for a minute - Awesome catch on hive for spark2 :) [16:52:27] I'll try a loading job later tonight with the new 1-node cluster [16:59:45] 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3910160 (10Nuria) [16:59:48] 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3910159 (10Nuria) 05Open>03Resolved [16:59:57] 10Analytics: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#3910162 (10Nuria) [16:59:59] 10Analytics-Kanban: Read the python code and design the Hadoop version - https://phabricator.wikimedia.org/T182944#3910161 (10Nuria) 05Open>03Resolved [17:00:14] 10Analytics-Kanban, 10Patch-For-Review: Make superset more scalable - https://phabricator.wikimedia.org/T182688#3910163 (10Nuria) 05Open>03Resolved [17:00:26] ping fdans [17:00:45] ping joal [17:02:04] 10Analytics-EventLogging, 10Analytics-Kanban: {tick} Schema Audit - https://phabricator.wikimedia.org/T102224#3910168 (10elukey) [17:02:06] 10Analytics-Kanban: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#3910167 (10elukey) [17:02:09] 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3910166 (10elukey) 05Open>03Resolved [17:02:17] 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1618887 (10elukey) [17:16:07] (03PS10) 10Mforns: Improve WikiSelector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) [17:23:06] 10Analytics: Investigate why data was missing - https://phabricator.wikimedia.org/T185229#3910241 (10Milimetric) [17:30:54] 10Analytics: Investigate why data was missing from mediawiki events around January 3rd - https://phabricator.wikimedia.org/T185229#3910283 (10Nuria) [17:35:44] ping ottomata [17:35:48] groskinnn [17:36:31] 10Analytics-Kanban: Investigate why data was missing from mediawiki events around January 3rd - https://phabricator.wikimedia.org/T185229#3910305 (10fdans) [17:36:57] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move EventStreams to new jumbo cluster. - https://phabricator.wikimedia.org/T185225#3910306 (10fdans) [17:40:52] 10Analytics-Kanban: Eventlogging of the Future - https://phabricator.wikimedia.org/T185233#3910345 (10fdans) [17:41:38] 10Analytics-Kanban: Eventlogging of the Future - https://phabricator.wikimedia.org/T185233#3910355 (10fdans) [17:41:41] 10Analytics, 10TCB-Team, 10Documentation: Where should keys used for stats in an extension be documented? - https://phabricator.wikimedia.org/T185111#3910356 (10fdans) [17:57:09] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3910415 (10fdans) p:05Normal>03High [18:11:31] 10Analytics-EventLogging, 10Analytics-Kanban: Lookout for duplicates in EL refine - https://phabricator.wikimedia.org/T185237#3910470 (10Nuria) p:05Triage>03Normal [18:11:42] 10Analytics, 10Analytics-EventLogging: Lookout for duplicates in EL refine - https://phabricator.wikimedia.org/T185237#3910470 (10Nuria) [18:15:48] 10Analytics, 10Analytics-EventLogging: Implement purging scheme for eventlogging data on top of eventlogging refine - https://phabricator.wikimedia.org/T176426#3910499 (10Nuria) One idea is to refine incoming data so we have, say, popups popups_refined according to whitelist and those two tables are fille... [18:19:59] 10Analytics, 10Analytics-EventLogging: Lookout for duplicates in EL refine - https://phabricator.wikimedia.org/T185237#3910511 (10Ottomata) a:03Ottomata [18:22:29] btw, nuria_, milimetric, i forget, we have a page draft with eventlogging + druid schema guidelines, right? [18:22:35] i'm writing a reply to this page previews email [18:22:39] ottomata: yes [18:22:40] we do [18:22:44] ottomata: one sec [18:22:52] oh i found [18:22:53] https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines [18:22:57] i shoulda just wikitech searched [18:22:58] https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines [18:22:58] thanks nuria_ [18:23:00] beat me to it [18:23:03] yeah, google has everything [18:31:02] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Services (watching), 10User-Elukey: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3910556 (10fdans) [18:32:30] 10Analytics: Configure superset to query mysql slaves - https://phabricator.wikimedia.org/T167427#3910559 (10fdans) 05Open>03declined [18:32:33] 10Analytics-Kanban, 10Patch-For-Review: Productionize Superset - https://phabricator.wikimedia.org/T166689#3910560 (10fdans) [18:33:19] 10Analytics-Kanban, 10Operations, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910563 (10fdans) [18:37:35] 10Analytics, 10Operations, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910569 (10fdans) [18:41:08] 10Analytics-Kanban, 10Operations, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#3910612 (10fdans) [18:44:00] oh, nuria_ / fdans: I have to also disable the per-country metric, right? [18:44:01] joal: when you're back let me know if you're going to reload tonight :) [18:44:11] fdans: Here I am :) [18:44:16] milimetric: if you're going to deploy yes [18:44:20] fdans: Do you wish me to restart the jobs? [18:44:42] joal: that includes backfilling right? [18:44:44] fdans: another question: should we truncate first? [18:44:47] fdans: correct [18:44:58] joal: yeah I think we agreed to truncate [18:45:02] fdans: I'll actually restart the jobs with a start-end at the beginniong of backfilling [18:45:38] fdans: ok, doing that: Truncate table then kill-restart loading jobs since beginning of existing data [18:46:13] you’re the bestest! [18:47:07] I need to be afk for a couple hours but i’ll be back later tonight [18:47:19] np fdans :) [18:47:32] ottomata: did you see all ksql_transient_.. topics in jumbo? [18:48:12] milimetric: Can you tall me the plan for metric-disabling? Is that in WKS2 front end? [18:50:27] joal: yeah, just changing the config: https://github.com/wikimedia/analytics-wikistats2/blob/master/src/config/metrics/reading.js#L27 [18:50:51] k milimetric - just triple checking [18:51:04] milimetric: A m I safe to truncate the table now or should I wait? [18:52:46] 10Analytics-Kanban, 10Operations, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3910670 (10elukey) Had an interesting chat with @ayounsi and for the moment it seems that the only format expected in the ne... [18:53:25] joal: totally safe, yea [18:53:29] * elukey off! [18:53:36] great - Thanks milimetric [18:53:44] Bye elukey ! [18:56:18] !log Truncating table "local_group_default_T_top_bycountry"."data" in cassandra before reload [18:56:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:10:37] !log Add fake data to cassandra to silent alarms (Thanks again ema) [19:10:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:10:45] :) [19:11:35] !log Kill-Restart coord_pageviews_top_bycountry_monthly ooie job from 2015-05 [19:11:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:26:32] (03CR) 10Joal: [C: 031] "LGTM :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/403916 (https://phabricator.wikimedia.org/T167907) (owner: 10Joal) [19:27:20] (03CR) 10Joal: [C: 031] "Good for me as well :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404662 (https://phabricator.wikimedia.org/T185100) (owner: 10Mforns) [19:29:13] Backfilling of top-by-country confirmed [19:35:37] ottomata: in case you missed it - Everything in a single place ;) https://docs.confluent.io/current/tutorials.html [19:47:41] joal cool! [19:56:49] joal: saw your comment about abstractdataclass, need to remove and will do. gerrit is working on/off today for me so i probably cannot take care of that today [19:56:59] joal: this code still needs to be tested on cluster though [19:57:08] nuria_: I can test if you wish [19:57:26] joal: either way, i can do it later on today if you have not gotten there yet, [19:57:41] nuria_: Was updatin [19:57:45] again ... [19:58:08] nuria_: was updating druid loading to prepare for WKS2 data loading v2 [19:58:09] joal: gerrit is not loading but they will fix that eventually, saw your comments on e-mail [19:58:52] nuria_: maxmind code is better than it was for sure :) [19:58:58] nuria_: Thanks for that [20:00:14] joal: ok, i 'm glad you like it [20:00:33] nuria_: I would have been lazy to get there though ;) [20:00:41] jaja [20:06:22] (03PS1) 10Joal: Add optional datasource to druid loading workflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405053 [20:17:51] joal: btw, druid running on d-1 in analytics labs [20:18:05] ottomata: noted ! [20:18:08] if you have some mins to test a druid loading job from hadoop [20:18:09] ottomata: will test [20:18:17] gr8 [20:19:28] ottomata: can we relaunch camus for some time? No more data in cluster :*( [20:21:49] i think the kafka cluster is busted [20:21:50] looking [20:22:47] gonna try to wipe it.. [20:25:28] ottomata: I can also generate a bunch of fake json, but it would be interesting to do it fake data that looks real :) [20:26:06] like my animals!? [20:30:29] :D [20:30:38] ottomata: real webrequest-style data [20:30:50] ottomata: I can also fake some if your cluster is gone [20:32:34] pssh i dunno what is up with this cluster, i'm totally wiping it, time for new! [20:35:07] ottomata: shall I generate some data, or do you prefer us testing your new cluster with camus as well? [20:35:43] oh joal, we kinda already tested that I thin? i dunno, i think i'll set it up anyway... [20:36:03] 10Quarry: Quarry runs thousands times slower in last months - https://phabricator.wikimedia.org/T160188#3911039 (10zhuyifei1999) Could be related to the load increase of the servers as a result of the recent *.labsdb redirection to the analytics database servers. [20:36:37] ottomata: we tested indeed [20:36:58] since java 8 upgrade? [20:37:39] hm, I think so, ya? no? [20:37:57] ottomata: not since Luca pushed java8 with puppet [20:38:03] but with j8 enable, yes [20:38:09] let's try it again :) [20:38:12] to be sure [20:38:16] ok' [20:46:03] (03PS1) 10Milimetric: Fix blocking problems with map component [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405061 [20:46:32] (03CR) 10Milimetric: [V: 032 C: 032] "fdans, take a look at this when you're back, just going to self-merge for now" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405061 (owner: 10Milimetric) [20:47:04] (03PS11) 10Milimetric: Improve WikiSelector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) (owner: 10Mforns) [20:53:43] (03CR) 10Milimetric: [C: 032] Improve WikiSelector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) (owner: 10Mforns) [20:54:06] allrighty joal [20:54:09] data rolling back in [20:54:14] camus running [20:54:42] 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3911114 (10Jgreen) >>! In T185136#3910130, @Ottomata wrote: >> first step on the frack side is to whitelist the new hosts at the firewalls, can you point me to... [20:54:47] ottomata: awesome, many thanks ! [20:54:53] (03CR) 10Milimetric: [C: 032] "I'm going to merge this, we'll look at the chain thing together later. The code does look cleaner, but it was already good. Another thin" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) (owner: 10Mforns) [20:55:17] ottomata: I think to prevent the issue we already add about disk being full, maybe we can stop it laer on tonight? [20:56:02] ottomata: I wanted to link people to your thoughts on Stream Data Platform, or the best thing to read about it for the summit. I feel like that's the best project to describe what our team does and can do [20:56:29] (03CR) 10Milimetric: [V: 032 C: 032] Improve WikiSelector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) (owner: 10Mforns) [20:57:03] ottomata: this is the task where I'd link to it: https://phabricator.wikimedia.org/T183320 [20:58:18] 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3911119 (10Ottomata) > Another question, does kafka use a different port for TLS service? Yes, :9093. > As long as we can create certs for the two frack hosts... [20:59:22] joal: hmm, its one req per second [20:59:22] but [20:59:27] which disk ? [20:59:29] kafka? [20:59:45] ottomata: it can wait tomorrow morning, but hadoop got full pretty fast (like a couple days) [20:59:53] ah ok [21:00:07] hm, well if we stop my curl loop, it'll stop [21:00:07] ottomata: I confirm data flows :) [21:00:10] i can stop whenever you like [21:00:18] milimetric: sure! [21:00:26] we are still wordsmithing i think [21:00:27] but [21:00:28] https://wikitech.wikimedia.org/wiki/User:Ottomata/Stream_Data_Platform [21:00:40] this is kinda more than 'analytics' though [21:01:06] but, i'd love it if you disseminated it [21:01:55] (03PS1) 10Milimetric: Release 2.1.5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405065 [21:02:16] ottomata: yeah, it's more than Analytics, and that's exactly the point I'm trying to make [21:02:32] like, if we are to hit this extremely ambitious strategy by 2030, we need to start working together [21:02:44] and this is the perfect example for that [21:04:42] (03CR) 10Milimetric: [C: 032] Release 2.1.5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405065 (owner: 10Milimetric) [21:04:51] (03CR) 10Milimetric: [V: 032 C: 032] Release 2.1.5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405065 (owner: 10Milimetric) [21:06:45] oh nice [21:06:48] +1 milimetric [21:07:05] yeah man, put all ur knowledge events into stream data platform [21:07:15] then everyone can put them wherever they want! [21:07:22] KNOWLEDGE AS A SERVICE [21:10:36] (03PS1) 10Milimetric: Release 2.1.5 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/405067 [21:10:43] (03CR) 10jerkins-bot: [V: 04-1] Release 2.1.5 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/405067 (owner: 10Milimetric) [21:11:37] (03Abandoned) 10Milimetric: Release 2.1.5 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/405067 (owner: 10Milimetric) [21:14:07] (03PS1) 10Milimetric: Release 2.1.5 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/405069 [21:15:10] (03CR) 10Milimetric: [V: 032 C: 032] Release 2.1.5 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/405069 (owner: 10Milimetric) [21:17:05] ottomata: that's even more ambitious than the 2030 strategy :) [21:20:40] :) [21:58:14] 10Analytics-Cluster, 10Analytics-Kanban: Add IPv6 addresses for kafka-jumbo hosts - https://phabricator.wikimedia.org/T185262#3911265 (10Ottomata) p:05Triage>03Normal [22:01:01] hey ottomata - had you stopped your loop, or camus? [22:06:54] have not [22:07:17] ottomata: I only have a single file in hdfs :( [22:07:24] looks like somnthing is worng [22:07:38] looking [22:09:18] ottomata: it's as if camus was thinking it is running, while there is no job [22:09:49] ottomata: Actually, camus IS running, but no yarn job [22:09:53] :( [22:10:47] weird joal [22:10:50] i just ran it manually [22:10:53] it seemed to work? [22:11:26] dunno why it was failing in cron [22:11:44] ottomata: Seems related to an existing staled process [22:16:31] 10Quarry: Quarry runs thousands times slower in last months - https://phabricator.wikimedia.org/T160188#3911330 (10IKhitron) Can't understand, but nevermind. Is this permanent? [22:17:19] oh there's a running yarn camus now? [22:17:45] not in yarn no, but as a java process [22:17:50] oh hm [22:17:51] weird ok [22:17:55] ok now though? [22:18:10] ottomata: very much [22:18:18] ottomata: you can kill your loop, I have data [22:18:21] ottomata: many thanks :) [22:18:33] ottomata: let's not fill disks with animals :) [22:19:00] ok, will kill loop :) [22:19:59] refine ongoing ottomata - Will test druid loading when finished [22:21:29] grrr8 [22:49:55] 10Quarry: Quarry runs thousands times slower in last months - https://phabricator.wikimedia.org/T160188#3911404 (10zhuyifei1999) > Is this permanent? Yes. [22:52:01] ottomata: not working - probably netweork issue [22:52:16] ottomata: oozie doens't even manage to lauch the indexation job [22:54:02] Arf, actually nom nevermind ottomata - sorry for the noise [22:56:58] ? oook:) [22:57:19] still not working, but loking for why now :) [22:58:04] ottomata: looks like the idexation failedon druid side [22:58:07] going over there :) [22:58:29] Ohhh ! I think I know :) [22:58:46] oh? [22:58:57] parameters in oozie templates [22:59:27] actually, no [22:59:44] ottomata: is yarn master configured correctly in d-1? [23:03:39] ottomata: in druid middlemanager logs: java.net.UnknownHostException: analytics-hadoop-labs [23:04:22] OH i think i know [23:04:24] luca had this problem... [23:04:25] hmmm [23:04:54] ottomata: I had it as well on cluster for oozie, but now seems solved - What magic have you done? [23:05:11] i did nothing! [23:06:21] its working joal? [23:06:43] in oozie yes, but from druid seems no [23:07:41] it's weird ottomata, cause hdfs command works from d-1, but it seems that middlemanager (maybe peons?) can't find it by name [23:17:21] joal i'm not sure where its getting that hostname from... [23:17:26] analytics-hadoop-labs [23:17:26] ... [23:17:31] that's not in oozie or something somewhere? [23:17:47] I past it as hostname (it;s defined as is in hdfs-site.xml) [23:18:09] OHHH [23:18:10] OH [23:18:11] i think i know... [23:19:14] misconfiguration on my part...luca's hadoop configs only apply to nodes that start with hadoop* [23:19:19] so i had to copy paste some [23:19:29] could be ottomata :) [23:21:26] joal try now [23:21:29] sure [23:22:51] wait wah [23:22:53] sorry joal [23:22:58] it idnt' change what I thought it did... what's wrong here.. [23:24:04] ottomata: At least druid launched an indexation [23:24:45] ottomata: seems to have worked (another issue, but indexation job launched) [23:27:32] it didn't? [23:28:18] ottomata: druid indexation succeeded :) [23:28:21] Awesome [23:28:35] ottomata: is d-1 running j7? [23:28:48] AH i know [23:28:50] yes [23:28:54] but the hostname is still not riht [23:29:05] because of more copy paste problems... [23:29:06] on it [23:29:20] i should have named this node hadoop-druid1 [23:29:24] then everything would jsut work [23:29:30] ottomata: I have no clue why it succeeded then :) [23:32:43] ottomata: Got results from druid - This is a success :) [23:35:14] ottomata: Gone to bed for tonight [23:35:19] ottomata: Many thanks for your help :) [23:35:27] ottomata: See you tomorrow [23:39:36] great!