[06:52:24] morning :) [06:52:45] I am going to restart NameNodes and Resource Managers to pick up the new zookeeper settings [06:57:43] aaand it should be done [06:57:56] this time for HDFS I just used the following procedure: [06:58:08] 1) failover NN from 1001 to 1002 [06:58:28] 2) restart the zkfc daemon on 1001 (since it is the only one talking with zk for HDFS afaik) [06:58:43] 3) failover NN from 1002 to 1001 [06:58:50] 4) restart the zkfc daemon on 1002 [06:59:06] and then a regular restart for the RM (since it talks directly to zk) [06:59:15] less invasive and quicker :) [06:59:24] Let me know if it makes sense or not [07:05:14] !log re-run webrequest-load-wf-text-2018-5-29-1 [07:05:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:07:00] Hi elukey - Makes sense ! [07:07:35] elukey: about webrequest failings, my tests have been finished since yesterday afternoon - I wonder if something has changed :9 [07:11:44] :( [07:16:47] joal: what do you think about https://gerrit.wikimedia.org/r/#/c/435966/ ? [07:17:02] so my idea is to limit the deployment of ores::base (and its packages) to the hadoop worker nodes [07:17:06] does it sound good? [07:21:16] elukey: sounds good except for stat machines [07:21:29] is it needed in there too? [07:21:32] elukey: I think we need them to be set up in order to run interactive spark with ores [07:23:03] joal: cr updated :) [07:23:10] Yay elukey :) Many thanks [07:37:40] joal: ok for me to reimage druid1002 ? [07:37:48] +1 elukey [07:37:58] super thanks :) [07:38:10] elukey: I actually think druid-public will be easier - A lot less data ;) [07:44:56] joal: ah yes for sure, but I'll have to do druid100[12] anyway :D [07:47:00] :) [07:49:28] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4237876 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['druid1002.eqiad.wmnet']... [08:15:43] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4237937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['druid1002.eqiad.wmnet'] ``` and were **ALL** successful. [08:25:28] druid1002 is back in production [08:25:36] it is loading segments :) [08:43:08] are we going to deploy refinery in these days? [08:57:23] 10Analytics-Kanban, 10Patch-For-Review: Update per-domain uniques fresh-sessions computation - https://phabricator.wikimedia.org/T167005#3314230 (10Tbayer) @JAllemandou Did the "about 10% of the offset" estimate in the task description refer to the daily metric? For the monthly unique devices, the impact may... [09:00:44] joal: as heads up, in 1/2 hours the network maintenance that I mentioned in one of my past email to internal@ [09:00:50] is going to happen [09:01:06] should be a network blip from the hosts point of view [09:25:55] helloooo [09:27:04] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Notifications, and 4 others: Make EchoNotification job JSON-serializable - https://phabricator.wikimedia.org/T192945#4238148 (10Pchelolo) This still didn't quite help. Although the `event` property is not there anymore, the job is still not serializable, bec... [09:35:45] 10Analytics: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238182 (10Tbayer) [09:36:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deploy Turnilo (possible pivot replacement) - https://phabricator.wikimedia.org/T194427#4238193 (10Tbayer) [09:36:16] 10Analytics: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238192 (10Tbayer) [09:39:11] 10Analytics, 10ChangeProp, 10EventBus, 10MassMessage, and 2 others: Global mass message delivered on meta but not on other wikis? - https://phabricator.wikimedia.org/T195500#4238203 (10mobrovac) > So this is T193471 probably. Indeed, this is the case. We switched `MassMessageSubmitJob` for all wikis, but... [09:40:50] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4238218 (10mobrovac) [09:40:54] 10Analytics, 10ChangeProp, 10EventBus, 10MassMessage, and 2 others: Global mass message delivered on meta but not on other wikis? - https://phabricator.wikimedia.org/T195500#4238217 (10mobrovac) [09:59:26] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578#4238244 (10Tbayer) Thanks @fdans! Related to question #3 in the task description, I noticed that the number of IE7 pageviews has dropped at lot from May 21 to M... [10:04:05] !log re-run pageview-druid-hourly-wf-2018-5-29-7 [10:04:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:04:23] I think that --^ is due to druid1002 being down at the time, let's see if it succeeds [10:10:20] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4238262 (10Pchelolo) [10:11:22] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4238271 (10mobrovac) [10:11:26] 10Analytics, 10EventBus, 10MassMessage, 10MediaWiki-JobQueue, 10Services (done): Global mass message delivered on meta but not on other wikis? - https://phabricator.wikimedia.org/T195500#4238267 (10mobrovac) 05Open>03Resolved a:03mobrovac Both jobs are now on the EventBus JobQueue, so this should b... [10:13:07] checking druid1002, there seems to be some trouble with indexing [10:14:11] elukey: any spontaneous thoughts about https://phabricator.wikimedia.org/T195819 ? [10:15:43] HaeB: interesting! I'll check with Joseph what's happening later on, it should be a misconfiguration probably! [10:15:53] thanks! [10:16:00] so about druid1002 - I can see the following in the logs [10:16:01] io.druid.java.util.common.ISE: Hadoop dependency [/usr/share/druid/hadoop-dependencies/hadoop-client/cdh] didn't exist!? [10:17:46] I am a bit confused :D [10:17:53] the list of extensions is correct [10:20:18] /etc/druid/middlemanager/runtime.properties:10:druid.indexer.task.defaultHadoopCoordinates=["org.apache.hadoop:hadoop-client:cdh"] [10:20:22] whattttttttt [10:20:59] ah ok now I get it [10:21:11] reimaging removes the old dependency [10:21:17] and the middle manager don't find it [10:21:24] lovely [10:25:29] joal: https://gerrit.wikimedia.org/r/#/c/435983/ [10:30:01] !log roll restart of druid-middlemanagers on druid* to pick up the new runtime settings (no more references to hadoop-client-cdh) [10:30:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:30:20] next round of hourly indexers should tell us if the issue is solved or not [10:31:51] I see you're busy elukey, let me know if I can help [10:34:35] mforns: o/ - there seems to be something weird with pageviews-daily [10:34:55] I cannot see its dimensions [10:35:08] in turnilo? [10:35:40] I also tried curl druid1003.eqiad.wmnet:8082/druid/v2/datasources/pageviews-daily [10:36:52] elukey, I think there's a problem with the job, it seems there are 2 metric names: "count" and "view count" [10:37:02] one seems to be populated: "count" [10:37:11] the other seems to be the default: "view count" [10:37:20] but empty [10:37:57] mforns: I have another theory, and it might be due to today's reimage of druid1002 [10:38:10] I thought that all the datasources were replicated two times [10:38:14] hmmm but wait, pageview-hourly does also have both metrics, and both correct [10:38:25] yea, you must be right [10:38:28] pageviews-hourly for example has replication factor 2 [10:38:34] pageviews-daily no :( [10:38:39] I see [10:40:09] basically all the ones in https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&panelId=50&fullscreen&orgId=1 that are not replicated two times are hit [10:40:42] so in theory, all those segments are on HDFS [10:40:47] but druid needs to pick them up [10:40:56] and load them to the historicals again [10:41:57] HaeB: I think I know what happened, it is due to this morning's reimage of druid1002 [10:41:57] hmm [10:42:51] segments are not replicated across multiple historicals unless explicitly set in the coordinator's settings [10:43:15] elukey, I think you're right, not sure though how to reload segments without recomputing, we need the jo-man [10:43:18] so if not replicated, wiping the segments cache on the host running the historical that is responsible for them means caput [10:43:21] kaput [10:43:52] mforns: I think that we need only to wait, there shouldn't be any need to recompute [10:44:06] but we'll have to think about replicating more data [10:44:10] aha ok [10:47:33] 10Analytics: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238182 (10elukey) Thanks a lot for the notification, I think that this is due to the reimage of druid1002 that happened this morning. pageviews-daily is not replicated, so the segments cache is prese... [10:47:55] replicating is also useful if a druid node reboots/dies/etc.. for some reason [10:48:13] if the historical is down all the subqueries for its segments will fail [10:55:00] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238380 (10elukey) p:05Triage>03Normal a:03elukey [10:55:54] as FYI the network maintenance planned for today is starting [10:56:04] some of our hosts will likely complain :) [10:56:45] k [10:57:55] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578#4238395 (10Nuria) [10:57:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update UA parser - https://phabricator.wikimedia.org/T189230#4238394 (10Nuria) [11:03:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update version of ua-parser in refinery source - https://phabricator.wikimedia.org/T192463#4238411 (10Nuria) [11:03:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update ua-parser package. Both uap-java and uap-core - https://phabricator.wikimedia.org/T192464#4238410 (10Nuria) 05Open>03Resolved [11:04:24] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update UA parser - https://phabricator.wikimedia.org/T189230#4238415 (10Nuria) [11:04:30] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update version of ua-parser in refinery source - https://phabricator.wikimedia.org/T192463#4139693 (10Nuria) 05Open>03Resolved [11:04:32] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid clusters to 0.11 - https://phabricator.wikimedia.org/T193712#4238416 (10Nuria) 05Open>03Resolved [11:04:55] 10Analytics, 10Analytics-Kanban, 10Pageviews-API, 10Patch-For-Review: Add nyc.wikimedia to pageviews whitelist - https://phabricator.wikimedia.org/T194309#4238417 (10Nuria) 05Open>03Resolved [11:06:05] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Restart Analytics hosts for Java 8 Security upgrades - https://phabricator.wikimedia.org/T194268#4238419 (10Nuria) 05Open>03Resolved [11:06:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update UA parser - https://phabricator.wikimedia.org/T189230#4238421 (10Nuria) [11:06:24] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update version of ua-parser in eventlogging - https://phabricator.wikimedia.org/T192529#4238420 (10Nuria) 05Open>03Resolved [11:07:12] hola EU nuria_ :) [11:07:25] hola fellow EU team member [11:08:21] i have cancel staff elukey as i think there is nothing pressing [11:14:38] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deploy Turnilo (possible pivot replacement) - https://phabricator.wikimedia.org/T194427#4238456 (10Nuria) @Ottomata do we need to re-generate this file everytime we add anew data source? https://gerrit.wikimedia.org/r/#/c/432530/4/modules/turnilo/template... [11:16:16] nuria_: ack [11:18:31] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965#4238470 (10Nuria) [11:18:34] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Add druid datasources as configuration parameter in AQS - https://phabricator.wikimedia.org/T193387#4238469 (10Nuria) 05Open>03Resolved [11:20:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Index and store page preview agreggates on Druid so they are visible in pivot/superset - https://phabricator.wikimedia.org/T192305#4238474 (10Nuria) 05Open>03Resolved [11:20:48] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965#4238478 (10Nuria) [11:20:51] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Index by-snapshot mediawiki-history-reduced in druid - https://phabricator.wikimedia.org/T193388#4238477 (10Nuria) 05Open>03Resolved [11:21:01] 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4238479 (10Nuria) 05Open>03Resolved [11:21:17] 10Analytics, 10Analytics-Kanban: Update user_history and page_history column naming convention - https://phabricator.wikimedia.org/T188669#4238480 (10Nuria) 05Open>03Resolved [11:22:10] 10Analytics-Kanban, 10Patch-For-Review: Refinery Hive python utils don't support month=2018-02 style partitions - https://phabricator.wikimedia.org/T194304#4238482 (10Nuria) 05Open>03Resolved [11:22:40] 10Analytics, 10Analytics-Kanban: Upgrade Analytics infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T192642#4238484 (10Nuria) [11:22:44] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4238483 (10Nuria) 05Open>03Resolved [11:23:03] 10Analytics-Kanban: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#4238486 (10Nuria) [11:23:05] 10Analytics, 10Analytics-Kanban: Rename new geowiki to geoeditors - https://phabricator.wikimedia.org/T193429#4238485 (10Nuria) 05Open>03Resolved [11:25:06] 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Add pivot parameter to tabular layout graphs - https://phabricator.wikimedia.org/T126279#2009951 (10Nuria) Ping @milimetric Let's make sure we document how to use this on dashiki docs [11:37:13] mforns: I reviewed the unavailable vs unreplicated metrics and something is off [11:37:20] (the druid metrics I mean) [11:37:30] druid's docs says [11:37:31] Number of segments (not including replicas) left to load until segments that should be loaded in the cluster are available for queries. [11:37:45] this is segments unavailable count --^ [11:37:46] what? hehe [11:37:51] Number of segments (including replicas) left to load until segments that should be loaded in the cluster are available for queries. [11:37:56] this is unreplicated count [11:38:10] mmmm [11:38:11] at the moment unavailable is zero [11:38:21] meanwhile unreplicated is showing the diff [11:38:42] I'd have expected unavailable to grow for those datasources replicated only once [11:40:07] I don't get this completely [11:41:59] can you show me in bc? [11:43:10] sure! [11:46:56] mforns: I am in bc [11:47:02] oops omw [11:48:57] 10Analytics, 10Analytics-Wikistats, 10Accessibility, 10Easy, 10Patch-For-Review: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4238526 (10sahil505) a:03sahil505 [11:53:40] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Accessibility, and 2 others: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4238551 (10sahil505) [12:00:37] elukey, I've been kicked out of hangouts [12:08:39] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4100724 (10Anomie) Did this break logging of jobs to runJobs.log on mwlog1001, and detection of jobs via maintena... [12:13:21] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4238643 (10Pchelolo) @Anomie > Did this break logging of jobs to runJobs.log on mwlog1001 The `runJobs.log` co... [12:15:01] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238647 (10elukey) This is clearly not right, segments have been loaded and nothing changed. Moreover, from the title (that I didn't pay attention to before) say... [12:15:13] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Accessibility, and 2 others: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4238648 (10sahil505) @mforns @Volker_E : I was going through the task description and I couldn't see the need o... [12:19:11] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Accessibility, and 2 others: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4238661 (10sahil505) Color contrast footer text & links are taken care of in T191672 so updating the description. [12:19:18] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Accessibility, and 2 others: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4238663 (10Volker_E) @sahil505 Neither the “Explore topics” “heading” (which isn't marked up as heading, it's j... [12:19:42] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Accessibility, and 2 others: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4238664 (10sahil505) [12:39:21] joal: let's chat when you are not busy about pageviews-daily in druid, we can't find the issue :( [12:44:12] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238725 (10Tbayer) >>! In T195819#4238647, @elukey wrote: > This is clearly not right, segments have been loaded and nothing changed. Moreover, from the title (t... [12:59:05] 10Quarry, 10DBA, 10Data-Services: Cannot reliably get the EXPLAIN for a query on analytics wiki replica cluster - https://phabricator.wikimedia.org/T195836#4238772 (10zhuyifei1999) [12:59:49] elukey: here ! [13:06:28] elukey: I'm assuming the problem comes from turnilo --> Manually requesting Druid works [13:07:12] I still didn't figure out how to query pageviews-daily via druid-sql, the '- [13:07:17] is giving me some issues [13:07:47] elukey: I think I've already faced that, and that it was a good reason for us to move toward _ as a global convention :) [13:07:53] but I tried to tcpdump on druid1001 and I don't see a POST query for pageviews-daily coming from turnilo [13:08:23] elukey: Here is what I used: https://gist.github.com/jobar/889b58c041cfd18144342982095d345d [13:08:46] thanks :) [13:08:49] ;) [13:08:55] elukey: do you want us to batcave? [13:09:04] so confirmed, I tried pageviews-hourly via tcpdump and I can see the query on druid1001 [13:09:09] but not for pageviews-daily [13:09:13] so the issue is not druid [13:09:21] seems turnilo/js related [13:09:29] sure we can bc [13:09:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deploy Turnilo (possible pivot replacement) - https://phabricator.wikimedia.org/T194427#4238800 (10Ottomata) I'm not 100%, but I know that Turnilo has the ability to discover datasources on it's own. Perhaps the use of the config.yaml file is preventing i... [13:13:02] 10Analytics, 10Operations, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10Gilles) [13:21:53] elukey o/ [13:21:59] re https://gerrit.wikimedia.org/r/#/c/435966/ [13:22:04] ottomata: o/ [13:22:14] why not just move the hunspell-bs package into the stretch conditional in ores::base [13:22:14] ? [13:22:23] (don't see hunspell bs...) [13:22:45] e.g. https://github.com/wikimedia/puppet/blob/production/modules/ores/manifests/base.pp#L47 [13:23:01] ottomata: yes that was the other option, but after a chat with Joseph I thought it was better to deploy ores packages only where needed (rather than in all hosts..) [13:23:16] but I can revert and do what you suggested, no preference [13:23:24] aye, that makes sense too, i just don't like the extra param :) [13:23:24] hm [13:23:36] maybe we can just include ores::base in the places needed [13:23:41] rather than conditionally in ::common [13:23:53] e,g. hadoop worker, stat packages [13:24:01] hadoop client [13:25:03] come ooonnnnn [13:25:04] :D [13:25:09] only for an extra parameter? :P [13:25:32] haha [13:25:39] i dunno, that's what we do for the other ones, right? [13:25:55] elukey: i am not hugely opinionated here :) [13:26:31] i don't actually love the fact that we include that class in hadoop::common [13:26:37] probably shouldn't have done that in the first place, no? [13:26:51] its not that it is an extra parameter [13:27:00] it is that it ties the hadoop stuff to ores, whiiiich is fine i guess [13:27:01] i dunno [13:27:04] you like it there? [13:27:40] not a lot, as you are saying it ties ores to hadoop config, ideally we'd deploy ores stuff only where needed (like hadoop workers and stat boxes) [13:27:41] probably what would be best would be to have a analytics common packages or something [13:27:46] and merge all those stats / hadoop common ones [13:28:07] could be an option yes :) [13:28:48] yeah, this might just be an artifact of the terrible statistics module [13:28:51] that makes us do ugly things [13:28:53] :) [13:30:22] ah sorry for the pages yesterday, I hope I didn't disturb your day off :( [13:30:48] one set of pages was burrow, the other one was mirror maker on old kafkas not liking the restarts [13:30:56] joal: https://gerrit.wikimedia.org/r/435997 [13:32:04] OH! that's all elukey!? that is a relief i saw them, but they recovered quickly [13:32:06] and i was not near my computer [13:32:09] so i was going to look into it today [13:32:15] good to know that it was known, phew :) [13:32:41] elukey: ^ nice :) [13:32:49] (the tunrilo pageview thing) [13:33:38] joal is the master fixer as always, I assist and merge :D [13:33:46] :-P [13:34:10] HaeB: turnilo should be ok now! [13:36:03] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238965 (10elukey) The issue should be gone now! [13:36:27] joal: one qs about druid that came up today - a dataset like pageviews-daily is not replicated 2 times right? [13:36:53] hm, elukey, ores::base is already in statisics::packages [13:37:00] hmm [13:37:01] elukey: I think it is [13:37:12] ottomata: ah I didn't check! [13:37:16] elukey: rules say - loadForever (default rule) [13:37:17] 2 in _default_tier [13:37:19] hmmmmmMMMMmMM [13:37:21] :) [13:37:25] elukey: looks good - thanks! [13:37:28] i'll figure it out make a patch and see what you think [13:37:53] joal: ah ok so default is two, I misread the legend then, the graphs of today's reimage now make sense [13:37:55] (03PS1) 10Sahil505: Fixed accessibility/markup issues of Wikistats 2.0 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) [13:38:01] np elukey :) [13:38:02] I thought that by default it was only replicated once [13:38:12] elukey: 2 is the default rule [13:38:12] I mean, no replication :) [13:39:14] joal: did you see https://gerrit.wikimedia.org/r/#/c/435983/ too? Indexations are good but let me know if you see any issue [13:39:40] elukey: I saw some errors, but since you fixed them, I trust :) [13:40:05] so one of the middlemanager's setting was druid.indexer.task.defaultHadoopCoordinates: ["org.apache.hadoop:hadoop-client:cdh"] [13:40:10] that was ok before 0.11 [13:40:14] but not now [13:40:16] elukey: seems very reasonable to make sure this dep is removed indeed :) [13:40:21] super [13:40:33] Thanks for that elukey [13:44:19] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4239020 (10elukey) [13:48:18] ottomata: ok if I swap the last zookeeper node? [13:48:26] so this nightmare will be finished [13:48:32] (then I'll need to roll restart again..) [13:48:53] elukey: please do! [13:49:04] (03CR) 10Sahil505: "- I couldn't make the labels explicit (with `for=""` attribute) as the inputs are missing an id." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) (owner: 10Sahil505) [14:16:39] zookeeper moved to conf100[4-6] \o [14:16:40] \o/ [14:16:47] now I need to roll restart kafka etc.. [14:17:01] but the zk cluster itself it not anymore on the old hosts [14:19:57] yeehawww [14:26:34] elukey, so the problem was underscore vs hyphen??? [14:26:59] mforns: nono simply that turnilo was trying to auto-fill the dimensions and failing for some reason [14:27:15] aaaahhh [14:28:09] elukey: https://gerrit.wikimedia.org/r/#/c/436012/ [14:29:15] ottomata, I think I found the problem with EL Sanitization, and it might affect other parquet data sets... My guess is it's https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/RefineTarget.scala#L341 [14:29:34] it does not recognize the file as parquet, it returns 'text' [14:29:52] I'm adding some logs to see what's in that buffer [14:31:00] OH reallY!? [14:31:02] huh. [14:31:14] mforns: you can just hdfs dfs -text the file [14:31:16] and see what comes out [14:31:41] ottomata, I know, the file starts with PAR, but it's still tagged as text [14:35:08] hm [14:35:39] joal, i'm looking at Nifi a little bit, am very unfamiliar [14:35:47] everything I'm seeing requires GUI use? [14:37:10] ottomata, the log says the contents of buffer when reading event db parquet files are: (,,2) [14:37:18] weird [14:37:22] mforns: these are the files that have already been refined? [14:37:22] yes? [14:37:26] ones in event. db? [14:39:53] OH [14:39:54] i bet you [14:40:00] it is reading the _REFINED sucess flag [14:40:05] instead of the .parquet one! [14:40:54] hmm no it is supposed to skip those. [14:40:55] hm [14:42:07] DUH [14:42:07] !f.getPath.toString.startsWith("_") [14:42:10] is never going to work [14:42:15] because the path starts with hdfs:// always [14:43:09] mforns: sorry about that! [14:43:11] fix coming... [14:43:38] ottomata, oh! good catch [14:43:59] ottomata, don't worry, I can fix that [14:44:52] but shouldn't it then try to read the correct ones? [14:45:07] Maybe then the table is already created and not mergeable? [14:45:26] (03PS1) 10Ottomata: RefineTarget.inferInputFormat should filter out file names starting with _ [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/436023 [14:45:56] ok, lookin [14:46:30] using https://hadoop.apache.org/docs/r2.7.5/api/org/apache/hadoop/fs/Path.html#getName-- [14:47:27] (03CR) 10Mforns: [C: 032] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/436023 (owner: 10Ottomata) [14:48:23] ottomata, just testing sanitization with that fix [14:49:27] k [14:50:22] ottomata: AFAIK NiFi is API-oriented, using XML as job definitionsd [14:51:47] aye [14:52:03] i guess same as kafka connect, just has a fancy gui that all the tutorials use [14:52:16] kafka connect also api (or config file) oriented, using json rest [14:54:03] right ottomata [14:54:36] ottomata: I actually NiFi was built with this idea in mind that non-coders could use it [14:55:06] Man -- We have an issue with webrequest jobs [14:55:21] oh? [14:55:26] Frequency of error has increased dramatically in the past few days [14:56:17] `dramatically` sounds a bit dramatic ... [14:56:40] But the frequency really has increased (Luca noticed it as well) [14:58:18] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4239425 (10Anomie) >>! In T190327#4238643, @Pchelolo wrote: > The kafka-based queue logs can be found either in `... [14:58:21] (03Merged) 10jenkins-bot: RefineTarget.inferInputFormat should filter out file names starting with _ [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/436023 (owner: 10Ottomata) [14:58:32] also ottomata - looks like yarn master is analytics1002 - is that expected? [15:00:41] nope! doesn't hurt but we try to keep it on 1001 [15:00:49] hm [15:01:57] joal: argh it might be due to zookeeper [15:02:01] I was about to check [15:02:07] thanks elukey [15:03:49] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (doing): Medke JobExecutor debug-log to mwlog - https://phabricator.wikimedia.org/T195858#4239441 (10Pchelolo) p:05Triage>03Normal [15:03:54] !log rerun webrequest-load-wf-upload-2018-5-29-13 [15:03:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:08] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4239457 (10Pchelolo) > That link takes me to a dashboard that does no filtering at all, so it shows every log mes... [15:26:27] elukey: I let you patch the datasource name? [15:26:39] joal: sure! [15:26:43] Thanks :) [15:27:24] joal: an1001 is again the master [15:27:57] mforns: Just added you to https://gerrit.wikimedia.org/r/#/c/435169/ [15:28:14] mforns: I was expected Dan to do it, but if you could that's be greatr [15:28:17] Thanks elukey !P [15:28:44] Oh by the way elukey - I +1 the idea that something is bizarre with webrequest [15:30:10] PROBLEM - HDFS corrupt blocks on analytics1001 is CRITICAL: 6 ge 5 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=39&fullscreen [15:30:33] Aouch [15:30:39] elukey: Any idea about --^ [15:30:40] ? [15:31:03] probably the restarts, 6 blocks, might be temporary [15:31:06] let's wait a sec [15:31:06] k [15:44:26] nuria_: While at it playing with druid datasources - Shall I rename the few that use dashes instead of underscores? [15:47:06] joal: no cause bookmarks will stop working for people [15:47:10] right? [15:47:37] joal: for webrequest is Ok cause a few people use it. Others are linked all over phab [15:47:42] nuria_: true - but they have stopped working when we moved to turnilo, and nobody complained :( [15:48:01] joal: in that case there is only one answer! [15:48:04] nuria_: I actually think a lot of people use them :) [15:48:18] joal: then let's go for it [15:48:41] ok let's do it - It'll also make SQL querying working for those datasources [15:48:59] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:49:01] We've not managed to succesfully SQL-query with-dashes datasources [15:49:16] hm hm [15:49:26] ok nuria_ - Doing it - It'll be cleaner [15:49:34] * elukey checks analytics1031 [15:49:38] Thanks elukey [15:50:07] java.net.NoRouteToHostException: No Route to Host from analytics1031.eqiad.wmnet/10.64.36.131 to analytics1001.eqiad.wmnet:8020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost [15:50:12] no bueno [15:52:24] elukey: network maintenance is supposedly finished, right? [15:52:58] yeah, but I think this is a misconfiguration.. an1030 was also cut off the network [15:53:02] this seems more subtle though [15:53:07] :( [15:53:07] I am chatting with Arzhel [15:53:23] joal: I think that the failover to an1002 was part of the network maintenance [15:53:29] I forgot that an1001 was impacted [15:53:34] k makes sense elukey [15:53:46] no prob, it's just good to know :) [15:55:37] joal: I also need a course on how to use these stats: https://gerrit.wikimedia.org/r/#/c/434987/ [15:55:43] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4239821 (10ArthurPSmith) Here's a specific question that might be detailed enough in description: suppose we have a collection of facts (say the names, countries, inception dates, and... [15:56:28] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add maxmind ip info to webrequest dataset on druid - https://phabricator.wikimedia.org/T194055#4239823 (10Nuria) Per our conversation let's rename datasource before we re-start these jobs to webrequest_sampled_128 [16:10:28] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4239894 (10Anomie) The new link works. Although I don't see any messages in there about jobs being run, just erro... [16:12:01] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4239901 (10Pchelolo) > The new link works. Although I don't see any messages in there about jobs being run, just... [16:12:11] nuria_: can do :) [16:13:52] 10Analytics: Problems with external referrals? - https://phabricator.wikimedia.org/T195880#4239904 (10Nuria) [16:14:03] joal: ticket created now: https://phabricator.wikimedia.org/T195880 [16:14:18] Thanks nuria_ [16:14:40] 10Analytics, 10Analytics-Kanban: Problems with external referrals? - https://phabricator.wikimedia.org/T195880#4239918 (10Nuria) a:03Nuria [16:18:01] 10Analytics-Kanban: Update oozie druid loading job to facilitate test indexation and prevent prod indexation by mistake - https://phabricator.wikimedia.org/T195882#4239952 (10JAllemandou) [16:18:06] another one nuria_ --^ [16:18:11] 10Analytics-Kanban: Update oozie druid loading job to facilitate test indexation and prevent prod indexation by mistake - https://phabricator.wikimedia.org/T195882#4239962 (10JAllemandou) a:03JAllemandou [16:20:00] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (doing): Make JobExecutor debug-log to mwlog - https://phabricator.wikimedia.org/T195858#4239992 (10Pchelolo) [16:47:53] joal: about webrequest_sampled_128 - do I simply need to change the datasource in the druid indexing json template right? [16:52:30] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:52:41] we are working on --^ [16:52:47] should be back soon [16:54:40] RECOVERY - Hadoop NodeManager on analytics1031 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:48] goooood [17:02:12] !log re-run webrequest-load-text 29th May 2018 12:00:00 [17:02:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:03:17] ok the zookeeper roll restart is almost completed, I need to do kafka1012->23 [17:03:23] but I'll do them tomorrow, no rush [17:07:28] nuria_ et al: do you understand the question on analytics public list re promotional user names? I attempted to respond, but I thikn I don't fully understand what statistics is in question. If you're unsure, too, I'll ask for clarification. [17:18:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4238182 (10JKatzWMF) @elukey it is, thanks!! [17:24:00] elukey: actually I'll do it - I'm working on a patch involving both adding datasource and rename some of them, so I'll rename webrequest [17:24:05] sorry for bothering about it [17:27:02] ack!! [17:27:11] joal: as FYI I am running a fsck / on hdfs [17:27:30] okey elukey [17:27:34] just finished [17:27:35] Corrupt blocks:0 [17:27:37] elukey: any hint about something wrong? [17:27:41] ok:) [17:27:44] The filesystem under path '/' is HEALTHY [17:27:47] That answers my question :) [17:28:03] so I guess that the metric needs to recover (possibly jmx is holding a weird value) [17:28:06] so all gooooood [17:28:12] lzia: sorry no idea :( [17:29:03] elukey: no worries. I'll ask for clarification [17:32:24] going off team! Talk with you tomorrow :) [17:32:29] Bye elukey [17:35:25] 10Analytics, 10Operations, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4240357 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:36:10] 10Analytics, 10Operations, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10MoritzMuehlenhoff) @gilles: Since this is a non-sudo change, it needs to only pass the three day waiting period. [17:41:26] lzia: i could not understand teh question nor do i think is directed to wikimedia spaces [17:42:24] lzia: i would let him clarify [18:15:23] (03PS1) 10Joal: Parameterize datasource of druid loading jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/436080 (https://phabricator.wikimedia.org/T195882) [18:33:40] 10Analytics, 10Operations, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10JAllemandou) @Gilles : Feel free to ping when you're in if you want some help on the data or the way to play with it. [18:39:08] (03CR) 10Joal: [V: 031] "Tested on webrequest_sampled_128." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/436080 (https://phabricator.wikimedia.org/T195882) (owner: 10Joal) [19:15:59] 10Analytics, 10Operations, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4240743 (10Nuria) Approved, please by all means use hadoop. [19:55:26] nuria_: yup. doing that. thanks. [20:39:49] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review: SSL and inter broker encryption for Kafka main - https://phabricator.wikimedia.org/T193778#4241111 (10Ottomata) [20:45:15] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10Imarlier) [20:54:31] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review: SSL and inter broker encryption for Kafka main - https://phabricator.wikimedia.org/T193778#4241172 (10Ottomata) [23:51:55] RECOVERY - HDFS corrupt blocks on analytics1001 is OK: (C)5 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=39&fullscreen