[00:23:18] <HaeB>	 joal: yesterday, milimetric added me to the restricted task, so i can see it myself now. that solves it for my own immediate purposes, but i still think we may want to make it public. (Also considering that it is linked for reference in the PV definition change log, and the issue itself is already publicly documented in that village pump discussion and the
[00:23:18] <HaeB>	 gerrit patch, plus may get noticed by other users, as we are not planning to correct past data.) Dan suggested I should just do that myself, but i may take a moment to ping people on the task in case anyone else still has objections.
[07:53:05] <HaeB>	 joal: heads-up: running a longer webrequest query right now (slightly modified version of your monthly uniques code... complete with the slow reducer start trick you recently added)
[09:27:06] <elukey>	 joal: o/ 
[09:27:11] <joal>	 Hi elukey 
[09:27:19] <joal>	 what's up?
[09:28:00] <elukey>	 morning! I'd like to increase the replication factor for system_auth in cassandra
[09:28:18] <joal>	 elukey: should be a noop, I think you can go :)
[09:29:19] <elukey>	 well not a no-op but really quick :)
[09:29:46] <joal>	 noop from the outside world ;)
[09:30:58] <joal>	 Thanks HaeB for the heads-up - Just as a quick note, I think zareen extracted table contained all the fields needed for you to run your query
[09:32:05] <elukey>	 joal: ALTER KEYSPACE "system_aut" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; - sounds good?
[09:32:11] <elukey>	 system_auth
[09:32:13] <elukey>	 sorry :)
[09:32:39] <joal>	 elukey: sounds correct if you don't forget `use system;` before :)
[09:33:12] <joal>	 elukey: and to be done with cassandra user (other don't have write rights on system I think)
[09:33:28] <elukey>	 joal: mmm why use system? They are separate keyspaces no?
[09:34:00] <joal>	 elukey: You're absolutely right! My mistake :)
[09:34:47] <HaeB>	 joal: the extract table doesn't have ip, user_agent and accept_language
[09:35:18] <elukey>	 joal: ah good!
[09:35:49] <joal>	 HaeB: Arf :( I quick read your query and noticed user agent, I was expecteing to be map
[09:36:25] <joal>	 HaeB: From what it means if I don't mistake, is that zareen table is missing important fields for queries - right?
[09:38:52] <HaeB>	 joal: no, don't worry, this is just a one-off for now, for a different purpose (trying to shed some more light on the year-over-year drop in monthly uniques - may want to pick your brain about that too at some point BTW, nuria already had some thoughts)
[09:39:18] <HaeB>	 joal: also, TBH, reliably rewriting the standard query to use the extract table would not have  been worth the effort ;)  
[09:39:33] <joal>	 hehe :)
[09:40:04] <joal>	 HaeB: Feel free to ask for brain as you wish (I don't garanty effeciency, but time I can)
[09:41:27] <joal>	 AFK for a moment, doctor appointment - will be back soon
[10:07:28] <elukey>	 a-team: the AQS API is completely unavailable, working on it in operations
[10:07:43] <elukey>	 it seems that the 'aqs' user is not available anymore for restbase
[10:07:53] <elukey>	 it started right after the first nodetool repair
[10:26:55] <elukey>	 the aqsloader user is also gone, this is why oozie complains
[11:04:46] <elukey>	 outage finished, one hour of downtime ;(
[11:08:46] <joal>	 elukey: let me guess - You changed the replication from one of the new servers?
[11:08:55] <joal>	 elukey: or something else:
[11:12:20] <elukey>	 joal: nope, from aqs1004
[11:12:52] <elukey>	 for some misterious reason, when I ran nodetool-a repair system_auth the aqs user/role disappeared 
[11:13:00] <elukey>	 from most of the nodes, except the new ones
[11:13:10] <elukey>	 the only thing that fixed that was re-creating the user
[11:13:19] <elukey>	 I waited to complete the repairs on all nodes
[11:13:22] <elukey>	 but nothing changed
[11:13:44] <joal>	 elukey: weird - possibly an inconsistent state of aqs1004 :(
[11:13:58] <elukey>	 I have no idea.. :(
[11:14:40] <joal>	 Th
[11:14:41] <joal>	 anks
[11:14:46] <joal>	 for fixing elukey !
[11:15:42] <joal>	 elukey: Shall I restart oozie jobs?
[11:16:30] <elukey>	 I also added the aqsloader again because some nodes were missing it, so it should be good for oozie now
[11:16:42] <joal>	 k elukey, that is weird
[11:16:53] <joal>	 elukey: oozie restart? (me)
[11:18:54] <elukey>	 sure sure please go :)
[11:19:29] <joal>	 !log Restart cassandra-coord-pageview-per-project-hourly 2017-02-23T07 and 08 to recover from cassandra issue
[11:19:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:25:27] <joal>	 elukey: jobs failed because of auth errors :(
[11:26:04] <elukey>	 yeah I was sure about it, aqsloader missing
[11:26:41] <joal>	 elukey: Ah, I thought you had this one fixed already
[11:27:04] <elukey>	 ah you mean again??
[11:27:08] <elukey>	 :O
[11:27:14] <joal>	 yessir
[11:27:15] <elukey>	 I might have not added all the GRANTs
[11:27:20] <elukey>	 SELECT/MODIFY
[11:27:24] <elukey>	 need more?
[11:27:26] <joal>	 elukey: INSERT!
[11:27:37] * elukey cries in a corner
[11:27:38] <joal>	 elukey: I don't know rights system in cassandra
[11:28:51] <elukey>	 modify is INSERT, DELETE, UPDATE, TRUNCATE
[11:29:02] <joal>	 hm, weird elukey !
[11:29:13] <joal>	 elukey: maybe not all keyspaces listed?
[11:30:05] <elukey>	 well it was on all the keyspaces
[11:30:13] <joal>	 rhm
[11:31:51] <elukey>	 what is the error?
[11:55:21] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:56:21] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[12:07:31] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:08:21] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[12:18:45] <joal>	 !log Restart cassandra-coord-pageview-per-project-hourly 2017-02-23T07, 08, 09 to recover from cassandra issue - Worked !
[12:18:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:18:54] <joal>	 Thanks elukey for the cleanup and fixing !
[12:26:44] <wikibugs_>	 10Analytics, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3049306 (10elukey) Preliminary report since we don't know the exact root cause:  I executed `ALTER KEYSPACE "system_auth" WITH REPLICATION = {...
[12:29:16] * elukey lunch!
[13:04:26] <wikibugs_>	 10Analytics, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3049435 (10elukey) One of the explanations that might be plausible is that after bumping the replication factor to 12 we moved to a state in w...
[13:06:54] <wikibugs_>	 06Analytics-Kanban, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3002384 (10elukey)
[13:15:37] <joal>	 Taking a reak a-team, later 
[13:15:48] <mforns>	 cya joal 
[13:18:25] <wikibugs_>	 10Analytics: Review the recent Varnishkafka patches - https://phabricator.wikimedia.org/T158854#3049481 (10elukey)
[13:20:16] <wikibugs_>	 10Analytics, 06Operations-Software-Development: Review the recent Varnishkafka patches - https://phabricator.wikimedia.org/T158854#3049495 (10elukey) p:05Triage>03Normal
[14:30:22] <elukey>	 joal,mforns - do you use jconsole by any chance to connect to prod?
[14:30:36] <mforns>	 elukey, no...
[14:30:45] <elukey>	 I can't find a way to use it.. even for labs, I need to get some mbeans exported by jmx
[14:30:49] <elukey>	 sigh
[14:46:26] <elukey>	 I got it!
[14:46:33] <elukey>	 now I am going to write it down somewhere
[14:46:39] <elukey>	 so frustrating
[14:51:14] <milimetric>	 mforns / joal: thanks for covering yesterday, my sanity is restored
[14:51:26] <mforns>	 milimetric, np :]
[15:00:24] <elukey>	 urandom: hello! I'd really need some help from you whenever you have time :)
[15:02:55] <elukey>	 I am currently adding the jmxport to the Mapreduce History server.. it does expose JVM metrics, so we'll be able to monitor it
[15:08:04] <wikibugs_>	 (03PS1) 10Fdans: Format timestamps in per-project aggregation so that comparison in Cassandra returns the correct months [analytics/aqs] - 10https://gerrit.wikimedia.org/r/339419 (https://phabricator.wikimedia.org/T156312)
[15:08:42] <urandom>	 elukey: ciao!
[15:08:50] <urandom>	 elukey: i was looking at your issue already
[15:08:54] <elukey>	 :(
[15:09:03] <urandom>	 i think i know what happened
[15:09:17] <elukey>	 what did I mess up?
[15:10:00] <urandom>	 well... maybe not
[15:10:03] <elukey>	 (we can chat in hangouts if you prefer)
[15:10:12] <urandom>	 if you'd like
[15:10:22] <urandom>	 might save the channel my yammering :)
[15:10:40] <elukey>	 and also it might take less for me to understand :)
[15:10:58] <urandom>	 or more!
[15:11:00] <urandom>	 :)
[15:11:25] <urandom>	 shall i ring you, you me, should we join your batcave?
[15:11:26] <elukey>	 https://hangouts.google.com/hangouts/_/wikimedia.org/we-love-cassandra
[15:11:30] <urandom>	 gotcha
[15:15:27] <wikibugs_>	 (03PS1) 10Mforns: Add oozie workflow to load projectcounts to AQS [analytics/refinery] - 10https://gerrit.wikimedia.org/r/339421 (https://phabricator.wikimedia.org/T156388)
[15:15:57] <wikibugs_>	 (03CR) 10Mforns: [C: 04-1] "Still WIP (needs to be tested)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/339421 (https://phabricator.wikimedia.org/T156388) (owner: 10Mforns)
[15:42:05] <joal>	 halfak: Good morning
[15:42:13] <halfak>	 o/ joal 
[15:42:24] <joal>	 halfak: I finally got that job working :)
[15:43:01] <joal>	 If we exclude the json transformation part (done for all wikis etc), it took a bit more than an hour I think
[15:43:05] <joal>	 halfak: --^
[15:43:12] <halfak>	 That sounds pretty good. 
[15:43:29] <wikibugs_>	 (03PS3) 10Fdans: Add secondary table endpoint to populate Cassandra with correct timestamps [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312)
[15:43:46] <halfak>	 comparing frwiki revision pairs with regex matching in one hour is nice. 
[15:43:51] <joal>	 halfak: Have you had a chance to run the thing on your side ?
[15:43:53] <halfak>	 I'm not sure how long my process took :/
[15:44:05] <halfak>	 Oh yeah.  Pinged here about it last week.  let me look for it again. 
[15:44:19] <joal>	 halfak: Arf, missed it
[15:44:23] <halfak>	 https://datasets.wikimedia.org/public-datasets/all/wp10/20170101/
[15:44:28] * joal diggs down irc logs
[15:44:49] <joal>	 halfak: awesome, will try to double check :)
[15:44:53] <joal>	 Thanks !
[15:51:19] <milimetric>	 elukey: mind taking a look at the commit message of https://gerrit.wikimedia.org/r/339424 and chown/mv/merge?
[15:51:33] <milimetric>	 (I don't has the roots)
[15:55:44] <elukey>	 milimetric: o/ - will do after standup.. qq - are 1. 2. points indipendent from puppet 
[15:55:47] <elukey>	 ?
[15:56:01] <elukey>	 I mean, is puppet going to complain for missing dirs after their execution etc.. ?
[15:56:13] <milimetric>	 no, elukey, shouldn't complain
[15:56:30] <elukey>	 all right
[15:56:33] <milimetric>	 and I have rights to sudo -u hdfs so I can clean up anyway.  But these jobs run weekly so we won't be in trouble until Sudnay
[16:03:26] <wikibugs_>	 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3049937 (10Milimetric) First, @Rafaesrey is this data ok with you?  Can you do what you need with the 71 countries?  Leila, based on the spreadsheet does Rafa get what he needs here?  If that'...
[16:17:22] <wikibugs_>	 (03CR) 10Milimetric: [C: 04-1] "biggest question is the versioning, not sure" (033 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans)
[16:28:20] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Add jmxtrans metrics from Hadoop yarn-mapreduce-historyserver - https://phabricator.wikimedia.org/T156272#3050001 (10elukey) Restarted the history server daemon on an1001, all good, port 9986 is now used by JMX. I verified via jconsole  that the JV...
[16:31:20] <milimetric>	 fdans: grooming
[16:35:46] <wikibugs_>	 06Analytics-Kanban, 06Operations-Software-Development: Review the recent Varnishkafka patches - https://phabricator.wikimedia.org/T158854#3050011 (10Milimetric)
[16:39:46] <wikibugs_>	 10Analytics, 10EventBus, 10MediaWiki-Vagrant, 06Services (watching): Kafka logs are not pruned on vagrant - https://phabricator.wikimedia.org/T158451#3050022 (10Milimetric) p:05Triage>03Low
[16:39:52] <wikibugs_>	 10Analytics, 06Analytics-Kanban, 06Operations-Software-Development: Review the recent Varnishkafka patches - https://phabricator.wikimedia.org/T158854#3050023 (10Milimetric)
[16:40:54] <wikibugs_>	 06Analytics-Kanban, 06Operations-Software-Development: Review the recent Varnishkafka patches - https://phabricator.wikimedia.org/T158854#3049481 (10Milimetric)
[16:44:41] <wikibugs_>	 06Analytics-Kanban, 10Pageviews-API: Pageviews missing for article that received on-wiki edits - https://phabricator.wikimedia.org/T158681#3050053 (10Milimetric) p:05Triage>03Normal a:03Milimetric
[16:46:09] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Monitor Hadoop cluster running out of HEAP space with Icinga - https://phabricator.wikimedia.org/T88640#3050059 (10Milimetric) 05Open>03Resolved
[16:46:21] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Monitor Hadoop cluster running out of HEAP space with Icinga - https://phabricator.wikimedia.org/T88640#1016773 (10Milimetric) 05Resolved>03Open
[16:52:27] <joal>	 HaeB: another query?
[16:53:09] <wikibugs_>	 06Analytics-Kanban: Security Upgrade for piwik - https://phabricator.wikimedia.org/T158322#3050091 (10Milimetric) p:05Triage>03Normal a:03Milimetric
[16:55:07] <wikibugs_>	 10Analytics, 13Patch-For-Review, 15User-Elukey: Puppetize clickhouse - https://phabricator.wikimedia.org/T150343#3050097 (10Milimetric)
[16:56:24] <wikibugs_>	 06Analytics-Kanban, 06Research-and-Data: Coordinate with research to vet metrics calculated from the data lake - https://phabricator.wikimedia.org/T153923#3050099 (10Milimetric) What were the plans for looking at this?  We have some new numbers from the public labs import that are just slightly off from our pr...
[17:00:58] <wikibugs_>	 (03CR) 10Fdans: Add secondary table endpoint to populate Cassandra with correct timestamps (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans)
[17:03:09] <milimetric>	 joal: you wanna de-confuse stat1002/stat1003 directories?
[17:03:11] <milimetric>	 in cave?
[17:03:37] <joal>	 milimetric: sure
[17:09:13] <wikibugs_>	 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3050139 (10Rafaesrey) Dear Leila,  This is great news and would be extremely useful for the GII.    I have to point out, however, that the indicator would be going from perfect coverage (128 c...
[17:10:24] <elukey>	 milimetric: done!
[17:11:07] <milimetric>	 thanks elukey 
[17:11:22] <elukey>	 milimetric: do you know how to get in touch with Katie by any chance?
[17:11:28] <elukey>	 for the CPS consultants and Pivot
[17:11:35] <elukey>	 I can't reach her via email or Phab
[17:12:25] <wikibugs_>	 (03CR) 10Milimetric: [C: 04-1] Add secondary table endpoint to populate Cassandra with correct timestamps (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans)
[17:12:40] <milimetric>	 elukey: katie horn?
[17:13:34] <wikibugs_>	 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3050162 (10Rafaesrey) Hi all,  Or how about giving a set number to those below 100k so that we still have the coverage? Just another idea.    Best,      *Rafael Escalona Reynoso, PhD, MPA. *...
[17:13:57] <milimetric>	 aha, yes, katie horn
[17:14:14] <elukey>	 yes yes sorry
[17:14:15] <milimetric>	 elukey: as far as I know she's on vacation and generally extremely busy 'cause she's in charge of both Discovery and FR tech.
[17:15:34] <elukey>	 milimetric: still on vacation? It might explain it.. moritzm, shall we pick up somebody else for the final approval/verification? (CPS consultants and Pivot, nda LDAP access)
[17:22:23] <wikibugs_>	 06Analytics-Kanban, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3050192 (10Eevans) >>! In T157354#3049435, @elukey wrote: > One of the explanations that might be plausible is that after bumping the r...
[17:32:13] <wikibugs_>	 06Analytics-Kanban, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3050323 (10Eevans) >>! In T157354#3049306, @elukey wrote: > Preliminary report since we don't know the exact root cause: >  > I execute...
[17:34:47] <urandom>	 elukey: i wonder if the longer roles/perms cache bit you
[17:35:20] <urandom>	 elukey: when you say that you finished the repair, but the errors persisted, does that mean the 503s persisted, or that you didn't always see results in cqlsh?
[17:38:33] <HaeB>	 joal: uh, that was just a rather short one, to the aggregate daily uniques table
[17:38:33] <elukey>	 urandom: both.. 
[17:38:51] <elukey>	 urandom: the 503s went away after the user creation
[17:38:58] <HaeB>	 analysts sometime need to do queries, you know ;)
[17:39:18] <HaeB>	 btw what's the normal execution time of the monthly uniques job these days?
[17:39:28] <HaeB>	 (ballpark)
[17:39:38] <urandom>	 elukey: ok
[17:39:42] <elukey>	 urandom: but I have cassandra::permissions_validity_in_ms: 600000
[17:39:51] <elukey>	 so this might have played its role
[17:39:52] <urandom>	 10 minutes?
[17:39:54] <urandom>	 well
[17:40:24] <wikibugs_>	 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3050357 (10leila) >>! In T131889#3049937, @Milimetric wrote: > First, @Rafaesrey is this data ok with you?  Can you do what you need with the 71 countries?  Leila, based on the spreadsheet doe...
[17:40:30] <urandom>	 yeah, i was thinking that if it cached a nack, then maybe the 503s would persist for the period of that value
[17:40:40] <urandom>	 but a query of the tables should succeed, if the repair worked
[17:41:01] <urandom>	 elukey: but you're saying that direct queries of the table failed, even after the repair, yes?
[17:44:02] <elukey>	 urandom: yeah IIRC the aqs user was not there on some nodes
[17:44:18] <elukey>	 random select * from system_auth.roles
[17:44:51] <urandom>	 yeah, that wouldn't be explained by roles/perms caching
[17:47:36] <wikibugs_>	 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3050390 (10leila) >>! In T131889#3050139, @Rafaesrey wrote: > I have to point out, however, that the indicator would be going from > perfect coverage (128 countries) to one with low coverage (...
[17:54:12] <joal>	 halfak: just double checked our datasets - they very much correct :)
[17:55:08] <urandom>	 elukey: see #4 here? https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/secureConfigNativeAuth.html
[17:55:26] <urandom>	 3 to 5 replicas per data center
[17:56:12] <elukey>	 indipendently of the cluster size?
[17:56:17] <urandom>	 ¯\_(ツ)_/¯
[17:56:23] <urandom>	 that's what it is saying
[17:56:32] <urandom>	 contradicts advice elsewhere, i know
[17:57:01] <urandom>	 also, apparently only the default cassandra user uses LOCAL_QUORUM consistency
[17:57:06] <urandom>	 all others use LOCAL_ONE
[17:57:24] <urandom>	 if true, that would make me doubt the wisdom in making replication factor == number of nodes
[17:57:43] <joal>	 halfak: 98.28% of the rows in your dataset are matched by mine :)
[17:58:23] <urandom>	 elukey: if it's a question of not being able to auth on a cluster with 5 down nodes, well, at 5 outages you've probably got some serious problems
[17:59:01] <elukey>	 urandom: sure.. but if LOCAL_ONE is used, and the replication is == number of nodes, shouldn't it be better since system_auth will be available locally?
[17:59:08] <elukey>	 (asking to understand your point)
[18:00:46] <elukey>	 (argh it is super late for me, going afk but will read afterwards! sorry :( )
[18:00:49] * elukey afk!
[18:01:08] <joal>	 HaeB: I see a second huge query currently running - is that normal?
[18:01:28] <urandom>	 elukey: better?  i dunno, i guess.
[18:01:47] <urandom>	 elukey: but if you have 5 nodes down, that's a pretty serious problem, no?
[18:02:00] <urandom>	 elukey: let's saying the AQS case we're talking 6
[18:02:35] <HaeB>	 joal: i only see one by me on https://yarn.wikimedia.org/cluster/scheduler ...?
[18:02:57] <urandom>	 if half the cluster is down, you have pretty serious problems and i don't know that not being able to authenticate application queries will be the concern
[18:03:20] <joal>	 HaeB: The one we discussed this morning finished long ago (application_1486634611252_38239)
[18:03:44] <urandom>	 elukey: also, using '6' as the example, it'd have to be the right 6 that were down
[18:04:29] <urandom>	 elukey: let's say that you changed replication to 6 and used NetworkTopologyStrategy, so that you ad 2 replicas per rack
[18:04:36] <urandom>	 s/ad/had/
[18:05:15] <urandom>	 elukey: can you imagine the failure scenario that would leave you unable to authenticate?
[18:06:11] <urandom>	 i think you are in cluster-wide outage territory there no matter what
[18:06:18] <urandom>	 elukey: not necessarily suggesting you change it now
[18:12:14] <joal>	 HaeB: ?
[18:24:22] <HaeB>	 joal: (still in meetings) that's weird that https://yarn.wikimedia.org/cluster/app/application_1486634611252_38239 shows as finished ...
[18:24:47] <HaeB>	 joal: ... my bash script/screen job that executes it is still running, and it's producing messages. last one:
[18:25:12] <HaeB>	 joal: INFO  : 2017-02-23 18:24:05,376 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU 2416879.34 sec
[18:25:27] <joal>	 HaeB: There definitely is one query running
[18:25:49] <joal>	 Now the question is is it the same as previous, a new one, a new stage ...
[18:31:41] <joal>	 HaeB: Looks like stage1 is the last of your request - my bad - It's really long !!
[18:35:36] <HaeB>	 joal: yup, looks like WITH queries generate separate jobs?
[18:36:10] <HaeB>	 joal: again though, this is basically just  a modified copy of your own monthly uniques job, which should show the same behavior ;)
[18:43:40] <mforns>	 :[ the projectcounts job failed in the final stage
[18:44:09] <mforns>	 I think I'm going to split the execution into yeas
[18:44:58] <mforns>	 years
[18:46:14] <joal>	 mforns: What was the error?
[18:46:36] <mforns>	 joal, org.apache.spark.SparkException: Error communicating with MapOutputTracker
[18:46:50] <mforns>	 on saveAsTextFile
[18:47:18] <mforns>	 I think that was caused by org.apache.spark.SparkException: Map output statuses were 280460322 bytes which exceeds spark.akka.frameSize (134217728 bytes).
[18:47:21] <joal>	 mforns: what setting do you launch your job with?
[18:47:40] <mforns>	 --executor-memory 1G        --driver-memory 4G        --executor-cores 1        --conf spark.dynamicAllocation.enabled=true        --conf spark.shuffle.service.enabled=true        --conf spark.dynamicAllocation.maxExecutors=64
[18:49:50] <joal>	 hm... 
[18:50:29] <ottomata>	 o/
[18:51:39] <joal>	 hi ottomata 
[18:54:33] <joal>	 mforns: wierd :(
[18:57:25] <wikibugs_>	 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 2 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3050708 (10Halfak) Just quickly noting that it seems we still have a memory leak in `ores precached` (the utility that runs in la...
[18:59:57] <ottomata>	 heyyy cool check this one out! https://esjewett.github.io/wm-eventsource-demo/
[19:00:23] <joal>	 wow ottomata :)
[19:03:13] <mforns>	 joal, the thing is, even when the whole data set is in a single RDD, and the processing of the files is parallelized, there are so many files (~75000) that there is a ratio of ~1000 files per executor, meaning those 1000 files will be processed sequentially
[19:03:32] <joal>	 mforns: I don't get it
[19:04:33] <mforns>	 joal, my understanding is that each executor will handle ~1000 files, one after the other
[19:04:49] <wikibugs_>	 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 2 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3050726 (10Pchelolo) > Just quickly noting that it seems we still have a memory leak in ores precached (the utility that runs in...
[19:05:33] <joal>	 mforns: this sounds reasonable, no?
[19:05:35] <mforns>	 so, my idea was to split execution by year, and we will not loose parallelization, because each year has already >8000 files and all nodes will be occupied as well
[19:06:04] <joal>	 mforns: it'll work
[19:06:54] <mforns>	 joal, yes, sounds reasonable, but it turns out to be risky, because the job takes 20 hours to finish, and only then the output is written
[19:07:22] <mforns>	 so, in case it fails, you loose everything, like today
[19:08:09] <mforns>	 if it was yearly execution, we could have results written every 1-2 hours, and not loose cluster power, because parallelization would still be happening
[19:08:31] <joal>	 mforns: no problem with that :)
[19:08:39] <mforns>	 ok
[19:09:01] <joal>	 mforns: particularly if you have an oozie workflow, shoue be very easy
[19:11:02] <wikibugs_>	 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 2 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3050759 (10Halfak) Yeah.  We have an experimental installation at ores.wmflabs.org where we deploy experimental models that aren'...
[19:12:07] <mforns>	 joal, I think in this case having it in scala code is sufficient, because the job is historical, and doesn't need to be scheduled. Also, it will be a couple lines of code, as opposed to the workflow file
[19:12:40] <joal>	 mforns: You're absolutely right, it's better
[19:13:38] <mforns>	 ok will do
[19:13:45] <halfak>	 joal, that % sounds pretty good. 
[19:13:58] <joal>	 halfak: Cool :)
[19:14:01] <halfak>	 Could you produce a set of non-matched rows so that we can look into them?
[19:14:10] <joal>	 halfak: I can do that
[19:14:36] <joal>	 halfak: There a ~ 40k rows that don't exist in my dataset and exist in yours
[19:14:50] <halfak>	 Perfect.  That seems totally manageable for review :) 
[19:15:08] <halfak>	 If you'd show me where that lives, and where your code lives, I'll take a pass at it. 
[19:15:41] <joal>	 halfak: and about 85k rows that exists in both yours and mine and don't match
[19:16:17] <halfak>	 Oh interesting.  Might that be related to comment patterns -- like "<!-- stuff with aa <ref> -->"
[19:16:36] <joal>	 halfak: hm, normally I treat that corerctly, maybe not
[19:16:56] <halfak>	 I might treat that wrong.  Anyway, it'll probably be obvious once we look at the data :) 
[19:18:43] <wikibugs_>	 (03PS2) 10Joal: [WIP] Add job computing citations diffs over text [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/337900
[19:18:47] <joal>	 halfak: code --^
[19:18:56] <joal>	 halfak: some tests included :)
[19:19:18] <halfak>	 joal, we should have a task.  Is there a task for this. 
[19:19:35] <joal>	 halfak: nope, no task - Creating one
[19:19:47] <halfak>	 Awesome. 
[19:20:02] * halfak is totally going to forget links and stuff without that :) 
[19:20:06] <wikibugs_>	 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 2 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3050811 (10Pchelolo) > Is there a ChangeProp in labs that will allow us to track the production wikis?  Nope, we don't have acces...
[19:21:10] <wikibugs_>	 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 2 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3050812 (10Halfak) Boo.  OK.  No worries.  Onward with `ores precached` for now.
[19:22:47] <wikibugs_>	 10Analytics, 06Research-and-Data: Provide a spark job processing history and text to extract citations diffs - https://phabricator.wikimedia.org/T158896#3050826 (10JAllemandou)
[19:22:54] <joal>	 halfak: --^
[19:23:15] <wikibugs_>	 (03PS3) 10Joal: [WIP] Add job computing citations diffs over text [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/337900 (https://phabricator.wikimedia.org/T158896)
[19:25:32] <wikibugs_>	 10Analytics, 06Research-and-Data: Provide a spark job processing history and text to extract citations diffs - https://phabricator.wikimedia.org/T158896#3050861 (10Halfak) I produced this dataset with mwrefs for comparison to the spark job: https://datasets.wikimedia.org/public-datasets/all/wp10/20170101/frwik...
[19:37:53] <wikibugs_>	 10Analytics, 06Research-and-Data: Provide a spark job processing history and text to extract citations diffs - https://phabricator.wikimedia.org/T158896#3050906 (10JAllemandou) Comparing the two datasets (in spark-shell): ``` spark.read.json("/user/joal/frwiki-20170101.diffs.json.bz2").createOrReplaceTempView(...
[19:38:34] <wikibugs_>	 10Analytics, 10EventBus, 06Services (watching): EventBus logs don't show up in logstash - https://phabricator.wikimedia.org/T153029#3050913 (10Ottomata)
[19:40:29] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 06Performance-Team: Stop using global eventlogging install on hafnium (and any other eventlogging lib user) - https://phabricator.wikimedia.org/T131977#3050946 (10Ottomata) a:05ori>03Krinkle Assigning to Timo instead of Ori.  Feel free to unassign this or reassign a...
[20:33:34] <wikibugs_>	 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3051159 (10Rafaesrey) Dear Leila,  Thank you for this reply. I understand. Let’s move on with the initial 71 and try to explore the possibility of expanding as you suggest.    I have two furth...
[20:48:30] <wikibugs_>	 06Analytics-Kanban, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3051239 (10Eevans) Some have suggested that you can temporarily switch to `AllowAllAuth{enticator,orizer}`, bump the replication factor...
[21:11:47] <joal>	 Away for now ! Tomorrow a-team :)
[21:12:01] <milimetric>	 ooh, metrics meeting
[21:12:05] <milimetric>	 nite jo
[21:50:37] <wikibugs_>	 (03PS1) 10Milimetric: Update dataset location [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/339536 (https://phabricator.wikimedia.org/T125854)
[21:50:50] <wikibugs_>	 (03CR) 10Milimetric: [V: 032 C: 032] Update dataset location [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/339536 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric)
[21:51:23] <wikibugs_>	 (03CR) 10Milimetric: [V: 032 C: 032] Move datasets to analytics.wikimedia.org [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/337642 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric)
[21:51:32] <wikibugs_>	 (03CR) 10Milimetric: [V: 032 C: 032] Fix style on worldmap [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/339114 (owner: 10Milimetric)
[22:20:03] <wikibugs_>	 (03PS1) 10Milimetric: Remember piwik instrumentation [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/339571
[22:20:17] <wikibugs_>	 (03CR) 10Milimetric: [V: 032 C: 032] Remember piwik instrumentation [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/339571 (owner: 10Milimetric)
[22:48:51] <wikibugs_>	 06Analytics-Kanban, 13Patch-For-Review: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#3051562 (10Milimetric) I've deployed the dashboards we control and they're all looking good.  But I tracked what data they use and we have some data on analytics.wikimedia.org/datasets now that n...