[06:51:14] <wikibugs_>	 10Analytics-Kanban, 10User-Elukey: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303#3561161 (10Marostegui) And here we are again: ``` root@dbstore1002:/srv# df -hT /srv/ Filesystem            Type  Size  Used Avail Use% Mounted on /dev/mapper/tank-data xfs   6.4T  5.9T  556G  92% /srv ```
[07:45:30] <wikibugs_>	 10Analytics, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561225 (10fgiunchedi) 05Resolved>03Open Reopening, rdkafka metrics for eventstreams is out of control since a couple of days  ``` graphite1001:~$ du -hcs /var/lib/carbon/whisper/eventstreams/rdkaf...
[07:45:50] <wikibugs_>	 10Analytics, 10Operations: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561228 (10fgiunchedi)
[07:46:02] <wikibugs_>	 10Analytics, 10Operations, 10monitoring: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3106537 (10fgiunchedi)
[08:01:51] <wikibugs_>	 10Analytics, 10Operations, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561239 (10elukey) The other step to take would be to limit the amount of data that we store for librkafka, because with so many clients it is impossible to keep track o...
[08:25:31] <wikibugs_>	 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561268 (10elukey) ``` MariaDB [(none)]> show create database log; +----------+----------------------------------------------------------------+ | Database | Create Dat...
[08:37:55] <joal>	 Hi elukey, anything to discuss this morning?
[08:39:07] <elukey>	 joal: o/ - I am about to start to restart yarn/hdfs daemons for the jvm updates
[08:39:21] <joal>	 okey, I'll watch oozien
[08:40:38] <elukey>	 joal: archiva is also going to be restarte by Moritz as FYI
[08:40:50] <joal>	 elukey: thanks for letting me know
[08:41:12] <joal>	 elukey: I wonder if those restart don't trigger errors for us, it'll be interesting to check :)
[08:42:04] <moritzm>	 archiva restarted, let me know if you spot any problems
[08:43:39] <joal>	 moritzm: hi ! Thanks for that, will check
[09:37:55] <wikibugs_>	 (03PS3) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207
[09:39:12] <elukey>	 nodemanagers almost restarted, after that will proceed with hdfs
[09:39:39] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal)
[09:39:55] <wikibugs_>	 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561391 (10jcrespo) Are you sure you want uft8 and not utf8mb4? You may be losing extended area characters (e.g. emojis).
[09:41:23] <wikibugs_>	 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561394 (10Marostegui) >>! In T170952#3561391, @jcrespo wrote: > Are you sure you want uft8 and not utf8mb4? You may be losing extended area characters (e.g. emojis)....
[09:46:51] <elukey>	 started with hdfs, with 2 hosts at the time and 120s of sleep between each batch
[09:49:05] <wikibugs_>	 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561399 (10elukey) Yep I made it equal to db1047, but I am open to suggestions :)
[09:52:23] <wikibugs_>	 (03PS4) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207
[09:54:16] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal)
[10:28:54] <wikibugs_>	 (03PS5) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207
[10:28:56] <wikibugs_>	 (03PS6) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101)
[10:28:59] <wikibugs_>	 (03PS5) 10Joal: [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550)
[10:31:15] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal)
[10:31:17] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101) (owner: 10Joal)
[10:31:37] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550) (owner: 10Joal)
[10:37:27] <wikibugs_>	 10Analytics, 10Contributors-Analysis: Bring the Editor Engagement Dashboard back - https://phabricator.wikimedia.org/T166877#3561513 (10Halibutt) @Neil_P._Quinn_WMF; @Nuria, any successes?
[10:45:55] <elukey>	 joal: all worker nodes restarted, now we'd need to do the usual hive/oozie/etc.. and finally the master nodes
[10:46:08] <joal>	 ok elukey 
[10:46:25] <joal>	 I'm still here, then I'll take my usual break
[10:48:46] <elukey>	 joal: I think we can do it after the break
[10:48:51] <joal>	 k
[10:53:02] <elukey>	 all right going to lunch!
[10:53:29] <joal>	 Bye elukey :)
[10:54:41] <mforns>	 helooooo
[10:55:04] <joal>	 Hi mforns :)
[10:55:12] <joal>	 mforns: you missed elukey by not much ;)
[10:55:21] <mforns>	 hi joal :]
[10:55:26] <mforns>	 I know...
[10:55:37] <mforns>	 will ping him later
[11:26:13] <wikibugs_>	 (03PS6) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207
[11:26:23] <elukey>	 I am back!
[11:26:58] <elukey>	 I am going to restart/failover the master nodes now, should be available in ~30mins
[11:32:23] <wikibugs_>	 (03PS7) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101)
[11:32:25] <wikibugs_>	 (03PS6) 10Joal: [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550)
[11:33:22] <elukey>	 restarted daemons on an1002
[11:35:53] <wikibugs_>	 (03CR) 10Mforns: "LGTM in general! Awesome that you did it with so few code! Some comments inline." (036 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/373961 (https://phabricator.wikimedia.org/T174174) (owner: 10Joal)
[11:37:53] <elukey>	 failed over hdfs to 1002
[11:44:18] <elukey>	 aaand done, 1001 back as master
[11:44:40] <joal>	 looks awesomely good elukey :)
[11:49:09] <elukey>	 oozie not complaining! \o/
[12:04:21] <elukey>	 joal: speaking about oozie, I'd stop jobs and let everything drain to allow a better restart of hive/oozie daemons
[12:04:32] <icinga-wm>	 PROBLEM - HDFS missing blocks on analytics1001 is CRITICAL: CRITICAL: 75.86% of data above the critical threshold [5.0]
[12:04:55] <elukey>	 oh boy
[12:05:01] <wikibugs_>	 (03CR) 10Fdans: "Just a couple comments generally agreeing with mforns and one nit, but it's looking great!" (034 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/373961 (https://phabricator.wikimedia.org/T174174) (owner: 10Joal)
[12:05:01] <elukey>	 this is the new alarm
[12:05:08] <joal>	 sounds good to me elukey - Please pause and resume, don't stop and restart :)
[12:05:13] <elukey>	 yep yep
[12:07:30] <elukey>	 weird I can only see corrupt blocks
[12:07:31] <elukey>	 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=39&fullscreen&orgId=1&from=now-3h&to=now
[12:08:10] <elukey>	 ahhaah the title of the alarms is flipped
[12:08:17] <elukey>	 in puppet, going to fix it now
[12:09:25] <elukey>	 joal: I'd execute hdfs fsck -list-corruptfileblocks
[12:10:29] <elukey>	 mmm not sure if the option is valid
[12:11:27] <elukey>	 ah yes seems to be there
[12:13:00] <elukey>	 elukey@analytics1001:~$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
[12:13:03] <elukey>	 Connecting to namenode via http://analytics1001.eqiad.wmnet:50070
[12:13:05] <elukey>	 The filesystem under path '/' has 0 CORRUPT files
[12:14:07] <joal>	 Mwarf :(
[12:16:28] <elukey>	 ok fixed the alarms description in puppet
[12:18:49] <elukey>	 lemme see what a regular fsck reports
[12:23:05] <elukey>	 cluster is healthy, nothing reported
[12:23:25] <joal>	 that's weird that this alarm fired then
[12:24:21] <elukey>	 I am re-running it with -openforwrite
[12:24:25] <elukey>	 just in case
[12:27:17] <elukey>	 all good again
[12:27:20] <elukey>	 the metric is Hadoop.NameNode.analytics1001_eqiad_wmnet_9980.Hadoop.NameNode.FSNamesystem.CorruptBlocks.mean
[12:30:03] <elukey>	 ah no
[12:30:06] <elukey>	 elukey@analytics1001:~$ sudo -u hdfs hdfs dfsadmin -report
[12:30:06] <elukey>	 Configured Capacity: 1937161485090816 (1.72 PB)
[12:30:06] <elukey>	 Present Capacity: 1915101328101666 (1.70 PB)
[12:30:06] <elukey>	 DFS Remaining: 538274591326004 (489.56 TB)
[12:30:06] <elukey>	 DFS Used: 1376826736775662 (1.22 PB)
[12:30:08] <elukey>	 DFS Used%: 71.89%
[12:30:11] <elukey>	 Under replicated blocks: 0
[12:30:13] <elukey>	 Blocks with corrupt replicas: 22
[12:30:16] <elukey>	 Missing blocks: 0
[12:30:18] <elukey>	 Missing blocks (with replication factor 1): 0
[12:32:51] <elukey>	 so analytics1055 is down now because of a broken disk, it is maybe the source of the confusion
[12:34:45] <elukey>	 for some reason it might have be picked up now after the restart
[12:40:12] <elukey>	 so from https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=39&fullscreen&orgId=1&from=now-3h&to=now it seems that the corrupt blocks reported by HDFS were rising in the sequence used to restart the HDFS master node daemons
[12:40:22] <elukey>	 first 1002 then 1001 after minutes
[12:41:04] <elukey>	 I am pretty sure this is a weirdness of restart + an1055 down
[12:41:42] <icinga-wm>	 RECOVERY - HDFS missing blocks on analytics1001 is OK: OK: Less than 60.00% above the threshold [2.0]
[12:43:01] <joal>	 elukey: if we only get alerts after restarts, that's not really efficient :)
[12:43:14] <elukey>	 ah recovery
[12:44:08] <elukey>	 no this is only icinga switching names after my puppet patch
[12:44:35] <elukey>	 joal: I agree, but it must be a special weird case for dfs report etc..
[12:44:41] <elukey>	 or maybe a little bug
[12:46:04] <elukey>	 !log suspend oozie jobs from Hue to allow a easier restart of oozie/hive daemons
[12:46:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:46:30] <wikibugs_>	 (03PS7) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207
[12:46:51] <icinga-wm>	 PROBLEM - HDFS corrupt blocks on analytics1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0]
[12:50:23] <ottomata>	 o/ :)
[12:51:54] <joal>	 Taking my break a-team
[12:52:03] <mforns>	 o/
[12:55:33] <wikibugs_>	 10Analytics, 10Wikimedia-Stream: Stop tracking EventStreams client lag in graphite - https://phabricator.wikimedia.org/T174435#3561871 (10Ottomata)
[13:13:42] <elukey>	 bearloga: o/ - I'd need to restart hive server/metastore during the next couple of hours for jvm updates, I saw that you are running a job so if you could stop after it I'd be super happy :)
[13:22:31] <wikibugs_>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481107 (10Krenair) Was the maintain-views step not performed? ```MariaDB [hiwikiversity_p]> show tables; Empty set (0.00 sec)```
[13:23:02] <ottomata>	 elukey:  are you restarting whole cluster?
[13:23:05] <ottomata>	 just curious
[13:23:26] <elukey>	 ottomata: o/ - sorry just seen the pages :(
[13:23:39] <elukey>	 so yes I've restarted the whole thing
[13:23:42] <elukey>	 including masters
[13:23:59] <elukey>	 and after restarting the HDFS master daemons corrupt blocks rose 
[13:24:07] <elukey>	 but fsck doesn't report anything
[13:24:17] <elukey>	 meanwhile dfsadmin -report does
[13:24:23] <elukey>	 that is super weird
[13:24:36] <elukey>	 analytics1055 is down for maintenance due to a broken disk
[13:24:48] <elukey>	 so I suspect that it might be related
[13:26:26] <ottomata>	 hmm, seems unlikely, since its been down so long, i would expect for blocks on an55 to be replicated elsewhere
[13:26:31] <ottomata>	 and if they weren't, i wouldn't think they'd be corrupt
[13:26:33] <ottomata>	 but missing
[13:26:40] <ottomata>	 buuut, maybe it has somethign to do with restarts
[13:26:48] <ottomata>	 some writes get funky during a restart
[13:26:54] <ottomata>	 and when it comes back online it has to fix them
[13:27:29] <ottomata>	 elukey:  mostly asking so I can resolve T172018 :)
[13:27:29] <stashbot>	 T172018: Cannot request more than 4 cores per spark executor - https://phabricator.wikimedia.org/T172018
[13:28:13] <wikibugs_>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561980 (10Reedy) >>! In T171829#3553943, @Marostegui wrote: > The blocker is fixed and so is this one too: > ``` > mysql:root@localhost [hiwikiversity_p]> show t...
[13:29:57] <wikibugs_>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561986 (10Marostegui) >>! In T168765#3561969, @Krenair wrote: > Was the maintain-views step not completely performed? > ```MariaDB [hiwikiversity_p]> show tables...
[13:30:09] <elukey>	 ottomata: ahahah yes you can resolve it
[13:31:02] <wikibugs_>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561990 (10Marostegui) >>! In T168765#3561986, @Marostegui wrote: >>>! In T168765#3561969, @Krenair wrote: >> Was the maintain-views step not completely performed...
[13:31:46] <elukey>	 ottomata: anyhow, I have also checked fsck for open files for write, all reported as healthy
[13:31:53] <elukey>	 this is why I was a bit puzzled
[13:32:38] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: Make webperf eventlogging consumers use eventlogging on Kafka - https://phabricator.wikimedia.org/T110903#3561991 (10Ottomata) Done!  Looking ok!  You should really use `base::service_unit` in puppet to manage your systemd servi...
[13:32:52] <ottomata>	 aye
[13:33:55] <wikibugs_>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Cannot request more than 4 cores per spark executor - https://phabricator.wikimedia.org/T172018#3561992 (10Ottomata) @EBernhardson, Luca just restarted the cluster.  Can you tell if the change we merged fixes this?
[13:38:25] <elukey>	 ottomata: as FYI I restarted kafka1012->1014 too today
[13:39:14] <ottomata>	 k cool
[13:39:38] <ottomata>	 elukey:  kqsl looks really fun
[13:39:43] <ottomata>	 i'm pretty eager to try it!
[13:40:49] <elukey>	 it looks cool indeed!
[13:42:53] <ottomata>	 ha!  they support JSON before Avro?!
[13:42:55] <ottomata>	 KSQL currently supports formats:
[13:42:56] <ottomata>	 	•	DELIMITED (e.g. CSV)
[13:42:56] <ottomata>	 	•	JSON
[13:42:56] <ottomata>	 Support for Apache Avro is expected soon.
[14:07:55] <elukey>	 ok oozie restarted on an1003, just need to do hive when bearloga's hive jobs are completed
[14:08:01] <elukey>	 for the moment our jobs are stopped
[14:20:11] <wikibugs_>	 10Analytics-Kanban, 10User-Elukey: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3562106 (10mforns) Hi all!  I'm seeing insertions into mariaDB for Popups events. About 3000 every 4 minutes (12 evt per second approx). ``` 2017-08-29 13:54:59,108 [22699] (Mai...
[14:38:04] <joal>	 elukey: I think bearND|afk jobs are run from reportupdater, and given the dates it's currently working, I think it'll not be done for a while
[14:38:31] <wikibugs_>	 10Analytics-Kanban, 10User-Elukey: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3562155 (10elukey) I'd prefer to wait a bit before closing, space consumption went a bit up afaics:  {F9205031}  It was of course expected but since dbstore1002 is still not out...
[14:38:35] <joal>	 elukey: I think a failed job is reran automatically in reportupdater, mforns, can you confirm?
[14:40:49] <elukey>	 yeah just restarted hive daemons as soon as I saw a new query popping up
[14:42:42] <elukey>	 re-enabling camus and then oozie jobs
[14:47:16] <elukey>	 Cluster restart completed
[14:52:44] <joal>	 Thanks a lot elukey :)
[14:56:00] <wikibugs_>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3483663 (10chasemp) >>! In T168765#3561990, @Marostegui wrote: >>>! In T168765#3561986, @Marostegui wrote: >>>>! In T168765#3561969, @Krenair wrote: >>> Was the m...
[14:57:27] <mforns>	 joal, elukey, reading about RU
[14:58:26] <mforns>	 joal, elukey, yes, RU should rerun the missing data
[14:58:35] <joal>	 great :)
[14:58:57] <joal>	 thinking of that, I think I'll ask bearND|afk if he can update his jobs to run in nice queue :)
[14:59:54] <wikibugs_>	 10Analytics-Kanban, 10User-Elukey: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303#3562196 (10elukey) Back to acceptable levels:  ``` /dev/mapper/tank-data  6.4T  5.5T  946G  86% /srv ```
[15:02:12] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Use daily granularity for 1-month time ranges - https://phabricator.wikimedia.org/T173372#3562201 (10fdans)
[15:02:31] <ottomata>	 fdans:  yoohoo
[15:16:53] <elukey>	 joal: +1 for the nice queue
[15:28:38] <wikibugs_>	 10Analytics-Kanban, 10User-Elukey: Archive PageContentSaveComplete in hdfs while we continue collecting data - https://phabricator.wikimedia.org/T170720#3440380 (10Ottomata) Before we close this, we should move the tables that Nuria created in Hive to a new 'archive' database.
[15:30:17] <wikibugs_>	 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3448710 (10Ottomata) Hm, for now I think this is good, so that the slave matches the master.  If we want everything to be a different charset, we should change it as a...
[15:37:36] <ottomata>	 elukey:  quick cumin q: can I use it to tail logs?
[15:37:39] <ottomata>	 on multiple hosts?
[15:43:23] <bearND>	 joal: which jobs are you referring to? I'm missing context here.
[15:43:37] <joal>	 ottomata: yarn logs --applicationId application_1504006918778_0236 | less
[15:44:28] <joal>	 bearND: The jobs that currently under your username on the cluster: they use a lot of resource in the user queue, and since they are automated (or so I guess), having them run in nice queue could be, well, nice :)
[15:45:04] <elukey>	 ottomata: I don't think so
[15:45:08] <bearND>	 joal: on what machine? I don't know anything about them.
[15:45:21] <bearND>	 on scb cluster?
[15:45:29] <joal>	 bearND: hadoop cluster
[15:45:49] <bearND>	 joal: like a hive query? I've never run a hive query
[15:46:09] <joal>	 bearND: I think I have made a very wrong mistake of names - I apologize
[15:46:21] <joal>	 bearND - I was after bearloga ...
[15:46:26] * joal hides in shame
[15:46:32] <bearND>	 joal: oh, i see.
[15:47:08] <bearND>	 joal: Yeah, that is a different person. He started after me but it easy to type the wrong one in chat.
[15:47:20] <joal>	 Please excuse me again
[15:47:24] <joal>	 bearND: --^
[15:47:46] <joal>	 bearloga: those comments on reportupdater jobs in nice queue were actually for you :)
[15:47:47] <bearND>	 joal: no worries. I'm just glad you get to ping the correct person :)
[15:47:57] <joal>	 So am I bearND !
[15:50:45] <bearloga>	 joal: fwiw we run our reportupdater scripts with nice and ionice
[15:51:07] <joal>	 bearloga: It's about using the nice queue in hadoop :)
[15:53:18] <bearloga>	 joal: ah. i don't know anything about that but if there's something we can do to make our daily runs of scripts & queries even friendlier, i'm open to recommendations :)
[15:55:23] <elukey>	 bearloga: o/
[15:55:41] <bearloga>	 elukey: hi! o/
[15:55:58] <joal>	 bearloga: We added a new queue in the cluster for this purpose - Let me see how we do that in hive
[15:56:01] <elukey>	 ottomata: whenever you want we can chat about certpy, or do it tomorrow during your morning
[15:57:56] <joal>	 bearloga: In hive scriptm before executing the query: set mapred.job.queue.name=nice;
[15:58:31] <joal>	 This queue has less priority than the user one, so users get results faster - but given the cluster activity globaly, jobs will finish fast enough I think :)
[15:58:37] <joal>	 bearloga: --^
[16:02:22] <bearloga>	 joal: so if foo.hql is the query being executed (for example), does `set mapred.job.queue.name=nice` go inside foo.hql, or is it mapred.job.queue.name set with `hive -hiveconf` or it doesn't matter?
[16:04:56] <joal>	 bearloga: I think it doesn't matter - I tried both successfully (actually, 2 dashes for hiveconf: --hiveconf mapred.job.queue.name=nice
[16:06:07] <bearloga>	 joal: got it, thank you!
[16:06:14] <joal>	 thanks you bearloga :)
[16:08:36] <bearloga>	 joal: is the new nice queue documented anywhere on wikitech or in a phab ticket? (just so I have something to link to in the patch notes)
[16:09:52] <ottomata>	 elukey:  i got 20 mins til next meeting, want a certpy overview?
[16:10:00] <elukey>	 ottomata: sur
[16:10:06] <ottomata>	 bc
[16:10:07] <joal>	 bearloga: we have a  phab ticket: T156841
[16:10:07] <stashbot>	 T156841: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841
[16:10:18] <joal>	 bearloga: We should document it as well, but not yet done
[16:14:34] <bearloga>	 joal: thanks!
[16:18:52] <joal>	 bearloga: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Run_long_queries_in_a_screen_session_and_in_the_nice_queue
[16:18:55] <joal>	 as wsell :)
[16:25:08] <bearloga>	 joal: *thumbs up*
[16:36:11] <wikibugs_>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Cannot request more than 4 cores per spark executor - https://phabricator.wikimedia.org/T172018#3562540 (10EBernhardson) `hdfs getconf` now reports 32, and spinning up a spark repl with 8 cores per executor is able to get executors and run code. Lo...
[16:41:43] <wikibugs_>	 (03CR) 10Mforns: "Hi Nettrom, sorry for the delay." (033 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/373373 (owner: 10Nettrom)
[16:52:46] <bearloga>	 joal: could you please take a look at https://gerrit.wikimedia.org/r/#/c/374569/ and +1 if I did it correctly? just want to make 100% sure before chelsyx and I merge it
[16:56:51] <wikibugs_>	 10Analytics-Kanban, 10RESTBase-API, 10WMF-Legal, 10Patch-For-Review, 10Services (done): License for pageview data - https://phabricator.wikimedia.org/T170602#3562642 (10mforns) Hi @Pchelolo :] I saw that https://wikimedia.org/api/rest_v1/#/ now has the updated licenses. Can I move this to done in our boa...
[16:57:27] <wikibugs_>	 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562647 (10RobH)
[16:57:46] <wikibugs_>	 10Analytics-Kanban, 10RESTBase-API, 10WMF-Legal, 10Patch-For-Review, 10Services (done): License for pageview data - https://phabricator.wikimedia.org/T170602#3562693 (10Pchelolo) 05Open>03Resolved All done here indeed, resolving the ticket.
[17:25:41] <wikibugs_>	 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3562852 (10Ottomata) Ok, met with Chase and Luca, and we decided that Option 2 is the way to go.  I'll make a subtask...
[17:30:37] <wikibugs_>	 10Analytics, 10Operations: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10Ottomata)
[17:30:47] <ottomata>	 chasemp: https://phabricator.wikimedia.org/T174465
[17:30:57] <ottomata>	 does my description match what we just talked about?
[17:31:03] * elukey off !
[17:31:08] <ottomata>	 laters!
[17:34:14] <Amir1>	 Hey, I'm running query against ApiAction data for wikidata but it seems it only records core actions and not wikibase actions
[17:34:25] <Amir1>	 I'm talking about this: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/ApiAction
[17:34:58] <Amir1>	 When I run it on action = 'query' it works but when I run it on action = 'wbgetentities' it doesn't
[17:35:06] <Amir1>	 (gives me null results)
[17:35:35] <Amir1>	 I probably need to file a phab card but was checking if it's something obvious that I'm missing
[17:40:51] <wikibugs_>	 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562940 (10RobH)
[17:45:56] <wikibugs_>	 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562979 (10RobH) So I realized that: d-i     partman-auto/choose_recipe      es  was in the recipe, and isn't needed since it doe...
[18:21:07] <wikibugs_>	 10Analytics: ApiAction log in data lake doesn't record Wikibase API actions - https://phabricator.wikimedia.org/T174474#3563202 (10Ladsgroup)
[18:21:18] <Amir1>	 Filed: https://phabricator.wikimedia.org/T174474
[18:21:39] <wikibugs_>	 10Analytics, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: ApiAction log in data lake doesn't record Wikibase API actions - https://phabricator.wikimedia.org/T174474#3563216 (10Ladsgroup)
[18:23:18] <wikibugs_>	 10Analytics, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: ApiAction log in data lake doesn't record Wikibase API actions - https://phabricator.wikimedia.org/T174474#3563221 (10Ladsgroup)
[18:33:05] <joal>	 ottomata: Based on https://groups.google.com/forum/#!topic/druid-development/wYHQznyonW0, can we try to add the "druid.indexer.task.chathandler.type=announce" to overlord and middlemanager conf?
[18:37:08] <ottomata>	 looking
[18:39:25] <ottomata>	 joal:  i'm not so sure that is a current config propertly
[18:39:28] <ottomata>	 property
[18:39:34] <joal>	 hm
[18:39:39] <ottomata>	 looking though
[18:40:05] <joal>	 I Can't think of something else (in /etc/druid/[overlord|middlemanager]/runtime.properties
[18:40:51] <ottomata>	 i can't find docs for it
[18:41:13] <joal>	 Arf :(
[18:42:50] <ottomata>	 but joal in that thread you linked to, the error was something about time stamp not being parsed
[18:43:00] <ottomata>	 were you able to get rid of that warning we saw?
[18:43:02] <joal>	 ottomata: from https://groups.google.com/forum/#!topic/druid-user/rCmhhJ67iw4, seems to be in runtime.properties, but no real doc indeed
[18:43:04] <ottomata>	 about partitiong string?
[18:43:16] <ottomata>	 yeah, joal it must be an old depracted property maybe?
[18:43:17] <joal>	 ottomata: trying again to triple check
[18:43:19] <ottomata>	 not sure though
[18:43:31] <joal>	 also ottomata, I double checked task creation, looks good
[18:46:45] * imandes waves to everyone
[18:48:05] <joal>	 ottomata: still the same warning - investigating in that direction - thanks :)
[18:48:15] <imandes>	 Hi @ottomata, seems that I don't get any events from Eventstreams. Everything worked fine a few days ago (I am writing a bot that feeds on it). I checked on my computer and on: https://codepen.io/ottomata/pen/VKNyEw/?editors=0010.
[18:48:44] <ottomata>	 imandes:  hi!
[18:48:47] <ottomata>	 interseting!
[18:48:48] <ottomata>	 looking
[19:08:56] <wikibugs_>	 10Analytics: Add redirect and pagelinks tables for partition repair in sqoop job for mediawiki history - https://phabricator.wikimedia.org/T174484#3563442 (10mforns)
[19:17:51] <wikibugs_>	 10Analytics-EventLogging, 10Analytics-Kanban, 10AbuseFilter, 10CirrusSearch, and 29 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3563493 (10Mattflaschen-WMF) Thanks for running this check.  Flow looks good.  ``` File: /home/reedy/git/mediawiki/core/extens...
[19:18:17] <wikibugs_>	 10Analytics-EventLogging, 10Analytics-Kanban, 10AbuseFilter, 10CirrusSearch, and 29 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3563503 (10Mattflaschen-WMF)
[19:22:19] <wikibugs_>	 10Analytics-EventLogging, 10Analytics-Kanban, 10AbuseFilter, 10CirrusSearch, and 29 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3563526 (10Reedy) `func_get_args` is very false positive/naive check.  Upstream issue filed, with various examples provided at...
[19:30:40] <joal>	 ottomata: you were RIGHT ! As usual :)
[19:31:14] <joal>	 ottomata: Tranquility doc on how to represent data for it to wirk is really not good :)
[19:33:13] <ottomata>	 joal?!
[19:33:15] <ottomata>	 you got it working?
[19:33:15] <ottomata>	 !
[19:33:23] <joal>	 I DID IT :)
[19:33:32] <ottomata>	 !!!!!
[19:33:35] <ottomata>	 WHAAA AAMZING
[19:33:57] <joal>	 Not that difficult once you know the hidden glitches
[19:34:39] <joal>	 Now I'm interested to see if my spark job goes well through the night and mre :)
[19:35:02] <joal>	 ottomata: Do you have a minute for a scala question?
[19:35:42] <ottomata>	 joal:  in a few i will
[19:35:48] <ottomata>	 petr and i broke eventstreams yesterday
[19:35:51] <ottomata>	 well, we broke mirror maker
[19:36:00] <ottomata>	 meaning no events were making it to analytics cluster
[19:36:01] <ottomata>	 :o
[19:36:04] <ottomata>	 fixing now
[19:36:09] <ottomata>	 !log restarting all kafka brokers and mirror maker processes to apply https://gerrit.wikimedia.org/r/#/c/374610/
[19:36:21] <joal>	 wow - 7 years for a mirror ... What about a mirror-maker?!!
[19:36:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:36:48] <ottomata>	 haha
[19:38:24] <ottomata>	 elukey:  ahh i need to restart all kakfa brokers
[19:38:29] <ottomata>	 did you want me to do jvm upgrade?
[19:38:31] <ottomata>	 how do I do it?
[19:38:33] <ottomata>	 moritzm: ?
[19:39:55] <moritzm>	 java is alredy upgraded on the kafka brokers
[19:40:24] <moritzm>	 I'm not sure which servers were already restarted by Luca, didn't follow that closely today, but it's likely in SAL
[19:40:48] <ottomata>	 ok great
[19:40:50] <ottomata>	 thanks
[19:58:14] <joal>	 ottomata: My spark job tells me you've restarted some brokers :)
[19:58:54] <ottomata>	 ohyeah
[20:00:18] <joal>	 ottomata: Thanks for having pushed me to investigate in format more :)
[20:00:29] <ottomata>	 yaaaa! so glad that works
[20:00:33] <ottomata>	 sucks that that wasn't anywhere in logs
[20:00:43] <ottomata>	 well, more about the error
[20:02:51] <wikibugs_>	 (03PS1) 10Mforns: Add pagelinks and redirect to refinery-drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/374623 (https://phabricator.wikimedia.org/T174484)
[20:24:16] <wikibugs_>	 10Analytics, 10Wikimedia-Stream: Alerts for common/import EventStreams topic volume - https://phabricator.wikimedia.org/T174493#3563830 (10Ottomata)
[20:28:59] <wikibugs_>	 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3563854 (10RobH) Ok, so putting the recipe info to ignore noswap requires:  partman-basicfilesystems partman-basicfilesystems/no_...
[20:29:44] <HaeB>	 Amir1: (re https://phabricator.wikimedia.org/T174474 ) i guess bd808 may have some knowledge about ApiAction
[20:30:08] <Amir1>	 HaeB: thanks
[20:32:21] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] Add pagelinks and redirect to refinery-drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/374623 (https://phabricator.wikimedia.org/T174484) (owner: 10Mforns)
[20:32:41] <wikibugs_>	 10Analytics, 10Wikimedia-Stream: Alerts for common/important EventStreams topic volume - https://phabricator.wikimedia.org/T174493#3563894 (10Ottomata)
[20:40:53] <vacio>	 ottomata: eventstreams seems to be back to normal, thanks! :)
[20:44:03] <wikibugs_>	 (03PS7) 10Joal: Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550)
[20:44:59] <wikibugs_>	 (03Abandoned) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101) (owner: 10Joal)
[20:52:28] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: Make webperf eventlogging consumers use eventlogging on Kafka - https://phabricator.wikimedia.org/T110903#3563960 (10Krinkle) @Ottomata Thanks a lot for doing that. Looking at the Navigation Timing metrics in Graphite through Gr...
[20:52:49] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: Make webperf eventlogging consumers use eventlogging on Kafka - https://phabricator.wikimedia.org/T110903#3563963 (10Krinkle)
[21:25:02] <wikibugs_>	 (03PS8) 10Joal: Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550)
[21:25:05] <joal>	 ottomata: if you have a minute, do you mind having a look --^ ?
[21:57:16] <wikibugs_>	 (03PS2) 10Nettrom: Add page creation configuration and queries [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/373373
[22:30:28] <wikibugs_>	 10Analytics, 10Wikimedia-Stream, 10Wikimedia-Incident: Alerts for common/important EventStreams topic volume - https://phabricator.wikimedia.org/T174493#3564376 (10greg)
[22:55:21] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (4/4) - Detail page - https://phabricator.wikimedia.org/T170940#3564431 (10fdans)
[23:01:13] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (4/4) - Detail page - https://phabricator.wikimedia.org/T170940#3564463 (10fdans)
[23:05:58] <wikibugs_>	 10Analytics, 10DBA: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033#3564480 (10demon) Been about a month and a half. Bump?
[23:08:09] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Productionise line graph - https://phabricator.wikimedia.org/T171766#3564488 (10fdans)
[23:08:11] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (4/4) - Detail page - https://phabricator.wikimedia.org/T170940#3564489 (10fdans)
[23:08:13] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (2/4) - Wiki selector - https://phabricator.wikimedia.org/T170936#3564490 (10fdans)
[23:08:15] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (1/4) - Dashboard and general UI - https://phabricator.wikimedia.org/T170933#3564491 (10fdans)
[23:08:23] <wikibugs_>	 10Analytics-Kanban, 10Reading-analysis: Final Vetting of Family Wide unique devices data - https://phabricator.wikimedia.org/T169550#3401448 (10ksmith) @Tbayer : Do you have a status update on this? Thanks!
[23:20:18] <icinga-wm>	 RECOVERY - HDFS corrupt blocks on analytics1001 is OK: OK: Less than 60.00% above the threshold [2.0]