[07:19:25] <wikibugs_>	 (03PS2) 10Amire80: Add new error types and abuse filter details printout [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/340982 (https://phabricator.wikimedia.org/T158834)
[08:41:48] <wikibugs_>	 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#3074562 (10MoritzMuehlenhoff) @Milimetric : When new people should be added to the privileged LDAP groups ("wmf" for WMF staff or "nda" for volunteers/researchers/external contractors), please op...
[08:48:06] <joal>	 Hi a-team
[08:48:41] <joal>	 My IRC screen server got issues this weekend, I can't backlog the chan
[08:49:39] <joal>	 elukey: would you mind letting me know if anything needs my attention?
[08:51:04] <elukey>	 joal: Morning! Nothing critical, the only thing that heppened was a disk issue with an1028 (there is an email about it)
[08:51:47] <joal>	 elukey: I had seen that from alert email
[08:52:26] <elukey>	 I am wondering if it is worth or not to move the journal node away from an1028
[08:54:34] <joal>	 elukey: it probably depends on what the plans are in term of disk replacement
[08:55:11] <joal>	 elukey: If disks can't be replaced soon, could be nice to move the journal out of an1028 and remove it entirely from the cluster for maintenance (maybe)
[09:02:51] <elukey>	 yep, I think that Chris will not be able to replace the disk before ~1 week
[09:11:04] <joal>	 elukey: Then let's chane it ? could also be good to have the procedure for swapping from one node to another documented soewhere :)
[09:14:12] <elukey>	 it is already documented in https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#JournalNodes
[09:14:17] <elukey>	 but it requires restarting the namenodes
[09:17:24] <joal>	 arf
[09:21:07] <joal>	 elukey: Just need u to monitor closely, but shouldn't be too much of a difficulty?
[09:25:54] <elukey>	 joal: I am thinking if we need to do it or not.. atm only the journalnode daemon is active on the host (so HDFS datanote and yarn resource managers are down and puppet disabled). The journalnode daemons seems to work fine and afaics turning down the datanode "shields" the other daemons to be affected
[09:26:09] <elukey>	 (we could also think to start yarn again on it)
[09:27:52] <elukey>	 (I am not in slacker mode, only trying to mess with namenodes metadata as much as possible :D)
[09:28:03] <elukey>	 *not to mess
[09:29:55] <elukey>	 as far as I can see the journalnodes are writing to /var/lib/hadoop/journal/analytics-hadoop
[09:30:00] <elukey>	 (not on hdfs)
[09:31:18] <mforns>	 hello a-team :]
[09:35:29] <elukey>	 helloooooo mforns!!
[09:35:32] <elukey>	 how are things??
[09:35:38] <mforns>	 hey elukey 
[09:35:47] <mforns>	 everything's good :], you?
[09:37:08] <elukey>	 not bad :)
[09:37:22] <elukey>	 how is the new (Temporary) life?
[09:38:18] <mforns>	 good! quite hot, doing yoga and chilling out so far :]
[09:38:31] <elukey>	 nice :)
[09:40:21] <mforns>	 elukey, this is my ops week, is there any unplanned from last week concerning ops that I can help? Haven't read the emails yet
[09:47:28] <elukey>	 not really, but IIRC joal has been working on a lot of oozie jobs failing recently (after the cdh upgrade) 
[09:47:37] <elukey>	 so he might need some info from you, but not sure
[09:49:00] <mforns>	 elukey, OK
[09:55:17] <joal>	 Hi mforns !
[09:55:27] <joal>	 Glad to hear everything good for you :)
[09:55:32] <mforns>	 hello joal! :]
[09:55:41] <mforns>	 yea, all good!
[09:55:44] <joal>	 mforns: oozie is back on track, so nothing really on my side
[09:55:50] <mforns>	 ok
[09:56:00] <mforns>	 what happened to it?
[09:57:30] <joal>	 mforns: After CDH upgrade, some errors happened
[09:57:57] <joal>	 1- Unplanned error for all jobs about memory for oozie launchers (lead to restart all jobs)
[09:58:20] <mforns>	 I see
[09:58:38] <joal>	 2- Restart of all jobs lead to errors of long-not-restarted jobs havings references to suppressed jars
[09:59:28] <joal>	 3- Oozie + HiveContext in Spark seems buggy with the versions we have (ottomata commented on a ticket with cloudera°
[09:59:52] <joal>	 So, given those errors:
[09:59:55] <mforns>	 oh
[10:00:07] <joal>	 1 - We modified all jobs to have the new oozie launcher conf param
[10:00:48] <joal>	 2 - We modified jobs with old jars to use new ones (1 non-backward compatible modif, the rest just version change)
[10:00:56] <mforns>	 aha
[10:01:14] <joal>	 3- We modified spark wikidata jobs to use sqlContext insteqd of hiveContext
[10:01:33] <joal>	 And now oozie has not complained through the weekend (YES !!°
[10:01:40] <mforns>	 awesome
[10:01:46] <mforns>	 thanks for the update
[10:02:04] <mforns>	 :]
[10:03:12] <joal>	 np mforns, it's been a bit busy last week :)
[10:03:21] <mforns>	 looks like it, eheheh
[10:26:16] <elukey>	 joal: I used a super hacky way to leave everything running on an1028 except the hdfs datanode, namely replacing /etc/init.d/hadoop-hdfs-datanode with exit 0
[10:26:28] <elukey>	 in this way yarn and the journal node daemon will keep running
[10:26:37] <elukey>	 together with puppet
[10:26:50] <elukey>	 and the datanode will be left sleeping
[10:27:01] <elukey>	 this is a horrible version of the hammer
[10:27:15] <elukey>	 but eventually with Debian we'll have all the magic of "systemctl mask"
[10:34:09] <joal>	 elukey: ok :)
[10:35:31] <elukey>	 joal: what do you think about reimaging an1040 to debian?
[10:54:00] * elukey proceeds silently..
[10:54:48] <joal>	 elukey: sorry missed your ping
[10:55:00] <joal>	 elukey: I don't really see how it affects ...
[10:55:19] <elukey>	 joal: all right I am going to do it! :)
[10:55:24] <elukey>	 so we'll have good data
[10:55:35] <elukey>	 maybe we could check if anything big is running on it
[10:55:43] <elukey>	 otherwise I'll stop the daemons and start the reimage
[10:56:59] <elukey>	 seems good
[10:59:19] <moritzm>	 are there any hadoop instances in labs?
[10:59:59] <elukey>	 moritzm: yep on the analytics project (one is already running debian jessie, it is an hadoop worker node)
[11:00:09] <elukey>	 cdh5-3.eqiad.wmflabs
[11:00:21] <moritzm>	 ok, does it currently use base::firewall? https://gerrit.wikimedia.org/r/341292 could potentially affect it, then
[11:00:44] <moritzm>	 I mean, if it doesn't currently use it, then base::firewall would affect the labs instance as well
[11:01:54] <elukey>	 I am not aware of any use of base:firewall in labs for our things.. but the cluster is only for test purposes, nothing super important
[11:02:02] <elukey>	 so we can fix it later on if we see any issue
[11:02:04] <elukey>	 not a blocker
[11:02:13] <moritzm>	 since the ferm rules for hadoop make extensive use of $ANALYTICS_NETWORKS to restrict access, which isn't available in labs
[11:02:13] <elukey>	 (not sure if I have answered your question)
[11:02:36] <moritzm>	 yeah, you have, but then my patch can't be used as-is
[11:03:29] <moritzm>	 I'll abandon, it's not really worth the effort to convert all the hadoop rules to be compliant with base::firewall in labs
[11:04:23] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070663 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1040.eqiad....
[11:07:22] <joal>	 elukey: just checked hadoop metrics on grafana: nodemanager heap is looking really better :)
[11:07:48] <joal>	 elukey: however, hdfs files count is not - do you know if we've restarted files deletion cron ?
[11:08:03] <elukey>	 joal: good point, let me check
[11:08:13] <elukey>	 in the meantime, an1040 is being reimaged \o/
[11:09:20] <joal>	 Aawesome elukey :)
[11:10:09] <elukey>	 # Puppet Name: hdfs-balancer
[11:10:10] <elukey>	 0 6 * * * /usr/local/bin/hdfs-balancer >> /var/log/hadoop-hdfs/balancer.log 2>&
[11:10:15] <joal>	 mforns: I think webrequest cron deletion stopped was related to banners data recomputation, then blocked by data deletion strategy - any news on that?
[11:10:35] <elukey>	 is it the hdfs-balancer cron?
[11:10:41] <joal>	 elukey: balancer and deletion are different :)
[11:10:47] <mforns>	 joal, I think Andrew switched it on again
[11:10:55] <elukey>	 yeah I suspected that
[11:12:06] <joal>	 mforns: ok, let's double check elukey, but maybe file growthv is just natural given jobs logs
[11:12:14] <mforns>	 joal, I don't think that banner data deletion/sanitizing strategy needs webrequest data no?
[11:12:33] <elukey>	 joal: do you know where the cron should be running
[11:12:34] <elukey>	 ?
[11:12:54] <joal>	 mforns: I think the reason for which we stopped deletion was to allow recomputation of december data using deployed oozie jobs
[11:13:02] <joal>	 But, those have not been started mforns
[11:13:12] <joal>	 elukey: trying to remeber
[11:14:48] <joal>	 elukey: I don't remeber, and I have not found it online
[11:14:55] <joal>	 elukey: Let's wait for ottomata
[11:15:18] <joal>	 mforns: do those oozie stuff ring bells, or not really?
[11:15:18] <mforns>	 joal, yes, but that is done, and even we have no sanitizing strategy yet, we should reset 60-day-purging for webrequest
[11:15:18] <mforns>	 joal, but I'm almost sure Andrew did that already
[11:16:01] <joal>	 mfournier: you said that is done - what is done?
[11:16:17] <joal>	 mforns: --^ sorry
[11:16:37] <joal>	 mfournier: My apologizes, was not adressed to you (tab error)
[11:17:00] <mforns>	 joal, reloading of december with new data format is done
[11:17:13] <elukey>	 joal: I didn't find it on puppet, not sure what it should do.. I thought it was part of the hdfs balancer script
[11:17:17] <mforns>	 joal, and also I think Andrew did reset the 60 day limit for webrequest purging
[11:18:04] <joal>	 mforns: ok
[11:18:35] <joal>	 mforns: should we start regular loading for banners then?
[11:18:56] <joal>	 elukey: mforns says cleanup has been started back by ottomata
[11:19:06] <mforns>	 joal, sure, I thought we already did... sorry my bad
[11:19:17] <elukey>	 mmmmmmmmmmmmmmmm
[11:19:22] <elukey>	 :D
[11:19:29] * mforns searches for gerrit change
[11:19:37] <joal>	 np mforns, I thought we did as well (realized that when fixing the mess last wekk)
[11:20:29] <joal>	 mforns: The info missing for me to stqrt them was: what should we pick as start date?
[11:20:43] <mforns>	 joal, I see
[11:20:52] <mforns>	 looking
[11:21:54] <mforns>	 joal, for the daily job, Feb 1st
[11:22:46] <mforns>	 joal, for the monthly, I don't remember if january was already compacted using the monthly job or not... jan has full data, but maybe we could start the monthly job at Jan 1st to be sure
[11:23:13] <joal>	 mforns: feb has full data as well: realtime provides it
[11:23:54] <mforns>	 joal, if I look at pivot, I can see a hole, and anyway, the daily job adds a couple fields no?
[11:23:59] <joal>	 mforns: But, when looking at pivot, it seems to be missing fields that are computed by oozie
[11:24:31] <joal>	 Ah mforns !
[11:24:35] <mforns>	 joal, from feb 1st to feb 7th, I see a hole, am I looking at it right>
[11:24:36] <mforns>	 ?
[11:24:46] <joal>	 mforns: I have not restarted my RT job after cluster upĝrade !!!
[11:25:12] <mforns>	 Ah! OK OK
[11:25:31] <mforns>	 then yes, monthly job: 1st Jan; daily job: 1st Feb
[11:25:44] <mforns>	 joal, no wait!
[11:25:55] <mforns>	 we can let the monthy job calculate feb as well, sorry
[11:26:28] <mforns>	 monthly job: 1st Jan; daily job: 1st Mar
[11:26:41] <mforns>	 no?
[11:27:40] <joal>	 mforns: I don't know :)
[11:28:14] <joal>	 mforns: restarting realtime job
[11:28:18] <mforns>	 joal, ok
[11:28:24] <joal>	 mforns: This needs to run anyway
[11:28:45] <joal>	 mforns: on monthly / daily jobs, best could be to check segments in druid
[11:29:05] <mforns>	 aha
[11:30:00] <mforns>	 joal, actually, I remember having run monthly for january
[11:30:47] <mforns>	 joal, elukey: that's Andrew's patch resetting webrequest purge to 62 days: https://gerrit.wikimedia.org/r/#/c/336458/
[11:31:59] <joal>	 awesome mforns, thanks for checking
[11:32:13] <mforns>	 np :]
[11:32:51] <mforns>	 and joal, I'm totally sure that the monthly job was run for jan, we can start monthly at 1st Feb and daily 1st Mar
[11:33:53] <mforns>	 joal, do you want me to change the default properties in the oozie job? or will we -D override those in the oozie call?
[11:35:13] <joal>	 mforns: just checked data on druid --> indeed jan got compacted monthly
[11:35:22] <mforns>	 aha
[11:35:35] <joal>	 mforns: I'm not sure which properties you're talking about :)
[11:35:45] <mforns>	 joal, xD
[11:35:59] <mforns>	 the properties file for the oozie job has the start and stop parameters
[11:36:14] <mforns>	 do you want me to change them? to make the start dates "official"?
[11:36:26] <mforns>	 although, the data starts at 2016-11-28...
[11:36:28] <joal>	 mforns: no need - We always use -D to override when we restart
[11:36:33] <mforns>	 ok ok
[11:37:45] <elukey>	 joal: ahh the script runs on an1027
[11:37:53] <elukey>	 nice to know
[11:37:56] <elukey>	 afaics it is enabled
[11:41:12] <elukey>	 Debian GNU/Linux 8 analytics1040 ttyS1
[11:42:07] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3075096 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1040.eqiad.wmnet'] ```  and were **ALL** successful.
[11:44:21] <elukey>	 what the hell... all the hadoop partitions are not mounted? 
[11:44:24] <elukey>	 on an1040
[11:44:26] <elukey>	 sigh
[11:44:30] <elukey>	 it was too good to be true
[11:44:32] <joal>	 elukey: then as I said, file growth must be related to new jobs
[11:45:00] <joal>	 elukey: unfortunately, I was expecting at least some things to break
[11:45:18] <joal>	 mforns: D you mind restarting the oozie jobs for banners?
[11:45:34] <mforns>	 joal, sure will do
[11:45:38] <elukey>	 ahhhh no it is me the problem
[11:45:40] <elukey>	 as always
[11:45:45] <elukey>	 the partition creation is manual!
[11:45:49] <joal>	 elukey: it is usual :)
[11:46:01] <joal>	 elukey: I'm my own problem most of the time ;)
[11:46:12] <joal>	 Guys, away for some time to lunch
[11:46:16] <joal>	 see you in a biy
[12:05:11] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3075124 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1040.eqiad....
[12:05:21] <elukey>	 of course I messed up with the partitions :D
[12:05:33] <elukey>	 re-installing
[12:11:37] <elukey>	 and I confirmed that the partman recipe doesn't work as expected
[12:11:49] <elukey>	 it stops before saying "Yes" to the new partitions
[12:11:49] <elukey>	 sigh
[12:23:07] * elukey lunch!
[12:38:52] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3075210 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1040.eqiad.wmnet'] ```  and were **ALL** successful.
[12:59:01] <wikibugs_>	 (03CR) 10Mforns: "Hey, I can not see the changes that deploy to production by default. Am I missing something?" (031 comment) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/340375 (owner: 10Milimetric)
[13:15:01] <fdans>	 hiiii, sorry, been all day with my irc window closed >.<
[13:31:31] <wikibugs_>	 (03Draft1) 10Amire80: Add Google Spreadsheet editor [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/341315
[13:32:05] <wikibugs_>	 (03Abandoned) 10Amire80: Add Google Spreadsheet editor [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/341315 (owner: 10Amire80)
[13:51:58] <mforns>	 hi fdans :]
[13:54:22] <wikibugs_>	 (03CR) 10Milimetric: "mforns: I just removed -test from the hostname of each dashboard in config.yaml" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/340375 (owner: 10Milimetric)
[13:55:22] <mforns>	 milimetric, oh yea, I was blind indeed
[13:56:32] <milimetric>	 cool, sorry I merged before you looked, but the dashiki extension merged so now we have nice pages on meta: https://meta.wikimedia.org/wiki/Config:Dashiki:VitalSigns
[13:56:55] <milimetric>	 mforns: interested in your thoughts about the other configs, like CategorizedMetrics, that are still left in raw-json
[13:57:15] <milimetric>	 I made this task: https://phabricator.wikimedia.org/T159269
[13:57:21] <mforns>	 milimetric, awesome!
[13:57:47] * mforns looks
[13:59:13] <elukey>	 so now I am really puzzled
[13:59:42] <elukey>	 the partman recipe to set up our worker nodes creates a swap lvm partition of 240GB :D
[14:00:36] <wikibugs_>	 10Analytics, 10Analytics-Cluster, 06Operations, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10Ladsgroup) Regarding GPU options. I just want to note that their drivers are propriety software and not open source (or partiall...
[14:00:49] <mforns>	 milimetric, but the patch is not merged yet no?
[14:01:05] <mforns>	 oh, you deployed without merging?
[14:01:31] <elukey>	 I need master ottomata :D
[14:02:24] <mforns>	 ops people are so cryptic...
[14:02:38] <elukey>	 come oooonnn
[14:02:50] <mforns>	 xD
[14:02:50] <elukey>	 :)
[14:03:49] <elukey>	 basically I am trying to figure out why partman (a lovely thing that configures for you disk setups with a very <sarcasm>understandable</sarcasm> syntax) is creating partitions in a weird way
[14:03:49] <milimetric>	 mforns: oh! oops, yeah, deployed without merging
[14:03:57] <mforns>	 ok ok
[14:04:00] <mforns>	 so...
[14:04:10] <elukey>	 ottomata: hhhhhhhhhhhhhhhhiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
[14:04:10] <wikibugs_>	 (03CR) 10Mforns: [V: 032 C: 032] Clean up Config [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/340375 (owner: 10Milimetric)
[14:04:24] <ottomata>	 hiiiii
[14:04:40] <elukey>	 whenever you are coffeinated and ready to go I'd need to ask you some questions
[14:06:56] <ottomata>	 k eatin some cereal, checking emails...
[14:06:58] <ottomata>	 :)
[14:08:14] <elukey>	 ottomata: does it mean that I can start? :D
[14:09:07] <ottomata>	 haha, uhh, give me a few!
[14:09:07] <wikibugs_>	 10Analytics, 10Analytics-Dashiki: Clean up remaining Dashiki configs on meta - https://phabricator.wikimedia.org/T159269#3062327 (10mforns) @Milimetric   > Available projects should be phased out (it already partially is) in favor of just parsing the site matrix. Totally agree.  > Annotations, out-of-service,...
[14:09:14] <mforns>	 milimetric, ^
[14:09:30] <elukey>	 ottomata: ahhh okok! even one hour, not really urgent
[14:13:17] <wikibugs_>	 10Analytics, 10Analytics-Dashiki: Clean up remaining Dashiki configs on meta - https://phabricator.wikimedia.org/T159269#3075565 (10Milimetric) >> Annotations, out-of-service, and metrics should be moved to a new sub-domain, maybe Config:DashikiMeta:. > Agree as well, but... Does this mean that we have to go t...
[14:17:51] <wikibugs_>	 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#3075598 (10Milimetric) Could we just change the phrasing of this task to say that?  So the description would be:  To use Pivot, Piwik, and other Analytics tools that require an LDAP login, please...
[14:29:42] <elukey>	 going to start writing..
[14:29:52] <elukey>	 so today I reimaged analytics1040 to debian
[14:30:26] <elukey>	 all good but when I tried to create the journalnode partition I got an error that the LVM VGS was full, no more space to allocate the 10G partition
[14:30:58] <elukey>	 I checked and the partman recipe created, as requested, a 30GB root partition and the rest was allocated for SWAP
[14:31:08] <elukey>	 in Trusty nodes, the swap is 1G
[14:31:18] <elukey>	 on analytics1040, it is 230GB
[14:31:42] <elukey>	 so I checked and we have tons of space in the VGS on a regular worker ndoe
[14:31:45] <elukey>	 *node
[14:31:52] <elukey>	 root@analytics1041:/home/elukey# vgs -o +vg_free_count,vg_extent_count VG               #PV #LV #SN Attr   VSize   VFree   Free  #Ext analytics1041-vg   1   3   0 wz--n- 232.09g 193.22g 49465 59415
[14:31:59] <elukey>	 horrible format sigh
[14:32:08] <elukey>	 root@analytics1041:/home/elukey# vgs -o +vg_free_count,vg_extent_count
[14:32:16] <elukey>	 VG               #PV #LV #SN Attr   VSize   VFree   Free  #Ext
[14:32:22] <elukey>	 analytics1041-vg   1   3   0 wz--n- 232.09g 193.22g 49465 59415
[14:32:27] <elukey>	 this --^ is a trusty node
[14:32:37] <elukey>	 that shows 193GB of free space
[14:32:41] <elukey>	 am I reading it wrong?
[14:33:24] <elukey>	 I checked pvdisplay and lvdisplay, it seems that we have space remaining in the flex bays
[14:34:08] <elukey>	 I also filed https://gerrit.wikimedia.org/r/#/c/341318/3/modules/install_server/files/autoinstall/partman/analytics-flex.cfg becuase the Debian installer stops before creating the partitions
[14:34:12] <elukey>	 asking for a Yes or No
[14:52:56] <wikibugs_>	 (03PS2) 10Joal: [WIP] Add oozie jobs for mw history denormalized [analytics/refinery] - 10https://gerrit.wikimedia.org/r/341030
[14:54:20] <elukey>	 in the meantime, I merged https://gerrit.wikimedia.org/r/#/c/341318/
[14:55:19] <wikibugs_>	 (03PS20) 10Joal: Add mediawiki history spark jobs to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548)
[14:59:59] <elukey>	 milimetric: Hello! I am reading your comments for the LDAP access for Pivot, but I am wondering why the one-task-per-user solution is not ok for you (or better you'd prefer another way)
[15:01:46] <milimetric>	 elukey: one-task-per-user is great, it's just that if I point someone to a wiki page that describes that, they almost always come back with questions.  If I point them at a task in phabricator, they tend to follow the directions there better.  And it's easy for me to find the task and link it
[15:01:52] <ottomata>	 elukey:  ok!
[15:02:08] <ottomata>	 so, iirc, i could never get partman to work properly for JBOD for hadoop
[15:02:20] <ottomata>	 especially with special cases for the flex bays
[15:02:22] <milimetric>	 I totally agree with you and moritz that you shouldn't handle those requests as comments, that's what I was saying, did that come across different?
[15:02:46] <ottomata>	 as for free space in flex bay vgs
[15:02:59] <ottomata>	 that sounds fine i think.  i don't fully remember, but i think i would have done something like that
[15:03:07] <ottomata>	 if there wasn't a reason to allocate all of the space
[15:03:17] <ottomata>	 i like to leave some in the VGs so that we can allocate as necessary later
[15:03:21] <ottomata>	 its easier to add space than remove it
[15:04:10] <elukey>	 ahhh okok
[15:04:27] <elukey>	 makes sense, maybe I'll document it so everybody will know it :)
[15:04:32] <elukey>	 it was a bit strange 
[15:04:47] <elukey>	 milimetric: nono sorry I thought you wanted to handle the future cases in a different way, nevermind!
[15:04:56] <ottomata>	 also, elukey, its good that you are testing this partman repartitioning stuff for jessie...but i would think hopefully we wouldn't have to wipe the non root partitions
[15:05:59] <elukey>	 ottomata: yep yep this would be the second step, I wanted to make sure that the recipe was working fine
[15:06:29] <ottomata>	 ok cool
[15:07:17] <elukey>	 ottomata: weird thing is, the swap partition part in the partman recipe seems to get the whole space available
[15:07:21] <elukey>	 rather than 1GB
[15:07:37] <elukey>	 there must be a little/subtle issue with partman
[15:07:42] <elukey>	 will keep investigating, thanks
[15:07:43] <ottomata>	 hm, yeah that's weird
[15:07:54] <ottomata>	 i know how much you love working with partman
[15:07:56] <ottomata>	 go get em!
[15:10:36] <elukey>	 ottomata: last thing - today I used a horrible trick to force only the hdfs datanode daemon to stop on an1028
[15:10:49] <elukey>	 namely replacing all the content of its init.d file with exit 0
[15:11:04] <elukey>	 I realized that yarn and journal node daemons were not affected by the disk failure
[15:11:09] <elukey>	 so I haven't moved the journal node
[15:11:17] <elukey>	 but we can do it
[15:12:53] <ottomata>	  hmm
[15:16:02] <elukey>	 atm an1028 works fine with puppet enabled etc..
[15:16:08] <elukey>	 it was basically a "systemctl mask"
[15:16:10] <elukey>	 horrible
[15:19:49] <ottomata>	 hm
[15:19:58] <ottomata>	 so chris is out all this week?
[15:21:14] <elukey>	 IIRC yes, but today he might be in the DC
[15:21:15] <elukey>	 not sure
[15:21:33] <ottomata>	 oh just today?  ideally we can just swap the disk and auto rebuild
[15:21:44] <ottomata>	 buuuut, ja if a week, it might be good to actively move the journalnode
[15:21:49] <ottomata>	 and, hey, maybe good practice for us anyway?
[15:22:23] <elukey>	 yes it might but I saw the procedure and I wanted to ask you first, since I don't like to mess with the Master nodes :)
[15:22:30] <elukey>	 maybe we could do it tomorrow together?
[15:22:37] <elukey>	 if nothing changes
[15:23:06] <elukey>	 (we also need to apply my script to generate the fstab before doing anything)
[15:25:17] <ottomata>	 k ya let's do tomorrow
[15:25:25] <ottomata>	 we can practice in labs, ja?
[15:25:30] <elukey>	 sure
[15:25:31] <ottomata>	 k
[15:25:32] <elukey>	 another thing
[15:25:39] <elukey>	 do we need a 1g swap partition?
[15:25:42] <elukey>	 on worker nodes
[15:26:20] <elukey>	 plus a root bigger than 30GB would be good
[15:26:22] <elukey>	 like 60
[15:26:31] <elukey>	 we have space in there :)
[15:26:52] <ottomata>	 i like bigger roots too, but i was told by other ops folks: what's the point, you should have good log rotation, we default to 30ish G roots blabla
[15:27:09] <ottomata>	 1g swap?  no idea really.
[15:27:16] <ottomata>	 i think swap probably doesn't really help us much
[15:27:23] <elukey>	 I'd vote to nuke it
[15:27:27] <ottomata>	 +1
[15:27:28] <elukey>	 go for 60GB of root
[15:27:37] <ottomata>	 i'm into it
[15:27:40] <ottomata>	 we do what we want.
[15:27:43] <ottomata>	 :)
[15:27:46] <elukey>	 yesss
[15:28:03] <elukey>	 and also maybe 20GB journal by default? Rather than adding it via script
[15:33:36] <ottomata>	 if you have learned the proper partman incantation, i bow to your wisdom :)
[15:34:35] <wikibugs_>	 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3075984 (10Aklapper) Wondering if this is related (I should check once this task is resolved): The account https://phabricator.wikimedia.org/p/EddieGP/ ha...
[15:42:57] <elukey>	 ottomata: the mountpoint might be weird, since it is /var/etc..
[15:43:24] <ottomata>	 ho right
[15:43:30] <ottomata>	 those dirs are created by puppet
[15:43:36] <ottomata>	  /var/lib/hadoop/journal
[15:43:39] <elukey>	 yep.. 
[15:43:42] <ottomata>	  /var/lib/hadoop might be created by the package
[15:43:45] <ottomata>	 but in either case, ja
[15:43:45] <ottomata>	 hm
[15:43:54] <ottomata>	 woudl it be possible to create the partition, but not mount it?
[15:43:55] <ottomata>	 with partman?
[15:44:04] <elukey>	 the main weird thing is that partman seems to allocate all the free space rather than leaving something free
[15:44:13] <elukey>	 not sure about it
[15:48:32] <ottomata>	 ya tha'ts pretty weird
[15:48:37] <ottomata>	 in the swap partitoin, eh?
[15:53:36] <elukey>	 yes exactly..
[15:53:42] <elukey>	 not sure if it is a new behavior
[15:53:52] <elukey>	 or an error in the partman recipe
[15:53:57] <elukey>	 but it worked in the past right?
[15:54:17] <elukey>	 I mean, the current nodes have been installed with the current analytics-flex recipe right?
[15:59:32] <wikibugs_>	 10Analytics, 10Analytics-Cluster, 06Operations, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3076068 (10Ottomata) Q:  Does T159165 mean that we no longer need to get a new stat box with a GPU?  Or is this ticket still valid?  I'm ab...
[15:59:41] <ottomata>	 yeah
[15:59:48] <ottomata>	 prettys sure that worked in the past
[16:01:17] <nuria>	 elukey, milimetric : standdupp
[16:01:37] <wikibugs_>	 (03CR) 10Nuria: Clean up Config (031 comment) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/340375 (owner: 10Milimetric)
[16:02:21] <nuria>	 joal: standduppp
[16:03:18] <joal>	 trying to join
[16:03:36] <joal>	 rebooting
[16:13:23] <wikibugs_>	 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Change userAgent field to user_agent_map in EventCapsule - https://phabricator.wikimedia.org/T153207#3076096 (10Nuria) a:05Nuria>03fdans
[16:29:33] <joal>	 ottomata: Was asking if you needed me for event 
[16:29:39] <joal>	 bus meeting
[16:30:06] <ottomata>	 oh
[16:30:10] <ottomata>	 naw joal you can skip
[16:30:19] <joal>	 k thanks :)
[16:30:30] <joal>	 with a bad connection, what point anyway  ottomata ;)
[16:31:18] <ottomata>	 :)
[16:32:28] <joal>	 Bye team, sorry for bad connection
[16:36:20] <elukey>	 ottomata: https://lwn.net/Articles/690079/ - really nice
[16:36:44] <elukey>	 it is a very clear view of swap, I was maybe too aggressive in removing it.. not sure
[16:37:24] <elukey>	 1G seems really low, but maybe something more could be usful
[16:37:26] <elukey>	 *useful
[16:37:36] <elukey>	 but it is always a tradeoff
[16:45:22] <joal>	 mforns: still with us?
[16:45:30] <mforns>	 joal, yes
[16:45:39] <mforns>	 s'up
[16:45:50] <wikibugs_>	 10Analytics, 10Analytics-Cluster, 06Operations, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3076221 (10Halfak) I think that having a GPU in a stats machine for modeling work will be critical for the research team and any other mode...
[16:45:52] <mforns>	 :]
[16:45:54] <joal>	 heyq, is it on purpose you've launched banners jobs in default queue?
[16:46:09] <mforns>	 joal, O.o
[16:46:23] <mforns>	 oops
[16:46:32] <mforns>	 ok, let me fix that
[16:47:43] <joal>	 mforns: no big deal, don't kill currently running one maybe?
[16:48:36] <elukey>	 ottomata: I think that in Debian partman assigns all the free space left to the last LVM volume :(
[16:48:44] <elukey>	 and there seems to be no option to avoid it
[16:49:23] <wikibugs_>	 10Analytics, 10Recommendation-API: productionize recommendation vectors - https://phabricator.wikimedia.org/T158973#3076230 (10Fjalapeno)
[16:54:59] <joal>	 mforns: it also seems you've relaunched jobs for January, no?
[16:55:49] <mforns>	 joal, january? I hope not...
[16:56:11] <mforns>	 hue says 2017-2
[16:56:25] <joal>	 mforns: it's weird, look at the title of the currently running job in yarn
[16:57:17] <mforns>	 joal, it says 2017-2 no?
[16:57:38] <mforns>	 oozie:launcher:T=hive:W=banner_activity-druid-monthly-wf-2017-2
[16:58:59] <joal>	 mforns: was looking at druid job, but it looks my eyes have fooled me
[16:59:47] <joal>	 mforns: sorry for making this fals alarm - I'm gonna stop doing operations for today I think :)
[17:00:04] <mforns>	 joal, don't worry, thanks for spotting the queue_name issue
[17:01:05] <mforns>	 will wait for the daily job to finish and restart the daily coord
[17:03:17] <ottomata>	 elukey:  :/
[17:03:32] <ottomata>	 i guess if we do 60G for root, and just let it to do the rest for journalnode partition
[17:03:34] <ottomata>	 it'll be ok 
[17:06:16] <elukey>	 ottomata: I was more for the opposite way around, namely root with all the remaining space, but maybe if we find a way to set up the journalnode partion it might be an option
[17:06:26] <elukey>	 and then we can add a step to resize it after installing
[17:07:17] <ottomata>	 elukey: , or, just add a fake space partitoin
[17:07:20] <ottomata>	 that we won't ever use
[17:07:26] <ottomata>	 and just can delete it if/when we need the space
[17:07:47] <elukey>	 yeah
[17:08:20] <elukey>	 the only weird thing is that it might require a mountpoint
[17:08:30] <elukey>	 I saw somebody on the internetz using /tmp/hack
[17:20:14] <elukey>	 https://www.mediawiki.org/wiki/Architecture_committee/2017-03-01 - last news is really interesting :)
[17:30:38] <nuria>	 elukey: looking
[17:31:04] <nuria>	 elukey: ahhh yess, for got to share that
[17:31:17] <nuria>	 elukey: because we have not had staff
[17:31:36] <elukey>	 it was super sneaky but I am really happy about it!
[17:31:50] <elukey>	 I think we need to say thank you to Victoria :)
[17:33:32] <nuria>	 elukey: sorry, i should have shared that earlier as it s been on teh works for a while
[17:37:05] <elukey>	 nuria: nono not your fault! This news should have needed a wmf-all email probably :)
[17:58:38] <nuria>	 elukey: i think the $$$ need to be moved before announcing how big is the team, that is wip
[18:00:01] <elukey>	 ahhh
[18:03:50] <wikibugs_>	 10Analytics: Refactor monthly banner oozie job to use already indexed daily data - https://phabricator.wikimedia.org/T159727#3076363 (10JAllemandou)
[18:04:38] <joal>	 ottomata: Should we in meeting?
[18:04:46] <joal>	 Or am I in an alternate place?
[18:18:06] <wikibugs_>	 10Analytics, 10EventBus, 06Services (watching): EventBus logs don't show up in logstash - https://phabricator.wikimedia.org/T153029#3076457 (10mobrovac)
[18:25:31] * elukey off!
[18:25:47] <elukey>	 ottomata: I left an1040 with hadoop daemons masked and icinga silenced
[18:25:54] <elukey>	 will restart tomorrow on partman
[18:27:04] <ottomata>	 oook!
[18:30:19] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 10Reading Epics (Trending Edits), and 3 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3076581 (10Jdlrobson)
[18:40:39] <milimetric>	 ashgrigas: is it ok if I post your dropbox links publicly on the repository where we're doing the prototype?
[18:40:50] <milimetric>	 I mean, they're public here too, but just making sure
[18:42:44] <milimetric>	 I can remove it if you want: https://github.com/milimetric/wikistats-prototype
[18:46:04] <ashgrigas>	 milimetric sure
[18:46:11] <ashgrigas>	 thats fine!
[18:46:25] <wikibugs_>	 10Analytics, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3065770 (10Ottomata)
[18:52:14] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 10Reading Epics (Trending Edits), and 3 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3076688 (10mobrovac) Will try to get it scheduled for this week. Apologies for the delay.
[18:55:22] <wikibugs_>	 10Analytics, 06Operations, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3076697 (10Ottomata) p:05Triage>03Low a:03Ottomata I'll take this on, low priority though.  Remind me about it if you get fidgety! :)
[18:56:59] <wikibugs_>	 10Analytics, 06Operations, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3066309 (10MoritzMuehlenhoff) jessie has jq 1.4, so this would also be fixed once stat1002 is migrated to jessie.
[18:57:45] <wikibugs_>	 10Analytics, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3076717 (10Pchelolo) After some testing of driver-librdkafka compatibility, here's the deal: 1. Currently we are using `nod...
[19:07:41] <wikibugs_>	 10Analytics, 06Operations, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3076750 (10Krinkle) @MoritzMuehlenhoff Thanks. Is there a ticket for that?  I've transferred my data to terbium for post-processing for the time being because the python/ua-parser package...
[19:41:27] <wikibugs_>	 10Analytics, 10Analytics-Cluster: Enable hyperthreading on analytics100[12] - https://phabricator.wikimedia.org/T159742#3076950 (10Ottomata)
[20:08:25] <Niharika>	 Is this the right place to request some eventlogging data? Pertaining to https://meta.wikimedia.org/wiki/Schema:CookieBlock Revision ID 16241436
[20:09:38] <milimetric>	 hi Niharika, I can help
[20:09:52] <Niharika>	 Hey milimetric. 
[20:09:55] <Niharika>	 Awesome.
[20:09:57] <milimetric>	 are you looking to query the data ad-hoc or make reports?
[20:10:20] <Niharika>	 milimetric: Just querying ad-hoc. 
[20:10:30] <milimetric>	 ok, Niharika do you have access to stat1003.eqiad.wmnet?
[20:10:40] <milimetric>	 EventLogging data is replicated there
[20:10:48] <Niharika>	 Let me see. 
[20:11:15] <Niharika>	 milimetric: I don't think so. 
[20:11:26] <Niharika>	 Asks for a password if I try to ssh. 
[20:11:46] <milimetric>	 ok, and is your ssh setup so you can access other machines on eqiad?
[20:12:04] <milimetric>	 like, using the bastion and forwarding your key and all that?
[20:12:22] <Niharika>	 milimetric: Yep. 
[20:13:56] <milimetric>	 ok, Niharika, then you'll need to request access.  So you file a task with Ops-Access-Requests and you ask for the groups "researchers and statistics-users"
[20:14:01] <milimetric>	 more info can be found here: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups
[20:14:31] <milimetric>	 you'll have to wait though because they have a waiting period on all access requests.  In the meantime, is there anything you need urgently Niharika ?
[20:15:50] <milimetric>	 there are 681 records in that table (the version of CookieBlock you need), so I can check whatever you're looking for pretty easily probably
[20:22:07] <milimetric>	 Niharika: did I lose you?
[20:22:31] <wikibugs_>	 10Analytics, 10EventBus, 06Services (watching): EventBus logs don't show up in logstash - https://phabricator.wikimedia.org/T153029#3077255 (10Pchelolo) The https://phabricator.wikimedia.org/T150106#2777178 should've been resolved by https://github.com/wikimedia/change-propagation/pull/133
[20:23:00] <Niharika>	 milimetric: Sorry, I'll Be right back. 
[20:23:12] <milimetric>	 np, just making sure I wasn't being confusing
[20:31:05] <Niharika>	 milimetric: Sorry again. I'm filing a request now. It's not urgent, so I can wait till I get access. Thanks!
[20:32:13] <milimetric>	 k, Niharika when you get access what you'll need is to mysql into analytics-store.eqiad.wmnet and use the "log" database, and select count(*) from CookieBlock_16241436;
[20:32:27] <Niharika>	 Noted.
[20:45:59] <nuria>	 ottomata (or joal) yt?
[20:46:12] <ottomata>	 ya
[20:46:14] <ottomata>	 nuria:  hey
[20:46:32] <nuria>	 ottomata: one question that might be very easy
[20:46:43] <nuria>	 ottomata: 
[20:46:52] <nuria>	 I am trying to insert data into a table on my db
[20:46:56] <nuria>	 https://www.irccloud.com/pastebin/P5pfPULR/
[20:47:29] <nuria>	 ottomata: 
[20:47:31] <nuria>	 like:
[20:47:32] <nuria>	 hive -f ./insert_test.hql -d destination_table=nuria.last_access_uniques_daily_asiacell -d year=2017 -d month=01 -d day=10
[20:47:48] <nuria>	 ottomata: so destination table is nuria.last_access_uniques_daily_asiacell 
[20:48:22] <nuria>	 ottomata: but i get an error about "moving" files
[20:48:24] <nuria>	 ottomata: Failed with exception Unable to move source hdfs://analytics-hadoop/tmp/hive-staging_hive_2017-03-06_20-43-00_938_6844714293285814446-1/-ext-10000 to destination hdfs://analytics-hadoop/wmf/data/wmf/last_access_uniques/daily/year=2017/month=01/day=10
[20:49:27] <ottomata>	 nuria:  do show create table
[20:49:28] <ottomata>	 on your table
[20:49:32] <ottomata>	 what is the 'location'?
[20:49:39] <ottomata>	 it looksl ike it is pointing into /wmf/data/wmf/last_access_uniques
[20:49:46] <ottomata>	 which is the prod table, which your user does not have perms to write to
[20:50:06] <ottomata>	 looks like you copy/pasted the create table statement, but didn't alter the location
[20:50:38] <nuria>	 ottomata: 
[20:50:42] <nuria>	 https://www.irccloud.com/pastebin/gB3Up40M/
[20:50:52] <nuria>	 ottomata: argh, it is!!!!!
[20:50:56] <ottomata>	 :)
[20:51:08] <nuria>	 ottomata: thank youuuu
[20:53:52] <ottomata>	 yw
[20:54:56] <wikibugs_>	 10Analytics, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3065770 (10mobrovac) There's no need to have downtime at all for the upgrade - we have multiple hosts for these services an...
[20:55:11] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#3077357 (10Ottomata) Ha!  OOPS I knew I would make a mistake when cleaning up old packages!  I accidentally removed almost all CDH pack...
[20:55:49] <wikibugs_>	 10Analytics, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3077359 (10Pchelolo) >>! In T159379#3077355, @mobrovac wrote: > There's no need to have downtime at all for the upgrade - w...
[20:56:12] <wikibugs_>	 10Analytics, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3077373 (10mobrovac)
[22:53:27] <wikibugs_>	 10Analytics, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3077953 (10Ottomata) By adding a new version of librdkafka to our apt repo, it has the chance that it might also be install...