[07:09:11] <fdans>	 hello yall
[07:10:36] <elukey>	 hola!
[08:40:06] <wikibugs>	 10Analytics: Mediarequests Examples Giving Errors - https://phabricator.wikimedia.org/T241863 (10fdans) 05Open→03Resolved a:03fdans Just updated it with working examples
[08:43:54] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - bot edits / new articles - https://phabricator.wikimedia.org/T241922 (10fdans) @FocalPoint Thanks for asking! Have you checked out the edits metric?  https://stats.wikimedia.org/#/all-projects/contributing/edits/normal|line|2-year|editor_type~anonymou...
[08:46:24] <wikibugs>	 10Analytics: Change link in wikis footer so that they point to stats.wikimedia.org - https://phabricator.wikimedia.org/T244961 (10fdans) p:05Triage→03High
[08:47:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update the AMD ROCm prometheus metric exporter to take into account changes to rocm-smi - https://phabricator.wikimedia.org/T236007 (10elukey) 05Open→03Resolved
[08:48:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Add request_bytes as measure in Druid's webrequest_sampled_128 - https://phabricator.wikimedia.org/T240681 (10elukey) 05Open→03Resolved
[08:49:01] <wikibugs>	 10Analytics: Statement of work for new designer in wikistats - https://phabricator.wikimedia.org/T223478 (10fdans) This document is now in the Analytics Drive
[08:49:15] <wikibugs>	 10Analytics, 10Analytics-Kanban: Statement of work for new designer in wikistats - https://phabricator.wikimedia.org/T223478 (10fdans) a:03fdans
[08:51:11] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) ` elukey@ganeti2001:~$  sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A     4        34 preferred   ovs=False, ssh_po...
[08:51:18] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Analytics datasets should be under a free license - https://phabricator.wikimedia.org/T244685 (10fdans)
[08:58:41] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10MoritzMuehlenhoff) Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) uses a single CPU (and hardly uses it) and...
[09:00:55] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) >>! In T244719#5875487, @MoritzMuehlenhoff wrote: > Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) u...
[09:01:04] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey)
[09:09:41] <wikibugs>	 10Analytics, 10User-Elukey: Redesign architecture of irc-recentchanges on top of Kafka - https://phabricator.wikimedia.org/T234234 (10Krenair) >>! In T234234#5875356, @elukey wrote: > 4) About the low usage of irc.wikimedia.org - yes I agree that few bots are using it (~300)  Am I going mad or isn't that actua...
[09:14:45] <wikibugs>	 10Analytics, 10User-Elukey: Redesign architecture of irc-recentchanges on top of Kafka - https://phabricator.wikimedia.org/T234234 (10elukey) >>! In T234234#5875512, @Krenair wrote: >>>! In T234234#5875356, @elukey wrote: >> 4) About the low usage of irc.wikimedia.org - yes I agree that few bots are using it (...
[09:18:00] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm codfw_B --link public --memory 8 --disk 40 --vcpus 4 irc2001.wikimedia.org START...
[10:05:43] <wikibugs>	 10Analytics: Create a Kerberos identity for foks - https://phabricator.wikimedia.org/T244773 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create foks --email_address=jsutherland@wikimedia.org Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to jsutherla...
[10:11:22] <wikibugs>	 10Analytics: Create a Kerberos identity for foks - https://phabricator.wikimedia.org/T244773 (10elukey) 05Open→03Resolved a:03elukey
[10:36:36] <moritzm>	 FYI, there's a disk space icinga warning for notebook1004 for /srv
[10:40:33] <elukey>	 sigh
[10:40:37] <elukey>	 thanks!
[10:40:40] <elukey>	 will check in a sec
[11:27:20] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) Ok current status:  * irc2001.wikimedia.org is running * puppet is set to role::system::spare, waiting for a new role/cluster combinati...
[11:27:52] * elukey lunch!
[11:28:27] <elukey>	 fdans: qq - are you doing something with oozie + mediarequest?
[11:30:27] <elukey>	 anyway, nothing horribly urgent, will check later :)
[11:38:41] <fdans>	 elukey: yes! backfilling of daily top mediarequests
[13:22:26] <joal>	 Hi team - I just joined as kids are asleep - There is something wrong with oozie 
[13:26:13] <joal>	 oozie-lib referenced in jobs is different from the one present on HDFS
[13:26:27] <joal>	 /user/oozie/share/lib/lib_20200204183338
[13:26:46] <joal>	 in hdfs
[13:26:53] <joal>	 while job expect: /user/oozie/share/lib/lib_20191216144244
[13:27:15] <joal>	 I really have no clue why it happens now
[13:36:07] <joal>	 !log Kill-restart webrequest bundle to see if it mitigates the error
[13:36:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:38:37] <joal>	 Restart of jobs doesn't mitigate the problem - Waiting for elukey to try to shake oozie
[13:48:05] <joal>	 ottomata: sorry to rush you but I could do with some ops help
[13:48:14] <ottomata>	 oh ok whats up?
[13:48:31] <joal>	 oozie is flipped - Can't find its lib
[13:48:43] <joal>	 I have checked folder, and indeed the expected one is not there
[13:49:05] <ottomata>	 the oozie sharelib?
[13:49:08] <joal>	 yes
[13:49:27] <joal>	 I have no clue why it started today
[13:49:48] <joal>	 Last lib has been created 2020-02-04
[13:49:49] <ottomata>	 oozie admin -shareliblist seems ko?
[13:49:49] <ottomata>	 ok
[13:49:50] <ottomata>	 no?
[13:50:55] <joal>	 it does, but jobs fail complaining about files not present, and it seems because of folders not present
[13:51:01] <elukey>	 hello, just got back :)
[13:51:06] <joal>	 heya elukey
[13:51:17] <ottomata>	 hmmm
[13:51:25] <ottomata>	 there is one lib folder....but it says 20200204 ...
[13:51:26] <joal>	 Can we try to restart oozie?
[13:51:30] <ottomata>	 why is there a receent one from th is monoth?
[13:51:46] <joal>	 sorry don't get it --^
[13:51:55] <ottomata>	 there is a single oozie share lib in hdfs
[13:51:59] <ottomata>	 created on feb 4
[13:51:59] <joal>	 yes
[13:52:03] <joal>	 I have seen hat
[13:52:09] <joal>	 +t
[13:52:10] <ottomata>	 did we make a new sharelib thi smonth?
[13:52:16] <joal>	 I can't recall!
[13:52:25] <joal>	 I actually don't think
[13:52:41] <ottomata>	 yeah i wouldn't expect us to
[13:52:43] <elukey>	 I restarted oozie for the spark-env changes, wondering if it made any change
[13:52:46] <ottomata>	 we usually only do that if we upgrade something
[13:52:47] <elukey>	 when was it?
[13:53:00] <ottomata>	 elukey last tues feb 4
[13:53:24] <elukey>	 that matches with the date on the lib dir
[13:53:48] <ottomata>	 right
[13:54:02] <ottomata>	 heh, that's what i'm saying, a new one was created then
[13:54:10] <ottomata>	 we know because the sahrelib dirs are named after their creationdate
[13:54:15] <ottomata>	 lib_20200204183338
[13:54:20] <elukey>	 yes yes
[13:54:25] <joal>	 This is weird
[13:54:33] <elukey>	 but does oozie create a new shared lib when restarted?
[13:54:36] <ottomata>	 no
[13:54:44] <ottomata>	 puppet will do it if
[13:54:45] <ottomata>	 unless  => '/usr/bin/hdfs dfs -ls /user/oozie | grep -q /user/oozie/share',
[13:55:00] <elukey>	 yes I meant if puppet does it after oozie is restarted
[13:55:10] <elukey>	 mmmm
[13:55:40] <ottomata>	  no it shoudln't
[13:56:11] <elukey>	 I am not saying it should, but all points in that direction
[13:56:19] <joal>	 also folks, from SAL, we restarted oozie on 2020-02-03, but not 04
[13:56:47] <ottomata>	 hm yeah
[13:56:48] <ottomata>	 File does not exist: hdfs://analytics-hadoop/user/oozie/share/lib/lib_20191216144244/hive2/libfb303-0.9.3.jar
[13:56:49] <joal>	 nothing seems to have happened particuarly on 04
[13:56:51] <ottomata>	 very weird.
[13:58:14] <elukey>	 and there is nothing on the 4th
[13:58:45] <ottomata>	 joal can I just rerun one of these webrequest load jobs to try stuff?
[13:58:47] <ottomata>	 i want to repro
[13:58:54] <ottomata>	 then i'm going to run sharelib update and see if anything changes
[13:59:10] <joal>	 sure ottomata, I tried to restart the webrequest bundle a while back, didn't work
[13:59:21] <joal>	 Do ou want me to do it again? Or a restart is enough?
[13:59:46] <ottomata>	 hm
[13:59:49] <ottomata>	 i guess yeah hm
[13:59:54] <ottomata>	 the job needs restarted if the sharelib changes?
[14:00:05] <ottomata>	 ok
[14:00:09] <ottomata>	 i'll just run the sharelibupdate
[14:00:11] <ottomata>	 then you restart bundlke
[14:00:12] <ottomata>	 one sec
[14:00:17] <joal>	 sure
[14:00:58] <ottomata>	 hm, FYI something new
[14:01:04] <ottomata>	 This request requires HTTP authentication.
[14:01:05] <ottomata>	 when
[14:01:17] <ottomata>	 in the future we'll need to change how we updatesharelib
[14:01:23] <joal>	 whoha
[14:01:45] <ottomata>	 e.g. when we auto add spark2 sharelib after spark upgrade, we use the REST api to update the sharelib, because the CLI had been flaky
[14:01:47] <ottomata>	 trying the CLI...
[14:02:06] <ottomata>	 yup
[14:02:09] <ottomata>	 [ShareLib update status]
[14:02:09] <ottomata>	 	sharelibDirOld = hdfs://analytics-hadoop/user/oozie/share/lib/lib_20191216144244
[14:02:09] <ottomata>	 	host = http://an-coord1001.eqiad.wmnet:11000/oozie
[14:02:09] <ottomata>	 	sharelibDirNew = hdfs://analytics-hadoop/user/oozie/share/lib/lib_20200204183338
[14:02:09] <ottomata>	 	status = Successful
[14:02:11] <ottomata>	 hm
[14:02:15] <ottomata>	 we might not even neeed a job resetart?
[14:02:21] <ottomata>	 goign to just rerun an individual hour
[14:03:31] <elukey>	 one thing that I noticed now is that /usr/bin/hdfs dfs -ls /user/oozie | grep -q /user/oozie/share may run even if there is a temporary issue with the hdfs command no?
[14:03:40] <elukey>	 like a network timeout etc..
[14:03:50] <ottomata>	 yeah i tmight...
[14:03:54] <ottomata>	 hm
[14:04:44] <elukey>	 maybe we could create a script that execs every time, doing the "unless" check in bash with some safe guards
[14:05:02] <ottomata>	 heh, w don't have syslogs from feb 4 to find out
[14:05:14] <joal>	 WAT?
[14:05:45] <ottomata>	 oh, we only ave a week of them joal that's all
[14:05:50] <joal>	 Ah ok
[14:06:05] <ottomata>	 looks better
[14:06:06] <ottomata>	 https://hue.wikimedia.org/oozie/list_oozie_workflow/0011413-200203112045319-oozie-oozi-W/?coordinator_job_id=0006212-200110143753542-oozie-oozi-C&bundle_job_id=0006211-200110143753542-oozie-oozi-B
[14:06:10] <joal>	 it's weird though that the thing only bites us now, isn't it?
[14:06:14] <ottomata>	 very weird
[14:06:31] <joal>	 ok, back in the game
[14:06:34] <ottomata>	 elukey:  this command only ever should run the first time oozie is instatlled
[14:06:44] <joal>	 I'm gonna manually restart failed jobs
[14:07:13] <elukey>	 ottomata: so you suggest to just move it in oozie's docs, rather than in puppet?
[14:07:21] <ottomata>	 yeah maybe
[14:07:34] <elukey>	 could be an option yes
[14:07:58] <ottomata>	 althought we also do the db creete stuff too
[14:08:01] <ottomata>	 kinda nice to have it all done
[14:08:42] <elukey>	 we could wrap those into some bash scripts, with more guards
[14:08:46] <ottomata>	 yeah
[14:09:02] <elukey>	 I'd vote for this option, and then if it doesn't work we remove them
[14:09:07] <ottomata>	 hm we could also just add another command to the unless
[14:09:15] <ottomata>	 unless oozie admin -shareliblist | grep ...
[14:09:50] <ottomata>	 wonoder what that retval is with no sharelib.
[14:10:30] <elukey>	 I mentioned the bash script since we could use set -x and have it abort if the retval is not zero
[14:10:46] <joal>	 ottomata: you unintentionally restarted the coordinator I killed: )
[14:10:55] <ottomata>	 ?
[14:10:55] <joal>	 Will kill it anew, and rerun the new one
[14:11:04] <joal>	 I killed https://hue.wikimedia.org/oozie/list_oozie_coordinator/0006212-200110143753542-oozie-oozi-C/
[14:11:22] <joal>	 and restarted https://hue.wikimedia.org/oozie/list_oozie_bundle/0011398-200203112045319-oozie-oozi-B
[14:11:23] <ottomata>	 oh
[14:11:27] <ottomata>	 you killed the whole thing
[14:11:33] <ottomata>	 ok sorry i just went from a oozie hue url
[14:11:34] <ottomata>	 ok
[14:11:42] <joal>	 By rerunning an action from the killed bundle, it restarted the coord :0
[14:11:43] <ottomata>	 ya restart all new ones, they should work now
[14:11:47] <joal>	 ack
[14:11:49] <ottomata>	 sorry
[14:11:53] <joal>	 Will kill again the old and rerun new
[14:11:54] <ottomata>	 elukey:  sure that sounds good too
[14:11:55] <joal>	 np :)
[14:12:49] <fdans>	 ottomata: helloooo there's a lil issue with the v2 old link
[14:12:53] <ottomata>	 yes?
[14:13:05] <fdans>	 this works https://stats.wikimedia.org/
[14:13:09] <fdans>	 but this doesn't https://stats.wikimedia.org
[14:13:11] <wikibugs>	 10Analytics: Request for Kerberos identity for fsalutari - https://phabricator.wikimedia.org/T245024 (10Fsalutari)
[14:13:15] <fdans>	 sorry add v2
[14:13:30] <fdans>	 this works https://stats.wikimedia.org/v2/
[14:13:30] <fdans>	 but this doesn't https://stats.wikimedia.org/v2
[14:16:29] <ottomata>	 hm
[14:21:49] <ottomata>	 good catch fdans  
[14:21:49] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/571726/1/modules/statistics/templates/stats.wikimedia.org.erb
[14:21:51] <ottomata>	 should do it
[14:22:48] <fdans>	 ottomata: oh cool, and that addition catches both cases then
[14:23:45] <elukey>	 joal: if you need a hand with the jobs let me know
[14:25:08] <joal>	 almost done elukey 
[14:25:39] <wikibugs>	 10Analytics: Request for Kerberos identity for fsalutari - https://phabricator.wikimedia.org/T245024 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create fsalutari --email_address=flavia.salutari@telecom-paristech.fr Principal successfully created. Make sure to update data.yaml in Puppet. Successfully...
[14:27:46] <ottomata>	 huh, the oozie admih -shareliblist just calls the REST API via java
[14:28:50] <icinga-wm>	 PROBLEM - yarn.wikimedia.org HTTPS on analytics-tool1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster
[14:30:01] <joal>	 wow
[14:30:29] <joal>	 I guess it's me overloading hue through restarts
[14:30:53] <elukey>	 weird, lemme check
[14:31:00] <elukey>	 fun day :D
[14:33:26] <elukey>	 !log restart hue on analytics-tool1001
[14:33:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:34:00] <elukey>	 better now
[14:34:19] <joal>	 Thanks elukey 
[14:34:30] <icinga-wm>	 RECOVERY - yarn.wikimedia.org HTTPS on analytics-tool1001 is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster
[14:34:47] <elukey>	 didn't check all the logs but requests were piling up due to hue
[14:34:58] <joal>	 elukey: my bad - sorry :(
[14:35:09] <joal>	 will try to go gently
[14:35:11] <wikibugs>	 10Analytics, 10Patch-For-Review: Request for Kerberos identity for fsalutari - https://phabricator.wikimedia.org/T245024 (10elukey) 05Open→03Resolved a:03elukey Please re-open if anything is missing!
[14:37:12] <icinga-wm>	 PROBLEM - Hue CherryPy python server on analytics-tool1001 is CRITICAL: PROCS CRITICAL: 2 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue runcherrypyserver https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration
[14:37:41] <ottomata>	 hm i suspect ooozie shareliiblist always returns 0
[14:37:42] <ottomata>	 :/
[14:38:07] <ottomata>	 hmm maybe not
[14:38:41] <elukey>	 what the hell hue
[14:39:08] <joal>	 ok I think I have restarted everything that was wrong - Will drop as Lino is awake - See y'all at standup
[14:39:36] <elukey>	 so hue's init scripts are so great that a restart leaves 2 processes running (old and new)
[14:39:39] <elukey>	 sigh
[14:39:41] <elukey>	 just killed the old one
[14:40:30] <icinga-wm>	 RECOVERY - Hue CherryPy python server on analytics-tool1001 is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue runcherrypyserver https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration
[14:40:46] <ottomata>	 heh yeah it does, need to parse i guess.
[14:42:21] <ottomata>	 hm, actually....we can't run oozie shareliblist...the oozie server has to be running or that, and on install at this point it isn't yet.
[14:46:10] <ottomata>	 hmmm elukey  i think our hypothosis of what when wrong isnt' right.
[14:46:18] <ottomata>	 that exec does
[14:46:19] <ottomata>	 require => [Cdh::Hadoop::Directory['/user/oozie'], File['/usr/bin/oozie-setup']]
[14:46:26] <ottomata>	 and those always issue hdfs dfs -ls commands
[14:46:38] <ottomata>	 so at least right before the unless command ran
[14:46:43] <ottomata>	 hdfs dfs -ls succeeded
[14:47:00] <ottomata>	 so, unless there was a really quick issue in between those
[14:47:09] <ottomata>	 i don't see how that could happen
[14:47:11] <ottomata>	 and, if there was
[14:47:15] <ottomata>	  a bash wrapper wouldn't help
[14:47:21] <ottomata>	 same thing could happen there
[14:48:07] <elukey>	 well a timeout could have happened with the hdfs dfs in unless, not really impossible.. and with a bash script we should gather the output of hdfs ls etc.., and then check it, with set -x. In this way it shouldn't fail
[14:48:17] <elukey>	 sorry the same thing shouldn't happen
[14:48:44] <elukey>	 the other alternative is to restart oozie now in say hadoop test, and see what happens
[14:48:48] <elukey>	 maybe we can repro there
[14:49:00] <elukey>	 (just to rule out the restart event)
[14:49:30] <ottomata>	 i think a bash script would fail in the same way if the problem was hdfs dfs -ls failling very temporarily
[14:49:39] <ottomata>	 puppet ends up running in success
[14:51:26] <ottomata>	 # for require
[14:51:26] <ottomata>	 hdfs dfs -test -e  /user/ooozie
[14:51:26] <ottomata>	 # for unless
[14:51:26] <ottomata>	 hdfs dfs -ls /user/oozie | grep -q /user/oozie/share
[14:51:26] <ottomata>	 # then if unless returns 1
[14:51:26] <ottomata>	 oozie-setup sharelib create
[14:51:28] <elukey>	 hdfs dfs -ls would return a non zero exit code, and set -x would abort. Maybe not using unless
[14:51:45] <ottomata>	 right, but so should puppet.
[14:52:09] <elukey>	 yes ok we can change puppet as well
[14:52:15] <ottomata>	 no i mean right now as is.
[14:52:19] <ottomata>	 this exec should not run if
[14:52:27] <ottomata>	 hdfs dfs -test -e  /user/oozie fails
[14:52:39] <elukey>	 we grep afterwards no? 
[14:52:46] <ottomata>	 the first check is the require
[14:52:50] <ottomata>	 require => [Cdh::Hadoop::Directory['/user/oozie']
[14:52:58] <ottomata>	 which runs hdfs dfs -test
[14:53:02] <ottomata>	 if that fails, the exec won't run
[14:53:05] <ottomata>	 since its preq fails
[14:53:25] <elukey>	 yes but nothing prevents a timeout to happen in the exec after the require was ok
[14:53:26] <ottomata>	 prereq*
[14:53:37] <ottomata>	 true, but that is true for hte bash script too
[14:53:38] <elukey>	 they are separate things
[14:54:00] <ottomata>	 it is just a series of commands run in succession, checking retvals
[14:54:05] <ottomata>	 which is what a bash wrapper would do too
[14:54:44] <elukey>	 if hdfs dfs etc.. fails, its ouput does not contain any /user/oozie/share etc.. and grep would then return 1
[14:54:50] <elukey>	 it is not the same as a bash script
[14:55:11] <elukey>	 because we'd need to do more checks in there, not a simple | 
[14:55:14] <elukey>	 this is my point
[14:56:30] <ottomata>	 hm, ah i think i see, you want to catch the potential failure of the exact unless hdfs ls before the grep, and prevent to run sharelib update if it fails...ok sorry i get it
[14:56:59] <ottomata>	 it seems so unlikely to me that this is what happened; the require succeeded bur the hdfs dfs -ls right after failed
[14:57:31] <elukey>	 yes I agree, mine was only an idea to rule out corner cases.. it didn't happen before, so it must be some weird corner case 
[15:01:49] <elukey>	 tried to restart oozie in test and run puppet, can't repro (just to rule out this)
[15:30:52] <elukey>	 so in hadoop test I just moved the namenodes to 2.8.5 as part of the rolling upgrade procedure
[15:30:55] <elukey>	 so far nothing explodes
[15:31:04] <elukey>	 the datanodes are the next ones
[15:51:23] <wikibugs>	 10Analytics, 10Wikidata, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Add time limits to scripts executed on stat1007 as part of analytics/wmde/scripts - https://phabricator.wikimedia.org/T243894 (10Rosalie_WMDE) a:03Rosalie_WMDE
[15:56:09] <elukey>	 ah lovely if I upgrade yarn together with hdfs there is a problem
[15:56:35] <icinga-wm>	 PROBLEM - Zookeeper Alive Client Connections too high on an-conf1001 is CRITICAL: 1091 ge 1024 https://wikitech.wikimedia.org/wiki/Zookeeper https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen
[15:56:49] <elukey>	 ourch
[15:58:11] <elukey>	 stopped all the new daemons
[15:58:30] <ottomata>	 oh oofo 
[15:58:31] <elukey>	 what the hell
[15:59:04] <elukey>	 ok it is going down
[15:59:32] <icinga-wm>	 RECOVERY - Zookeeper Alive Client Connections too high on an-conf1001 is OK: (C)1024 ge (W)512 ge 0 https://wikitech.wikimedia.org/wiki/Zookeeper https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen
[15:59:37] <elukey>	 I have a meeting now, will check in a bit
[15:59:40] <elukey>	 sigh
[16:03:56] <wikibugs>	 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Patch-For-Review: Refining is failing to refine centranoticeimpression events - https://phabricator.wikimedia.org/T244771 (10Ottomata) > since the plan was to switch the datatype back to array eventually. FYI, if you need  a new da...
[16:50:55] <wikibugs>	 10Analytics, 10Product-Analytics: Request for instructions for using DataGrip in the Kerberos paradigm - https://phabricator.wikimedia.org/T245040 (10mpopov)
[16:52:22] <wikibugs>	 10Analytics, 10Product-Analytics: Request for instructions for using DataGrip in the Kerberos paradigm - https://phabricator.wikimedia.org/T245040 (10mpopov)
[16:58:12] <wikibugs>	 10Analytics, 10Product-Analytics: Request for instructions for using DataGrip in the Kerberos paradigm - https://phabricator.wikimedia.org/T245040 (10elukey) Me and Joseph tried a lot to make this work, but we think that the solution might be so have kerberos auth where datagrip runs (so laptop or own pc), tha...
[17:01:19] <joal>	 :S
[17:23:42] <wikibugs>	 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Patch-For-Review: Refining is failing to refine centranoticeimpression events - https://phabricator.wikimedia.org/T244771 (10AndyRussG) >>! In T244771#5877159, @Ottomata wrote: > FYI, if you need  a new datatype, you should just ma...
[17:27:11] <wikibugs>	 10Analytics, 10Analytics-Cluster: Hadoop Hardware Orders FY2019-2020 - https://phabricator.wikimedia.org/T243521 (10RobH) So to put some of the figures I just posted in IRC about this:  In eqiad 10G racks, we have the following port totals using SFP-T (and thus using 1G in a 10G rack): row a: 64, row b: 33, ro...
[17:30:02] <wikibugs>	 10Analytics, 10Product-Analytics: Request for instructions for using DataGrip in the Kerberos paradigm - https://phabricator.wikimedia.org/T245040 (10mpopov) Aw, bummer :( thank you so much for trying though!
[17:46:26] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10akosiaris) >>! In T238658#5830499, @Ottomata wrote: > @akosiaris I just did a bit of benchmarking in staging.  As I added mor...
[17:53:29] <wikibugs>	 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Patch-For-Review: Refining is failing to refine centranoticeimpression events - https://phabricator.wikimedia.org/T244771 (10Nuria) @Ottomata we keep 90 days of raw data right? If so i vote from dropping all rerfined data and re-re...
[17:55:53] <wikibugs>	 10Analytics, 10Product-Analytics: Request for instructions for using DataGrip in the Kerberos paradigm - https://phabricator.wikimedia.org/T245040 (10kzimmerman) >>! In T245040#5877446, @elukey wrote: > Me and Joseph tried a lot to make this work, but we think that the solution might be so have kerberos auth w...
[18:33:54] <elukey>	 going off, for the moment the hadoop test cluster is ok
[18:34:08] <elukey>	 zookeeper seems quiet, and puppet disabled on the hadoop nodes that I am working on
[18:34:24] <elukey>	 BUT, in the unlikely event that anything explodes, stop all java daemons on
[18:34:29] <ottomata>	 ok!
[18:34:32] <elukey>	 analytics1028/1029/1031
[18:34:55] <elukey>	 ottomata: it was yarn causing that mess with zookeeper :(
[18:35:08] <elukey>	 will try to sort it out tomorrow
[18:35:09] <elukey>	 sigh
[18:35:16] <ottomata>	 u da best !
[18:35:52] <elukey>	 o/
[18:42:27] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Dzahn) ` Debug: Augeas[ens5_v6_token](provider=augeas): sending command 'set' with params ["/files/etc/network/interfaces/iface[. = 'ens5']/pre...
[18:46:36] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Dzahn) The primary network interface is missing from /etc/network/interfaces. There is only loopback in there. Why that is is another question....
[18:46:45] <wikibugs>	 10Analytics, 10Analytics-Cluster: Hadoop Hardware Orders FY2019-2020 - https://phabricator.wikimedia.org/T243521 (10Ottomata) What matters most for us in terms of row placement is an even-ish spread.  Hm, 10 of the nodes we are replacing are in Row B.  We also currently only have 9 hosts in row C anyway, so pe...
[18:48:58] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10Ottomata) Ok thanks!  Will try that!
[19:16:31] <joal>	 milimetric: Heya - would you have aminute ?
[19:20:44] <milimetric>	 joal: in 2 min!
[19:20:51] <joal>	 sure milimetric :)
[19:23:57] <wikibugs>	 10Analytics: Spike [2019-2020]. GPU enabled computations. How to do that best - https://phabricator.wikimedia.org/T217367 (10Nuria) 05Open→03Declined
[19:24:56] <milimetric>	 ok all yours joal 
[19:25:00] <joal>	 \o/
[19:25:05] <joal>	 To the cave :)
[19:39:58] <wikibugs>	 10Analytics, 10Operations, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10MoritzMuehlenhoff) Given that Luca also had an error during initial setup related to name resolution, this sounds like some error related to th...
[19:52:58] <wikibugs>	 10Analytics, 10Product-Analytics: Request for instructions for using DataGrip in the Kerberos paradigm - https://phabricator.wikimedia.org/T245040 (10JAllemandou) Thinking of the future, if we decide presto is the way to go for analysts, [[ https://github.com/airbnb/airpal | Airpal ]] seems a good candidate. I...
[20:35:10] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10serviceops, 10Patch-For-Review: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) > Should we use main_app.name instead of service.name?  I think yes is the answer.  I just updated [[...
[21:48:12] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - bot edits / new articles - https://phabricator.wikimedia.org/T241922 (10FocalPoint) @fdans thank you, indeed, not exactly the same, but with a bit of processing, I may get something similar to what I was looking for. The tables of Wikistats 1 seem ted...
[22:22:58] <wikibugs>	 10Analytics, 10Multimedia, 10Tool-Pageviews: Allow users to query mediarequests using a file page link - https://phabricator.wikimedia.org/T244712 (10Nuria) >Personally I think you should just use {project}/{filename} That is what I am proposing but I think "project" is not the right term. In the wikimedia e...
[22:33:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10serviceops: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) Ok, applied for staging eventgate-analytics.  I think it works!  First, because the 'analytics' release already existed, I...
[23:29:30] <wikibugs>	 10Analytics, 10Product-Analytics: Request for instructions for using DataGrip in the Kerberos paradigm - https://phabricator.wikimedia.org/T245040 (10Nuria) i think airpal is going to be the future hue, ya.
[23:32:14] <foks>	 thanks elukey for setting up my kerberos auth. still having trouble running hive queries though :( 
[23:32:25] <foks>	 "The number of live datanodes 8 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached." 
[23:32:41] <foks>	 oh, it's this "Cannot create directory /tmp/hive/foks. Name node is in safe mode."
[23:39:33] <wikibugs>	 10Analytics, 10Product-Analytics: Develop a consistent rule for which special pages count as pageviews - https://phabricator.wikimedia.org/T240676 (10kzimmerman) p:05High→03Medium a:05nshahquinn-wmf→03None The scope of this is (increasingly) extensive and requires changing existing definitions; current...
[23:48:32] <nuria>	 foks: mmm, what queries are you running?
[23:49:14] <foks>	 nuria: I'm running one against wmf.webrequest
[23:49:28] <foks>	 I'll be honest, I have not a whole lot of idea what I am doing 
[23:49:28] <nuria>	 foks: can you paste query here?
[23:49:55] <foks>	 It's Legal-related so I'll replace what I'm running it for with foobar, but sure
[23:50:13] <nuria>	 foks: i think reading docs before using cluster might help, querying petabytes of data is not an intutive thing
[23:50:20] <nuria>	 foks: let me see
[23:50:20] <foks>	 yes that is fair
[23:50:23] <foks>	 hive -e "use wmf; select dt, ip, client_ip, uri_host, uri_path, uri_query, agent_type, pageview_info['page_title'] as page_title, page_id, namespace_id, year, month, day, hour from wmf.webrequest where year=2020 and month=2 and day=3 and is_pageview=true and uri_host in ("fr.wikipedia.org", "fr.m.wikipedia.org") and uri_path="/wiki/Foobar" order by dt, ip limit 1000000000" > ./stat-legal-fr-2020-02-10.tsv
[23:50:39] <foks>	 I will probably also narrow by hour
[23:50:52] <nuria>	 foks: narroring by hour 1st would help
[23:51:02] <foks>	 nod
[23:52:01] <nuria>	 foks: let me rerun your query
[23:52:05] <foks>	 I actually only need ten minutes of data
[23:52:20] <foks>	 nuria: let me PM you