[00:03:55] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @Marostegui could you expand on why the check isn't realistic? From what I can tell all it's monitoring is the total u...
[01:05:13] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:02:41] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) >>! In T269211#6978486, @razzi wrote: > @Marostegui could you expand on why the check isn't realistic? From what...
[05:03:23] <icinga-wm>	 PROBLEM - Check unit status of wikimedia-discovery-golden on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit wikimedia-discovery-golden https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:51:46] <elukey>	 good morning
[06:20:21] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:34:34] <wikibugs>	 10Analytics, 10Analytics-Kanban: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10elukey)
[06:35:12] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10elukey)
[06:35:47] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10elukey) a:05elukey→03None
[06:36:39] <wikibugs>	 10Analytics, 10Analytics-Kanban: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10elukey) This task should be done! Please let me know if any other step is needed :)
[06:37:27] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) We can call this done, I'll open new tasks to progress this initial work :)
[06:56:19] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Escape edit count bucket for metrics tag name [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676297 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight)
[07:00:30] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "Even if this patch is tiny, this is crazy command line magic I can't really review." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight)
[07:10:10] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10elukey) I have a plan for this task, let me know your thoughts!  Assumption: the SRE team decided to consider `/srv` as the canonical place where to put raid-based-volumes/partitions and...
[07:25:18] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10elukey) `export HADOOP_CLASSPATH=/usr/share/java/apache-log4j-extras.jar:${HADOOP_CLASSPATH}` in `hadoop-env` did the trick!
[07:27:29] <awight>	 I found a need in https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/676297 to repeat a trivial transformation in a dozen different HiveQL scripts.  Would it be reasonable to write a UDF for this and similar operations?  Is there any precedent for doing so?
[07:36:30] <joal>	 Hi awight - writing UDFs is completely fine - We usually do them in 2 steps, core definition (actual computation) in refinery-source/refinery-core, and hive stuff in refinery-source/refinery-hive
[07:47:41] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "One nit to discuss, but non-blocking." (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse)
[07:49:04] <wikibugs>	 (03CR) 10Joal: "Another nit :)" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 (owner: 10Hnowlan)
[07:56:04] <joal>	 good morning elukey - would ou have minute for me this morning?
[07:57:18] <wikibugs>	 10Analytics, 10SRE, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) +1 on the approach (updating the task description for details)
[07:58:21] <wikibugs>	 10Analytics, 10SRE, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou)
[08:04:07] <elukey>	 joal: bonjour! Sure
[08:04:46] <joal>	 elukey: let me know when it's easier for you :)
[08:07:44] <elukey>	 joal: maybe in 30 mins?
[08:08:01] <joal>	 When ou wish elukey - I have nothing before 4pm
[08:09:22] <awight>	 joal: Thank you, I may have UDFs to review in the future, in that case.  Are the refinery UDFs already available any time Hive is used on stat* and anlauncher* by chance, perhaps by the wrapper script?  Or would I have to e.g. add the jar to each tool's launch configuration such as reportupdater's?
[08:14:15] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation: Filename convention is not easy to follow for dumps using a `precombine` step - https://phabricator.wikimedia.org/T279055 (10JAllemandou) Thank you for the explanation @ArielGlenn. Let me precise my 2 concerns (they are minor): - job names are different for the same outpu...
[08:22:03] <joal>	 awight: IIRC you need to add the jar in HQL session - This was done to prevent having problems to test new versions of stuff if defaults were included
[08:27:44] <awight>	 Oh that's fancy, I literally "add jar" from a script.  Now I see diverse examples in `refinery`.  Looking forward to tinkering!
[08:29:03] <joal>	 awight: in an-launcher or stat boxes deployed jars can be found at: /srv/deployment/analytics/refinery/artifacts
[08:29:17] <awight>	 TY
[08:30:32] <joal>	 awight: While you can use the `current` version of the jar (/srv/deployment/analytics/refinery/artifacts/refinery-hive.jar for instance), we strongly encourage you to use the versioned ones (for instance /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-hive-1.0.3.jar  not sure about the paths)
[08:31:12] <joal>	 awight: Given that code changes sometimes, having a versioned pointer to the jar prevents unexpected post-deployment failures
[08:34:55] * awight sweats at the thought of importing 41MB of bytecode just to reuse two "replace" statements.  Are there efficiencies if the same jar is loaded in many scripts...
[08:36:12] <joal>	 awight: in oozie we use HDFS path for the jars (hdfs:///wmf/refinery...) - there is data movement, but at least distributed
[08:38:37] <awight>	 Well this gives me a nice option to consider, I'll try to educate myself about the trade-offs and identify what else of my team's HQL can be usefully pushed into UDF.
[08:40:41] <joal>	 awight: IIUC your current need is about making sure the same code is reused in many queries
[08:46:06] <awight>	 joal: Yes, that's all for now.  If it were some logic that couldn't be easily performed in HQL I wouldn't be considering this small matter of carbon overhead.
[08:48:32] <wikibugs>	 10Analytics, 10SRE, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10ayounsi) >>! In T279429#6976000, @ayounsi wrote: > There is also a term permitting UDP fragments, I added a "count" to know if/why we're using it.  Looks like we're not. I'll remove it as well.
[08:59:41] <elukey>	 joal: I am free if you want!
[09:00:06] <joal>	 elukey: in 2 mins - batroom b
[09:00:10] <joal>	 reak first
[09:03:24] <wikibugs>	 10Analytics, 10SRE, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10elukey) @razzi this is a good task to get started with the firewall rules of our VLAN :)
[09:04:50] <joal>	 elukey: to the cave>/
[09:04:51] <joal>	 ?
[09:08:04] <elukey>	 sure
[09:28:47] <joal>	 mgerlach: heya - your job is using 4.5TB on the cluster, and half the CPUs - Can you please set limits within the usual boundaries?
[09:30:12] <mgerlach>	 joal: thanks for the ping. apologies.
[09:30:16] <mgerlach>	 I am checking the job
[09:31:17] <mgerlach>	 I am submitting a spark-job with the following parameters spark2-submit --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G
[09:31:39] <mgerlach>	 do you have reocmmendations how to set the usual limits?
[09:32:59] <joal>	 mgerlach: you need '--conf spark.dynamicAllocation.maxExecutors=XX' I recommand no more than 128 executors given your settings
[09:33:10] <joal>	 mgerlach: see --conf spark.dynamicAllocation.maxExecutors=64
[09:33:16] <joal>	 mwarf -again sorry
[09:33:23] <joal>	 mgerlach: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Start_a_spark_shell_in_yarn
[09:35:16] <mgerlach>	 joal: thanks for this. I will try your suggestion. sorry again
[09:35:26] <joal>	 np mgerlach, nothing broken :)
[09:36:23] <mgerlach>	 joal: puh, I am always sweating from your pings : )
[09:36:45] <joal>	 I'm sorry for that mgerlach - I'll try to ping more regularly for chit-chat :)
[09:36:50] <mgerlach>	 afraid that I might have broken something
[09:37:28] <mgerlach>	 joal: but I should I know this so thanks for directing my attention towards this for better practice
[09:37:34] <joal>	 mgerlach: the cluster is resilient, it's more about sharing resources than broken stuff for real
[09:41:59] <wikibugs>	 (03PS3) 10Hnowlan: Add makefile and dockerfile for local tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402
[09:52:29] <elukey>	   /win 11
[09:52:32] <elukey>	 ufff
[10:21:12] <wikibugs>	 (03PS1) 10Awight: Fix date range parameters for native hive scripts [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/677511 (https://phabricator.wikimedia.org/T193169)
[10:25:07] <wikibugs>	 (03PS3) 10Awight: Validate the native "hive" report type [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169)
[10:25:14] <wikibugs>	 (03CR) 10Awight: Validate the native "hive" report type (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight)
[10:35:59] * elukey lunch!
[10:43:36] <wikibugs>	 (03PS1) 10Hnowlan: scap: add analytics WMCS hosts instead of old deploy-prep hosts [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517
[10:57:15] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix date range parameters for native hive scripts [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/677511 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight)
[11:05:53] <hnowlan>	 I'm at a loss on how deployment works in the wmcs cluster - it looks like the service is actually deployed via scap but the deployment-prep scap can't sync to the analytics cluster
[11:06:11] <hnowlan>	 deploy-local is breaking with a weird error I don't quite understand 
[11:06:35] <awight>	 Getting started with refinery, but surprised to see that git-fat requires python 2.
[11:15:37] <joal>	 hnowlan: could this be related to the fact that the analytics wmcs cluster got wiped-outm then recreated?
[11:17:08] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM! +1 to let you merge when you wish" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 (owner: 10Hnowlan)
[11:18:07] <wikibugs>	 (03PS1) 10Awight: Remove unused jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521
[11:18:28] <hnowlan>	 joal: I'm not sure - were the hosts moved from deployment-prep to the new cluster or were they recreated from scratch? 
[11:19:26] <joal>	 I think they were recreated hnowlan - IIRC we hadn't followed up on wmcs cluster reconfiguration/cleanup, and ended with our cluster deleted
[11:20:49] <hnowlan>	 joal: ah, okay, makes sense
[11:21:00] <hnowlan>	 the new hosts got a valid scap deploy in January 2020 but I dunno how :) 
[11:26:09] <joal>	 hnowlan: I think Luca did that so that lex could test his patch about pageview-per-country
[12:27:22] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation: Filename convention is not easy to follow for dumps using a `precombine` step - https://phabricator.wikimedia.org/T279055 (10JAllemandou) I have implemented some more logic to get the files we need, so no real need to change here. This task was more about things to keep i...
[12:27:52] <elukey>	 hnowlan: hello! No scap deploy, we don't have any scap/deploy server available in our wmcs namespace
[12:28:13] <elukey>	 they got deployed via scap pull when puppet ran the first time
[12:28:59] <joal>	 wow - thanks for explaining elukey - I was far from understanding
[12:29:38] <wikibugs>	 (03CR) 10Joal: Remove unused jar (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 (owner: 10Awight)
[12:30:19] <elukey>	 it is by bad joal, I should have added docs to wikitech, will do it later on
[12:30:27] <elukey>	 (still have to complete that task)
[12:30:54] <elukey>	 hnowlan: checking puppet in there to see what settings were applied
[12:32:28] <elukey>	 so in profile::aqs we have profile::aqs::git_deploy, that defaults to false, I thought we had it enabled in wmcs
[12:32:31] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation: Filename convention is not easy to follow for dumps using a `precombine` step - https://phabricator.wikimedia.org/T279055 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn >>! In T279055#6979924, @JAllemandou wrote: > I have implemented some more logic to get the file...
[12:34:23] <elukey>	 *they got deployed via git pull, not scap pull 
[12:36:26] <wikibugs>	 (03CR) 10Elukey: "Let's remove any wmcs/labs reference, we are not in deployment-prep anymore (and we were discouraged to add another cluster in there from " [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517 (owner: 10Hnowlan)
[12:37:07] <elukey>	 fatal: remote error: mediawiki/services/analytics/aqs/deploy unavailable
[12:37:12] <elukey>	 ah snap
[12:38:25] <elukey>	 that is weird, it shouldn't happen
[12:38:38] <elukey>	 hnowlan: see service::node, we use the git deploy option in there
[12:38:46] <elukey>	 (it is created by the aqs class)
[12:39:03] <elukey>	 now why it adds mediawiki/services is weird
[12:42:31] <elukey>	 define service::deploy::gitclone( String[1] $prefix = 'mediawiki/services',
[12:42:34] <elukey>	 sigh
[12:43:52] <elukey>	 ok hnowlan my bad, we should try to make the prefix configurable in service::node when using the git clone
[12:44:56] <elukey>	 be back in 10 mins!
[12:45:20] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10Ottomata) +1!
[12:49:20] <wikibugs>	 (03CR) 10Ottomata: Switch to eventutilities 1.0.4 (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse)
[13:06:41] <elukey>	 hnowlan: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677569 - lemme know what you think about it
[13:06:45] <hnowlan>	 elukey: ohhhh, thanks for filling me in! 
[13:07:13] <elukey>	 hnowlan: sorry for the mess, my brain was convinced about something different :(
[13:07:38] <hnowlan>	 the thing that really threw me for a loop is that on the aqs hosts aqs HEAD is at scap/sync/2021-04-07/0004, and it's at the same on deployment-deploy01:/srv/deployment/analytics/aqs/deploy
[13:10:22] <elukey>	 hnowlan: I think it got deployed via git during the first puppet run :(
[13:11:34] <hnowlan>	 ahhh heh - makes sense - deploy-local works but it only pulls the same version
[13:27:32] <ottomata>	 dcausse: in that code review...i think gehel is suggesting that I keep a reference to the response body InputStream somehow
[13:27:57] <ottomata>	 but, in an earlier review, you suggested that I changee the constructor so it doesn't look like it is keeping e.g. HttpResponse
[13:28:15] <ottomata>	 was that just because HttpResponse wasn't actually being kept as an instance property?
[13:28:31] <ottomata>	 or because holding onto it at all would be bad?
[13:28:34] <dcausse>	 ottomata: I think gehel is suggesting to stream and do http request -> json
[13:28:45] <ottomata>	 ah, that's a bit harder with the abstraction i'm trying to make
[13:28:52] <dcausse>	 which is not possible without a majort refactor
[13:29:25] <ottomata>	 hmmm actually, that could be done if we make an interface in resourceloader to get an inputstream?
[13:29:26] <gehel>	 yep, my point is that if you go from HttpRequest -> InputStream -> String / byte[] -> Json
[13:29:45] <gehel>	 you could actually do better by removing the String / byte[] part. Less memory allocation
[13:29:53] <ottomata>	 right that makes sense
[13:30:07] <ottomata>	 I guess BasicHttpClient should really only be used for very basic small things, so this might be ok
[13:30:09] <ottomata>	 to keeep as is
[13:30:10] <gehel>	 But as David said, that's a non trivial refactoring, so it might not make sense to do it here
[13:30:10] <dcausse>	 if you expose the response out of the BasicHttpClient then you trust clients to properly close the response
[13:30:21] <ottomata>	 hm, right.
[13:30:43] <gehel>	 or you could add JSON parsing to the BasicHttpClient (but it is then a little bit less basic)
[13:31:01] <gehel>	 you also loose the ability to read that body multiple times, but that's probably a good thing
[13:31:04] <ottomata>	 well its more complicated that, ResourrceLoader is abstracting away the protocol used to get the content
[13:31:07] <ottomata>	 it could also be from a file
[13:31:08] <ottomata>	 url
[13:31:11] <ottomata>	 or maybe even hdfs
[13:31:26] <ottomata>	 so, the Json parsing needs to be above the HttpClient
[13:32:12] <ottomata>	 I'll add some comments about how BasicHttpClient makes a copy of the response body in memory and should only be used for very simple http requests
[13:32:19] <ottomata>	 simple / small
[13:32:28] <gehel>	 or maybe the shared abstraction shoudl not be a String/byte[] but a stream
[13:32:39] <ottomata>	 that could work, but what abouot what dcausse said about closing the http request?
[13:33:29] <dcausse>	 it's like returning CloseableHttpResponse
[13:35:35] <mforns>	 heya team :]
[13:35:55] <dcausse>	 or passing a reader from the client: public R get(URL resource, Function<InputStream, R> reader) {}
[13:38:12] <gehel>	 yeah, harder to ensure it is closed properly :/
[13:38:51] <gehel>	 The not using streams to process potentially large chunk of data is one of my pet peeves, but don't be bound by my OCDs :)
[13:39:47] <ottomata>	 well i added a note not to use it for large chunks of data so it never will be used that way now right? :p
[13:39:59] <gehel>	 :)
[13:40:03] <ottomata>	 gehel:  the other comment is the one about copying the byte[] in getBody()
[13:40:43] <ottomata>	 spot bugs was giving me http://findbugs.sourceforge.net/bugDescriptions.html#EI_EXPOSE_REP
[13:41:11] <gehel>	 that's convoluted :)
[13:41:23] <ottomata>	 would it be better to just return the byte[] and then add an ignore for that spotbugs?
[13:42:09] <dcausse>	 I think (if possible) it would be better to directly return byte[] and not use BasicHttpResult
[13:42:15] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "discussion done - merge when you wish :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse)
[13:43:18] <ottomata>	 hm
[13:43:50] <gehel>	 I don't see how you could access the underlying char array from the String, so not sure what FB thinks this is exposing
[13:44:17] <ottomata>	 gehel:  i assumed it meant one could do
[13:44:26] <ottomata>	 byte[] mybody = result.getBody();
[13:44:33] <ottomata>	 mybody[0] = 'X';
[13:44:48] <ottomata>	 and then someone else might call result.getBody() and see your modification
[13:44:55] <gehel>	 Oh yes, exposing the byte array is an issue
[13:45:30] <gehel>	 what I was proposing is that getBodyAsString should just be `return new String(this.body, UTF_8)`
[13:45:44] <wikibugs>	 (03PS2) 10Hnowlan: scap: add analytics WMCS hosts instead of old deploy-prep hosts [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517
[13:45:46] <elukey>	 hnowlan: puppet works now in analytics wmcs, you should be unblocked!
[13:45:55] <hnowlan>	 elukey: sweet, thank you! 
[13:46:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "<3" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517 (owner: 10Hnowlan)
[13:46:07] <gehel>	 well, you have to decide how much of an issue it is to expose the byte array, vs the cost of copying it
[13:46:12] <ottomata>	 oh i see
[13:46:19] <ottomata>	 ok i can cahnge the asString method for sure
[13:49:26] <wikibugs>	 10Analytics-Clusters, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey)
[13:50:12] <gehel>	 String will create it's own copy anyway
[13:50:15] <wikibugs>	 10Analytics-Clusters, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) @Cmjohnson this is kind of strange, I don't see any problem reported by megacli for the BBU but I cannot enforce WriteBack on the RAID controller, as if the BBU wasn't working. Any i...
[13:50:41] <gehel>	 we could even cache the String in a private field to not create a new copy each time, but that's probably unrequired optimization
[13:50:47] <ottomata>	 yeah that makes sense gehel  i thikn that was just there from before i changed getBody to make acopy
[13:50:50] <ottomata>	 fixed in latest patch
[13:52:33] <gehel>	 LGTM!
[13:59:43] <ottomata>	 dcausse: you ok to merge? would love to get this into refinery asap so it can maybe make it in our deployment train
[14:00:24] <dcausse>	 ottomata: lgtm!
[14:00:42] <ottomata>	 ty, guess i' waiting on jenkns too
[14:00:47] <ottomata>	 thanks so much for reviews! 
[14:02:25] <joal>	 ottomata: meeting?
[14:02:33] <ottomata>	 OH NO
[14:02:34] <ottomata>	 thtank you
[14:02:35] <ottomata>	 coming
[14:03:41] <hnowlan>	 !log setting profile::aqs::git_deploy: true in aqs-test1001 hiera config
[14:03:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:08:54] <elukey>	 hnowlan: It should be already applied to the prefix aqs tab (for all nodes)
[14:09:02] <elukey>	 (I added it 30 mins before)
[14:11:26] <wikibugs>	 10Analytics-Clusters, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10Cmjohnson) @elukey that's a first! Maybe the raid bios settings are wrong?
[14:15:09] <hnowlan>	 elukey: ah nice - it's not quite behaving like I'd expect but I'm looking into it
[14:29:24] <wikibugs>	 (03PS2) 10Awight: Remove unused jar in oozie/cassandra/monthly/pageview_top_bycountry.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521
[14:29:33] <wikibugs>	 (03CR) 10Awight: Remove unused jar in oozie/cassandra/monthly/pageview_top_bycountry.hql (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 (owner: 10Awight)
[14:30:20] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Thank you :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 (owner: 10Awight)
[14:55:44] <milimetric>	 razzi: wanna talk gobblin?
[14:55:54] <milimetric>	 (anytime is ok, not necessarily now)
[14:59:47] <wikibugs>	 (03PS6) 10Ottomata: Switch to eventutilities 1.0.5 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse)
[15:01:24] <ottomata>	 mforns:  are you doing train today?
[15:01:34] <mforns>	 ottomata: yes
[15:01:38] <ottomata>	 awesooome just in time :)
[15:01:48] <mforns>	 is there something you want me to deploy?
[15:01:58] <ottomata>	 just refinery source changes, i'll apply them after deploy
[15:02:06] <ottomata>	 they are in the etherpad
[15:02:10] <ottomata>	 i've already merged one
[15:02:15] <ottomata>	 waiting for jenkins on the other then will merge
[15:02:16] <mforns>	 I'll be doing it scattered during the day, because I have this BUOD presentation meeting, but yes
[15:02:27] <ottomata>	 oooo rigjt cool yeah am looking forward to that
[15:02:30] <mforns>	 ok, no problemo
[15:05:19] <wikibugs>	 (03CR) 10Ottomata: Switch to eventutilities 1.0.5 (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse)
[15:14:39] <razzi>	 milimetric: yeah, good time to talk now?
[15:14:48] <milimetric>	 sure, to the cave!
[15:15:47] <razzi>	 ok!
[15:22:03] <elukey>	 mforns: hola hola, we should deploy the new mediawiki snapshot config for aqs 
[15:22:30] <mforns>	 elukey: sure
[15:23:03] <mforns>	 elukey: is this depending on any change to refinery or refinery-source?
[15:24:15] <wikibugs>	 10Analytics, 10Cloud-Services, 10Data-Persistence (Consultation): Sqoop on multi-instance clouddb1021 is very slow for some tables - https://phabricator.wikimedia.org/T279095 (10JAllemandou) 05Open→03Declined Thanks for your suggestion @Marostegui. The global drift is not big (this month took 4h more tha...
[15:24:19] <elukey>	 mforns: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677591
[15:24:59] <mforns>	 elukey: oh! I see, this is unrelated to the refinery deployment
[15:27:19] <elukey>	 mforns: yes yes sorry I thought you had time now, it can be done later
[15:27:27] <elukey>	 the new datasource is there
[15:27:34] <elukey>	 (just checked the coordinator)
[15:27:46] <elukey>	 I can deploy the puppet change and then run the cookbook when you are ok
[15:27:55] <elukey>	 so we can test aqs1004 separately
[15:27:58] <mforns>	 elukey: code looks good to me, I was going to check the data, but if you checked already, I'm all for merging!
[15:27:59] <elukey>	 (the canary)
[15:28:23] <elukey>	 mforns: part of the cookbook is depool aqs1004 and wait for a manual confirmation, is it ok?
[15:28:56] <mforns>	 elukey: yes!
[15:30:09] <elukey>	 mforns: all right, +1 to do it now? Or do you prefer to wait?
[15:30:27] <mforns>	 elukey: good to do it now!
[15:30:32] <mforns>	 you wanna batcave?
[15:31:16] <elukey>	 mforns: in here is fine if you are ok.. I just depooled aqs1004, you can test!
[15:31:24] <mforns>	 ok!
[15:35:14] <elukey>	 bearloga: hi! I see that the discovery golden timer is broken on stat1007, did you change anything recently?
[15:35:24] <elukey>	 it complains about the pid python module not available
[15:35:49] <mforns>	 elukey: tested on aqs1004, seems to work fine
[15:36:44] <elukey>	 mforns: ok to proceed with the rest of the nodes?
[15:36:49] <mforns>	 elukey: yes
[15:37:43] <elukey>	 perfect
[15:37:51] <joal>	 elukey: quick q for ou
[15:38:01] <elukey>	 I just noticed in https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m that the 2G cache on the historicals is filled for public
[15:38:09] <joal>	 elukey: how can I check IOs on a linux machine in realtime?
[15:38:38] <elukey>	 joal: for a specific process or cluster wide?
[15:38:40] <elukey>	 err host wide?
[15:38:53] <joal>	 elukey: single host
[15:39:02] <joal>	 elukey: stat1008 is very unresponsive
[15:39:40] <joal>	 ok elukey: grafana tells me it's diskg
[15:40:32] <joal>	 hm
[15:40:34] <elukey>	 I was about to suggest iotop
[15:40:34] <mforns>	 elukey: isn't the cache supposed to be filled all the time?
[15:40:54] <joal>	 not available on stat1008 elukey :/
[15:41:13] <elukey>	 mforns: yes yes it is nice, the hitrate is ~0.85 so I am wondering if adding more RAM for it (since it is heap based) could make it better
[15:41:19] <bearloga>	 elukey: hi! that must explain https://phabricator.wikimedia.org/T279443
[15:41:44] <elukey>	 bearloga: yes! :D
[15:42:17] <elukey>	 joal: yeah it requires sudo :(
[15:42:23] <joal>	 hm
[15:42:34] <joal>	 elukey: would you sudo me a favor?
[15:43:45] <elukey>	 joal: so there is somebody logged as root doing a find in /srv/home that is quite intensive
[15:43:59] <joal>	 meh
[15:44:10] <joal>	 ok
[15:44:31] <elukey>	 and also mgerlach's python3 script seems heavy
[15:44:35] <mforns>	 "would you sudo me a favor?"  xDDDDDD
[15:44:37] <bearloga>	 elukey: question for you and mforns. if kerberos systemd in puppet runs a script as a system user ('analytics-search' in this case) which calls reportupdater. where is the python environment?
[15:45:21] <elukey>	 bearloga: in theory it should be added via PYTHONPATH or something similar
[15:45:30] <elukey>	 mforns: deployed!
[15:46:56] <elukey>	 joal: ah so we have one cron that does
[15:46:57] <elukey>	 /usr/bin/find /srv/home -type d -regex "/srv/home/.+/\.local/share/Trash" -exec rm -rf {} >/dev/null 2>&1 \;
[15:46:58] <bearloga>	 elukey: so theoretically I would create a venv that has everything reportupdater needs (as declared in requirements.txt) and then could set PYTHONPATH in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/statistics/manifests/discovery.pp ?
[15:48:00] <mforns>	 elukey: I'm seeing that with 1G of ram consumption, the hit ratio is already 70%, and when the cache reaches 2G, the ratio is 81%. From there on, even if there's no increase in cache size, the ratio continues to grow (83%); this might be due to the cache optimizing the most accessed paths.
[15:48:05] <elukey>	 bearloga: in theory the venv should be shipped with the repo if needed, but not sure what mforns did in the past for RU
[15:48:12] <joal>	 elukey: I assume the script stopped - perf is back to usual
[15:49:15] <elukey>	 mforns: yep yep, but the trend seems to be sitting at 0.85 after a month, and there are evictions, this is why I thought about increasing a little the cache (maybe it could lead to zero evictions and more hits)
[15:49:28] <elukey>	 joal: yes, we need to make that find better sih
[15:49:30] <elukey>	 *sigh
[15:50:03] <joal>	 elukey: no big deal- just mentioning - maybe we could change the execution time for that thing at a moment when less people use the tools?
[15:50:44] <mforns>	 elukey: aha, I see the evictions, sure! It might be worth increasing the heap
[15:51:08] <elukey>	 joal: yes yes, I think that we could also make it smarter, it is a little dumb since it removes files instead of the whole dir (guess who's the one to blame?)
[15:51:38] <elukey>	 sorry s/files/dirs
[15:53:59] <bearloga>	 mforns: what was your solution to venvs for reportupdater?
[15:54:06] <mforns>	 elukey, bearloga: I never had to use venvs with reportupdater, I avoided using any library within reportupdater that was not already part of the stats machines' environments...
[15:54:12] <fdans>	 mforns: you haven't ran the train yet right? :D
[15:54:22] <mforns>	 fdans: no train yet :]
[15:54:26] <joal>	 elukey, klausman: that could be of interest - https://medium.com/riselab/feature-stores-the-data-side-of-ml-pipelines-7083d69bff1c
[15:54:45] <klausman>	 Havin' a look-see
[15:54:51] <mforns>	 bearloga: we could add the venv to the repo, as elukey suggests?
[15:54:53] <elukey>	 bearloga: it would be good if we could deploy the venv (a frozen one basically) as part of the discovery golden scap repo (if it is deployed via scap)
[15:55:04] <elukey>	 and reference it in the timer
[15:55:14] <joal>	 klausman, elukey: non technical but I think it puts hig-level concepts in a nivce way
[15:55:34] <mforns>	 fdans: do you want me to deploy sth?
[15:55:36] <klausman>	 > Unfortunately, many feature stores being built today are Frankensteinien amalgamations of batch, streaming, caching, and storage systems.
[15:55:44] <klausman>	 Ain't that the truth for many things
[15:56:00] <elukey>	 let's write our own feature store!
[15:56:02] <joal>	 klausman: I have no idea what you could possibly mention
[15:56:09] * joal looks away
[15:56:18] <elukey>	 I propose nodjs
[15:56:20] <elukey>	 *nodejs
[15:56:46] <elukey>	 joal: can you join #wikimedia-ai ?
[15:56:47] * joal wonders about feeding the troll
[15:56:59] <bearloga>	 elukey: the deployment is just a clone of the git repo on gerrit https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/golden/
[15:57:01] <elukey>	 people will be interested in there too :)
[15:57:02] <joal>	 sure elukey
[15:57:46] <fdans>	 mforns: just start a job, I added the command in the etherpad
[15:57:56] <klausman>	 elukey: I think it would be better to use a mixture of Haskell and Ocaml
[15:58:03] <mforns>	 fdans: will do!
[15:58:03] <elukey>	 bearloga: ah yes then seems fine as well, if we add a "venv" or similar dir in there with PYTHONPATH it should work in theory 
[15:58:17] <fdans>	 thank youuuu
[15:58:19] <joal>	 klausman, elukey: scalajs is what you're after, best of both world
[15:58:23] <elukey>	 ahahahah
[15:58:35] <fdans>	 joal: what I want that
[15:58:51] * joal has failed to refrain feeding the troll 
[15:59:02] <fdans>	 oh wait
[15:59:09] <fdans>	 this is to materialize scala code into js
[15:59:13] <fdans>	 I don't want that
[15:59:22] <joal>	 :D
[15:59:24] <fdans>	 I want the opposite thing
[15:59:43] <fdans>	 how dare they call something .js when no javascript is written
[16:00:40] <bearloga>	 elukey: got it! thanks, I'll document your recommendations in the ticket.
[16:00:42] <fkaelin>	 is the stat1008 back to normal? i get a 500 on newpyter
[16:01:01] <ottomata>	 fkaelin:  working on it
[16:01:06] <joal>	 fkaelin: it's been under IO pressure - back to normal for me
[16:01:10] <ottomata>	 no idea why it worked before and now not??
[16:01:15] <ottomata>	 no there is another problem
[16:01:22] <ottomata>	 https://phabricator.wikimedia.org/T279480
[16:01:24] <joal>	 Arf ok sorry
[16:01:52] <mforns>	 elukey: looking at aqs, I see the change in hit ratio from 70% to 83% has not improved the response time percentiles significantly.
[16:02:40] <elukey>	 mforns: let's wait a bit for things to settle, it may take a bit of time, but it is a good point to check that relationship!
[16:02:51] <ottomata>	 bearloga: elukey , you might be able to get away without setting PYTHONPATH, if you just launch python out of your venv
[16:03:10] <elukey>	 ottomata: yes true!
[16:03:27] <mforns>	 fdans: standup?
[16:03:35] <fdans>	 sorri
[16:03:42] <fkaelin>	 also, would it be easy to upgrade the node version installed on the stat machines?  Node version v6.13.1 < 10.12.0 when using vim plugins
[16:04:21] <wikibugs>	 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10EYener) Update for those on this task:  All three tables have been whitelisted and are live in the event_sanitized db: event_s...
[16:06:12] <fkaelin>	 thanks ottomata, will have a look at the phab
[16:06:25] <elukey>	 fkaelin: we have 10.23.1~dfsg-1~deb10u1 installed on the stat100x hosts, no idea where 6.x comes from
[16:07:31] <ottomata>	 fkaelin:  node is in conda, so if you use your conda env, you can install whatever versino you want! :)
[16:08:21] <fkaelin>	 ai, thanks. 
[16:27:58] <wikibugs>	 10Analytics, 10Analytics-Wikistats: New Wikivoyages are only partially included in Stats - https://phabricator.wikimedia.org/T279564 (10KuboF)
[16:28:40] <Nettrom>	 hey a-team: do you have any standard tools to compact data in HDFS? I found spark-compaction on GitHub, but thought I'd check in and see if you have other recommendations before testing it
[16:30:00] <Nettrom>	 to make it easier, spark-compaction is this one: https://github.com/KeithSSmith/spark-compaction
[16:35:38] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 2 others: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10Ottomata)
[16:47:34] <elukey>	 ottomata, razzi let me know if https://gerrit.wikimedia.org/r/c/operations/puppet/+/677576 is ok for you later on
[16:47:38] <elukey>	 in case I'll merge it tomorrow
[16:47:56] <elukey>	 (I should be able to get the /srv/ logging working config too)
[16:53:12] <elukey>	 going afk earlier today, ttl folks!
[17:32:28] <wikibugs>	 10Analytics: Review request: New datasets for WMCZ published under analytics.wikimedia.org - https://phabricator.wikimedia.org/T279567 (10Urbanecm)
[17:44:57] <hnowlan>	 the wmcs analytics cluster seems to have no users configured 
[17:45:22] <hnowlan>	 oh also the aqs service has no password configured either heh, might be something to fix tomorrow 
[18:26:31] <isaacj>	 ottomata: is there any workarounds for the Newpyter issue or should i movie notebooks i need to run from stat1008 to a different machine in the meantime and using SWAP?
[18:32:14] <ottomata>	 isaacj:  i'm really sorry, i have no idea why this all the sudden started happening, i'm trying to get it fixed asap.   got one coming i htink, but building a new anaconda-wmf takes a bit
[18:32:21] <ottomata>	 no workaround. :/
[18:32:54] <isaacj>	 ok -- just wanted to make sure there wasn't a quick fix. i can wait a few hours so i'll hope for that to work. thanks!
[18:33:14] <ottomata>	 ok thanks, again sorry about this .  
[18:38:50] <wikibugs>	 10Analytics, 10Analytics-Kanban: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) FYI, I am working on this now and hope to get it fixed within a few hours.  "hope" :)
[18:46:34] <joal>	 stat1008 is really difficult to work with currently
[18:48:37] <ottomata>	 joal: it seems kind of ok to me?  at least shell is responsive
[18:48:50] <mgerlach>	 joal: sorry joal, this could have been me. I killed the process
[18:49:25] <mgerlach>	 still trying to understand what is the problem. do you have any insights what could be the cause?
[18:49:29] <joal>	 not sure if it was you mgerlach, but my processes go a lot faster :)
[18:49:44] <joal>	 mgerlach: not without more info :)
[18:50:32] <joal>	 mgerlach: tell me more about what ou're doing
[18:51:14] <mgerlach>	 joal: I am running a package called wikipedia2vec https://github.com/wikipedia2vec/wikipedia2vec/blob/master/docs/commands.md
[18:52:09] <mgerlach>	 the input is one of the dump-files (in this case I am trying to work with enwiki)
[18:52:39] <mgerlach>	 I restrict to 8 cpus
[18:53:05] <joal>	 mgerlach: you could add 'ionice' before your command
[18:53:32] <joal>	 mgerlach: not sure it'll solve the issue, but could help
[18:53:52] <joal>	 mgerlach: 8 CPUs is not a lot for that host
[18:55:27] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi)
[18:55:32] <mgerlach>	  full command is this: wikipedia2vec train  --min-entity-count=0 --dim-size 50 --pool-size 8 "/mnt/data/xmldatadumps/public/enwiki/latest/enwiki-latest-pages-articles.xml.bz2" <output-file> 
[18:56:07] <mgerlach>	 joal: so you suggest to call instead "ionice wikipedia2vec ..."?
[18:56:26] <joal>	 yes mgerlach - ottomata can you confirm this looks ok?
[18:56:54] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Upgrade furud/flerovium to Debian Buster - https://phabricator.wikimedia.org/T278421 (10razzi)
[18:58:14] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10razzi)
[19:07:49] <joal>	 mgerlach: no ottomata around, let's try with ionice param :)
[19:08:12] <mgerlach>	 thanks joal: Iet me try that. I hope it helps
[19:10:33] <joal>	 ok gone for tonight
[19:22:01] <ottomata>	 hi!
[19:22:02] <ottomata>	 sorry
[19:28:02] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Switch to eventutilities 1.0.5 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse)
[19:34:52] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): [MEP] [BUG] Timestamp format changed in migrated client-side EventLogging schemas - https://phabricator.wikimedia.org/T277253 (10kzimmerman) 05Open→03Resolved
[19:43:06] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, 10MW-1.36-notes (1.36.0-wmf.35; 2021-03-16): Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Krinkle) You can and should d...
[19:53:00] <ottomata>	 !log upgrade anaconda-wmf everywhere to 2020.02~wmf4 with fixes for T279480
[19:53:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:53:03] <stashbot>	 T279480: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480
[19:53:44] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) Unreverted and merged the puppet fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677403
[19:55:35] * razzi out for a walk
[20:15:33] <razzi>	 !log rebalance kafka partitions for webrequest_text partitions 15, 16
[20:15:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:28:53] <mforns>	 heya team starting deployment train :]
[20:29:54] <ottomata>	 yeehaw
[20:49:48] <wikibugs>	 (03PS1) 10Mforns: Update changelog.md for v0.1.4 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/677651
[20:50:31] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging for deployment train." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/677651 (owner: 10Mforns)
[20:51:49] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "I did it one better. I built a stretch vagrant box and tested the entire cycle." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/676846 (https://phabricator.wikimedia.org/T278715) (owner: 10BryanDavis)
[20:57:37] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Expand dbname validation regex [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/676846 (https://phabricator.wikimedia.org/T278715) (owner: 10BryanDavis)
[21:00:17] <wikibugs>	 (03Merged) 10jenkins-bot: Expand dbname validation regex [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/676846 (https://phabricator.wikimedia.org/T278715) (owner: 10BryanDavis)
[21:09:36] <wikibugs>	 10Quarry, 10Patch-For-Review, 10User-bd808: Database name check excludes valid names (like nds_nlwiki) - https://phabricator.wikimedia.org/T278715 (10Bstorm) 05Open→03Resolved a:03bd808 https://quarry.wmflabs.org/query/53945 Fixed by @bd808's patch after I deployed it.
[21:19:15] <razzi>	 Noticed this alert: Icinga/DPKG "DPKG CRITICAL dpkg reports broken packages" on stat1008 and an-worker1100, looking in to it
[21:20:06] <razzi>	 Ok, looks like this is related to another alert, Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[anaconda-wmf], guessing this is ottomata ?
[21:23:43] <wikibugs>	 10Analytics, 10Privacy Engineering, 10Research, 10Patch-For-Review: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) > I'm so sorry, is this stuck on me? No worries -- as you said, privacy review is still ongoing. FYI @JFi...
[21:23:54] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.4 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677660
[21:24:47] <razzi>	 Going to try to reinstall anaconda-wmf on stat1008
[21:25:04] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging for deployment train." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677660 (owner: 10Maven-release-user)
[21:25:08] <razzi>	 !log sudo apt-get install --reinstall sudo apt-get install --reinstall anaconda-wmf on stat1008
[21:25:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:25:46] <razzi>	 Looks like there is some apt process running holding the apt lock
[21:26:35] <mforns>	 !log deployed refinery-source v0.1.4
[21:26:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:39:32] <razzi>	 Ok, the apt install finished by itself and anaconda-wmf is installed as intended
[21:42:16] <ottomata>	 oh hey sorry missed ping
[21:42:20] <ottomata>	 was afk, that install was taking a while
[21:42:44] <ottomata>	 razzi:  yeah i think the install for anaconda-wmf takes so long that the alert fires
[21:43:26] <razzi>	 strange, I wonder why it takes so long ottomata 
[21:43:32] <ottomata>	 it is 2.2G
[21:44:58] <mforns>	 !log starting refinery deploy up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
[21:44:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:46:37] <ottomata>	 isaacj: fkaelin Urbanecm jupyterhub-conda should be fixed on stat1008 
[21:46:45] <ottomata>	 and others as well, but not all quiet yet.
[21:46:54] <ottomata>	 there is actually a small bug stillf that will affect first time users creating a new conda env
[21:46:57] <ottomata>	 like you Urbanecm  :/
[21:47:00] <ottomata>	 am workign on that now
[21:47:14] <Urbanecm>	 so I'm not supposed o try it yet? :D
[21:48:20] <ottomata>	 heh, i guess not?  
[21:48:37] <Urbanecm>	 ok :)
[21:57:21] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) Getting the stat boxes back in place.  stat1008 and stat1004 are good to go.  Along the way, I introduced another small bug for users that are creating new...
[22:39:55] <mforns>	 !log deployment of refinery via scap to hadoop-test failed with Permission denied: '/srv/deployment/analytics/refinery-cache/.config' (deployemt to production went fine)
[22:39:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:40:30] <ottomata>	 mforns:  was that on all hadoop test hosts?
[22:40:35] <mforns>	 razzi: you still there? I had a small issue with the deployment train, the scap deployment to hadoop-test failed with a permits error (see above)
[22:40:43] <mforns>	 ottomata: oh! you still there
[22:40:47] <razzi>	 I am here too :)
[22:40:48] <ottomata>	 ya still here!
[22:40:51] <mforns>	 ottomata: no, only one
[22:40:56] <ottomata>	 which one?
[22:43:17] <mforns>	 ottomata: I couldn't tell :[, actually both were listed in green before the error message appeared
[22:43:42] <mforns>	 ottomata: want me to retry?
[22:44:19] <ottomata>	 mforns:  both?
[22:44:22] <ottomata>	 an-test-client1001
[22:44:25] <ottomata>	 what's the other one?
[22:44:47] <mforns>	 ottomata: I think it was an-test-master1001, but not sure
[22:45:10] <mforns>	 I closed the screen and lost the output
[22:45:16] <mforns>	 retry?
[22:46:03] <ottomata>	 lemme look
[22:46:20] <ottomata>	 hm no it isn't deployed there
[22:46:22] <ottomata>	 i guess retry mforns 
[22:46:29] <mforns>	 ok, doing
[22:47:00] <mforns>	 !log finished refinery deployment up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
[22:47:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:48:33] <mforns>	 ottomata: it was an-test-coord1001.eqiad.wmnet
[22:48:57] <mforns>	 ottomata: rolling back again
[22:49:05] <ottomata>	 huh
[22:49:24] <ottomata>	 alluxio!!
[22:49:44] <ottomata>	 the fiesl are grp owned by alluxio
[22:49:59] <ottomata>	 mforns:  looks like need to ask luca
[22:50:21] <mforns>	 ottomata: ok, luckily the rest of deploymentis fine
[22:50:22] <mforns>	 seems
[22:50:25] <ottomata>	 aye
[22:50:25] <ottomata>	 nice
[22:51:17] <ottomata>	 !log installing anaconda-wmf-2020.02~wmf5 on stat boxes - T279480
[22:51:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:51:20] <stashbot>	 T279480: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480
[22:52:53] <ottomata>	 razzi:  FYI am doing an anaconda-wmf installation again, dunno if that alarm will fire or not
[22:53:09] <razzi>	 ok cool ottomata 
[22:56:26] <mforns>	 bye ottomata and razzi :] see ya tomorrow
[22:56:31] <ottomata>	 byeaaa
[23:02:16] <ottomata>	 Urbanecm: ok!  try now?
[23:02:31] <Urbanecm>	 ottomata: trying. any stat box?
[23:02:34] <ottomata>	 yup
[23:03:45] <ottomata>	 !log installing anaconda-wmf-2020.02~wmf5 on remaining  nodes - T279480
[23:03:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[23:03:48] <stashbot>	 T279480: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480
[23:05:48] <Urbanecm>	 ottomata: looks much better so far https://usercontent.irccloud-cdn.com/file/hENlDSfT/image.png
[23:06:59] <ottomata>	 good stuff, that hsouldnt' take too long
[23:07:05] <ottomata>	 a minute or 2 when you create a new env
[23:07:48] <Urbanecm>	 i guess I'm in ottomata! https://usercontent.irccloud-cdn.com/file/PaVbhIgA/image.png
[23:07:57] <ottomata>	 you're in1
[23:08:08] <Urbanecm>	 thx for the fix
[23:08:17] <ottomata>	 ya thanks for the report!
[23:08:28] <Urbanecm>	 do i need to somehow stop the server when i stop working?