[00:03:55] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @Marostegui could you expand on why the check isn't realistic? From what I can tell all it's monitoring is the total u... [01:05:13] PROBLEM - Check unit status of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:02:41] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) >>! In T269211#6978486, @razzi wrote: > @Marostegui could you expand on why the check isn't realistic? From what... [05:03:23] PROBLEM - Check unit status of wikimedia-discovery-golden on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit wikimedia-discovery-golden https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:51:46] good morning [06:20:21] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:34:34] 10Analytics, 10Analytics-Kanban: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10elukey) [06:35:12] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10elukey) [06:35:47] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10elukey) a:05elukey→03None [06:36:39] 10Analytics, 10Analytics-Kanban: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10elukey) This task should be done! Please let me know if any other step is needed :) [06:37:27] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) We can call this done, I'll open new tasks to progress this initial work :) [06:56:19] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Escape edit count bucket for metrics tag name [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676297 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [07:00:30] (03CR) 10Thiemo Kreuz (WMDE): "Even if this patch is tiny, this is crazy command line magic I can't really review." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [07:10:10] 10Analytics-Clusters, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10elukey) I have a plan for this task, let me know your thoughts! Assumption: the SRE team decided to consider `/srv` as the canonical place where to put raid-based-volumes/partitions and... [07:25:18] 10Analytics-Clusters, 10Analytics-Kanban: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10elukey) `export HADOOP_CLASSPATH=/usr/share/java/apache-log4j-extras.jar:${HADOOP_CLASSPATH}` in `hadoop-env` did the trick! [07:27:29] I found a need in https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/676297 to repeat a trivial transformation in a dozen different HiveQL scripts. Would it be reasonable to write a UDF for this and similar operations? Is there any precedent for doing so? [07:36:30] Hi awight - writing UDFs is completely fine - We usually do them in 2 steps, core definition (actual computation) in refinery-source/refinery-core, and hive stuff in refinery-source/refinery-hive [07:47:41] (03CR) 10Joal: [C: 03+1] "One nit to discuss, but non-blocking." (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [07:49:04] (03CR) 10Joal: "Another nit :)" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 (owner: 10Hnowlan) [07:56:04] good morning elukey - would ou have minute for me this morning? [07:57:18] 10Analytics, 10SRE, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) +1 on the approach (updating the task description for details) [07:58:21] 10Analytics, 10SRE, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) [08:04:07] joal: bonjour! Sure [08:04:46] elukey: let me know when it's easier for you :) [08:07:44] joal: maybe in 30 mins? [08:08:01] When ou wish elukey - I have nothing before 4pm [08:09:22] joal: Thank you, I may have UDFs to review in the future, in that case. Are the refinery UDFs already available any time Hive is used on stat* and anlauncher* by chance, perhaps by the wrapper script? Or would I have to e.g. add the jar to each tool's launch configuration such as reportupdater's? [08:14:15] 10Analytics-Radar, 10Dumps-Generation: Filename convention is not easy to follow for dumps using a `precombine` step - https://phabricator.wikimedia.org/T279055 (10JAllemandou) Thank you for the explanation @ArielGlenn. Let me precise my 2 concerns (they are minor): - job names are different for the same outpu... [08:22:03] awight: IIRC you need to add the jar in HQL session - This was done to prevent having problems to test new versions of stuff if defaults were included [08:27:44] Oh that's fancy, I literally "add jar" from a script. Now I see diverse examples in `refinery`. Looking forward to tinkering! [08:29:03] awight: in an-launcher or stat boxes deployed jars can be found at: /srv/deployment/analytics/refinery/artifacts [08:29:17] TY [08:30:32] awight: While you can use the `current` version of the jar (/srv/deployment/analytics/refinery/artifacts/refinery-hive.jar for instance), we strongly encourage you to use the versioned ones (for instance /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-hive-1.0.3.jar not sure about the paths) [08:31:12] awight: Given that code changes sometimes, having a versioned pointer to the jar prevents unexpected post-deployment failures [08:34:55] * awight sweats at the thought of importing 41MB of bytecode just to reuse two "replace" statements. Are there efficiencies if the same jar is loaded in many scripts... [08:36:12] awight: in oozie we use HDFS path for the jars (hdfs:///wmf/refinery...) - there is data movement, but at least distributed [08:38:37] Well this gives me a nice option to consider, I'll try to educate myself about the trade-offs and identify what else of my team's HQL can be usefully pushed into UDF. [08:40:41] awight: IIUC your current need is about making sure the same code is reused in many queries [08:46:06] joal: Yes, that's all for now. If it were some logic that couldn't be easily performed in HQL I wouldn't be considering this small matter of carbon overhead. [08:48:32] 10Analytics, 10SRE, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10ayounsi) >>! In T279429#6976000, @ayounsi wrote: > There is also a term permitting UDP fragments, I added a "count" to know if/why we're using it. Looks like we're not. I'll remove it as well. [08:59:41] joal: I am free if you want! [09:00:06] elukey: in 2 mins - batroom b [09:00:10] reak first [09:03:24] 10Analytics, 10SRE, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10elukey) @razzi this is a good task to get started with the firewall rules of our VLAN :) [09:04:50] elukey: to the cave>/ [09:04:51] ? [09:08:04] sure [09:28:47] mgerlach: heya - your job is using 4.5TB on the cluster, and half the CPUs - Can you please set limits within the usual boundaries? [09:30:12] joal: thanks for the ping. apologies. [09:30:16] I am checking the job [09:31:17] I am submitting a spark-job with the following parameters spark2-submit --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G [09:31:39] do you have reocmmendations how to set the usual limits? [09:32:59] mgerlach: you need '--conf spark.dynamicAllocation.maxExecutors=XX' I recommand no more than 128 executors given your settings [09:33:10] mgerlach: see --conf spark.dynamicAllocation.maxExecutors=64 [09:33:16] mwarf -again sorry [09:33:23] mgerlach: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Start_a_spark_shell_in_yarn [09:35:16] joal: thanks for this. I will try your suggestion. sorry again [09:35:26] np mgerlach, nothing broken :) [09:36:23] joal: puh, I am always sweating from your pings : ) [09:36:45] I'm sorry for that mgerlach - I'll try to ping more regularly for chit-chat :) [09:36:50] afraid that I might have broken something [09:37:28] joal: but I should I know this so thanks for directing my attention towards this for better practice [09:37:34] mgerlach: the cluster is resilient, it's more about sharing resources than broken stuff for real [09:41:59] (03PS3) 10Hnowlan: Add makefile and dockerfile for local tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 [09:52:29] /win 11 [09:52:32] ufff [10:21:12] (03PS1) 10Awight: Fix date range parameters for native hive scripts [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/677511 (https://phabricator.wikimedia.org/T193169) [10:25:07] (03PS3) 10Awight: Validate the native "hive" report type [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) [10:25:14] (03CR) 10Awight: Validate the native "hive" report type (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [10:35:59] * elukey lunch! [10:43:36] (03PS1) 10Hnowlan: scap: add analytics WMCS hosts instead of old deploy-prep hosts [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517 [10:57:15] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix date range parameters for native hive scripts [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/677511 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [11:05:53] I'm at a loss on how deployment works in the wmcs cluster - it looks like the service is actually deployed via scap but the deployment-prep scap can't sync to the analytics cluster [11:06:11] deploy-local is breaking with a weird error I don't quite understand [11:06:35] Getting started with refinery, but surprised to see that git-fat requires python 2. [11:15:37] hnowlan: could this be related to the fact that the analytics wmcs cluster got wiped-outm then recreated? [11:17:08] (03CR) 10Joal: [C: 03+1] "LGTM! +1 to let you merge when you wish" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 (owner: 10Hnowlan) [11:18:07] (03PS1) 10Awight: Remove unused jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 [11:18:28] joal: I'm not sure - were the hosts moved from deployment-prep to the new cluster or were they recreated from scratch? [11:19:26] I think they were recreated hnowlan - IIRC we hadn't followed up on wmcs cluster reconfiguration/cleanup, and ended with our cluster deleted [11:20:49] joal: ah, okay, makes sense [11:21:00] the new hosts got a valid scap deploy in January 2020 but I dunno how :) [11:26:09] hnowlan: I think Luca did that so that lex could test his patch about pageview-per-country [12:27:22] 10Analytics-Radar, 10Dumps-Generation: Filename convention is not easy to follow for dumps using a `precombine` step - https://phabricator.wikimedia.org/T279055 (10JAllemandou) I have implemented some more logic to get the files we need, so no real need to change here. This task was more about things to keep i... [12:27:52] hnowlan: hello! No scap deploy, we don't have any scap/deploy server available in our wmcs namespace [12:28:13] they got deployed via scap pull when puppet ran the first time [12:28:59] wow - thanks for explaining elukey - I was far from understanding [12:29:38] (03CR) 10Joal: Remove unused jar (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 (owner: 10Awight) [12:30:19] it is by bad joal, I should have added docs to wikitech, will do it later on [12:30:27] (still have to complete that task) [12:30:54] hnowlan: checking puppet in there to see what settings were applied [12:32:28] so in profile::aqs we have profile::aqs::git_deploy, that defaults to false, I thought we had it enabled in wmcs [12:32:31] 10Analytics-Radar, 10Dumps-Generation: Filename convention is not easy to follow for dumps using a `precombine` step - https://phabricator.wikimedia.org/T279055 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn >>! In T279055#6979924, @JAllemandou wrote: > I have implemented some more logic to get the file... [12:34:23] *they got deployed via git pull, not scap pull [12:36:26] (03CR) 10Elukey: "Let's remove any wmcs/labs reference, we are not in deployment-prep anymore (and we were discouraged to add another cluster in there from " [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517 (owner: 10Hnowlan) [12:37:07] fatal: remote error: mediawiki/services/analytics/aqs/deploy unavailable [12:37:12] ah snap [12:38:25] that is weird, it shouldn't happen [12:38:38] hnowlan: see service::node, we use the git deploy option in there [12:38:46] (it is created by the aqs class) [12:39:03] now why it adds mediawiki/services is weird [12:42:31] define service::deploy::gitclone( String[1] $prefix = 'mediawiki/services', [12:42:34] sigh [12:43:52] ok hnowlan my bad, we should try to make the prefix configurable in service::node when using the git clone [12:44:56] be back in 10 mins! [12:45:20] 10Analytics-Clusters, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10Ottomata) +1! [12:49:20] (03CR) 10Ottomata: Switch to eventutilities 1.0.4 (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [13:06:41] hnowlan: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677569 - lemme know what you think about it [13:06:45] elukey: ohhhh, thanks for filling me in! [13:07:13] hnowlan: sorry for the mess, my brain was convinced about something different :( [13:07:38] the thing that really threw me for a loop is that on the aqs hosts aqs HEAD is at scap/sync/2021-04-07/0004, and it's at the same on deployment-deploy01:/srv/deployment/analytics/aqs/deploy [13:10:22] hnowlan: I think it got deployed via git during the first puppet run :( [13:11:34] ahhh heh - makes sense - deploy-local works but it only pulls the same version [13:27:32] dcausse: in that code review...i think gehel is suggesting that I keep a reference to the response body InputStream somehow [13:27:57] but, in an earlier review, you suggested that I changee the constructor so it doesn't look like it is keeping e.g. HttpResponse [13:28:15] was that just because HttpResponse wasn't actually being kept as an instance property? [13:28:31] or because holding onto it at all would be bad? [13:28:34] ottomata: I think gehel is suggesting to stream and do http request -> json [13:28:45] ah, that's a bit harder with the abstraction i'm trying to make [13:28:52] which is not possible without a majort refactor [13:29:25] hmmm actually, that could be done if we make an interface in resourceloader to get an inputstream? [13:29:26] yep, my point is that if you go from HttpRequest -> InputStream -> String / byte[] -> Json [13:29:45] you could actually do better by removing the String / byte[] part. Less memory allocation [13:29:53] right that makes sense [13:30:07] I guess BasicHttpClient should really only be used for very basic small things, so this might be ok [13:30:09] to keeep as is [13:30:10] But as David said, that's a non trivial refactoring, so it might not make sense to do it here [13:30:10] if you expose the response out of the BasicHttpClient then you trust clients to properly close the response [13:30:21] hm, right. [13:30:43] or you could add JSON parsing to the BasicHttpClient (but it is then a little bit less basic) [13:31:01] you also loose the ability to read that body multiple times, but that's probably a good thing [13:31:04] well its more complicated that, ResourrceLoader is abstracting away the protocol used to get the content [13:31:07] it could also be from a file [13:31:08] url [13:31:11] or maybe even hdfs [13:31:26] so, the Json parsing needs to be above the HttpClient [13:32:12] I'll add some comments about how BasicHttpClient makes a copy of the response body in memory and should only be used for very simple http requests [13:32:19] simple / small [13:32:28] or maybe the shared abstraction shoudl not be a String/byte[] but a stream [13:32:39] that could work, but what abouot what dcausse said about closing the http request? [13:33:29] it's like returning CloseableHttpResponse [13:35:35] heya team :] [13:35:55] or passing a reader from the client: public R get(URL resource, Function reader) {} [13:38:12] yeah, harder to ensure it is closed properly :/ [13:38:51] The not using streams to process potentially large chunk of data is one of my pet peeves, but don't be bound by my OCDs :) [13:39:47] well i added a note not to use it for large chunks of data so it never will be used that way now right? :p [13:39:59] :) [13:40:03] gehel: the other comment is the one about copying the byte[] in getBody() [13:40:43] spot bugs was giving me http://findbugs.sourceforge.net/bugDescriptions.html#EI_EXPOSE_REP [13:41:11] that's convoluted :) [13:41:23] would it be better to just return the byte[] and then add an ignore for that spotbugs? [13:42:09] I think (if possible) it would be better to directly return byte[] and not use BasicHttpResult [13:42:15] (03CR) 10Joal: [C: 03+1] "discussion done - merge when you wish :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [13:43:18] hm [13:43:50] I don't see how you could access the underlying char array from the String, so not sure what FB thinks this is exposing [13:44:17] gehel: i assumed it meant one could do [13:44:26] byte[] mybody = result.getBody(); [13:44:33] mybody[0] = 'X'; [13:44:48] and then someone else might call result.getBody() and see your modification [13:44:55] Oh yes, exposing the byte array is an issue [13:45:30] what I was proposing is that getBodyAsString should just be `return new String(this.body, UTF_8)` [13:45:44] (03PS2) 10Hnowlan: scap: add analytics WMCS hosts instead of old deploy-prep hosts [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517 [13:45:46] hnowlan: puppet works now in analytics wmcs, you should be unblocked! [13:45:55] elukey: sweet, thank you! [13:46:05] (03CR) 10Elukey: [C: 03+1] "<3" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517 (owner: 10Hnowlan) [13:46:07] well, you have to decide how much of an issue it is to expose the byte array, vs the cost of copying it [13:46:12] oh i see [13:46:19] ok i can cahnge the asString method for sure [13:49:26] 10Analytics-Clusters, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) [13:50:12] String will create it's own copy anyway [13:50:15] 10Analytics-Clusters, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) @Cmjohnson this is kind of strange, I don't see any problem reported by megacli for the BBU but I cannot enforce WriteBack on the RAID controller, as if the BBU wasn't working. Any i... [13:50:41] we could even cache the String in a private field to not create a new copy each time, but that's probably unrequired optimization [13:50:47] yeah that makes sense gehel i thikn that was just there from before i changed getBody to make acopy [13:50:50] fixed in latest patch [13:52:33] LGTM! [13:59:43] dcausse: you ok to merge? would love to get this into refinery asap so it can maybe make it in our deployment train [14:00:24] ottomata: lgtm! [14:00:42] ty, guess i' waiting on jenkns too [14:00:47] thanks so much for reviews! [14:02:25] ottomata: meeting? [14:02:33] OH NO [14:02:34] thtank you [14:02:35] coming [14:03:41] !log setting profile::aqs::git_deploy: true in aqs-test1001 hiera config [14:03:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:08:54] hnowlan: It should be already applied to the prefix aqs tab (for all nodes) [14:09:02] (I added it 30 mins before) [14:11:26] 10Analytics-Clusters, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10Cmjohnson) @elukey that's a first! Maybe the raid bios settings are wrong? [14:15:09] elukey: ah nice - it's not quite behaving like I'd expect but I'm looking into it [14:29:24] (03PS2) 10Awight: Remove unused jar in oozie/cassandra/monthly/pageview_top_bycountry.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 [14:29:33] (03CR) 10Awight: Remove unused jar in oozie/cassandra/monthly/pageview_top_bycountry.hql (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 (owner: 10Awight) [14:30:20] (03CR) 10Joal: [V: 03+2 C: 03+2] "Thank you :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677521 (owner: 10Awight) [14:55:44] razzi: wanna talk gobblin? [14:55:54] (anytime is ok, not necessarily now) [14:59:47] (03PS6) 10Ottomata: Switch to eventutilities 1.0.5 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [15:01:24] mforns: are you doing train today? [15:01:34] ottomata: yes [15:01:38] awesooome just in time :) [15:01:48] is there something you want me to deploy? [15:01:58] just refinery source changes, i'll apply them after deploy [15:02:06] they are in the etherpad [15:02:10] i've already merged one [15:02:15] waiting for jenkins on the other then will merge [15:02:16] I'll be doing it scattered during the day, because I have this BUOD presentation meeting, but yes [15:02:27] oooo rigjt cool yeah am looking forward to that [15:02:30] ok, no problemo [15:05:19] (03CR) 10Ottomata: Switch to eventutilities 1.0.5 (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [15:14:39] milimetric: yeah, good time to talk now? [15:14:48] sure, to the cave! [15:15:47] ok! [15:22:03] mforns: hola hola, we should deploy the new mediawiki snapshot config for aqs [15:22:30] elukey: sure [15:23:03] elukey: is this depending on any change to refinery or refinery-source? [15:24:15] 10Analytics, 10Cloud-Services, 10Data-Persistence (Consultation): Sqoop on multi-instance clouddb1021 is very slow for some tables - https://phabricator.wikimedia.org/T279095 (10JAllemandou) 05Open→03Declined Thanks for your suggestion @Marostegui. The global drift is not big (this month took 4h more tha... [15:24:19] mforns: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677591 [15:24:59] elukey: oh! I see, this is unrelated to the refinery deployment [15:27:19] mforns: yes yes sorry I thought you had time now, it can be done later [15:27:27] the new datasource is there [15:27:34] (just checked the coordinator) [15:27:46] I can deploy the puppet change and then run the cookbook when you are ok [15:27:55] so we can test aqs1004 separately [15:27:58] elukey: code looks good to me, I was going to check the data, but if you checked already, I'm all for merging! [15:27:59] (the canary) [15:28:23] mforns: part of the cookbook is depool aqs1004 and wait for a manual confirmation, is it ok? [15:28:56] elukey: yes! [15:30:09] mforns: all right, +1 to do it now? Or do you prefer to wait? [15:30:27] elukey: good to do it now! [15:30:32] you wanna batcave? [15:31:16] mforns: in here is fine if you are ok.. I just depooled aqs1004, you can test! [15:31:24] ok! [15:35:14] bearloga: hi! I see that the discovery golden timer is broken on stat1007, did you change anything recently? [15:35:24] it complains about the pid python module not available [15:35:49] elukey: tested on aqs1004, seems to work fine [15:36:44] mforns: ok to proceed with the rest of the nodes? [15:36:49] elukey: yes [15:37:43] perfect [15:37:51] elukey: quick q for ou [15:38:01] I just noticed in https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m that the 2G cache on the historicals is filled for public [15:38:09] elukey: how can I check IOs on a linux machine in realtime? [15:38:38] joal: for a specific process or cluster wide? [15:38:40] err host wide? [15:38:53] elukey: single host [15:39:02] elukey: stat1008 is very unresponsive [15:39:40] ok elukey: grafana tells me it's diskg [15:40:32] hm [15:40:34] I was about to suggest iotop [15:40:34] elukey: isn't the cache supposed to be filled all the time? [15:40:54] not available on stat1008 elukey :/ [15:41:13] mforns: yes yes it is nice, the hitrate is ~0.85 so I am wondering if adding more RAM for it (since it is heap based) could make it better [15:41:19] elukey: hi! that must explain https://phabricator.wikimedia.org/T279443 [15:41:44] bearloga: yes! :D [15:42:17] joal: yeah it requires sudo :( [15:42:23] hm [15:42:34] elukey: would you sudo me a favor? [15:43:45] joal: so there is somebody logged as root doing a find in /srv/home that is quite intensive [15:43:59] meh [15:44:10] ok [15:44:31] and also mgerlach's python3 script seems heavy [15:44:35] "would you sudo me a favor?" xDDDDDD [15:44:37] elukey: question for you and mforns. if kerberos systemd in puppet runs a script as a system user ('analytics-search' in this case) which calls reportupdater. where is the python environment? [15:45:21] bearloga: in theory it should be added via PYTHONPATH or something similar [15:45:30] mforns: deployed! [15:46:56] joal: ah so we have one cron that does [15:46:57] /usr/bin/find /srv/home -type d -regex "/srv/home/.+/\.local/share/Trash" -exec rm -rf {} >/dev/null 2>&1 \; [15:46:58] elukey: so theoretically I would create a venv that has everything reportupdater needs (as declared in requirements.txt) and then could set PYTHONPATH in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/statistics/manifests/discovery.pp ? [15:48:00] elukey: I'm seeing that with 1G of ram consumption, the hit ratio is already 70%, and when the cache reaches 2G, the ratio is 81%. From there on, even if there's no increase in cache size, the ratio continues to grow (83%); this might be due to the cache optimizing the most accessed paths. [15:48:05] bearloga: in theory the venv should be shipped with the repo if needed, but not sure what mforns did in the past for RU [15:48:12] elukey: I assume the script stopped - perf is back to usual [15:49:15] mforns: yep yep, but the trend seems to be sitting at 0.85 after a month, and there are evictions, this is why I thought about increasing a little the cache (maybe it could lead to zero evictions and more hits) [15:49:28] joal: yes, we need to make that find better sih [15:49:30] *sigh [15:50:03] elukey: no big deal- just mentioning - maybe we could change the execution time for that thing at a moment when less people use the tools? [15:50:44] elukey: aha, I see the evictions, sure! It might be worth increasing the heap [15:51:08] joal: yes yes, I think that we could also make it smarter, it is a little dumb since it removes files instead of the whole dir (guess who's the one to blame?) [15:51:38] sorry s/files/dirs [15:53:59] mforns: what was your solution to venvs for reportupdater? [15:54:06] elukey, bearloga: I never had to use venvs with reportupdater, I avoided using any library within reportupdater that was not already part of the stats machines' environments... [15:54:12] mforns: you haven't ran the train yet right? :D [15:54:22] fdans: no train yet :] [15:54:26] elukey, klausman: that could be of interest - https://medium.com/riselab/feature-stores-the-data-side-of-ml-pipelines-7083d69bff1c [15:54:45] Havin' a look-see [15:54:51] bearloga: we could add the venv to the repo, as elukey suggests? [15:54:53] bearloga: it would be good if we could deploy the venv (a frozen one basically) as part of the discovery golden scap repo (if it is deployed via scap) [15:55:04] and reference it in the timer [15:55:14] klausman, elukey: non technical but I think it puts hig-level concepts in a nivce way [15:55:34] fdans: do you want me to deploy sth? [15:55:36] > Unfortunately, many feature stores being built today are Frankensteinien amalgamations of batch, streaming, caching, and storage systems. [15:55:44] Ain't that the truth for many things [15:56:00] let's write our own feature store! [15:56:02] klausman: I have no idea what you could possibly mention [15:56:09] * joal looks away [15:56:18] I propose nodjs [15:56:20] *nodejs [15:56:46] joal: can you join #wikimedia-ai ? [15:56:47] * joal wonders about feeding the troll [15:56:59] elukey: the deployment is just a clone of the git repo on gerrit https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/golden/ [15:57:01] people will be interested in there too :) [15:57:02] sure elukey [15:57:46] mforns: just start a job, I added the command in the etherpad [15:57:56] elukey: I think it would be better to use a mixture of Haskell and Ocaml [15:58:03] fdans: will do! [15:58:03] bearloga: ah yes then seems fine as well, if we add a "venv" or similar dir in there with PYTHONPATH it should work in theory [15:58:17] thank youuuu [15:58:19] klausman, elukey: scalajs is what you're after, best of both world [15:58:23] ahahahah [15:58:35] joal: what I want that [15:58:51] * joal has failed to refrain feeding the troll [15:59:02] oh wait [15:59:09] this is to materialize scala code into js [15:59:13] I don't want that [15:59:22] :D [15:59:24] I want the opposite thing [15:59:43] how dare they call something .js when no javascript is written [16:00:40] elukey: got it! thanks, I'll document your recommendations in the ticket. [16:00:42] is the stat1008 back to normal? i get a 500 on newpyter [16:01:01] fkaelin: working on it [16:01:06] fkaelin: it's been under IO pressure - back to normal for me [16:01:10] no idea why it worked before and now not?? [16:01:15] no there is another problem [16:01:22] https://phabricator.wikimedia.org/T279480 [16:01:24] Arf ok sorry [16:01:52] elukey: looking at aqs, I see the change in hit ratio from 70% to 83% has not improved the response time percentiles significantly. [16:02:40] mforns: let's wait a bit for things to settle, it may take a bit of time, but it is a good point to check that relationship! [16:02:51] bearloga: elukey , you might be able to get away without setting PYTHONPATH, if you just launch python out of your venv [16:03:10] ottomata: yes true! [16:03:27] fdans: standup? [16:03:35] sorri [16:03:42] also, would it be easy to upgrade the node version installed on the stat machines? Node version v6.13.1 < 10.12.0 when using vim plugins [16:04:21] 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10EYener) Update for those on this task: All three tables have been whitelisted and are live in the event_sanitized db: event_s... [16:06:12] thanks ottomata, will have a look at the phab [16:06:25] fkaelin: we have 10.23.1~dfsg-1~deb10u1 installed on the stat100x hosts, no idea where 6.x comes from [16:07:31] fkaelin: node is in conda, so if you use your conda env, you can install whatever versino you want! :) [16:08:21] ai, thanks. [16:27:58] 10Analytics, 10Analytics-Wikistats: New Wikivoyages are only partially included in Stats - https://phabricator.wikimedia.org/T279564 (10KuboF) [16:28:40] hey a-team: do you have any standard tools to compact data in HDFS? I found spark-compaction on GitHub, but thought I'd check in and see if you have other recommendations before testing it [16:30:00] to make it easier, spark-compaction is this one: https://github.com/KeithSSmith/spark-compaction [16:35:38] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 2 others: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10Ottomata) [16:47:34] ottomata, razzi let me know if https://gerrit.wikimedia.org/r/c/operations/puppet/+/677576 is ok for you later on [16:47:38] in case I'll merge it tomorrow [16:47:56] (I should be able to get the /srv/ logging working config too) [16:53:12] going afk earlier today, ttl folks! [17:32:28] 10Analytics: Review request: New datasets for WMCZ published under analytics.wikimedia.org - https://phabricator.wikimedia.org/T279567 (10Urbanecm) [17:44:57] the wmcs analytics cluster seems to have no users configured [17:45:22] oh also the aqs service has no password configured either heh, might be something to fix tomorrow [18:26:31] ottomata: is there any workarounds for the Newpyter issue or should i movie notebooks i need to run from stat1008 to a different machine in the meantime and using SWAP? [18:32:14] isaacj: i'm really sorry, i have no idea why this all the sudden started happening, i'm trying to get it fixed asap. got one coming i htink, but building a new anaconda-wmf takes a bit [18:32:21] no workaround. :/ [18:32:54] ok -- just wanted to make sure there wasn't a quick fix. i can wait a few hours so i'll hope for that to work. thanks! [18:33:14] ok thanks, again sorry about this . [18:38:50] 10Analytics, 10Analytics-Kanban: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) FYI, I am working on this now and hope to get it fixed within a few hours. "hope" :) [18:46:34] stat1008 is really difficult to work with currently [18:48:37] joal: it seems kind of ok to me? at least shell is responsive [18:48:50] joal: sorry joal, this could have been me. I killed the process [18:49:25] still trying to understand what is the problem. do you have any insights what could be the cause? [18:49:29] not sure if it was you mgerlach, but my processes go a lot faster :) [18:49:44] mgerlach: not without more info :) [18:50:32] mgerlach: tell me more about what ou're doing [18:51:14] joal: I am running a package called wikipedia2vec https://github.com/wikipedia2vec/wikipedia2vec/blob/master/docs/commands.md [18:52:09] the input is one of the dump-files (in this case I am trying to work with enwiki) [18:52:39] I restrict to 8 cpus [18:53:05] mgerlach: you could add 'ionice' before your command [18:53:32] mgerlach: not sure it'll solve the issue, but could help [18:53:52] mgerlach: 8 CPUs is not a lot for that host [18:55:27] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) [18:55:32] full command is this: wikipedia2vec train --min-entity-count=0 --dim-size 50 --pool-size 8 "/mnt/data/xmldatadumps/public/enwiki/latest/enwiki-latest-pages-articles.xml.bz2" [18:56:07] joal: so you suggest to call instead "ionice wikipedia2vec ..."? [18:56:26] yes mgerlach - ottomata can you confirm this looks ok? [18:56:54] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade furud/flerovium to Debian Buster - https://phabricator.wikimedia.org/T278421 (10razzi) [18:58:14] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10razzi) [19:07:49] mgerlach: no ottomata around, let's try with ionice param :) [19:08:12] thanks joal: Iet me try that. I hope it helps [19:10:33] ok gone for tonight [19:22:01] hi! [19:22:02] sorry [19:28:02] (03CR) 10Ottomata: [C: 03+2] Switch to eventutilities 1.0.5 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [19:34:52] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): [MEP] [BUG] Timestamp format changed in migrated client-side EventLogging schemas - https://phabricator.wikimedia.org/T277253 (10kzimmerman) 05Open→03Resolved [19:43:06] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, 10MW-1.36-notes (1.36.0-wmf.35; 2021-03-16): Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Krinkle) You can and should d... [19:53:00] !log upgrade anaconda-wmf everywhere to 2020.02~wmf4 with fixes for T279480 [19:53:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:53:03] T279480: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 [19:53:44] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) Unreverted and merged the puppet fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677403 [19:55:35] * razzi out for a walk [20:15:33] !log rebalance kafka partitions for webrequest_text partitions 15, 16 [20:15:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:28:53] heya team starting deployment train :] [20:29:54] yeehaw [20:49:48] (03PS1) 10Mforns: Update changelog.md for v0.1.4 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/677651 [20:50:31] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging for deployment train." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/677651 (owner: 10Mforns) [20:51:49] (03CR) 10Bstorm: [C: 03+1] "I did it one better. I built a stretch vagrant box and tested the entire cycle." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/676846 (https://phabricator.wikimedia.org/T278715) (owner: 10BryanDavis) [20:57:37] (03CR) 10Bstorm: [C: 03+2] Expand dbname validation regex [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/676846 (https://phabricator.wikimedia.org/T278715) (owner: 10BryanDavis) [21:00:17] (03Merged) 10jenkins-bot: Expand dbname validation regex [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/676846 (https://phabricator.wikimedia.org/T278715) (owner: 10BryanDavis) [21:09:36] 10Quarry, 10Patch-For-Review, 10User-bd808: Database name check excludes valid names (like nds_nlwiki) - https://phabricator.wikimedia.org/T278715 (10Bstorm) 05Open→03Resolved a:03bd808 https://quarry.wmflabs.org/query/53945 Fixed by @bd808's patch after I deployed it. [21:19:15] Noticed this alert: Icinga/DPKG "DPKG CRITICAL dpkg reports broken packages" on stat1008 and an-worker1100, looking in to it [21:20:06] Ok, looks like this is related to another alert, Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[anaconda-wmf], guessing this is ottomata ? [21:23:43] 10Analytics, 10Privacy Engineering, 10Research, 10Patch-For-Review: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) > I'm so sorry, is this stuck on me? No worries -- as you said, privacy review is still ongoing. FYI @JFi... [21:23:54] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.4 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677660 [21:24:47] Going to try to reinstall anaconda-wmf on stat1008 [21:25:04] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging for deployment train." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/677660 (owner: 10Maven-release-user) [21:25:08] !log sudo apt-get install --reinstall sudo apt-get install --reinstall anaconda-wmf on stat1008 [21:25:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:25:46] Looks like there is some apt process running holding the apt lock [21:26:35] !log deployed refinery-source v0.1.4 [21:26:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:39:32] Ok, the apt install finished by itself and anaconda-wmf is installed as intended [21:42:16] oh hey sorry missed ping [21:42:20] was afk, that install was taking a while [21:42:44] razzi: yeah i think the install for anaconda-wmf takes so long that the alert fires [21:43:26] strange, I wonder why it takes so long ottomata [21:43:32] it is 2.2G [21:44:58] !log starting refinery deploy up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3 [21:44:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:46:37] isaacj: fkaelin Urbanecm jupyterhub-conda should be fixed on stat1008 [21:46:45] and others as well, but not all quiet yet. [21:46:54] there is actually a small bug stillf that will affect first time users creating a new conda env [21:46:57] like you Urbanecm :/ [21:47:00] am workign on that now [21:47:14] so I'm not supposed o try it yet? :D [21:48:20] heh, i guess not? [21:48:37] ok :) [21:57:21] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) Getting the stat boxes back in place. stat1008 and stat1004 are good to go. Along the way, I introduced another small bug for users that are creating new... [22:39:55] !log deployment of refinery via scap to hadoop-test failed with Permission denied: '/srv/deployment/analytics/refinery-cache/.config' (deployemt to production went fine) [22:39:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:40:30] mforns: was that on all hadoop test hosts? [22:40:35] razzi: you still there? I had a small issue with the deployment train, the scap deployment to hadoop-test failed with a permits error (see above) [22:40:43] ottomata: oh! you still there [22:40:47] I am here too :) [22:40:48] ya still here! [22:40:51] ottomata: no, only one [22:40:56] which one? [22:43:17] ottomata: I couldn't tell :[, actually both were listed in green before the error message appeared [22:43:42] ottomata: want me to retry? [22:44:19] mforns: both? [22:44:22] an-test-client1001 [22:44:25] what's the other one? [22:44:47] ottomata: I think it was an-test-master1001, but not sure [22:45:10] I closed the screen and lost the output [22:45:16] retry? [22:46:03] lemme look [22:46:20] hm no it isn't deployed there [22:46:22] i guess retry mforns [22:46:29] ok, doing [22:47:00] !log finished refinery deployment up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3 [22:47:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:48:33] ottomata: it was an-test-coord1001.eqiad.wmnet [22:48:57] ottomata: rolling back again [22:49:05] huh [22:49:24] alluxio!! [22:49:44] the fiesl are grp owned by alluxio [22:49:59] mforns: looks like need to ask luca [22:50:21] ottomata: ok, luckily the rest of deploymentis fine [22:50:22] seems [22:50:25] aye [22:50:25] nice [22:51:17] !log installing anaconda-wmf-2020.02~wmf5 on stat boxes - T279480 [22:51:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:51:20] T279480: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 [22:52:53] razzi: FYI am doing an anaconda-wmf installation again, dunno if that alarm will fire or not [22:53:09] ok cool ottomata [22:56:26] bye ottomata and razzi :] see ya tomorrow [22:56:31] byeaaa [23:02:16] Urbanecm: ok! try now? [23:02:31] ottomata: trying. any stat box? [23:02:34] yup [23:03:45] !log installing anaconda-wmf-2020.02~wmf5 on remaining nodes - T279480 [23:03:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:03:48] T279480: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 [23:05:48] ottomata: looks much better so far https://usercontent.irccloud-cdn.com/file/hENlDSfT/image.png [23:06:59] good stuff, that hsouldnt' take too long [23:07:05] a minute or 2 when you create a new env [23:07:48] i guess I'm in ottomata! https://usercontent.irccloud-cdn.com/file/PaVbhIgA/image.png [23:07:57] you're in1 [23:08:08] thx for the fix [23:08:17] ya thanks for the report! [23:08:28] do i need to somehow stop the server when i stop working?