[00:54:49] 10Analytics, 10GLOW: Optimization tips and feedback - https://phabricator.wikimedia.org/T245373 (10Iflorez) [03:01:52] 10Analytics, 10Event-Platform, 10serviceops, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Krinkle) [05:29:08] nuria: Sorry for the super late notice, won't be able to make standup tomorrow due to an early morning flight. Will be at Tuesday's standup [05:31:27] nuria: tomorrow as in Monday [06:22:13] 10Analytics: Stats menu says {{$t(`areas-${a.path}`)} - https://phabricator.wikimedia.org/T247725 (10Quiddity) [06:22:37] 10Analytics: Stats menu says {{$t(`areas-${a.path}`)} - https://phabricator.wikimedia.org/T247725 (10Quiddity) [06:22:59] 10Analytics: Stats menu says {{$t(`areas-${a.path}`)} - https://phabricator.wikimedia.org/T247725 (10Quiddity) [06:27:53] 10Analytics, 10Event-Platform, 10serviceops, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) The number of errors has gone up during the weekend, making this even more absurd. This is the debug log for a session that fails:... [06:45:47] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Marostegui) >>! In T246970#5969163, @Jdx wrote: > @zhuyifei1999: But why? The query used to execute in 11 minutes max. Is it a congestion issue, as Mike Peel suspects? It cou... [07:08:31] Amir1: re: stat1005, yes :( [07:21:45] 10Analytics, 10Event-Platform, 10serviceops, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) This seems to be a recurring issue with envoy and some upstream applications, see for instance https://github.com/envoyproxy/envoy/... [07:59:57] 10Analytics, 10ContentTranslation, 10Language-Team (Language-2020-January-March): Test Performance of Marian NMT translation in stat cluster - https://phabricator.wikimedia.org/T247245 (10Pginer-WMF) a:03santhosh [09:20:04] 10Analytics, 10ContentTranslation, 10Language-Team (Language-2020-January-March): Test Performance of Marian NMT translation in stat cluster - https://phabricator.wikimedia.org/T247245 (10santhosh) >>! In T247245#5964528, @elukey wrote: > @santhosh stat1008 is ready, you can ssh to it and copy your stat1006'... [09:41:11] 10Analytics, 10ContentTranslation, 10Language-Team (Language-2020-January-March): Test Performance of Marian NMT translation in stat cluster - https://phabricator.wikimedia.org/T247245 (10elukey) >>! In T247245#5971389, @santhosh wrote: >>>! In T247245#5964528, @elukey wrote: >> @santhosh stat1008 is ready,... [10:29:46] 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) >>! In T247484#5971229, @Joe wrote: > This seems to be a recurring issue with envoy and some upstream a... [11:06:51] 10Analytics, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) [11:26:21] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 and stat1005 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I had a chat with Arzhel today and we didn't find a lot. From his perspective, it seems that something in the middle between the switch and stat1005 is not worki... [11:44:11] git status [11:44:15] oops [11:44:20] heya teamm :] [11:49:11] hello :) [11:51:03] * elukey lunch! [12:07:52] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10fgiunchedi) >>! In T238658#5964111, @Ottomata wrote: >> Perhaps using a single metric name e.g. 'express_router_request_durat... [12:20:26] 10Analytics, 10ContentTranslation, 10Language-Team (Language-2020-January-March): Test Performance of Marian NMT translation in stat cluster - https://phabricator.wikimedia.org/T247245 (10santhosh) Ok, I think I misunderstood it. Sorry. I thought we can install intel MKL and reimage after testing. // havin... [12:46:14] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10nshahquinn-wmf) [12:52:15] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10nshahquinn-wmf) I remember @SNowick_WMF also experienced this when she first started using SWAP, and she may still be using the `~/venv/bin/pip` workaround. S... [13:11:07] ༼ຈل͜ຈ༽ノ [13:11:14] :D [13:11:17] good morning [13:11:36] Hi folks :) [13:41:50] mforns: you there? [13:42:13] heya elukey [13:47:53] mforns: qq - did we try to remove your venv on stat1008 before doing the test? [13:48:06] elukey, no I didn't [13:48:16] should I? [13:48:33] mforns: if you have time can you try later on? Might not be it but just want to rule out all possibilities [13:53:10] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10Ottomata) > service from service-runner isn't very useful as a tag and we could key dashboards on app instead Ya makes sense,... [13:53:33] 10Analytics, 10Analytics-Kanban, 10Research, 10User-Elukey: Add SWAP profile to stat1005 - https://phabricator.wikimedia.org/T245179 (10elukey) Marcel experienced a problem with the Spark Yarn kernel, namely the same thing reported in https://issues.apache.org/jira/browse/TOREE-485. It seems a problem with... [13:55:39] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10Ottomata) > While things do indeed look way better, the memory leak is most certainly still there. Indeed. The code is sligh... [13:56:32] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10akosiaris) > That's why I mentioned that IMHO `service` from service-runner isn't very useful as a tag and we could key dashb... [14:01:36] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10Ottomata) > service isn't from service-runner. It's from the sidecar prometheus-statsd-exporter in most services, eventstream... [14:04:22] 10Analytics, 10ContentTranslation, 10Language-Team (Language-2020-January-March): Test Performance of Marian NMT translation in stat cluster - https://phabricator.wikimedia.org/T247245 (10elukey) @santhosh the suggestion came from Faidon and Moritz, in two parts: 1) Faidon's point was that it doesn't seem w... [14:06:57] 10Analytics, 10ContentTranslation, 10Language-Team (Language-2020-January-March): Test Performance of Marian NMT translation in stat cluster - https://phabricator.wikimedia.org/T247245 (10elukey) >>! In T247245#5971942, @santhosh wrote: > Are we expecting that openBLAS will perfrom better in stats1008 than s... [14:09:12] mforns: I am going to restart jupytherhub to allow your venv creation [14:09:15] ok? [14:09:43] done, created [14:09:58] now you should be able to restart your notebook [14:27:30] (03PS1) 10Elukey: Update kernel's README to match last changes in Toree kernels [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/579956 (https://phabricator.wikimedia.org/T245179) [14:32:58] joal: not sure if it makes sense, tried to start the conversation --^ [14:34:50] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10akosiaris) >>! In T238658#5972100, @Ottomata wrote: >> service isn't from service-runner. It's from the sidecar prometheus-st... [14:35:47] elukey, thanks [14:37:23] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10akosiaris) >>! In T238658#5972092, @Ottomata wrote: >> While things do indeed look way better, the memory leak is most certai... [14:38:18] elukey, no luck: The kernel for Untitled1.ipynb appears to have died. It will restart automatically. [14:38:41] mforns: yep let's restart all to be sure [14:38:51] elukey, the error is different now [14:39:08] Socket is not alive to be able to send messages! [14:39:13] buuuu [14:39:16] ack thanks :) [14:41:40] elukey, no, I created a new notebook with the same scala spark yarn engine and it gives the same error as before... [14:42:04] org.zeromq.ZMQException: Errno 48 : Address already in use [14:42:17] same bug yes, I need to redo the kernels [14:47:33] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10fgiunchedi) Thanks for the context on `service` @akosiaris , now it is much more clear in my mind what the status quo is. In... [15:04:14] nuria: fdans standup? [15:04:34] ottomata: cannot do standup today, will be off until 1pm [15:04:38] k [15:09:46] sorry team - 2 workers at home is difficult [15:16:15] 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) a:03Joe [15:23:51] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10Ottomata) > I just logged into stat1004 to check out its new JupyterHub installation. Did you do this via JupyterHub in the browser, or just in the terminal v... [15:32:05] mforns: back to the cave about backfilling? [15:32:34] joal, yes, gimme one sec, internet hiccup [15:32:38] sure [15:32:39] np [15:34:12] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10nshahquinn-wmf) >>! In T247752#5972405, @Ottomata wrote: >> I just logged into stat1004 to check out its new JupyterHub installation. > Did you do this via Ju... [15:45:07] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10elukey) @nshahquinn-wmf if you have a Jupyter terminal handy, what does the following command say `which pip3`? [15:49:05] elukey: shall I just update your patch with my typo, and create a new one about jars? [15:49:12] or do everything at once? [15:50:40] (03PS2) 10Joal: Make kernel README and definition consistent [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/579956 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [15:50:57] elukey: Just pushed a minimal change (the typo I mentionned) ---^ Will do a new patch for jars [15:52:05] joal: very nice [15:52:27] joal: how did you guys deployed the last change in kernels that you did? [15:52:51] elukey: ottomata did that - It's kinda magic to me :S [15:52:51] IIRC the kernels are copied over to all the venvs [15:52:56] ah ok [15:53:32] joal: shall I merge the change? It looks very good [15:53:39] yessir [15:53:43] 10Analytics, 10GLOW: Optimization tips and feedback - https://phabricator.wikimedia.org/T245373 (10Iflorez) @JAllemandou I tried running these spark queries over the weekend on a small batch of articles and they timed out. Might you have tips or insights? I didn't receive any error messages, simply the querie... [15:53:47] elukey: will come back with a new patch [15:53:52] (03CR) 10Elukey: [V: 03+2 C: 03+2] Make kernel README and definition consistent [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/579956 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [15:56:55] 10Analytics, 10GLOW: Optimization tips and feedback - https://phabricator.wikimedia.org/T245373 (10JAllemandou) Hi @Iflorez - This kinda feels like Kerberos. Can you confirm you have run `kinit` and entrered your password in a notebook-terminal (see https://wikitech.wikimedia.org/wiki/SWAP#Kerberos). [16:03:22] > IIRC the kernels are copied over to all the venvs [16:03:22] the kernels are copied to a shared path that all venvs use [16:03:28] /usr/local/share maybe? [16:03:38] ahhhh okok didn't get it [16:03:58] makes sense yes, so basically it is just a matter of git pull + update those [16:06:08] (03PS1) 10Joal: Add dependency to Vegas (charts drawing) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/580077 [16:13:18] 10Analytics, 10GLOW: Optimization tips and feedback - https://phabricator.wikimedia.org/T245373 (10Iflorez) Thank you. Yes, I can confirm that I had run kinit and entered my kerberos credentials in a notebook-terminal. [16:22:02] ya [16:22:35] elukey: l if you do a venv/bin/jupyter kernelspec list [16:22:40] it will show you all the installed kernels [16:23:34] ottomata: TIL [16:23:56] I'll update the docs in case something is missing [16:24:05] so I learn in the process :) [16:27:24] ah joal - did you see that all the yarn node managers are running g1 ? [16:27:34] Ahhh ! No I had not :) [16:27:58] elukey you're great :) Thanks a lot! I'm gonna try some heavy soon :) [16:28:30] one good thing is https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&fullscreen&panelId=17&from=now-7d&to=now [16:28:42] I am so happy about it [16:28:55] ottomata: --^ [16:29:06] Nice :) [16:29:14] the gc timing are more or less the same [16:29:23] I've set ~400ms as "suggested deadline" [16:29:24] I was looking at that [16:29:50] and I want to rise the one for Namenode, it is 400ms now that is too tight I think [16:29:59] something like 1s is more appropriate [16:31:34] joal: I also added UseStringDeduplication, that might have played a role [16:31:48] hehe - nice catch [16:32:11] at this point I am inclined to open a "G1 everywhere" task [16:38:58] Kudos elukey from Product-Analytics about the move to bigger machines for notebooks [16:40:49] ottomata, remember there was something odd the other day when deploying refinery source? [16:41:31] I just tried to launch an oozie job and it failed because the new jar (v0.0.118) is not there in hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/ [16:41:40] joal: <# [16:41:42] <3 [16:45:01] hm, wait [16:46:28] ottomata, my fault, maaan... [17:05:10] !log roll restart of hadoop namenodes to get the new GC setting (MaxGCPauseMillis 400 -> 1000) [17:05:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:05:53] 2020-03-16 17:04:48,209 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 40443577 INodes [17:05:59] 40M!! [17:07:14] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10nshahquinn-wmf) >>! In T247752#5972722, @elukey wrote: > @nshahquinn-wmf if you have a Jupyter terminal handy, what does the following command say `which pip3... [17:15:20] elukey: just saw your email to nathante about rsyncing notebook home [17:15:32] i think your suggested commands as run might cause some issues? not sure. [17:15:39] e.g. you need rsync -r (or -a) for the copy [17:15:49] and asking someone to rm -rf their home sounds a little harsh :p [17:20:35] ottomata: yes I followed up with him with some adjustments, but there were some perm issues [17:20:40] ottomata: why harsh? [17:21:05] I asked if he could move to stat100X, and free space on notebooks since there is not enough space [17:21:28] I am not following [17:22:51] anyway, I need to follow up with him for the perms issues, will do it later on [17:27:54] (03PS1) 10Joal: Add refinery jars tp spark scala and sql kernels [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/580083 [17:28:42] (03PS2) 10Joal: Add refinery jars tp spark scala and sql kernels [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/580083 [17:34:28] ottomata: one question for the rsyncs - nathan reported that some files yield a perm denied, and he is right, since 'nobody' can't read those [17:34:42] is there a workaround to this problem? [17:35:47] probably using uid 0 [17:36:06] but it is not really great probably [17:36:17] (even without probably) [17:38:41] 10Analytics, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Ottomata) The SRE Observability folks are about to [[ https://docs.google.com/document/d/1HYHCPvuz93nAYX... [17:39:33] elukey: hm, i mean rm -rf /home/username also removees ssh keys [17:39:38] sure puppet will re create them [17:39:40] buuuuut [17:39:43] still [17:40:23] ok good point, I'll try to be more precise next time [17:42:35] I am still a bit puzzled about the rsync [17:42:42] not sure if I am missing something or not [17:43:17] but if rsyncd runs as 'nobody' then not all files can be copied from one host to the other one [17:43:46] (like the ones with x00 ) [17:49:54] ah yes it is on purpose, reading the old tasks [17:53:20] they have to rsync pull, right? [17:53:41] biking home, will be back in a bit! [17:54:50] ottomata: yes I mean when you rsync from say notebook to stat [17:55:08] https://phabricator.wikimedia.org/T205157#4790176 [18:01:51] ok going to check later when you are onlinez, be back in a bit [18:12:28] 10Analytics, 10GLOW: Optimization tips and feedback - https://phabricator.wikimedia.org/T245373 (10Iflorez) In an effort to run these queries from a Python3 notebook without needing to change the notebook type, I've switched these queries to run as spark queries using the wmf data package's spark.run function.... [18:30:29] !log Deployed refinery using scap, then deployed onto hdfs [18:30:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:40:27] ah elukey i see ya [18:40:29] hmmmmmmMMM [18:40:56] that willl make things different if we chgrp a-pd-users + 770 all homedirs [18:41:03] 10Analytics, 10Analytics-Kanban, 10ArticlePlaceholder, 10Wikidata, and 4 others: ArticlePlaceholder dashboard stopped tracking page views - https://phabricator.wikimedia.org/T236895 (10mforns) We've deployed the patch, now. It has already started to crunch data starting at 2020-01-01. It will take a couple... [18:57:37] ottomata: yes exactly [18:58:14] that's tricky [19:08:26] ottomata: I think that kerberos might play a role in the future, namely if a user provides a valid ticket then it can do anything as that user on the remote host [19:09:03] but in theory now we don't auth anything [19:09:59] for these situations we could write a script to use as helper, not neat but helpful [19:10:50] something that on the remote host 1) saves the list of files/dir without go+r 2) chmods o+rx if needed [19:10:56] (to be executed manually) [19:11:06] and then does the reverse on the local [19:11:30] what do you think? [19:17:42] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10colewhite) >>! In T238658#5972237, @akosiaris wrote: > Sure it might very well be. I am fine with dropping it from statsd-exp... [19:23:45] * elukey off! [19:23:58] oof could work [19:23:58] hm [19:23:59] but [19:24:15] maybe errrrrr maybe we should not do the chmod stuff afterall? [19:24:19] not sure [19:25:42] ottomata: yeah I am not happy either but if people want to copy their stuff around completely we'd need it [19:25:56] ah wait the chmod stuff you mean locking down homes? [19:26:05] I do think we should do it.. [19:26:24] I'll talk with Moritz tomorrow about our use case [19:26:29] let's see if anything can be done [19:26:36] yes [19:26:50] going to dinner, o/ [19:26:55] i mean that, i mean, if we forced thin clients only, we could get away with this because all files wouldl be in hdfs [19:27:02] or if we had a dist filesystem [19:27:04] OK let's discuss later! [19:40:13] 10Analytics, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10wiki_willy) a:05Christopher→03Cmjohnson [19:43:08] mforns: for when you're around - You should update your pattern for oozie-jobs-start adding `-Dqueue_name=production -Doozie_launcher_queue_name=production` [19:43:31] oh joal, sorry, my head is in the moon [19:43:34] !log Kill-restart wikidata-articleplaceholder_metrics-coord to fix yarn queue [19:43:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:43:38] I always do that, but today... [19:44:09] np mforns :) I have pre-written patterns for restarts, that's why I mentionned updating patterns :) [19:44:33] joal, me too, I have all my non-trivial commands written in my notes [19:44:42] I just copy pasted the wrong one [19:44:51] :) [19:44:58] thanks for spotting this [19:45:15] np - Thanks for the deploy! [19:45:32] I also confirm analytics user is not usable anymore from stat1004 :) [22:42:36] (03PS1) 10Nuria: Do not count pages that are just redirects [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/580136 (https://phabricator.wikimedia.org/T247101) [22:43:42] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Product-Analytics (Kanban): SQL definition for structure data in commons metrics - https://phabricator.wikimedia.org/T247101 (10Nuria) >Also I want to exclude redirected page from counting, Nice, thanks for the correction. Code patch submitted. [22:45:12] 10Analytics, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Nuria) >We should review the ECS with @colewhite and modify our schema to conform where we can +1 [22:54:08] 10Analytics, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Nuria) >Maybe let's do error_name? The field we're going to be using to populate this will be Error.nam... [23:30:59] (03CR) 10Nuria: "This seems like it should not exist on the deps we deploy to hadoop, right? Can't we add this to the classpath in notebooks in a differen" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/580077 (owner: 10Joal)