[00:26:46] isaacj, yt? [00:27:01] dsaez: o/ [00:27:21] hey, I've been playing your wikidata-topic code ... [00:28:10] I've written some code to replace the call to Wikidata API, because this (I thought) was the bottle neck to process a lot of items. [00:28:44] now I can generate the Bag of Words, (Ps and Qs) for all items in couple of minutes .... [00:28:46] buuuuut [00:29:32] Now I'm getting an error from fasttext. I'm trying to understand what it is, it says can connect with socket... I'm wondering if could be the size of the BoW [00:29:50] have you experienced any problem with items with large number of Ps and Qs? [00:30:25] and, do you think that if just cut to the first N (for example 500 ) characters, the model precision will suffer too much? [00:31:04] hmm...never ran into errors like that i think. what's the full text of the error? not clear to me why fasttext would need sockets... [00:32:33] the challenge with limiting the size of the wikidata BoW is that there is no good order to wikidata properties. but if you have an item that has that many statements, i suspect you can trim some of them out and still get good signal. just seems unnecessary [00:33:45] dsaez: also did you see: https://github.com/geohci/wikidata-topic-model/blob/master/bulk/wikidata_ids_to_topics_dumps.py [00:34:11] let me see, maybe it is a problem from pyspark , i'll prepare the dataset with Spark and then run normal python [00:36:50] isaacj, so should I run --input_qids and and a jsonl like '{'Q1': PX QY ... } [00:37:28] jsonl file? [01:17:38] found it. [02:58:48] 10Analytics, 10Analytics-Cluster, 10ORES, 10Research, 10Scoring-platform-team: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub) - https://phabricator.wikimedia.org/T249078 (10diego) [05:55:53] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Definition of not text content metrics for tunning session (rich media,: images and the linke) - https://phabricator.wikimedia.org/T247417 (10jwang) Non-text contents are images, audio, video, documents (pdfs), and data (e.g. JSON, et. al). They are store... [06:07:28] 10Analytics, 10Analytics-Cluster, 10ORES, 10Research, 10Scoring-platform-team: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub) - https://phabricator.wikimedia.org/T249078 (10elukey) Hey Diego, not sure if you have seen https://docs.google.com/document/d/1r-oqMXViWvQCqsYz0qze... [06:12:21] snowick: o/ [06:35:14] snowick: my bad for the mysql credentials, I am fixing it. Basically since you are in analytics-privatedata-users you have now access only to /etc/mysql/conf.d/analytics-research-client.cnf (the filename is different) [06:35:28] you should be able to unblock yourself using the new file [06:45:34] 10Analytics, 10Analytics-Cluster, 10ORES, 10Research, 10Scoring-platform-team: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub) - https://phabricator.wikimedia.org/T249078 (10diego) Hey @elukey . Thanks for sharing, @Ottomata has talked about the general idea, but I was not... [07:03:12] hello! [07:03:22] I was looking for a way to add query logging to Presto [07:03:38] and I keep finding guides about creating custom plugings via https://prestodb.io/docs/current/develop/event-listener.html [07:04:01] I am really shocked that something so simple is not already shipped by Presto [07:04:46] like https://github.com/rchukh/presto-querylog [07:10:01] even https://eng.lyft.com/presto-infrastructure-at-lyft-b10adb9db01 talks about a query log plugin [07:10:57] * elukey afk for a couple of hours (hopefully), will have my laptop with me [09:26:37] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Addshore) Thanks all! The next step for data access will be https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos setup Ping #analyti... [09:26:48] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Addshore) Thanks all! The next step for data access will be https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos setup Ping #analytics... [09:27:03] elukey: ^^ ping :D [09:27:26] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Addshore) [09:27:35] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Addshore) [09:27:51] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Addshore) a:05Volans→03None [09:27:56] need some kerberos love [09:47:40] 10Analytics, 10Analytics-Kanban, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), and 2 others: Migrate analytics/refinery/source release jobs to Docker - https://phabricator.wikimedia.org/T210271 (10akosiaris) A couple of things I 'd like to add: 1. Having CI u... [10:07:59] !log updating icu packages [10:08:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:04:25] addshore: 5 euros! [11:05:20] noooooo :P [11:05:23] that sounds expensive! [11:05:59] do you need anything from them to setup kerbos? or you have their email from ldap and that is it? [11:06:11] then I can have a quick call with them later and walk them through some things :) [11:07:29] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create tarrow --email_address=thomas.arrow_ext@wikimedia.de Principal successfully created. Ma... [11:07:51] addshore: all done, the email with the temp password has been sent [11:09:40] Fanks [11:17:19] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10elukey) 05Open→03Resolved [11:19:26] elukey: did you see the other ticket too? [11:19:32] https://phabricator.wikimedia.org/T248482 [11:19:36] =] [11:23:25] addshore: I can do it in a sec [11:42:29] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create itamar --email_address=itamar.givon@wikimedia.de Principal successfully created. Ma... [11:43:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Vet high volume bot spike detection code - https://phabricator.wikimedia.org/T238363 (10JAllemandou) @Ottomata : I see. @Nuria : shall we rename before recomputing the features, or do we keep that name? [11:44:47] addshore: done! [11:44:57] ty!!! [11:45:18] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10elukey) 05Open→03Resolved [11:50:20] 10Analytics, 10Operations, 10netops: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) We could start with TLS authentication only, with: ` security.protocol=SSL ssl.ca.location=/etc/ssl/certs/Puppet_Internal_CA.pem ssl.cipher.suites=ECDHE-EC... [11:56:46] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) @jcrespo I confirm that I was able to log in. Thank you. @elukey Thanks for the prompt response! [12:21:23] 10Analytics: Analytics Kerberos Welcome Email contains hostname typo - https://phabricator.wikimedia.org/T249103 (10Tarrow) [12:38:22] 10Analytics, 10Analytics-Cluster, 10ORES, 10Research, 10Scoring-platform-team: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub) - https://phabricator.wikimedia.org/T249078 (10Ottomata) [12:38:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) [13:00:53] @dsaez don't spend time on this yet. I'll try to make a "formal" request that you can or not integrate in your workflow. [13:01:13] I don't want to disrupt anything for mere curiosity :) [13:03:41] delphine, ok :) [13:04:31] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) @elukey Should this also provide me with access to hue.wikimedia.org? [13:40:42] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10elukey) >>! In T248482#6018579, @ItamarWMDE wrote: > @elukey Should this also provide me with access to hue.wikimedia.org? Needs another access, just added you! [13:49:21] dsaez: Hi :) You have 8 spark kernels open in parallel on the cluster - I bet you're not using them all at the same time :) Could you please turn some of them off? [14:00:23] 10Analytics, 10Operations, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10ema) 05Open→03Resolved Done: ` $ curl -v -H "X-Wikimedia-Debug: mwdebug1001.eqiad.wmnet" https://en.wikipedia.org/wiki/Main_Page 2>&1... [14:13:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Vet high volume bot spike detection code - https://phabricator.wikimedia.org/T238363 (10Nuria) On my opinion I do not think we need to rename, the context is very different than the MW one. [14:23:55] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Import regularly via scoop mediawiki_imagelinks table - https://phabricator.wikimedia.org/T249113 (10Nuria) [14:24:53] (03CR) 10Joal: "Minor details, otherwise looks great :)" (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584580 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [14:25:28] (03CR) 10Joal: [C: 03+2] "Approved for deploy later this week" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584923 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [14:26:11] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Vet high volume bot spike detection code - https://phabricator.wikimedia.org/T238363 (10Ottomata) The context is different, but you will be evaluating traffic as actors on mediawiki websites. There will also be 'actor*' table(s) that we sqoop from Mediawi... [14:27:20] ottomata, nuria: is it ok for me to move forward with the Actor name? I understand both sides, and I'm not afraid to keep it as is as it seems a distant context from mediawiki - However if you feel strongly about it andrew, let's rename now that we need to recompute everything [14:29:34] (03CR) 10Ottomata: Introduce RefineFailuresChecker.scala (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584580 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [14:29:37] (03CR) 10Joal: [C: 03+1] "ok for me but I'm no shell expert - ottomata if you don't mind triple checking :S" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584949 (owner: 10Hashar) [14:30:12] (03CR) 10Joal: Introduce RefineFailuresChecker.scala (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584580 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [14:30:50] (03Merged) 10jenkins-bot: document -DdeveloperConnection [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584923 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [14:31:38] (03CR) 10Joal: "Minor thing: by convention we use spaces, not tabs ;)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584984 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [14:32:51] (03CR) 10Elukey: [V: 03+1] Introduce RefineFailuresChecker.scala (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584580 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [14:35:09] (03CR) 10Joal: Introduce RefineFailuresChecker.scala (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584580 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [14:36:29] 10Analytics, 10Discovery, 10Discovery-Search (Current work): Data for events from wdqs needs to be deleted after 90 days and/or sanitized - https://phabricator.wikimedia.org/T247034 (10TJones) 05Open→03Resolved a:03TJones [14:41:51] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) @elukey Thank you, am able to access hue now :) [14:47:50] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors metrics definition - https://phabricator.wikimedia.org/T247419 (10Nuria) Tools maintainers: https://toolsadmin.wikimedia.org/ maintainers 2020/03 1919 Baseline 2020/01 1880 [15:05:21] 10Analytics, 10Event-Platform, 10Wikimedia-Extension-setup, 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Wikimedia-extension-review-queue: Deploy EventStreamConfig extension - https://phabricator.wikimedia.org/T242122 (10Jdforrester-WMF) [15:12:22] (03CR) 10Milimetric: [V: 03+2] Fix wikidata item_page_link job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/582500 (https://phabricator.wikimedia.org/T248228) (owner: 10Joal) [15:55:56] 10Analytics, 10Analytics-Kanban, 10Pageviews-API: Pageviews missing for titles with emojis since April 23, 2019 - https://phabricator.wikimedia.org/T245468 (10Nuria) a:05Nuria→03lexnasser [15:56:15] awight: hello! adam met lexnasser , our intern who is going to be working in: https://phabricator.wikimedia.org/T245468 [15:56:31] awight: and for which he could use a CR when the time comes [15:56:37] (03PS3) 10Hashar: Fetch commit hook over https and skip if already present [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584984 (https://phabricator.wikimedia.org/T210271) [15:56:47] (03CR) 10Hashar: "> Minor thing: by convention we use spaces, not tabs ;)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584984 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [16:01:45] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Vet high volume bot spike detection code - https://phabricator.wikimedia.org/T238363 (10Nuria) Per post-standup conversation: - The intent of this "actor" identifier is more inline with the "bad actor" semantics used when assessing security risk - Actor_i... [16:10:41] ottomata: how do you plan on commenting for names? phabTask or CR? Cause there is already quite a lot of code committed :) [16:10:57] on code sorry [16:11:02] now am prepping for interview ah! [16:11:06] hm [16:11:06] will do commetns real quck [16:11:11] sure, we can do here [16:15:27] joal posted [16:15:30] oncr [16:16:43] (03CR) 10Ottomata: Update actors for pageview labelling (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [16:22:11] (03PS1) 10Elukey: RefineFailuresChecker: fix parameters documentation and defaults [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/585266 (https://phabricator.wikimedia.org/T240230) [16:23:14] joal: hope it is fixed with --^ [16:24:27] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) p:05Triage→03High [16:26:05] encrypt all the things :D [16:42:37] (03CR) 10Nuria: [C: 03+2] Update actors for pageview labelling (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [16:51:20] will read after diner ottomata - thanks a lot [16:52:02] (03CR) 10Joal: [C: 03+2] "All good :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/585266 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [16:52:09] (03CR) 10Ottomata: Update actors for pageview labelling (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [16:52:23] nuria: commented again [16:55:55] (03CR) 10Nuria: [C: 03+2] Update actors for pageview labelling (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [16:57:25] (03Merged) 10jenkins-bot: RefineFailuresChecker: fix parameters documentation and defaults [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/585266 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [16:57:36] \o/ [16:58:57] (03CR) 10Ottomata: Update actors for pageview labelling (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [16:59:23] * elukey off! [17:10:10] 10Analytics, 10Operations, 10decommission, 10ops-codfw, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10RobH) [17:25:32] @elukey thanks, I'm in. [17:45:16] 10Analytics, 10Analytics-Kanban, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, and 2 others: Migrate analytics/refinery/source release jobs to Docker - https://phabricator.wikimedia.org/T210271 (10hashar) a:05JAllemandou→03hashar I am working on it with @Joal providing... [17:50:14] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [17:56:59] wow- what's wrong with hive here --^ ? [17:57:05] ottomata: can you have a look? [18:01:44] mwarf - plenty jobs failed [18:01:58] elukey: would you be nearby by any chance? [18:03:19] joal just finished an interview looking [18:03:31] ack ottomata - thanks [18:05:34] only thin I can view is that: https://grafana.wikimedia.org/d/000000379/hive?orgId=1&fullscreen&panelId=9 [18:05:52] no bump in heap [18:06:22] !log restarted hive-server2 [18:06:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:06:50] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:06:59] Will restart the jobs [18:08:00] Hm - usage spike at the error time: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics [18:08:33] I don't understand how so much memor is used on an-coord1001 - this feels a lot [18:09:42] yeah also not sure i know what happened [18:09:59] there are some presto timeouts? [18:10:04] (03CR) 10Nuria: [C: 03+2] Update actors for pageview labelling (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [18:10:15] (03CR) 10Hashar: "I might add that I got them reported to me via "shellcheck" ;)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584949 (owner: 10Hashar) [18:10:26] ottomata: currently the machine is full of mysql (sqoop time), and presto [18:11:11] actually I say sqoop but I'm wrong - It's mysql for hive/oozie taking memory [18:13:00] hm [18:13:03] oh beginning of month [18:13:09] right [18:13:15] but still, feels bizarre [18:13:18] ya [18:13:36] cluster is full - I think this is realted [18:14:12] !log Kill groceryheist job taking half the cluster [18:14:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:14:39] better [18:14:57] but still - 4 pages of running jobs lag :( [18:15:47] dsaez: I reiterate my request- can you please kill some notebooks? [18:16:12] hm - sqoop runs in default queue - not good [18:20:03] ok, back in more reasonable state [18:21:15] !log restart webrequest-load-wf-upload-2020-4-1-16 and webrequest-load-wf-text-2020-4-1-16 after hive failure [18:21:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:21:43] thanks joal [18:23:01] !log Restart unique_devices-per_project_family-monthly-wf-2020-3 and aqs-hourly-wf-2020-4-1-15 after hive fialure [18:23:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:24:07] !log Kill learning-features-actor-hourly as new version to come [18:24:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:25:54] wow hive failed?? [18:26:04] yup elukey - HiveServer2 [18:26:11] elukey: ya i logged int nad it was in active exited state [18:26:18] no info in hive server2 logs [18:26:22] i didn't see an OOM in syslog [18:26:32] mostly just presto timeouts (is there a 10s query timeout right now?) [18:26:34] Spike in usage at that time on the machine - Seems memory constraints [18:26:50] no hive-server2 process was running [18:27:28] very weird, there is one oom for java but this morning [18:27:45] but https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&fullscreen&panelId=4 doesn't look great :( [18:28:16] see last week [18:28:17] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&fullscreen&panelId=4&from=now-7d&to=now [18:28:40] indeed elukey - there have a change in usage today [18:28:49] We are 1st of the month [18:28:53] But still [18:29:29] joal: yeah but we basically saturated the memory, cached went down a lot [18:29:40] yes [18:29:53] ottomata: no idea about the presto timeouts :( [18:29:54] We are using more and more memory [18:30:01] but we need to log the presto queries [18:30:04] on an-coord1001 [18:30:50] joal: https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=now-6h&to=now [18:31:11] at around :50 there is a change in all metrics, probably a heavy query [18:31:29] now in theory presto should query the metastore IIRC, not the server [18:32:01] at 17:50 GC + heap yes [18:32:39] Actually just heap - GC started before (17:00 [18:33:20] there is also a hole in metrics at around 17:40 [18:33:37] Bizarre elukey - it seems the error happened while no presto query was running :( [18:33:46] See the running queries chart [18:34:31] elukey: we don't have charts for mysql on that box do we? [18:35:26] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fanalytics&var-server=an-coord1001&var-port=13306 [18:36:14] \o/ [18:36:18] Thanks elukey [18:37:19] elukey: nothing special :( [18:37:52] ottomata: I checked in kern.log, I think that dmesg has a weird timestamp [18:37:55] Apr 1 17:46:42 an-coord1001 kernel: [16913130.925971] Out of memory: Kill process 141869 (java) score 304 or sacrifice child [18:38:02] that makes sense --^ [18:38:39] so I think that there was memory pressure and our dear oom killed hive [18:38:47] elukey: that feels a query doing too much on the coord-side [18:38:50] yeah [18:39:38] ottomata: I am wondering if we could move hive server+metastore on notebook1003 when we move people to stat boxes [18:40:15] 32 cores, 64g of ram [18:40:43] even oozie on it, and then notebook1004 could get the presto coordinator, or similar [18:40:44] AH OOM cool [18:41:07] sure why not! [18:41:14] joal: what do you think? [18:41:30] anyway, be back in a bit, dinner! [18:41:48] because it has more mem? [18:41:56] sounds great elukey - more space for hive and friends - Enjoy diner, thanks for help [18:54:42] (03PS1) 10Joal: Add imagelinks to mediawiki-history-load oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585294 (https://phabricator.wikimedia.org/T249113) [18:55:12] (03CR) 10Joal: "This needs to be merged after https://gerrit.wikimedia.org/r/585292" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585294 (https://phabricator.wikimedia.org/T249113) (owner: 10Joal) [19:02:05] 10Analytics, 10Analytics-Kanban: Make sqoop run in proiduction queue - https://phabricator.wikimedia.org/T249155 (10JAllemandou) [19:02:44] 10Analytics, 10Analytics-Kanban: Make sqoop run in proiduction queue - https://phabricator.wikimedia.org/T249155 (10JAllemandou) a:03JAllemandou [19:03:30] (03PS1) 10Joal: Parameterize sqoop to run in a specific yarn queue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585295 (https://phabricator.wikimedia.org/T249155) [19:07:24] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Dwisehaupt) Spent some time digging into superset and how to get it packaged up for use on our hosts. There is no debian package, and the install proce... [19:26:53] I have mantra for these days: "wash your hands, run kinit, wash your hands, run kinit, wash your hands, run kinit ..." ;) [19:41:29] dsaez: juas [19:42:42] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Make sqoop run in production queue - https://phabricator.wikimedia.org/T249155 (10JAllemandou) [19:42:48] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Import regularly via sqoop mediawiki_imagelinks table - https://phabricator.wikimedia.org/T249113 (10JAllemandou) [19:42:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Make sqoop run in production queue - https://phabricator.wikimedia.org/T249155 (10JAllemandou) [19:44:29] joal: I am going to run a scoop of tables, should i do that on an-coord1001? [19:44:51] nuria: sure - the script is ready for that [19:46:07] nuria: is that you having restarted the pageview job? [19:46:35] cause I didn't do it and it still is ready mode [19:48:46] joal: me no touch nothing [19:48:56] ack nuria - no idea :S [19:52:26] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Nuria) @Dwisehaupt you can take advantage of the already existing puppet modules for superset. See a recent example of changes to those: https://gerri... [19:52:48] joal: this hive issue happened befor when we had two much to heavy queries [19:53:57] nuria: timing of email is actually the same for all errors (7:46PM) [19:54:04] when hive got down [20:05:59] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) Hmm, Quarry is still not completing the query, and now paws has stopped working. :-( [20:07:30] (03PS3) 10Joal: Add ActorSignatureGenerator and GetActorSignatureUDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584948 (https://phabricator.wikimedia.org/T247342) [20:08:00] nuria: this is the java part renamed --^ [20:08:10] Will move to the oozie part now [20:09:23] (03CR) 10Nuria: [C: 03+1] Add ActorSignatureGenerator and GetActorSignatureUDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584948 (https://phabricator.wikimedia.org/T247342) (owner: 10Joal) [20:09:42] joal: naming change there looks good, i think is much better ottomata was right [20:12:19] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Nuria) Also https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Administration https://wikitech.wikimedia.org/wiki/SWAP#Administration [20:16:21] (03CR) 10Nuria: "One small typo" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585294 (https://phabricator.wikimedia.org/T249113) (owner: 10Joal) [20:17:23] (03CR) 10Joal: Add imagelinks to mediawiki-history-load oozie job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585294 (https://phabricator.wikimedia.org/T249113) (owner: 10Joal) [20:17:45] (03PS2) 10Joal: Add imagelinks to mediawiki-history-load oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585294 (https://phabricator.wikimedia.org/T249113) [20:26:35] joal: maybe not the best time to run my scoop it being the 1st of the month as the other , larger scoop, must be happening [20:27:07] nuria: plenty things is happening :) if you can wait tomorrow or the day after, better, if not, well it'll do :) [20:27:17] joal: i can wait [20:27:21] ack [20:27:42] joal: can i help you with the hiver server issues? [20:28:01] ottomata, nuria: The comment about interaction length- does it mean using _s instead of _secs at the end of the field? [20:28:44] nuria: it's all back to normal [20:29:03] nuria: ottomata restarted the server, I restarted the jobs, so far so good [20:29:26] thanks joal [20:29:41] joal: if you don't mind, it'll help us be consistent, but that one is just a convention, not a rule [20:29:55] maybe even [20:29:56] time_s [20:30:00] (03CR) 10Nuria: [C: 03+2] Add ActorSignatureGenerator and GetActorSignatureUDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584948 (https://phabricator.wikimedia.org/T247342) (owner: 10Joal) [20:30:00] as suffix for a duration [20:30:01] ottomata: no problem, I was just not sure about what you were asking for :) [20:30:25] joal: k, question answered then , was double checking [20:31:02] ottomata: currently we use interaction_length_secs - is interaction_length_s ok, or something like: interaction_duration_s? [20:31:06] nuria: --^ [20:31:19] hmmmm [20:31:29] joal this is just the time between the first and last interaction? [20:31:36] correct ottomata [20:32:00] interaction_duration_period_s [20:32:01] ? [20:32:15] interaction_period_s [20:32:15] hmm [20:32:37] duration makes me feel like it is the interaction_duration and interaction_length make me think itis the duration of a single interaction [20:32:39] (03CR) 10Nuria: [C: 03+2] Add imagelinks to mediawiki-history-load oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585294 (https://phabricator.wikimedia.org/T249113) (owner: 10Joal) [20:32:41] I'd go for duration and not period [20:33:02] hm [20:33:04] interaction_length_total_s [20:33:04] ? [20:33:21] session_duration_s ? [20:33:37] interaction_frame_duration_s [20:33:37] ? [20:33:43] hm [20:34:03] interaction_window_s [20:34:04] ? [20:34:13] interaction_window_length_s [20:34:14] ? [20:34:26] interaction_window_time_s [20:34:40] ottomata: you know what - that field is NEVER used [20:34:43] interaction_window_duration_s [20:34:44] oh??? [20:34:45] ha [20:34:50] nuria: shall I remove it? [20:35:22] nuria: we use first and last, but never the duration [20:35:34] (03Merged) 10jenkins-bot: Add ActorSignatureGenerator and GetActorSignatureUDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/584948 (https://phabricator.wikimedia.org/T247342) (owner: 10Joal) [20:35:53] joal: ah, cause we divide by firats and last later to calculate ratio? [20:35:57] joal: let me look at code [20:36:00] correct nuria [20:37:01] joal: i see, here: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/584602/2/oozie/learning/features/actor/rollup/hourly/calculate_features_actor_rollup_hourly.hql [20:37:06] joal: totally my fault [20:37:14] joal: ya, let's remove it cc ottomata [20:37:25] ack nuria - problem solved ottomata [20:37:29] sorry for the noise :) [20:37:36] DELETING ALL THE THINGS! [20:37:37] nice :) [20:37:50] nuria: less code, less bugs :) [20:43:51] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Bstorm) labsdb1011 is suffering badly lately, it seems. I see some rough replag and the connection errors spike periodically on that host https://grafana.wikimedia.org/d/00000... [20:44:56] (03CR) 10Joal: Update actors for pageview labelling (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [20:45:43] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Bstorm) I can also point out that I just tried to query a table with in wikidatawiki with a simple `select * from limit 1` and it just hung. It was a full view, not a... [20:52:26] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Bstorm) I just found that I got the same result when I did it against the underlying table locally. @Marostegui I think there is a problem with the wikidatawiki database on t... [20:54:32] ottomata: i leave up to you to merge https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/584602/ if you do not have further comments [20:54:55] not yet ready - finishing my patch :) [20:54:58] nuria: --^ [20:55:01] huhu [20:55:29] joal: yaya, i meant that I am leaving the +2 to ottomata [20:55:34] ack [20:56:01] nuria: please have a look at the patch as well if you have a minute: I have removed the first_interaction_dt and last_interaction [20:56:10] in the rollup table- not needed anywhere [20:56:15] joal: k [20:58:26] (03PS3) 10Joal: Update actors for pageview labelling [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) [20:58:34] here we are --^ [21:01:35] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Dwisehaupt) @Nuria Thanks for those links. I hadn't dug into the puppet portions yet but will do so. Hopefully there will be some bits we could reuse g... [21:06:39] (03PS5) 10Joal: Add automated agent-type to pageview_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/578373 (https://phabricator.wikimedia.org/T238363) [21:11:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Services (watching): Add examples to all event schemas - https://phabricator.wikimedia.org/T242454 (10Ottomata) Review please! https://github.com/wikimedia/jsonschema-tools/pull/13 [21:13:26] joa did you decide tyo leave interaction_length_s in after all? [21:13:34] WAST? [21:14:06] meh - my baf [21:14:13] git up [21:14:15] oops [21:14:30] 10Analytics, 10Analytics-Kanban, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, and 2 others: Migrate analytics/refinery/source release jobs to Docker - https://phabricator.wikimedia.org/T210271 (10hashar) >>! In T210271#6017936, @akosiaris wrote: > A couple of things I 'd... [21:15:08] (03CR) 10Ottomata: [C: 03+1] "+1 with removal of interaction_length_s." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) (owner: 10Joal) [21:15:18] gonna head out! byyeee alll [21:15:22] joal: feel free to merge after that [21:15:39] hello! I had a quick question and it's not important but just wanted to share in case it was missed: Turnilo seems to be missing the data for 2020-04-01T15:00:00. is that expected? [21:15:54] I am looking at pageviews_hourly [21:16:16] Hi sukhe - we had a failure at that hour, and I just realized the job had not been restarted - it is now, data should up soon [21:16:36] ah great thanks joal! [21:16:55] (03PS4) 10Joal: Update actors for pageview labelling [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) [21:19:43] !log restart pageview-hourly-wf-2020-4-1-15 [21:19:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:29:46] (03PS5) 10Joal: Update actors for pageview labelling [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) [21:43:16] (03PS6) 10Joal: Update actors for pageview labelling [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) [22:13:13] (03PS6) 10Joal: Add automated agent-type to pageview_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/578373 (https://phabricator.wikimedia.org/T238363) [22:48:06] (03PS7) 10Joal: Update actors for pageview labelling [analytics/refinery] - 10https://gerrit.wikimedia.org/r/584602 (https://phabricator.wikimedia.org/T238363) [23:08:25] (03PS7) 10Joal: Add automated agent-type to pageview_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/578373 (https://phabricator.wikimedia.org/T238363) [23:20:52] (03PS8) 10Joal: Add automated agent-type to pageview_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/578373 (https://phabricator.wikimedia.org/T238363) [23:54:55] 10Analytics, 10Operations, 10decommission, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10Papaul)