[01:01:18] I'm testing a Cassandra-loading Oozie job with the aqs-test1003.analytics.eqiad1.wikimedia.cloud Cassandra test cluster, which is accessible from the command line, but fails to connect during the Oozie job execution. Any suggestions? [01:02:15] So far, I've tried replacing the hostname with its IP address and trying with another machine (aqs-test1002) to no avail [01:02:48] Here's the log for reference: https://hue.wikimedia.org/hue/jobbrowser/#!id=job_1612875249838_79220 (under syslog) [06:27:13] lexnasser: hey!! there is a network partition between cloud/labs and production, we cannot load from oozie :( [06:27:58] (for security reasons the realms are split) [06:34:55] elukey: morning! makes sense, can I just assume my Oozie job works then if it produces the appropriate output file even if it can’t load into Cassandra? [06:42:55] lexnasser: I don't have a lot of context on it, but what are the output files? [06:51:21] elukey: just the HDFS output files that would be loaded into Cassandra [06:51:50] ahhh okok [06:52:09] yes I think it is fine, but follow up with Joseph or Dan or Marcel just to be sure :( [06:52:12] err :) [06:52:37] Sounds good, thanks for your answer and advice!! [06:57:38] anytime! [06:58:00] I am sorry that we didn't discuss the oozie -> cassandra test cluster thing before :( [06:59:17] PROBLEM - Hadoop DataNode on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:59:33] !log restart datanode on an-worker1099 - soft lockup kernel errors [06:59:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:01:42] !log reboot an-worker1099 to clear out kernel soft lockup errors [07:01:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:08:01] RECOVERY - Hadoop DataNode on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [07:11:20] good [07:15:02] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) 05Resolved→03Open @wiki_willy I am terribly sorry to re-open this task, please be patient, but I discovered that I made an error (got fooled by... [07:29:02] !log added journalnode partition to all hadoop workers not having it in the Analytics cluster [07:29:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:48:50] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1058.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [07:50:34] !log attempt to reimage analytics1058 (part of the cluster, not a new worker node) to Buster [07:50:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:51:00] ah lovely the reimage gets stuck for 1058, and I cannot reach the serial console [07:51:03] GOOD START [08:10:31] all right unblocked [08:11:23] (03CR) 10Awight: Rewrite date match to avoid buggy UDF (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666932 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:18:40] (03PS1) 10Gerrit maintenance bot: Add tay.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667108 (https://phabricator.wikimedia.org/T275803) [08:21:11] (03CR) 10Awight: [C: 04-1] "Hmm, some of these seem to have problems still. This one falls into an interactive session, for example: `visualeditor/hive/template_dial" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666942 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:25:44] (03Abandoned) 10Awight: Switch to beeline to avoid stray logging [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666942 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [09:08:14] (03PS1) 10Gehel: Minimal configuration of Sonar maven plugin. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667114 (https://phabricator.wikimedia.org/T264873) [09:11:00] (03PS1) 10Gehel: Minimal configuration of Sonar maven plugin. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/667115 (https://phabricator.wikimedia.org/T264873) [09:50:49] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) Final script to use for workers: ` #!/bin/bash set -x change_uid() { # $1 new uid # $2 username if id "$2" &>/dev/null then OLD_UID=$(id -u $2) use... [09:51:59] !log reimaged analytics1058 to debian buster (preserving datanode partitions) [09:52:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:52:09] going to test a GPU node now [09:52:15] (since it has more partitions etc..) [09:52:29] if it works, then we'll be able to deploy a 5.x kernel + rocm 38 [09:52:51] aaaand with yarn labels, maybe tensorflow on yarn? [09:58:19] (03CR) 10Addshore: [C: 03+2] Use Archiva to download Maven Wrapper. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/666906 (owner: 10Gehel) [09:59:58] (03Merged) 10jenkins-bot: Use Archiva to download Maven Wrapper. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/666906 (owner: 10Gehel) [10:00:35] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1096.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [10:04:17] addshore: thanks for the merge! I have another one for you: https://gerrit.wikimedia.org/r/c/analytics/wmde/toolkit-analyzer/+/667115 [10:04:27] if you're up to it (almost as trivial as the previous one) [10:06:56] * elukey bbiab [10:13:26] zpapierski: if you have a minute: https://gerrit.wikimedia.org/r/c/wikimedia/discovery/discovery-maven-tool-configs/+/667129 [10:13:47] sure [10:13:54] (I have a few more CRs needing review, but that one is blocking me) [10:13:58] thanks! [10:14:13] for context: https://checkstyle.org/config_sizes.html#LineLength_Parent_Module [10:14:37] and done [10:14:44] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03), 10WMDE-TechWish-Sprint-2021-02-17: Adjust edit count bucketing for TemplateWizard, segment all metrics - https://phabricator.wikimedia.org/T273475 (10awight) [10:14:49] thanks! [10:18:23] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03), 10WMDE-TechWish-Sprint-2021-02-17: Adjust edit count bucketing for TemplateWizard, segment all metrics - https://phabricator.wikimedia.org/T273475 (10awight) The metrics are landing in Graphite, but we... [10:29:46] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 2 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10phuedx) 05Resolved→03Open >>! In T210106#6862686, @awight... [11:04:03] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1096.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [11:14:47] (03CR) 10Awight: Glue for pure-setup.cfg project (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/649296 (owner: 10Awight) [11:31:59] very interesting, on an-worker1096 (host with GPU) we have rocm 3.3 on stretch, and if we reimage to buster the same settings somehow prevent the host from booting (the kernel tries to boot, then stalls on "Relocating xyz etc.." and the host reboots [11:32:30] but only when the reimage script reboots after the first puppet run [11:32:38] so I am 99% positive it is rocm [11:32:45] (the dkms blobs) [11:33:01] I just reimaged again but this time with rocm 3.8 settings [11:34:26] Weird. But I do remember manually installing rocm would on occasion hang if the vanilla packages' DKMS was loaded (I think it couldn't unload the old module cleanly) [11:35:11] Only more reason to insist on Buster for ML [11:35:41] the script is about to reboot after the first puppet run, let's see [11:36:10] but yes definitely better to rely on kernel drivers rather than dkms, seems brittle [11:37:15] Or at least drivers shipped with the distro, binary or no. [11:42:53] yep confirmed that it worked this time [11:52:37] (03CR) 10Joal: [C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667108 (https://phabricator.wikimedia.org/T275803) (owner: 10Gerrit maintenance bot) [11:52:48] (03CR) 10Joal: [V: 03+2 C: 03+2] Add tay.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667108 (https://phabricator.wikimedia.org/T275803) (owner: 10Gerrit maintenance bot) [11:53:07] klausman: weird thing - rocm-dev wants rocm-gdb, that wants libpython38 (and hence it doesn't get installed) [11:53:11] does it ring a bell? [11:53:30] (03CR) 10Thiemo Kreuz (WMDE): "Sorry, I can't say much about this. But if it helps, why not? Is there a way this could cause issues? I don't see any at the moment." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/649296 (owner: 10Awight) [11:53:35] Didn't we roll our own packages because of that? [11:54:38] what do you mean? [11:55:07] I vaguely remember us hacking upstream packages because of dependencies that were in them, but not strictly needed for our use case [11:55:45] so for some reason [11:55:45] elukey@an-worker1096:~$ apt-cache show rocm-gdb [11:55:45] Package: rocm-gdb [11:55:46] Version: 9.2-rocm-rel-3.8-30 [11:55:52] elukey@stat1005:~$ sudo apt-cache show rocm-gdb [11:55:52] Package: rocm-gdb [11:55:52] Version: 9.2-rocm-rel-3.8-30 [11:56:29] but on stat1005 we also have [11:56:29] Version: 9.2-rocm-rel-3.7-20 [11:56:37] that is installed [11:56:57] Is it a dependency or a leaf? [11:57:23] it is a dependency of rocm-dev [11:57:33] Hurm. [11:58:11] the only explanation that I can give is that 9.2-rocm-rel-3.7-20 was from another rocm version [11:58:17] on stat1005 I mean [11:58:20] Sounds like it. [11:58:26] Py3.7 is pretty old by now [11:59:28] even worse: [11:59:28] Version: 9.2-rocm-rel-3.7-20 [11:59:28] Depends: libexpat1, libtinfo5, libncurses5, rocm-dbgapi, libpython2.7 (>= 2.7), libbabeltrace-ctf1 (>= 1.2.1), libbabeltrace1 (>= 1.2.1) [11:59:44] Oh. [11:59:54] Something tells me we should backup its files and purge it [12:01:56] root@apt1001:/srv/wikimedia# reprepro ls libpython3.8 [12:01:56] libpython3.8 | 3.8.1-2~buster1 | buster-wikimedia | amd64 [12:02:03] I'll try with this [12:02:53] mmmmm [12:03:50] ahhh it is in thirdparty/pyall, that it is deployed on worker nodes, but I think puppet failed before adding the repo configs [12:05:15] but of course that package is virtual [12:05:28] so cannot be installed [12:09:26] rocm-gdb | 9.2-rocm-rel-3.7-20 | buster-wikimedia | thirdparty/amd-rocm37 | amd64 [12:09:44] they changed it with rocm 3.8 sigh [12:17:57] (03PS1) 10Awight: Always run as "funnel" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) [12:18:34] Hi gehel - I tried the patch ou sent for sonar, and from what I read it needs the SONAR_API_KEY to be set for the plugin to execute correctly - I assume that his will be setup on jenkins, but how will we test locally? [12:20:35] (03CR) 10Addshore: [C: 03+2] Minimal configuration of Sonar maven plugin. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/667115 (https://phabricator.wikimedia.org/T264873) (owner: 10Gehel) [12:21:43] (03Merged) 10jenkins-bot: Minimal configuration of Sonar maven plugin. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/667115 (https://phabricator.wikimedia.org/T264873) (owner: 10Gehel) [12:27:47] elukey: what fix do you propose? [12:28:37] 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174 (10awight) Since we have control over all jobs using this tool, I think we can move quickly with the migration. It's still nice to include a soft cutover, in case of roll... [12:29:57] klausman: so for the moment I did something horrible, namely using transfer.py to copy the rocm-gdb deb from stat1005 to an-worker1096, to unblock things. The follow up should be better, I'll open a task [12:30:16] also hcc is not present in rocm38, we kept the old version on stat1005 instead of removing it [12:30:35] I think that we should also figure out how to cleanup/upgrade rocm, it seems very weird [12:30:38] sigh [12:32:21] klausman: also IIRC debian is packaging rocm in unstable, so we might use it in the future, but I am afraid of getting very slow updates.. [12:32:32] using ubuntu packages is not great either for these problems [12:33:30] !log reimaged an-worker1096 (GPU node) to Debian buster (preserving datanode dirs) [12:33:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:33:54] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1096.eqiad.wmnet'] ` and were **ALL** successful. [12:36:15] 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174 (10awight) Another detail to mention: the output writer currently includes a date column, and I believe that removing it would cause the header change detection to invalid... [12:45:05] elukey: roger. I will give this a think over lunch [12:45:13] me too :) [12:45:45] * elukey afk! Lunch [12:59:19] (03PS1) 10Awight: [WIP] The date column should be optional [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) [13:00:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] The date column should be optional [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [13:05:04] (03CR) 10Thiemo Kreuz (WMDE): "I'm afraid I don't fully understand what a "funnel" is. ;-) But what I see in the code makes sense. It looks like the code was able to pro" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) (owner: 10Awight) [13:14:43] joal, elukey: is this something one of you would be happy to review / merge? https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/667114 [13:15:10] gehel: hi - I asked a question earlier on about that [13:15:32] gehel: the patch as-is doesn't work locally (needs SONAR_API_KEY env set) [13:15:40] then I should probably read the backlog and try to answer it :) [13:15:58] it's not supposed to work locally, only in CI [13:16:27] unless you do have an API key and you supply it via an environment variable [13:16:29] gehel: shouldn't we have a way to make it work locally? [13:17:14] gehel: In my view CI is not a test-bed, and we shouldn't have stuff running in CI that have not been tested locally (it happens, but shouldn't :) [13:18:06] not really, the sonar plugin is only for sending data to sonarcloud, if you want to have early analysis of your code, it should be done with sonarlint (https://www.sonarlint.org/) [13:18:34] yes, I know, it would be better if this was a single tool with a single configuration, but that's not how sonar works [13:45:10] (03PS2) 10Gehel: Minimal configuration of Sonar maven plugin. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667114 (https://phabricator.wikimedia.org/T264873) [13:45:31] joal: and here is your update ^ [13:45:33] thanks again! [13:46:21] awesome gehel - Thanks :) [13:46:45] (03CR) 10Joal: [C: 03+1] "keeping it as +1 so that Andrew and Luca see it :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667114 (https://phabricator.wikimedia.org/T264873) (owner: 10Gehel) [13:47:51] (03CR) 10Framawiki: "fyi I plan to self-merge this change soon if no comment is made" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/664353 (owner: 10Framawiki) [14:48:19] (03CR) 10Awight: "> I'm afraid I don't fully understand what a "funnel" is. ;-) But what I see in the code makes sense. It looks like the code was able to p" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) (owner: 10Awight) [15:21:12] hello teammm [15:23:51] holaaa [15:33:31] (03PS2) 10Awight: Input table date column should be optional [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) [15:35:04] 10Analytics, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-02-17: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174 (10awight) [15:35:08] (03CR) 10jerkins-bot: [V: 04-1] Input table date column should be optional [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [15:35:35] 10Analytics, 10Patch-For-Review, 10Unplanned-Sprint-Work, 10WMDE-TechWish-Sprint-2021-02-17: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174 (10awight) [15:36:32] 10Analytics, 10Patch-For-Review, 10Unplanned-Sprint-Work, 10WMDE-TechWish-Sprint-2021-02-17: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170 (10awight) [15:45:58] G'day team [15:48:02] (03PS3) 10Awight: Input table date column should be optional [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) [15:54:21] (03CR) 10Erin Yener: "> Patch Set 1:" (037 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666223 (owner: 10Erin Yener) [16:03:38] !log rebalance kafka partitions for webrequest_upload partition 4 [16:03:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:04:51] (03PS1) 10Awight: [WIP] Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) [16:06:17] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [16:13:09] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) a:05RobH→03wiki_willy I would recommend opening a new task rather than reopening a resolved racking task and adding to the 'racking' timeline for... [16:18:23] (03CR) 10Mforns: [C: 03+2] "Thanks a lot for this change!!" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) (owner: 10Awight) [16:19:16] (03Merged) 10jenkins-bot: Always run as "funnel" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) (owner: 10Awight) [16:24:14] elukey: sqoop confirmed working on an-launcher1002 - Many thanks again for having me checking this [16:25:58] \o/ [16:26:19] joal: can we also do a quick check with the other way (no /usr/share/java) ? [16:26:24] I can quickly hack an-launcher1002 [16:26:34] please elukey - I have the command at hand :) [16:27:27] joal: done! [16:27:41] 10Analytics, 10SRE, 10ops-eqiad: an-worker1111 PS Redundancy alert - https://phabricator.wikimedia.org/T275732 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Fixed, loose cable [16:27:43] Testing [16:28:43] elukey: works for me [16:28:50] elukey: how can I check the change ou made? [16:30:05] joal: cat /usr/bin/sqoop [16:30:09] ack [16:30:12] just did that [16:30:45] basically I removed my workaround and /usr/share/java from SQOOP_JARS=`ls /var/lib/sqoop/*.jar 2>/dev/null` [16:30:57] joal: so did it work right? [16:31:05] it did work indeed [16:31:08] wow [16:31:10] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03), 10WMDE-TechWish-Sprint-2021-02-17: Adjust edit count bucketing for TemplateWizard, segment all metrics - https://phabricator.wikimedia.org/T273475 (10lilients_WMDE) [16:31:22] yup, confirmed elukey [16:31:38] nice thanks! [16:31:42] np :) [16:33:47] wow /var/lib/sqoop is.. empty? [16:34:07] all is in /usr/lib/sqoop [16:35:00] (03CR) 10Awight: "> If you have tested it by running update_reports.py with real data, please add a Verified+2 and I will merge." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) (owner: 10Awight) [16:36:52] (03CR) 10Mforns: "> Patch Set 1:" (037 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666223 (owner: 10Erin Yener) [16:39:48] (03CR) 10Mforns: "Oh, hmm, my bad... didn't think this would get merged automatically." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) (owner: 10Awight) [16:40:31] 10Analytics: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) [16:41:28] (03CR) 10Awight: ":-) Thanks for handling it!" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667150 (https://phabricator.wikimedia.org/T193170) (owner: 10Awight) [16:42:11] razzi: o/ [16:42:13] hello [16:42:16] hiya [16:42:24] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) Current status: * reimaged analytics1058 (regular hadoop worker, 12 disks) - all good! (the reuse partman recipe preserved the datanode dirs) * reimaged an-worker1096 (GPU worker, 2... [16:42:37] I added a comment to --^ [16:42:47] we have now 4 hadoop workers running buster [16:43:08] 2 new nodes (from the backup cluster) and two reimaged today (that were already part of the cluster) [16:43:19] nothing on fire, but let's keep it in mind [16:43:28] next week we can talk about how to split the reimages [16:43:31] would it be ok? [16:45:06] Sounds good, yeah [16:46:11] we can also discuss https://gerrit.wikimedia.org/r/c/operations/puppet/+/667180 [16:46:34] basically this is the awesome work that Stevie did to have debian install to preserve partitions (and format only root) [16:46:49] reuse-test forces the admin to join the serial console and hit "YES" [16:47:04] it is meant for testing on the first use cases [16:47:20] so I checked the 12 and 24 disks use cases, all good [16:47:24] and I removed the -test [16:47:46] it may be confusing at first but I promise that I'll explain! [16:48:06] (partman is already horrible, I spent so many hours crying because of it) [16:49:51] haha ok [16:56:49] (03CR) 10Erin Yener: "> Patch Set 1:" (037 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666223 (owner: 10Erin Yener) [16:57:09] elukey: do you know what to make of this alert: "Icinga/Stale file for node-exporter textfile in eqiad"? (https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DIcinga%2FStale%20file%20for%20node-exporter%20textfile%20in%20eqiad) [16:58:29] 10Analytics, 10Patch-For-Review, 10Unplanned-Sprint-Work, 10WMDE-TechWish-Sprint-2021-02-17: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169 (10awight) [16:58:34] razzi: never seen it before :D [16:59:12] 10Analytics, 10Patch-For-Review, 10Unplanned-Sprint-Work, 10WMDE-TechWish-Sprint-2021-02-17: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169 (10awight) a:03awight [16:59:14] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666932 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [16:59:41] elukey: ok :) no notes URL, but I see via git the author of the alert was Filippo so I can ask [16:59:58] razzi: yes exactly, good approach [17:01:39] (03PS2) 10Awight: [WIP] Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) [17:02:39] (03CR) 10Mforns: "Yes, no problem, we can take the conversation to the Phabricator task, no?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666223 (owner: 10Erin Yener) [17:04:28] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10Cmjohnson) 05Open→03Resolved Replaced the disk [17:07:53] going afk people, have a good weekend! [17:08:11] have a good weekend elukey :) [17:10:02] 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10mforns) Hi @Jdrewniak and @mpopov I ping you here to discuss about WikipediaPortal schema. I've seen you listed as schema owne... [17:18:57] (03PS3) 10Awight: [WIP] Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) [17:19:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [17:33:11] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 2 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdlrobson) Given it relates to our existing language instrumen... [17:40:40] Gone for tonight - see you next week team [17:46:33] (03CR) 10Bstorm: "Sounds pretty awesome to me. Vagrant is pretty heavy and quirky." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/664353 (owner: 10Framawiki) [18:02:19] (03PS4) 10Awight: Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) [18:03:46] (03CR) 10jerkins-bot: [V: 04-1] Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [18:06:01] (03CR) 10Awight: "Looks good in manual testing. Convert hiveql scripts by renaming e.g. "toggles" to "toggles.hql", remove bash boilerplate, any "\" escape" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [18:07:04] 10Analytics, 10Patch-For-Review, 10Unplanned-Sprint-Work, 10WMDE-TechWish-Sprint-2021-02-17: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169 (10awight) a:05awight→03None [18:08:09] 10Analytics-Radar, 10Machine-Learning-Team, 10ORES: Emit synthetic mediawiki.revision-score events for both datacenters - https://phabricator.wikimedia.org/T214545 (10Ottomata) BTW, we have an automated way do this now. Artifical canary events are automatically filtered out of the refined Hive tables, but... [18:08:18] mforns: went a bit rogue on the reportupdater patches :-). Nice list of TODOs, you had already made tasks for everything! [18:22:01] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) No worries @elukey, it looks like I missed the double count in rack A4 as well. If these hosts need to stay in row A though, the only other 10... [18:26:40] awight: thanks a lot for all those changes <3 those were improvements that we identified for RU, but never prioritized. We have been talking for a while now of trying to unify all our scheduling tools (Oozie, Refine, reportupdater, some systemd timers) into 1 tool (most probably Airflow). So, we kinda left reportupdater stalled. But since it seems to be used more and more, your patches are super welcome. [18:26:40] Thanks again. [18:29:45] mforns: Ah I see, yeah a standardized scheduler will be an improvement, for sure. [18:31:19] awight: one q: your team is using reportupdater basically for the graphite metrics right? Or are you also using the report files? [18:33:49] reportupdater was initially designed to generate tsv files that would be read and displayed by Dashiki, a dashboarding tool. [18:34:12] but that use case has diminished, there are few Dashiki dashboards now. [18:35:01] however, several teams are using the graphite capability more (which was introduced later by someone outside the team, Max I think). [18:36:13] Maybe when we have Airflow set up, we can implement a Hive2Graphite operator that takes a query and an interval and does the same thing as reportupdater. [18:47:46] mforns: We're strictly using it for the graphite export... Of course, I'd rather drop the data into something better if that existed. We would probably have chosen superset, but wanted the option to expose the dashboards publicly. [18:48:21] I see [18:48:55] long story short, airflow+hive2graphite would be a great next step for us. [18:49:09] We also have a few sql queries, fwiw. [18:49:40] yes, Hive2GraphiteOperator seems something useful to me too [18:51:04] if we had more fine-grained access control for the hadoop cluster, we could maybe have a public superset instance? not sure about that, though [18:53:00] +1 I don't mean it like a vote for any particular product btw, just to say that we aren't committed to Grafana in any way. [18:59:32] ok [19:12:12] (03CR) 10Awight: "Strange, I don't get this error locally." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [20:48:04] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10razzi) Found a client-side error: when creating a new chart from `pageviews_hourly`, when attempting to add a metric in the chart creator, the frontend app crashes. {F34124131} {F34124134} I'm guessing this has to do with the... [21:43:22] 10Analytics: Prep for replacing jupyter conda migration - https://phabricator.wikimedia.org/T262847 (10Ottomata) [22:07:45] (03PS3) 10Lex Nasser: Create and configure Oozie job to load data into Cassandra for pageviews 'top-per-country' AQS endpoint [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654924 (https://phabricator.wikimedia.org/T207171) [22:13:51] (03CR) 10Lex Nasser: "This latest patch set fixed the issues introduced by the Hadoop upgrade." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654924 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [22:15:40] (03PS4) 10Lex Nasser: Create and configure Oozie job to load data into Cassandra for pageviews 'top-per-country' AQS endpoint [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654924 (https://phabricator.wikimedia.org/T207171) [22:42:11] (03PS5) 10Lex Nasser: Create and configure Oozie job to load data into Cassandra for pageviews 'top-per-country' AQS endpoint [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654924 (https://phabricator.wikimedia.org/T207171) [22:51:49] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10lexnasser) Just finished fixing up the Hive query for the Oozie job to load the data into Cassandra for the top per-country AQS pageviews endpoint.\ In my descri...