[01:02:22] 10Analytics: Make data quality stats alert only if anomalous metrics change - https://phabricator.wikimedia.org/T263030 (10mforns) [06:33:12] good morning [06:37:34] 10Analytics, 10ops-eqiad: an-worker1111 PS Redundancy alert - https://phabricator.wikimedia.org/T275732 (10elukey) [06:59:36] (03Abandoned) 10Elukey: Update changelog.md with skipped releases [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/661187 (owner: 10Elukey) [07:03:57] 10Analytics, 10Analytics-Kanban, 10observability: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10elukey) @Ottomata I have been seeing in icinga brief UNKNOWNs like the following for various kafka clusters (but they keep repeating): {F34121909} [08:50:33] !log move analytics-privatedata/search/product to fixed gid/uid on all buster nodes (including airflow/stat100x/launcher) [08:50:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:17:05] Good morning [09:17:36] bonjour [09:21:32] elukey: shall we finish that thorium backup? [09:22:42] joal: sure sure [10:20:43] A nice read about schedulers: https://www.datarevenue.com/en-blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow [10:21:46] elukey: thorium? [10:21:51] elukey: only 2 files left :) [10:23:36] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1117.eqiad.wmnet', 'an-worker1118.eqiad.wmnet'] ` The log can be found in... [10:24:22] joal: sure! [10:25:36] elukey: I'm gonna run a hash-check on the file with the _COPYING_ extension - if they both match, I'll rename, otherwise we'll need to copy anew - ok? [10:25:47] and, there is a file to copy again (different sizes) [10:25:52] +1 [10:26:19] elukey: the file to drop/copy again: thorium:/srv/backup/backup_wikistats_1/htdocs/FR/PlotDatabaseEdits1.svg [10:27:19] going to copy it in a sec [10:27:48] just kicked off the reimage of an-worker111[7,8] [10:28:00] then I'll init/wipe the partitions, check that all is good [10:28:07] and add them to the cluster [10:28:07] ack elukey [10:28:25] let's see if the fixed gid/uid thing works :D [10:28:38] elukey: md5 are the same - I rename the file removing the _COPYING extension [10:28:39] if it doesn't I am going to throw my laptop out of the window [10:28:44] super [10:29:39] Done [10:29:45] one file to go [10:32:48] joal: file swapped on hdfs [10:32:55] ack elukey - checking [10:33:29] confirmed elukey [10:33:46] elukey: do you wish we run a last triple check with file-list, or do we say we're good? [10:34:33] joal: nah we are good! [10:34:37] ack elukey [10:34:45] elukey: writing in the task [10:35:55] <3 [10:36:04] 10Analytics, 10Analytics-Kanban: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10JAllemandou) Validation has been made file by file, checking names and sizes. Data present on `thorium` at folder `/srv/backup` is now in HDFS at folder... [10:36:09] elukey: -^ [10:36:14] elukey: shall I close? [10:37:39] joal: ah no I need to drop later on and check that the data is gone :D [10:38:04] * elukey bbiab! [10:38:29] ack elukey [10:45:07] * klausman lunch [10:50:08] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1117.eqiad.wmnet', 'an-worker1118.eqiad.wmnet'] ` and were **ALL** successful. [10:53:37] * joal likes when it's **ALL** successful. [10:55:30] 10Analytics-Radar, 10Editing-team, 10MediaWiki-Page-editing, 10Platform Engineering, and 2 others: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10daniel) Codesearch shows 22 extensions that use these hooks: https://codesearch.wmcloud.org/extensions/?q=Ed... [11:08:45] a-team: How can I debug reportupdater jobs, without runner machine access? Are the logs mirrored anywhere? Can I ask a one-time favor of copying some logs to stat1004, maybe? [11:09:02] Hi awight [11:09:15] logs are not mirroed anywhere IFAIK [11:10:38] awight: I'm happy to help - I assume the runner is an-launcher1002? [11:11:00] awight: then .which job/ [11:11:01] ? [11:11:02] joal: Hi :-) Yes pretty sure that's it. It would also be helpful to see the latest output files for my jobs [11:11:16] codemirror, templatedata, and visualeditor [11:11:56] I've run the scripts directly, and output looks fine to me... but what's copied to graphite makes no sense so far. [11:12:30] We could think about creating a VM only for RU in theory... [11:12:56] really sorry for the lack of logs awight, you are the first one requesting them :D [11:13:02] but we can definitely do something about it [11:13:17] another alternative is to send logs to logstash, but it is a partial solution since you also need to check files right? [11:13:21] elukey: hehe I was about to apologize for the trouble, we went a bit overboard writing these queries. [11:13:47] elukey, awight: I wish we have better ways for teams to troubleshoot their workflows [11:14:05] !log add an-worker111[7,8] to Analytics Hadoop (were previously backup worker nodes) [11:14:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:14:18] elukey, awight: after/during the move to airflow, maybe we'll make sure to send logs to logstash? [11:14:27] joal: if the end goal is to replace RU with airflow we will! (hopefully) [11:14:30] awight: https://gist.github.com/jobar/5ea4a54ab92412891abae51af4f27f69 [11:14:36] elukey: it sure is! [11:14:52] elukey: the goal is to replace all scheduling with airflow (ok, maybe not timers :) [11:16:04] joal: O_O interesting output. I guess I might need to debug more of the integration, by actually calling a local checkout of reportupdater on stat*. [11:18:03] awight: https://gist.github.com/jobar/4ab93f5d57b5fbe9ddf4cde27d0244bb [11:18:59] and https://gist.github.com/jobar/85dd5fd5d254dace6d9979e29e4363a9 awight [11:19:08] awight: sorry for nothing better :( [11:19:54] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, and 2 others: Adjust edit count bucketing for CodeMirror - https://phabricator.wikimedia.org/T273471 (10awight) @JAllemandou has kindly copied the logged failures: ` Feb 25 10:56:50 an-launcher10... [11:20:23] joal: Yeah those are some unfortunate logs, but very helpful! I can think of some things to check, starting with the `date` reserved keyword change. [11:20:40] yessir :) [11:20:55] there can't be only positive aspects to moving to new stuff :) [11:21:01] haha [11:21:34] joal: btw, I see that the job "Succeeded", perhaps there are also problems with wiring the reportupdater error code through to job output. [11:22:09] awight: possible - I'm no good at reportupdater internals so I can't really say :S [11:26:19] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, and 3 others: Adjust edit count bucketing for VisualEditor, segment all metrics - https://phabricator.wikimedia.org/T273474 (10awight) Lots of errors in the logs: ` Feb 25 11:00:01 an-launcher100... [11:35:25] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10WMDE-TechWish (Sprint-2021-02-03), 10WMDE-TechWish-Sprint-2021-02-17: Adjust edit count bucketing for TemplateData - https://phabricator.wikimedia.org/T272569 (10awight) [11:44:32] (03PS4) 10Neil P. Quinn-WMF: Fix inconsistent Hive query fail [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665406 (owner: 10Milimetric) [11:44:46] (03CR) 10Neil P. Quinn-WMF: [V: 03+2 C: 03+2] Fix inconsistent Hive query fail [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665406 (owner: 10Milimetric) [11:48:29] 10Analytics-Radar, 10Product-Analytics, 10wmfdata-python: Consider rewriting wmfdata-python to use omniduct - https://phabricator.wikimedia.org/T275038 (10nshahquinn-wmf) p:05Low→03Medium [11:52:28] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10Product-Analytics, 10Structured-Data-Backlog (Current Work): Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10Miriam) @ArielGlenn thanks for clarifying this. I chatted with @Cormac... [11:55:32] Hive is no longer printing column headers in the output. Maybe another upgrade casualty. [11:58:31] awight: can you try with 'set hive.cli.print.header=true;' |? [12:04:32] going to have a quick lunch [12:04:35] * elukey afk! lunch [12:07:57] joal: It had no effect--also, I found that other queries are printing column headers, so this isn't an upgrade thing, it's weirder. [12:09:45] (I'm playing with the escaped backticks now...) [12:14:59] ack awight - weird indeed! [12:15:30] * joal is gone for a break [12:24:39] FWIW, it looks like a concurrent output issue, the `grep -v parquet.hadoop` is a brute-force method of removing loglines, however one of those loglines pauses mid-buffer, then my output header and result table written, then the logs continue mid-line again. So my header gets stripped, and some junk gets concatenated to the end of the result table. I think the correct way to solve is that we [12:24:45] control log levels explicitly rather than using grep -v. [12:29:08] 10Analytics-Radar: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) [12:30:56] both workers in service!! \o/ [13:23:24] 10Analytics, 10SRE, 10ops-eqiad: an-worker1111 PS Redundancy alert - https://phabricator.wikimedia.org/T275732 (10jbond) p:05Triage→03Medium [13:35:25] !log drop /srv/backup/backup_wikistats_1 from thorium [13:35:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:36:08] !log drop /srv/backup/wikistats from thorium [13:36:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:40:15] 10Analytics, 10Analytics-Kanban: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10elukey) @JAllemandou in /srv/backup there was an hardlink to `/srv/analytics.wikimedia.org/published/datasets/archive/public-datasets`, so when I deleted... [13:40:44] sorry joal --^ :( [13:43:10] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) All right an-worker111[7,8] (previously in the backup cluster) were bootstrapped fine with the new fixe gid/uid, will proceed with the rest of the backup cluster in T274795 before pr... [13:43:25] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) 05Stalled→03Open [13:50:55] 10Analytics: Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster - https://phabricator.wikimedia.org/T275767 (10elukey) [13:59:54] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) Current status: - an-worker111[7,8] already reimage/wiped and added to the Analytics Hadoop cluster - newest/last 6 nodes to be configured in https://pha... [14:03:30] 10Analytics-EventLogging, 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, and 4 others: OperationError: The operation failed for an operation-specific reason in generateRandomSessionId - https://phabricator.wikimedia.org/T263041 (10Mholloway) a:03Mholloway Thanks @jlinehan and @Krinkle for re... [14:04:14] nshahquinn: did you deploy that oozie job and is it working or still giving problems? [14:06:12] (03PS1) 10Gehel: Standardize CI builds on Maven Wrapper. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/666898 [14:09:09] awight, cc joal: apologies for those errors, basically the first one when row[0] is None means the header of the output file used to have date in the first column, and the new output has something else as the first column. ReportUpdater tries to be smart and adapt the output to fit new fields, but it doesn't do a good job of protecting the first column. The result is that old output is merged with new output and the new [14:09:09] output has None in the first column. That's why I had to revert awight's nice patch that renamed the first column to report_date, I tried to write the reasons in that patch but it was late at night so it might've come out as my usual troll-speak. [14:09:45] 10Analytics-Radar: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) A related issue is that SL4J is complaining, and I believe this topic is unmaskable without changing config. Can we fix the root cause? ` SLF4J: Class path contains multiple SLF4J bind... [14:13:13] awight: the other error, where it's trying to make the empty row with [None] * blah..., I got that one too but I'm so sorry I wracked my brain and can't remember why. Either way, the sad way to debug is to just hack the python files and add print statements, it'll tell you pretty quickly what's wrong. And I hesitate to improve any of this since RU jobs are first up for migration to airflow [14:13:41] (my guess is that's another symptom of renaming the date column, so maybe just clean that up first and see how it goes) [14:13:58] 10Analytics, 10Analytics-Kanban, 10observability: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10Ottomata) Weird, and it is very intermittent and seems to happen to all broker checks. I just ran a bunch of manual check_prometheus_metric.py comma... [14:14:33] milimetric: ty, I have a guess about one of my errors, it seems to be caused by concurrent logging. [14:22:27] (03CR) 10Ottomata: [C: 03+1] Standardize CI builds on Maven Wrapper. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/666898 (owner: 10Gehel) [14:26:50] ottomata: thanks for the +1 ^ [14:27:02] I don't have +2 rights on that repo. Could you merge it for me? [14:27:54] gehel: the only thing i don't love is [14:27:54] wrapperUrl=https://repo.maven.apache.org/maven2/io/takari/maven-wrapper/0.5.6/maven-wrapper-0.5.6.jar [14:28:02] coudl we get that from archiva instead? [14:28:17] yes, we could, let me fix it! [14:28:18] actually, this is a problem for the discovery.pom too [14:28:43] when I build wikimedia-event-utilities on e.g. stat1004 ii have to set java -D http proxy settings [14:29:03] discovery pom already uses archiva [14:29:07] hm [14:29:20] problem is the wrapper config in wikimedia-event-utilities [14:29:38] i actually always forget to use the mvnw [14:29:57] ok i don't remember what needed http prooxy [14:30:03] i'll lok next time i try and build like that [14:30:57] elukey: do you wish me to triple check files from the other folder? if the hardlink was at whole folder, it's probably ok? [14:31:49] ottomata: that should fix the issue: https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/666903 [14:33:33] (03PS2) 10Gehel: Standardize CI builds on Maven Wrapper. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/666898 [14:33:58] ottomata: and that should fix the issue for refinery as well: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/666898 [14:42:31] (03CR) 10Ottomata: [C: 03+2] Standardize CI builds on Maven Wrapper. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/666898 (owner: 10Gehel) [14:42:33] (03PS1) 10Gehel: Use Archiva to download Maven Wrapper. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/666906 [14:42:35] ty! [14:48:37] joal: it is yes, should be fine! I just wanted your opinion [14:48:51] then, let's drop :) [14:49:28] ottomata: thanks for the merges! [15:10:36] 10Analytics, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10jbond) p:05Triage→03Medium [15:20:22] 10Analytics, 10Analytics-Kanban, 10observability: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10fgiunchedi) My hunch here is that the `2m` modifier is quite close to the scape time (`1m`) so there might not be enough data from time to time, chan... [15:20:36] joal: so the directory was a hardlink itself (that makes sense, now I get why backup was only a link) to /srv/published-rsynced/2/datasets/archive/public-datasets [15:31:33] 10Analytics-Radar: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) Playing with logging is uncovering more errors. When run like `hive --hiveconf hive.root.logger=ERROR,console`, you can see each of our boilerplate `cast(split('2021-02-11', '-')[2] as... [15:32:14] Is there any reason I should avoid writing reportupdater-query scripts using "beeline -e" rather than "hive -e" as the commandline? It seems to come with better logging defaults. [15:43:08] (03CR) 10Mforns: "Thanks again for adding this schema to the include-list!" (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666227 (owner: 10Erin Yener) [15:44:19] (03PS1) 10Awight: Rewrite date match to avoid buggy UDF [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666932 (https://phabricator.wikimedia.org/T275757) [15:46:37] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 2 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdlrobson) 05Stalled→03Open @phuedx should we call this re... [15:46:59] (03PS1) 10Awight: Filter out oversampled events [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666933 (https://phabricator.wikimedia.org/T273454) [15:54:23] (03PS1) 10Awight: Switch to beeline to avoid stray logging [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666942 (https://phabricator.wikimedia.org/T275757) [16:00:40] ^ this is what I mean. According to the manual btw, `hive` is deprecated in favor of `hiveserver2`, but I get errors trying to run that on stat1004. [16:01:15] (03CR) 10Mforns: "And thanks again for this work!" (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666229 (owner: 10Erin Yener) [16:03:04] 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10mforns) Hi all! I reviewed the include-list patches and left some comments there. Please, don't feel overwhelmed by the review... [16:04:25] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) Procedure to add the LVM volume (script to be tested): ` #!/bin/bash if [ ! -e /dev/mapper/*unused* ] then echo "Dropping unused volume"; /usr/s... [16:07:52] joal: can I pick your brain about something? [16:09:16] awight: beeline is fine, I just find it annoying for daily use, so I don't know much about debugging problems with it. I think if you which beeline you'll see we have a wrapper around the actual command (I forget the details) [16:12:14] milimetric: Okay, thanks for the info! May be preferable to change our default log4j settings in that case, if you agree I can prepare a patch for review. [16:12:52] awight: sure, change away [16:15:46] milimetric: yes! I just redeployed it and it's working great :D [16:15:56] sweet, thx [16:15:58] (the wikipediapreview_stats Oozie job, that is) [16:16:08] Thanks for finding the solution! [16:17:30] 10Analytics: Remove all debian python-* packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 (10Ottomata) [16:18:27] 10Analytics: Remove all debian python-* packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 (10Ottomata) This will allow us to stop including python specific versions of packages with our spark2 distribution. [16:21:00] * elukey bbiab! [16:24:40] (03CR) 10Awight: "Alternative, more permanent solution: Ibe7534928b21a3a566e0e699dfcb41154a5c4601" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666942 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [16:25:11] hey awight can I help with reportupdater? [16:31:14] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 2 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10phuedx) 05Open→03Resolved a:03phuedx Yes! Thanks for the... [16:54:07] (03CR) 10Mforns: "Now that I'm thinking..." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/661799 (owner: 10Ebernhardson) [16:58:03] 10Analytics, 10Analytics-Kanban, 10SRE, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10mforns) a:03mforns [16:59:40] away [16:59:42] uff [16:59:44] :) [17:04:46] !log rebalance kafka partitions for webrequest_upload_3 [17:04:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:47:41] mforns: Sure, and thanks for all the review! There's a new issue, T275757, and I think my preferred solution is the attached puppet patch. Does it seem reasonable? [17:47:41] T275757: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 [17:58:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10nshahquinn-wmf) I just tried out Newpyter on stat1008 and had one big issue: I can't deactivate my current conda environment. When I try `source deactivate`, I get `bash: deact... [18:40:16] * elukey afk! [19:55:15] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 2 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdlrobson) 💪