[04:58:17] 10Quarry, 10DBA, 10Data-Services: Quarry query became work much slower - https://phabricator.wikimedia.org/T247978 (10Marostegui) Unfortunately, the servers that we use for Quarry and for the all wikireplicas in general is very specific (and very costly) so we do not have hot spares ready to take over any mo... [05:08:27] (03PS9) 10Nuria: Replace numeral with numbro and fix bytes formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/585725 (https://phabricator.wikimedia.org/T199386) (owner: 10Fdans) [05:09:02] nuria: o/ [05:31:11] (03CR) 10Fdans: [V: 03+2 C: 03+2] "Checked everything good, merging, thanks for the last PS @nuria" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/585725 (https://phabricator.wikimedia.org/T199386) (owner: 10Fdans) [05:37:47] (03PS1) 10Fdans: Release 2.7.4 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/593976 [05:37:54] (03CR) 10Fdans: [V: 03+2 C: 03+2] Release 2.7.4 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/593976 (owner: 10Fdans) [06:13:44] (03PS1) 10Elukey: Assign execute permission to cx/daily_abuse_filter_count [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/593978 [06:21:10] good morning [06:21:58] Hi [06:50:51] !log upgrade druid-exporter on all druid nodes [06:50:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:55:56] hellooooo elukey [06:56:43] elukey: luca I need to ask you a lil something [06:57:53] could you take a look at ezachte's crontab and let me know the full command of DammitCompactHourlyOrDailyPageCountFiles [07:00:42] fdans: sure [07:57:50] 10Analytics, 10Analytics-Kanban, 10Dumps-Generation: Document missing project types in pagecount dumps - https://phabricator.wikimedia.org/T249984 (10fdans) [08:20:41] 10Analytics, 10Analytics-Kanban: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10MoritzMuehlenhoff) >>! In T251006#6080835, @Ottomata wrote: > I think we could do this by only pushing the debian/ dir to gerrit, and including in instructions how to set up... [08:24:44] 10Analytics, 10Analytics-Kanban: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10elukey) >>! In T251006#6103548, @MoritzMuehlenhoff wrote: >>>! In T251006#6080835, @Ottomata wrote: >> I think we could do this by only pushing the debian/ dir to gerrit, an... [08:43:34] 10Analytics, 10Analytics-Kanban: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10MoritzMuehlenhoff) The size mentioned by Otto (6G) should be fine, IIRC there's some limitation within ar which imposes a maximum size of 10 digit bytes (so ~ 9.5 GiB), but... [09:37:52] joal: bonjour! I deployed the new druid exporter, and with a quick puppet change [09:37:55] https://grafana.wikimedia.org/d/000000538/druid?panelId=59&fullscreen&orgId=1 [09:37:58] ta daaan [09:38:00] :) [09:43:49] milimetric: hellooo when you're in can we chat for a couple mins? [09:45:12] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10ema) >>! In T237993#6074940, @elukey wrote: > - the HTTP status `000` seems to be used for clients that have some trouble doing a HTTP request to ats-tls, without ev... [10:32:10] 10Analytics, 10Pywikibot, 10Wikimedia-Site-requests, 10User-Urbanecm: Provide some Pywikibot usage statistics for Python2.7 and Python3.x - https://phabricator.wikimedia.org/T242157 (10Urbanecm) Sorry @Multichill, missed your message. Seems the warning messages that got sent out helped a lot: {F31801803} [10:34:59] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10elukey) I would seriously think about evaluating Gobblin (see https://gobblin.readthedocs.io/en/latest/miscellaneo... [10:37:54] 10Analytics, 10Operations: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10elukey) Maybe we could add a flag to use `programname` selectively and apply to analytics timers? Then if nothing break the rest of the timers could be migrated by... [11:04:38] (03PS1) 10Gilles: LayoutJank schema is deprecated, now LayoutShift [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594135 (https://phabricator.wikimedia.org/T216594) [11:18:17] Hi folks :) [11:18:42] elukey: thanks a lot about druid-exporter - having a good vision of what happens on those big beasts is so very important :) [11:18:45] <3 [11:23:21] <3 [11:32:33] joal: very interesting - druid.broker.http.numConnections is set to 20 [11:32:44] "Size of connection pool for the Broker to connect to Historical and real-time processes. If there are more queries than this number that all need to speak to the same process, then they will queue up." [11:33:16] in the public cluster we are definitely queuing then [11:33:48] (in the brokers) [11:34:01] right [11:34:07] hm [11:35:02] elukey: given the CPU usage of those druid machines, we probably can raise that number! [11:35:08] yeah [11:35:16] also we'll get two more nodes for each cluster [11:35:20] that will smooth those numbers [11:35:27] elukey: I assume raising the number means more parallel queries, which ould be good [11:35:33] ri [11:35:33] ght [11:35:36] super [11:35:45] Let;s try that :) [11:36:08] ack! I am going away now to get groceries, will send a patch after lunch :) [11:36:17] elukey: also, I have experienced some spark difficulties again with shuffle stage - do you think we could raise NM memory to 8G as we discussed ? [11:36:33] ack elukey - later ) [11:38:18] joal: yep will do it today :) [11:50:47] 10Analytics, 10Analytics-Kanban: Add page_restrictions table to sqoop list - https://phabricator.wikimedia.org/T251749 (10JAllemandou) [11:51:20] 10Analytics, 10Dumps-Generation, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: page_restrictions field incomplete in current and historical dumps - https://phabricator.wikimedia.org/T251411 (10JAllemandou) I created T251749 to add the `page_restrictions` table to the tables we sqoop. [12:05:06] PROBLEM - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:07:51] 10Analytics, 10Operations: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10fgiunchedi) IIRC the startswith is there to cater for multi-instance systemd units (e.g. prometheus, elasticsearch, etc) so they all log to the same file. Having a... [12:09:58] 10Analytics, 10I18n, 10RTL: Support right-to-left languages - https://phabricator.wikimedia.org/T251376 (10Amire80) [12:22:29] 10Analytics, 10I18n, 10RTL: Support right-to-left languages - https://phabricator.wikimedia.org/T251376 (10Huji) @fdans What is the message group name on Translatewiki? I might be able to provide Persian translations as well. [12:24:24] 10Analytics, 10I18n, 10RTL: Support right-to-left languages - https://phabricator.wikimedia.org/T251376 (10Amire80) >>! In T251376#6104367, @Huji wrote: > @fdans What is the message group name on Translatewiki? I might be able to provide Persian translations as well. Description: https://translatewiki.net/w... [12:26:42] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: [Wikistats v2] Default selection for (active) editors is confusing for inexperienced users - https://phabricator.wikimedia.org/T213800 (10Nemo_bis) This bug continues to be highly confusing for users. Even experienced users may get lost trying to access... [12:27:43] hey teammm, good afternoon [12:27:50] slash morning [12:34:08] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Thanks!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/593978 (owner: 10Elukey) [12:35:01] hi elukey, I saw there were errors with cx/daily_abuse_filter_count [12:35:23] but I thought that job was inactive at the moment! I didn't yet switch it on in puppet! [12:35:33] suprised it ran [12:35:39] *surprised [12:48:55] 10Analytics, 10I18n, 10RTL: Support right-to-left languages - https://phabricator.wikimedia.org/T251376 (10fdans) @Huji that would be fantastic! Thank you so much for your help. We should definitely add a "help translate this site" link on Wikistats. [12:54:04] elukey: can you please let me know when you're back? [13:03:31] ottomata: WDQS meeting if you are around... [13:03:41] and you want to talk to us :) [13:04:05] hiya just signed on, eating bfast, drinking coffee, checking emails, will skip unless you need me for something in particular [13:05:12] ottomata: all good! We'll have notes in https://etherpad.wikimedia.org/p/streaming-wdqs and we'll ping you if we have anything [13:05:18] great thank you [13:06:02] gehel: likely i'll do this for future meetings too unless, (this time just isn't the best for me!) I can always make it if there is something in particular I should be there for so please don't hesitate to let me know! [13:06:14] oh gehel [13:06:17] * questions about deployment on yarn [13:06:17] s [13:06:25] sounds like something I could help with? [13:06:28] sounds good [13:06:34] for today? [13:06:38] just saw that in the agenda [13:06:39] should I come? [13:06:44] nope, seems Flink specific and dcausse already knows how to do it [13:06:51] ok great [13:06:52] thanks [13:08:43] RECOVERY - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:09:24] 10Analytics, 10I18n, 10RTL: Support right-to-left languages - https://phabricator.wikimedia.org/T251376 (10Huji) Yes, that would be a good idea. I am currently adding Persian translations. A couple issues: * There is a lot of redundancy in the messages. The phrase "Access site" needed to be translated mult... [13:09:41] ottomata: we'll be trying to talk to ververica to see what kind of training they can provide, we're already quite a few people in that conversation, but if you want to be part of that, let me know [13:10:16] mforns: ah! didn't know it! [13:10:21] do you want to absent it? [13:10:24] joal: I am back [13:10:41] elukey: no no, I was planning to switch it on today! [13:10:51] if it's running, cool! [13:10:57] ahahha okok [13:11:02] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10Ottomata) > I propose renaming it to prometheus-rdkafka-exporter and using it every Sounds great! We could then use it in node instead of https://github.com/wikimed... [13:11:12] elukey: Hi! [13:11:56] elukey: sqoop has failed without message :( [13:12:03] buuuuu [13:12:09] elukey: failure is entirely my fault, but no email is not cool :( [13:12:32] joal: this is the first on an-launcher right? [13:12:39] if so it is surely Luca's fault [13:12:49] I might have missed some parameter etc.. [13:12:52] checking [13:13:56] elukey: I'm also working on fixing my bug [13:14:51] gehel: no its ok! I'd love to join for training but don'tneed to help plan it [13:14:52] thanks [13:19:45] joal: so sqoop failed but the logs are there right? What do you mean with "without a message" ? [13:19:58] elukey: no email from timer is what I mean [13:20:31] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10Ottomata) [13:20:35] joal: so it must be that the script didn't return non-zero [13:20:48] Meh : [13:26:05] joal: found it - python/refinery/sqoop.py line 86 onward [13:26:26] the try/except catches the exception, and it only logs it [13:26:39] so the script ends up returning zero at the end [13:26:55] pffffff [13:27:04] ok will check that [13:27:56] elukey: see sqoop-mediawiki-tables (calling script) line 254 [13:30:30] joal: so in theory should exit 1 [13:30:34] I think it should [13:30:53] elukey: cause the `ERROR generating ORM jar for` log line I see in logs [13:31:19] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10Ottomata) > I would seriously think about evaluating Gobblin Gobblin does look better than Camus, and I think it... [13:34:37] joal: can we repro? maybe executing manually [13:35:19] if I check refinery-sqoop-whole-mediawiki.service [13:35:25] Active: inactive (dead) since Fri 2020-05-01 00:04:43 UTC; 3 days ago [13:35:28] Main PID: 25227 (code=exited, status=0/SUCCESS) [13:35:38] :( [13:35:50] elukey: I can definitely repro the error - (and found the bug) [13:35:57] 10Analytics, 10Analytics-Kanban: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10Ottomata) > The size mentioned by Otto (6G) That's the uncompressed size, the actually .deb size is something more like 2ish GB (IIRC). > you can simply run "DIST=buster p... [13:36:36] joal: bc? [13:37:01] elukey: in meeting, will join in minutes [13:37:06] ahh okok [13:38:46] ready elukey [13:39:32] joal: we can do on IRC np, can we try to re-run refinery-sqoop-whole-mediawiki? [13:39:38] it should break right? [13:39:46] I want to test the script with set -e [13:39:47] correct [13:40:13] ok restarting [13:43:14] !log restart refinery-sqoop-whole-mediawiki to test failure exit codes [13:43:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:44:17] nope still exit 0 [13:44:20] will keep digging [13:44:51] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Jgreen) [13:45:12] on my end elukey I'll have a CR to correct the bug in minutes [13:45:58] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), 10MW-1.34-notes (1.34.0-wmf.20; 2019-08-27): Refactor EventBus mediawiki configuration - https://phabricator.wikimedia.org/T229863 (10Ottomata) > So, you want to leave only EventBus::getInstance( $n... [13:46:42] (03PS1) 10Joal: Fix bug introduced adding yarn queue to sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594171 [13:46:46] elukey: --^ [13:47:11] I'll do a quick refinery deploy after it gets merged [13:48:38] looks good, lemme try to see what's wrong first [13:50:26] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Jgreen) [13:51:56] joal: ahhhhh kerberos-run-command doesn't propagate the subprocess.call's return code [13:51:59] * elukey cries in a corner [13:52:03] :/ [13:52:07] lemme check now [13:52:28] elukey: I wouldn't have thought about that for a loooooong time - meh [13:52:53] elukey: 1 error, 2 bugs corrected - not that bad ;) [13:54:52] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10ema) >>! In T237993#6104570, @Ottomata wrote: >> I propose renaming it to prometheus-rdkafka-exporter and using it every > Sounds great! We could then use it in nod... [13:55:02] mforns: was just responding to that EditAtttempStep error [13:55:08] about to re-run it and am creating a ticket too [13:55:11] will post how I did [13:55:28] ottomata: oh, cool [13:55:36] you filtered out the event? [13:55:46] Refine has [13:55:50] a --dataframereader_options [13:55:51] flag [13:55:53] so [13:55:54] --dataframereader_options=mode=DROPMALFORMED [13:56:04] https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html [13:56:12] oh great! [13:56:13] at least, am tryiing that now :) [13:56:25] a few months ago we changed modes to FAILFASTT [13:56:29] to catch things like this [13:56:37] i think there was some table we were silenting faling for [13:57:01] ottomata: I have seen the error and was happy :) [13:57:15] yeh! [13:57:29] cool [14:03:36] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10Ottomata) > We need to essentially add custom metrics to the data structure dumped to disk as JSON, Wouldn't a prometheus-rdkafka-exporter expose the metrics via HTT... [14:10:08] 10Analytics, 10Analytics-EventLogging: EditAttemptStep sent event with "ready_timing": -18446744073709543000 - https://phabricator.wikimedia.org/T251772 (10Ottomata) [14:14:59] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594191/ [14:15:09] /o\ [14:15:37] Process: 23332 ExecStart=/usr/local/bin/kerberos-run-command analytics /usr/local/bin/refinery-sqoop-whole-mediawiki (code=exited, status=1/FAILURE) [14:15:40] Main PID: 23332 (code=exited, status=1/FAILURE) [14:15:46] yep, alert incoming [14:15:59] PROBLEM - Check the last execution of refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:16:00] ok elukey :) Let's merge both and redeploy I guess? [14:16:11] joal: yes definitely, mistery solved :) [14:16:32] (03CR) 10Elukey: [C: 03+1] Fix bug introduced adding yarn queue to sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594171 (owner: 10Joal) [14:18:02] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594171 (owner: 10Joal) [14:18:06] (03PS1) 10Fdans: Change "Active Editors" to registered user editors only [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/594194 (https://phabricator.wikimedia.org/T213800) [14:18:17] elukey: refinery fix merged - shall I deploy? [14:18:32] ok you're faster than I am - deploying :) [14:19:20] Actually while we are at it elukey, can I ask for a patch in sqoop script to add prod queue? [14:21:07] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10ayounsi) [14:23:08] joal: sure [14:23:17] getting a coffee [14:23:20] elukey: currently making it [14:24:10] PROBLEM - At least one Hadoop HDFS NameNode is active on an-master1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:24:43] PROBLEM - Check the last execution of refinery-sqoop-whole-mediawiki on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:24:43] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1008 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [14:26:33] hmmm ^ [14:26:42] sudo -u hdfs hdfs haadmin -getServiceState an-master1001-eqiad-wmnet [14:26:42] active [14:26:45] at leats that looks ok [14:26:55] ottomata: I think it is kerberos-run-command [14:27:04] now it returns the error codes of the calling script [14:27:13] ooh? [14:27:14] oh [14:27:23] ok [14:27:30] yes sorry just merged, then I went for coffee :D [14:27:41] I am going to check/fix all [14:29:27] moritzm: o/, yt? [14:30:15] yep [14:30:31] so your comment [14:30:32] https://phabricator.wikimedia.org/T251006#6103548 [14:30:36] means no git-buildpackage, right? [14:30:39] so no gbp.conf? [14:31:02] yeah, just the command should be enough [14:31:02] and if not [14:31:06] how can I also do [14:31:07] https://phabricator.wikimedia.org/T233020#6104607 [14:31:08] ? [14:31:14] or do I not need to? [14:31:58] these are only needed for git-buildpackage, regular "pdebuild" doesn't have the problem, I made various stretch builds without problems so far [14:37:47] oh ok great [14:37:50] thank you [14:41:22] (03PS1) 10Joal: Fix sqoop yarn queue (bis) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594201 [14:41:26] (03CR) 10Mforns: Change "Active Editors" to registered user editors only (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/594194 (https://phabricator.wikimedia.org/T213800) (owner: 10Fdans) [14:43:19] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10Ottomata) p:05Triage→03High [14:43:33] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [14:43:48] (03CR) 10Fdans: Change "Active Editors" to registered user editors only (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/594194 (https://phabricator.wikimedia.org/T213800) (owner: 10Fdans) [14:43:58] mforns: thanks for taking a look :) [14:44:07] ahhh what a joy kerberos-run-command [14:44:26] fdans: np! [14:44:45] elukey: I actually found a leftover bug again in sqoop - finalizing my CRs, and then I'll deploy [14:45:07] here we [14:45:27] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594201 (owner: 10Joal) [14:45:41] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on an-airflow1001 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [14:47:33] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: EventGate validation errors should be visible in logstash - https://phabricator.wikimedia.org/T116719 (10Ottomata) Update: EventGate service logs are already collected into logstash, including validation error logs.... [14:48:40] ok ready for refinery deploy (sqoop patch) [14:49:55] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on an-coord1001 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [14:50:22] fdans: but there is no 'name' property in the metric config no? [14:50:35] !log Deploy refinery using scap to fix sqoop [14:50:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:51:03] Wow the icinga messages in IRC I didn't miss ::) [14:54:06] how the hell we didn't see the issue before [14:54:09] (03CR) 10Mforns: Change "Active Editors" to registered user editors only (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/594194 (https://phabricator.wikimedia.org/T213800) (owner: 10Fdans) [14:54:51] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1008 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [14:55:12] elukey: it was ALL WORKING :) [14:55:57] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics: EditAttemptStep sent event with "ready_timing": -18446744073709543000 - https://phabricator.wikimedia.org/T251772 (10Nuria) [14:56:31] for example [14:56:32] User class threw exception: java.lang.ClassCastException: java.util.LinkedHashMap cannot be cast to scala.runtime.Nothing [14:56:38] in refine_sanitize_eventlogging_analytics_immediate [14:56:46] I am not sure since when this fails [14:57:02] at org.wikimedia.analytics.refinery.job.refine.EventLoggingSanitization$.apply(EventLoggingSanitization.scala:140) [14:57:06] mforns: --^ [14:57:07] wow [14:57:10] :S [14:57:34] elukey: O.o [14:58:18] elukey: only the immediate? [14:58:24] for the moment yes :D [14:58:42] ok, that's good news [14:59:15] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1006 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [15:00:13] yes yes ufff [15:01:38] ping ottomata [15:12:11] ok - we're ready restart sqoop jobs [15:13:58] elukey: do you do it or shall I --^ ? [15:14:08] joal: need to run puppet first, 1 min [15:14:14] elukey: done ;) [15:14:21] elukey: I checked an-launcher1001 :) [15:14:22] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on an-tool1006 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [15:14:27] ah okok [15:16:46] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [15:35:19] elukey: ping on sqoop - you or me? [15:35:26] joal: gogo [15:35:29] Ack [15:36:46] elukey: confirming strategy: I reset-failed the service, and start it manually after - correct? [15:36:54] joal: just restart it [15:37:01] ack [15:37:53] !log restart refinery-sqoop-mediawiki-private.service [15:37:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:39:21] !log restart refinery-sqoop-whole-mediawiki.service [15:39:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:40:19] 10Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10Nuria) [15:40:29] 10Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10Nuria) a:03fdans [15:40:51] hm - sqoop fails [15:41:21] elukey: sqoop private failed :( [15:41:33] elukey: and acutally whole as well [15:41:42] elukey: I suspect tmp folders issue [15:42:06] elukey: we need /tmp/sqoop-jars folder with analytics write rights [15:42:12] on an-launcher [15:42:46] PROBLEM - Check the last execution of refinery-sqoop-mediawiki-private on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-mediawiki-private https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:42:54] elukey@an-launcher1001:~$ ls -ld /tmp/sqoop-jars [15:42:54] drwxr-xr-x 2 analytics analytics 4096 May 4 15:42 /tmp/sqoop-jars [15:42:56] done! [15:43:01] we should add it in puppet though joal [15:43:07] We should !!!! [15:43:17] Creating a task [15:43:28] Actually, checking before creating the task [15:46:35] still failing :( [16:00:24] PROBLEM - At least one Hadoop HDFS NameNode is active on an-master1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:01:54] elukey: driver is linked on an-launcher :( ls -la /usr/lib/sqoop/lib [16:02:27] joal: and what is the error again? [16:03:46] java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver [16:04:03] meh [16:15:15] joal: I am wondering if it is a problem of jar not picked up or jar different than the mysql version [16:15:23] but I suspect the former [16:15:23] elukey: I think it is [16:15:39] elukey: the latter sorry [16:15:56] but it seems strange that they break compatibility [16:16:38] elukey: If I change the jdbc:mysql: to jdbc:mariadb: in the connection url the error changes [16:16:47] tool.BaseSqoopTool: Got error creating database manager: java.io.IOException: No manager for connect string: jdbc:mariadb://labsdb1012.eqiad.wmnet/etwiki_p [16:17:08] so maybe it's actually the former :S [16:20:00] elukey: can I suggest comething? [16:20:07] sure sure [16:20:16] elukey: could you try to rebuild the link mysql-connector-java.jar -> /usr/share/java/mariadb-java-client-2.3.0.jar [16:20:33] using mariadb-connector-java.jar [16:20:36] ? [16:20:47] meh - nevermind elukey [16:21:02] given java uses classes to link, shouldn't change anything [16:22:18] RECOVERY - At least one Hadoop HDFS NameNode is active on an-master1001 is OK: Hadoop Active NameNode OKAY: an-master1001-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:23:36] ok elukey - got it working adding explicitely the driver name in command [16:23:46] elukey: shall I patch the script adding that explicitely? [16:24:16] joal: +1 [16:24:21] ok [16:29:08] also elukey - should I put sqoop default driver to com.mysql.jdbc.Driver (stretch) or org.mariadb.jdbc.Driver (buster) [16:29:24] I think stretch is safer for now, or are we planning to move to buster soon? [16:30:01] joal: but sqoop runs on buster now no? Or do you mean on an-coord? [16:30:22] elukey: we use sqoop on stat1004 manually [16:30:52] hm - the script is for prod - let's make default for buster? [16:30:56] elukey: --^ [16:31:08] +1 [16:36:02] (03PS1) 10Joal: Add driver_class option to sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594232 [16:36:09] elukey: --^ [16:36:52] (03CR) 10Elukey: [C: 03+1] "neat!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594232 (owner: 10Joal) [16:37:10] Let me test before I merge [16:42:43] ok confirmed [16:43:09] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging to fix sqoop" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594232 (owner: 10Joal) [16:43:15] 10Analytics: Cannot see SQL lab tab on UI - https://phabricator.wikimedia.org/T251787 (10Nuria) [16:44:43] !log Deploy refinery again using scap (trying to fox sqoop) [16:44:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:46:10] Also elukey - If we could put some thoughts about trying speeding up deployment of refinery, I'd be very happy :) [16:46:25] elukey: getting jars from archiva is looooooooooooong [16:46:39] sure [16:51:37] 10Analytics: Add folder creation for sqoop initial installation in puppet - https://phabricator.wikimedia.org/T251788 (10JAllemandou) [16:51:44] bearloga: o/ - when you have a moment, let me know if it works [16:53:16] (03CR) 10Nuria: [C: 03+2] LayoutJank schema is deprecated, now LayoutShift [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594135 (https://phabricator.wikimedia.org/T216594) (owner: 10Gilles) [16:53:49] elukey: yep! SQL Lab in the UI, querying works. thanks! I'll check with jennifer too [16:55:19] bearloga: I got what it is the issue, will keep it in mind, thanks for the feedback! [16:57:46] PROBLEM - Check the last execution of refinery-drop-webrequest-refined-partitions on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-refined-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:58:36] Permission denied: user=analytics, access=ALL, inode="/wmf/data/wmf/webrequest/webrequest_source=text/year=2019/month=12/day=14/hour=18 [16:59:03] the hour is hdfs:analytics-privatedata-user [16:59:16] and the script is trying hdfs dfs -rm -R -skipTrash /wmf/data/wmf/webrequest/webrequest_source=text/year=2019 [16:59:59] ah right drwxr-x--- - hdfs analytics-privatedata-users [17:00:13] so analytics gets group permissions that cannot write [17:01:05] there are a couple of hours with wrong perms, fixing [17:02:58] !log chown analytics (was: hdfs) /wmf/data/wmf/webrequest/webrequest_source=text/year=2019/month=12/day=14/hour={13,18} [17:03:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:03:17] !log Restart refinery-sqoop-whole-mediawiki.service on an-launcher1001 [17:03:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:03:41] !log restart refinery-drop-webrequest-refined-partitions after manual chown [17:03:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:06:29] elukey: IT HAS STARTED ! [17:07:00] \o/ [17:08:19] !log Restart refinery-sqoop-mediawiki-private.service on an-launcher1001 [17:08:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:08:34] RECOVERY - Check the last execution of refinery-drop-webrequest-refined-partitions on an-launcher1001 is OK: OK: Status of the systemd unit refinery-drop-webrequest-refined-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:09:16] ok team - gone for diner with kids, will be back checking sqoop after [17:09:16] RECOVERY - Check the last execution of refinery-sqoop-mediawiki-private on an-launcher1001 is OK: OK: Status of the systemd unit refinery-sqoop-mediawiki-private https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:09:50] RECOVERY - Check the last execution of refinery-sqoop-whole-mediawiki on an-launcher1001 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:13:34] mforns: o/ are you going to open a task for the sanitize immediate failure? [17:17:04] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on an-coord1001 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [17:20:30] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: EventGate validation errors should be visible in logstash - https://phabricator.wikimedia.org/T116719 (10Krinkle) @Ottomata Where do they end up in Logstash exactly? What's an example query for someone interested to... [17:23:48] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on an-airflow1001 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [17:23:48] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [17:23:48] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on an-tool1006 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [17:23:48] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1006 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [17:24:43] ottomata: so icinga looks clear now, I hope everything works.. will check later! [17:25:04] * elukey afk for a couple of hours! [17:28:25] elukey: yes, will create a task and start on it [17:47:50] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [17:48:03] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Patch-For-Review: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Jgreen) Failure at https://analytics.frdev.wikimedia.org/users/userinfo/ ` Sorry, something went wrong 500 - Internal Server Err... [17:49:18] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Patch-For-Review: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Jgreen) Failure at https://analytics.frdev.wikimedia.org/users/list/ ` Sorry, something went wrong 500 - Internal Server Error S... [17:49:28] 10Analytics, 10Analytics-Kanban: Troubleshoot EventLogging sanitization immediate - https://phabricator.wikimedia.org/T251794 (10mforns) [17:50:05] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Patch-For-Review: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Jgreen) Failure at https://analytics.frdev.wikimedia.org/roles/list/ Sorry, something went wrong 500 - Internal Server Error Sta... [18:08:10] 10Analytics, 10Better Use Of Data, 10Product-Analytics: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10LGoto) [18:22:18] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [18:26:15] 10Analytics, 10Analytics-Kanban: Troubleshoot EventLogging sanitization immediate - https://phabricator.wikimedia.org/T251794 (10mforns) Te delayed job did not alert, because it's executed once a day only. But it has the same problem. [18:26:38] 10Analytics: check leftovers of jmorgan - https://phabricator.wikimedia.org/T251600 (10leila) @elukey I checked with Jonathan. You can purge them all. [18:26:50] 10Analytics: check leftovers of jmorgan - https://phabricator.wikimedia.org/T251600 (10leila) a:05leila→03None [18:26:52] 10Analytics, 10Analytics-Kanban: Troubleshoot EventLogging sanitization immediate - https://phabricator.wikimedia.org/T251794 (10mforns) Good news is that the issue has been going on only since the last refinery/refinery-source deployment. [18:27:42] mforns: can you describe issue in sanitization ticket (https://phabricator.wikimedia.org/T251794)? [18:28:03] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Product-Analytics (Kanban): SQL definition for structure data in commons metrics - https://phabricator.wikimedia.org/T247101 (10jwang) [18:28:59] 10Analytics, 10Patch-For-Review, 10Product-Analytics (Kanban): SQL definition for wikidata metrics for tunning session - https://phabricator.wikimedia.org/T247099 (10jwang) [18:30:54] 10Analytics, 10Analytics-Kanban: deploy bots changes to AQS - https://phabricator.wikimedia.org/T251169 (10Nuria) @JAllemandou I think we also need to update docs on wikitech with this change. [18:35:07] 10Analytics, 10Analytics-Kanban: Support language variations on Wikistats - https://phabricator.wikimedia.org/T251091 (10Nuria) 05Open→03Resolved [18:35:43] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review, 10good first task: Wikistats - move from numeral to numbro for better localization support - https://phabricator.wikimedia.org/T199386 (10Nuria) 05Open→03Resolved [18:35:54] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review, 10good first task: Wikistats - move from numeral to numbro for better localization support - https://phabricator.wikimedia.org/T199386 (10Nuria) [18:44:00] ottomata: do you think you could take a look at this jenkins job, https://integration.wikimedia.org/ci/job/wikidata-query-rdf-maven-release-docker-wdqs/7/console and see if you know what the issue with the archiva credentials could be? [18:45:47] 10Analytics, 10Analytics-Kanban: deploy bots changes to AQS - https://phabricator.wikimedia.org/T251169 (10Nuria) Added example to AQS api wikitech docs: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Slice_and_dice_pageview_counts Closing [18:45:53] 10Analytics, 10Analytics-Kanban: deploy bots changes to AQS - https://phabricator.wikimedia.org/T251169 (10Nuria) 05Open→03Resolved [18:46:12] 10Analytics, 10Analytics-Kanban: deploy bots changes to AQS - https://phabricator.wikimedia.org/T251169 (10Nuria) [18:46:44] 10Analytics, 10Analytics-Kanban: Language selector is not pressable in mobile site - https://phabricator.wikimedia.org/T246971 (10Nuria) @fdans I think you mentioned this was still an issue on iOS [18:47:18] @nuria: not still an issue, it’s just that the change isn’t merged [18:47:46] fdans: but there is no change in ticket [18:48:17] 10Analytics, 10Analytics-Kanban: Add "automated" dimension to Total Page Views metric on Wikistats - https://phabricator.wikimedia.org/T251170 (10Nuria) 05Open→03Resolved [18:48:19] 10Analytics, 10Analytics-Kanban: deploy bots changes to AQS - https://phabricator.wikimedia.org/T251169 (10Nuria) [18:48:33] nuria: ya gerrit is bad https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/589606/ [18:48:44] 10Analytics, 10Analytics-Kanban: Vet high volume bot spike detection code - https://phabricator.wikimedia.org/T238363 (10Nuria) closing, all documented here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection [18:48:51] 10Analytics, 10Analytics-Kanban: Vet high volume bot spike detection code - https://phabricator.wikimedia.org/T238363 (10Nuria) 05Open→03Resolved [18:48:53] 10Analytics: Deploy high volume bot spike detector to hungarian wikipedia - https://phabricator.wikimedia.org/T238358 (10Nuria) [18:48:58] 10Analytics, 10Analytics-Kanban: Vet high volume bot spike detection code - https://phabricator.wikimedia.org/T238363 (10Nuria) [18:49:21] 10Analytics, 10Analytics-Kanban: Language selector is not pressable in mobile site - https://phabricator.wikimedia.org/T246971 (10Nuria) chnageset: https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/589606/ [18:50:44] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2 pageviews trend figure is wrong - https://phabricator.wikimedia.org/T212032 (10Nuria) 05Open→03Resolved [18:52:58] 10Analytics, 10Analytics-Cluster, 10Analytics-Wikistats: Add proper trend numbers to wikistats metrics - https://phabricator.wikimedia.org/T251813 (10Nuria) [18:54:33] 10Analytics, 10Analytics-Kanban: Tune up thresholds of data quality hourly alarms - https://phabricator.wikimedia.org/T251814 (10Nuria) [18:54:51] 10Analytics, 10Analytics-Kanban: Troubleshoot EventLogging sanitization immediate - https://phabricator.wikimedia.org/T251794 (10mforns) The job fails with the following error: ` 20/05/04 06:10:15 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.Cla... [18:55:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix non MapReduce execution of GeoCode UDF - https://phabricator.wikimedia.org/T238432 (10Nuria) 05Open→03Resolved [18:55:27] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Users having issues with presto sqllab on superset - https://phabricator.wikimedia.org/T249923 (10Nuria) 05Open→03Resolved [18:55:43] 10Analytics, 10Analytics-Kanban: Troubleshoot EventLogging sanitization immediate - https://phabricator.wikimedia.org/T251794 (10mforns) Hm, see https://bitbucket.org/asomov/snakeyaml/issues/392/bug-upgrading-from-118-119. They seem to have the same problem when upgrading from snakeyaml 118 to 119. [18:55:52] 10Analytics, 10Analytics-Kanban: Add TLS encryption support to Kafkatee and enable it where possible - https://phabricator.wikimedia.org/T250147 (10Nuria) 05Open→03Resolved [18:55:54] 10Analytics: Add Authentication/Encryption to Kafka Jumbo's clients - https://phabricator.wikimedia.org/T250146 (10Nuria) [18:56:08] 10Analytics, 10Analytics-Kanban: Add hourly resolution to data quality outage/censhorship alarms - https://phabricator.wikimedia.org/T249759 (10Nuria) 05Open→03Resolved [18:56:17] 10Analytics, 10Analytics-Kanban: Add hourly resolution to data quality outage/censhorship alarms - https://phabricator.wikimedia.org/T249759 (10Nuria) [19:01:05] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add new dimensions to druid's pageview_hourly datasource - https://phabricator.wikimedia.org/T243090 (10Nuria) I think turnilo needs a re-start cause the dimensions are not available [19:07:30] 10Analytics: Deploy high volume bot spike detector to hungarian wikipedia - https://phabricator.wikimedia.org/T238358 (10Nuria) Closing as we have deployed spike detector code to all wikipedias. See hungarian: https://stats.wikimedia.org/#/hu.wikipedia.org/reading/total-page-views/normal|line|1-month|agent~user... [19:07:41] 10Analytics: Deploy high volume bot spike detector to hungarian wikipedia - https://phabricator.wikimedia.org/T238358 (10Nuria) 05Open→03Resolved [19:07:45] 10Analytics, 10Patch-For-Review: Label high volume bot spikes in pageview data as automated traffic - https://phabricator.wikimedia.org/T238357 (10Nuria) [19:13:03] 10Analytics, 10Analytics-Kanban: Add TLS to Kafka Mirror Maker - https://phabricator.wikimedia.org/T250250 (10Nuria) 05Open→03Resolved [19:13:05] 10Analytics: Add Authentication/Encryption to Kafka Jumbo's clients - https://phabricator.wikimedia.org/T250146 (10Nuria) [19:14:36] https://www.irccloud.com/pastebin/tycBtfjz/ [19:14:48] maryum: please see above [19:15:29] nuria: thanks,I'll take a look [19:16:03] ottomata,nuria: figured out the issue and put a comment in the phab ticket as a heads up [19:16:38] maryum: the issue with archiva? want to send ticket along? [19:16:56] yes, https://phabricator.wikimedia.org/T247123 [19:18:46] 10Analytics, 10Analytics-Kanban: Make spark-refine resilient to incorrectly formatted _REFINED files - https://phabricator.wikimedia.org/T246706 (10Nuria) 05Open→03Resolved [19:19:27] 10Analytics, 10Analytics-Kanban: geoeditors-yearly job times out - https://phabricator.wikimedia.org/T246753 (10Nuria) 05Open→03Resolved [19:20:13] 10Analytics, 10Analytics-Kanban: Troubleshoot EventLogging sanitization immediate - https://phabricator.wikimedia.org/T251794 (10mforns) OK, executing `mvn dependency:tree` I get: ` [INFO] +- com.github.ua-parser:uap-java:jar:1.4.4-core0.6.10~1-wmf:compile [INFO] | +- org.yaml:snakeyaml:jar:1.20:compile ` Eve... [19:21:45] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Run a script to check REFINE_FAILED flags daily - https://phabricator.wikimedia.org/T240230 (10Nuria) Nice work on this. [19:22:28] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Run a script to check REFINE_FAILED flags daily - https://phabricator.wikimedia.org/T240230 (10Nuria) 05Open→03Resolved [19:23:00] 10Analytics: Add Authentication/Encryption to Kafka Jumbo's clients - https://phabricator.wikimedia.org/T250146 (10Nuria) [19:23:02] 10Analytics, 10Analytics-Kanban: Enable TLS encryption from Eventgate to Kafka - https://phabricator.wikimedia.org/T250149 (10Nuria) 05Open→03Resolved [19:23:50] 10Analytics, 10Analytics-Kanban: Stats menu says {{$t(`areas-${a.path}`)} - https://phabricator.wikimedia.org/T247725 (10Nuria) Pinging @Quiddity to confirm this is been fixed [19:23:52] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [19:25:05] mforns: Did you checked that the yaml in the sanitization list is valid? (just to be triple sure) [19:25:12] (03PS1) 10Mforns: Adapt EventLoggingSanitization to snakeyaml version 1.20 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/594283 (https://phabricator.wikimedia.org/T251794) [19:25:32] nuria: yes, I checked [19:25:39] mforns: nice [19:25:46] I think I found the problem [19:25:51] it's in the task [19:26:24] mforns: ya, i just wanted to make sure the basics were covered [19:26:28] mforns: did you tested: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/594283/1/refinery-spark/pom.xml? [19:26:57] nuria: not yet [19:27:05] will -1 for now [19:27:30] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics: EditAttemptStep sent event with "ready_timing": -18446744073709543000 - https://phabricator.wikimedia.org/T251772 (10Mayakp.wiki) Hi @DLynch , can we look at what may have caused this issue? Wanted to be sure if this was just a lone incident or so... [19:27:35] (03CR) 10Mforns: [C: 04-1] "Still needs to be tested!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/594283 (https://phabricator.wikimedia.org/T251794) (owner: 10Mforns) [19:28:13] but couldn't find any use of snakeyaml other than eventloggingsanitization [19:29:11] mforns: looks like it is going to work [19:30:06] ok, tomorrow I will test it and deploy if OK [19:34:20] mforns: ya, np! [19:37:08] 10Analytics, 10Analytics-Kanban: Stats menu says {{$t(`areas-${a.path}`)} - https://phabricator.wikimedia.org/T247725 (10Quiddity) LGTM, thanks! [19:37:53] 10Analytics, 10Analytics-Kanban: Stats menu says {{$t(`areas-${a.path}`)} - https://phabricator.wikimedia.org/T247725 (10Nuria) 05Open→03Resolved [19:46:57] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics: EditAttemptStep sent event with "ready_timing": -18446744073709543000 - https://phabricator.wikimedia.org/T251772 (10DLynch) If it's only happened once, I'm inclined to call it some sort of fluke. That said... it came from WikiEditor, so we can te... [19:57:37] 10Analytics, 10I18n, 10RTL: Support right-to-left languages - https://phabricator.wikimedia.org/T251376 (10Amire80) >>! In T251376#6104565, @Huji wrote: > Yes, that would be a good idea. > > I am currently adding Persian translations. A couple issues: > > * There is a lot of redundancy in the messages. The... [20:03:26] RECOVERY - Check the last execution of refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is OK: OK: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:16:06] PROBLEM - Check the last execution of refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:22:55] 10Analytics, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10Nuria) [20:22:57] 10Analytics, 10Pageviews-Anomaly: Abnormal peaks @ huwiki - https://phabricator.wikimedia.org/T249792 (10Nuria) [20:25:29] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: EventGate validation errors should be visible in logstash - https://phabricator.wikimedia.org/T116719 (10Ottomata) I don't have a validation error in logstash as an example atm. But, to get service logs, you could q... [20:57:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10Ottomata) @MoritzMuehlenhoff I wonder if you have some tips for building a correct .orig.tar.gz file. Upstream does not release this, it is provided e... [22:07:54] 10Analytics, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Patch-For-Review: Install superset on front end server for analytics - https://phabricator.wikimedia.org/T245755 (10Jgreen) Alrighty, I finally figured out that superset does not play well with 'binary' as a database character set. Worked around... [23:04:04] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1004 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [23:13:28] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1007 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [23:14:46] 10Analytics, 10Operations, 10observability: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10colewhite) p:05Triage→03Medium