[00:31:10] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10Ottomata) Not sure what the 'right' thing to do is, but a quick search for that error brought me to https://stackoverflow.com/questions/30761867/mysql-error-the-maximum-column-size-is-767-byteswith some suggestions. Been a while... [00:55:57] (03PS6) 10Mforns: Add oozie job for session length computation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) [01:35:09] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) Ok, so an-worker11[23]9 needs the network stuff figured out by onsite still, but the installer loop issue i was having is due t... [02:11:46] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1133.eqiad.wmnet ` Th... [02:14:08] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [02:15:44] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [02:19:11] (03CR) 10Milimetric: [C: 03+1] "Hm, Joseph points out we have this UDF in Java already, just a matter of updating the regexes and using it here. Up to you, Baho, if you " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [02:33:47] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1133.eqiad.wmnet'] ` and were **ALL** successful. [02:42:42] 10Analytics, 10Product-Analytics: Big increase in traffic for projects except 'wikipedia' family since Feb 14th - https://phabricator.wikimedia.org/T274823 (10Joseagush) Hi all, I also found that big increase traffic for projects in most local wikipedias in Indonesia has same problem, except bug.wiki. Please c... [02:45:04] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [05:50:13] I'm running into an issue when testing my AQS pageviews/per-country Oozie job that didn't appear when I tested it a few weeks ago [05:50:54] This log is from one of my runs, and can be replicated when running the Oozie job's hive query through Hue as well: https://hue.wikimedia.org/hue/jobbrowser/#!id=task_1612875249838_45243_m_000000 [05:52:19] This is the main part of the error: ```Error: Error running query: java.lang.AssertionError: Internal error: While invoking method 'public org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator$Frame org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator.decorrelateRel(org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveProject) throws [05:52:19] org.apache.hadoop.hive.ql.parse.SemanticException' (state=,code=0)``` [05:53:44] Furthermore, when I run the query through the command line (hive), I receive a different error: `NoViableAltException(350@[()* loopback of 430:20: ( ( LSQUARE ^ expression RSQUARE !) | ( DOT ^ identifier ) )*])` [05:54:07] Again, this error did not appear when I first ran the query several weeks ago [05:55:45] The query file is located at `/home/lexnasser/oozie/cassandra/daily/pageview_top_percountry.hql`on stat1007, and can be run for example as: ```hive -f pageview_top_percountry.hql -d refinery_hive_jar_path=hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.115.jar -d destination_directory=/user/lexnasser/test -d source_table=wmf.pageview_actor -d [05:55:45] country_blacklist_table=wmf.geoeditors_blacklist_country -d separator=\\t -d year=2021 -d month=2 -d day=16``` [05:56:43] Because this error is only occurring after the Hadoop upgrade, I'm thinking the issue is related to that, but not sure. Any help or suggestions would be greatly appreciated :) [06:50:04] hey lexnasser ! [06:51:26] qq - are you running the oozie job with the last version of refinery deployed to hdfs? [06:51:32] a lot of things have changed [06:52:07] I'm not, I'll try that out! [06:52:23] for example, I see refinery-hive-0.0.115.jar [06:52:36] I think we have 0.1.1 now, or even more [06:52:59] (I am checking all the jars in the .properti config) [06:53:35] not sure if it will fix but I recall some trouble with UDFs right after the upgrade [06:53:45] Any idea why the query alone would is failing? [06:53:55] Just because of the UDF I'd guess? [06:54:11] The thing that confuses me is that it's ParseException [06:55:01] Either way, I'll try re-pulling the latest refinery [06:55:56] yes exactly let's see how it goes with the latest [06:57:59] lexnasser: is it using a UDF? I don't see it in Hue.. also the exception seems more something hive related, it really feels like old jars not working with the new version [06:58:14] Joseph updated refinery source to drop old cdh deps [06:58:32] Yeah, it uses UDF [06:58:36] here's the file: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/654924/2/oozie/cassandra/daily/pageview_top_percountry.hql#73 [06:59:02] line 32 [07:00:57] super ignorant, anyway let's kick it off with the new jars :) [07:01:02] (in the property I mean) [07:01:30] because in the stderr I see [07:01:31] Caused by: java.lang.RuntimeException: java.lang.AssertionError:false at org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator.decorrelateRel(HiveRelDecorrelator.java:683) [07:01:48] that looks really a hive-internal lib mismatching for some reason [07:03:28] ah but https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRelDecorrelator.java#L683 is interesting [07:04:37] it may be due to the query itself lexnasser [07:05:21] yeah just tried re-running with both hive and oozie with the 0.0.1 jar, and I'm still getting both issues [07:05:39] the jar is 0.1.1 right? (just to double check) [07:05:50] oops, that's what i meant [07:06:31] sorry I am re-reading now what you wrote above, I missed that you were testing the query in hive directly sorry [07:06:39] (coffee still not doing its work :) [07:06:45] ok then it is the query [07:08:11] lexnasser: if you want add your case to https://phabricator.wikimedia.org/T274322 so people can check alter on [07:08:18] *later [07:09:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10lexnasser) [07:10:07] added! [07:14:54] I'm just searching Google rn for any similar HiveRelDecorrelator issues, lmk if you want me to test anything out [07:17:24] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10elukey) I have something in my notes: https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Superset#Upgrade_DB We already have `innodb_file_format=Barracuda` in the config, can you check if doing `ALTER TABLE row_level_secu... [07:20:35] lexnasser: can you also add a summary (if it is not too late) about what breaks in the task? [07:21:10] sure! I'll probably be on for another 30 minutes [07:22:11] since you think it's likely an issue with the query, I've switched to just removing chunks of the query to see if I can narrow down the general area thats causing the error [07:23:07] yeah I am wondering this bit [07:23:08] if (rel.getGroupType() != Aggregate.Group.SIMPLE) { [07:23:08] throw new AssertionError(Bug.CALCITE_461_FIXED); [07:23:08] } [07:24:22] see https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/core/Aggregate.html [07:24:40] "It corresponds to the GROUP BY operator in a SQL query statement, together with the aggregate functions in the SELECT clause. " [07:24:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10lexnasser) I'm running into an issue when testing my AQS pageviews/per-country Oozie job that didn't appear when I tested it a few weeks ago This log is from one... [07:26:07] I suspect it may be the GROUPING SETS [07:27:17] yeah, I was just able to replicate the org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator error in Hive, it's in that first `raw ` CTE [07:28:12] But I get a slightly different message: ```Exception in thread "main" java.lang.AssertionError: Internal error: While invoking method 'public org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator$Frame org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator.decorrelateRel(org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit)'``` [07:28:19] HiveSortLimit? [07:30:34] lexnasser: is there a specific line of code to check? [07:31:29] I'd like to see what that assert is about in the code [07:32:27] do u mean a specific line in the query? [07:33:16] or a hadoop code line in the stack trace? [07:33:35] https://www.irccloud.com/pastebin/NIgiu5Ck/Per%20country%20stack%20trace [07:34:45] lexnasser: nice, see at line 95, same assert failing [07:36:14] lexnasser: does it work if you remove the GROUPING SETS? [07:36:41] I mean from the testing of the raw bit [07:37:53] just removed GROUPING SETS but now I get an error `No partition predicate for Alias "pageview_actor" Table "pageview_actor"` [07:38:18] here's my full query [07:38:33] https://www.irccloud.com/pastebin/YORj6Ohb/ [07:39:06] I am very ignorant about grouping sets, never really used it [07:40:51] yeah, I just learned about it from the other queries, it just aggregates the results of certain group-bys (e.g., adding the result of each agent type for a special 'all-agents' access parameter) [07:41:56] do you know why i'm getting the partition predicate error for this query? I have WHERE year =, month =, day = for the pageview_actor table [07:44:41] no idea [07:46:06] elukey: no worries, feel free to skip on this for now. its possible someone else has seen this issue. I'll head to bed, and also maybe I'll have an epiphany in my dream like joal ;) [07:46:27] ack! [07:46:52] goodnight! [07:48:09] gnight! [07:50:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) Reporting a chat with Lex from IRC: from the stacktrace I see an assertion error for https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apac... [08:09:48] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering (if we should) - https://phabricator.wikimedia.org/T152731 (10Aklapper) [09:12:17] going to stop the backup cluster in a bit! [09:16:41] !log stop and decom the hadoop backup cluster [09:16:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:31:20] all right cluster down, masked all systemd units so the will not come up [09:31:29] puppet is disabled and we have downtime [09:31:54] I am going to merge a change in a bit to remove hadoop settings in puppet for the nodes [09:32:12] in this way alerts should be gone as well [09:32:27] and we'll be able to reimage/re-init workers and add them to the main cluster [09:42:33] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering (if we should) - https://phabricator.wikimedia.org/T152731 (10Acagastya) I can see advantages of having ES for a specific wiki. There are dozens if not 100s of edits taking place all over sisterhood at any moment. And in some cas... [09:43:37] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering (if we should) - https://phabricator.wikimedia.org/T152731 (10Acagastya) @Aklapper Can you please reopen this issue? [09:49:48] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) @elukey I think it is the Apache Sqoop call that fails. Example: **Command:** ` sudo -u analytics-privatedata kerberos-run-command... [09:57:13] GoranSM: thanks a lot for the precise report! [10:12:59] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10elukey) Very interesting, thanks a lot for the report! ` /usr/bin/sqoop does `SQOOP_JARS=`ls /var/lib/sqoop/*.jar /usr/share/java/*.jar 2>/dev/null`` `... [10:29:07] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) @elukey It works! Thank you! **Q:** > It is not the final solution of course, just a temporary hack :) There is a regular system up... [10:30:54] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10elukey) @GoranSMilovanovic we can keep stat1004 in this state for the weekend so your regular update goes through, we'll revert it only when a final sol... [10:36:52] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) @elukey > we can keep stat1004 in this state for the weekend so your regular update goes through, we'll revert it only when a final... [10:36:57] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering for EventStreams (if we should) - https://phabricator.wikimedia.org/T152731 (10Aklapper) @Acagastya: If there are actual new technical arguments anyone is free to do so. [10:37:11] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering for EventStreams (if we should) - https://phabricator.wikimedia.org/T152731 (10Aklapper) a:05Ottomata→03None [10:40:09] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) p:05High→03Medium [10:43:50] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) @WMDE-leszek @Lydia_Pintscher The updates are unblocked now; we expect everything to be found back in the expected state until Februa... [10:46:01] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) [10:56:18] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) > [x] Stop the backup cluster daemons, remove all puppet config and set role(insetup) to all new workers. @razzi adding some notes about what I did, so y... [11:04:41] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) @razzi to add more confusion: you'll note that among the new workers (that were part of the Backup cluster and that now we have to repurpose) there are so... [12:24:49] (03PS1) 10Jdrewniak: SearchSatisfaction: Add editBucketCount [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/665317 (https://phabricator.wikimedia.org/T272991) [13:11:42] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering for EventStreams (if we should) - https://phabricator.wikimedia.org/T152731 (10Acagastya) @Aklapper Should I update the task description to better explain the current arguments? [13:26:41] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering for EventStreams (if we should) - https://phabricator.wikimedia.org/T152731 (10Aklapper) No, comments picking up arguments of previous comments are fine. [13:33:17] heya teammm [13:33:38] Hi mforns :) [13:33:44] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10JAllemandou) > I am glad that I was able to help Indeed ! We would have been caught by surprise next 1st of the month with our sqoop failing! Thanks a l... [13:33:45] joal or elukey: can any of you pair with me for a second to delete a couple tables from hive? [13:33:54] sure mforns [13:33:57] batcave? [13:34:00] :] batcave? [13:34:04] yes ghehe [13:53:14] elukey: Hi :) Asking for permission to manually launch an instance of drop-el-homepagevisit-events.service (an-launcher1002) to check that manual data deletion doesn't break them in a way that would alert us [13:55:31] hey joal - no need to ask permissions :) [13:55:45] 10Analytics, 10Event-Platform, 10EventStreams: Implement server side filtering for EventStreams (if we should) - https://phabricator.wikimedia.org/T152731 (10Acagastya) 05Declined→03Open [13:55:48] 10Analytics-Kanban, 10Event-Platform, 10EventStreams, 10Services (watching), 10User-mobrovac: EventStreams - https://phabricator.wikimedia.org/T130651 (10Acagastya) [13:56:09] elukey: sudo systemctl start drop-el-homepagevisit-events.service [13:56:10] right? [13:56:20] yep! [13:57:01] great :) all good mforns, no rush - the timers don't break :) [13:57:17] actually I'd also need some help, if people have time later on, to check the thorium's backup for https://phabricator.wikimedia.org/T265971 [13:57:23] this time it should be better :D [13:57:30] elukey: I can do that :) [13:57:46] joal and elukey thanks! [14:03:17] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10mforns) The data has been deleted! Once those patches get merged (unused jobs), we can close this task. [14:06:10] 10Analytics, 10Analytics-Kanban, 10Growth-Scaling, 10Growth-Team, and 2 others: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10mforns) a:03mforns [14:06:35] 10Analytics-Radar, 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10mforns) I think this is done! [14:08:38] elukey: can I have your help for the backup please? [14:08:44] elukey: some files are readbal [14:08:57] readable by ezachte only [14:09:17] elukey: could you please do: find . -type f > /home/joal/backup_files.txt [14:09:20] please? [14:09:28] with a sudo in front (i forgot) [14:11:33] joal: on what host an path? thorium:/srv ? [14:11:48] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10MW-1.36-notes (1.36.0-wmf.22; 2020-12-15), and 2 others: [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10CBogen) [14:11:50] (to be sure) [14:11:53] elukey: thorium:/srv/backup please [14:12:13] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 4 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10CBogen) 05Open→03Resolved Confirmed that I can see the data now. Thanks all! [14:14:03] joal: done [14:20:21] thanks elukey [14:25:04] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) Luca from the past already added hdfs/yarn/mapred users to puppet! Completely forgot about it.. Of course we didn't set any specific uid/gid Some cumin magic: ` elukey@cumin1001:~$... [14:26:47] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) 05Open→03Stalled Before starting we'll need to solve the uid/gid issue, see T231067 :( [14:30:56] Quick question: my team got an alert about a failed oozie job (https://hue.wikimedia.org/jobbrowser/apps#!id=0010868-210211141618764-oozie-oozi-W) will it be automatically re-tried or do we need to do that manually? (all the previous days have worked successfully and we didn't change anything) [14:31:17] bearloga: hi! Manually :) [14:31:42] elukey: ah, okay! thank you! [14:32:33] bearloga: to avoid issues, rerun it from https://hue.wikimedia.org/jobbrowser/apps#!id=0006702-210107075406929-oozie-oozi-C [14:32:49] so it will be re-executed with the correct user (not yours!)) [14:32:52] Hue is sneaky [14:34:36] !log rerun mobile_apps-uniques-daily-wf-2021-2-18 [14:34:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:44:02] ok, second time failed [14:46:57] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10MoritzMuehlenhoff) >>! In T231067#6843624, @elukey wrote: > After this we should be able to just reimage new nodes with fixed gid, and get consistency once all hosts will be migrated to Bust... [14:57:47] hm - this is weird [15:00:20] hi joal, looking at the mobile uniques job? [15:00:23] can I help? [15:02:58] milimetric: the failure is due to the fact that the output file of the hive job doesn't end with .gz [15:03:12] What I don't get is: Why this one and not others???? [15:03:46] milimetric: I need to drop to care children - If you wish you can have a look, I'll help when back [15:03:56] joal: I've been seeing strange things like this when looking at reportupdater queries. One example: two queries had a semicolon at the end, one worked and one failed until I removed the semicolon [15:04:08] k, I'll take a look [15:04:10] MEH? [15:05:47] 10Analytics, 10Patch-For-Review: Repackage spark without hadoop, use provided hadoop jars - https://phabricator.wikimedia.org/T274384 (10Ottomata) Ok, Option 3. is not looking good. I installed spark2 with my manually added Hadoop 2.10.1 jars on an-test-client1001, and I can't start spark locally. I get `Exc... [15:14:49] I'm not sure why these files used to be .gz and are no longer (I'm assuming something changed in the script) but isn't it just as easy as changing expected_file_name to EMPTY? [15:22:49] a-team: I need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/665352 and restart oozie, it may break our ability to re-run jobs temporarily, in case blame me :D [15:23:18] thx for the ping, no worries we're not restarting anything soon [15:24:41] milimetric: /me now https://makeameme.org/meme/worked-fine-in-55u5ji [15:27:12] yeah... but you can't use this meme if you're both dev and ops :) [15:27:52] milimetric: in this case ops is the ops week person [15:27:54] not me [15:27:55] :P [15:31:46] !log restart oozie to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/665352 [15:31:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:32:40] elukey: looks like joal is not alone in having problems with feb 18th (re: mobile_apps-uniques-daily-wf-2021-2-18). our workflow (wikipediapreview_stats-daily-wf-2021-2-18) just died again, even though everything has been fine in the days before [15:34:37] bearloga: yep yep going to check it in a second, can you retry a re-run? [15:36:50] elukey: still not enabled for me. have tried logging out and back in. going to try logging out, clearing cache, and seeing if that helps. [15:37:57] bearloga: ack, but I think there may be something else missing then sigh :( [15:38:16] milimetric: the weird 400 displayed by hue when we try to filter oozie coords/etc.. is due to [15:38:19] org.apache.oozie.servlet.XServletException: E0420: Invalid jobs filter [text=user: analytics-product], invalid name [text] [15:38:32] if you add in the search bar user:etc.. it doesn't break [15:38:36] * elukey cries in a corner [15:39:06] now there might be a filter that we can use [15:39:08] elukey: still disabled in hue and hue-next. I will file a phab task for the re-run stuff. [15:40:16] thanjs [15:40:20] *thanks! [15:41:29] elukey: should I file a separate task for the failing job or do you think that's related to whatever is suddenly going on with mobile_apps-uniques-daily? [15:41:49] bearloga: I'll take a look at that [15:42:05] bearloga: let's wait a sec, there might be an explanation [15:42:13] milimetric: thanks! [15:43:17] !log installing spark 2.4.4 without hadoop jars on analytics test cluster - T274384 [15:43:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:43:19] T274384: Repackage spark without hadoop, use provided hadoop jars - https://phabricator.wikimedia.org/T274384 [15:46:58] bearloga: it's not the same issue but more like the weird Hive errors we've been having since the upgrade. I noticed this wasn't restarted since Jan so it won't have the new sharelibs. I will restart it, even though it seems crazy (it worked fine till now) because I also saw how some of that is data-dependent. Like the old sharelib was failing to write/read particular types of data. [15:47:27] good test +! [15:47:28] +1 [15:47:44] I don't find any good log that can explain the problem [15:49:52] oh bearloga I'm sorry I killed that thinking it was in refinery but I have no idea where/how you launch the analytics-product jobs from [15:49:54] wanna teach em? [15:50:00] 10Analytics, 10Product-Analytics: Can't re-run failed Oozie workflows in Hue/Hue-Next (as non-admin) - https://phabricator.wikimedia.org/T275212 (10mpopov) [15:51:38] something like... [15:51:40] https://www.irccloud.com/pastebin/c7dlujEE/ [15:52:02] milimetric: that one is from https://gerrit.wikimedia.org/g/analytics/wmf-product/jobs (cc nshahquinn ) [15:52:32] bearloga: and do you know where it's deployed? [15:53:18] it is probably on a stat node [15:53:39] !log restart oozie again to test another setting for role/admins [15:53:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:54:04] bearloga: I have another test to re-run for later to ask you :) [15:54:50] millimetric: I deployed it from stat1008, but there's no centralized version of the repo. The procedure for restarting a job is to clone the repo to the appropriate place, update the settings as necessary, and then run the deployment script in that repo :) [15:55:14] milimetric: elukey is right (I think), I just can't remember which node specifically. I want to say 1008 or maybe 1007, but hoping nshahquinn can clarify [15:55:16] I mean, there's no central checkout of the repo on the stats machines [15:55:42] thx nshahquinn [15:55:48] and bearloga! :) [15:59:49] so fancy... deployment scripts... not like me just brute forcing my way through like 300 restarts [16:03:18] milimetric: I mean, with that script, you'd still have to brute force your way through 300 script invocations 😂 [16:04:01] oh I'm not allowed to sudo -u analytics-product, so either one of you has to do it nshahquinn or I have to wait to get permissions [16:04:10] oh lol [16:04:15] I'll redeploy now [16:04:29] milimetric: what start date? 2021-02-18? [16:04:32] nshahquinn: the only param change is the start, 2021-02-18 is the first instance that failed, so start there [16:04:33] yea [16:04:38] 👍🏽 [16:04:50] thx for covering, I'll be more careful/knowledgeable next time [16:05:05] this is the original coord that failed: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0006702-210107075406929-oozie-oozi-C/ [16:05:18] I'll take a look to see if the new one has the same error [16:08:08] milimetric: not at all! thanks for helping out. when this is done, please let us know if there's anything we should change about the setup/deployment process for greater maintainability. The idea is that my team has take primary responsibility for fixing any errors with our jobs, but of course if it's anything difficult, we'd probably need your help :) [16:14:32] nshahquinn: if you want us restarting these, maybe file a task to get us sudo rights, but otherwise no it's great. Maybe if the deployment script could pass parameters like -Dstart_time=... we should adapt that to our repo too [16:14:46] well, maybe too late, just wait for AirFlow at this point [16:17:14] milimetric: yeah, I see your point about the start time parameter in particular. Would be helpful to be able to override it from the command line, but as you said, probably better to wait for Airflow :) [16:17:14] Anyway, new coordinator deployed: https://hue.wikimedia.org/jobbrowser/apps#!id=0000028-210219155348120-oozie-oozi-C [16:21:52] milimetric: failed again :( [16:23:54] yeah, same error. It's some bug in the new Hive, we got around it in two other jobs with hacks, but don't yet understand what's going on. And the hack wouldn't even make sense here (we were writing to a temp table to prevent a schema misunderstanding in a CTE, but you're already writing to temp tables) [16:23:59] it is not great that we get a sper generic error [16:24:09] *super [16:24:37] milimetric: did you find better logs than "generic mr failure" ? [16:24:40] it's generic but consistent with the other ones that failed, Joseph dug into it and (unless I'm confusing with another one) got to some bug in how Hive reads from parquet? [16:25:01] (03CR) 10Mholloway: [C: 03+2] Enforce numeric bounds for all schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661959 (https://phabricator.wikimedia.org/T273069) (owner: 10Ottomata) [16:25:09] I'll check the old logs if they're still there, sec [16:25:13] it is weird that I don't see much in the hive server 2 logs [16:25:39] bearloga: time to attempt a re-run? Just to check if perms are better now [16:26:00] nah, logs are gone but I'm more sure that was it [16:26:06] sure, just a moment [16:26:24] (https://hue.wikimedia.org/oozie/list_oozie_workflow/0000580-210210190626872-oozie-oozi-W/?coordinator_job_id=0000573-210210190626872-oozie-oozi-C is one of the two that I think were failing for the same reason) [16:27:12] hmm, wow...sounds like quite a puzzle [16:28:08] elukey: cleared cookies & cache and logged back in to hue-next. still disabled [16:28:20] very strange [16:32:10] ahhhh wait bearloga! There is one thing to add to the properties file [16:32:46] you'd need to add oozie.job.acl = analytics-product-users [16:35:46] (03PS1) 10Elukey: wikipediapreview_stats: add ACL to allow job re-run [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665361 [16:35:59] bearloga, nshahquinn --^ [16:36:18] elukey: hah! beat me to it, I was about to upload patch [16:36:23] thank you! [16:36:30] (03PS2) 10Elukey: wikipediapreview_stats: add ACL to allow job re-run [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665361 [16:36:32] ah typo [16:36:33] users :D [16:36:40] bearloga: just updated! [16:38:32] (03PS3) 10Bearloga: wikipediapreview_stats: add ACL to allow job re-run [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665361 (https://phabricator.wikimedia.org/T275212) (owner: 10Elukey) [16:38:35] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1134.eqiad.wmnet',... [16:39:03] (03PS4) 10Awight: Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 [16:39:10] (03CR) 10Bearloga: [C: 03+1] wikipediapreview_stats: add ACL to allow job re-run [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665361 (https://phabricator.wikimedia.org/T275212) (owner: 10Elukey) [16:39:28] (03CR) 10Awight: "> Patch Set 3: Code-Review-1" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:39:36] (03CR) 10jerkins-bot: [V: 04-1] Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:39:36] we'll need to merge/re-run the job, nshahquinn sorryyyy :) [16:39:47] (03CR) 10Awight: "> I had missed pushing the fix--PS 3!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:43:40] (03PS5) 10Ottomata: Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:43:59] elukey: no problem! but I can't kill the current coordinator. I don't have the option in Hue (although I used to, before the Bigtop upgrade I guess). [16:43:59] And when I run `sudo -u analytics-product oozie job -kill 0000028-210219155348120-oozie-oozi-C` I get a weird error: `Connection exception has occurred [ java.net.ConnectException Connection refused (Connection refused) ]. Trying after 1 sec. Retry count = 1` [16:44:13] (03CR) 10jerkins-bot: [V: 04-1] Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:44:26] (03CR) 10Ottomata: "OO, FYI, https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/661959 was just merged (I think maybe a we early with so many schema c" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:45:13] I tried it with `kerberos-run-command` but that didn't work either [16:45:18] Heya team - back I am [16:45:35] (03PS3) 10Ottomata: Update schema to handle quickview playback events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154) (owner: 10Eric Gardner) [16:45:55] I get the same error with `kerberos-run-command` [16:46:06] (03CR) 10jerkins-bot: [V: 04-1] Update schema to handle quickview playback events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154) (owner: 10Eric Gardner) [16:47:29] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10razzi) `ALTER TABLE row_level_security_filters ROW_FORMAT=DYNAMIC;` fixed it, thanks! Here's the full procedure so the order is clear: On an-coord1001: ` $ sudo mysql > drop database superset_staging; > create database superset... [16:49:05] nshahquinn: yeah I think it is a problem with the current perms, lemme kill it [16:49:57] nshahquinn: done! [16:50:10] we are experiencing several issues with Hue, upsteam is not really helpful [16:50:23] (03PS6) 10Awight: Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 [16:50:53] (03CR) 10jerkins-bot: [V: 04-1] Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:51:07] (03PS1) 10Ottomata: Re-materialize 2 schemas with enforced numeric bounds [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/665363 (https://phabricator.wikimedia.org/T273069) [16:51:40] milimetric: how may I help [16:51:40] ? [16:52:00] (03PS4) 10Ottomata: Update schema to handle quickview playback events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154) (owner: 10Eric Gardner) [16:52:02] (03CR) 10Ottomata: [C: 03+2] Re-materialize 2 schemas with enforced numeric bounds [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/665363 (https://phabricator.wikimedia.org/T273069) (owner: 10Ottomata) [16:52:30] sorry joal I gotta make lunch now but I can ping after [16:52:50] Anybody who's interested can test out Superset 1.0 on the staging instance; it's running on an-tool1005 so run `ssh -NL 8080:an-tool1005.eqiad.wmnet:80 an-tool1005.eqiad.wmnet` then go to http://localhost:8080! [16:52:55] elukey: for the oozie.job.acl patch you uploaded do you want to wait for razzi to +2 it? [16:52:58] sure np - will take it with what I have understood from irc logs [16:53:07] nice razzi ! [16:53:46] hi bearloga elukey; irc disconnected so looks like I missed some context here [16:53:52] (03CR) 10Ottomata: "We just merged a change that materializes schema versions with enforced numeric bounds, so I updated a rematerialized patch for ya so the " [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154) (owner: 10Eric Gardner) [16:53:59] bearloga: I think we can proceed, I thought it was your repo, I was kinda waiting for a +2 from you or Neil :D [16:54:10] razzi: nice for superset! [16:54:40] razzi: re: https://gerrit.wikimedia.org/r/c/analytics/wmf-product/jobs/+/665361 and sure elukey I can change my +1 to +2 :D [16:55:00] (03CR) 10Ottomata: SearchSatisfaction: Add editBucketCount (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/665317 (https://phabricator.wikimedia.org/T272991) (owner: 10Jdrewniak) [16:55:02] (03CR) 10Bearloga: [V: 03+2 C: 03+2] wikipediapreview_stats: add ACL to allow job re-run [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665361 (https://phabricator.wikimedia.org/T275212) (owner: 10Elukey) [16:55:04] (03CR) 10Elukey: [C: 03+2] wikipediapreview_stats: add ACL to allow job re-run [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665361 (https://phabricator.wikimedia.org/T275212) (owner: 10Elukey) [16:55:11] ahahah [16:55:31] razzi: so we are applying the rules to allow analytics-product to re-run their jobs in hue/oozie [16:56:34] (03CR) 10Ottomata: "FYI:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/615562 (https://phabricator.wikimedia.org/T255302) (owner: 10Bearloga) [16:57:37] razzi: also nice summary in the superset task [16:57:40] nshahquinn: merged; do you want to re-deploy and retry that workflow? [16:58:29] (03CR) 10Razzi: "We may find some issues on superset staging but we can begin to review this code in the meantime." [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (owner: 10Razzi) [17:00:56] awight: thanks for all the legacy schema patches! We'll try to get to them soon, been bogged down with meetings and hadoop upgrade fallout recently. [17:01:22] FYI, https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/661959 was just merged, so you shoudl rebase those patches locally and re-materialize them with the new setting [17:01:24] ottomata: No rush from my side, I'm just trying to reciprocate a bit. [17:01:30] git rebase origin/master [17:01:41] razzi: I am trying to fix the Presto TLS cert config (after the database re-creation), but I don't find a way to do it.. [17:01:51] $(npm bin)/jsonschema-tools materialize ./jsonschema/analytics/legacy/.../current.yaml [17:02:09] bearloga: yup, will retry now [17:02:13] (it won't change anything if those schemas don't have any numeric fields though) [17:02:14] TY! [17:03:19] (03CR) 10Ottomata: "LGTM and simple! Nice find with the cryptography/cargo problem." [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (owner: 10Razzi) [17:03:25] (03CR) 10Ottomata: [C: 03+1] Upgrade superset to 1.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (owner: 10Razzi) [17:04:11] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqia... [17:05:02] (03CR) 10Elukey: [C: 03+1] "LGTM as well, let's wait for more user-testing but it looks good!" [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (owner: 10Razzi) [17:05:09] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [17:06:45] milimetric, bearloga - I think I have a culprit from our hive-parquet issues: https://issues.apache.org/jira/browse/HIVE-16958 [17:07:10] elukey: I was able to look at the config at http://localhost:8080/databaseview/edit/14, but now strangely I'm not seeing the navigation option to view databases [17:07:30] yeah weird [17:07:43] I can see the page [17:07:48] if I follow the link [17:10:23] elukey: I re-ran it but still can't kill it from the Hue interface. Is the `oozie.job.acl` property supposed to be `analytics-product-users`, or `analytics-product-user`, without the final S? [17:11:13] nshahquinn: it should be the group of people allowed to make changes [17:11:28] Oh, `analytics-product`? [17:11:41] Oh, the group is `analytics-product-users`, I think [17:11:46] So the patch was correct [17:11:59] But I still don't have the right permissions [17:12:36] nshahquinn: also from the UI? [17:12:45] ah yes sorry [17:12:52] very strange [17:13:04] ottomata: I see 37 files that will be rematerialized with changed content. Shall I push a patch for everything at once? [17:13:05] elukey: no can't do it in the UI :( [17:14:36] I see from Hue that there is a new col now, Group, and analytics-product-users is corretly displayed [17:15:09] Interesting, most of the changes are due to reordering fields, I'll ignore those. [17:15:42] nshahquinn: same thing from the CLI too? [17:16:03] ah well from there you can use analytics-product [17:16:04] nevermind [17:16:06] Should the schema materialization script be fixed so that the output has a consistent ordering? [17:16:57] elukey: oh, I actually _was_ able to kill it from the CLI without sudoing, which I couldn't before. So, that's a nice partial fix :) [17:17:57] elukey: I confirm I have exactly the same list of files in your backup and on thorium - I realized I gave you a wrong command for the file-list to check size [17:18:11] elukey: if you have a minute we can do it now [17:18:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [17:19:45] joal: sure! [17:20:11] elukey: on thorium - sudo rm /home/joal/backup_files.txt [17:20:47] elukey: sudo du /srv/backup > /home/joal/backup_files.txt [17:20:49] thanks elukey [17:21:03] elukey: that will do checking, and double checking :) [17:21:31] done! well thank you joal! [17:23:28] ottomata: qq - is https://phabricator.wikimedia.org/T231067#6843624 ok for you? If so I'll proceed on Monday, otherwise I can wait : [17:23:31] :) [17:26:08] milimetric joal: interesting find! I guess I'm still confused about why 2-17 would be OK and then 2-18 keeps failing, and why other coordinators/workflows aren't affected [17:26:23] also razzi, same question for you for the task above --^ [17:27:00] (it is very confusing and weird at first so we can follow up next week with all the details) [17:27:22] (03PS7) 10Awight: Add VisualEditorTemplateDialogUse schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 [17:27:41] ottomata: sorry for the static, I rebased to master and everything is fine now (I think). [17:29:52] yeah elukey, still trying to wrap my head around why uids / gids would change [17:33:01] razzi: I added some info earlier on in the task for you, if those are not clear we can chat about them [17:33:20] Maybe this isn't the right place to be asking, but how do we preserve datanode dirs before a reimage? Are they copied outside the host? [17:34:17] razzi: ah no that's another thing! We do it with partman, thanks to work that Stevie did a while ago.. see https://gerrit.wikimedia.org/r/c/operations/puppet/+/664788/6/modules/install_server/files/autoinstall/partman/custom/reuse-analytics-hadoop-worker-12dev.cfg [17:34:25] it is part of the things to go through together :D [17:34:37] (see the "keep" parts) [17:34:50] basically debian install has an option to leave partitions untouched [17:35:16] and Stevie created a while ago a custom workflow for us to set the "keep" values if needed [17:35:36] so during a reimage the datanode dirs are not touched [17:35:45] only root is re-installed [17:36:02] but because of that, when the hdfs/yarn/etc.. users are created during package install [17:36:12] they get as uid/gid the first one available [17:36:32] and might not be the one that they had in the previous version of the OS [17:36:38] ok yeah that makes sense [17:36:54] we have never really tried to tackle this issue [17:37:35] for this round of installs we'll have to do some chown -R hdfs:hdfs /var/lib/hadoop/data/etc.. for usre [17:37:38] *sure [17:37:50] but once done, we should have a fixed uid/gid everywhere [17:38:00] so the next os upgrade will be smoother [17:38:21] we pay a price now but hopefully we'll get the benefits long term [17:39:23] the analytics/druid users are other candidates for the fixed uid/gid [17:40:37] awight: i think the output ordering should be deterministic...? [17:40:42] joal: if you're still here, remember the changes to the traffic anomaly query we discussed in standup a couple weeks ago, can you please review https://gerrit.wikimedia.org/r/c/analytics/refinery/+/659306 and see if it's OK? :] [17:40:53] elukey: yes that sounds good to me! [17:41:12] ack!! [17:41:35] ottomata: if you run `npx jsonschema-tools materialize-all`, it gets crazy. All changes are reordering, I believe. [17:42:05] awight: i think there was a change recently to make the sort thing consistent [17:42:09] elukey: for `find / -group Y -exec chgrp -h hdfs {} \;`, is the idea to rename the group to Y to make it not conflict? Or does Y mean something else? [17:42:11] but not all existing schemas were rematerialised [17:43:53] i think this [17:43:53] https://github.com/wikimedia/jsonschema-tools/commit/182c99e9c0148323b22e88648f42ced8db875921 [17:43:54] razzi: nono it is me writing silly things, I didn't specify that Y is the old gid [17:44:19] hm yes so some is undefined [17:45:01] razzi: say that hdfs currently has uid 117 and gid 118 on a host, Y in this case was the 118 value [17:45:06] sorry it was not super clear :( [17:45:08] awight: yeah you are maybe rigiht, we should probably chan ge that return 0 there to just return alphanumeric compare [17:45:28] ottomata: +1 I think the same thing. I'll try to patch next week. [17:45:41] awight you da best [17:46:25] elukey: all good; last question for now, would the `find /`be really slow? Now I think I see how it all works, and it makes sense [17:49:29] razzi: yes it may be for some nodes, but I am currently betting that most of the ones on buster without fixed uid/gid have very few files owned by hdfs/yarn/etc.. [17:50:06] I'll do some tests on Monday, but hopefully this will not be super painful :D [17:54:03] elukey: the new file contains about 1/3 of the rows of the previous one :( [17:54:06] * joal doesn't underxstand [17:56:20] (03CR) 10Joal: "Minimal nits" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/659306 (https://phabricator.wikimedia.org/T272052) (owner: 10Mforns) [17:56:39] thanks joal :] [17:56:42] joal: I think that before you asked for find, this time du [17:56:44] possible? [17:56:52] (03CR) 10Anne Tomasevich: [C: 03+1] Update schema to handle quickview playback events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154) (owner: 10Eric Gardner) [17:56:55] I assume that is the thing [17:56:58] elukey: -^ [17:57:01] :( [17:57:03] do you want find? [17:57:06] I can redo it now :) [17:57:15] elukey: I'd like find but with size info [17:57:25] I know that the file list is correct [17:57:31] I wanted to check for sizes [17:57:42] maybe I'm too paranoid :) [17:57:56] ahhhh [17:58:10] nono on the contrary [17:58:40] bearloga: I indeed wonder as well about why the jobs fail now :( [17:59:25] find /srv/backup -type f -exec du {} \; > /home/joal/backup_files_size.txt ? [18:00:54] elukey: possibly [18:01:08] elukey: when I try i I get: find: ‘du’ terminated by signal 13 [18:01:16] probably because I cut it with |head [18:01:20] let's try elukey [18:01:46] elukey: let's also remove the file first (I know, > should overwrite, but eh, who am I to trust a computer) [18:02:10] I am writing to a new file! [18:02:11] :) [18:02:23] joal: on Monday if you have time I'd like to introduce you to https://phabricator.wikimedia.org/T231067#6843624 [18:02:41] that should be transparent to you but better if you know what I am doing :D [18:03:05] this is to standardize the uid/gid for users like hdfs/yarn/mapred/etc.. [18:03:17] I read that elukey [18:03:36] elukey: Does the procedure involve changin Ids on the already existing nodes? [18:05:46] joal: on the buster ones yes [18:06:08] Ah I think I get [18:06:11] (you have /home/joal/backup_files_size.txt in you home) [18:06:35] elukey: You will enforce uids for users for reimaged hosts [18:06:46] elukey: newly added hosts with follow that pattern [18:06:57] joal: correct [18:07:05] elukey: then older hosts will be reimaged, following same process [18:07:11] so next time if we reinstall it will be a piece of cake [18:07:20] elukey: then everything is the same [18:07:25] am I right? [18:07:41] thanks for the file elukey [18:07:41] joal: exactly, same uid/gid everywhere for hdfs/yarn/etc... [18:07:50] a dream basically [18:08:11] elukey: to confirm - now we have different values on the existing machines [18:08:37] elukey: we need the reimage with the code you already wrote to actually make it consistent (on reimaged hosts) [18:08:41] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) [18:08:41] joal: lemme give you an overview of the mess [18:09:05] elukey: sorry to have followed :S [18:10:18] joal: https://phabricator.wikimedia.org/P14421 [18:10:24] I am super happy to get questions :) [18:11:07] the latter is buster nodes [18:11:18] the former the hadoop stretch nodes (master + workers etc..) [18:11:27] see the beauty of the uids? :D [18:12:09] mhhh [18:12:34] That means the code to make uuids stable exists but has not has been activated if I don't mistake [18:15:20] joal: we never enforced it yes, since you cannot change it on the fly on a system [18:15:47] right elukey - So the buster reimage is a good moment to enforce it I imagine [18:17:34] yes exactly [18:20:13] all right I think that I am going to log off for today, have a nice weekend folks :) [18:20:33] ok elukey I figured out why the databases menu wasn't showing! had to `superset init` [18:20:49] I did that a previous iteration, but forgot to this time.cya though! [18:20:58] (03PS3) 10Mforns: Factor out traffic anomaly countries into a Hive table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/659306 (https://phabricator.wikimedia.org/T272052) [18:21:33] (03CR) 10Mforns: Factor out traffic anomaly countries into a Hive table (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/659306 (https://phabricator.wikimedia.org/T272052) (owner: 10Mforns) [18:26:39] hi razzi :] I added you as a reviewer today of 2 puppet patches to remove some deletion jobs that are not used any more, it's not urgent, but I think they are also not dangerous (the data they were deleting does not exist any more), let me know if you have questions! [18:28:14] razzi: nice! [18:28:22] I'll retest on monday :) [18:30:57] (03CR) 10BrandonXLF: "This change is ready for review." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/665185 (https://phabricator.wikimedia.org/T254847) (owner: 10BrandonXLF) [18:39:02] thanks razzi! [18:40:03] np, thank you mforns! [18:41:59] razzi, is it OK to merge today, or should we wait to next week? Just asking, to know if I should move the task to done, or leave it there? [18:46:38] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) an-worker1139 corrected dac cable for host moved to port 25 [18:50:06] also, a-team, if someone is interested, here's the session length oozie job: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/664885, I think the query is quite cool :] [18:50:21] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) an-worker1129 verified host is plugged into xe-4/0/3 [19:02:47] mforns: you should walk us through that query after standjup on monday [19:03:06] ottomata: ok :] [19:09:19] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1129.eqiad.wmnet ` The log can be found in... [19:12:09] milimetric: still nearby? [19:12:55] yes joal, cave? [19:13:00] yessir [19:15:21] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure: Roll-up raw sessionTick data into distribution - https://phabricator.wikimedia.org/T271455 (10Mayakp.wiki) QA checks performed for validating data between the raw table (event.mediawiki_client_session_tick) and in... [19:19:15] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1139.eqiad.wmnet ` The log can be found in... [19:32:34] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet'] ` and were **ALL** successful. [19:34:20] (03PS1) 10Milimetric: Fix inconsistent Hive query fail [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665406 [19:35:11] (03CR) 10Milimetric: "this seems to solve the problem, I ran the query manually" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665406 (owner: 10Milimetric) [19:41:35] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1139.eqiad.wmnet'] ` and were **ALL** successful. [19:41:37] (03PS1) 10Milimetric: Fix UDF typing problem more permanently [analytics/refinery] - 10https://gerrit.wikimedia.org/r/665408 [19:45:36] (03Abandoned) 10Milimetric: Fix UDF typing problem more permanently [analytics/refinery] - 10https://gerrit.wikimedia.org/r/665408 (owner: 10Milimetric) [19:50:18] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [19:51:20] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) 05Open→03Resolved a:05Jclark-ctr→03RobH All hosts installed and staged in netbox. [20:49:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou) @lexnasser: I run into the same error testing the query manually (`NoViableAltException(350@[()* loopback of 430:20: ( ( LSQUARE ^ expression RSQUARE... [21:09:27] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 2 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Jdlrobson) I believe a review here is required from our team. [21:32:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10lexnasser) @JAllemandou Thanks for finding the `hive.cbo.enable` option! That fixed the HiveRelDecorrelator issue, but now I'm getting another error: ` Error: Err... [21:34:57] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Patch-For-Review: Create Oozie job for session length - https://phabricator.wikimedia.org/T273116 (10Mayakp.wiki) a:05mforns→03Mayakp.wiki [21:44:31] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou) > I'm thinking of just iteratively rewriting the query from a blank slate and seeing what components cause the issue and how I can circumvent them.... [22:06:40] (03PS1) 10Joal: Make hive temporary-tables storage format explicit [analytics/refinery] - 10https://gerrit.wikimedia.org/r/665425 (https://phabricator.wikimedia.org/T168554) [22:09:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Default hive table creation to parquet - needs hive 2.3.0 - https://phabricator.wikimedia.org/T168554 (10JAllemandou) The change has broken some jobs: - mobile_apps-uniques-daily-coord (job failed on 2021-02-18) - pageview-daily_dump-coord (successful jo... [22:09:54] milimetric: if you're still there --^ [22:10:37] milimetric: that finding could actually be the problem for the analytics-product job (even if we found a solution for the fact that job broke, maybe the data genereted is incorrect) [22:10:40] thanks joal, will check it out [22:11:02] ooh, good to know, don't want bad dara [22:11:06] *data [22:11:33] milimetric: hdfs dfs -cat /wmf/data/archive/pageview/complete/2021/2021-02/pageviews-20210218-spider.bz2 | less [22:11:49] milimetric: the data is actually not bz2, and stored in parquet [22:12:22] * joal is sorry for not having thought of this impact when changing the default table storage format :( [22:14:23] joal: thanks for the fix! I'll merge and deploy since it affects pageview dumps [22:14:46] milimetric: I think there is no rush - the dumps are the complete one that is kinda not yet done [22:14:59] Could be propagated next week with reruns - as you wish [22:22:54] oh ok, I'll wait for Monday then, less tired == risk [22:22:58] *less risk :) [22:23:04] nite yall, I'm around if anything blows up [22:25:06] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 (10Ottomata) @razzi, @elukey, Analytics VLAN should be able to access api-... [22:31:13] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Patch-For-Review: Create Oozie job for session length - https://phabricator.wikimedia.org/T273116 (10Mayakp.wiki) I ran QA checks similar to T271455 but for data from Jan 6 to Jan 21 of the raw table (since mforns.session_length_daily (intermediate t...