[00:10:33] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) Ok, next week I'll work on merging https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/635304, ma... [05:53:32] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) [05:53:46] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) @KFrancis can you confirm if @Rmaung has a valid NDA signed? I cannot see it on the NDA tracking sheet. [05:54:50] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) [05:55:06] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) [05:59:26] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) Confirmed that @rmaung is staff by checking via ldap-corp. @Rmaung we'd also need your manager to sign off this re... [05:59:38] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) [06:00:51] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) [06:01:02] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) Confirmed janstee@wikimedia.org via ldap corp as staff. @JAnstee_WMF we'd need your manager to sign this off. Thanks! [06:11:07] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh 16 nodes in the Hadoop Analytics cluster - https://phabricator.wikimedia.org/T255140 (10elukey) All old nodes with removed from Hadoop! [06:11:51] 10Analytics, 10Analytics-Kanban: Hadoop Hardware Orders FY2019-2020 - https://phabricator.wikimedia.org/T243521 (10elukey) 05Open→03Resolved [06:11:53] 10Analytics: Analytics Hardware for Fiscal Year 2019/2020 - https://phabricator.wikimedia.org/T244211 (10elukey) [06:25:50] good morning [06:26:08] I am running a test on an-presto1001, namely setting the Xmx of the jvm to 60G [06:26:15] (the others run at 110G) [06:38:01] 10Analytics: Check home/HDFS leftovers of jkumarah - https://phabricator.wikimedia.org/T263715 (10elukey) 05Open→03Resolved a:03elukey homes deleted. [06:38:25] good morning [06:38:34] morning! [06:39:29] Could it be that /srv/published/datasets us not updating https://analytics.wikimedia.org/published/datasets/ from stat1005? I have updated some files hours ago and I still can't see not change via https? [06:41:29] 10Analytics: Check home/HDFS leftovers of shiladsen - https://phabricator.wikimedia.org/T264269 (10elukey) Deleted all the home dirs on stat100x, only hdfs files are left :) [06:42:36] GoranSM: what is the link of the dir not updating? [06:43:42] 10Analytics: Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st - https://phabricator.wikimedia.org/T264081 (10elukey) 05Open→03Resolved a:03elukey It seems way more stable now, closing for the moment :) [06:45:25] 10Analytics-Kanban: Analytics Hardware for Fiscal Year 2020/2021 - https://phabricator.wikimedia.org/T255145 (10elukey) [06:46:33] 10Analytics-Kanban: Analytics Hardware for Fiscal Year 2020/2021 - https://phabricator.wikimedia.org/T255145 (10elukey) [06:46:54] 10Analytics-Kanban: Analytics Hardware for Fiscal Year 2020/2021 - https://phabricator.wikimedia.org/T255145 (10elukey) [06:48:01] https://issues.apache.org/jira/browse/BIGTOP-3434 - Hadoop-3.3.0 deb packaging support [06:55:54] elukey: Hm, I had the same directories under /srv/published/datasets on stat1005 and stat1007, could that be the origin of the problem? I have just removed the directories from stat1007. [06:56:10] elukey: As of your question: https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/etl, https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/ml, https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/geo [06:58:22] GoranSM: having dirs on multiple nodes may cause issues, does it work now? [07:00:22] 10Analytics: Check home/HDFS leftovers of nathante - https://phabricator.wikimedia.org/T264268 (10elukey) All stat100x home dirs purged, only hdfs/hive left! [07:03:25] 10Analytics: Check home/HDFS leftovers of rush - https://phabricator.wikimedia.org/T265121 (10elukey) Sent an email to John to get a final confirmation. [07:05:18] 10Analytics: Check home/HDFS leftovers of joewalsh - https://phabricator.wikimedia.org/T265447 (10elukey) 05Open→03Resolved a:03elukey All stat100x home dirs removed! [07:06:37] * elukey bbiab [08:32:53] (03PS6) 10Elukey: Add oozie webrequest test bundle [analytics/refinery] - 10https://gerrit.wikimedia.org/r/491791 (https://phabricator.wikimedia.org/T212259) [08:43:33] elukey: It works. Mea culpa: switched some ML operations to stat1005 and forgot to remove /srv/published/datasets related things from stat1007. Thx. [08:45:46] np! glad that it is fixed :) [09:00:54] ebernhardson: re integration environment - can you give us a little bit more details? :) [09:49:05] ok the analytics-test-hive.eqiad.wmnet trick seems to work in hadoop test [09:49:42] the main downside though is that it will require a restart/update of all client when we change the settings on the hive server/metastore (since they cannot run two service principals) [09:52:16] -- [09:52:33] while trying to run refine on hadoop test (with bigtop) I got [09:52:34] org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoSuchMethodError: com.maxmind.geoip2.DatabaseReader [09:52:40] (webrequest_load's refine) [09:53:43] refinery_jar_version = 0.0.137 [09:55:01] mforns: o/ is it related to the work that you are doing by any chance? [10:02:19] mmm no last change seems to be https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/588715 [10:04:03] I think it is an issue with hive 2.x [10:23:28] 10Analytics: Possible between Maxmind and Hive 2.x libs in Refinery source - https://phabricator.wikimedia.org/T266322 (10elukey) [10:23:41] 10Analytics: Possible issue between Maxmind and Hive 2.x libs in Refinery source - https://phabricator.wikimedia.org/T266322 (10elukey) [10:26:55] 10Analytics: Check home/HDFS leftovers of shiladsen - https://phabricator.wikimedia.org/T264269 (10mforns) I deleted HDFS and HIVE files. Resolving! [10:27:19] 10Analytics: Check home/HDFS leftovers of shiladsen - https://phabricator.wikimedia.org/T264269 (10mforns) 05Open→03Resolved a:03mforns [10:45:25] hey elukey [10:45:30] just joined [10:46:19] elukey: I believe those might be related to changes to remove unused fields derived from maxmind [10:46:28] I deployed those on tuesday [10:48:54] ah! [10:49:13] I am going afk for lunch + errand, let's check later on if you have time [10:49:18] nothing super urgent :) [10:50:01] yes ok [11:06:41] 10Analytics: Check home/HDFS leftovers of nathante - https://phabricator.wikimedia.org/T264268 (10mforns) 05Open→03Resolved a:03mforns Deleted both HDFS and HIVE directories, plus the corresponding database in HIVE. Marking this as resolved! [11:07:00] 10Analytics: Check home/HDFS leftovers of nathante - https://phabricator.wikimedia.org/T256356 (10mforns) [13:03:26] mforns: o/ [13:03:46] was the last refinery that you deployed 0.0.137? [13:04:13] if so I can try 0.0.136, but I am afraid that the issue is more subtle (like hive 2.x dependent) [13:04:27] it is my bad that I kept running oozie with an old version of refinery, and never seen issues [13:11:13] 10Analytics: Check home/HDFS leftovers of rush - https://phabricator.wikimedia.org/T265121 (10JBennett) Nothing we need to keep, good to cleanup, thanks! [13:17:29] 10Analytics: Check home/HDFS leftovers of rush - https://phabricator.wikimedia.org/T265121 (10elukey) 05Open→03Resolved a:03elukey All stat100x homes cleaned up, HDFS home also cleaned up! [13:23:09] mforns: checked with 0.136, same error [13:45:16] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) Summary of actions done: * created a dns CNAME analytics-test-hive.eqiad.wmnet -> an-test-coord1001.eqiad.wmnet * created the kerberos principal `hive/analytics-test-h... [14:23:29] 10Analytics: Check home/HDFS leftovers of leila - https://phabricator.wikimedia.org/T264994 (10elukey) stat100x homes done (content moved under `/home/leizi`) For HDFS: ` ======= HDFS ======== Found 6 items drwx------ - leila leila 0 2018-06-27 00:37 /user/leila/.staging drwxr-xr-x - leila bma... [14:24:00] elukey: hi, sorry was having lunch [14:24:12] mforns: how dare you marcel to eat? [14:24:16] :D [14:24:18] xD [14:24:27] please don't say sorry :) [14:24:31] hehehehe [14:25:00] I haven't understood the error, is there a task or log I can look at? [14:25:05] or alarm? [14:25:05] yep! [14:25:08] I opened one [14:25:16] https://phabricator.wikimedia.org/T266322 [14:25:21] it is on the Test cluster [14:25:23] with hive 2.x [14:25:37] so no alarm, it is just me trying to run webrequest-load in there [14:25:50] it used to work, but with a older version of refinery, stuff might have changed [14:26:02] but if it doesn't work in test we cannot really migrate to Bigtop :( [14:26:02] aha [14:28:00] it is more ops week so don't spend time on it, I pinged you if you had any idea since I recalled maxmind changes during the last deployment [14:28:11] but it seems a more complicated issue [14:30:36] elukey: I worked with maxmind, but didn't change it... [14:31:03] but... IIUC maxmind is not shipped with BigTop no? It is imported externally, right? [14:34:53] elukey: I think there's only one place in refinery-source where we use a DatabaseReader constructor: refinery-core...maxmind/AbastractDatabaseReader.java [14:35:07] yep yep [14:35:32] I added a stackoverflow link, they seem to have had the same issue [14:35:38] ah [14:35:46] but they solved it changing the dependencies [14:37:18] elukey: we can check in mvn tree that maxmind version is the same in prod and in test [14:38:52] we can yes, what changes (I think) is hive 1.x vs hive 2.x libs [14:43:30] in prod: [INFO] +- com.maxmind.geoip2:geoip2:jar:2.1.0:compile [14:44:07] mforns: does DataFrameToHive automatically write _SUCCESS flags here somehow? https://github.com/wikimedia/analytics-refinery-source/blob/fed14e6dbad3eb5a65069d80182c6070e203dbe6/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/Refine.scala#L521 [14:48:42] oh my bad that's in our codebase, looking [14:49:53] milimetric: don't remember... [14:50:38] it's very weird! The _REFINED and _SUCCESS flags are written at the same time, so they must be written around there somewhere, but I search for _SUCCESS and it doesn't show up anywhere [14:50:56] (this is for refined event streams like mediawiki_page_move) [14:56:06] AHA!!! [14:56:08] Spark does it [14:56:30] out of outputDf.df.write.parquet(...) [14:56:40] oof, I'm gonna add a comment [14:57:14] ok, ottomata, I figured it out, Spark writes _SUCCESS at the same time as the DataFrameToHive writes _REFINED by calling the callback from Refine.scala [14:58:09] the dataset definition uses _SUCCESS, so when data is coming into codfw, it can't use the canaries from eqiad, because those don't represent the availability of the data in codfw [15:01:26] interesting [15:04:52] (03PS1) 10Milimetric: Make explicit that _SUCCESS flag is written [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636051 [15:05:17] (03CR) 10Milimetric: [C: 03+2] Make explicit that _SUCCESS flag is written [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636051 (owner: 10Milimetric) [15:11:01] (03Merged) 10jenkins-bot: Make explicit that _SUCCESS flag is written [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636051 (owner: 10Milimetric) [15:37:53] 10Analytics, 10Analytics-Wikistats: pagecounts-ez uploads stopped after 9/24 - https://phabricator.wikimedia.org/T265378 (10Danilo) I didn't find the total per month in those files, it will not be provided anymore? I have some tools that use the total pegecounts per month, that is the only data I need from the... [15:40:16] hi team, looking into the wikipediapreview alerts [15:56:33] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) [15:57:49] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) [16:06:07] mforns: yt? [16:06:16] hey nuria yes [16:06:28] elukey: in terms of an integration environment, i'm trying to setup a string of docker containers in docker-compose that runs airflow and all the things it talks to, such that i can trigger a task in the environment and see the results in elasticsearch at the end [16:06:31] mforns: ahem, question about the data_quality_stats job [16:06:39] yep [16:06:54] jobs are running ok but teh update steps [16:07:09] that move data to my local data_quality_stats table [16:07:16] are succeeding [16:07:20] ebernhardson: ah okok that is more clear :) [16:07:25] but no data is present [16:07:59] mforns: so hdfs://analytics-hadoop/user/nuria/data/data_quality_stats is empty (no partitions) [16:08:04] ebernhardson: we can work on specific issues if you want, but I am afraid that all our configs are in puppet (there are guidelines but it is fairly complicated to make all the pieces working together) [16:08:37] nuria: can you paste the command that you're using? [16:08:42] mforns: is there anything i am forgetting [16:08:45] https://www.irccloud.com/pastebin/slk5vC8u/ [16:11:28] elukey: for the moment i'm leaning on a cloudera quickstart image for hadoop (but its only cdh5.9, uses java 7. Basically i can't submit anything to clutser, can only access hive/hdfs via apis). [16:11:42] i might get around to trying to replace that with 5.14 on debian ... but not today :) [16:11:58] ebernhardson: keep in mind that we are moving to apache bigtop, so don't invest too much on cloudera [16:12:16] ok, then it's certainly not worth doing anything beyond the quickstart image they are providing [16:12:22] (they offer a lot of docker images to use) [16:13:00] is the timeline next fiscal? Or still in early planning? [16:13:25] (or maybe much closer than i expect :) [16:13:52] mforns: will keep onlooking i think data is being moved to a diff location , i am executing this as 'nuria' so no prod data is overriden [16:15:49] ebernhardson: should be this quarter or the next :) [16:15:59] elukey: awesome! [16:16:10] we are going to upgrade hdfs to 2.8.5, hive to 2.3.3, etc.. [16:25:50] mforns: do not look at this deeply , i can bypass this issue [16:25:59] mforns: really [16:26:07] just looking if I find something eviden [16:26:10] evident [16:27:14] mforns: do not worry, i will just do away with updater step [16:27:24] nuria is there source data for 2020-05? [16:27:52] yea yea, pageview_hourly right, of course... [16:27:57] mforns: right [16:28:03] mforns: really, do not worry [16:28:09] mforns: will shortcut [16:38:42] mforns: got it [16:38:51] nuria: ho, what was it? [16:38:52] mforns: * i think* [16:38:56] mforns: wait [16:41:17] mforns: i think the query_name is missing [16:45:23] nuria: the query_name should be in the bundle file, in the coordinator snippet, no? [16:45:40] mforns: yes, but i must have a snaffu somehere [16:47:19] nuria: could it be that the query is not in /home/nuria/workplace/refinery/refinery_main/ ? [16:47:44] oh, it's there [16:47:46] mforns: as in the hql file? [16:47:51] yes [16:48:14] mforns: this smells so much of one of my famous STUPID TYPOS [16:48:26] ains [16:48:40] mforns: nvm will continue later [17:14:17] * elukey afk! [17:30:09] 10Analytics: Check home/HDFS leftovers of leila - https://phabricator.wikimedia.org/T264994 (10leila) @elukey thanks. Just drop the Hive tables, please. No need to move them. [17:49:27] 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: Add image table to monthly sqoop list - https://phabricator.wikimedia.org/T266077 (10mpopov) The team will review and prioritize this during our next board review meeting (October 26th). [18:47:32] 10Analytics-Radar, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10CBogen) [18:47:34] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10CBogen) [22:08:53] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10KFrancis) >>! In T266250#6573604, @Marostegui wrote: > @KFrancis can you confirm if @Rmaung has a valid NDA signed? I cannot s... [22:31:43] 10Analytics, 10Product-Analytics: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history - https://phabricator.wikimedia.org/T266374 (10nettrom_WMF) [22:36:06] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Nuria) @Rmaung: can you describe what data are looking to access? This is so we can see what is the appropriate level of acces... [22:38:02] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Nuria) Also, @Rmaung please take a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines and ask any qu... [22:42:23] 10Analytics, 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) [22:43:17] 10Analytics, 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) NDA signed now but I do not have access to https://phabricator.wikimedia.org/L2? [22:45:24] 10Analytics, 10Product-Analytics: Add timestamps of important revision events to mediawiki_history - https://phabricator.wikimedia.org/T266375 (10nettrom_WMF) [22:48:45] 10Analytics, 10Product-Analytics: Add timestamps of important revision events to mediawiki_history - https://phabricator.wikimedia.org/T266375 (10nettrom_WMF) @Isaac : you wanted me to tag you when I filed the task for getting information about revision tag changes into MediaWiki history. Here's said tag. I do... [22:49:43] 10Analytics, 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Dzahn) @Nuria Try again now, I just added you to the project called "WMF-NDA-Requests" (https://phabricator.wikimedia.org/project/profile/974/) which seems like it's needed to allow you... [22:52:37] 10Analytics, 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) done! [22:58:29] 10Analytics, 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Dzahn) >>! In T266086#6575705, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/5BKtV3UBpU87LSFJgL3r} [2020-10-23T22:5... [23:14:19] 10Quarry: Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10Framawiki) From my records it was down for 9 hours and 20 minutes. Logs on that day are full of: ` sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'quarry-db-01.quarry... [23:19:40] 10Analytics, 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10KFrancis) @Dzahn because they are an employee of the WMF, the NDA is kept in file by T&C. [23:25:33] mforns: all my issues are just permit issues, somewhere somewhere the table that script is trying to update is under analytics [23:25:52] mforns: diagnostics: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=nuria, access=WRITE, inode="/tmp/analytics/data_quality_stats_updater":analytics:hdfs:drwxr-xr-x [23:26:46] mforns: but funny how job suceeds, this is the spark problem of errors not being surfaced, i think [23:35:50] mforns: ok, got it, the temp directory needs to be overriden or it will default to waht spark has which is "/tmp/analytics/", this can be fixed with docs or a small change in workflow [23:50:07] 10Quarry, 10cloud-services-team (Kanban): Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10bd808) a:03Bstorm The database had crashed as I remember. @Bstorm did things to get it back up and running. She may have a better memory of what was broken and why.