[01:22:32] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Operations, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Dzahn) When adding a new type of server name please add them in these 2 places: - wikitech https://wikitech.wikimedia.org/wiki/In... [04:55:12] I'm doing a batch of Wikidata queries and getting throttled. It isn't a big deal, but in the near future I'd like to scale this up by an order of magnitude. Is there a good place to go for tips? [04:55:38] or maybe there's a way I can get around the limit? [05:48:54] groceryheist: hi! Our main point of contact for Wikidata is currently on holidays, maybe joal knows more.. what I'd suggest though is creating a task with the Wikidata label explaining your use case, so people that have context can chime in and help.. we can follow up if nobody picks up your request and find somebody asking around! [05:52:21] @elukey thanks, things are ok for now, so I'll do that should the need arise again [06:02:44] 10Analytics, 10Research: Check home leftovers of ISI researchers - https://phabricator.wikimedia.org/T215775 (10elukey) @leila it shouldn't be a problem, but let's keep in mind when we review these use cases the 90 days of PII data retention guidelines, this is my only concern :) [06:36:02] 10Analytics, 10Product-Analytics, 10Epic, 10User-Elukey: Add wikidata ids to data lake tables - https://phabricator.wikimedia.org/T221890 (10JAllemandou) Thanks for elaborating @Groceryheist :) We have a non-productionized version of the `item_id --> wiki_db/page_id` dataset built for research: ` val wd =... [06:38:57] Morning elukey - I provided an answer to groceryheist that would only use the cluster - I hope it'll do :) [06:42:46] ack thanks! [07:23:09] fdans: Good morning - Would you be there by any chance? [07:23:34] hello joal! [07:24:06] I have a questoin for you, about mediawiki-feature that was discussed last year, and about which I can't recall [07:24:11] fdans: --^ [07:24:52] hmmmmmm [07:24:58] fdans: Do you recall about the right for anonymous users to create new articles? [07:25:16] joal: you mean actrial? [07:25:43] fdans: I can't recall if actrial was about anonymous users, or new-users [07:26:40] I'll be clearer: I found articles created on enwiki by anonymous users on namespace 0 last month (97 to be precise) - And my gut was telling me that it was not supposed to happen ,but I can't recall why [07:26:47] joal: i think that doesn't matter [07:26:52] i mean [07:27:26] can an IP user become an autoconfirmed user? [07:27:42] Wow - that's a question! [07:28:01] if so, there's no irregularity since this affects users that are not autoconfirmed [07:28:04] however joal [07:28:36] this is enwiki right? this policy only affects enwiki [07:28:46] enwiki indeed [07:28:47] ah yes you already clarified [07:28:58] hm [07:29:08] I can't imagine IPs being autoconfirmed [07:29:22] I'm gonna ask a fellow CL [07:29:31] joal: can we take a look at the contrib list of that IP? [07:32:06] fdans: is `autoconfirmed` computed on the flag based on the number of edits made by the username|?> [07:33:27] joal: https://doc.wikimedia.org/mediawiki-core/master/php/classUser.html#a464e8aee6df0fea2412b200cf16fd8e9 [07:33:28] s/flag/fly [07:34:03] fdans: autoconfirmed == !newbie? [07:35:36] joal: yeah this goes out of what i know, some mw person should be able to answer these questions easily [07:35:52] but https://www.mediawiki.org/wiki/Manual:Autoconfirmed_users [07:36:05] doesn't say anything about anonymous users, I'm not sure about newbie [07:36:37] fdans: -->`number of seconds account needs to have existed to qualify` [07:36:43] fdans: IPs shouldn't qualify [07:37:05] right, that makes sense [07:37:58] joal: is this problem coming from many ips or just a few? [07:38:54] fdans: not a reqally a problem per say, just a confirmation of data correctness - 97 pages created in March 2019 in namespace 0 of enwiki from IPs - Currently double-checking historical namespace [07:40:45] fdans: I think I have my answer - current namespace is 0, but historical was different :) [07:40:48] \o/ ! [07:40:55] fdans: sorry to have bothered :) [07:41:24] ooooo nice catch joal [07:41:37] never a bother, I like that you ask me! [07:42:26] fdans: Now the question coming from a usage perspective: WKS2 will report pages created by anonymous users on namespace 0, while they actually were created on a different namespace then moved [07:42:46] this is misleading, isn't it? [07:44:36] (03PS4) 10Joal: Fix mediawiki-history user event join [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504834 [07:44:39] hmmmm a little but i think it's fine joal it makes sense [07:44:53] hard to document though [07:45:37] 10Analytics: Check home leftovers of pirroh - https://phabricator.wikimedia.org/T222140 (10MoritzMuehlenhoff) [07:45:44] indeed - But, it reflects how data has been reported so far: using current-values (user_text, page_title, page_namespace ...) and not historical ones (because they were not good enough) [07:50:05] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, 10Performance-Team (Radar): Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) It seems to be breaking navtiming (coal is fine, though): ` Apr 30 07:43:37 webperf1001 python[5681]: 2019-04-30 07:43:37,515 [... [07:58:39] 10Analytics, 10Analytics-Kanban: Mediawiki-History fixes before deploy - https://phabricator.wikimedia.org/T222141 (10JAllemandou) [07:59:01] (03PS1) 10Joal: Fix mediawiki_history_reduced checker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/507254 (https://phabricator.wikimedia.org/T222141) [08:14:39] 10Analytics: Check home leftovers of pirroh - https://phabricator.wikimedia.org/T222140 (10elukey) These are the leftovers of Michele: ` ====== stat1004 ====== total 427280 drwxrwxr-x 24 19707 wikidev 4096 Dec 28 05:58 anaconda3 -rw-r--r-- 1 19707 wikidev 437519351 Dec 24 22:02 links.txt drwxrwxr-x 2 197... [08:15:01] 10Analytics: Check home leftovers of pirroh - https://phabricator.wikimedia.org/T222140 (10elukey) p:05Triage→03Normal [08:17:01] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) Fixed navtiming for now, I'll investigate further to make sure that this is a proper fix and not a hack. Right now I'm not sure that the metadat... [08:29:51] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Mediawiki-History fixes before deploy - https://phabricator.wikimedia.org/T222141 (10JAllemandou) [08:30:12] (03PS1) 10Joal: Fix mediawiki_page_history undefined userId [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/507258 (https://phabricator.wikimedia.org/T222141) [08:37:43] (03PS2) 10Joal: Fix mediawiki_page_history undefined userId [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/507258 (https://phabricator.wikimedia.org/T222141) [08:38:14] (03PS1) 10Joal: Fix mediawiki_history eventUserIsAnonymous [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/507260 [08:38:23] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/507259/ - whenever you have time :) [08:40:18] (03PS2) 10Joal: Fix mediawiki_history eventUserIsAnonymous [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/507260 (https://phabricator.wikimedia.org/T222141) [08:50:59] fixed joal, thanks! used ln not log2 [08:51:25] 10 -> 115 looks a big jump but the masters are not doing much [08:52:05] ok for me elukey :) [08:54:11] joal: ok if I roll restart now? [08:54:53] yessirP [09:06:04] restarted namenode on 1002 [09:09:56] 10Analytics: Check home leftovers of pirroh - https://phabricator.wikimedia.org/T222140 (10Miriam) Thanks @elukey! @tizianopiccardi anything in Michele's home that we should keep? [09:14:07] 10Analytics: Update mediawiki_history with username bein an IP to better define isAnonymous - https://phabricator.wikimedia.org/T222147 (10JAllemandou) [09:15:19] failover to 1002 done [09:15:54] I'd like to leave as it is for a couple of hours joal [09:16:05] to see if anything comes up [09:16:09] ack elukey - no problem for me [09:16:09] if so, we can easily rollback [09:16:57] \o/ I have finally understood and fixed all widely-visible differences in mediawiki-history-reduced :) [09:17:18] Currently testing on small wikis, then ready to be merged and deployed for next month :) [09:17:24] pffff - just in time [09:19:40] woooww [09:19:50] that means data vetting over? [09:20:23] elukey: over for the bunch of patches to be deployed this month - Then comes next month batch of it ;) [09:20:32] but yes, vetting over for a few days ;) [09:21:15] ahahah okok [09:21:25] good job :) [09:21:25] elukey: Thanks for being understanding and supportive in the vetting swamp-crawling :) [09:21:41] elukey: it means to me ;) [09:21:48] well we should say thank you for all this endeavour! [09:21:58] :) [09:22:18] interesting thing that I found today while reviewing metrics [09:22:19] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1554146781429&to=1554375075571&panelId=54&fullscreen [09:22:42] that mess is when Tiziano executed the spark job [09:23:04] so I think that the major problem was worker exhaustion on the namenode [09:23:09] (processing rpc calls) [09:23:47] elukey: hm - here is what I understand of it - We are deploying a patch that will make sure the cluster don't fail at allowing users making a mess? :D [09:24:06] kinda? [09:24:11] yep :) [09:24:15] Goooood :) [09:24:17] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1554290800610&to=1554333729404 this is a better view [09:24:43] it should help, in theory, to have more powa when the namenode is "attacked" :D [09:25:01] elukey: last link seems incorrect [09:25:26] elukey: or I don't know where to look :) [09:25:40] joal: ah yes, namenode metrics [09:25:45] RPCs etc.. [09:26:19] elukey: yes yes yes :) [09:26:20] Nice [09:26:53] I remember that the SRE from Apple presenting the Hadoop talk last year told me while chatting that they had alarms for RPC calls to the namenode [09:27:06] now elukey we should really find a way to enforce quotas :) [09:27:15] Makes sense [09:27:28] yep that is the second step, first of all making some sense from them [09:27:40] too many RPC calls means abnormal NN activity [09:27:56] yeah, I think we should add an alarm [09:28:09] +1 [09:28:18] the len of the call queue seems to me something worth to check [09:28:25] since usually it should be 0 [09:28:29] Particularly now that the NN should be able to better sustain them [09:28:34] if RPC calls starts to pile up we are in trouble [09:28:48] means, IIUC, NN workers/threads not keeping up [09:29:01] yeah - particularly again with the last change [09:29:40] exactly [09:30:04] ok let's alarm early, so that we do let the cluster die [09:31:31] elukey: https://www.youtube.com/watch?v=6D9vAItORgE [09:33:11] joal: https://www.youtube.com/watch?v=etAIpkdhU9Q [09:33:13] ahahhaah [09:33:24] :D [09:33:58] I picture the RPC call queue alarm as a sign of hell :D [09:39:37] 10Analytics: Check home leftovers of pirroh - https://phabricator.wikimedia.org/T222140 (10tizianopiccardi) Thanks @elukey! You can proceed with the deletion. I don't see anything important [09:54:54] completing the roll restart since everything looks good [09:58:14] \o/ :) [10:17:30] 1001 is active again [10:29:06] 10Analytics, 10Patch-For-Review, 10User-Elukey: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 (10elukey) Rationale of the change listed above. During the execution of the Spark job the following happened: {F28865898} Usua... [10:44:39] 10Analytics, 10Research-management: Test GPUs with an end-to-end training task (Photo vs Graphics image classifier) - https://phabricator.wikimedia.org/T221761 (10Miriam) [10:47:13] 10Analytics, 10Research-management: Test GPUs with an end-to-end training task (Photo vs Graphics image classifier) - https://phabricator.wikimedia.org/T221761 (10Miriam) Data collection is over, it took 271044s (~75hours) for 320815 images (~160k per class), i.e. 0.85 sec/image. I downloaded 600-px thumbnails... [11:05:05] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) I've tracked down the root cause of the issue: https://github.com/dpkp/kafka-python/issues/1774 For other uses of python-kafka we have, we simp... [11:10:48] 10Analytics, 10Research-management: Test GPUs with an end-to-end training task (Photo vs Graphics image classifier) - https://phabricator.wikimedia.org/T221761 (10dr0ptp4kt) To clarify, was that via the internet or internal cluster? [11:13:04] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) [11:13:57] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) @Ottomata all our services are good now, you can go ahead with upgrading EventLogging and Hadoop. [11:18:04] 10Analytics, 10Patch-For-Review: trying to get a clean master branch - https://phabricator.wikimedia.org/T221466 (10Aklapper) No news for a week on this open UBN task and patch is merged. Is there more to do here or can this task be resolved? [11:36:02] 10Analytics, 10Research-management: Test GPUs with an end-to-end training task (Photo vs Graphics image classifier) - https://phabricator.wikimedia.org/T221761 (10Miriam) This was via the internet. But we should try to do this from the internal cluster, too, for comparison, if possible. I just need few instruc... [12:03:07] 10Analytics, 10Patch-For-Review: trying to get a clean master branch - https://phabricator.wikimedia.org/T221466 (10JAllemandou) This task is resolved, closing it it. Sorry @Aklapper for the delay. [12:03:17] 10Analytics, 10Patch-For-Review: trying to get a clean master branch - https://phabricator.wikimedia.org/T221466 (10JAllemandou) 05Open→03Resolved [12:05:36] joal: today is cleanup day :D https://gerrit.wikimedia.org/r/#/c/operations/puppet/cdh/+/507306/ [12:05:40] (if you have time) [12:07:32] elukey: commented - Please feel free to ignore [12:07:37] Code looks correct [12:07:57] nono I mean code + your thoughts :D [12:08:03] I think it is essentialy to have it [12:08:09] (the audit) [12:19:23] joal: added the comment! [12:53:49] taking the lunch break down for an errand [12:53:55] will be back in ~1h! [12:53:56] ttl [13:07:00] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Operations, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) Done, thanks. Also edited https://wikitech.wikimedia.org/wiki/Ganeti#Assign_a_hostname%2FIP with instructions for futur... [13:25:11] !log restarting eventlogging processes to upgrade to python-kafka 1.4.6 - T221848 [13:25:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:25:14] T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 [13:29:14] o/ [13:40:14] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata) [13:41:00] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata) [13:48:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Oozie article recommender: use version 0.0.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/506218 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [13:51:06] (03PS2) 10Ottomata: Oozie: wait for new Wikidata dumps before generating article recommendations [analytics/refinery] - 10https://gerrit.wikimedia.org/r/503393 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [13:51:10] (03PS3) 10Ottomata: Oozie article recommender: use version 0.0.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/506218 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [13:51:45] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Oozie: wait for new Wikidata dumps before generating article recommendations [analytics/refinery] - 10https://gerrit.wikimedia.org/r/503393 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [13:51:53] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Oozie article recommender: use version 0.0.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/506218 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [13:53:32] 10Analytics, 10Research, 10Article-Recommendation, 10Patch-For-Review: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10Ottomata) @bmansurov, I've merged your changes, but haven't deployed them. They should go out with the next deploy. The oo... [14:35:26] ottomata: o/ [14:35:31] thanks for the code reviews! [14:35:35] o/ [14:35:42] anything against me turning on the hdfs audit log? [14:35:47] also Cc: joal [15:15:45] ottomata: what was the api ticket that we were going to cc the pm for apis [15:19:13] nuria: https://phabricator.wikimedia.org/T214080#5143973 [15:20:17] 10Analytics, 10Core Platform Team, 10MediaWiki-API, 10Patch-For-Review, 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10kchapman) [15:33:08] ottomata: something "interesting" happens wih the new log4j (still haven't restarted namenodes) [15:33:14] elukey@an-master1001:/var/log/hadoop-hdfs$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet [15:33:17] log4j:ERROR setFile(null,true) call failed. [15:33:19] java.io.FileNotFoundException: /usr/lib/hadoop/logs/hdfs-audit.log (No such file or directory) [15:33:33] after a bit of research this goes away setting export HADOOP_LOG_DIR=/var/log/hadoop-hdfs [15:33:52] yeah, dunno about hadoop/logs dir [15:34:05] is that some default that we override? [15:34:11] HADOOP_LOG_DIR=/var/log/hadoop-hdfs sounds fine [15:34:25] we override it per daemon [15:34:28] aye [15:34:45] and I think that the hdfs tool uses the default if it doesn't find anything [15:34:57] but it is really annoying that we need to set that variable [15:36:36] so I am wondering how to get around it mmmm [15:45:46] restarting the namenode on 1002 [15:56:18] and failed over to test the new log [15:57:09] works! [15:57:14] a bit spammy though [15:57:54] 2019-04-30 15:44:26,157 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Enabling async auditlog [15:57:57] joal: --^ [15:57:58] elukey: ya would be better if that settings was in log4j stuff i guyess [15:57:59] works :) [15:58:07] \o/ :) [15:58:07] but ¯\_(ツ)_/¯ [16:00:59] (03PS1) 10Fdans: Add 137 wikis that haven't been sqooped so far [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) [16:06:47] (03CR) 10Nuria: Add 137 wikis that haven't been sqooped so far (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans) [16:14:50] (03CR) 10Joal: [C: 04-1] "We also need the production list to be updated, as we use the actor and comments from there." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans) [16:16:15] (03CR) 10Joal: [C: 04-1] Add 137 wikis that haven't been sqooped so far (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans) [16:18:50] (03PS2) 10Fdans: Add 137 wikis that haven't been sqooped so far [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) [16:19:03] (03CR) 10Fdans: Add 137 wikis that haven't been sqooped so far (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans) [18:21:08] a-team: there is currently an annoyance with the yarn/hdfs cli tools, they emit an error like the following before the "regular" output [18:21:16] log4j:ERROR setFile(null,true) call failed. [18:21:16] java.io.FileNotFoundException: /usr/lib/hadoop-yarn/logs/hdfs-audit.log (No such file or directory) [18:21:25] for example, say hdfs dfs -ls / [18:21:56] I still haven't found a simple and clean solution, but the issue is that we use a single log4j config for everything [18:22:08] and the hdfs/yarn commands ovverride the log dir [18:22:13] I'll fix it tomorrow [18:22:25] for the moment, beware that the commands do work but with some spam [18:23:07] Thanks for the headsup elukey :) [18:23:51] * elukey afk for dinner o/ [18:30:29] (03CR) 10Joal: "Checked the new wikis against labsdb and some them are missing:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans) [18:34:30] thaanaks luca! [20:08:20] (03CR) 10Nuria: "Let's clean up the list a bit, removing wikis like fixcopyright and open a ticket to dbas so we can gauge whether the wikis that do not ex" (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507355 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans)