[02:54:46] ottomata: hey I was OOO today, I'll poke at it tonight/tomorrow morning [03:06:15] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats 2.0: Page heading style varies - https://phabricator.wikimedia.org/T187412#4110885 (10Krinkle) 05Open>03Resolved [03:06:29] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0: Page heading style varies - https://phabricator.wikimedia.org/T187412#3974402 (10Krinkle) [06:24:38] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Mount dumps on SWAP machines (notebook1003.eqiad.wmnet / notebook1004.eqiad.wmnet) - https://phabricator.wikimedia.org/T176091#4111022 (10madhuvishy) Fixed by running `sudo exportfs -ra` on the nfs servers and remounting on notebook*. [06:56:45] Hi team [07:06:14] o/ [07:12:47] 10Analytics, 10Research, 10WMDE-Analytics-Engineering, 10User-Addshore, 10User-Elukey: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410#4111064 (10elukey) [08:18:10] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#3961955 (10elukey) Nothing varnish-related happened on Feb 6th as far as I can see from the ops SAL: https://tools.wmflabs.org/sal/productio... [08:34:16] joal: qq if you have time - do you have any quick way to read/inspect a .dat maxmind db? [08:34:29] curious about how we read it and apply to webrequests [08:34:36] elukey: I didn't do that no [08:34:42] (also curious about https://phabricator.wikimedia.org/T187014) [08:34:48] elukey: we use the java maxmind DB reader [08:36:02] ack [08:43:26] (03PS29) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.3.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [08:50:08] (03PS2) 10Sahil505: Updated Readme & corrected syntax errors [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/424464 (https://phabricator.wikimedia.org/T191567) [08:52:28] (03CR) 10Sahil505: "> Uploaded patch set 2." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/424464 (https://phabricator.wikimedia.org/T191567) (owner: 10Sahil505) [08:58:01] For anyone that likes pretty pictures, https://addshore.com/2018/04/wikidata-map-march-2018/ [08:58:13] I wonder if one day I'll get around to productionizing that.... [09:27:18] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111348 (10ema) >>! In T187014#4110582, @Nuria wrote: > Varnish5 rollout might have something to do with this? https://gerrit.wikimedia.org/... [10:25:34] (03PS13) 10Joal: Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) [10:38:21] (03PS14) 10Joal: Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) [10:38:23] (03PS15) 10Joal: Update mediawiki-history spark job for performance [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419516 (https://phabricator.wikimedia.org/T189449) [11:18:15] (03PS1) 10Joal: Move spark library code to refinery-spark package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/424569 (https://phabricator.wikimedia.org/T188025) [11:20:02] (03CR) 10jerkins-bot: [V: 04-1] Move spark library code to refinery-spark package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/424569 (https://phabricator.wikimedia.org/T188025) (owner: 10Joal) [11:21:38] Taking a break a-team - Will be back for standup, and will test spark2.3 for Refine with Andrew [11:41:35] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111691 (10mbaluta) Please note that number of page views prior to 6th February seems incorrect from our perspective too - number of Opera M... [12:58:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Mount dumps on SWAP machines (notebook1003.eqiad.wmnet / notebook1004.eqiad.wmnet) - https://phabricator.wikimedia.org/T176091#4111902 (10Ottomata) Ahh, great, thanks Madhu! [13:06:58] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111935 (10ema) >>! In T187014#4111691, @mbaluta wrote: > If you provided IP address of our server, we could at least tell whether it is com... [13:13:02] ok to install the apache security update on thorium? should just be a few seconds of non-availability during the restart [13:13:34] moritzm: ya i think should be fine [13:13:48] k, doing that now [13:14:27] done [13:14:34] thanks! [13:14:43] thank you! [13:31:49] hellooo team [13:39:09] !log Rerun mediawiki-history-druid-wf-2018-03 [13:39:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:42:06] joal, any problems with the sqooped data_ [13:42:15] ? [13:42:19] I don't think so [13:42:31] mforns: denormalize finished correctly [13:42:36] But following ones didn't [13:42:45] ah [13:43:05] hello o/ [13:43:15] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format [13:43:32] is anybody familiar with the code generating "Country"? ^ [13:45:27] hm.. not really, ema [13:47:04] the context is T187014, I'd like to know if there's any way to see which IPs w/ User-Agent ~ "Opera Mini" get identified by that code as "United States" [13:47:05] T187014: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014 [13:47:24] hmm, I wonder how hard it would be for me to load all of the wikidata statements meta data into hadoop [13:48:06] addshore: have a look at /user/joal/wikidata/parquet [13:48:13] :D [13:49:25] joal: how? :D (I'm not a hadoop master) ;) [13:52:52] * elukey afk for ~30m [13:55:01] ema, it looks like the UAparser code is unchanged, and also the code that populates the country field. I'd say that the difference is in the maxmind db [13:56:07] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4112076 (10ema) @mbaluta: note that the problem I've mentioned in my comment above is probably unrelated to the stats issue discussed here (... [13:57:50] mforns: so if I'd collect a bunch of Opera Mini proxy IPs and share them with you, could you run them through the code generating Country and see which ones say "United States"? [13:57:55] ooooh https://wikitech.wikimedia.org/wiki/User:Joal/Wikidata_Graph [13:59:41] joal: so I was looking / thinking of a statement level history, with meta data about a statement, so i can query it and ask for all statements added between 2 dates with a main snak with a property P123 for example [14:00:10] mforns: s/proxy IPs/XFF/ :) [14:01:22] 10Analytics-Kanban, 10Patch-For-Review: Spark 2 as cluster default (working with oozie) - https://phabricator.wikimedia.org/T159962#4112080 (10Ottomata) [14:02:18] ema, not sure, never done that, but I can look into finding an entry point to that, one minute [14:02:32] mforns: sure, no rush [14:04:38] ema, I think I can do that :], how many IPs are that? [14:05:33] heayyy joal, got a patch prepped for spark 2 yarn shuffle jar [14:05:38] so i understand how that works now [14:05:41] ready to test some jobs [14:05:54] ema, can you leave them in a file in /home/mforns in stat1004, or in hdfs under /user/mforns ? [14:05:55] have you tested reifne yet? and/or is your spark2 patch now working with spark 2.3 and rebased on master? [14:06:18] OH IT SURE IS! [14:07:48] mforns: sure [14:10:04] (03CR) 10Ottomata: "Remove derby.log; maybe add to .gitignore?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [14:10:23] 10Analytics, 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4112106 (10herron) [14:28:18] mforns: I've left some X-Forwarded-For values captured on cp2023 here: stat1004:~mforns/xff-om.log [14:28:46] they might well all be non-US of course but still worth a shot [14:28:51] ema, thanks! will look into it, how urgent is it? [14:29:15] mforns: that's a question for your team! :) [15:06:13] ema, I created /home/mforns/xff-om-geo.log with the IPs and attributed country name, I think there are no instances of United States, please check if it makes sense. [15:07:29] mforns: yeah that's what I got on my workstation with geoiplookup too [15:07:46] mforns: I've updated ~mforns/xff-om.log with more juicy data [15:08:18] mforns: that's 30 seconds of X-Forwarded-Fors from Opera Mini captured on all cache-text hosts [15:09:51] given this https://bit.ly/2q9Qz4Y you'd expect to find quite a few ones from the US! [15:13:43] ema, yea, that's crazy [15:14:21] will comment that in stand-up today, see if someone has more context [15:16:37] joal we ight have a spark refine problem... :o [15:17:51] mforns: spoiler alert, my analysis w/ geoiplookup of those 30 seconds of XFF puts the US at the 39th place (top 5: India, Nigeria, Indonesia, Bangladesh, South Africa) [15:18:01] with spark 2 [15:22:38] ema, this file has 3 or sometimes 4 ips per row, the last one seems local, but which is the one to use? [15:24:22] mforns: that's the whole X-Forwarded-For header, I think the code you're testing should be doing the job of extracting the proper one? It's the left-most one anyways. [15:24:47] k [15:28:32] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4112360 (10Nuria) @ema: > I think we should try to debug the code that sets Country to "United States" for User-Agent: ~ "Opera Mini" and se... [15:28:59] ema: hola [15:29:33] ema: just added one more comment to ticket, will look at xff [15:31:45] nuria_: interesting issue, the whole opera mini thing seems mildly messed up (I can't imagine users from Bangladesh enjoying being routed to Texas for example) [15:47:44] ema, https://pastebin.com/QLmGn6qY [15:49:05] hey guys [15:49:47] joal heya [15:49:49] ottomata: spark problems? [15:49:50] i got some bad news [15:49:57] https://issues.apache.org/jira/browse/SPARK-14130 [15:50:06] https://github.com/apache/spark/pull/12714 [15:50:34] it only affects our alter table change column for new struct fields [15:50:38] adding new fields still works [15:50:42] but we can't alter column types anymore [15:50:45] which includes adding new struct fieds [15:51:03] i think they would allow an alter table change type for a struct type column, but the code istn' there [15:51:07] going to file a bug report... [15:51:09] but man oh man [15:51:29] maybe we can find some work around...issue the alter statement directly to hive using shell CLI or jdbc connection [15:51:30] oof [15:52:00] ottomata: batcave? [15:52:03] ya [15:53:21] addshore: I think easiest way to play with the parquet data is using spark [15:53:52] addshore: spark will load the data with single line command, and tell you about how it's structured (same as JSON mostly) [15:55:19] addshore: if you're after dates, I think it's edits you're after - And edit content to know which property is refers to - I have nothing close to that [16:06:12] ema: will look at your files and report back [16:07:36] nuria_: ok! [16:13:10] (03PS30) 10Ottomata: Upgrade scala to 2.11.7 and Spark to 2.3.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [16:28:20] joal: edit content (wikitext) isn't actually in hadoop is it? [16:28:58] addshore: it is -- Not productionized, but a version of it is [16:29:12] current only? or historical? [16:29:18] addshore: historical [16:29:23] interesting [16:29:54] There is a bunch of Java code (WikidataToolkit) that should be able to parse the JSON from all wikibase items as saved in the DB [16:30:29] addshore: /user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01 [16:30:37] addshore: /user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01/wikidata actually [16:30:44] addshore: /user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01/wikidatawiki actually-biws [16:30:47] sorry :) [16:30:49] :D [16:31:07] * ebernhardson is surprised thats < 20TB [16:31:14] i suppose to be fair it's probably compressed too [16:31:21] * addshore is curious to know how large it is [16:31:25] ebernhardson: parquet snappy [16:31:31] addshore: $ hdfs dfs -du -s -h /path/in/hdfs [16:31:46] wikidatawiki is 1.2Tb [16:31:53] addshore: first number is "primary" data. second number includes replicas across hdfs [16:32:05] And enwiki is 7.8Tb [16:32:14] ebernhardson: nice [16:33:28] joal: also random thought, probably not much to do ... but i tried reading an eventlogging schema yesterday with spark and it decided 90 days was ~5k partitions. I can of course coalesce, but seems storing that many partitions on disk may be wasteful [16:33:49] ebernhardson: very much agreed [16:33:57] ottomata: --^ [16:34:18] ebernhardson: I've been gently telling about this, but we move gently [16:34:24] sure :) [16:34:29] ebernhardson: Thanks for helping pushing ;) [16:35:24] i dont know if its applicable, but for our query_clicks_daily table there is a daily job that rolls up all the hourlies into a single daily partition with ~256M per partition and then deletes the hourlies [16:36:28] (but thats not the only reason for it, the daily job also sessionizes and drops some private data) [16:38:52] (03CR) 10Nuria: [V: 032 C: 032] Updated Readme & corrected syntax errors [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/424464 (https://phabricator.wikimedia.org/T191567) (owner: 10Sahil505) [16:39:04] ebernhardson: That's probably what we'll end up with :) [16:46:01] (03CR) 10Nuria: "This changes paths for all classes that are moved which means that to deploy we need to update the refinery version we use in those jobs a" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/424569 (https://phabricator.wikimedia.org/T188025) (owner: 10Joal) [16:46:35] 10Analytics-Kanban: Execute mediawiki history scoping jobs in the 5th of the month - https://phabricator.wikimedia.org/T191645#4112567 (10Nuria) [16:48:23] ebernhardson: aye yeah, the trouble is the job that creates those tables is generic for all eventlogging [16:48:33] we can't easily change the partitioning scheme for just one [16:48:57] and we don't want to ahve to wait for a whole day before data is available [16:48:59] nuria_: if interested the mediawiki-reduced job failed a second time [16:49:15] joal: ah! [16:49:30] nuria_: I leave the oozie interface as is ) [16:49:38] joal: shouldn't that have sent an e-mail? [16:49:42] it did [16:49:57] joal: https://issues.apache.org/jira/browse/SPARK-23890 [16:50:02] joal: ah, ok, to analytics+alarms@? [16:50:10] yes [16:51:31] Thanks ottomata for the ticket [16:53:30] 10Analytics-Kanban: Execute mediawiki history scoping jobs in the 5th of the month - https://phabricator.wikimedia.org/T191645#4112592 (10Nuria) March run was stopped by DBAs cause too many scripts were hitting labs dbs at the beginning of month. [16:53:54] going off team! Have a good weekend :) [16:53:55] * elukey off [16:54:06] (03PS1) 10Nuria: Alarms should take into account that scooping was moved to 5th of month [analytics/refinery] - 10https://gerrit.wikimedia.org/r/424626 (https://phabricator.wikimedia.org/T191645) [16:55:27] Bye elukey ) [16:55:39] 10Analytics, 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4112615 (10herron) p:05Triage>03Normal [16:58:08] 10Analytics-Kanban, 10Patch-For-Review: Execute mediawiki history scoping jobs in the 5th of the month - https://phabricator.wikimedia.org/T191645#4112567 (10Nuria) [16:59:15] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Hadoop jobs that generate large temporary files can take down nodes - https://phabricator.wikimedia.org/T187139#4112638 (10EBernhardson) [17:00:57] ping ottomata coming to call with marshall? [17:01:59] nuria_: call w marshall? [17:02:37] nuria_: i don't have anythnig else on my cal today [17:03:23] please look again, i think i left it haf done cc ottomata [17:03:29] nuria_: from a deeper log analysis, looks like the hive query fails because of memory (mappers memory too small) [17:03:35] Will try with a patch [17:03:40] joal: on meeting can talk in a bit [17:03:47] ok it is ther enow [17:09:15] (03PS2) 10Joal: Move spark library code to refinery-spark package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/424569 (https://phabricator.wikimedia.org/T188025) [17:11:03] (03PS1) 10Joal: Set up memory for mediawiki-history-reduced [analytics/refinery] - 10https://gerrit.wikimedia.org/r/424630 [17:14:11] !log Launch manual mediawiki-history-reduced job to test memory setting (and index new data) -- mediawiki-history-reduced-wf-2018-03 [17:14:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:14:19] https://hue.wikimedia.org/oozie/list_oozie_workflow/0009362-180330093100664-oozie-oozi-W [17:14:38] A-team - Need to drop now, but will be back later on tonight to check job [17:14:57] bye, have a nice weekend! [17:17:03] byyyee [18:54:34] (03CR) 10Nuria: "I can see some errors by doing:" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/424630 (owner: 10Joal) [19:17:09] ema: you are probably nor arround but ... what that xff tells me is that what we are getting in kafka might not be the xff ip but rather the IP for opera proxy [19:33:12] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4113030 (10Nuria) >number of Opera Mini users in US is far far below India, Indonesia and Nigeria. Note these are "pageviews", not users.... [19:38:54] (03PS11) 10Mforns: [WIP] Label map and top metrics with the month they belong to [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/423144 (https://phabricator.wikimedia.org/T182990) (owner: 10Amitjoki) [19:50:45] 10Analytics, 10Analytics-Wikistats: Wikistats2 GraphPanel computeds and watchers do not update as expected when using table-chart. - https://phabricator.wikimedia.org/T191661#4113090 (10mforns) [19:51:39] (03CR) 10Mforns: "I think the code now is enough to close this task for the moment. However, I created another task (T191661) to look into and take care of " (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/423144 (https://phabricator.wikimedia.org/T182990) (owner: 10Amitjoki) [20:18:15] ema: you are probably nor around but ... what that xff tells me is that what we are getting in kafka might not be the xff ip but rather the IP for opera proxy, let's talk Monday [20:31:42] ottomata: do you happen to know where is our trusted proxy db? [20:35:08] ? [20:35:21] nuria_: don't know what your question means :p [20:35:53] ottomata: from varnish code [20:35:59] https://www.irccloud.com/pastebin/Xiu6pYd9/ [20:38:25] ottomata: does that make mores sense? [20:45:24] ah [20:45:25] ok [20:45:28] didn't know context [20:46:38] ah set req.http.X-Trusted-Proxy = netmapper.map("proxies", req.http.X-Client-IP); [20:48:55] nuria_: grepping puppet I get: [20:49:38] ottomata: i also saw: 235 netmapper.init("proxies", "<%= @netmapper_dir %>/proxies.json", 89); [20:49:54] https://github.com/wikimedia/puppet/blob/production/modules/varnish/manifests/zero_update.pp [20:49:54] https://github.com/wikimedia/puppet/blob/production/modules/varnish/files/zerofetch.py [20:49:55] so [20:50:08] i think it gets it from mw api [20:50:09] https://github.com/wikimedia/puppet/blob/production/modules/varnish/files/zerofetch.py#L75-L77 [20:50:19] somethign like /w/api.php/zeroportal [20:50:20] ? [20:50:21] dunno exactly [20:50:23] somethign like that [20:50:50] /w/api.php/zeroportal?type=ztype [20:52:02] which maybe comes from https://github.com/wikimedia/mediawiki-extensions-ZeroPortal [20:52:02] ottomata: ah so we hold it in our api? ok, will check [20:52:03] ? [20:52:19] ottomata: i though it was hidden in some repo somewhere [20:52:24] https://zero.wikimedia.org/w/api.php?action=help&modules=zeroportal [20:52:52] nuria_: looks like it is periodically requeted from this api, and then written to local netmapper files which are read by varnish to populate that header [20:54:10] which i guess you need to be logged in to use [20:54:10] https://zero.wikimedia.org/w/api.php?action=zeroportal&type=carriers [22:00:06] 10Analytics, 10Analytics-Wikistats: Upgrading Wikistats 2.0 footer UI/design - https://phabricator.wikimedia.org/T191672#4113442 (10sahil505) [22:01:02] 10Analytics, 10Analytics-Wikistats: Upgrading Wikistats 2.0 footer UI/design - https://phabricator.wikimedia.org/T191672#4113455 (10sahil505) I have prepared a mockup of a possible design of the new footer. Please share your feedback on this. {F16766529} [22:15:36] (03PS2) 10EBernhardson: Drop hive partitions before checking hdfs paths [analytics/refinery] - 10https://gerrit.wikimedia.org/r/419953 [22:16:22] (03CR) 10EBernhardson: "that sounds reasonable, I've adjusted it to not collect paths to delete if a cli flag is provided." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/419953 (owner: 10EBernhardson) [22:17:06] (03CR) 10EBernhardson: "also tested this against discovery.query_clicks_daily, works as expected." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/419953 (owner: 10EBernhardson) [23:00:41] 10Analytics-Kanban, 10Patch-For-Review: Refresh SWAP notebook hardware - https://phabricator.wikimedia.org/T183145#4113594 (10Tbayer) >>! In T183145#4104514, @Ottomata wrote: > Hm, in the meantime, I’ve also installed pyhive, which I think has a > similar interface. https://github.com/dropbox/PyHive > > Try... [23:06:20] Manual job for mediawiki-history reduced succeeded - Problem was wrong default value for oozie_launcher_memory [23:23:49] 10Analytics-Kanban, 10Patch-For-Review: Refresh SWAP notebook hardware - https://phabricator.wikimedia.org/T183145#4113623 (10Tbayer) >>! In T183145#4104026, @elukey wrote: >>>! In T183145#4103901, @Tbayer wrote: >> I have been using [[https://pypi.python.org/pypi/impyla |impyla]] on notebook1001 to run Hive q...