[00:22:14] 10Analytics-Kanban: Wikistats: labeling of pageviews is wrong on table and graph - https://phabricator.wikimedia.org/T189266#4037147 (10Nuria) Right, we are creating date objects in users local tz (which is js default) which is not correct. Will look into this a bit later today. [00:44:51] nuria_: I got the date problem if you're not working on it, will submit patch in a bit [01:03:24] nuria_: https://gerrit.wikimedia.org/r/#/c/417476/ [01:04:02] (03PS1) 10Fdans: Formats date objects always according to UTC timezone [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/417476 (https://phabricator.wikimedia.org/T189266) [04:13:26] going to sleep now, denormalize job still running [06:50:27] Checking in denormalization - Still running [07:01:41] joal: o/ [07:02:07] everything looks ok afaics (kafka, etc..) [07:02:24] going to the conf, will be online in an hour or so probably [07:03:00] * elukey will re-discover Bologna's traffic at rush hours [08:55:46] I am back again, had an issue with my VPS (so didn't see the previous logs :) [09:00:05] Hi elukey [09:00:08] nothin happened after "elukey will re-discover Bologna's traffic at rush hours " [09:00:09] No issue as of now [09:01:52] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4037597 (10jcrespo) [09:18:42] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4037613 (10jcrespo) The lower concurrency is better, but the problem it is still ongoing- it is too "bursty"- moments where many connecti... [09:49:22] (03CR) 10Joal: [C: 04-1] "Comments inline (mostly not-breaking, except 1)." (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [10:06:36] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4037678 (10jcrespo) In other order of things, is it normal to still get errors from 127.0.0.1, which I think it points to the older queue... [12:08:36] moritzm,joal ack thanks! [13:44:52] mediawiki-history run on spark2 without dynamic-allocation as succeeded :) [14:05:57] joal: \o/ [14:06:15] hell yea joal :) [14:06:23] :) [14:07:04] milimetric: we still are in incorrect-mode for sqoop imports on pagelinks table [14:11:59] joal: eh? [14:12:12] 10Analytics-Tech-community-metrics, 10Developer-Relations: Review entries in https://github.com/Bitergia/mediawiki-repositories/ to exclude/include - https://phabricator.wikimedia.org/T187711#4038175 (10Aklapper) Trying to gather understanding by running `diff -pu gerrit-repo-list-from-ssh-T187711.txt gerrit-r... [14:12:58] milimetric: 8 wikis are missing pagelinks data [14:13:25] milimetric: cawiki, cswiki, fawiki, huwiki, idwiki, kowiki, metawiki, srwiki [14:13:55] milimetric: folders exist but are empty [14:14:10] joal: oh, I didn't know, sorry if you told me and I missed it [14:14:21] I can rerun those [14:17:06] milimetric: the state was incomplete when we were modifyinghe sqoop code, and we left it like that [14:19:29] joal: wait, these were the wikis that failed last month [14:19:37] and I reran those and they reran fine, I added success [14:19:44] so this is new, they failed again [14:20:41] What that means is that writing _SUCCESS files by folder-in-error works, but email doens't :) [14:22:31] (03CR) 10Ottomata: "+1 to your comments joal, except one. :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [14:23:26] joal: wait, but this is a new set of failures, that you just found out about, right? I'm still trying to figure out what you mean [14:24:22] milimetric: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0033606-170829140538136-oozie-oozi-C/ [14:26:04] ok, so a bunch of tables didn't sqoop...? But then how did denormalize run? [14:26:15] you mean the denormalize run didn't have all the wikis sqooped when it started? [14:26:57] because that's not just pagelinks missing, that's a revision, user, etc. [14:27:06] milimetric: I double checked _SUCCESS files were missing - Only pagelinks are [14:27:21] That's why denormalize started [14:27:44] However clickstream won't :) [14:28:23] Also, oozie metrics, druid and reduced(AQS) jobs have started as well [14:28:26] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Patch-For-Review, 10Services (doing): Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052#4038214 (10Pchelolo) [14:28:27] ok, so status is: the *new* sqoop code failed to write SUCCESS flags for revision, user, etc. for unknown reasons, AND on top of that, pagelinks failed to sqoop for those wikis [14:28:30] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (done): Failed to acquire page lock in LinksUpdate - https://phabricator.wikimedia.org/T188106#4038212 (10Pchelolo) 05Open>03Resolved After merging the patch to wait for replicas to catch up in the last 24 hours we've seen only 9 occurrences of... [14:28:35] nope [14:28:55] milimetric: The new sqoop code succeeds at writing success flags by table-folder [14:29:21] The files are present for every table except pagelinks, where data is missing for the wikis pasted above [14:29:55] but hue says SUCCESS is missing for revision for 2018_02 [14:30:10] So current status is: sqoop code doesn't email us for errors (we knew that out of cu_changes run) [14:30:22] milimetric: you really trus hue? [14:30:46] oh :( so you sent me the hue link as a trick, I see :) [14:31:13] Oh no !!! - Arf - I sent you that link so that you knew the job failed [14:31:24] And in the mean time told you about the files :( [14:31:46] rmublr rumblr rumblr --- I don't like when I don't manage to communicate correctly :( [14:32:13] it's ok, I should've just checked the files myself, I'm doing that now to make sure I understand what happened. But why is hue saying something crazy? [14:32:17] anway - hive load job has not started because of pagelinks only [14:32:48] milimetric: I don't know - I have already noticed that for files dependecies, I shouldn't trust hue [14:33:07] oof, the fuse hdfs mount is borked I think [14:33:15] cd-ing into it hangs [14:34:51] indeed /wmf/data/raw/mediawiki/tables/revision/snapshot=2018-02/_SUCCESS exists [14:35:06] oh wow, I never knew hue is lying to me!! [14:35:59] hdfs dfs -ls /wmf/data/raw/mediawiki/tables/*/snapshot=2018-02/_SUCCESS confirms what you say - only pagelinks is missing [14:36:03] milimetric: The cake IS a lie ! [14:36:27] _that_ whole thing is unreasonable, it's much easier to bake a cake than go through all that [14:36:33] :D [14:42:38] nice work guys with the mw-history job :) [14:43:10] elukey: Thanks - I'd however would have prefered for the bloody thing to JUST WORK ! [14:43:29] o/ [14:43:45] jo al btw i'm going to work on that null string/long jsonrefine thing today [14:43:52] elukey: I've been putting effort today in understanding better perf aspets, and I have a plan [14:43:53] joal: I know I know, it required a higher Joseph's worker concurrency but it eventually worked [14:44:03] :D [14:44:12] * joal smiles in a devil-ish way [14:44:23] ottomata: o/ [14:44:27] Hi ottomata [14:44:35] cool ottomata [14:44:55] also people you rock, Apple is basically using all the things that we use for their Hadoop infrastructures [14:45:02] ottomata: I'd also like to discuss move to spark2 now that the shuflle thing is confirmed [14:45:50] joal: sounds like we can't do the move to spark2 before refine deploy, right? [14:45:53] since we need that asap now [14:46:10] * elukey afk again [14:46:57] ottomata: I'd have actually go for a fast move to spark2, but the dep on discovery makes it more tricky :( [14:47:13] ottomata: I'm afraid of us never moving if we don't do it soon [14:47:38] The more jobs in spark1, the more difficult to move [14:47:39] (03CR) 10Nuria: "This fixes format of display but I think one additional change is needed, give me a sec." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/417476 (https://phabricator.wikimedia.org/T189266) (owner: 10Fdans) [14:48:23] joal: but i thought we couldn't do it? beacuse of shuffle? [14:49:34] ottomata: if we manage to sync with discovery, we move every prod job to spark2, and therefore also change shuffle jar [14:51:50] joal: do we have a shuffle jar for spark2? [14:52:06] joal: i thikn this is actually very hard to coordinate [14:52:16] and even thought we've tested some, very likely to cause problems here and there [14:52:20] its a lot of things moving at once [14:52:28] and we need to do the page previews geocode refine stuff like now [14:52:36] if we do it and everythign breaks while we do it, it will look pretty bad [14:52:37] ok [14:53:00] ottomata: my concern is that without doing it, we'll actually never do it :) [14:53:15] we can make it a goal next quarter or something, i want to do it [14:53:40] ottomata: only coord thing is discovery for their ES loading jobs [14:53:48] ottomata: the rest is ready for us [14:53:50] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034945 (10Pchelolo) > In other order of things, is it normal to still get errors from 127.0.0.1, which I think it points to the older qu... [14:53:55] Mwarf [14:53:58] anyway [14:54:37] ottomata: I'm gonna correct Marcels patch later today (kids time now), and we can merge, deploy and test if you wish [14:54:38] yeahhhh [14:54:42] ok cool [14:55:25] ottomata: And, if we manage to have +1 from Erik on moving to spark2, I'd really like to do it while page-previews are not yet productionized [14:55:37] But he, days only have 24h [14:56:07] joal: if we can, i want to deploy refine monday and start it [14:56:13] i have to do eventlogging jumbo next week too [14:56:16] :) [14:56:24] ottomata: I don't see why not [14:57:31] ottomata: /usr/lib/spark2/yarn/spark-2.2.1-yarn-shuffle.jar [14:57:36] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4038252 (10Pchelolo) [14:58:33] ohhh, :) [15:00:17] joal: verrrry strange with this string/long thing [15:00:20] it isn't nulls or missing values [15:00:34] ottomata: really???? [15:00:38] i filtered an offending hour for records where this msToDisplayResults was present, and afaict it always has a long [15:00:41] or integer field [15:00:46] but spark still loads it as a string [15:01:06] ottomata: are there any rows where it is missing? [15:01:08] also, there's nother field in this schema hitsReturned [15:01:16] which is missing sometimes, but it seems to be inferred as a long just fine [15:01:23] joal: no, i filtered for where its present [15:01:32] basically dumped the json | grep msToDisplayResults [15:01:35] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Support claimTTL and rootClaimTTL in change-prop - https://phabricator.wikimedia.org/T189303#4038259 (10Pchelolo) p:05Triage>03Normal [15:01:39] super weird [15:01:53] formatting issue in json? [15:01:55] i'm trying to find which record is causing this, but all values look sane [15:02:02] gonna start bisecting it :p [15:02:03] yeah I did that as well [15:02:09] ok [15:02:44] ottomata: dumb idea - comma instead of dot in numeric formaT? [15:03:00] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038274 (10jcrespo) Thanks, that last comment was indeed *very very* useful. > one-by-one and that means each job establishes a new conn... [15:05:46] can't find any commas or dots in this field [15:05:51] they all look lke ints [15:06:14] ottomata: Let's try to parse the file with python? [15:07:08] joal worth a try, if you want to, the file i'm workign with is in hdfs at /tmp/ss/onlymstodisplayresults/data.json [15:07:17] i'm tryign to make splits of it and load them in pieces [15:09:00] ottomata: which machine? [15:09:28] oh hdfs [15:09:39] Ha ! [15:14:23] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038288 (10Pchelolo) > Proxy/connection pool is something that we are going to use for crossdc connections, so it was already in the back... [15:16:58] ottomata: in python, your file is parsed is either int or long [15:17:04] need o drop now - super weird [15:17:44] joal: ya, that happens, but we always choose long [15:18:34] joal i found the bad value [15:18:34] 9223372036854776000 [15:22:54] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, 10Services (doing): FY17/18 Q3 Program 8 Services Goal: Migrate two high-traffic jobs over to EventBus - https://phabricator.wikimedia.org/T183744#4038303 (10mobrovac) [15:22:58] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#4038304 (10mobrovac) [15:23:02] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, 10Services (doing): Support claimTTL and rootClaimTTL in change-prop - https://phabricator.wikimedia.org/T189303#4038302 (10mobrovac) [15:23:10] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, 10Services (doing): Support claimTTL and rootClaimTTL in change-prop - https://phabricator.wikimedia.org/T189303#4038259 (10mobrovac) [15:24:22] fdans: thnaks for jumping on patch, will continue on it [15:32:13] joal: when you come back , would love to brain bounce this problem [15:32:16] not sure what to do about it. [15:32:46] maybe we can do extra validation at EL level to not allow this type of value [15:32:48] ergh. [15:35:35] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038391 (10jcrespo) > I think it would be much better do it on your side of the "fence" I can own this no problem, but if I do, I will a... [15:42:27] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 6 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038398 (10jcrespo) [16:11:15] I am back :) [16:49:55] Hi joal, It's been a while! Do you have a few minutes to check in? [16:54:41] (03CR) 10Mforns: "Thanks for your review Joal, all comments make sense." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [17:02:48] shilad - I'm in standup right, now, will have time in about 1/2 hour [17:03:18] Got it. Shoot me a message when / if you are free afterwards. [17:03:25] ack shilad :) [17:29:22] !lof Rerun mediawiki-history-reduced job after having manually repaired wmf_raw.mediawiki_project_namespace_map [17:29:26] !log Rerun mediawiki-history-reduced job after having manually repaired wmf_raw.mediawiki_project_namespace_map [17:29:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:31:28] 10Analytics: Changes to map projection in wikistats - https://phabricator.wikimedia.org/T188927#4038721 (10ezachte) @Nuria, could you please resend url to new map projection, I think you already did send but I can't find that mail (sorry). [17:36:50] (03PS1) 10Joal: Correct oozie mediawiki-reduced job dependency [analytics/refinery] - 10https://gerrit.wikimedia.org/r/417994 [17:42:56] elukey: what was that mediawiki doc that giuseppe made? [17:43:57] nuria_: https://wikitech.wikimedia.org/wiki/MediaWikiEtcdConfig [17:44:54] * elukey rebrands himself as Hadoop SRE after today's overload of buzzwords from the conference [17:45:12] elukey: +1000 ! [17:45:22] ahhaah [17:46:36] joal: next step is a cluster of 100 nodes [17:47:13] * joal supports that idea :) [17:47:42] * elukey sends a formal request for nuria_ to push for a 100 nodes cluster [17:47:47] :D [17:52:58] elukey: on it no worries [17:53:15] rightttt [17:53:19] \o/ [17:53:41] super good that this chan has logs so I have you on record :D [17:55:04] all right people logging off! [17:55:07] have a nice weekend [17:58:15] Bye elukey :) [17:58:23] Hey shilad - I'm a bit late, but here :) [17:58:29] Is now a good moment shilad ? [17:58:43] Yes! Batcave? [17:58:48] OMW! [18:08:27] hi nuria_! just following up to see if I could get added to Hue? I still can't seem to access. [18:23:42] ottomata: can you add Mneisler to the list of hue-allowed users? [18:35:29] Mneisler: ottomata would need to add you can be done as he has time [18:37:21] thanks! not urgent but by next week would be great if possible. [18:41:07] gotcha! [18:41:45] Done Mneisler! [18:41:53] log in with your shell name and your ldap password [18:43:31] got it! Thanks ottomata! [19:44:13] joal: have a min to go over what I got? [19:44:16] i think its working [19:44:19] sure ottomata [19:44:50] cool ! [20:01:20] joal: should I catch any exception when casting? or just the ones I've run into? [20:01:34] joal: should I catch any exception when casting? [20:01:40] oh ^ delay [20:02:04] ottomata: if casting fails for wrong reasons, it should be an error [20:02:05] IMO [20:02:53] ok [20:26:08] Hey milimetric [20:26:40] milimetric: Have you started a sqoop for the missing tables or shallI do it? [20:33:22] ah, joal sorry I forgot [20:33:27] I'll do it now [20:33:29] no worries [20:36:03] Thanks milimetric :) [20:36:10] I'll call it a day then [20:36:16] Have a good wekend team [20:38:10] 10Analytics-Kanban: Get fancy with type casting in Refine job - https://phabricator.wikimedia.org/T189332#4039207 (10Ottomata) [20:38:15] 10Analytics-Kanban: Get fancy with type casting in Refine job - https://phabricator.wikimedia.org/T189332#4039218 (10Ottomata) [20:38:51] (03PS1) 10Ottomata: Get smart and hacky about compatible type casting [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418052 (https://phabricator.wikimedia.org/T189332) [21:32:40] something's strange with HDFS it's really really slow [21:56:36] (03PS9) 10Milimetric: Compute geowiki statistics from cu_changes data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 [22:49:53] ugh, I am getting a fatal error when opening this page on wikitech https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly/Sanitization [22:50:04] the rest of wikitech seems to load fine [22:58:07] 10Analytics, 10Data-release, 10Research, 10Privacy: An expert panel to produce recommendations on open data sharing for public good - https://phabricator.wikimedia.org/T189339#4039479 (10DarTar) [22:58:24] 10Analytics, 10Data-release, 10Research, 10Privacy: An expert panel to produce recommendations on open data sharing for public good - https://phabricator.wikimedia.org/T189339#4039479 (10DarTar) p:05Triage>03Normal [22:59:40] 10Analytics, 10Data-release, 10Research, 10Privacy: An expert panel to produce recommendations on open data sharing for public good - https://phabricator.wikimedia.org/T189339#4039479 (10DarTar) [23:09:42] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4039520 (10DarTar) [23:55:41] DarTar: the page works for me