[02:24:57] PROBLEM - Hadoop DataNode on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:37:50] Analytics-Cluster, Analytics-Kanban, Operations: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2621981 (Ottomata) [02:41:42] Analytics-Cluster, Analytics-Kanban, Operations: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2621999 (Ottomata) Also, megacli shows: ``` $sudo megacli -PDList -aAll ... Enclosure Device ID: 32 Slot Number: 3 ... Firmware state: Failed ``` [02:42:35] ACKNOWLEDGEMENT - Hadoop DataNode on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode ottomata https://phabricator.wikimedia.org/T145170 [03:51:09] RECOVERY - Hadoop DataNode on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [03:58:49] PROBLEM - Hadoop DataNode on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [07:06:34] (CR) Joal: [C: 1] "LGTM, we should probably try that patch on new aqs cluster before merging/deploying?" [analytics/aqs] - https://gerrit.wikimedia.org/r/309386 (https://phabricator.wikimedia.org/T144521) (owner: Nuria) [07:21:43] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [07:22:44] scheduling downtime --^ [07:23:40] thanks elukey [07:24:42] (CR) Joal: [C: -1] "I expected errors, but not that bad (I checked page 331414, it's really ugly ...)." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric) [07:25:42] need to commute to the office, will be back in 30 mins [07:25:45] :) [07:27:20] ok I stopped puppet and Yarn/hdfs on 1032 so it will be out of the cluster [07:27:32] it is not a journal node so we are ok :) [07:28:29] thx elukey [07:33:29] (PS7) Joal: [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric) [07:34:05] milimetric, mforns --^ Last updates with making it easier to run parts of the script in shell (remove private) [08:32:04] !log executed apt-get clean on analytics1032 to free space [08:32:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [08:36:45] I am checking analytics1032 [08:36:47] 1.5G syslog [08:36:48] 3.9G jmxtrans [08:36:48] 6.7G syslog.1 [08:36:48] 8.2G kern.log [08:36:52] lol [08:42:53] so the host is spamming tons of logs [08:42:56] this is weird [08:49:34] sounds like broken hardware [08:50:34] sde seems dying [08:50:44] moritzm: super weird, there was a "du" process launched by "hdfs" causing the massive spam [08:50:57] in syslog, kern.log, etc.. [08:51:01] I killed it an now it is fine [08:54:09] it might just be the case that the du process caused I/O which made it logged the failure, and now without that command it's idling/serving from caches, maybe try excercising some I/O [08:55:37] Analytics-Cluster, Analytics-Kanban, Operations: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2622515 (elukey) kern.log, syslog and jmxtrans kept getting errors logged ending up filling the disks, the major cause seemed to be a "du" process launched by the "hdf... [09:02:41] moritzm: oh yes I think the du command was causing the mess, not sure why it was executed :D [10:25:45] (PS3) Addshore: WikidataArticlePlaceholderMetrics also send search referral data [analytics/refinery/source] - https://gerrit.wikimedia.org/r/305989 (https://phabricator.wikimedia.org/T142955) [10:26:35] (PS4) Addshore: WikidataArticlePlaceholderMetrics also send search referral data [analytics/refinery/source] - https://gerrit.wikimedia.org/r/305989 (https://phabricator.wikimedia.org/T142955) [10:26:53] (CR) Addshore: WikidataArticlePlaceholderMetrics also send search referral data (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/305989 (https://phabricator.wikimedia.org/T142955) (owner: Addshore) [11:08:11] Hey, one question. I have access to stat1003 and made a file there for this: https://phabricator.wikimedia.org/T135684 I was thinking if there is public endpoint (specially an apache-like one) in stat machines so I can put my file there and they will be accessible through something like https://analytics.wikimedia.org/files/foo.bz2 or files/ladsgroup/foo.bz2 [11:09:08] I just saw this: https://datasets.wikimedia.org/ [11:09:16] nice [11:15:50] Okay, I put something in /a/public-datasets/enwiki/article_quality/ in stat1003, can someone run the job to copy it to stat1001? [11:18:40] Hey Amir1, I can help but I didn't get the use case.. is it a one time thing only or should it be done periodically? [11:19:15] elukey: It's one time thing now [11:20:09] If we want to do it periodically, we need a more sophisticated method. Planning to do it later [11:20:47] ok I'll double check later on, would it be fine for you? (today of course) [11:24:11] * elukey lunch! [11:24:27] (will read messages later on!) [11:27:18] elukey: thanks but the file is not in https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/ [11:27:24] (for when you're back) [11:41:19] it's there now [11:41:21] thanks [11:43:06] addshore: btw. Do you have some time to review this? https://gerrit.wikimedia.org/r/#/c/307077/ [11:43:16] oops wrong channel :D [11:43:30] haha! [11:43:41] I'll add it to my review list, and see what happens! [11:58:05] (CR) Thiemo Mättig (WMDE): "I believe this patch is incomplete and does not do what it's supposed to do. But I don't know enough to justify a -1." (2 comments) [analytics/statsv] - https://gerrit.wikimedia.org/r/308959 (owner: Addshore) [12:10:54] (CR) Addshore: Use ^ and $ while spliting metric value and type (2 comments) [analytics/statsv] - https://gerrit.wikimedia.org/r/308959 (owner: Addshore) [12:18:12] (CR) Thiemo Mättig (WMDE): [C: 1] Use ^ and $ while spliting metric value and type (1 comment) [analytics/statsv] - https://gerrit.wikimedia.org/r/308959 (owner: Addshore) [12:19:21] (PS3) Addshore: Use ^ and $ while spliting metric value and type [analytics/statsv] - https://gerrit.wikimedia.org/r/308959 [12:19:32] (CR) Addshore: Use ^ and $ while spliting metric value and type (1 comment) [analytics/statsv] - https://gerrit.wikimedia.org/r/308959 (owner: Addshore) [12:47:12] a-team: https://pivot.wikimedia.org/ :) [12:47:45] oh that's beautiful elukey [12:48:27] still using Andrew's instance on stat1002 [12:48:52] but it will be super easy to switch after I'll have the correct version deployedd [12:48:55] :) [12:49:07] in the meantime, let me know if you see weirdness [13:10:51] Analytics-Kanban, Patch-For-Review: Productionize Pivot UI - https://phabricator.wikimedia.org/T138262#2622968 (elukey) https://pivot.wikimedia.org/ is up and running, at the moment this is the status: Varnish --> stat1001 VHost (Basic Auth + LDAP) --proxied--> stat1001 (nodejs run using screen) Next s... [13:15:07] Analytics-Cluster, Analytics-Kanban, Operations: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2622972 (Cmjohnson) The disk on analytics1032 is failed. Replaced the failed disk, cleared the cache, added the disk back and all disks are back online @analytics1032... [13:20:22] created partition on analytics1032, rebooting [13:23:44] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:26:13] Analytics-Cluster, Analytics-Kanban, Operations: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2622996 (elukey) Open>Resolved created the partition, rebooted the host since we don't have UUIDs and enabled puppet. All good! Thanks @Cmjohnson! [13:44:47] Analytics-EventLogging, DBA, ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#2623005 (Marostegui) @Jdforrester-WMF looks like ImageMetricsCorsSupport_11686678 was created yesterday. Can you check from your side... [13:46:26] joal: hiyaaa, should I merge that camus puppet change? [13:47:59] elukey I can make repos if you need, pivot-deploy? [13:49:15] milimetric: I was about to ask you how to proceed, if we want to brutally copy/paste everything to gerrit or something different [13:49:24] Analytics, Pageviews-API: WMF pageview API (404 error) when requesting statitsics over around 1000 files on GLAMorgan - https://phabricator.wikimedia.org/T145197#2623016 (Mrjohncummings) [13:49:29] so if you want please go ahead! [13:50:13] hm [13:50:32] pivot/deploy? :) [13:50:35] whatchall doin? [13:50:38] right, normally we'd have analytics/pivot [13:50:46] analytics/pivot-deploy would then have that as a submodule [13:50:54] analytics/pivot/deploy i think [13:50:56] not pivot-deploy [13:50:57] right [13:50:59] sorry [13:51:08] but the main question is where do we put pivot [13:51:14] do we leave it in github and submodule that [13:51:18] or fork it in gerrit [13:51:26] probably best to fork in gerrit [13:51:43] or [13:51:43] hm [13:51:43] i see [13:51:52] it makes updates a bit more complicated [13:51:53] Analytics-Cluster, Analytics-Kanban, Operations: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2623030 (Cmjohnson) I used an on-site spare to swap the disk, ordered a new one from Dell. Congratulations: Work Order SR935921121 was successfully submitted. [13:51:57] but seems fine... [13:52:00] yeah, still probably better to fork in gerrit [13:52:01] yeah, but it isn't too hard [13:52:07] yea [13:52:13] git pull origin [13:52:14] git push gerrit master [13:52:23] we do that with camus [13:52:24] brutal :) [13:52:28] haha [13:53:22] milimetric: I'd need also a scap repo for the config, that will probably end up in $path/scap [13:53:23] ok, making two repos [13:53:32] $path? [13:53:35] Analytics-Cluster, Analytics-Kanban, Operations: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2623032 (Ottomata) Thanks you two! So much action between my bedtime and my coffee! :D :D [13:54:01] hmmm [13:54:08] elukey: in this case you don't need a scap repo it hink [13:54:14] you can put scap in deploy [13:54:15] i think... [13:54:20] i think that's how most services stuff works [13:54:24] how is it with aqs? [13:54:25] ottomata: or just a brutal git clone? [13:54:29] scap repo is for if you don't need a deploy repo [13:54:36] just a place for scap dir [13:54:44] yes exactly [13:54:49] I like that, separations of concern [13:55:15] brb [13:55:31] ottomata: if we use git clone + systemd unit instead of scap [13:55:34] would it be easier? [13:55:58] less control of course [13:56:00] e..g https://github.com/wikimedia/mediawiki-services-citoid-deploy [13:56:29] elukey: naw, you can't just git clone, because you need to have the node_modules checkced into the deploy repo [13:57:21] ah ok I didn't know this part [13:58:13] okok I thought it was easier and self contained [13:58:20] didn't think about the node mess :D [13:58:32] so we need two scap repos then [13:58:38] elukey: ok so this is scap: https://github.com/wikimedia/analytics-aqs-deploy/tree/master/scap [13:58:42] in aqs-deploy [13:58:46] *aqs/deploy [13:59:03] milimetric: go ahead with what you wanted to do I had some confusion in mind [13:59:06] sorry [13:59:21] I thought it was all self contained in the github repo [13:59:24] but that means one repo that's a fork of pivot and one that's pivot-deploy [13:59:26] didn't think about node dep [13:59:37] pivot/deploy would have a scap directory in there and a submodule to the fork repo [13:59:50] sure sure I am super ignorant about it [13:59:57] would you mind to explain that to me while doing it? [14:00:11] no problem [14:00:14] batcave? [14:00:22] or you mean here on irc [14:00:45] even batcave! Let me grab a coffee, 2 mins? [14:00:53] k [14:00:56] (my brain is melting a bit :) [14:01:07] ottomata: did you see pivot.wikimedia.org ? [14:01:09] yeah, put it on pause, it'll be sufficient to watch me work [14:01:13] I move like glaciers :) [14:01:23] (also analytics1032 is up and running) [14:05:57] milimetric: going in the cave [14:11:42] hello? [14:14:02] milimetric: can you read this? [14:33:06] elukey: hi, am I on IRC now? [14:33:52] ottomata: come to the cave [14:33:56] we need to ask you about camus [14:37:54] k! [14:57:03] (PS1) Milimetric: First commit [analytics/pivot/deploy] - https://gerrit.wikimedia.org/r/309579 [14:57:15] (CR) Milimetric: [C: 2 V: 2] First commit [analytics/pivot/deploy] - https://gerrit.wikimedia.org/r/309579 (owner: Milimetric) [14:58:50] (CR) Nuria: [C: -1] "Please add more info to commit message as to the reasons why are we doing these changes." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/305989 (https://phabricator.wikimedia.org/T142955) (owner: Addshore) [15:00:34] joal: standdupp? [15:02:57] (PS1) Milimetric: Add source repo submodule [analytics/pivot/deploy] - https://gerrit.wikimedia.org/r/309582 [15:05:47] (PS1) Milimetric: Add node dependencies [analytics/pivot/deploy] - https://gerrit.wikimedia.org/r/309585 [15:06:27] (CR) Milimetric: [C: 2 V: 2] Add source repo submodule [analytics/pivot/deploy] - https://gerrit.wikimedia.org/r/309582 (owner: Milimetric) [15:06:46] (CR) Milimetric: [C: 2 V: 2] Add node dependencies [analytics/pivot/deploy] - https://gerrit.wikimedia.org/r/309585 (owner: Milimetric) [16:29:58] (PS1) Nuria: Change default compression scheme [analytics/aqs] (new-aqs-cluster) - https://gerrit.wikimedia.org/r/309602 (https://phabricator.wikimedia.org/T140866) [16:31:20] elukey: see if this make sense: https://gerrit.wikimedia.org/r/#/c/309602/, see branch on gerrit UI is called "new-aqs-cluster" [16:31:35] elukey: to push a change: git push origin HEAD:refs/for/new-aqs-cluster [16:33:03] nuria_: you missed to delete a '}' [16:33:09] line 54 [16:37:27] (PS1) Nuria: Map null count values to zeros in output [analytics/aqs] (new-aqs-cluster) - https://gerrit.wikimedia.org/r/309604 (https://phabricator.wikimedia.org/T144521) [16:38:45] Amir1: sorry I didn't have time to check your request :( [16:38:53] would it be ok to do it on Monday? [16:39:05] otherwise I can try to check it now [16:39:11] because we have rsyncs everywhere [16:39:21] and I'd need to figure out the best way to do it [16:39:47] (PS2) Nuria: Change default compression scheme [analytics/aqs] (new-aqs-cluster) - https://gerrit.wikimedia.org/r/309602 (https://phabricator.wikimedia.org/T140866) [16:40:13] (PS2) Nuria: Map null count values to zeros in output [analytics/aqs] (new-aqs-cluster) - https://gerrit.wikimedia.org/r/309604 (https://phabricator.wikimedia.org/T144521) [16:40:23] elukey: sorry, corrected now [16:41:01] elukey: and now the null to zerp change is on top of this one on the branch, we can deploy this branch next week [16:41:22] elukey: and do other changes we need for new cluster there [16:47:01] nuria_: the } seems still there no? This time is line 55 [16:51:37] nuria_: I really like the idea of the new branch, let's also put ottomata and joal in the loop [16:54:04] Amir1: will sync back with you on Monday, sorry again (/me takes notes to ping you) [16:59:38] nuria_: going offline but we'll catch up on Monday from what I've read right? Please let me know otherwise, I'll double check later on [17:21:03] hey mforns [17:21:09] hey milimetric [17:21:21] how goes [17:21:53] I've looked at the code, looks good, also the latest changes, I'm executing right now [17:21:59] getting errors though [17:22:28] ok, I'll look over the code and if you're still getting errors when I'm done maybe we can look together? [17:23:52] milimetric, if you want ok, but feel free to do other stuff, I guess you're tired of scala, and also I need to conquer this :] [17:24:41] mmm, thanks but if I can help I'd like to, I think we need to get Erik some data [17:24:53] milimetric, sure [17:32:12] (PS3) Nuria: Change default compression scheme [analytics/aqs] (new-aqs-cluster) - https://gerrit.wikimedia.org/r/309602 (https://phabricator.wikimedia.org/T140866) [17:32:34] elukey: sorry, ahem, pushed w/o add [17:32:39] elukey: let's touch base mon [17:33:59] Analytics-EventLogging, DBA, ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#2623622 (Jdforrester-WMF) The reference to this was dropped from the code in https://gerrit.wikimedia.org/r/#/c/230982/ which was mer... [17:34:44] ok mforns, I more or less read through it... :) [17:34:58] milimetric, ok, it seems to be executing now [17:35:01] cave? [17:35:01] ooh [17:35:06] ya, brt [17:39:16] milimetric, when you get a minute, could you look at https://meta.wikimedia.org/wiki/User_talk:EpochFail#Useranalysis_in_Wikipedia.3F ? [17:39:28] sure halfak [17:40:07] Thanks! :) [17:44:16] halfak: FYI that our instance of piwik is not mean for big projects like wikipedia though, same issues than unique tokens philosophically but also the more concrete blocker that is extremely small scale. I will let milimetric add on I think that is the gist [17:45:28] nuria_, yeah. I figured that piwik was out, but if you read my proposal, I suggest that we just temporally correlate pageviews of articles when they are featured on the main page. [17:45:39] I think it could work pretty well. [17:45:55] There's some methods there that we can borrow from some of nettrom's work. [17:46:26] http://www-users.cs.umn.edu/~bhecht/publications/qualityimprovement_cscw2015.pdf [17:46:44] He looked at the effect that featuring an article had on editing patterns. [17:46:54] Could ask similar questions for viewing patterns. [17:47:32] I'm thinking that you could use the pageview API to query for the view rates of features pages before and during "featured" status on the mainpage. [17:47:46] And do breakdowns based on where on the mainpage a particular link if shown. [17:48:05] It would give you a basic understand of what elements drive traffic. Clicks are implicit. [17:52:42] halfak: ah, yes, that indeed seems doable is "featured" status is on the scale of "days" [17:53:14] nuria_, yeah. There might be some day-wise overlap, but I think we can work with that either way. [17:53:20] *seems doable IF "featured" status spans at least a day or more [17:54:01] Signal/noise-wise, we should be able to detect a substantial perturbation even if the featured window is smaller than a day and doesn't overlap with the UTC date [18:05:57] ottomata: oozie job to load cassandra seems a bit stuck: oozie job -info 0017356-160826130408204-oozie-oozi-C [18:07:28] ottomata: how can i know if it is stuck due to teh disk problems we had earlier? [18:07:31] *the [18:10:56] hm, nuria not sure, it seems like things are running though, ja? [18:12:00] this looks almost done [18:12:00] https://yarn.wikimedia.org/proxy/application_1472219073448_61063/mapreduce/job/job_1472219073448_61063 [18:34:55] ottomata: you are right ! it was stalled for a loong time [18:36:31] Analytics-EventLogging, DBA, ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#2623910 (Tgr) All four records seem to be from the same user (geolocated to US, looking at mobile jawiki, using IE 9), over the span... [18:39:53] Analytics-Kanban: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2623925 (Sadads) good to know, that delays one of my projects then: which is fine, it looked like it might have been too early in next quarter anyway, Alex [19:09:44] (PS15) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) [19:17:50] (PS16) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) [19:33:34] mforns: yt? [19:33:42] nuria_, yes! [19:34:03] mforns: when you are testing these chnages: https://gerrit.wikimedia.org/r/#/c/309604/2 [19:34:10] do you test on 1002 ? [19:35:53] *changes [19:36:11] nuria_, just a sec :] [19:41:20] mforns: /user/milimetric/testing/mediawiki/tables [19:42:34] hi nuria_, sorry was in a meeting [19:42:41] mforns: np [19:43:16] nuria_, those are the aqs changes right? [19:43:23] ok [19:44:13] nuria_, I've never tested any change on aqs repo.. I don't know [19:44:40] mforns: ah sorry [19:44:56] mforns: these: https://gerrit.wikimedia.org/r/#/c/308977/2//COMMIT_MSG [19:45:01] mforns: too many chats! [19:45:37] do you mean executing the unit tests or doing a "staging" test? [19:45:41] nuria_, ^ [19:48:02] nuria_, oh! the RU change, those I know :] [19:48:13] mforns: ya, sorry for teh typo [19:48:14] *the [19:48:21] nuria_, no, from your machine [19:48:41] in the commit messsage there's instructions [19:49:14] nuria_, do you want to try it together in the batcave? [19:51:23] mforns: sure, give me 15 mins, i will ping you [19:51:24] it's kind a tricky [19:51:27] ok sure [19:57:51] Analytics, Discovery-Analysis: [REQUEST] Extract search queries from HTTP_REFERER field for a Wikibook - https://phabricator.wikimedia.org/T144714#2624297 (Tbayer) PS, two more remarks: - A caveat about the proposed approach: Many referrals from Google come without the query part of the URL; which is... [20:15:47] mforns: batcave? [20:15:52] nuria_, omw