[02:48:01] I've noticed the pagecounts data set has stopped updating since last night. (https://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-07/). I've been looking for outage info to no avail. Does anyone have info on this? Tha [02:48:13] nks [03:25:01] on stat1002 the load average is about 1500 :o [03:25:24] CPU maxed out [03:25:36] ellery is running some phting, ezachte some perl [03:25:46] users report pagecounts stopped being reported [03:25:58] s/pthing/python [03:30:12] i think we need to kill some processes [03:30:22] it's breaking stuff [03:43:29] mutante, uhoh [03:43:54] did you kill the things? I only see Erik's stuff live now [03:44:08] i killed some but it's not enough [03:44:40] Analytics-Engineering: stat1002 - extreme load - no more pagecounts - rogue processes - https://phabricator.wikimedia.org/T107404#1494008 (Dzahn) NEW [03:45:43] Ironholds: the python processes were shown as using all the CPU but load is still not down [03:45:53] that's odd. I'm only seeing the perl [03:46:00] what is the java stuff by ashwinpp? [03:46:18] 56 zombies [03:46:23] from the runtime but low CPU load I assume Hive processes [03:46:25] but i cant get "ps aux" to even finish [03:46:28] basically it's a client waiting for stream data [03:46:29] ...ack [03:46:42] see, every time I go "and this is why we notify people in advance of big jobs" I get told off. [03:47:10] this is why production cron jobs and manual work hosts dont mix :) [03:47:43] users report problems against dumps.wm but that merely gets the stuff from here [03:48:07] considers rebooting [03:48:09] what do you think [03:48:33] i mean: 20:43 < icinga-wm> PROBLEM - configured eth on stat1002 is CRITICAL: Connection refused by host [03:49:00] ah, i forgot to add i also killed a ton of nagios procs [04:09:07] Analytics-Engineering: stat1002 - extreme load - no more pagecounts - rogue processes - https://phabricator.wikimedia.org/T107404#1494020 (Dzahn) pid 9367 is the bad one, it's fuse_dfs 9367 root 20 0 3815280 484636 1124 D 0.0 0.7 1592:52 fuse_dfs... [04:10:23] Analytics-Engineering: stat1002 - extreme load - no more pagecounts - fuse_dfs hangs - https://phabricator.wikimedia.org/T107404#1494022 (Dzahn) [04:47:55] mutante, probably would just hunt the zombies rather than reboot [04:48:05] only because Erik's scripts are also fairly vital and I dunno how resistant to restarts they are [04:48:22] (plus, we have a lot of other production jobs that kick in early morning UTC to avoid the Erik Crush but still get done before SF kick up) [07:08:20] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1494162 (jgbarah) In fact, the main reason would be simplifying (and making it more accurate) the retrieving of historical stats of almost any kind. For example, if at some... [08:57:26] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1494236 (Aklapper) [08:59:17] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1490798 (Aklapper) I've corrected the task desc to say "closed as resolved" (=task status change) instead of "moved to done" (=workboard column change) to avoid confusion. [12:35:51] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1494552 (Nemo_bis) Importing closures would be very welcome. While at it, it would be good to fix the status of RESOLVED DUPLICATE reports, which were incorrectly migrated... [13:38:24] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1494633 (Aklapper) [off-topic] >>! In T107254#1494552, @Nemo_bis wrote: > While at it, it would be good to fix the status of RESOLVED DUPLICATE reports Very different top... [14:00:46] Analytics-Engineering: stat1002 - extreme load - no more pagecounts - fuse_dfs hangs - https://phabricator.wikimedia.org/T107404#1494681 (Ottomata) I just did this: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Fixing_HDFS_mount_at_.2Fmnt.2Fhdfs ``` sudo umount -f /mnt/hdfs s... [14:03:26] Hi ottomata ! [14:03:27] Do you give me 5 minutes in batcave to pick-up on news ? [14:03:39] HIya! [14:03:41] surerre!~ [14:03:56] there now [14:04:02] hallo [14:04:33] is there a convenient way to see how many edits were made in a given Wikipedia on a certain day? [14:04:37] Analytics-Engineering: stat1002 - extreme load - no more pagecounts - fuse_dfs hangs - https://phabricator.wikimedia.org/T107404#1494695 (Dzahn) It does look much better now. yes. thank you. load is way down and: 07:03 < icinga-wm> RECOVERY - configured eth on stat1002 is OK - interfaces up 07:03 < icinga-wm... [14:04:39] Analytics-Engineering: stat1002 - extreme load - no more pagecounts - fuse_dfs hangs - https://phabricator.wikimedia.org/T107404#1494008 (Dzahn) a:Ottomata [14:04:41] Analytics-Engineering: stat1002 - extreme load - no more pagecounts - fuse_dfs hangs - https://phabricator.wikimedia.org/T107404#1494697 (Dzahn) Open>Resolved [14:04:41] or at least a daily average? [14:04:53] (and as a bonus, how many of them were done with VisualEditor) [14:32:04] ottomata: Have you seen the email from stat1002 about cluster datasets ? [14:32:13] I have checked them manually, they seem ok [14:32:23] cluster datasets? [14:32:34] webrequests [14:32:44] no? [14:33:02] well today, everything is wrong according to the email [14:33:09] email subject? [14:33:15] oh the usual email [14:33:21] yes, daily recap [14:33:31] oh, i know why [14:33:37] this [14:33:37] https://phabricator.wikimedia.org [14:33:40] oops [14:33:42] hdfs-mount issue ? [14:33:44] https://phabricator.wikimedia.org/T107404#1494697 [14:33:45] yes [14:33:51] that script uses the mount to send status [14:34:02] ok makes sense [14:34:12] where's my email though!? [14:34:17] I ran it manually 2 minutes ago and everything seems fine [14:34:35] k [14:34:37] thanks [14:34:41] can't help on the email searching ottomata ;) [14:34:44] heheh [14:57:41] ottomata: hullo [15:00:19] Analytics-Kanban: {lama} Wikistats 2.0 - https://phabricator.wikimedia.org/T107175#1494784 (Pginer-WMF) For content Translation (#cx), it has been useful to check the [[ https://stats.wikimedia.org/EN/ChartsWikipediaCA.htm | "New articles per day" chart for a given language ]] in order to get an idea of how t... [15:00:52] madhuvishy: hello! [15:05:37] Analytics-EventLogging, Analytics-Kanban: Build .deb package for pykafka {stag} - https://phabricator.wikimedia.org/T106252#1494793 (madhuvishy) Verified on my local setup in mw-vagrant. [15:05:52] ottomata: I verified the pykafka package build [15:05:58] should i move the task to done? [15:06:34] yes please! [15:07:27] ottomata: okay done :) do you have sometime to pair today(after standup?) on figuring out what happens when one of balanced consumers die? [15:08:17] i think so. today might be toughuhhhhh, I'm outside of NYC right now, and I was going to try to drive back to a cafe in NYC during lunch break, buuuuuut i might not be able to make it because of staff meeting [15:08:42] Hmmm, okay let me know, we can do tomorrow too [15:08:47] madhuvishy: maybe after staff? or tomorrow yeah [15:08:52] ya cool [15:21:53] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Give the option of using the same parameters for all reports for a given cohort {dove} [21 pts] - https://phabricator.wikimedia.org/T74117#1494830 (kevinator) Open>Resolved [15:22:27] Analytics-EventLogging, Analytics-Kanban: Can Search up sampling to 5%? {oryx} - https://phabricator.wikimedia.org/T103186#1494831 (kevinator) Open>Resolved [15:23:07] Analytics-Kanban, MediaWiki-extensions-ExtensionDistributor, Patch-For-Review: Set up graphs and dumps for ExtensionDistributor download statistics {frog} [3 pts] - https://phabricator.wikimedia.org/T101194#1494839 (kevinator) Open>Resolved [15:24:42] Analytics-Kanban, Analytics-Visualization: {Epic} Community reads pageviews per project in Vital Signs {crow} - https://phabricator.wikimedia.org/T95336#1494846 (kevinator) [15:24:43] Analytics-Cluster, Analytics-Kanban: {musk} Pageviews in Vital Signs - https://phabricator.wikimedia.org/T101120#1494845 (kevinator) [15:24:45] Analytics-Kanban, Analytics-Visualization, Patch-For-Review: Update Vital Signs UX for aggregations {musk} [13 pts] - https://phabricator.wikimedia.org/T95340#1494843 (kevinator) Open>Resolved It works! https://vital-signs.wmflabs.org/#projects=all/metrics=Pageviews or search for "totals" when... [15:35:09] Analytics-Kanban: Event Logging sends mysql consumer stats to statsd [5 pts] {oryx} - https://phabricator.wikimedia.org/T105935#1494882 (kevinator) a:madhuvishy [15:36:32] Analytics-EventLogging, Analytics-Kanban: Build .deb package for pykafka {stag} [3 pts] - https://phabricator.wikimedia.org/T106252#1494887 (kevinator) [15:39:02] Analytics-Cluster, Analytics-Kanban: {mule} Hadoop Cluster Expansion - https://phabricator.wikimedia.org/T99952#1494896 (kevinator) [15:46:02] Analytics-Cluster, Analytics-Kanban: Test Cassandra as a storage strategy {slug} [5 pts] - https://phabricator.wikimedia.org/T101786#1494920 (kevinator) [15:54:16] Analytics-Kanban: Add User Agent refined data to intermediate pageview aggregation. - https://phabricator.wikimedia.org/T107436#1494943 (JAllemandou) NEW a:JAllemandou [15:55:08] Analytics-Kanban: Add User Agent refined data to intermediate pageview aggregation. [5 pts] {hawk} - https://phabricator.wikimedia.org/T107436#1494953 (JAllemandou) [16:15:21] (PS1) Joal: Add user_agent_map to intermediate pageview [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) [16:22:46] Analytics-Tech-community-metrics, ECT-July-2015, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1495009 (Acs) http://korma.wmflabs.org/browser/gerrit_review_queue.html Fixed the name of the metric and the n... [16:24:09] (PS1) Joal: Remove empty projects from intermediate pageviews [analytics/refinery] - https://gerrit.wikimedia.org/r/228011 [17:05:46] Analytics-Backlog: Host a debrief of EventLogging cleanup {tick} - https://phabricator.wikimedia.org/T104351#1495101 (mforns) Hey folks, I cancelled our meeting today, basically because the EventLogging audit process is still ongoing and unfinished. Kevin agreed on this, and we will create another meeting whe... [17:43:34] milimetric: ready ! [17:43:39] Batcave :) [17:43:49] omw [17:47:40] Analytics-Backlog: create first RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107053#1495266 (Milimetric) [17:48:01] Analytics-Backlog: create second RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107054#1495268 (Milimetric) [17:48:21] Analytics-Backlog: create third RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107055#1495269 (Milimetric) [17:52:27] Analytics-Backlog: create second RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107054#1495300 (Milimetric) [17:53:13] Analytics-Backlog: create third RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107055#1495301 (Milimetric) [18:00:35] (CR) Madhuvishy: [C: 1 V: 1] "LGTM" [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [18:01:07] (CR) Madhuvishy: [C: 1] "LGTM" [analytics/refinery] - https://gerrit.wikimedia.org/r/228011 (owner: Joal) [18:10:42] Analytics-Backlog, RESTBase-API: create first RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107053#1495404 (mobrovac) [18:14:24] Analytics-Backlog, RESTBase-API: create second RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107054#1495432 (mobrovac) [18:14:53] Analytics-Backlog, RESTBase-API: create third RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107055#1495438 (mobrovac) [18:30:20] mforns: I forgot, did you say to add you to that late meeting for dashboards? [18:30:27] it's in 1.5 hours [18:30:41] milimetric, yes, but I'm on a meeting with Adam Baso right now.. :] [18:41:54] hello! [18:42:01] phew, that was much longer than I thought it would take [18:42:03] madhuvishy: hi [18:44:01] ottomata: hullo [18:44:24] i have to get lunch, and then i'm scheduled to work with leila for 3 hours [18:44:31] o/ mforns [18:44:40] I saw the EventLogging cleanup meeting go away [18:44:40] think we can do it tomorrow? ottomata [18:44:56] (CR) Ottomata: "So, we were already a little concerned about privacy without this dimension, right? Won't this make things worse?" [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [18:45:04] oook, madhuvishy, ja lets do tomorrow [18:49:17] (CR) Joal: "True..." [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [18:53:50] I'm grabbing diner, will be back soon [19:03:39] halfak, sorry I was in another meeting [19:04:11] No worries. I just saw that the event got canceled wanted to make sure everything is good. [19:04:12] milimetric, ok! yes, I'd like to go to the VE meeting [19:04:40] halfak, yes everything is ok, we just spoke about this in standup [19:05:03] the EL audit isn't still finished, so we'd like to wait until the end, to have the debrief [19:05:24] Sounds good. :) [19:05:24] after that we'll reschedule the debrief meeting [19:05:27] ok, cool [19:05:56] halfak, just sorry that I cancelled it with so little advance [19:06:57] Not a problem. I just had a weird open spot on my calendar and decided to get lunch but realized later that I was supposed to have a meeting there. :) [19:07:05] Happy surprise that I had time to get lunch though :) [19:11:16] halfak, :] glad you got some time to "stop" [19:44:25] (CR) Ottomata: "Also, how much larger does this make the dataset?" [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [19:55:28] (CR) Mforns: [C: 1] "LGTM!" [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [19:55:44] (CR) Joal: "testing over one hour: +35%" [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [19:56:44] (PS2) Joal: Add user_agent_map to intermediate pageview [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) [20:02:48] (CR) Mforns: [C: 1] Remove empty projects from intermediate pageviews [analytics/refinery] - https://gerrit.wikimedia.org/r/228011 (owner: Joal) [20:05:19] milimetric, o/ [20:05:24] hi mforns [20:05:29] i'm in the meeting... [20:05:35] what?! [20:05:37] I got kicked out [20:05:43] mmm, I think we are in parallel universes [20:05:46] :] [20:05:47] hm... now I'm in it again [20:05:47] ok [20:39:34] (CR) Joal: "Here the user agent data is not raw, it is parsed." [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [20:39:43] off Guys I'm off for today ! [20:39:55] See y'all tomorrow [20:39:57] :) [20:42:08] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [20:46:09] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [20:50:28] Analytics-Backlog, Analytics-Dashiki, VisualEditor: Improve the edit analysis dashboard {lion} - https://phabricator.wikimedia.org/T104261#1496079 (Milimetric) Suggestions for how to get at these improvements, in case someone wants to work on them before we prioritize: All visualizers use Timeseries... [20:54:09] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [20:58:18] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [21:00:49] Analytics-Kanban: {lama} Wikistats 2.0 - https://phabricator.wikimedia.org/T107175#1496124 (Milimetric) By the way, Pau commenting reminded me that I should mention his Dashboard Directory work. So Pau designed this awesome UX for how people could find different dashboards: T92502. We are definitely going t... [21:02:00] Analytics-Kanban: {lama} Wikistats 2.0 - https://phabricator.wikimedia.org/T107175#1496134 (Milimetric) [21:02:46] Analytics-Kanban: {lama} Wikistats 2.0 - https://phabricator.wikimedia.org/T107175#1488807 (Milimetric) [21:08:48] (CR) Mforns: "The thing is, the user agent info is still in the raw pageviews right? So unless we want to remove it from there too, we aren't adding new" [analytics/refinery] - https://gerrit.wikimedia.org/r/228010 (https://phabricator.wikimedia.org/T107436) (owner: Joal) [21:11:44] Analytics-Kanban, Team-Practices-This-Week: Get regular traffic reports on TPG pages - https://phabricator.wikimedia.org/T99815#1496175 (ggellerman) Joel, Kevin L and Grace meeting on Fri July 31, 2015 [21:21:11] milimetric, do you have a min? [21:21:39] what's up mf [21:21:44] mforns :) [21:22:00] hehe :] [21:22:06] I need some help [21:22:10] sure [21:22:33] I'm a little bit confused with which limn- projects are currently running [21:22:43] are those: https://gerrit.wikimedia.org/r/#/admin/projects/?filter=limn- all of them? [21:22:45] it's in puppet, hang on [21:23:14] if I search in github, I find also, i.e. limn-fundraising-data [21:23:20] that's the only disabled one: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/statistics.pp#L58 [21:23:35] oh no, only the ones under there in puppet are running [21:24:05] mobile is disabled? [21:24:20] yes [21:24:28] oh! ok [21:24:29] it's just mobile that you're working on, right? [21:24:48] yes, but I will need to check and maybe fix all the others [21:25:01] so I'm creating tasks in phab for all of them [21:25:14] mmm, sounds like scope creep :) [21:26:21] mmm, well at least is something I can do, and do not need to wait for other teams [21:27:04] and limn-edit-data needs no modification [21:27:09] right [21:27:21] well, fair enough, but remember not to make too much work for yourself [21:27:30] and mobile is almost done [21:27:42] ok [21:27:54] thanks! [21:33:32] Analytics-Backlog: Check and potentially timebox limn-flow-data reports {Tick} - https://phabricator.wikimedia.org/T107502#1496348 (mforns) NEW [21:37:58] Analytics-Backlog: Check and potentially timebox limn-language-data reports {Tick} - https://phabricator.wikimedia.org/T107504#1496371 (mforns) NEW [21:40:06] Analytics-Backlog: Check and potentially timebox limn-extdist-data reports {Tick} - https://phabricator.wikimedia.org/T107506#1496391 (mforns) NEW [21:44:32] (PS2) Mforns: Clean up mobile-reportcard reports and dashboards [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/227911 (https://phabricator.wikimedia.org/T104379) [21:45:21] (CR) Mforns: "Now ready for review!" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/227911 (https://phabricator.wikimedia.org/T104379) (owner: Mforns) [21:46:00] good night folks, see you tomorrow! [21:46:47] laters! [21:58:42] leila: I'm on fifth floor if you need me [22:06:07] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [22:07:00] madhuvishy: will ping or stop by once I have the couple of options. I'm sure I'll have questions in the mean time, too. [22:07:27] leila: sure :) [22:08:16] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [22:14:32] Analytics, Analytics-Backlog, Fundraising-Backlog, Wikimedia-Fundraising: Provide performant query access to banner show/hide numbers - https://phabricator.wikimedia.org/T90649#1496625 (atgo) [22:39:23] Analytics, Analytics-Backlog: Provide performant query access to banner show/hide numbers - https://phabricator.wikimedia.org/T90649#1497001 (atgo) [22:39:42] Analytics, Analytics-Backlog, Fundraising research: Provide performant query access to banner show/hide numbers - https://phabricator.wikimedia.org/T90649#1064105 (atgo) [22:40:26] Analytics, Fundraising-Backlog, Wikimedia-Fundraising: Public dashboards for CentralNotice and Fundraising - https://phabricator.wikimedia.org/T88744#1497015 (atgo) p:Normal>Low [22:47:25] milimetric: around? [23:21:55] madhuvishy: you should come back to 3. :-) [23:22:04] leila: coming :) [23:22:11] close to Mushroom [23:24:29] aah coming