[02:49:24] Coren, you're on top of whatever it is that's happening? [02:49:27] * andrewbogott just got 280 emails [04:11:38] andrewbogott: That supposed to have stopped hours ago. When was the last one you got? [04:12:22] Coren: It's been almost 2 hours now. [04:12:34] So probably it was resolved before I asked. [04:13:13] You mean gazillion of sudo notices right? [04:27:10] Coren: No, they're things like 'GE 6.2u5: Job-array task 2083349.265 failed' [04:27:22] ... I didn't get any of /those/. [04:27:58] Starting… 9:06, ending at 9:15 CDT [04:28:13] Some of them are just Job xxx failed, most are Joba-array task xxx failed [04:28:43] example: https://dpaste.de/e0Xd [04:28:43] I'm seeing a munch of merlbot failed tasks, but no stuck queues. [04:29:39] lemme look and see if they're all merlbot [04:30:14] Hm, yep, looks like they're all that some user [04:30:39] so, my mistake -- last time I got a flood like this it was from 200 different users. This time not so much, probably just one tool that's flapping [07:59:08] Hi, is there any one who can help me? [07:59:26] I need to Extract all text between (..) tags in wikipedia from database, and their instance of occurrence (news websites in particular). [07:59:59] using mysql syntax , thanx in advance. [09:09:55] Coren: last night my many of my merlbot jobs failed with: ERROR 145 (HY000) at line 3: Table './mysql/proc' is marked as crashed and should be repaired [11:25:03] Coren: lots of spam from tools failing to start, Merlissimo's I think [11:25:06] unsure what's happening [11:26:08] YuviPanda: i got many sql client errors last night. nearly all like: ERROR 145 (HY000) at line 3: Table './mysql/proc' is marked as crashed and should be repaired [11:26:27] hmm [11:28:39] some also because of still missing globalimagelinks, but that my my fault, because i have not disabld this one. [11:29:59] YuviPanda: i think its because of missing mem at sql server according to ganglia [11:30:18] right, and I guess that's the replica databases [11:30:25] Merlissimo: where did you look in ganglia? can you point me to it? [11:30:56] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=labsdb1002.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1404559836&v=49295428&m=mem_free&vl=KB&ti=Free%20Memory&z=large [11:32:14] the time of the mysql/proc crash error fits the graph [11:33:04] ah [11:33:06] right [11:33:10] looks to be thrashing :| [11:33:29] Merlissimo: there isn't much I can do about it, sadly :( I don't have access to these machines [11:33:38] Merlissimo: can you open a bug, so I can cc the people most likely able to fix? [11:33:50] (andrewbogott, Coren and springle) [11:34:36] i think the problem is caused by federated tables [11:35:58] Coren is currently working on https://bugzilla.wikimedia.org/show_bug.cgi?id=59683 . Maybe we should wait until i can really test my scripts using federated commonswiki [11:37:32] last night i run are query which used commonswiki_f_p.templatelinks on s5 which may caused this error [11:40:58] Merlissimo: hmm, ok. [12:00:28] I blatantly took http://tools.wmflabs.org/whois/ for a yet another tool. [12:06:01] they need to bring overlordQ's who is tool to the wmflabs :\ [12:34:48] I'm trying to install numpy in a virtual env. but I'm getting this error: SystemError: Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel. [12:35:06] Can anyone tell me how to install these packages so I can install numpy using pip? Or is there some other workaround? [13:31:13] 3Wikimedia Labs / 3tools: create procedure to show all granted permission to foreign user for own user databases - 10https://bugzilla.wikimedia.org/67552 (10merl) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier normal users have no read access to mysql.db, mysql.table_priv and mysql.columns_priv "show... [15:35:54] 3Wikimedia Labs / 3tools: Add some of the missing tables in commonswiki_f_p - 10https://bugzilla.wikimedia.org/59683#c15 (10krd) I'm also stuck with one or another script on the "Unknown column 'page_content_model' in 'field list'" problem. [20:03:33] scfc_de: setting up grafana on tools directly :) http://tools.wmflabs.org/grafana/index.html#/dashboard/file/default.json [20:04:05] scfc_de: just got it to work, now need to do actual dashboards [20:10:44] jesus christ grafana is fucking annoying [20:14:18] I'm going to try out girafe [20:14:26] YuviPanda: Well ... You picked it :-). [20:14:46] scfc_de: well, mostly because there was a WIP patch in puppet that ori started, so... [20:15:26] scfc_de: either way, maybe it is nice, but the fact you've to build dashboards purely with their UI is pissing me off at. [20:15:27] atm [20:16:32] Doesn't feel very Unixy as well. Even Jenkins can be configured by XML files IIRC. [20:16:51] scfc_de: so it can be configured by JSON files, but you have to generate those through the UI [20:16:52] so... [20:17:07] at least girafe doesn't go 'omg look you can edit teh dashboard!' [20:17:18] oh wait, and it uses JSONP as well [20:17:19] nice [20:17:23] (girafe I mean) [20:20:38] scfc_de: aaand already running. http://tools.wmflabs.org/grafana/giraffe/index.html#dashboard=Demo&timeFrame=1d [20:20:39] nice [20:20:43] now to create girafe [20:20:44] as a tool [20:35:07] scfc_de: metrics! [20:35:08] scfc_de: http://tools.wmflabs.org/grafana/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1d [20:35:11] scfc_de: but kinda useless atm :) [20:38:43] scfc_de: http://tools.wmflabs.org/grafana/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1d slightly more useful [20:45:13] !ping [20:45:13] !pong [20:45:14] ok [20:52:01] YuviPanda: Is it possible to see individual disk usage with the current setup? [20:52:26] scfc_de: yes, click the legend button on the top (first button after the '7w' indicator [20:52:34] scfc_de: also you can hover over the lines themselves. [20:52:50] so this makes it fairly easy to look at the graph and figure out which ones are gonna die, and take action [20:53:04] scfc_de: you might want to refresh, I've made a few changes [20:53:38] scfc_de: going to add puppet freshness next :D [20:54:25] "Fairly easy" yes, but I'm lazy :-). I need alerts via mail (or IRC). [20:54:49] scfc_de: yeaaah, so that's going to be my next thing [20:54:59] scfc_de: but totally unsure how to do that, to be honest. [20:55:06] am considering writing my own deamon, but that just feels... wrong [20:55:16] scfc_de: I looked at the cabotapp code, doesn't inspire confidence. [20:55:40] plus they use... django model objects to store their checks, and have terrible data modelling even there [20:59:06] YuviPanda: Sorry, I have no good ideas for that. [20:59:15] scfc_de: :) 'tis ok, I'll figure something out [20:59:28] scfc_de: but I think the dashboard is still a step up... [20:59:31] from where we are now [21:02:18] Well ... When the Ganglia server was up, it didn't look that bad for a quick overview. But if it isn't maintained ... [21:03:50] yeah [21:09:57] 3Wikimedia Labs / 3tools: let sge respect job priority set by users (weight_priority) - 10https://bugzilla.wikimedia.org/67555 (10merl) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier Currently a toollabs user has no influence of the order jobs are executed. Please also set weight_priority so that user... [21:15:55] scfc_de: hmm, what else do you think I should add to the default? puppet freshness has problems, I'll check on it later [21:16:23] dunno if memory / CPU count as 'critical' [21:19:12] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556 (10merl) 3NEW p:3Unprio s:3normal a:3None All issues that needs to be solved to run or enhance tools by merl [21:19:54] 3Wikimedia Labs / 3tools: let sge respect job priority set by users (weight_priority) - 10https://bugzilla.wikimedia.org/67555 (10merl) [21:19:55] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556 (10merl) [21:20:10] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556 (10merl) [21:20:11] 3Wikimedia Labs / 3tools: Add some of the missing tables in commonswiki_f_p - 10https://bugzilla.wikimedia.org/59683 (10merl) [21:21:10] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556 (10merl) [21:21:10] 3Wikimedia Labs / 3tools: Provide user_slot resource in grid - 10https://bugzilla.wikimedia.org/52976 (10merl) [21:25:08] YuviPanda: Hmmm. CPU not so much, but memory was an indicator in the past on tools-login when someone ran bots there (against the rules). And the recent webgrid hiccup was also caused by the instances running out of memory? [21:25:39] scfc_de: hmm, I wonder which memory Indicator I should use [21:25:42] MemFree? [21:25:59] YuviPanda: Not a clue. [21:26:12] scfc_de: yeah, looking around. I'll add a memory thing similar to the first two graphs [21:26:26] scfc_de: also, looks like not a lot of jobs in webgrid-03 and -04, wonder if they are in error state [21:27:27] scfc_de: also, mail queue size has been 17 forever, are they stuck? [21:27:46] YuviPanda: Looking at "qstat -f", it just appears as if noone is starting new webservices :-). [21:27:56] scfc_de: heh :) no way for us to 'move' 'em as well, I suppose [21:28:54] YuviPanda: Re mail, we have some spam from non-existent domains, and when that bounces, we can't deliver it. Also, some users set a non-existent mail address on their wikitech account which means that mail from cron & Co. causes the queue to never go to zero. [21:29:08] right [21:29:28] YuviPanda: Well, you could reschedule the webservices with "qmod -rj", but if ain't broken ... [21:29:36] scfc_de: true, true [21:29:40] let's let it be [21:30:46] * YuviPanda removes redis clients and stuff from the dashboard, don't think that's necessary [21:34:44] scfc_de: of course, as it stands this will be useless if tools goes down, since it is hosted on tools... :) [21:36:40] YuviPanda: If we move graf* to its own webgrid/exec node, I'd take that chance :-). But a super-duper catch-all solution for all projects Labs would be preferable, of course. [21:36:58] scfc_de: indeed, I suspect we'll have giraffe.wmflabs.org at some point [21:37:06] scfc_de: with the dashboard config living in git as well [21:41:23] scfc_de: can you do 'become giraffe' and see what happens? [21:41:28] I added you [21:41:45] Need to log out and in. Moment. [21:42:08] scfc_de: ok [21:42:17] Uh! More animals. [21:42:27] yay :) [21:43:17] scfc_de: hmm, also webproxy is getting more traffic in than out, seems suspiciou [21:43:19] s [21:44:06] YuviPanda: 0,1 %? 1 %? 10 %? [21:44:23] scfc_de: I added a graph, see http://tools.wmflabs.org/grafana/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1h [21:44:34] like, 2x? [21:44:36] or more [21:44:53] scfc_de: free space on /tmp would be useful because some tools like java (perfdata) or the broken path by peacy framework increases this [21:45:51] YuviPanda: +1 for /tmp. [21:46:01] scfc_de: /tmp is part of root [21:46:05] we don't have a separate partition [21:46:27] YuviPanda: Hard to tell from the graphs if they align ... [21:46:37] scfc_de: hmm, true. [21:46:39] YuviPanda: Ah, yeah, forgot about that (/tmp = /). [21:46:53] scfc_de: yeah, so solution is to use an lvm volume for /tmp [21:47:01] or do some /srv magic [21:47:15] anayway, new giraffe URL http://tools.wmflabs.org/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1d [21:47:18] old one gonna go away now [21:49:29] I like the LVM solution technically & Co., but judging from the hoops we have to jump through and the bootstrapping problem (it's only applied after the initial Puppet run), I think providing a flat / partition and some quota system would save a lot of nerves :-). [21:53:37] well, then you've to deal with the quota system :) [21:53:49] (Note that Linux doesn't seem to provide a per-directory quota at the moment, but that's what wishes are for :-).) [21:53:50] I think *actual* fix is to fix the *image* itself, perhaps. [21:53:59] I wonder if we can setup lvm in the image itself [21:54:07] definitely a thing to do during the trusty migration [21:56:58] It would be nice if the Puppet filesystem setup is evaluated and done outside the instance; i. e. the LVM configuration is evaluated before instance creation (after you check "*::biglogs" on the instance creation page) and set up before boot. On every reboot, the LVM configuration is evaluated again, then files are automatically shuffled from one volume to another (if the LVM changes say so), then the instance is powered up again. [21:56:58] Etc. Just magic. [21:58:15] scfc_de: right. but that seems fairly magical :) [21:58:52] scfc_de: what we can do instead is write a script that does this for us ('shuffle files around'). We can probably even use lsof or something to figure out if processes are holding on to file handles from /var/log [21:58:59] and if so, idk [22:01:07] Yeah, but that sounds like a /lot/ more trouble than having the instance in a known, powered-down state. Just imagine that thing having to cope with a request to move /bin ... :-) [22:01:52] heh [22:01:59] YuviPanda: BTW, with the new dashboard I see the difference between webnodes and webproxy as well. Huge. [22:02:13] scfc_de: yeah, I dunno where that's from. [22:02:27] scfc_de: btw, webnodes are only showing tx [22:02:29] not rx [22:02:31] Do we have that auto-compression enabled at the moment. [22:02:43] we do actually [22:02:44] gzip [22:02:56] I don't know if it can account for this much discrepancy though [22:02:57] But that's only on the proxy? [22:03:02] ya [22:03:09] let me check lighty's default config [22:03:20] YuviPanda: ssl overhead? [22:03:25] Can you graph RX on the proxy? That should align with the webnodes' TX somewhat. [22:04:03] Merlissimo: That would increase the proxy's traffic compared to the webnodes. At the moment, the graph shows the opposite :-). [22:04:13] scfc_de: ah, so webnodes' tx and proxy rx on same graph? that makes sense, actually. [22:04:16] the rx is from the webnodes. [22:04:34] rx counts both requests from clients and data received from the webnodes themselves [22:04:44] which makes sense, I guess? [22:04:48] ah, sorry [22:04:49] Merlissimo: shouldn't affect traffic stats, no? [22:08:02] scfc_de: submitting another patch to cut down diamond's logging even more [22:09:54] two! [22:10:10] :-) [22:11:10] scfc_de: should've done this before I turned on diamond in all of labs tho [22:11:45] Spilled milk ... [22:11:53] heh yeah [22:12:00] always next time [22:14:16] scfc_de / YuviPanda: if you have time, could you fix simple sge config change request https://bugzilla.wikimedia.org/show_bug.cgi?id=67555 ? It would took only a few seconds ("qconf -msconf" opens an editor the change this value) [22:15:06] scfc_de: ^ wanna do it? I... haven't really done much grid engine configuration at ll [22:15:07] *all [22:16:40] Merlissimo: ok, done. try? [22:16:46] scfc_de: (no need, I changed the config value) [22:16:49] Merlissimo: is 0.1 now [22:17:10] looks good, thx YuviPanda [22:17:34] !log tools changed grid scheduling config, set weight_priority to 0.1 from 0.0 for https://bugzilla.wikimedia.org/show_bug.cgi?id=67555 [22:17:36] Logged the message, Master [22:17:43] Merlissimo: yw! mind if I close the bug? [22:17:59] yes, can be closed [22:18:03] YuviPanda: No, let's wait till it's puppetized. Otherwise, it'll get lost on the way to Denver. [22:18:53] 3Wikimedia Labs / 3tools: let sge respect job priority set by users (weight_priority) - 10https://bugzilla.wikimedia.org/67555#c1 (10Yuvi Panda) 5NEW>3RESO/FIX Changed per request. [22:19:05] scfc_de: 1. too late, 2. I don't know how well this can be puppetized, and I think it's going to be a long while, and when it does happen it'll be just puppetizing status quo as such. [22:19:20] grid engine does seem to be using these kinda things to change config all the time, rather than config files :| [22:20:56] hmm, there's some puppetization.... [22:21:17] YuviPanda: Yeah, I didn't find any pre-existing SGE module for Puppet, so we'll probably need to do some of our own. The problem with the SGE config files is that due to their format, it's hard to tell what is a conscious configuration choice and what is just the default. So I'll just add that bug to my personal list :-). [22:21:33] scfc_de: ty, and sorry to add to that. [22:34:44] YuviPanda: all this config is sored in files by sge [22:35:50] Merlissimo: true, but our puppet config doesn't have them yet. [22:36:18] so we'll have to figure out a nice puppet structure to do that, then look at the current files and then move them to puppet, and then we can call them 'puppetized' [22:36:53] !log tools cleared diamond archive logs on a bunch of machines, submitted patch to get rid of archive logs [22:36:55] Merlissimo: Sure? I recently looked into that, and didn't find anything plain. There was some script (save_sge.sh?) that extracted all the config. [22:36:55] Logged the message, Master [22:37:52] the file modifies by this bug must be localted at ${SGE_HOME}/default/common/sched_configuration [22:38:05] but i still search where sge_home is [22:39:17] ah /usr/share/gridengine [22:39:35] YuviPanda: Challenge is to detect when a Puppet change is needed. If you store the config file, any reordering/etc. by SGE causes Puppet to update that file again and again. So I'd prefer if we have something à la "puppet_setting { 'x': ensure => present, value => 'y'; }" that looks at the current config's setting of 'x' and updates/removes that only if necessary. [22:40:09] scfc_de: ah, right. I suppose then the thing to do is to prevent SGE from updating anything :) [22:40:17] scfc_de: or we can have a custom script (eugh?) [22:40:23] that sorts and diffs before applying [22:40:53] scfc_de: I also want to kill current incarnation of toolsbeta, and run it the same as toollabs, except from a different puppetmaster [22:41:00] scfc_de: as it is, I think it's pretty useless [22:42:18] Merlissimo: On tools-master, there's /var/lib/gridengine/default/common, but no sched_configuration. [22:43:24] * YuviPanda manually cleaned up a lot of diamond archive logs, lot less danerous now [22:43:39] YuviPanda: On my own box, for SELinux, I use settings like "exec { '/sbin/semanage port -a -t postgresql_port_t -p tcp 5433': unless => '/sbin/semanage -o - | /usr/bin/fgrep -qx "port -a -t postgresql_port_t -p tcp 5433"'; }". Pack that into a custom type ... [22:43:52] right [22:44:04] YuviPanda: What do you mean about Toolsbeta? [22:44:11] scfc_de: have you used it recently? [22:44:14] the toolsbeta project [22:44:27] Yes, from time to time to test patches. [22:44:36] scfc_de: how do you test patches? [22:44:46] without having them merged [22:44:58] But I need to set up SGE before I can test some of the tougher ones. [22:45:05] Self-hosted puppetmaster? [22:45:07] oh [22:45:15] toolsbeta is fully on self-hosted puppetmaster? [22:45:18] like, all the nodes? [22:45:20] toolsbeta-puppetmaster3, I believe. [22:45:27] There are only three or so at the moment. [22:45:45] ah, hmm [22:45:45] ok [22:45:55] it wasn't the case when i last looked at it, which was... ages ago [22:46:38] scfc_de: hmm, so https://gerrit.wikimedia.org/r/#/c/143861 was merged yesterday, but doesn't seem to have taken effect. I still see the old collector files, and we have no puppet staleness stats [22:46:59] any idea? an ensure problem? [22:47:21] YuviPanda: One moment. (Can you see any instances at https://wikitech.wikimedia.org/wiki/Special:NovaInstance for any project?) [22:47:29] looking [22:47:40] scfc_de: yes I can [22:47:53] YuviPanda: The last comment on that change says: "This patchset was reverted in change: I045d854ffde5a29e7dc574204ef3303688a19f08"? [22:47:58] oh [22:47:59] lol [22:48:35] what, why isn't the sudo rule everywhere? [22:48:35] WHY [22:48:38] it was part of the role! [22:49:00] On the instance page, I see the projects, but if I set the filter, the sections under the headings "tools" and "toolsbeta" (or whatever I chose) are empty. [22:49:24] scfc_de: hmm, I see tools on mine [22:49:24] YuviPanda: Perhaps the prod diamonds? [22:49:39] scfc_de: yeah, but this collector was included only on labs [22:49:42] won't be running in prod [22:52:41] Hmmm. Perhaps the collector was called before the sudo role was installed? But it's probably easier to just ask Coren. [22:53:02] yeah, probalby. I'll just wait for him, I guess [22:53:07] I should go to sleep. it's almost 4:30AM [22:53:10] Re :NovaInstance, just to clarify, you see tools-exec-*, etc., yeah? [22:53:37] scfc_de:I do yea [22:54:00] Well, then wikitech doesn't like me anymore :-). Good night! [22:54:05] deployment-prep seems to have a self-hosted puppetmaster that may or may not be able to merge things properly. [22:54:21] Coren: were all of them from deployment-prep? [22:54:28] * YuviPanda logs in to check [22:55:20] Yes. [22:56:20] YuviPanda: (Logging in and out from wikitech made :NovaInstance work again.) [22:56:20] Coren: something unrelated seems to be fucked up there, [22:56:21] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::beta::puppetmaster for i-0000015c.eqiad.wmflabs on node i-0000015c.eqiad.wmflabs [22:56:39] Coren: but on the puppetmaster, I see that the commit is still not reverted, so theoretically it should still be spamming you [22:57:53] Coren: also, pretty graphs! http://tools.wmflabs.org/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1h (from graphite) [23:00:32] oh wow, I really should sleep. night. [23:35:41] Coren: after some tests i don't think that federated commonswiki_f_p is usable for tools in production. coping a whole table takes really long on mysql server runs out of memery even by simple queries. i just cancel a script running for 4 hours now. many "mysql gone away" and other error caused by low memory. The script with all queries runs 3-4 minutes on old toolserver [23:37:10] have a look for the last 24 hours at https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=labsdb1002.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1404602983&v=23730676&m=mem_free&vl=KB&ti=Free%20Memory&z=large [23:37:41] Merlissimo: "Coping" = "Copying"? The federated tables are only intended for JOINs. [23:38:14] yes, joins with dewiki [23:39:11] Do you have an example query? [23:45:53] scfc_de: prepared statement which need an page_id as argument: http://pastebin.com/5bBWpXuY [23:46:44] run 20 seconds on ts, never sucessfuly on labs [23:47:11] you must use commonswiki_f_p for labs version [23:51:15] Merlissimo: That should be the proper way (with _f_p) to do so in Labs (and a reasonable query). Could you please file a bug and assign it to the DBA, Sean Pringle (springle@wikimedia.org)? I remember him writing something about the way the views are defined at the moment not allowing some optimization on non-federated queries, so he's probably the best one to see if (how) that query or the DB setup can be optimized. [23:52:45] scfc_de: Marc is working on commonwiki_f_p this weekend that's why i pinged ihm [23:53:08] Ah, okay. [23:53:12] (or he promised so)