[05:28:26] do we have a proxy checking tool? [05:31:44] comets: no, as there would be legal issues [05:32:31] i could have sworn we did on toolsever..oh well :( [06:39:58] comets: I recall a proposal to move procseebot over to the labs a while back, but I can't find record of it. [07:46:43] a930913: tried on the user's talk? [07:47:32] I once asked him to share his methods a bit but he doesn't want to open source it, IIRC [07:58:41] andrewbogott_afk: Coren hmm, labs internal network issues? [07:58:54] ping diamond-collector [07:58:55] From bastion1.eqiad.wmflabs (10.68.16.5) icmp_seq=2 Destination Host Unreachable [08:01:02] hmm, fine now [08:03:39] !log graphite rebooted diamond-collector [08:03:41] Logged the message, Master [08:39:13] Bug: On https://tools.wmflabs.org/xtools/ec/?user=Hangsna&project=sv.wikipedia.org there is a bug. At "Latest edit (global)" it cuts out the first letter of the comments [09:18:51] s51072 has more than max_user_connections active connections, meh [09:19:54] and i cannot debug that [10:34:29] Coren: A user wants me to allow them to generate page lists from raw SQL (mainly for live database reports.) Any idea? [10:36:36] a930913: fyi https://tools.wmflabs.org/personabot/ [10:41:34] Nemo_bis: That's quite predefined though. They want to run the database reports. [10:47:55] let them create a wikitech account etc.? [11:08:56] gifti: Not that savvy I think. They just want to be able to copy a query from the db reports, paste it in and have it return them a list. [11:09:19] hm [11:09:50] is it possible to turn the db reports into a tool? [12:09:02] !log tools tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged [12:09:05] Logged the message, Master [13:55:39] 3Wikimedia Labs / 3tools: Add some of the missing tables in commonswiki_f_p - 10https://bugzilla.wikimedia.org/59683#c16 (10Marc A. Pelletier) 5REOP>3RESO/FIX Both views should now work properly. [14:00:23] YuviPanda: Morning! I saw some mails stuck from -proxy-test on -mail and wanted to fix that, but no puppet master is running on -proxy-test thus "sudo puppet agent apply -tv" fails. Any ideas? [14:00:32] oh [14:00:33] unsure [14:00:40] I think -proxy-test should be a self hosted puppetmaster? [14:00:48] It is. [14:00:48] unsure what's happening [14:00:55] I know that deployment-prep's puppetmaster is also failing [14:01:01] so might be a general issue with self hosted puppetmasters? [14:01:34] Dunno. Would it be very disruptive to reboot the machine, or will it catch up without problems? [14:01:39] (-test, that is.) [14:02:07] scfc_de: not disruptive at all, go ahead [14:02:16] k [14:02:27] scfc_de: I'm investigating why data is sometimes missing in http://tools.wmflabs.org/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1h [14:02:31] I don't think it's packetloss [14:02:54] I can see that diamond is transmitting, but nothing in graphite UI [14:02:55] weird [14:03:14] You mean the graphs don't extend to "now"? [14:04:21] scfc_de: yes, by varying numbers across hosts [14:04:23] Okay, on -proxy-test the reboot didn't bring up a puppetmaster either. Hmmm. On toolsbeta-puppetmaster3, it's still running, so I'll take a look what's the difference between them. [14:04:26] not all of them as well [14:04:28] scfc_de: ok [14:04:38] Heisenbugs! Love them. [14:05:58] scfc_de: heh :) [14:06:00] so I see 43733: 0, 0 [14:06:03] and lots of it [14:06:05] in the whisper files [14:06:11] so it's not even recording timestamp? [14:08:06] I've never looked at diamond & Co. Wait till tomorrow for Rush? [14:08:28] I asked in #graphite [14:08:36] but waiting for rush is my other solution [14:08:54] I might as well go back to reading Foundation [14:09:08] scfc_de: In a way I'm glad I'm doing this first before setting up alerts. otherwise everything would be blaring by now [14:09:51] :-) [14:10:20] scfc_de: what's weird is that some hosts are fine. the tomcat node has been consistently reporting data in graphite, for example. webproxy was reporting rx but not tx for a while, and now nothing [14:11:44] http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1404655894.043&target=carbon.relays.diamond-collector-a.metricsReceived [14:11:45] hmm [14:12:03] On toolsbeta-puppetmaster3, /etc/init.d/puppetmaster exists as part of the package puppetmaster, which is missing on tools-proxy-test. It should probably have been installed, so I install it manually and see what happens. [14:12:41] scfc_de: btw, you can consider -proxy-test useless, fwiw :) No problems if it dies [14:14:01] YuviPanda: I assumed so, but it exhibits a bug with puppetmaster::self that I'd like to fix, and rather reproduce it on another machine ... (stuck mails from puppet cron jobs that complain about a non-existing directory /var/lib/puppet/reports). [14:14:29] scfc_de: oh sure, feel free to. just saying, no need to treat it with kid gloves [14:17:20] The private password repo is missing as well. It looks like initial run to set up puppetmaster::self aborted and left a mess behind. Hmmm. [14:17:57] ah [14:17:58] that's possible [14:18:30] Wait! It is there, looked at the wrong screen. [14:19:04] ah [14:21:35] And a "git pull" only works after moving labs-puppet-key to ~/.ssh/id_rsa. [14:21:59] I might've seen something similar in graphite-test [14:22:00] as well [14:22:09] where I suspect the private repo wasn't being updated [14:22:49] Yes, and because the puppet repo relies on the phabricator password, then Puppet stalls. [14:22:56] (I think.) [14:23:00] right [14:23:09] scfc_de: so I 'got around' it by commenting out the phab stuff [14:23:18] also I might've gotten what's ailing diamond-collector [14:23:19] LVM | -local--disk | busy 100% | read 47 | write 3102 | KiB/r 15 | KiB/w 4 | MBr/s 0.07 | MBw/s 1.22 | avio 3.18 ms | [14:23:45] the disk can't keep up [14:23:47] Is that on the aggregating machine? [14:24:12] yeah [14:24:21] doesn't surprise me at all [14:24:27] now that I think about it [14:24:29] Okay, after these changes Puppet ran through on -proxy-test with several changes to nginx, so it may have hosed it :-). [14:24:31] they are but virtual disks [14:24:40] scfc_de: yeah, it will :) that's ok [14:24:55] scfc_de: our proxy role assumes the existance of an ssl private key, so will fail without it [14:25:44] Nothing's responding on :80, so you're right :-). [14:25:53] scfc_de: :) [14:26:11] 42% iowait on diamond-collector [14:26:15] vs <1% elsewhere [14:26:16] boom [14:26:38] looks like I'll have to write my own diamond collectors and collect only useful data [14:27:00] And now something created /var/lib/puppet/reports, so the bug that I wanted to fix was an artifact of the broken puppetmaster install, and not something inherent to puppetmaster::self. Problem solved! [14:27:21] :D [14:27:23] right [14:27:30] now to figure out why it broke? :) [14:28:07] YuviPanda: Shouldn't that be a common problem with a common solution? I think we don't need that much historical data -- oh wait, that's the current data that's making the trouble. [14:29:15] scfc_de: it's partly my fault, let's say I went overboard with metrics :) [14:29:24] scfc_de: we're collecting NFS read stats on all toollabs machines, for example [14:30:25] Yeah, but Ganglia could cope with that previously, couldn't it? (Though doesn't Ganglia aggregate some data locally before sending it over the network? Dunno.) [14:30:36] scfc_de: no, I don't think Ganglia was collecting as much data [14:30:47] scfc_de: also, ganglia *couldn't* cope at all. see ganglia.wmflabs.org :) [14:30:59] :-) [14:31:04] scfc_de: and we weren't sending NFS stats [14:31:53] scfc_de: and in general... we are sending a *lot* of stats :) [14:31:58] scfc_de: and from about 260 instances [14:32:31] let me remove some more [14:33:04] I might end up writing 'minimal' versions of most of the inbuilt collectors. oh dear [14:33:22] like, disk space reports both bytes and inode counts. [14:33:27] don't think we need the second one much [14:34:23] and should also increase graphite queueing [14:35:25] Depends. On Toolserver we ran several times out of inodes on /var because there was a "while sudo something; do sleep x; done" loop that created a new sudo log file every x seconds. [14:35:42] ah, right [14:35:53] I think increasing the queue size and making graphite pace itself better will also help [14:39:52] hmm [14:40:06] this needs... MATH! [14:40:14] and also someone who can merge stuff [14:41:26] hmm, so we get over 1000 metrics per second atm [14:42:52] Would be interesting to see the corresponding stats for prod. [14:43:27] scfc_de: I can look, actually [14:43:28] moment [14:45:33] scfc_de: heh, about 250k metrics in prod per min [14:46:39] Ooops. Is that all written to disk? [14:46:58] scfc_de: yeah. [14:47:04] scfc_de: it's on tungsten tho, so bare metal [14:47:23] scfc_de: this is also things like mediawiki profiler info, slow queries, etc [14:48:55] So 250k + more? Wow. [14:49:05] scfc_de: no, that's all included in the 250k [14:51:21] Still impressive :-). Gotta go, be back later. [14:51:28] scfc_de: cya! me too [15:36:46] a930913: join -tech? [17:08:12] a930913: Arbitrary SQL is inherently insecure, FWIW. [18:57:34] Carmela: Well yeah, that's why I'm asking for ideas. [20:09:59] * DeltaQuad raises hand for help with SGE and python [20:13:45] !ask DeltaQuad [20:13:45] Hi, how can we help you? Just ask your question. [20:14:39] k, so I put my task in cron a while back. It converted it automatically to SGE [20:14:54] and now I get error files saying: /usr/bin/python2.7: can't find '__main__' module in '/data/project/deltaquad-bots/DeltaQuadBot/UAA' [20:15:04] i'm wondering what I"m doing wrong [20:17:14] DeltaQuad: care to pastebin the entire traceback? [20:17:58] that's all that exists in cron-tools.deltaquad-bots-1.err [20:18:31] I can try and wrap it so it spits it all out [20:18:33] standby [20:18:51] DeltaQuad: what about in .out [20:18:59] blank [20:19:29] whats the crontab command? [20:19:56] */10 * * * * /usr/bin/jsub -N cron-tools.deltaquad-bots-1 -once -quiet python /data/project/deltaquad-bots/DeltaQuadBot/UAA > $HOME/UAA.out 2&>1 [20:20:20] I originally put */10 * * * * python /data/project/deltaquad-bots/DeltaQuadBot/UAA > $HOME/UAA.out 2&>1 [20:21:08] DeltaQuad: FAIL [20:21:23] or wait...FUCK. [20:21:27] I see it [20:21:31] /data/project/deltaquad-bots/DeltaQuadBot/UAA isnt a script [20:21:32] it must of removed .py [20:21:39] fml [20:21:48] DeltaQuad: shit happens [20:30:51] Also, the redirections won't do what you (probably) intended to. They output the output of the jsub command to those files; that SGE direct the output of the command to those as well is just conincidence. [23:02:34] Coren: how long will ot take until the new commonswiki_f_p tables are visible to tool users? [23:02:57] o_O it should already be the case. [23:03:16] still no globalimagelinks on s5 [23:03:42] thats why i mainly reopened the bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=59683#c5 [23:05:53] globalimagelinks was already mentioned in the first comment [23:14:43] any lab admins around? [23:15:06] could someone grant me access to this instance pls? https://wikitech.wikimedia.org/wiki/Analytics/gp.wmflabs.org [23:19:02] yurikR: you must ask one of the project admins, not labs server admins [23:19:23] you can see a list right at https://wikitech.wikimedia.org/wiki/Nova_Resource:Analytics [23:19:47] Merlissimo, true, but analytics are out and about...its sunday :) :)