[00:24:13] 3Wikimedia Labs / 3tools: Make the iptables initialization occur at boot time - 10https://bugzilla.wikimedia.org/53181 (10Tim Landscheidt) 5PATC>3RESO/WON [01:34:31] my labs-vagrant instance seems to be a bit borky [01:34:40] Error: Could not start Service[jobrunner]: Execution of '/sbin/start jobrunner' returned 1: [01:35:57] Oct 7 01:34:25 ogvjs-testing kernel: [ 949.564469] init: hhvm respawning too fast, stopped [01:35:58] hrmmmm [01:40:28] lemme try switching back to zend [01:48:33] that works for now [02:12:56] andrewbogott, https://wikitech.wikimedia.org/wiki/Special:NovaProxy is a bit strange. [02:13:15] it doesn't seem to do anything useful [02:13:27] just lists the projects I selected as linked headings [02:14:14] also, too much escaping on Special:NovaProject?action=configureproject [02:44:57] hello, I’d like to request access to tool labs, but the request hasn’t been processed yet, could anybody help me? my lab account is sophiejjj [03:51:00] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731 (10Santhosh Thottingal) 3NEW p:3Unprio s:3normal a:3None Created attachment 16687 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16687&action=edit Screenshot showing empty instance lis... [06:03:30] 3Wikimedia Labs / 3wikitech-interface: Can't delete files on wikitechwiki - 10https://bugzilla.wikimedia.org/71735 (10Nemo) 3NEW p:3Unprio s:3major a:3None Internal error [e96affb6] 2014-10-07 06:01:50: Fatal exception of type MWException The file (a copyvio) no longer has a duplicate on Commons, bu... [08:43:07] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (22.22%) [08:57:56] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [09:06:44] 3Wikimedia Labs / 3wikitech-interface: Can't delete files on wikitechwiki - 10https://bugzilla.wikimedia.org/71735 (10Andre Klapper) p:5Unprio>3High [09:15:50] andrewbogott_afk: Coren: Just created a new labs instance with precise, it fails to do the ininitial provisioning thus deying ssh access [09:15:51] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=integration&instanceid=b65a604d-40ef-4b16-b527-bfb862ca3904®ion=eqiad [09:15:55] Oct 7 08:55:28 integration-slave1004 nslcd[1059]: [3c9869] failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection timed out [09:16:04] seems is still wants to do virt0 ? [09:21:08] seems it falls back to virt1000 so it should be fine [09:21:08] Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [3c9869] connected to LDAP server ldap://virt1000.wikimedia.org:389 [09:26:30] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741 (10Krinkle) 3NEW p:3Unprio s:3normal a:3None Creating a new instance with the precise image fails and leaves the instance inacces... [09:28:25] hashar: ^ [09:30:31] Krinkle: thanks [09:30:43] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741#c1 (10Antoine "hashar" Musso) I suspect the labs image for Ubuntu Precise hasn't been updated to take in account the recent LDAP changes... [09:31:03] Krinkle: the ops might be able to log in the instance and see what is going on [09:32:08] !log Apply I44d33af1ce85 instead of Ib95c292190d on integration-puppetmaster (remove php5-parsekit package) [09:32:09] Apply is not a valid project. [09:38:00] 3Wikimedia Labs: WMFLabs Graphite: Dashboard is empty (Uncaught exception in javascript) - 10https://bugzilla.wikimedia.org/71742 (10Krinkle) 3NEW p:3Unprio s:3critic a:3None http://graphite.wmflabs.org/ ext-all.js:7> Uncaught SyntaxError: Unexpected end of input (index):29> Uncaught TypeError: Cannot... [12:41:05] jackmcbarn: you around? [13:46:43] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741#c2 (10Andrew Bogott) I just tested this a moment ago, and it worked fine for me. I installed a new precise base image on Friday that use... [14:08:15] Betacommand: i am now [14:11:35] Hi ! Is beta instance accessible directly to the outside world ? Like - we where planning to make a polonium equivalent for labs to test the bounce handler actions - but we where curious whether our mx would be accessible from the outer world ? [14:12:10] tonythomas: if you assign it a public IP address, it would be [14:12:38] Jeff_Green: that would do ? [14:13:23] YuviPanda how do we assign a public IP to an instance? [14:13:58] Jeff_Green: there's a 'manage addresses' link on the sidebar in wikitech [14:14:27] Jeff_Green: you just 'allocate IP' and then assign it to an instance, and then assign a DNS A record to it if you want [14:14:38] orly. ok great [14:14:42] Jeff_Green: and if you are out of IP quota, you can ask andrewbogott to increase it. [14:15:10] i didn't realize it was that straightforward, great! [14:15:35] Jeff_Green: :D [14:16:11] Jeff_Green: unrelated, but if you just want a public HTTP/HTTPS interface, there's a 'manage web proxies' on the sidebar that lets you do that as well. proper https + spdy for free with that. [14:16:17] not useful for mail, but might be useful otherwise [14:16:31] ok [14:16:53] Jeff_Green: so we power up an mx isntance ? [14:17:09] tonythomas: yup [14:17:17] Jeff_Green: should I ? [14:17:20] sounds as though once it's running I can assign it an Ip [14:17:20] sure [14:17:34] * tonythomas got 2 in stock though :D [14:17:59] Jeff_Green: I don't know if you want to add MX records to it, though. Wikitech interface only lets you add A records. [14:18:36] the interface administers the DNS zone directly? [14:19:23] we can probably do most testing without an MX record [14:21:08] Jeff_Green: the interface just lets you add an A record, and nothing more. [14:21:14] ok [14:22:17] Jeff_Green: so - just create an MX instance - right ? [14:22:22] right [14:22:28] in the mediawiki-verp project would be good ? [14:22:42] do we need that in the beta project ? [14:23:00] i thought beta project? [14:23:16] I dont have create rights there :( I think [14:23:33] oh, ha. ok one moment [14:25:33] * Jeff_Green checks if I have access there... [14:26:09] k :) [14:26:30] beta == deployment-prep right? [14:26:43] yeah ! [14:27:23] I will be back in ~30 mins ( dinner ) [14:31:46] "failed to allocate new public IP address" [14:32:05] Jeff_Green: yeah, not enough quota [14:32:17] andrewbogott: ^ can you increase deployment-prep IP quota by 1? [14:32:21] "an error has occured" [14:32:50] yeah, wikitech has a very usable, intuitive interface with descriptive error messages... [14:33:21] * Jeff_Green wants to change that message to "what." [14:34:56] done [14:35:07] And, the message isn't /that/ cryptic. You are allowed to check your own quota. [14:35:44] the message doesn't say what cause the error, that's the part that's missing [14:36:11] i got a similar error trying to create a new instance, and it worked on the third try, not idea what was happening on the backend [14:37:21] i.e. "Failed to allocate new public IP address, you've run out of IPs. See [wikitech link about requesting IPs]" [14:37:41] andrewbogott: btw, what happened to horizon? [14:37:53] YuviPanda: what do you mean? [14:38:01] andrewbogott: you were experimenting with it at some point, right? [14:38:05] ^^^[2] thanks for allocating ips [14:38:07] or was I tripping and seeing things? [14:38:15] Yeah, but it'll be the work of many many months to actually replace OSM with it [14:38:20] It doesn't have any of the features we use [14:38:26] *any*? [14:38:44] pretty much [14:38:49] sigh [14:38:55] it's basically a sketched in framework. [14:39:08] hmm, and we'd have to build plugins / modules... [14:39:13] at least it's python and not PHP [14:43:19] jackmcbarn: sorry got pulled away from my desk [14:44:23] jackmcbarn: if you want a bug to fix: https://bugzilla.wikimedia.org/show_bug.cgi?id=63601 is a good one [14:44:48] Betacommand: that's actually something i'm already working on [14:45:56] I see your gmail hack with your bugzilla email address :P [14:46:18] if i start getting spam i like to know where they got my address :p [14:51:53] jackmcbarn: Just making a cheeky comment, I do the same thing at times [14:53:10] yuvipanda do you handle the web proxies? [14:53:20] for some definition of handle, sure :) [14:53:20] 'sup [14:54:16] can you add https://bugzilla.wikimedia.org/show_bug.cgi?id=71120 to the blocked UA list? Im seeing 60+% of web activity from them [14:54:38] Betacommand: hmm, we don't actually have a blocked UA list, but I could whip one up... [14:54:42] let me write a patch [14:55:44] yeah, these spammy web cralwers are putting excessive load on labs for no reason. Im already serving 403's to them but they just ignore it [14:56:06] hmm, ok [14:58:44] YuviPanda: Im probably one of the only people who actually uses their access.log and that spider UA fills 60%+ of it [14:59:39] oh hey guys [14:59:57] is hhvm totally borked on labs-vagrant? i had to switch my server to zend to get it working again [15:00:25] brion: heh! shouldn't be... can you update your vagrant? git pull on /vagrant? [15:00:52] YuviPanda: well logs were something about ‘hhvm no longer supports build-in web server, use fastcgi’ [15:01:09] brion: huh, that was fixed like... many many months ago, I think... [15:01:15] it might have just been stuck in an inconsistent state though [15:01:19] brion: yeah [15:01:25] brion: also, trusty or precise? [15:01:28] lemme try switching it back now that it’s fully provisioned [15:01:55] ok [15:02:04] trusty [15:02:17] ah ok [15:03:08] ok when provisioning i get an error: [15:03:09] Error: Could not start Service[jobrunner]: Execution of '/sbin/start jobrunner' returned 1: [15:03:09] Error: /Stage[main]/Mediawiki::Jobrunner/Service[jobrunner]/ensure: change from stopped to running failed: Could not start Service[jobrunner]: Execution of '/sbin/start jobrunner' returned 1: [15:03:21] and the web server returns 503s only http://ogvjs-testing.wmflabs.org/wiki/Demo [15:04:14] oh wait *self-slap* [15:04:18] forgot to update vagrant [15:04:38] i only updated mediawiki :D [15:04:45] * brion whistles innocently and walks away [15:05:12] brion: ah :) [15:05:54] Error: mwscript importDump.php --wiki=wiki /vagrant/puppet/modules/labs/files/labs_privacy_policy.xml returned 255 instead of one of [0] [15:05:54] Error: /Stage[main]/Role::Labs_initial_content/Mediawiki::Import_dump[labs_privacy]/Exec[import_dump_labs_privacy]/returns: change from notrun to 0 failed: mwscript importDump.php --wiki=wiki /vagrant/puppet/modules/labs/files/labs_privacy_policy.xml returned 255 instead of one of [0] [15:06:11] hmmmm and now something’s awry with the wiki: “Class undefined: WebVideoTranscode “ [15:08:11] bah [15:08:19] mayhbe something ate TMH [15:08:34] brion: switch back go zend, see if the problem goes away? [15:10:39] waiting on puppet… dum de dum [15:12:59] ok back to zend and …. it’s fine :( [15:13:06] brion: :( file a bug? [15:13:13] brion: I haven't touched vagrant in a while [15:13:43] yeah i’ll see if i can narrow it down a little [15:13:51] brion: ok [15:14:03] brion: the import dump is kinda a knownish issue, but I thought that was fixed [15:14:08] brion: basically import not working with hhvm [15:14:13] bah [15:14:30] brion: but I thought that was fixed [15:14:36] what the [15:15:01] ok it started working under hhvm again [15:15:04] Jeff_Green: back [15:15:11] i’ll chalk it up to ‘puppet is magical and full of ghosts’ [15:15:17] brion: haha ;) [15:15:17] brion: ok [15:15:23] Betacommand: merged and deployed, btw [15:15:30] so - we are ready to create the instance ? [15:15:58] 3Wikimedia Labs / 3tools: Block TweetmemeBot UA - 10https://bugzilla.wikimedia.org/71120 (10Yuvi Panda) 5PATC>3RESO/FIX [15:16:48] brion: If you had not updated the mw-v puppet code in a while there were several fixes for hhvm config that were needed for the latest hhvm binaries. [15:17:05] Also I hope the dump import bugs are fixed now [15:17:39] And everyone needs to start testing everything on hhvm becuase as of yesterday 1% of anon traffic is being served by hhvm in production [15:17:59] and the plan is to have 10% on hhvm next week [15:19:37] bd808: yeah i think some of the updates didn’t fully track right because i left the thing alone for a few weeks [15:19:53] * bd808 nods [15:19:55] seems ok now though *fingers crossed* [15:20:23] Getting idempotent updates to run correctly from any initial state turns out to be hard :) [15:23:02] brion: Puppet isn't "magical", it's "eldritch" [15:23:20] lol [15:24:15] * bd808 checks under desk for tentacles and portals to the netherworld [15:24:58] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c1 (10Sam Reed (reedy)) (In reply to Santhosh Thottingal from comment #0) > Created attachment 16687 [details] > Screenshot showing empty instance list > > I am not able to see the instance listing f... [15:26:04] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c2 (10Yuvi Panda) Indeed, turning it off and on fixes it... I'm unsure why this is happening, though. [15:30:48] once I assign a public IP to an instance, where do I configure filtering on the incoming traffic? [15:31:18] Jeff_Green: you can filter by IP with 'security groups' in 'manage security groups' [15:31:30] looking [15:31:48] Jeff_Green: but openstack itself is stupid, and you can't add or delete a security group from an instance, so first you'd have to create a security group, and then add the instance with that security group, and then you can modify the rules there... [15:32:00] Jeff_Green: you can also just allow everything in security groups, and just filter with ferm/iptables [15:32:20] ok [15:32:26] makes sense, thanks [15:37:09] YuviPanda I don't see where I can add an instance to a security group? [15:41:12] the wikitech docs make it look as if you have to have the security group made before you create the instance? [15:45:28] "once the group has been created it will be available in the “Add Instance” form under the “Manage Instances” section." [15:45:58] 3Wikimedia Labs / 3deployment-prep (beta): Determine first pass list of icinga-alerting data from graphite.wmflabs - 10https://bugzilla.wikimedia.org/70141#c18 (10Greg Grossmeier) 5ASSI>3RESO/FIX (In reply to Greg Grossmeier from comment #17) > Yuvi: Thanks for the first pass work! Once you remove yourse... [15:46:26] * Jeff_Green starts over. [15:48:13] YuviPanda now I see what you were saying [15:52:20] YuviPanda: thanks [15:53:28] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741#c3 (10Andrew Bogott) OK -- that last comment was both right and wrong. New instances /do/ work. But there's still a smattering of virt0... [16:00:36] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c3 (10Santhosh Thottingal) (In reply to Sam Reed (reedy) from comment #1) > It's a known session bug... If you log out, and back in again, it should fix > it for you Worked when logged out and logged... [16:02:27] YuviPanda: we dont have a polonium ( mx ) role readily in labs configuration right ? [16:02:38] we will have to manually edit ? so self::puppetmaster ? [16:04:47] andrewbogott: you around ? [16:04:55] tonythomas: yes, but in a meeting [16:05:31] ok. anyway - if you could look into my previous query -- do we have a ready made role::mx available as in configure instance in wikitech? [16:05:37] or we have to do it manually ? [16:06:32] I don't know -- best to look in the puppet source. [16:06:51] woo progress! telnet: connect to address 208.80.155.193: Connection refused [16:07:03] Jeff_Green: the mx is installed ? [16:07:16] it was on the previous instance, looking [16:07:29] but it looks like we're at least getting to the instance [16:07:46] ok. and andrewbogott : if its not there in wikitech -- then go for self::puppetmaster right ? [16:08:25] yeah exim is installed, but configured only outbound [16:08:29] tonythomas: if a class is available in puppet than you can add it to the wikitech interface for a specific project. https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [16:10:38] andrewbogott: I will try that one out [16:11:01] i just added role::mail::mx [16:11:54] Jeff_Green: and its puppet-applying ? [16:12:04] just added to the instance. running puppet [16:12:25] okey :) [16:12:26] and boom [16:12:28] fail. [16:12:29] 3Wikimedia Labs / 3wikitech-interface: Can't delete files on wikitechwiki - 10https://bugzilla.wikimedia.org/71735#c1 (10Tim Landscheidt) 5NEW>3RESO/DUP *** This bug has been marked as a duplicate of bug 71208 *** [16:12:40] conflicting modules [16:12:43] 3Wikimedia Labs / 3wikitech-interface: Not possible to delete files - 10https://bugzilla.wikimedia.org/71208#c2 (10Tim Landscheidt) *** Bug 71735 has been marked as a duplicate of this bug. *** [16:12:55] yeah. How I done that one was [16:13:19] removing mail::sender from default { in site.pp [16:13:42] err. let me check that one again [16:13:50] Jeff_Green: yeah, mail based roles fail in labs because labs standard role includes a similar role... [16:13:52] anyway the conflicting one was a role::mail::sender [16:13:58] I guess using ensure_ in both places would be useful, perhaps... [16:14:20] one was there in role/labs.pp [16:14:32] YuviPanda i see [16:15:00] Jeff_Green: the conflict would be with role/labs.pp [16:15:14] yep, looking [16:15:24] this could get interesting [16:15:47] Jeff_Green: yup [16:15:51] :) [16:16:01] role::labs::instance includes role::mail::sender [16:16:23] yeah. I removed that one from role/labs.pp [16:16:30] right [16:17:05] my goal is to keep this integrated with normal labs puppet [16:17:43] yeah. now puppet apply is running ? [16:17:59] it's trying anyway :-) [16:18:11] :) [16:18:28] dies on the conflict between underlying classes in role::mail::sender and role::mail::mx [16:19:08] i don't suppose we can feed a class parameter to role::labs::instance :-( [16:19:30] it's setup by ldap, so I suppose not [16:20:12] Jeff_Green: you can add a global reference with a default to that class and then sent that value via ldap. Maybe. [16:20:26] I don't know about order of operations though [16:20:34] You could also be a heira pioneer [16:20:45] * Jeff_Green dies [16:20:46] I guess using ensure_* in both places might not be a bad idea.... [16:21:00] * Jeff_Green does not want to be a heira anything :-) [16:21:18] Jeff_Green: I think there is one more role::labs::sender in site.pp ? [16:21:56] under class standard :{ [16:22:21] oop [16:23:01] mark has also been working on labs mail handling, might be good to coordinate with him if you actually write any puppet code [16:23:44] andrewbogott: thanks, I'll check in with him now [16:26:13] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c4 (10Tim Landscheidt) IIRC Andrew once said that the authentication tokens for MediaWiki and OpenStack time out at different times, or something like that. [16:27:13] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c5 (10Andrew Bogott) Yeah, when running openstack queries OpenStackManager really needs to detect expired tokens and say something rather than just displaying the page as though you have no rights at... [16:27:15] andrewbogott: I'd like to bring up the issue on a mailing list, which list makes do you think sense for labs discussions? [16:27:27] sorry to be such a noob [16:27:29] probably -labs [16:27:33] ok [16:27:43] https://lists.wikimedia.org/mailman/listinfo/labs-l [16:27:51] thank you [16:27:51] Which you should immediately subscribe to if you're doing anything with labs. [16:28:28] i think I was at some point, but everything I was doing pulled me away from labs [16:28:37] resubscribing! [16:31:22] s a [16:31:27] Jeff_Green: I occasionally send things to labs-l along the lines of "Do this or your labs instance will stop working forever" [16:31:33] so it's a good idea to keep an eye out [16:31:51] hah [16:33:56] andrewbogott: is it possible to have default classes enabled for every host which can also be toggled in wikitech? [16:33:58] in that case I'll subscribe too! [16:34:28] Jeff_Green: I don't understand the question… but the answer is probably no :) [16:34:49] Or, well, it's software so everything is possible. But that's not supported at the moment [16:34:53] andrewbogott: I would like to remove "include role::mail::sender" from role::labs::default [16:35:02] and enable it on all instances [16:35:17] AND have a checkbox to leave it off a specific instance in the instance config page [16:36:00] That's related to what mark needs as well, I think. If y'all agree on a specific design I may be able to implement in a few weeks. [16:36:16] ok [16:39:14] Jeff_Green, but we will need to have that option by defualt anyway right ? [16:39:23] otherwise - the wiki wont send any emails :\ [16:40:44] tonythomas01_: right. one idea would be to include the class when you build an instance, but make it configurable [16:41:23] Jeff_Green, yeah. we want it to be configurable -- and another option there to have role::mail::mx enabled [16:41:23] I don't know enough about how all this works to suggest a great way to do it [16:42:21] yeah. I just meant that we have an option to make it an mx easily [16:42:30] ya [16:43:00] Jeff_Green, you send the mail ? [16:43:36] ya will do [16:44:19] ok. [17:08:44] 3Wikimedia Labs / 3tools: Block TweetmemeBot UA - 10https://bugzilla.wikimedia.org/71120 (10Tim Landscheidt) a:5Marc A. Pelletier>3Yuvi Panda [17:27:04] Jeff_Green: Just sent (what I think is) a better solution to labs-l [17:28:24] Including a correction so that my second paragraph makes sense. :-) [17:29:58] Jeff_Green: A fix that'd work for you is, literally, a four-line diff in manifests/role/mail.pp [17:30:23] (Maybe 5) :-) [17:40:39] Jeff_Green: mail_full_mx [17:40:45] Jeff_Green: I mean, https://gerrit.wikimedia.org/r/165249 [18:01:40] Coren: :-) [18:12:07] Yeah, Faidon didn't like that (not without reason). [18:12:20] It /was/ a hack; though I felt it was a reasonable one. :-) [18:20:09] !ping [18:20:09] !pong [18:24:29] !log integration /var/lib/jenkins-slave/tmpfs 100% full on gallium [18:24:32] Logged the message, Master [18:49:35] ^d: I've made sure the plugin upgrades (https://gerrit.wikimedia.org/r/#/c/164633) are ready by syncing themto beta. I haven't tried using elasticsearch 1.3.4 yet. that is next on my list [18:50:05] <^d> deployment-elastic01 is already running .4 [18:50:14] <^d> I was testing the .deb upload to apt.wm.o [18:51:28] sweet [18:51:33] bouncing it will get the new plugin [18:52:59] <^d> deployment-elastic01 experimental highlighter 0.0.12 j [18:52:59] <^d> deployment-elastic01 wikimedia-extra 0.0.1 j [18:53:10] <^d> (among others, obvs) [19:46:35] YuviPanda: meh, the phab email format is complete crap to parse :-( [19:47:11] valhallasw`cloud: I still think proper way is to patch upstream... [19:47:16] they already have an IRC bot... [19:47:43] YuviPanda: well, not really. It's marked as 'experimental' and 'an example of how you could use the API' [19:47:56] sure, so we should make fix it and do things with it :) [19:48:06] rather than go phab -> email -> email list -> email in -> redis -> python... [19:48:31] well, email is still a great pub sub method ;-) [19:48:43] and one thing that's unclear to me is where the irc bot would run [19:48:43] tch tch ;) [19:48:53] as far as I can see, it's supposed to run on the same host as phab [19:48:55] which is not ideal [19:49:02] why not [19:49:13] because changes will then take a gazillion years [19:49:20] also getting it back up when it crashes [19:49:25] remember the old wikibugs? :-p [19:49:29] ah, that :) [19:49:42] also from a security perspective it's not ideal [19:50:08] hmm, true [19:51:00] although it seems to run completely over the Conduit api, so maybe it just needs a phabricator checkout (not so much to be on the same server) [19:53:11] also PHID's everywhere [19:53:36] but this is the main code: https://secure.phabricator.com/diffusion/P/browse/master/src/infrastructure/daemon/bot/handler/PhabricatorBotFeedNotificationHandler.php [19:59:08] Phabricator appears phine, but I phear its phanciphul naming will pheel old aphter a while. [20:08:54] I'm not sure how I'm going to run this on the grid, but let's try. [20:26:43] YuviPanda: I'd also be happy with getting that bot to dump stuff in redis :-p [20:26:55] yeah, but it'll dump them in *prod* redis ;P [20:32:02] Coren, there's a whole list of mail.tools.wikibugs tasks in error state on the wikibugs account (state Eqw). Anything I can do about that, and maybe anything you can do about that? ;-) [20:33:39] valhallasw`cloud: Well, I can blindly clear the errored out jobs; but that's not going to be useful unless the maintainer looks at what happened and possibly makes corrective measures (or decides the jobs can be done away with) [20:33:58] Coren: I am the maintainer. I'm not sure how to check what went wrong :-p [20:35:09] valhallasw`cloud: Well, qstat the jobs; any error would show. But also, you may want to check the jobs' own error logs. [20:35:49] Coren: except they are mail jobs, so they don't have any output. [20:36:07] qstat says error reason 1: can't get password entry for user "tools.wikibugs". Either the user does not exist or NIS error! [20:36:54] Ah, those must date from when LDAP was throwing fits. You can simply clear the error state. [20:37:32] Or just delete them if they are not useful anymore. [20:37:51] (clear error state: qmod -cj [20:39:30] qdel is probably fine -- those are old emails [20:54:43] bd808: Hm.. that !log about gallium tmpfs should've gone to production log in -operations, or !log for RelEng/QA in -qa, not the 'integration' project in labs. [20:55:40] hashar was calling me off on it as well earlier. We shouldn't use the log for 'integration' and 'deployment-prep'/'beta' anymore as those aren't typical labs project, they're usually of interest to prod or qa in general. https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:55:45] ah, I see you found that already :) [20:55:49] I did it here and in -qa [20:56:38] I've been duping some stuff but yeah I should just stick with -qa [21:04:35] bd808: YuviPanda: ganglia in labs, over or not over? It seemse to be taking up massive amounts of RAM on integration slaves. [21:05:02] * bd808 doesn't know [21:05:08] ganglia is dead, and there are no current plans to bring it back [21:05:11] you can kill it [21:06:52] YuviPanda: how? [21:07:19] I've no idea :| [21:07:29] Krinkle: I guess ganglia collectors are defined in integration roles? [21:08:01] Krinkle: ah, I see [21:08:07] Krinkle: it is included in standard [21:08:15] Yes. [21:08:31] I guess I can put a realm guard around that... [21:19:23] YuviPanda: thx [21:21:55] Krinkle: yw. now to get that merged... [21:30:16] 3Wikimedia Labs / 3Infrastructure: WMFLabs: Ganglia deamon is taking up a lot of memory - 10https://bugzilla.wikimedia.org/71761 (10Krinkle) 3NEW p:3Unprio s:3normal a:3None The gmond process seems obsolete since the aggregator is down. ganglia.wmflabs.org is still up, but no longer being populated.... [21:30:30] 3Wikimedia Labs / 3Infrastructure: WMFLabs: Ganglia deamon is taking up a lot of memory - 10https://bugzilla.wikimedia.org/71761 (10Yuvi Panda) a:3Yuvi Panda [21:52:39] Hm... that's concerning [21:52:52] YuviPanda: lol, clean brand new instance: not used for anything yet (not pooled) [21:52:53] https://graphite.wmflabs.org/render/?width=900&height=500&from=-4h&target=integration.integration-slave1009.memory.MemFree.value [21:52:57] down to the river we go [21:53:14] Krinkle: MemFree is a terrible metric, though [21:53:17] https://graphite.wmflabs.org/render/?width=900&height=500&from=-7h&target=integration.integration-slave1009.memory.MemFree.value [21:53:25] YuviPanda: OK [21:53:28] Got a better one? [21:53:32] let me see what you need to look at [21:53:33] moment [21:54:19] Krinkle: MemTotal - Active+Buffers+Cached [21:54:20] I think [21:54:57] MemTotal is total memory available, constant value [21:55:21] oh, I see [21:55:29] (03PS1) 10BearND: Update build script for Gradle [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165375 [21:55:31] (03PS1) 10BearND: Need JAVA_HOME for Gradle [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165376 [21:55:33] (03PS1) 10BearND: Expand wildcards when copying apk [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165377 [21:55:44] YuviPanda: There's no subtract method in graphite is there.. [21:56:18] Krinkle: diffSeries [21:56:28] Can take two or more metrics, or a single metric and a constant. Subtracts parameters 2 through n from parameter 1. [21:56:52] YuviPanda: I don't need substract though [21:56:57] just adding active,buffer,cached will do [21:56:58] hmm? [21:57:02] I'm trying to draw a stacked graph [21:57:03] hmm, that shall do too, yeah [21:57:08] ah, hmm [21:57:09] ok [21:57:11] stacked() doesn't work [21:57:22] sum() seems to work, but display them as one new value, not stacked areas [21:58:36] oh, are you drawing them directly via graphite? [21:59:05] YuviPanda: do I have another option? [21:59:09] https://graphite.wmflabs.org/render/?width=900&height=500&from=-6h&target=sum(integration.integration-slave1009.memory.{Active,Buffers,Cached}.value)&areaMode=stacked [21:59:35] Krinkle: well, my preferred way is to add &format=json, get the points in json, and plot them with a sane library [21:59:53] graphite's graphs aren't what I'll call nice [22:00:12] I agree. but I just want monitoring so I can keep the integration slaves healthy. [22:00:16] Can I help your efforts instead? [22:00:22] I don't want to reinvent [22:01:02] hmm, I'm currently setting up shinken, but perhaps I should setup the graphing first, and then shinken... [22:01:11] I guess lots of people would find that order more useful [22:02:29] I basically just want simple graphs that show me: cpu, memory and disk usage. And a set for each node and one for all nodes in a group. E.g. like http://ganglia.wikimedia.org/latest/?r=hour&c=Bits%2520caches%2520eqiad (the last hour memory graph) [22:02:49] and then one for each node as well. Just plain graphs like that so I can see what's going on. [22:03:18] and ideally alerts as well (like you set up), which should be pretty straight forward thanks for the infra you put in place via prod icinga [22:03:34] but right now they don't measure cpu and memory in a useful way [22:03:59] YuviPanda: Use the new graphana module that ori made? [22:04:18] bd808: yeah, but I'm not too much of a fan of grafana [22:04:21] but I'll give it a shot... [22:04:39] Krinkle: ok, so I'll setup grafana (you can try it out for prod at grafana.wmflabs.org) and see if it's useful [22:04:45] heh. what we need is yet another graphite front end.... [22:04:47] bd808: from what I've seen that's similar to regular graphite, it just makes it easier to write the function calls and store queries in a persistent dashboard. It's still limited to the functions graphite supports I think. [22:06:01] It renders client side I think, but I haven't worked with it deeply [22:06:22] bd808: I'd want to autogenerate a set of graphs for most common metrics (to match ganglia, at least) per project by default [22:06:25] I don't know if grafana does that [22:06:31] I have worked with graphite directly a ton and there are very few things I've wanted that it can't be tricked into doing. [22:06:38] plus the fact that it stores config in ElasticSearch(!?!) seems ugh [22:07:14] YuviPanda: heh. I wrote a php library for doing that (replacing ganglia) at $DAYJOB-1 [22:07:22] heh [22:07:28] I guess that isn't open source... [22:07:48] https://github.com/bd808/graphite-graph-php [22:08:01] I know nothing about monitoring, I probably used all the wrong metrics, but I wrote this in little over an hour https://github.com/wikimedia/integration-docroot/blob/master/org/wikimedia/integration/monitoring/index.php [22:08:36]