[00:24:13] 3Wikimedia Labs / 3tools: Make the iptables initialization occur at boot time - 10https://bugzilla.wikimedia.org/53181 (10Tim Landscheidt) 5PATC>3RESO/WON [01:34:31] my labs-vagrant instance seems to be a bit borky [01:34:40] Error: Could not start Service[jobrunner]: Execution of '/sbin/start jobrunner' returned 1: [01:35:57] Oct 7 01:34:25 ogvjs-testing kernel: [ 949.564469] init: hhvm respawning too fast, stopped [01:35:58] hrmmmm [01:40:28] lemme try switching back to zend [01:48:33] that works for now [02:12:56] andrewbogott, https://wikitech.wikimedia.org/wiki/Special:NovaProxy is a bit strange. [02:13:15] it doesn't seem to do anything useful [02:13:27] just lists the projects I selected as linked headings [02:14:14] also, too much escaping on Special:NovaProject?action=configureproject [02:44:57] hello, I’d like to request access to tool labs, but the request hasn’t been processed yet, could anybody help me? my lab account is sophiejjj [03:51:00] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731 (10Santhosh Thottingal) 3NEW p:3Unprio s:3normal a:3None Created attachment 16687 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16687&action=edit Screenshot showing empty instance lis... [06:03:30] 3Wikimedia Labs / 3wikitech-interface: Can't delete files on wikitechwiki - 10https://bugzilla.wikimedia.org/71735 (10Nemo) 3NEW p:3Unprio s:3major a:3None Internal error [e96affb6] 2014-10-07 06:01:50: Fatal exception of type MWException The file (a copyvio) no longer has a duplicate on Commons, bu... [08:43:07] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (22.22%) [08:57:56] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [09:06:44] 3Wikimedia Labs / 3wikitech-interface: Can't delete files on wikitechwiki - 10https://bugzilla.wikimedia.org/71735 (10Andre Klapper) p:5Unprio>3High [09:15:50] andrewbogott_afk: Coren: Just created a new labs instance with precise, it fails to do the ininitial provisioning thus deying ssh access [09:15:51] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=integration&instanceid=b65a604d-40ef-4b16-b527-bfb862ca3904®ion=eqiad [09:15:55] Oct 7 08:55:28 integration-slave1004 nslcd[1059]: [3c9869] failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection timed out [09:16:04] seems is still wants to do virt0 ? [09:21:08] seems it falls back to virt1000 so it should be fine [09:21:08] Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [3c9869] connected to LDAP server ldap://virt1000.wikimedia.org:389 [09:26:30] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741 (10Krinkle) 3NEW p:3Unprio s:3normal a:3None Creating a new instance with the precise image fails and leaves the instance inacces... [09:28:25] hashar: ^ [09:30:31] Krinkle: thanks [09:30:43] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741#c1 (10Antoine "hashar" Musso) I suspect the labs image for Ubuntu Precise hasn't been updated to take in account the recent LDAP changes... [09:31:03] Krinkle: the ops might be able to log in the instance and see what is going on [09:32:08] !log Apply I44d33af1ce85 instead of Ib95c292190d on integration-puppetmaster (remove php5-parsekit package) [09:32:09] Apply is not a valid project. [09:38:00] 3Wikimedia Labs: WMFLabs Graphite: Dashboard is empty (Uncaught exception in javascript) - 10https://bugzilla.wikimedia.org/71742 (10Krinkle) 3NEW p:3Unprio s:3critic a:3None http://graphite.wmflabs.org/ ext-all.js:7> Uncaught SyntaxError: Unexpected end of input (index):29> Uncaught TypeError: Cannot... [12:41:05] jackmcbarn: you around? [13:46:43] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741#c2 (10Andrew Bogott) I just tested this a moment ago, and it worked fine for me. I installed a new precise base image on Friday that use... [14:08:15] Betacommand: i am now [14:11:35] Hi ! Is beta instance accessible directly to the outside world ? Like - we where planning to make a polonium equivalent for labs to test the bounce handler actions - but we where curious whether our mx would be accessible from the outer world ? [14:12:10] tonythomas: if you assign it a public IP address, it would be [14:12:38] Jeff_Green: that would do ? [14:13:23] YuviPanda how do we assign a public IP to an instance? [14:13:58] Jeff_Green: there's a 'manage addresses' link on the sidebar in wikitech [14:14:27] Jeff_Green: you just 'allocate IP' and then assign it to an instance, and then assign a DNS A record to it if you want [14:14:38] orly. ok great [14:14:42] Jeff_Green: and if you are out of IP quota, you can ask andrewbogott to increase it. [14:15:10] i didn't realize it was that straightforward, great! [14:15:35] Jeff_Green: :D [14:16:11] Jeff_Green: unrelated, but if you just want a public HTTP/HTTPS interface, there's a 'manage web proxies' on the sidebar that lets you do that as well. proper https + spdy for free with that. [14:16:17] not useful for mail, but might be useful otherwise [14:16:31] ok [14:16:53] Jeff_Green: so we power up an mx isntance ? [14:17:09] tonythomas: yup [14:17:17] Jeff_Green: should I ? [14:17:20] sounds as though once it's running I can assign it an Ip [14:17:20] sure [14:17:34] * tonythomas got 2 in stock though :D [14:17:59] Jeff_Green: I don't know if you want to add MX records to it, though. Wikitech interface only lets you add A records. [14:18:36] the interface administers the DNS zone directly? [14:19:23] we can probably do most testing without an MX record [14:21:08] Jeff_Green: the interface just lets you add an A record, and nothing more. [14:21:14] ok [14:22:17] Jeff_Green: so - just create an MX instance - right ? [14:22:22] right [14:22:28] in the mediawiki-verp project would be good ? [14:22:42] do we need that in the beta project ? [14:23:00] i thought beta project? [14:23:16] I dont have create rights there :( I think [14:23:33] oh, ha. ok one moment [14:25:33] * Jeff_Green checks if I have access there... [14:26:09] k :) [14:26:30] beta == deployment-prep right? [14:26:43] yeah ! [14:27:23] I will be back in ~30 mins ( dinner ) [14:31:46] "failed to allocate new public IP address" [14:32:05] Jeff_Green: yeah, not enough quota [14:32:17] andrewbogott: ^ can you increase deployment-prep IP quota by 1? [14:32:21] "an error has occured" [14:32:50] yeah, wikitech has a very usable, intuitive interface with descriptive error messages... [14:33:21] * Jeff_Green wants to change that message to "what." [14:34:56] done [14:35:07] And, the message isn't /that/ cryptic. You are allowed to check your own quota. [14:35:44] the message doesn't say what cause the error, that's the part that's missing [14:36:11] i got a similar error trying to create a new instance, and it worked on the third try, not idea what was happening on the backend [14:37:21] i.e. "Failed to allocate new public IP address, you've run out of IPs. See [wikitech link about requesting IPs]" [14:37:41] andrewbogott: btw, what happened to horizon? [14:37:53] YuviPanda: what do you mean? [14:38:01] andrewbogott: you were experimenting with it at some point, right? [14:38:05] ^^^[2] thanks for allocating ips [14:38:07] or was I tripping and seeing things? [14:38:15] Yeah, but it'll be the work of many many months to actually replace OSM with it [14:38:20] It doesn't have any of the features we use [14:38:26] *any*? [14:38:44] pretty much [14:38:49] sigh [14:38:55] it's basically a sketched in framework. [14:39:08] hmm, and we'd have to build plugins / modules... [14:39:13] at least it's python and not PHP [14:43:19] jackmcbarn: sorry got pulled away from my desk [14:44:23] jackmcbarn: if you want a bug to fix: https://bugzilla.wikimedia.org/show_bug.cgi?id=63601 is a good one [14:44:48] Betacommand: that's actually something i'm already working on [14:45:56] I see your gmail hack with your bugzilla email address :P [14:46:18] if i start getting spam i like to know where they got my address :p [14:51:53] jackmcbarn: Just making a cheeky comment, I do the same thing at times [14:53:10] yuvipanda do you handle the web proxies? [14:53:20] for some definition of handle, sure :) [14:53:20] 'sup [14:54:16] can you add https://bugzilla.wikimedia.org/show_bug.cgi?id=71120 to the blocked UA list? Im seeing 60+% of web activity from them [14:54:38] Betacommand: hmm, we don't actually have a blocked UA list, but I could whip one up... [14:54:42] let me write a patch [14:55:44] yeah, these spammy web cralwers are putting excessive load on labs for no reason. Im already serving 403's to them but they just ignore it [14:56:06] hmm, ok [14:58:44] YuviPanda: Im probably one of the only people who actually uses their access.log and that spider UA fills 60%+ of it [14:59:39] oh hey guys [14:59:57] is hhvm totally borked on labs-vagrant? i had to switch my server to zend to get it working again [15:00:25] brion: heh! shouldn't be... can you update your vagrant? git pull on /vagrant? [15:00:52] YuviPanda: well logs were something about ‘hhvm no longer supports build-in web server, use fastcgi’ [15:01:09] brion: huh, that was fixed like... many many months ago, I think... [15:01:15] it might have just been stuck in an inconsistent state though [15:01:19] brion: yeah [15:01:25] brion: also, trusty or precise? [15:01:28] lemme try switching it back now that it’s fully provisioned [15:01:55] ok [15:02:04] trusty [15:02:17] ah ok [15:03:08] ok when provisioning i get an error: [15:03:09] Error: Could not start Service[jobrunner]: Execution of '/sbin/start jobrunner' returned 1: [15:03:09] Error: /Stage[main]/Mediawiki::Jobrunner/Service[jobrunner]/ensure: change from stopped to running failed: Could not start Service[jobrunner]: Execution of '/sbin/start jobrunner' returned 1: [15:03:21] and the web server returns 503s only http://ogvjs-testing.wmflabs.org/wiki/Demo [15:04:14] oh wait *self-slap* [15:04:18] forgot to update vagrant [15:04:38] i only updated mediawiki :D [15:04:45] * brion whistles innocently and walks away [15:05:12] brion: ah :) [15:05:54] Error: mwscript importDump.php --wiki=wiki /vagrant/puppet/modules/labs/files/labs_privacy_policy.xml returned 255 instead of one of [0] [15:05:54] Error: /Stage[main]/Role::Labs_initial_content/Mediawiki::Import_dump[labs_privacy]/Exec[import_dump_labs_privacy]/returns: change from notrun to 0 failed: mwscript importDump.php --wiki=wiki /vagrant/puppet/modules/labs/files/labs_privacy_policy.xml returned 255 instead of one of [0] [15:06:11] hmmmm and now something’s awry with the wiki: “Class undefined: WebVideoTranscode “ [15:08:11] bah [15:08:19] mayhbe something ate TMH [15:08:34] brion: switch back go zend, see if the problem goes away? [15:10:39] waiting on puppet… dum de dum [15:12:59] ok back to zend and …. it’s fine :( [15:13:06] brion: :( file a bug? [15:13:13] brion: I haven't touched vagrant in a while [15:13:43] yeah i’ll see if i can narrow it down a little [15:13:51] brion: ok [15:14:03] brion: the import dump is kinda a knownish issue, but I thought that was fixed [15:14:08] brion: basically import not working with hhvm [15:14:13] bah [15:14:30] brion: but I thought that was fixed [15:14:36] what the [15:15:01] ok it started working under hhvm again [15:15:04] Jeff_Green: back [15:15:11] i’ll chalk it up to ‘puppet is magical and full of ghosts’ [15:15:17] brion: haha ;) [15:15:17] brion: ok [15:15:23] Betacommand: merged and deployed, btw [15:15:30] so - we are ready to create the instance ? [15:15:58] 3Wikimedia Labs / 3tools: Block TweetmemeBot UA - 10https://bugzilla.wikimedia.org/71120 (10Yuvi Panda) 5PATC>3RESO/FIX [15:16:48] brion: If you had not updated the mw-v puppet code in a while there were several fixes for hhvm config that were needed for the latest hhvm binaries. [15:17:05] Also I hope the dump import bugs are fixed now [15:17:39] And everyone needs to start testing everything on hhvm becuase as of yesterday 1% of anon traffic is being served by hhvm in production [15:17:59] and the plan is to have 10% on hhvm next week [15:19:37] bd808: yeah i think some of the updates didn’t fully track right because i left the thing alone for a few weeks [15:19:53] * bd808 nods [15:19:55] seems ok now though *fingers crossed* [15:20:23] Getting idempotent updates to run correctly from any initial state turns out to be hard :) [15:23:02] brion: Puppet isn't "magical", it's "eldritch" [15:23:20] lol [15:24:15] * bd808 checks under desk for tentacles and portals to the netherworld [15:24:58] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c1 (10Sam Reed (reedy)) (In reply to Santhosh Thottingal from comment #0) > Created attachment 16687 [details] > Screenshot showing empty instance list > > I am not able to see the instance listing f... [15:26:04] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c2 (10Yuvi Panda) Indeed, turning it off and on fixes it... I'm unsure why this is happening, though. [15:30:48] once I assign a public IP to an instance, where do I configure filtering on the incoming traffic? [15:31:18] Jeff_Green: you can filter by IP with 'security groups' in 'manage security groups' [15:31:30] looking [15:31:48] Jeff_Green: but openstack itself is stupid, and you can't add or delete a security group from an instance, so first you'd have to create a security group, and then add the instance with that security group, and then you can modify the rules there... [15:32:00] Jeff_Green: you can also just allow everything in security groups, and just filter with ferm/iptables [15:32:20] ok [15:32:26] makes sense, thanks [15:37:09] YuviPanda I don't see where I can add an instance to a security group? [15:41:12] the wikitech docs make it look as if you have to have the security group made before you create the instance? [15:45:28] "once the group has been created it will be available in the “Add Instance” form under the “Manage Instances” section." [15:45:58] 3Wikimedia Labs / 3deployment-prep (beta): Determine first pass list of icinga-alerting data from graphite.wmflabs - 10https://bugzilla.wikimedia.org/70141#c18 (10Greg Grossmeier) 5ASSI>3RESO/FIX (In reply to Greg Grossmeier from comment #17) > Yuvi: Thanks for the first pass work! Once you remove yourse... [15:46:26] * Jeff_Green starts over. [15:48:13] YuviPanda now I see what you were saying [15:52:20] YuviPanda: thanks [15:53:28] 3Wikimedia Labs / 3Infrastructure: WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible) - 10https://bugzilla.wikimedia.org/71741#c3 (10Andrew Bogott) OK -- that last comment was both right and wrong. New instances /do/ work. But there's still a smattering of virt0... [16:00:36] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c3 (10Santhosh Thottingal) (In reply to Sam Reed (reedy) from comment #1) > It's a known session bug... If you log out, and back in again, it should fix > it for you Worked when logged out and logged... [16:02:27] YuviPanda: we dont have a polonium ( mx ) role readily in labs configuration right ? [16:02:38] we will have to manually edit ? so self::puppetmaster ? [16:04:47] andrewbogott: you around ? [16:04:55] tonythomas: yes, but in a meeting [16:05:31] ok. anyway - if you could look into my previous query -- do we have a ready made role::mx available as in configure instance in wikitech? [16:05:37] or we have to do it manually ? [16:06:32] I don't know -- best to look in the puppet source. [16:06:51] woo progress! telnet: connect to address 208.80.155.193: Connection refused [16:07:03] Jeff_Green: the mx is installed ? [16:07:16] it was on the previous instance, looking [16:07:29] but it looks like we're at least getting to the instance [16:07:46] ok. and andrewbogott : if its not there in wikitech -- then go for self::puppetmaster right ? [16:08:25] yeah exim is installed, but configured only outbound [16:08:29] tonythomas: if a class is available in puppet than you can add it to the wikitech interface for a specific project. https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [16:10:38] andrewbogott: I will try that one out [16:11:01] i just added role::mail::mx [16:11:54] Jeff_Green: and its puppet-applying ? [16:12:04] just added to the instance. running puppet [16:12:25] okey :) [16:12:26] and boom [16:12:28] fail. [16:12:29] 3Wikimedia Labs / 3wikitech-interface: Can't delete files on wikitechwiki - 10https://bugzilla.wikimedia.org/71735#c1 (10Tim Landscheidt) 5NEW>3RESO/DUP *** This bug has been marked as a duplicate of bug 71208 *** [16:12:40] conflicting modules [16:12:43] 3Wikimedia Labs / 3wikitech-interface: Not possible to delete files - 10https://bugzilla.wikimedia.org/71208#c2 (10Tim Landscheidt) *** Bug 71735 has been marked as a duplicate of this bug. *** [16:12:55] yeah. How I done that one was [16:13:19] removing mail::sender from default { in site.pp [16:13:42] err. let me check that one again [16:13:50] Jeff_Green: yeah, mail based roles fail in labs because labs standard role includes a similar role... [16:13:52] anyway the conflicting one was a role::mail::sender [16:13:58] I guess using ensure_ in both places would be useful, perhaps... [16:14:20] one was there in role/labs.pp [16:14:32] YuviPanda i see [16:15:00] Jeff_Green: the conflict would be with role/labs.pp [16:15:14] yep, looking [16:15:24] this could get interesting [16:15:47] Jeff_Green: yup [16:15:51] :) [16:16:01] role::labs::instance includes role::mail::sender [16:16:23] yeah. I removed that one from role/labs.pp [16:16:30] right [16:17:05] my goal is to keep this integrated with normal labs puppet [16:17:43] yeah. now puppet apply is running ? [16:17:59] it's trying anyway :-) [16:18:11] :) [16:18:28] dies on the conflict between underlying classes in role::mail::sender and role::mail::mx [16:19:08] i don't suppose we can feed a class parameter to role::labs::instance :-( [16:19:30] it's setup by ldap, so I suppose not [16:20:12] Jeff_Green: you can add a global reference with a default to that class and then sent that value via ldap. Maybe. [16:20:26] I don't know about order of operations though [16:20:34] You could also be a heira pioneer [16:20:45] * Jeff_Green dies [16:20:46] I guess using ensure_* in both places might not be a bad idea.... [16:21:00] * Jeff_Green does not want to be a heira anything :-) [16:21:18] Jeff_Green: I think there is one more role::labs::sender in site.pp ? [16:21:56] under class standard :{ [16:22:21] oop [16:23:01] mark has also been working on labs mail handling, might be good to coordinate with him if you actually write any puppet code [16:23:44] andrewbogott: thanks, I'll check in with him now [16:26:13] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c4 (10Tim Landscheidt) IIRC Andrew once said that the authentication tokens for MediaWiki and OpenStack time out at different times, or something like that. [16:27:13] 3Wikimedia Labs / 3Infrastructure: labsconsole: Empty instance list - 10https://bugzilla.wikimedia.org/71731#c5 (10Andrew Bogott) Yeah, when running openstack queries OpenStackManager really needs to detect expired tokens and say something rather than just displaying the page as though you have no rights at... [16:27:15] andrewbogott: I'd like to bring up the issue on a mailing list, which list makes do you think sense for labs discussions? [16:27:27] sorry to be such a noob [16:27:29] probably -labs [16:27:33] ok [16:27:43] https://lists.wikimedia.org/mailman/listinfo/labs-l [16:27:51] thank you [16:27:51] Which you should immediately subscribe to if you're doing anything with labs. [16:28:28] i think I was at some point, but everything I was doing pulled me away from labs [16:28:37] resubscribing! [16:31:22] s a [16:31:27] Jeff_Green: I occasionally send things to labs-l along the lines of "Do this or your labs instance will stop working forever" [16:31:33] so it's a good idea to keep an eye out [16:31:51] hah [16:33:56] andrewbogott: is it possible to have default classes enabled for every host which can also be toggled in wikitech? [16:33:58] in that case I'll subscribe too! [16:34:28] Jeff_Green: I don't understand the question… but the answer is probably no :) [16:34:49] Or, well, it's software so everything is possible. But that's not supported at the moment [16:34:53] andrewbogott: I would like to remove "include role::mail::sender" from role::labs::default [16:35:02] and enable it on all instances [16:35:17] AND have a checkbox to leave it off a specific instance in the instance config page [16:36:00] That's related to what mark needs as well, I think. If y'all agree on a specific design I may be able to implement in a few weeks. [16:36:16] ok [16:39:14] Jeff_Green, but we will need to have that option by defualt anyway right ? [16:39:23] otherwise - the wiki wont send any emails :\ [16:40:44] tonythomas01_: right. one idea would be to include the class when you build an instance, but make it configurable [16:41:23] Jeff_Green, yeah. we want it to be configurable -- and another option there to have role::mail::mx enabled [16:41:23] I don't know enough about how all this works to suggest a great way to do it [16:42:21] yeah. I just meant that we have an option to make it an mx easily [16:42:30] ya [16:43:00] Jeff_Green, you send the mail ? [16:43:36] ya will do [16:44:19] ok. [17:08:44] 3Wikimedia Labs / 3tools: Block TweetmemeBot UA - 10https://bugzilla.wikimedia.org/71120 (10Tim Landscheidt) a:5Marc A. Pelletier>3Yuvi Panda [17:27:04] Jeff_Green: Just sent (what I think is) a better solution to labs-l [17:28:24] Including a correction so that my second paragraph makes sense. :-) [17:29:58] Jeff_Green: A fix that'd work for you is, literally, a four-line diff in manifests/role/mail.pp [17:30:23] (Maybe 5) :-) [17:40:39] Jeff_Green: mail_full_mx [17:40:45] Jeff_Green: I mean, https://gerrit.wikimedia.org/r/165249 [18:01:40] Coren: :-) [18:12:07] Yeah, Faidon didn't like that (not without reason). [18:12:20] It /was/ a hack; though I felt it was a reasonable one. :-) [18:20:09] !ping [18:20:09] !pong [18:24:29] !log integration /var/lib/jenkins-slave/tmpfs 100% full on gallium [18:24:32] Logged the message, Master [18:49:35] ^d: I've made sure the plugin upgrades (https://gerrit.wikimedia.org/r/#/c/164633) are ready by syncing themto beta. I haven't tried using elasticsearch 1.3.4 yet. that is next on my list [18:50:05] <^d> deployment-elastic01 is already running .4 [18:50:14] <^d> I was testing the .deb upload to apt.wm.o [18:51:28] sweet [18:51:33] bouncing it will get the new plugin [18:52:59] <^d> deployment-elastic01 experimental highlighter 0.0.12 j [18:52:59] <^d> deployment-elastic01 wikimedia-extra 0.0.1 j [18:53:10] <^d> (among others, obvs) [19:46:35] YuviPanda: meh, the phab email format is complete crap to parse :-( [19:47:11] valhallasw`cloud: I still think proper way is to patch upstream... [19:47:16] they already have an IRC bot... [19:47:43] YuviPanda: well, not really. It's marked as 'experimental' and 'an example of how you could use the API' [19:47:56] sure, so we should make fix it and do things with it :) [19:48:06] rather than go phab -> email -> email list -> email in -> redis -> python... [19:48:31] well, email is still a great pub sub method ;-) [19:48:43] and one thing that's unclear to me is where the irc bot would run [19:48:43] tch tch ;) [19:48:53] as far as I can see, it's supposed to run on the same host as phab [19:48:55] which is not ideal [19:49:02] why not [19:49:13] because changes will then take a gazillion years [19:49:20] also getting it back up when it crashes [19:49:25] remember the old wikibugs? :-p [19:49:29] ah, that :) [19:49:42] also from a security perspective it's not ideal [19:50:08] hmm, true [19:51:00] although it seems to run completely over the Conduit api, so maybe it just needs a phabricator checkout (not so much to be on the same server) [19:53:11] also PHID's everywhere [19:53:36] but this is the main code: https://secure.phabricator.com/diffusion/P/browse/master/src/infrastructure/daemon/bot/handler/PhabricatorBotFeedNotificationHandler.php [19:59:08] Phabricator appears phine, but I phear its phanciphul naming will pheel old aphter a while. [20:08:54] I'm not sure how I'm going to run this on the grid, but let's try. [20:26:43] YuviPanda: I'd also be happy with getting that bot to dump stuff in redis :-p [20:26:55] yeah, but it'll dump them in *prod* redis ;P [20:32:02] Coren, there's a whole list of mail.tools.wikibugs tasks in error state on the wikibugs account (state Eqw). Anything I can do about that, and maybe anything you can do about that? ;-) [20:33:39] valhallasw`cloud: Well, I can blindly clear the errored out jobs; but that's not going to be useful unless the maintainer looks at what happened and possibly makes corrective measures (or decides the jobs can be done away with) [20:33:58] Coren: I am the maintainer. I'm not sure how to check what went wrong :-p [20:35:09] valhallasw`cloud: Well, qstat the jobs; any error would show. But also, you may want to check the jobs' own error logs. [20:35:49] Coren: except they are mail jobs, so they don't have any output. [20:36:07] qstat says error reason 1: can't get password entry for user "tools.wikibugs". Either the user does not exist or NIS error! [20:36:54] Ah, those must date from when LDAP was throwing fits. You can simply clear the error state. [20:37:32] Or just delete them if they are not useful anymore. [20:37:51] (clear error state: qmod -cj [20:39:30] qdel is probably fine -- those are old emails [20:54:43] bd808: Hm.. that !log about gallium tmpfs should've gone to production log in -operations, or !log for RelEng/QA in -qa, not the 'integration' project in labs. [20:55:40] hashar was calling me off on it as well earlier. We shouldn't use the log for 'integration' and 'deployment-prep'/'beta' anymore as those aren't typical labs project, they're usually of interest to prod or qa in general. https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:55:45] ah, I see you found that already :) [20:55:49] I did it here and in -qa [20:56:38] I've been duping some stuff but yeah I should just stick with -qa [21:04:35] bd808: YuviPanda: ganglia in labs, over or not over? It seemse to be taking up massive amounts of RAM on integration slaves. [21:05:02] * bd808 doesn't know [21:05:08] ganglia is dead, and there are no current plans to bring it back [21:05:11] you can kill it [21:06:52] YuviPanda: how? [21:07:19] I've no idea :| [21:07:29] Krinkle: I guess ganglia collectors are defined in integration roles? [21:08:01] Krinkle: ah, I see [21:08:07] Krinkle: it is included in standard [21:08:15] Yes. [21:08:31] I guess I can put a realm guard around that... [21:19:23] YuviPanda: thx [21:21:55] Krinkle: yw. now to get that merged... [21:30:16] 3Wikimedia Labs / 3Infrastructure: WMFLabs: Ganglia deamon is taking up a lot of memory - 10https://bugzilla.wikimedia.org/71761 (10Krinkle) 3NEW p:3Unprio s:3normal a:3None The gmond process seems obsolete since the aggregator is down. ganglia.wmflabs.org is still up, but no longer being populated.... [21:30:30] 3Wikimedia Labs / 3Infrastructure: WMFLabs: Ganglia deamon is taking up a lot of memory - 10https://bugzilla.wikimedia.org/71761 (10Yuvi Panda) a:3Yuvi Panda [21:52:39] Hm... that's concerning [21:52:52] YuviPanda: lol, clean brand new instance: not used for anything yet (not pooled) [21:52:53] https://graphite.wmflabs.org/render/?width=900&height=500&from=-4h&target=integration.integration-slave1009.memory.MemFree.value [21:52:57] down to the river we go [21:53:14] Krinkle: MemFree is a terrible metric, though [21:53:17] https://graphite.wmflabs.org/render/?width=900&height=500&from=-7h&target=integration.integration-slave1009.memory.MemFree.value [21:53:25] YuviPanda: OK [21:53:28] Got a better one? [21:53:32] let me see what you need to look at [21:53:33] moment [21:54:19] Krinkle: MemTotal - Active+Buffers+Cached [21:54:20] I think [21:54:57] MemTotal is total memory available, constant value [21:55:21] oh, I see [21:55:29] (03PS1) 10BearND: Update build script for Gradle [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165375 [21:55:31] (03PS1) 10BearND: Need JAVA_HOME for Gradle [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165376 [21:55:33] (03PS1) 10BearND: Expand wildcards when copying apk [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165377 [21:55:44] YuviPanda: There's no subtract method in graphite is there.. [21:56:18] Krinkle: diffSeries [21:56:28] Can take two or more metrics, or a single metric and a constant. Subtracts parameters 2 through n from parameter 1. [21:56:52] YuviPanda: I don't need substract though [21:56:57] just adding active,buffer,cached will do [21:56:58] hmm? [21:57:02] I'm trying to draw a stacked graph [21:57:03] hmm, that shall do too, yeah [21:57:08] ah, hmm [21:57:09] ok [21:57:11] stacked() doesn't work [21:57:22] sum() seems to work, but display them as one new value, not stacked areas [21:58:36] oh, are you drawing them directly via graphite? [21:59:05] YuviPanda: do I have another option? [21:59:09] https://graphite.wmflabs.org/render/?width=900&height=500&from=-6h&target=sum(integration.integration-slave1009.memory.{Active,Buffers,Cached}.value)&areaMode=stacked [21:59:35] Krinkle: well, my preferred way is to add &format=json, get the points in json, and plot them with a sane library [21:59:53] graphite's graphs aren't what I'll call nice [22:00:12] I agree. but I just want monitoring so I can keep the integration slaves healthy. [22:00:16] Can I help your efforts instead? [22:00:22] I don't want to reinvent [22:01:02] hmm, I'm currently setting up shinken, but perhaps I should setup the graphing first, and then shinken... [22:01:11] I guess lots of people would find that order more useful [22:02:29] I basically just want simple graphs that show me: cpu, memory and disk usage. And a set for each node and one for all nodes in a group. E.g. like http://ganglia.wikimedia.org/latest/?r=hour&c=Bits%2520caches%2520eqiad (the last hour memory graph) [22:02:49] and then one for each node as well. Just plain graphs like that so I can see what's going on. [22:03:18] and ideally alerts as well (like you set up), which should be pretty straight forward thanks for the infra you put in place via prod icinga [22:03:34] but right now they don't measure cpu and memory in a useful way [22:03:59] YuviPanda: Use the new graphana module that ori made? [22:04:18] bd808: yeah, but I'm not too much of a fan of grafana [22:04:21] but I'll give it a shot... [22:04:39] Krinkle: ok, so I'll setup grafana (you can try it out for prod at grafana.wmflabs.org) and see if it's useful [22:04:45] heh. what we need is yet another graphite front end.... [22:04:47] bd808: from what I've seen that's similar to regular graphite, it just makes it easier to write the function calls and store queries in a persistent dashboard. It's still limited to the functions graphite supports I think. [22:06:01] It renders client side I think, but I haven't worked with it deeply [22:06:22] bd808: I'd want to autogenerate a set of graphs for most common metrics (to match ganglia, at least) per project by default [22:06:25] I don't know if grafana does that [22:06:31] I have worked with graphite directly a ton and there are very few things I've wanted that it can't be tricked into doing. [22:06:38] plus the fact that it stores config in ElasticSearch(!?!) seems ugh [22:07:14] YuviPanda: heh. I wrote a php library for doing that (replacing ganglia) at $DAYJOB-1 [22:07:22] heh [22:07:28] I guess that isn't open source... [22:07:48] https://github.com/bd808/graphite-graph-php [22:08:01] I know nothing about monitoring, I probably used all the wrong metrics, but I wrote this in little over an hour https://github.com/wikimedia/integration-docroot/blob/master/org/wikimedia/integration/monitoring/index.php [22:08:36] seems with a little more knowledge from someone who 1) knows graphite, 2) knows what to measure, can come a long way [22:08:40] YuviPanda: It is embedded into http://opennetadmin.com/ at least at Kount [22:08:46] heh [22:10:25] Krinkle: Yup. My library is basically a helper for doing that kind of stuff. You can use it as a DSL or to read from ini files -- http://bd808.com/graphite-graph-php/ [22:11:35] We hooked it into our network monitoring system so that when a host was added it would generate the right ini files and display the graphs on the host's page [22:12:16] hm.. labs graphite is no longer responding [22:12:27] Krinkle: I can spend time tomorrow figuring out if grafana is going to be good enough, or if we'll have to do something else [22:13:21] Krinkle: hmm, you're right. [22:13:26] Krinkle: the labs general proxy seems down [22:13:28] andrewbogott: ^ [22:13:37] YuviPanda: OK. I'll do some research as well. One question: What's the name of the thing we use to collect these metrics? e.g. memory.MemTotal the names of those and their values, is that a standard of sorts? I assume that's not built-in into graphite [22:13:56] andrewbogott: diamond is what we use to collect the metrics [22:14:03] andrewbogott: https://github.com/BrightcoveOS/Diamond [22:14:04] err [22:14:05] Krinkle: https://github.com/BrightcoveOS/Diamond [22:14:16] Krinkle: inside 'collectors' you can see the code that does the actual collection [22:14:20] and see how exactly it gets the values [22:14:37] andrewbogott: dynamicproxy-gateway the box seems dead. can you try logging in with root key? [22:14:45] YuviPanda: thx [22:15:11] YuviPanda: Why does the build.py script in wikipedia-android-builds only sometimes produce output to build.out/err? Often both files are empty. I'm trying to figure out why it doesn't publish the apk to the web site. [22:15:34] YuviPanda: I'm looking… probably I'll just reboot it though [22:16:03] andrewbogott: ok [22:16:12] bearND: oh, it doesn't?! :| [22:16:36] bearND: can you try running the jsub command from cron by hand to see if it runs? [22:17:27] !log projectproxy dynamicproxy-gateway is dead, unpingable and not proxying requests anymore [22:17:28] projectproxy is not a valid project. [22:17:49] !log proxyproject dynamicproxy-gateway is dead, unpingable and not proxying requests anymore [22:17:50] proxyproject is not a valid project. [22:18:07] YuviPanda: Yeas. I usually cd to wikipedia and then do a git reset --hard HEAD^ so there would be a new commit to pull [22:18:42] bearND: hmm, so it works if you do it with jsub, but not when done via cron? [22:19:30] YuviPanda: i've tried both but I haven't seen any consistent behavior when it produces output to the files [22:21:04] YuviPanda: I just ran it manually and verified that a job was running with job -v build; said: Job 'build' has been running since 2014-10-07T22:18:27 as id 4632081 [22:21:23] bearND: did it produce the files? [22:21:32] YuviPanda: files are empty [22:21:39] bearND: no apk file yet? [22:21:48] bearND: just do a 'qstat', see what jobs are running? [22:22:29] YuviPanda: qstat says only lighttpd is running [22:22:43] ah, hmm [22:22:48] bearND: did you add -mem 8G parameter to jsub? [22:23:23] YuviPanda: yes: jsub -mem 8G -quiet -once [22:23:49] YuviPanda: used the same command line as in crontab [22:23:57] andrewbogott: btw, tools proxy is dead as well [22:24:06] bearND: hmmm, I'm unsure :| can you file a bug and assign it to me? [22:24:15] 3Wikimedia Labs / 3deployment-prep (beta): Beta: Cannot save any page "DB connection error: Can't connect to MySQL server on '10.68.16.193' (4) (10.68.16.193)" - 10https://bugzilla.wikimedia.org/71764#c2 (10Krinkle) a:3None https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/4588/ https:/... [22:24:20] YuviPanda: will do [22:24:27] bearND: thanks [22:24:44] 3Wikimedia Labs / 3deployment-prep (beta): Beta: Cannot save any page "DB connection error: Can't connect to MySQL server on '10.68.16.193' (4) (10.68.16.193)" - 10https://bugzilla.wikimedia.org/71764 (10Krinkle) p:5Unprio>3Normal s:5normal>3critic [22:24:54] bah, fucking logout-back-in bug [22:25:03] YuviPanda: tools-webproxy? [22:25:08] andrewbogott: yeah [22:25:14] andrewbogott: general proxy is still dead [22:25:15] as well [22:26:00] andrewbogott: I issued a reboot of tools-webproxy [22:26:07] It's the host I think. [22:26:43] of course, I can't look at graphite stats since that is served by labs proxy [22:26:59] andrewbogott: general proxy says still 'rebooting'? [22:28:48] andrewbogott: Looks like toollabs is down. [22:29:19] :( [22:29:22] kaldari: underlying host seems a bit dead [22:29:26] he's looking into it [22:29:45] bearND: ^ this might be a cause as well. we're looking into it... [22:34:06] YuviPanda: ok, thanks [22:50:05] !log deployment-prep updated OCG to version c778ea8b898f8ad8c2b7ad9de78a75469e7ed061 [22:50:08] Logged the message, Master [22:52:12] http://en.wikipedia.beta.wmflabs.org/ seems to be dead :( i guess y'all know that. [22:52:27] cscott: yah, see topic [22:52:31] lotsamachines are dead [22:54:32] 3Wikimedia Labs / 3tools: jsub command produces empty output/error files - 10https://bugzilla.wikimedia.org/71766 (10Bernd Sitzmann) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier To repro: login to tools and become wikipedia-android-builds # note output of: ls -l build.* -rw-rw---- 1 tools.wikiped... [22:54:59] 3Wikimedia Labs / 3tools: jsub command produces empty output/error files - 10https://bugzilla.wikimedia.org/71766 (10Bernd Sitzmann) a:5Marc A. Pelletier>3Yuvi Panda [22:55:22] WTF? [22:55:36] YuviPanda: filed the bug for you [22:55:46] bearND: thanks [22:55:52] MarcosDias: ? [22:56:09] this whitelist as problem? [22:57:12] MarcosDias: hmm? which whitelist? [22:58:11] https://huggle.wmflabs.org/data/wl.php?wp=pt.wikipedia&action=display [22:58:56] MarcosDias: ah, you want petan [23:00:29] yes [23:01:04] MarcosDias: there's also the #huggle channel, which might be more useful [23:01:14] thanks! [23:19:44] YuviPanda: I just restarted the proxy gateway -- look like it's working to you? [23:19:57] andrewbogott: ya [23:20:06] ok, lemme set up a batch to restart everything else [23:20:09] ok [23:20:14] andrewbogott: tools-webproxy as well before the batch? [23:20:27] YuviPanda: lemme try [23:36:55] YuviPanda: May I delete instance 'boiledegg'? Just to give virt1005 some breathing room? [23:37:04] andrewbogott: ya [23:37:08] (That instance chosen at random due to a seemingly transitory name) [23:37:09] thanks. [23:37:31] andrewbogott: although, I was naming everything under the design project with 'what did I have for breakfast?' [23:37:35] so not very transitiony :D [23:37:45] Well, wait -- [23:37:53] so if that instance is actually good for something then I will not delete it :) [23:38:02] andrewbogott: no, it isn't being used atm. [23:38:11] andrewbogott: and hasn't been for a while. [23:38:13] (I use 'breakfast foods' for a naming scheme too, but only for disposable things usually) [23:38:18] ah :) [23:38:18] ok, great! Killing... [23:38:39] !log design deleting instance 'boiledegg' [23:38:41] Logged the message, dummy [23:38:54] matanya: may I delete instance 'etherpad-matanya'? [23:41:02] anybody was restarting servers? [23:41:52] Danny_B|webgate: yes, info sent to labs-l [23:42:59] bd808: is labs instance bd808 a disposable/defunct instance, or still useful? [23:43:51] andrewbogott: You can kill it. I built docker support for mw-vagrant there and haven't touched it in over a month [23:44:17] * bd808 will recreate via a proper role some day [23:44:25] hi all - I am getting the pieces together for this writeup on serving tiles for maps... most of the major looking done.. the other day I saw mention of a VM here doing some experimental tile serving ? any hints on that ? [23:44:43] bd808: excellent, thanks! [23:46:10] andrewbogott: by "died" is meant data are lost too? [23:46:28] !log mediawiki-core-team deleting instance bd808 because bd808 said I could. [23:46:30] Logged the message, dummy [23:46:48] Danny_B|webgate: As I said in the email, you instance will experience the equivalent of an unexpected reboot. [23:47:02] There shouldn't be any dataloss, other than immediate running-state data. [23:47:13] your instances should be back up and running by now. [23:47:52] ok. when we say die it means the disc is unrecoverably dead, so i was verifying... [23:48:21] <^demon|brb> andrewbogott: On that note from !log...could you delete chad-test too? It's saying "The requested host does not exist" when I try. [23:48:22] otoh, i can't connect to freenode, but i assume that's rather their issue with their settings [23:48:43] ^demon|brb: sure, I'll try. [23:50:10] ^demon|brb: done, I think... [23:50:36] <^demon|brb> Looks like it, thx [23:51:57] was the ip of the labs changed? [23:53:54] Danny_B|webgate: nah, shouldn't have been any changes really. [23:53:57] Just reboots.