[00:02:11] 3Wikibugs, Wikimedia-Fundraising: Wikibugs bot is skipping many notifications - https://phabricator.wikimedia.org/T88747#1019780 (10awight) Awesome work, thank you! [00:43:42] 3Wikimedia-Fundraising-CiviCRM, Labs, Wikimedia-Fundraising: Create new labs project: fundraising-integration - https://phabricator.wikimedia.org/T88599#1019884 (10hashar) Or you can get the instance created in your labs project :-]  Might just need to add a security rule to your project to allow ssh from the J... [01:23:39] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<44.44%) [01:33:36] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [01:44:37] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<33.33%) [02:16:56] I'm learning bash and have figured out how to get the job number from qstat using |awk '{print $1}'|grep '[0-9]\{7\}' [02:17:40] Is there a better way to get that number and how do I pass that to qstat -j #### [02:21:15] T13|sleeps: do you need it after you have just submitted the job, while running in the job or for a second script? [02:21:35] While the job is running. [02:23:01] I want to create an alias or script that checks if `mem=` > 1000.000 GBs and then if it is to run `webservice restart` [02:23:29] the job gets some environment variablen on startup which laos contains the job number and the tasj number [02:24:53] $JOB_ID current job ID [02:24:53] $JOB_NAME current job name [02:24:53] $TASK_ID array job task index number [02:25:07] T13|sleeps, ping [02:25:28] Cyberpower678: pong [02:25:43] Fixed. [02:25:47] For the most part/ [02:25:51] I've been doing damage control on enwp all day... [02:26:12] The tools are all working again. [02:27:00] Merlissimo: how do I query that $JOB_ID number? [02:27:06] For some reason, I just can't get the login sessions to carry over to xtools-ec and xtools-articleinfo, but they stay active on xtools and xecho is now fixed [02:27:13] T13|sleeps, ^ [02:27:51] Or can I just pass $JOB_ID to qstat -j $JOB_ID directly Merlissimo? [02:28:24] Cyberpower678: Is there currently any use for the login feature? [02:28:38] XEcho [02:28:57] What does that tool do that needs login? [02:29:11] Because it is a global echo tool [02:29:49] $JOB_ID contains the value, so it should work. [02:30:58] Thanks. I'll try it out tomorrow. Do you know if that in the qstat man page [02:32:06] Have you tried Echo yet? [02:32:13] I haven't [02:32:20] to be source you may better use $JOB_ID.$TASK_ID which should work for alls jobs including array tasks [02:32:25] Try it now!! [02:32:51] Cyberpower678: I'm sleeps. :p [02:33:05] I's only 9:30pm. :p [02:33:20] Merlissimo: I was thinking of trying it both ways actually. [02:33:52] Cyberpower678: it's 21:34 and I'm an old fart. [02:34:38] I thought you were in your 30s? [02:35:12] I am. [02:35:27] THEN YOU'RE NOT AN OLD FART!! [02:35:27] Makes me an old fart [02:35:30] :p [02:35:34] :p [02:35:46] I'm over 35... old enough.. [02:35:57] You know what an old fart is? [02:36:01] 80s [02:36:11] Like my grandparents [02:36:25] They're not farts though. [02:36:34] ... [02:38:05] it is 03:37 a.m. [02:38:07] T13|sleeps, ... [02:38:46] T13|sleeping too Cyberpower678? [02:38:58] No. [02:39:08] * T13|sleeps wonders why he pinged himself... [02:39:09] * Cyberpower678 doesn't go to bed until 1:00 am [02:39:25] Cyberpower678: is a yungin' [02:39:42] * Cyberpower678 will be 21 on the 25th [02:39:53] Kids... [02:40:00] Anyways... [02:40:07] 21 is hardly a kid. [02:40:25] T13|sleeps, have you tested xEcho now? [02:40:52] I did want to discuss how we (xtools) announces changes and expected downtime to the community. [02:41:12] T13|sleeps, xecho [02:41:12] T13|sleeps, xecho [02:41:12] T13|sleeps, xecho [02:41:13] T13|sleeps, xecho [02:41:16] :p [02:41:44] This 'meh, we're patching things and making fixes' when people complain is a bad plan. [02:42:04] We can create a mailing list. [02:42:15] Once we get xtools.wmflabs.org [02:42:15] I'll look at xecho tomorrow. [02:42:46] Fair enough. Should have an onwiki mailing list for MMS notices too. [02:43:31] Maybe even global for MMS. [04:49:34] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [06:58:01] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [07:22:55] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [08:54:21] 3Labs: Fix documentation & puppetization for labs NFS - https://phabricator.wikimedia.org/T88723#1020329 (10faidon) p:5Triage>3High [08:54:43] 3Tool-Labs: Fully puppetize Grid Engine (Tracking) - https://phabricator.wikimedia.org/T88711#1020330 (10faidon) p:5Triage>3Normal [09:55:37] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [10:00:37] 3Labs: Ensure that opsen are paged on failure of labstore1001's NFS service - https://phabricator.wikimedia.org/T76402#1020376 (10faidon) p:5Triage>3High [10:05:35] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [10:05:55] 3Wikimedia-Labs-extdist: Create new extdist instance - https://phabricator.wikimedia.org/T88787#1020378 (10Legoktm) 3NEW [10:06:37] 3Wikimedia-Labs-extdist: /var is running out of space on extdist2 - https://phabricator.wikimedia.org/T72952#1020385 (10Legoktm) 5Open>3declined T88787 should fix this. [10:41:36] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [10:56:36] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [13:51:01] hey yo why are the bastion hosts so slow [14:10:36] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:11:39] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:11:40] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:12:06] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:12:47] PROBLEM - Puppet failure on tools-static is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:12:59] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:13:35] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:14:07] PROBLEM - Puppet failure on tools-exec-13 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [14:14:49] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:15:23] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:17:46] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:17:56] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:17:56] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:18:58] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:19:14] that's all from the sysctl thing, it's already fixing itself slowly [14:19:26] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [14:20:04] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:21:11] Bleh. Bad puppet. [14:21:48] bad paravoid [14:21:54] PROBLEM - Puppet failure on tools-uwsgi-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:22:42] paravoid: Do you already have a grip on what broke? [14:22:49] Ah, I see it's already fix't. [14:23:04] yes [14:35:01] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [14:39:49] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [14:40:22] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:40:35] RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0] [14:41:35] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [14:41:43] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:13] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:47] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:47] RECOVERY - Puppet failure on tools-static is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:57] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:59] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [14:43:35] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [14:43:55] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [14:44:07] RECOVERY - Puppet failure on tools-exec-13 is OK: OK: Less than 1.00% above the threshold [0.0] [14:44:23] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [14:46:53] RECOVERY - Puppet failure on tools-uwsgi-01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:47:57] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:28:34] 3Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1020662 (10dnaber) 3NEW [15:29:17] 3Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1020669 (10dnaber) [15:53:44] 3Tool-Labs: log files not written - https://phabricator.wikimedia.org/T85775#1020681 (10scfc) If I look at the logs for that job (grep `/var/lib/gridengine/default/common/accounting` for the job numbers and feed them to `qacct -j`), they show that the jobs exited with status 143, which is 128 + 15, with 15 being... [15:58:37] YuviPanda, Nemo_bis: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Developing#Shared_files says /shared is supposed to have a full checkout of MW core and extensions, presumably in /shared/mediawiki, which is owned by your mediawiki-mirror tool. But there is no core checkout and the code in /shared/mediawiki/git.sh to do it is commented out? [15:59:40] anomie: The contents of /shared is, technically, community-maintained. I'm not sure who originally provided a checkout of core there, tbh [16:00:12] Coren: Since it's owned by the mediawiki-mirror tool and not world-writable, I thought it best to ping the maintainers of that tool [16:00:34] That does sound like a good bet. :-) [16:02:16] I was about to go check that uid owned it to be helpful; I should have realized that you own a clue and would have started with that yourself. :-) [16:02:43] anomie: I think I never cloned core there [16:03:01] Let me add you to maintainers [16:03:17] (I would have made the directory world-writable but others thought better not) [16:03:28] Yeah, world-writable is probably bad. [16:04:36] anomie: try now [16:04:45] Is there some sensible service group I can add to maintainers? [16:04:53] Nemo_bis: I have access. Which, I guess, means I need to fix it myself :/ [16:07:46] 3Tool-Labs: log files not written - https://phabricator.wikimedia.org/T85775#1020683 (10dnaber) JVMs can also request memory after start, depending on how you start them. I've now added `-Xms100M` to make sure they start with the maximim amount, i.e. no more memory requests. [16:08:16] :P [16:08:21] I don't use core [16:16:07] 3Tool-Labs: log files not written - https://phabricator.wikimedia.org/T85775#1020688 (10coren) Another thing that may help with memory is to use -jamvm, a different jvm implementation that is //considerably// more frugal with its memory allocation. [16:16:53] Coren, what's the verdict? [16:17:43] Cyberpower678: inspecting the tool's directory for credential is on my todo for after lunch; if everything checks out I'll add you to it. [16:18:11] Thanks. :-) [16:18:50] Coren, I also noted you put backlog on the request for our own xtools project? When can we expect a project of our own. [16:19:41] 'backlog' means only that "it's triaged, needs to be done" as opposed to "is being done atm" [16:20:01] I need to talk with andrewbogott_afk about it, but there should be no obstacle. [16:26:18] How much can we have allocated to us? [16:32:43] How much of what? [16:33:10] Easier yet is to actually list your requirements on the ticket. :-) [16:36:12] Hey yo guys why does the bastion server take so much time to load :/ [16:38:13] canaar: I'm not sure what you mean by "time to load" but there doesn't seem to be any outages at the moment. What issue are you running into? [16:39:28] that website bastion.wmflabs.org doesnt load [16:39:56] Btw, i am a beginner, can you help me get started? [16:41:12] bastion.wmflabs.org isn't a web site - it's a bastion for logging into with SSH. https://wikitech.wikimedia.org/wiki/Help:Access has more detailed instructions to help you get started, but if you run into specific issues just ask. [16:43:54] Okay thank you. [18:12:51] YuviPanda: can you restart https://tools.wmflabs.org/robin/?tool=uploadconfig ? [18:21:03] Coren: ^ [18:21:12] Nemo_bis: I think I'm on vacation now [18:21:20] YuviPanda: You should be. Go 'way. [18:21:30] :D [18:21:32] Yessir [18:22:23] Nemo_bis: 'tis up [18:28:09] Thanks [18:28:22] YuviPanda: good to know. :P [18:29:11] I wonder if it'd be a good idea to provide a "try to start the webservice" button maintainers could add to the error page. Hmmm. [18:29:51] Coren: with a simple rate-limiter it's probably fine [18:30:15] Well, it'd only be available at all in the first place if the service is already down. [18:30:59] right, not when it's in 'starting up' state, I guess. Although that edge case is currently also a bit awkward in terms of what the user sees [18:33:50] Coren, +! [18:33:51] ! [18:33:52] hello :) [18:33:52] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [18:33:54] ! [18:33:54] hello :) [18:33:56] 1 [18:34:05] dammnit I can't type. [18:36:30] MusikAnimal, I'm going to have to reduce my activity. College hectic is picking up. [18:36:49] I can understand that [18:36:54] don't let anything get in the way of your studies! [18:37:13] * Coren stares at tools-exec-15. [18:37:21] Why you no puppet on your own, dawg? [18:37:24] Cyberpower678: we're at stable point right now, code-wise, right? [18:37:36] I'm going to try to stay active while I wait for Coren to find time to grant us xtools.wmflabs.org [18:38:18] Cyberpower678: Did you add your resource requirements to the ticket? [18:38:27] MusikAnimal, semi-stable? [18:38:45] well that's better than unstable [18:39:05] Coren, not yet. What resources do tool labs tools get? [18:39:49] Technical 13 knows better than I do because he's more active in that regards. [18:40:07] My goals have been to get the PHP to work. [18:40:11] Cyberpower678: They're not allocated the same way at all; with a project you get quota (cores, ram, disk space) which you then dole out as needed in instances. [18:40:44] if you're going off of instance type I'd say an m1.large [18:41:45] Dah, I guess with multiple instances m1.small. I don't know [18:42:10] I don't think we'd need much for storage, right? More about RAM and CPU power [18:43:15] I'm guessing a project has some sort of load balancer to divvy out traffic to all the instances? [18:43:45] Considering the large loads xtools gets and sometimes the heavy work it does, I'd say 4 cores, 10GB RAM, and 4 GB of Disk space. Does that sound unreasonable? [18:43:49] Coren, ^ [18:43:50] MusikAnimal: It has only exactly what you install and configure on it. If you want load balancing, you'll need a load balancer. :-) [18:44:32] Cyberpower678: that's about a m1.large which by itself I think is enough [18:44:35] Cyberpower678: That's actually sorta tiny, and will really not be an issue. [18:44:45] oh good [18:44:57] Coren, then let's double that. :D [18:45:09] or two m1.large's and a load balancer [18:45:22] that's got to be plenty [18:45:39] MusikAnimal, it will also allow for expansions, if we wanted. [18:46:03] But we also need to setup replication on that project right Coren |? [18:46:26] "replication"? [18:46:43] Do we have access to the replicas on our own project? [18:47:00] DBs [18:47:05] sorry. [18:47:08] Coren, ^ [18:47:13] Oh, yes, though you do have to either use the IPs directly or copy over the host file from tools. [18:47:39] Otherwise, it works the same and service groups get their credentials the same way. [18:47:41] Definitely the host file. [18:48:56] RECOVERY - Puppet staleness on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:50:31] Coren, when did you say you can create this? [18:51:09] Well, I wanted to talk with andrewbogott first but your requirements, after all, are low enough that I see no reason to delay. [18:51:41] What are the typical requirements of a project? [18:52:16] Coren: two large instances and a small one? Or are there other things at stake? [18:52:23] * andrewbogott should read the backscroll but is multitasking [18:52:34] andrewbogott: Nothing else at stake. It's a small project. [18:52:43] yeah, totally fine then [18:52:49] andrewbogott: And it's two small /or/ a large afaik [18:53:08] * Cyberpower678 high fives Coren, andrewbogott, and MusikAnimal [18:53:13] +1 [18:53:37] Cyberpower678: Your project, she exists. You may have to log off then back on wikitech to manage it though. [18:54:41] I see it. :DD [18:57:29] do we have to install everything on it ourselves? like the environment and PHP dependencies, etc? [18:57:34] MusikAnimal, I have added you. [18:57:57] thanks! [18:57:58] MusikAnimal: The price of freedom. You almost certainly want to puppetize your setup, too, so that it's easy to recover in case of issues. [18:58:29] How do I SSH into it? [18:58:57] Coren, ^ [18:59:18] Cyberpower678: Well, once you actually create instances, you'll be able to ssh into them from the bastions. :-) [18:59:40] Cool. How is that done? [18:59:43] :p [19:00:20] From wikitech; look in your sidebar under 'Labs Projectadmins' [19:02:20] I have 51200 RAM. What units are they in? [19:03:13] Megs, by default [19:03:50] Cool. What's our disk space? [19:06:09] Cyberpower678: That depends on the instance size. IIRC, it should be around 120G of local storage, but you definitely want to use /data/project for anything you want to keep. Remember to treat i nstances as cattle, not pets. [19:09:47] MusikAnimal, I've created 2 instances. [19:10:05] A login instance for us to use and a web service instance. [19:10:09] both 51MB RAM? [19:10:25] err GB [19:10:47] No the login instance has 2GB. That's our SSH terminal. [19:10:50] MusikAnimal: Err, no, that's the quota for the entire project [19:11:32] okay, yeah cause that's a lot [19:13:36] 3Tool-Labs: log files not written - https://phabricator.wikimedia.org/T85775#1021044 (10dnaber) I forget to mention that I restart the jobs every 24 hours with e.g. `qmod -rj ca-feedcheck`. Could this be a problem? Will this keep the memory settings? [19:14:24] MusikAnimal, login uses 1 CPU, 2048MB RAM, and allows for 20GB. [19:15:03] MusikAnimal, web service uses 8 CPUs, 16384 MB RAM, and allows for 160GB. [19:15:31] very nice [19:16:13] CP678: thought of running two web services with a balancer? [19:16:26] yeah, I was thinking the same [19:16:26] In case one decides to die [19:16:32] I'm creating a second right now. [19:19:01] 3Tool-Labs: Make SGE be more informative about OOM kills - https://phabricator.wikimedia.org/T88824#1021110 (10scfc) 3NEW [19:19:41] MusikAnimal, https://wikitech.wikimedia.org/wiki/Special:NovaInstance [19:20:47] 3Wikimedia-Labs-Infrastructure: Move LabsDB aliases and NAT to DNS and LabsDB servers - https://phabricator.wikimedia.org/T63897#1021124 (10yuvipanda) Copying my comment from elsewhere: Currently, aliases for things like enwiki.labsdb and s1.labsdb are maintained as /etc/host entries, manually. Why aren't they... [19:21:51] Can you see it? [19:23:14] 3Tool-Labs: Make SGE be more informative about OOM kills - https://phabricator.wikimedia.org/T88824#1021150 (10valhallasw) If we provide qsub with `-ma` (mail on abort/reschedule) by default, and maybe continuous jobs also `-me` (mail on end of job, i.e. the continuous job crashed), the user will at least get an... [19:23:44] MusikAnimal, I created the instances, do you want to set up the installations? [19:25:38] so it just has bare bones Ubuntu right now? [19:25:46] Yep. [19:25:56] No DBs either I believe. [19:26:06] It's empty and untouched. [19:26:24] And hopefully a good home for xTools once it's ready/ [19:26:26] do we need a DB? aren't we just connecting to the wiki repl dbs? [19:26:46] It uses the local DBs as well. [19:26:55] gotcha [19:28:13] well I guess we'll have to sit down and figure out everything we need. I don't know where to start with the PHP, do you use a framework or anything? [19:28:50] Framework? [19:28:57] For what? [19:29:11] like cakephp [19:29:13] guess not [19:29:25] No clue. I used what Labs provided. [19:29:58] I'm more of the PHP person than a Linux person. Linux and server setups are your area of expertise I believe. :p [19:30:29] Install PHP 5.3 [19:30:34] That should do it. [19:31:23] I'm certainly more so linux than PHP! This ought to be fun, setting this all up from scratch. Hopefully I don't mess anything up [19:32:02] We'll know if a chunk of Lab's server explodes. That ought to make Coren happy. :p [19:38:11] valhallasw`cloud: set(next(iter(tags.values()))) wat [19:38:21] legoktm: i know right [19:38:29] legoktm: that's tags.values()[0].keys() [19:38:37] sorry, set(tags.values()[0].keys()) [19:38:44] BUT PYTHON 3 [19:39:04] <3 [19:39:06] "sorry, it's a view, so you can't get the first element' [19:40:20] (03CR) 10Legoktm: [C: 032] "I think it would be a good idea to also run this test against the live site so after upgrades we can figure out what's broken." [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/188936 (https://phabricator.wikimedia.org/T88747) (owner: 10Merlijn van Deen) [19:40:31] valhallasw`cloud: I assume you already deployed that? [19:40:35] (03Merged) 10jenkins-bot: Fix project tag screen scraping [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/188936 (https://phabricator.wikimedia.org/T88747) (owner: 10Merlijn van Deen) [19:40:42] legoktm: no [19:41:20] oh [19:41:27] legoktm: but the main issue was resolved when we added paragraph-marker-thingie to the matched projects [19:41:40] I'll deploy now, then :-) [19:41:59] !log tools.wikibugs legoktm: Deployed 1b6bbd391ad1f23a8270d3547b2540064e452d94 Fix project tag screen scraping wb2-phab [19:42:04] er, wait I'm deploying [19:42:05] Logged the message, Master [19:42:09] :D [19:42:32] well, should not be an issue [19:42:43] pull is idempotent and qdel -rj probably just restarts twice or so [19:49:50] 3Wikibugs, Wikimedia-Fundraising: Wikibugs bot is skipping many notifications - https://phabricator.wikimedia.org/T88747#1021253 (10Legoktm) a:3valhallasw [19:49:57] 3Wikibugs, Wikimedia-Fundraising: Wikibugs bot is skipping many notifications - https://phabricator.wikimedia.org/T88747#1021255 (10Legoktm) 5Open>3Resolved [19:50:42] Coren, I can't SSH into the instance. It won't resolve. What am I doing wrong? [19:50:51] legoktm: I propose to add the fundraising bug as live site test case when it's closed :-) [19:51:00] Cyberpower678: What's the instance name? [19:51:06] login [19:51:08] valhallasw`cloud: ok, sure [19:51:35] lemme write the test, and see if we can add testing to be automagic [19:54:47] (03PS1) 10Merlijn van Deen: Add online project scrape test [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/189051 [19:55:29] I'm surprised login wasn't already taken. Huh. [19:56:32] Cyberpower678: Give me a few, my auth token *had* to expire now, obviously. [19:57:56] :DDDD [20:00:12] are there problems at this time ? [20:01:15] GerardM-: None that I know of, or that have been raised. [20:02:03] wdq is down and it refers to http issues [20:02:11] Cyberpower678: Huh, that's odd - your firewall rules got created empty rather than with the proper defaults. Lemme fix those for you. [20:02:18] Warning: fopen(http://wdq.wmflabs.org/api?q=claim%5B31%3A5%5D+and+noclaim%5B54%3A1471265%5D): failed to open stream: HTTP request failed! HTTP/1.1 504 Gateway Time-out in /data/project/catscan2/public_html/omniscan.inc on line 132 [20:03:14] <^d> !log deployment-prep rebuilt deployment-elastic05 with new partition scheme [20:03:19] Logged the message, Master [20:04:42] GerardM-: Trying the manual interface seems to work, but seems to be insanely slow atm. I see no issues with the fileserver or the DB however, so it's not immediately clear why it times out. [20:07:29] <^d> !log deployment-prep scratch that, I rebuilt it as precise. why did I do that? [20:07:33] Logged the message, Master [20:07:40] Cyberpower678: I added icmp and ssh to your default security group; you may want to add more eventually depending on what you need. [20:23:43] 3Tool-Labs: Cannot submit sge jobs from tools-webgrid-tomcat - https://phabricator.wikimedia.org/T68882#1021358 (10Prolineserver) I suddenly get the error "Unable to run job: denied: host "tools-webgrid-06.eqiad.wmflabs" is no submit host." on https://tools.wmflabs.org/videoconvert/, so I guess tools-webgrid-06.... [20:26:40] Coren, so that's how it works. :p [20:27:11] Coren, nodename nor servname provided, or not known [20:35:54] 3Tool-Labs: Cannot submit sge jobs from tools-webgrid-tomcat - https://phabricator.wikimedia.org/T68882#1021376 (10Prolineserver) [20:53:12] 3Tool-Labs: Cannot submit sge jobs from tools-webgrid-tomcat - https://phabricator.wikimedia.org/T68882#1021423 (10coren) 5Open>3Resolved -tomcat has been a submit host for some time now, but the recently-added tools-webgrid-06 wasn't. Fix't. [21:41:21] !log mediawiki-core-team added legoktm as project member [21:41:23] Logged the message, Master [21:43:11] !log mediawiki-core-team added Smalyshev as project member [21:43:13] Logged the message, Master [21:58:49] 3Labs, Wikimedia-IEG-grant-review: Create "grantreview" labs project - https://phabricator.wikimedia.org/T88852#1021645 (10bd808) 3NEW [22:05:11] Betacommand: as for T88853 [22:05:12] wb2-irc.err:DEBUG irc3.wikibugs > PRIVMSG #wikimedia-collaboration :3Echo: Echo is disabled for blocked users - https://phabricator.wikimedia.org/T88853#1021664 (10Betacommand) 3NEW [22:05:27] ah [22:05:42] filtered to a channel im not in [22:14:50] Betacommand: I suppose we should just have an online log of PRIVMSGs sent :-p [22:15:04] improving logging is on the to-do, though [22:53:11] andrewbogott: I'm trying to ssh from one integration instance (integration-dev) to another (integration-slave100X). Connection isn't being established. Ping worksfine and security group has port 22 allowed for 10.0.0.0/8. [22:53:32] Are you trying to do it yourself, or set up an automated job? [22:53:42] Your key probably isn’t forwarded. [22:53:48] Unless you’re forwarding it :) [22:54:30] I am [22:54:38] hm, lemme try [22:54:47] I am using a tool and am forwarding it, but also tried plain ssh without special arguments or wrapeprs [22:55:16] Also, I had to log out and in again on wikitech as https://wikitech.wikimedia.org/wiki/Special:NovaSecurityGroup was showing headings without content. Presumably the internal session lost track of openstack again? [22:55:49] Yes, that seems to happen every week or so :( I haven’t tracked it down [22:56:10] That bug is happening several times a week. It'd be a lot more helpful if at least the software recognised its own state and didn't output empty pages and errors claiming the host doesnt exist. [22:56:40] http://i.imgur.com/ipqVn4m.png [22:59:58] I see the ssh behavior you’re describing. I’m not sure what the story is — investigating [23:00:11] Thx! [23:04:38] <^d> Coren: About? I'm having an NFS problem with some new instances I just spun up [23:04:55] ^d: Elaborate away. [23:07:32] ^d: there’s an occasional race still, a reboot may fix it [23:08:09] Krinkle: I’ve confirmed that ssh’ing between instances within a project works in other projects. Your security groups look good, although I’m suspicious about the source group setting. Will it harm anything if I remove it? [23:08:28] I only barely understand how ‘source group’ is supposed to work :( [23:08:59] andrewbogott: I don't think it's ever possible to prevent the race short of adding a (long-ish) delay in firstboot or something. I can think of no way to synchronize the entirely asynchronous "export NFS directories" with "instance fires up and tries to mount them" given they aren't even in the same security domain. [23:09:27] Coren: hence the blunt solution of ‘reboot it’ :) [23:12:42] andrewbogott: I'm not deploying anything in the next 2 hours. As long as Jenkins is still able to ssh into the instances from the outside to run jobs, do as you like. Keep me posted :) [23:13:02] It should, I’m pretty sure that source group can only harm and not help [23:13:11] I didn't set that. [23:13:14] Was/is that the default? [23:14:14] Nope, I don’t know where it came from. [23:14:20] And, removing it didn’t help :( [23:14:32] <^d> I tried rebooting. [23:14:41] <^d> /home and /data/project [23:14:41] Coren, want in on this weirdness? Port 22 is blocked between instances in the ‘integration’ project for no reason that I can detect. [23:14:53] Which two? [23:14:59] And only those two? [23:15:06] Krinkle: Do those instances have their own ferm or iptables config outside of the nova security rules? [23:15:12] Coren: any two as best I can tell. [23:15:21] for instance integration-slave1001 vs integration-slave1002 [23:15:26] <^d> Pastebinning [23:15:29] andrewbogott: Not that I know of. Can you verify? [23:15:56] <^d> Coren: Earlier today, so ignore timestamps. But identical problem. https://phabricator.wikimedia.org/P268 [23:16:01] <^d> (rebooting earlier today fixed it) [23:16:10] * Coren looks into it [23:16:39] ^d: That a new instance? [23:16:41] Krinkle, those instances do have their own firewall rules. [23:16:43] <^d> Yep. These are fresh VMs, no extra roles or anything applied. [23:16:48] So that’s definitely the issue. [23:17:02] Krinkle: you probably need to dig in puppet, or I can in a bit [23:17:03] andrewbogott: Even integration-slave1006+ ? [23:17:29] yep [23:17:30] Antoine created the earlier ones. The newer ones I created and documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup - I don't create a firewall anywhere. [23:17:35] ^d: That does look like the "instance came up too fast" problem. [23:17:36] at least, ‘sudo iptables —list’ is full of entries [23:17:39] andrewbogott: Then I guess it's done by puppet [23:17:43] yep [23:17:58] andrewbogott: Krinkle: I confirm that networking gets the packets to the instances but they are then ignored. Local iptables. :-) [23:18:16] ^d: Reboot didn't work that time? [23:18:19] ^d: this is in the deployment-prep project, right? So NFS is working fine elsewhere? [23:18:32] <^d> Coren: Did earlier today, newest ones not. I can try again [23:19:10] Part of the issue is that there is negative caching of rejected mount attempts; depending on how recently the last attempt happened and how fast the reboot, it may end up refreshing the cache. [23:19:19] At least, once the mount works, it works forever. [23:19:31] <^d> `shutdown -r now "once more, with feeling"` [23:19:45] Coren: I’m trying to check the exports but not remembering where they live [23:19:55] andrewbogott: /etc/exports.d [23:20:06] well, that should’ve been easy to remembe [23:20:07] r [23:20:25] <^d> no dice [23:20:48] andrewbogott: Also, 'exportfs' lists them but that's a mess to search. [23:20:58] There’s no export for chad’s machine [23:21:17] Krinkle: Yep. The instances have iptables with an explicit list of hosts allowed to ssh. [23:21:34] manage-nfs-volumes is dead, it seems [23:22:15] andrewbogott: dafu? Why didn't icinga notice? [23:22:23] manage_nfs_volumes_running [23:22:23] OK 2015-02-06 23:21:01 1d 8h 6m 54s 1/3 PROCS OK: 1 process with regex args '^/usr/bin/python /usr/local/sbin/manage-nfs-volumes' [23:22:28] It’s running, but hasn’t touched the log for several hours [23:22:35] Gah! [23:22:38] so maybe hung up someplace [23:23:02] I just kicked it. [23:23:13] Also, that means we need to have some sort of heartbeat check. :-( [23:23:36] It’s still not touching the log [23:23:37] * Coren hates it when he's right. [23:23:39] not yet at least [23:23:49] * Coren tries it manually [23:24:10] I bet it’s erroring out early in each run :( [23:24:15] andrewbogott: Hm.. is it caused by the firewall that allows gallium to ssh inwards? https://github.com/wikimedia/operations-puppet/blob/211ea1f5d2cfe53c41ed376364c6d821d9d8af32/modules/contint/manifests/firewall/labs.pp#L7 [23:24:52] andrewbogott: but there's other entries in there as well, such as for shinken and bastion. Those are inserted by the base classes for labs instances, right? [23:25:08] andrewbogott: why in Baal's name is the logfile owned by root? [23:25:30] Coren: maybe this is another one of those ‘restarted by hand during ldap panic’ issues [23:25:31] leftover [23:25:41] There’s no firewall config that’s labswide. [23:25:58] Anytime you apply a ferm config on a machine, it closes ALL ports except those explicitly mentioned, as I understand it. [23:26:29] andrewbogott: But nobody would have done that in the past couples days! [23:26:32] * Coren boggles a bit. [23:26:53] true [23:27:41] Coren, try ps -ef | grep manage-nfs [23:27:51] Looks to me like there are several, in competition [23:28:14] Or, not several, but, ‘two' [23:28:16] o_O I'm just seeing the one, pid 18570 [23:28:29] 18569 is just the su, dude. :-) [23:30:07] Well, it works now with the chown [23:30:18] andrewbogott: OK. I've moved my script to bastion.wmflabs.org, but there's no dsh there. [23:30:24] Could we install it there? [23:30:42] andrewbogott: Ima keep an eye on the owner though, try to see if something is being dumb and chowning it [23:30:43] Krinkle: That’s easier than fixing the firewall? [23:31:10] That depends. [23:31:18] Coren: Chad’s box still doesn’t have an export… [23:31:29] 10.68.17.187 [23:31:32] Maybe it’ll catch up [23:32:40] Coren, any opinion re: dsh on bastions? [23:35:07] andrewbogott: Is there a good way of allowing instances within the project to ssh in? [23:35:14] allowing the entire subnet would be too much [23:35:24] is there a project-confined ip range? [23:35:37] Or safe nodename wildcard? [23:35:39] Not that I know [23:35:48] I know bastion and dsh works. I'm not ops :) [23:36:28] There’s not a predictable subnet — you’d have to allow all labs boxes. [23:36:33] Which is the default elsewhere on labs anyway [23:36:49] I guess it still uses ssh keys, right? [23:36:55] It's not a blanket allowance [23:39:19] right [23:39:34] there must already be a hole in the firewall for bastion [23:40:38] andrewbogott: I mean in addition to bastion. It'll still require the user to be in the integration group. [23:40:40] right? [23:41:06] yeah, I just mean — there must be a rule someplace in the puppetized firewall that allows bastion access. So it should be straightforward to alter that. [23:41:25] Right [23:42:00] ^d: I’m unconvinced, but can you try another reboot and let me know what you see? [23:42:02] OK. So copying the rule from puppet for gallium and using the ip range from the nova security group for ssh, I'd think something like this: rule => 'proto tcp dport ssh { saddr 10.0.0.0/8 ACCEPT; }' [23:42:08] (does it support ranges?) [23:42:22] I would expect that to work [23:42:35] Probably need to run it by hashar though before you open things up. [23:44:44] Sure [23:46:28] <^d> andrewbogott: Rebooting [23:46:40] * andrewbogott is not optimistic [23:51:54] ^d: if that doesn’t work, let me know what instances need access and I’ll add them by hand. This will take some digging in. [23:52:52] <^d> no dice. [23:53:02] <^d> deployment-elastic0[5-8] [23:53:14] ooh, ok. [23:53:16] Stay tuned... [23:54:59] ^d: ok, added those by hand [23:55:04] So, should work now. If you reboot again. [23:55:06] And again [23:55:08] and again [23:55:18] “Did you reboot it? How many times?” [23:56:33] andrewbogott: "Error: /Stage[main]/Role::Labs::Instance/Mount[/home]: Could not evaluate: Execution of '/bin/mount /home' returned 32: mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/mediawiki-core-team/home failed, reason given by server: No such file or directory" [23:56:47] is there something wacky about nfs and new instances? [23:56:54] bd808: yes! [23:57:03] But if you tell me the instance’s IP I can hack [23:57:09] well, project and IP [23:57:24] 10.68.17.178; mediawiki-core-team [23:58:04] ok, try a reboot [23:59:05] * bd808 twiddles thumbs while reboot happens [23:59:37] andrewbogott: sweet. I have a home dir again [23:59:39] thanks [23:59:47] bd808: did /data/project work too?