[03:40:08] legoktm (or anyone), do you know how to run a Python 3 web service on Tool Labs? [03:40:19] webservice2 uwsgi-python3 start (given at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_.28uwsgi.29 ) does not work. [03:40:21] It outputs: [03:40:39] invalid choice: 'uwsgi-python3' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs') [03:40:50] I'm on tools-trusty. [03:52:27] Trying Python 2, but haven't got that yet either. [03:53:31] Never mind, that's working now. [04:02:58] superm401: we (Yuvi and I) tried getting py3 to work, except there were some bugs in the underlying uwsgi library [04:03:32] legoktm, got it. I'll remove that for now. BTW, I'm using your fab library. :) [04:04:03] oh yay! [04:04:12] what are you using it for? [04:05:14] Fixing a Wikipedia gadget that tracks bug status. Kind of similar to fab-proxy, but JSONP instead of a straight proxy: https://phabricator.wikimedia.org/T539 [04:05:36] ah, neat :D [04:06:24] legoktm, have you made any Phabricator bots? [04:06:32] yes, wikibugs :) [04:06:39] Cool, should I just use LDAP? [04:06:50] With a separate account, of course. [04:07:20] you can get a special "bot" account that just has a certificate and can't be logged into normally, /me finds link [04:07:44] Oh, they don't even have passwords? I knew you used the cert for bot operations, but I didn't realize that part. [04:07:58] superm401: https://www.mediawiki.org/wiki/Phabricator/Bots [04:08:09] Thanks. :) [04:14:19] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1024066 (10yuvipanda) [04:14:21] 3operations, Tool-Labs: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1024064 (10yuvipanda) 5Open>3Resolved I think this is fixed at least now. I'll follow up monitoring for this when I'm back from vacation [04:53:17] legoktm, so I'm having the tool throw exceptions on purpose (it won't work anyway until I get a bot and cert). [04:53:19] My question is, how do I actually see what those exceptions are (as opposed to just that it's a 500, which I can see in uwsgi.log). [04:53:54] https://tools.wmflabs.org/phabricator-bug-status/queryTasks?ids=[539]&callback=foo links to 'common causes for errors' (https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Logs) but there is no section called Logs. [04:56:45] superm401: I'm not sure if it's logged anywhere...log it yourself? [04:57:14] :( [04:57:34] except Exception, e: logging.error(traceback.fmt_exec()); raise e [04:57:43] something like that maybe? [04:57:45] I will if I have to. No real point, since I can probably just write it without logs once I have my cert. [05:05:49] legoktm, thanks for your help. I'll pick this up when the bot is created. [05:06:31] :) [06:32:36] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:38:47] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:39:53] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [06:52:44] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:53:58] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [07:02:33] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [07:03:51] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [07:18:56] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [07:22:43] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:47:00] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-puppetmaster does not respond to other instances - https://phabricator.wikimedia.org/T88960#1024349 (10hashar) Maybe ops have some idea? :-( [09:51:41] <_joe_> anyone having network issues in labs? [09:52:53] O/ [10:27:06] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024429 (10hashar) [10:51:59] 3Labs: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1024515 (10faidon) p:5Triage>3High [10:52:17] 3Labs, operations: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1024517 (10faidon) p:5Triage>3Unbreak! [10:52:27] 3Labs: Move wikitech web interface to a dedicated server - https://phabricator.wikimedia.org/T88300#1024518 (10faidon) p:5Triage>3High [14:17:21] (03PS1) 10KartikMistry: Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 [14:18:06] (03CR) 10Hashar: [C: 031] Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry) [14:23:46] (03CR) 10Hashar: "I have added as reviewer all members of the tools labs group tools.lolrrit-wm as reported by 'getent group tools.lolrrit-wm'. Hopefully " [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry) [14:45:07] Coren: could you restart the webservice for drtrigonbot on tools? Has been offline since the reboots and DrTrigon has been inactive since August [14:45:24] Sure. Give me a minute. [14:45:32] thanks :-) [14:47:00] (03CR) 10Merlijn van Deen: [C: 032] Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry) [14:47:03] (03Merged) 10jenkins-bot: Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry) [14:49:15] sitic: That should be it. [14:49:33] jep works, thanks again [14:50:09] !log tools.lolrrrit-wm deployed https://gerrit.wikimedia.org/r/189482 [14:50:09] tools.lolrrrit-wm is not a valid project. [14:50:21] !log tools.lolrrit-wm deployed https://gerrit.wikimedia.org/r/189482 [14:50:26] Logged the message, Master [14:50:30] two rs, three rs, potato, potato [15:25:03] * hashar rolls the drums [15:25:55] Coren: good morning and breakfast! Labs security rules seems to be broken :-D [15:27:31] for some reason I have a bunch of firewalling issue that started occurring last Friday around 11pm UTC [15:27:39] got some traces at https://phabricator.wikimedia.org/T88960 [15:40:22] Hm. I don't think we changed anything. Lemme look at this for you. [15:40:45] hashar: ^^ [15:41:16] one example is attempting to ssh from gallium.wikimedia.org to the CI instances in labs. [15:42:01] hashar: 10.68.16.60 is your example, right? [15:42:06] I suspect the labs security rules to be borked. Maybe some cache yields an empty list [15:42:13] yeah [15:42:22] that is integration-slave1001.eqiad.wmflabs [15:42:45] we use ferm rules on CI slaves, and have one to allow ssh (port 22) from gallium ( 208.80.154.135 ) [15:43:02] so at least from prod to labs that is blocked by something :D [15:44:06] hashar: Indeed. I'm trying to track down where from. [15:44:07] on the same projects, the puppet agent can not reach the project puppetmaster on integration-puppetmaster.eqiad.wmflabs ( 10.68.16.96 port 8140) [15:44:23] so that is traffic local to the labs project [15:44:41] the puppet master works just fine (I can run puppet agent locally) [15:46:09] That's odd; I see the traffic reaching the labnet at least. [15:46:46] great [15:49:37] hashar: Your current security group rules (on Wikitech) only show port 22 open from 10.0.0.0/8 [15:50:10] hashar: Sadly, we have no change history over that config so I couldn't tell you if that was changed Friday. [15:50:21] pfff [15:50:22] wtf [15:51:38] Also, port 8140 is not open at all. [15:51:50] it never was [15:51:57] Only rule I see for gallium is to open 873 [15:51:58] since the traffic is local to the project [15:52:12] I have added 208.80.154.135/32 port 22 [15:52:47] And that does open up ssh from gallium. [15:52:52] yes [15:53:03] so seems the security rules got messed up on friday :-((( [15:53:21] 3Tool-Labs: Install byobu terminal multiplexer package on toollabs - https://phabricator.wikimedia.org/T88989#1024935 (10devunt) 3NEW [15:53:43] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024943 (10coren) 5Open>3Resolved a:3coren The project security group did not (was changed not to?) include allowing ss... [15:54:14] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024951 (10hashar) [15:54:27] one issue solved [15:54:30] now the puppetmaster one :D [15:54:48] deployment-prep has its own puppetmaster as well [15:54:50] and works just fine [15:54:55] 3Tool-Labs: Install byobu terminal multiplexer package on toollabs - https://phabricator.wikimedia.org/T88989#1024955 (10devunt) [15:55:00] though port 8140 is not in the security rules [15:58:34] Hm, that's odd. [15:58:49] Do you have a ticket for that one too? [15:59:05] same ticket :D [15:59:26] Oh, I hadn't noticed the "additionally" part. [15:59:28] Heh. [15:59:56] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024977 (10hashar) 5Resolved>3Open The integration labs project was missing a security rule to allow ssh from gallium for... [16:00:13] ... I wonder... [16:00:21] someone must have messed the rules [16:01:54] or maybe prod was always allowed to ssh to instances? [16:02:16] or the rules are no applied at the instance level [16:02:17] hashar: I don't think it was, nor was it intended to. [16:02:24] and thus instances can no more communicate to each others [16:02:58] hashar: And, annoyingly enough, adding a rule to allow 8140 worked. [16:03:16] yeah something is definitely wrong [16:03:25] on deployment-prep we have no rule to allow 8140 [16:03:32] I saw. [16:03:55] I wonder if the intervening upgrade (between instance creation) changed a default. I'll have to consult with Andrew [16:05:17] Well, your immediate issue is fixed, at least, but it bears looking into [16:05:25] I guess you can comment on https://phabricator.wikimedia.org/T88960 [16:05:36] I know _joe_ had some related issue this morning [16:06:46] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025012 (10coren) The puppetmaster issue did appear related: adding an explcit rule to allow it fixed the immediate problem,... [16:07:22] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025014 (10coren) 5Open>3Resolved [16:07:43] _joe_: Moar details? ^^ [16:07:44] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025015 (10hashar) The deployment-prep labs project also uses a local puppetmaster but it does not need any specific security... [16:07:45] Coren: should I fill another task so ? [16:07:57] hashar: Might be wise. [16:08:07] So as to not lose track of the underlying question. [16:09:01] <_joe_> Coren: nope [16:10:22] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025027 (10akosiaris) @hashar root@integration-slave1002:~# telnet 10.68.16.96 8140 Trying 10.68.16.96... Connected to 10.68... [16:11:58] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025041 (10coren) Yeah, things are working fine now with an explicit rule - but the necessity of //having// the explicit rule... [16:12:41] 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025046 (10hashar) 3NEW [16:13:06] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024242 (10hashar) Filled another ticket for investigation of the underlying issue: {T88995}. [16:25:10] <^d> Coren: Whatever you and andrewbogott_afk did seems to have worked. New instances are coming up fine with nfs for me now [16:47:53] Heh. Spam. "Reduce your mortgage interest by 4%". I'd love to know how they can find a bank that'll *give* me interest on a loan. :-) [16:51:14] Coren: in France the real-estate mortgage are veryyyy low [16:51:30] sometime less than 2% [16:51:40] I dunno about France, but I pay 2.5% on mine so that's pretty reasonable. [16:52:11] heck http://www.meilleurtaux.com even shows 1,75% (excluding insurance) for 15 years [16:52:29] (fixed rates) [16:52:52] the only problem is a house is half a million of euros :-] [16:58:53] <^d> Coren: Perhaps I spoke too soon. deployment-elastic0[568] all came up fine, but [7] did not. Same 2 mounts. [16:59:36] ^d: The race cannot be won all the times. 10:1 just rebooting it will fix it. [16:59:43] * ^d has rebooted 3 times [16:59:47] Huh. [17:00:12] I'm guessing you made them in that order? [17:00:39] <^d> Yep [17:04:24] ^d: That's odd - the LDAP entry for that instance is completely hosed. [17:04:32] <^d> boo :( [17:05:44] That's actually a little worrisome. andrewbogott_afk ping? [17:06:21] <^d> I had deleted an instance of the same name a bit ago. Chance the old entry wasn't fully deleted first? [17:06:22] ^d: Can you avoid touching it for a bit so it's left intact for investigation? [17:06:42] ^d: That might be part of the issue, or a possible cause. [17:07:26] <^d> I'll leave it alone [17:07:33] <^d> logged out too [17:11:00] 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025233 (10coren) 3NEW [17:11:30] 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025247 (10coren) p:5Triage>3High [17:11:50] ^d: Can you live without and/or make due with a -09 in the meantime? [17:12:00] <^d> I was just thinking that. [17:34:23] why are my python grid jobs failing with "KeyboardInterrupt"? [17:34:35] I'm definitely not ^C'ing them... [17:36:23] they also appear to be failing on network issues talking to gerrit? [17:43:40] 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025343 (10Andrew) I've seen this too, but only on Friday. The addition of the DNS info to the ldap record is done by a mw job -- for the time being I'm blaming this on load but we should keep an eye out fo... [18:01:43] legoktm__: I think that might be the SIGKILL that's sent before the SIGTERM? not sure... [18:01:50] although ctrl-c is SIGINT I think [18:18:46] Jobkill goes INT -> TERM -> KILL [18:19:13] legoktm: But tell me more about the network issues. [18:19:32] Guest92790: ^^ [18:21:13] 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025447 (10coren) I note there are also several missing puppetVar and puppetClass, but not all of them. Are they in fact added through different mechanism? [18:26:19] 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025449 (10Andrew) No! The mw job only sets the arecord. If we're getting instances without puppetVars then... something interesting is happening :( [18:28:51] 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025453 (10Andrew) Oh, nope, I'm wrong -- puppetVars are handled in OpenStackNovaPrivateHost.php, same as the private arecord. So this is at least only one problem rather than two. [18:33:28] Coren: want to double-check my work and verify that the jobqueue on virt1000 is in fact working and empty? [18:36:39] andrewbogott: Looking now. [18:36:44] thx [18:36:53] It should be running 1/min on a cron [18:37:54] It does, and showJobs shows 0 [18:39:08] Is there any scenario where the jobqueue would just… throw jobs away? [18:39:20] andrewbogott: Do you have an opinion about the networking issue? T88960 [18:39:53] andrewbogott: I don't know of a way for jobs to not be run, but I don't think error recovery when a job fails is that robust. [18:41:22] The job resubmits itself in certain situations, maybe I can broaden that [18:41:58] Coren: is that bug a network issue, or just an ‘instances without an arec don’t have an arec’ issue? [18:42:01] * andrewbogott reads more closely [18:42:21] oh, sorry, misread [18:42:53] No, that one is unrelated. It looks like there is no communication between instances in the same project by default anymore; at least that always has been my understanding. [18:43:45] Coren: hashar’s instances use ferm… [18:43:57] So it could be an issue with the firewall local to the instance [18:44:00] That one didn't. [18:44:09] I checked to make sure first. [18:44:12] ok [18:44:17] sorry, I will actually read this to the end [18:44:22] :-) [18:45:56] I messed with that project’s security rules on Friday. So this is probably my fault one way or another [18:48:20] Huh. Is there a tunable knob about "allow connections within the project"? [18:48:42] 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025519 (10Andrew) I messed with the security rules on Friday because someone on IRC (timo, I think?) was trying to ssh between instances and failing.... [18:49:47] Hm, I don’t think of security rules as having any sense of in- or out-of-project [18:58:03] Coren: sorry, I read the log wrong, there are no network issues. My python script is shelling out to git, does a git pull on a large repo which takes a while, and is dying with KeyboardInterrupt (/data/project/extreg-wos/generate.err)...any ideas? [19:01:47] legoktm: that could be SGE killing the job [19:02:23] valhallasw`cloud: should I throw more memory at it? [19:02:39] might help. qacct -j should tell the maxvmem, I think [19:02:48] and you can add -ma to get a mail on abort [19:05:56] ok yeah it's running out of memory [19:05:56] I bumped it up to 900M from 500M [19:06:54] I really think -ma should be default [20:01:53] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025736 (10hashar) 3NEW [20:02:35] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024242 (10hashar) The doc publishing jobs are failing as well and there is no workaround for it :( T89026 [20:02:50] 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025046 (10hashar) The doc publishing jobs are failing as well and there is no workaround for it :( T89026 [20:04:26] andrewbogott: good morning :] [20:04:33] 'morning! [20:04:39] hashar: ready for me to merge those two patches? [20:04:44] the poor security rules are broken on integration :D [20:04:49] sorry lacking a bouncer [20:05:06] I guess some rule that allow communications between instances inside a project has been removed somehow [20:06:01] hashar: I responded to the phab. I was messing with security rules on (I think) Timo’s request. I probably broke something by accident. [20:06:07] Anyway, all working now right? [20:06:27] na :] [20:06:31] well let me check [20:06:47] Coren: can you check on the status of /public/backups? Labs instances are complaining; I’m on labstore1001 but don’t see what’s going on [20:06:58] there is a workaround to explicitly define security rules, but in some case that is not possible [20:07:48] andrewbogott: Hm, /public/backup is obsolete and shouldn't be even used since https://gerrit.wikimedia.org/r/#/c/184638/ [20:08:16] jgage is getting puppet errors about it, but maybe his puppet is broken [20:08:32] <^d> FOR ALL THAT IS GOOD AND HOLY IN THIS WORLD [20:08:47] ^d? [20:08:48] * ^d stabs nfs [20:08:51] <^d> *AGAIN* [20:08:56] o_O [20:09:03] ^d: What instance? [20:09:06] <^d> Spun up an 09 to work around the missing 07. [20:09:11] <^d> deployment-elastic09 [20:09:19] <^d> (and this is *not* a replacement. There's never been an 09) [20:09:37] ^d: Still broken in ldap [20:09:48] andrewbogott: ^^ ze problem, she is real. [20:10:34] andrewbogott: Same missing things. It's got no aRecord nor the non-default vars and classes [20:11:01] ^d: The issue isn't NFS but LDAP. We be on it. [20:11:16] <^d> Well, cascades into nfs, right? [20:11:20] <^d> When ldaps busted? [20:11:46] ^d: dns is in ldap, so nothing works without it [20:11:53] <^d> Makes sense [20:13:19] andrewbogott: still broken :] [20:13:37] hashar: the phab says otherwise, can you update the ticket accordingly? [20:13:41] andrewbogott: I have a use case were the labs instance ssh to another instance and I can't really allow any instance to ssh to it [20:13:42] Or am I misreading somehow? [20:13:56] andrewbogott: I want to restrict ssh to the integration labs instance. Filled a bug about it https://phabricator.wikimedia.org/T89026 [20:14:12] the way we worked around the underlying issue is by explicitly defining security rules [20:14:21] ok [20:14:22] such as allowing connections to puppet port [20:14:31] so convince me that this is a new problem and not just the way things have always worked? [20:14:50] I have no clue how the security rules work, but seems a rule that allow instances to communicate together inside a project has been dropped [20:15:03] ah, ok [20:15:25] I am 1000% sure that instances always have been able to communicate on whatever port / protocol as long as they are in the same project. [20:15:51] so you were making use of a ‘source group’ rule which is so obscure that I removed it from the GUI. you must’ve configured it back in the day. [20:15:52] on beta cluster, there is no security rules to communicate to the puppet port ( 8160 or something ) and puppet is still happy [20:15:58] I will replace that rule from the commandline [20:16:02] ohhhhh [20:16:10] andrewbogott: And, indeed, so am I. I always believed that this was the case. [20:16:22] so deployment-prep must have a similar rule [20:17:19] ah source group [20:17:21] \O/ [20:17:21] the openstack people removed use of ‘source groups’ from their docs as well — it caused a bunch of confusion and I judged it to be worthless. Clearly that was incorrect :( [20:17:45] so by default instances can't communicate with each other right ? [20:17:58] ugh, and removed from their usage statements as well. This should be interesting [20:18:01] * andrewbogott dives into source [20:18:04] whenever we created the contintcloud project (fairly recently) it cames with the source group by default [20:18:28] hashar: right, by default the project membership has no affect on firewall rules [20:19:37] andrewbogott: here what I got https://phabricator.wikimedia.org/F38628 :D [20:19:43] for contintcloud project (not used yet) [20:19:54] deployment-prep has the same [20:20:01] at least we know what is the underlying cause now! [20:20:07] hashar, please stop talking for a minute so I can fix your project? [20:20:45] 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025803 (10hashar) [20:20:46] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025802 (10hashar) [20:20:59] sure [20:22:35] 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025046 (10hashar) Andrew found out that the integration labs project is missing the security rule that allows communication between instances. That i... [20:27:20] hashar: it looks like the commandline has changed and now I’m required to specify a port range. What do you need besides 22? [20:30:12] aoriarhee [20:30:25] no clue really [20:30:30] try 'any' or '*' ? :-D [20:30:54] I can do 0-1000000 [20:31:01] just thought you might have a list of services [20:31:01] andrewbogott: it still works for deployment-prep so maybe whatever is set for that project can be reused? [20:31:34] hashar: yeah, I’m talking about the commandline tool changing not the firewall implementation [20:31:46] anyway, integration should be happy now [20:33:38] Coren: we have 464 instances and 589 host entries — so it’s not like we hit another luck number in ldap… [20:33:52] I can’t imagine why it’s rejecting those entries. [20:33:58] trying :] [20:34:43] andrewbogott: It looks like the job is failing, but I can't see why. [20:35:40] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025875 (10hashar) a:3Andrew This has been fixed on spot by Andrew. See T88995 for details. [20:35:48] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025877 (10hashar) 5Open>3Resolved [20:37:05] hashar: sorry i broke everything on Friday :( [20:37:22] nested firewalls are hard to debug [20:39:16] what kind of sec rule have you added ? [20:39:38] integration-publisher [10.68.16.255] ssh port can now be reached by any instance [20:39:47] I guess there is an allow source any isn't it ? [20:40:43] ah no different rule :-] [20:42:37] andrewbogott: and yeah stuff get broken from time no worries. I am quite happy to see you figured out the fix in a few minutes \O/ [20:43:02] can you comment and close the bug for later reference please? https://phabricator.wikimedia.org/T88995 thx! [20:47:02] greetings [20:47:23] andrewbogott: yes my stale nfs mounts problem is only with the "backups" mount point [20:47:35] i tried rebooting [20:48:12] jgage and Coren can you confer on this please? [20:48:52] jgage: Hm. What instance is this? [20:49:35] 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025923 (10Andrew) 5Open>3Resolved a:3Andrew Yeah, I deleted the 'source group' rule because I suspected it of interfering with inter-instance s... [20:49:37] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025926 (10Andrew) [20:49:37] coren - project: ipsec. affected hosts: ipsec-c{1-4}. not affected: ipsec-pm. nfs mounts of /public/backups are stale, and after reboot when i try to mount -a i get this: [20:49:40] mount.nfs: mounting labstore.svc.eqiad.wmnet:/backups failed, reason given by server: No such file or directory [20:50:57] thx coren [20:51:05] it should be noted that i'm using self hosted puppetmaster, so i wondered whether i'm missing an important puppet change [20:52:12] jgage: You have, at least one: https://gerrit.wikimedia.org/r/#/c/184638/ is the one that removed /public/backup/ [20:52:19] aha :) [20:53:38] * andrewbogott endorses rebasing every day when using self-hosted puppet [21:10:43] andrewbogott, jgage: This module will do that (sync with upstream puppet repo) -- https://github.com/wikimedia/operations-puppet/blob/production/modules/puppetmaster/manifests/gitsync.pp [21:11:32] Set $::puppetmaster_autoupdate and poof it should just work [21:18:19] grr, I’m remembering that things run from the jobqueue don’t write to the auth log :( [21:22:29] using cron, is there a specific place to add the -cwd flag? Seems not be working in a job [21:25:05] the line is this one: http://pastebin.com/0QaK1fsF [21:31:04] Alchimista: doesn't work in what sense? the cwd for the second jsub command is $HOME, not whatever you cd'ed to [21:31:09] in the other jsub job [21:32:31] Alchimista: I *think* the best way to get what you want is to make a small shell script that does [21:33:06] #!/bin/bash [21:33:06] cd /data/project/alchimista/bots/alch; [21:33:06] jsub -cwd -N stewie python pwb.py stewievo.py > ~/bots/logs/stewie.log 2>&1 [21:34:08] valhallasw`cloud: i was tryng to avoid it, it'll be 20 or more cron jobs, if it where possible to do it in cron, was more easily manageble :s [21:34:29] Alchimista: hmm. [21:34:48] it's a bit annoying the jsub-fixup-script doesn't understand cd et al [21:38:00] yah, specially when it needs more than one file [21:40:28] Alchimista: I'm not sure. Maybe with some bash, but getting bash -c syntax right is basically impossible [21:41:06] It may well be a good time to get rid of the fixup script now; there's not much reason for it anymore. [21:41:34] Coren: I’m looking at deployment-elastic09 in ldap now… it’s missing an arec but the puppet settings look fine to me. Do you disagree? [21:41:38] Though it does meant a bit of policing to make sure tools-submit isn't being misused. [21:41:59] andrewbogott: Lemme refresh [21:42:52] andrewbogott: Well, they are correct for the default, but I doubt ^d just didn't configure the rest like -08 [21:43:36] you mean like role::elasticsearch::server? Where would that come from if not hand-config by ^d? [21:46:00] That's was my point but I now notice that -07 and -09 were indeed /not/ configured (probably because they were broken). So the absence of vars and classes is not a problem, only the arec is missing and the problem [21:46:24] ok, great. [21:46:51] Too bad there’s no logging from things in the jobqueue :( [21:50:14] Coren: is there logging someplace that shows the jobs at least trying to run? [21:51:17] <^d> Hm? [21:51:21] * ^d catches scrollback [21:52:03] andrewbogott: on regular wikis we used to log start of jobs [21:52:13] <^d> So what I was doing was spinning up the instances, making sure there were ok, then adding my puppet roles. [21:52:27] <^d> Since 07 and 09 didn't come up with their mounts, I stopped at that point [21:52:41] <^d> (hence why 06/08 have role::elasticsearch::server etc) [21:52:43] hashar: log where? [21:52:52] ^d: yeah, that all fits with what I’m seeing [21:52:57] udp2log / fluorine [21:53:06] but nowadays it is a dedicated service I think [21:53:28] ^d: Right, that became obvious in retrospect. Originally, I thought the lack of configuration was a secondary issue. [21:54:43] <^d> Ah no, it's not :) [21:55:30] andrewbogott: in the udp2log bucket "runJobs" [21:55:39] andrewbogott: you should have a message on start of job and one on completion [21:56:03] hashar: oh, I guess that doesn’t help me w/wikitech [21:58:37] depends where it is sending its udp2log spam :-] [21:58:45] would be rather nice to have it send to some logstash [21:58:46] probably nowhere [22:00:31] bug fill it :-] [22:00:52] thanks again for the security rule fix. I am out to bed [22:01:28] logging a bug isn’t so much fun when I know I’ll have to fix it myself [22:37:37] Coren: review svp? https://gerrit.wikimedia.org/r/189602 [22:37:51] that doesn’t fix the bug but may allow us to see it [22:39:03] Of course, passing that global into another process space may be a Bad Thing [23:08:06] * Coren reviews [23:09:25] thx [23:14:51] andrewbogott: yo [23:15:47] tfinc: hello! [23:16:06] andrewbogott: thanks for the backstory on unicorn, go ahead an add me [23:16:15] and* [23:16:18] Thank you for volunteering :) I’ll add you as an admin [23:16:35] that should give you sudo on unicorn, among other things [23:17:28] tfinc: what’s your labs username? [23:17:41] andrewbogott: Tfinc [23:17:44] (And I hope my email wasn’t grumpy… I had a moment of panic when I realized someone might still be relying on that instance :) ) [23:18:10] andrewbogott: i learned about it for the first time on Friday [23:18:26] yikes [23:18:29] ok, you should have access now [23:20:18] andrewbogott: sigh, key mismatch to get into bastion, i'll have to fix that. can you update the file to be https://gist.githubusercontent.com/flyingclimber/addfb86f1c816f50205f/raw/19703b5bf066fc88f2abb56f0f23a31b8a8a5de1/discussion.json .. last update had the HTML and not JSON content [23:20:33] yep. Same file? [23:21:47] andrewbogott: yes [23:22:11] ok… how’s that? [23:24:43] andrewbogott: looks good [23:24:44] thanks [23:25:10] np [23:32:52] Coren: either my change doesn’t work, or that job is never running. So I’ve learned nothing! [23:48:02] Never or unreliably. [23:49:27] valhallasw`cloud: found the problem, -cwd doesn't make it run on the current dir, it was failling because the script needed a local .txt wich wasn't loaded by the grid, after adding the full path to the txt, it works directly on crontab [23:50:55] It /is/ running and just not logging [23:56:15] hmm i just made a new node ipsec-c5 but it can't mount /home because its export doesn't include the new node's ip. it's been >30 mins. [23:57:26] jgage: I expect it's a known issue. We're having odd unreliability issues with LDAP that causes that.