[03:40:08] <superm401>	 legoktm (or anyone), do you know how to run a Python 3 web service on Tool Labs?
[03:40:19] <superm401>	 webservice2 uwsgi-python3 start (given at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_.28uwsgi.29 ) does not work.
[03:40:21] <superm401>	 It outputs:
[03:40:39] <superm401>	 invalid choice: 'uwsgi-python3' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs')
[03:40:50] <superm401>	 I'm on tools-trusty.
[03:52:27] <superm401>	 Trying Python 2, but haven't got that yet either.
[03:53:31] <superm401>	 Never mind, that's working now.
[04:02:58] <legoktm>	 superm401: we (Yuvi and I) tried getting py3 to work, except there were some bugs in the underlying uwsgi library
[04:03:32] <superm401>	 legoktm, got it.  I'll remove that for now.  BTW, I'm using your fab library. :)
[04:04:03] <legoktm>	 oh yay!
[04:04:12] <legoktm>	 what are you using it for?
[04:05:14] <superm401>	 Fixing a Wikipedia gadget that tracks bug status.  Kind of similar to fab-proxy, but JSONP instead of a straight proxy: https://phabricator.wikimedia.org/T539 
[04:05:36] <legoktm>	 ah, neat :D
[04:06:24] <superm401>	 legoktm, have you made any Phabricator bots?
[04:06:32] <legoktm>	 yes, wikibugs :)
[04:06:39] <superm401>	 Cool, should I just use LDAP?
[04:06:50] <superm401>	 With a separate account, of course.
[04:07:20] <legoktm>	 you can get a special "bot" account that just has a certificate and can't be logged into normally, /me finds link
[04:07:44] <superm401>	 Oh, they don't even have passwords?  I knew you used the cert for bot operations, but I didn't realize that part.
[04:07:58] <legoktm>	 superm401: https://www.mediawiki.org/wiki/Phabricator/Bots
[04:08:09] <superm401>	 Thanks. :)
[04:14:19] <wikibugs>	 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1024066 (10yuvipanda)
[04:14:21] <wikibugs>	 3operations, Tool-Labs: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1024064 (10yuvipanda) 5Open>3Resolved I think this is fixed at least now. I'll follow up monitoring for this when I'm back from vacation
[04:53:17] <superm401>	 legoktm, so I'm having the tool throw exceptions on purpose (it won't work anyway until I get a bot and cert).
[04:53:19] <superm401>	 My question is, how do I actually see what those exceptions are (as opposed to just that it's a 500, which I can see in uwsgi.log).
[04:53:54] <superm401>	 https://tools.wmflabs.org/phabricator-bug-status/queryTasks?ids=[539]&callback=foo links to 'common causes for errors' (https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Logs) but there is no section called Logs.
[04:56:45] <legoktm>	 superm401: I'm not sure if it's logged anywhere...log it yourself?
[04:57:14] <superm401>	 :(
[04:57:34] <legoktm>	 except Exception, e: logging.error(traceback.fmt_exec()); raise e
[04:57:43] <legoktm>	 something like that maybe?
[04:57:45] <superm401>	 I will if I have to.  No real point, since I can probably just write it without logs once I have my cert.
[05:05:49] <superm401>	 legoktm, thanks for your help.  I'll pick this up when the bot is created.
[05:06:31] <legoktm>	 :)
[06:32:36] <shinken-wm>	 PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[06:38:47] <shinken-wm>	 PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[06:39:53] <shinken-wm>	 PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]  
[06:52:44] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[06:53:58] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[07:02:33] <shinken-wm>	 RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:03:51] <shinken-wm>	 RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:18:56] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:22:43] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[09:47:00] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-puppetmaster does not respond to other instances - https://phabricator.wikimedia.org/T88960#1024349 (10hashar) Maybe ops have some idea? :-(
[09:51:41] <_joe_>	 anyone having network issues in labs?
[09:52:53] <hashar>	 O/
[10:27:06] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024429 (10hashar)
[10:51:59] <wikibugs>	 3Labs: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1024515 (10faidon) p:5Triage>3High
[10:52:17] <wikibugs>	 3Labs, operations: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1024517 (10faidon) p:5Triage>3Unbreak!
[10:52:27] <wikibugs>	 3Labs: Move wikitech web interface to a dedicated server - https://phabricator.wikimedia.org/T88300#1024518 (10faidon) p:5Triage>3High
[14:17:21] <grrrit-wm>	 (03PS1) 10KartikMistry: Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 
[14:18:06] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry)
[14:23:46] <grrrit-wm>	 (03CR) 10Hashar: "I have added as reviewer all members of the tools labs group tools.lolrrit-wm as reported by 'getent group tools.lolrrit-wm'. Hopefully " [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry)
[14:45:07] <sitic>	 Coren: could you restart the webservice for drtrigonbot on tools? Has been offline since the reboots and DrTrigon has been inactive since August
[14:45:24] <Coren>	 Sure.  Give me a minute.
[14:45:32] <sitic>	 thanks :-)
[14:47:00] <grrrit-wm>	 (03CR) 10Merlijn van Deen: [C: 032] Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry)
[14:47:03] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add cx, cx/deploy notification to #mediawiki-i18n [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/189479 (owner: 10KartikMistry)
[14:49:15] <Coren>	 sitic: That should be it.
[14:49:33] <sitic>	 jep works, thanks again
[14:50:09] <valhallasw`cloud>	 !log tools.lolrrrit-wm deployed https://gerrit.wikimedia.org/r/189482
[14:50:09] <labs-morebots>	 tools.lolrrrit-wm is not a valid project.
[14:50:21] <valhallasw`cloud>	 !log tools.lolrrit-wm deployed https://gerrit.wikimedia.org/r/189482
[14:50:26] <labs-morebots>	 Logged the message, Master
[14:50:30] <valhallasw`cloud>	 two rs, three rs, potato, potato
[15:25:03] * hashar rolls the drums
[15:25:55] <hashar>	 Coren: good morning and breakfast!    Labs security rules seems to be broken :-D
[15:27:31] <hashar>	 for some reason I have a bunch of firewalling issue that started occurring last Friday  around 11pm UTC
[15:27:39] <hashar>	 got some traces at https://phabricator.wikimedia.org/T88960
[15:40:22] <Coren>	 Hm.  I don't think we changed anything.  Lemme look at this for you.
[15:40:45] <Coren>	 hashar: ^^
[15:41:16] <hashar>	 one example is attempting to ssh from gallium.wikimedia.org to the CI instances in labs.
[15:42:01] <Coren>	 hashar: 10.68.16.60 is your example, right?
[15:42:06] <hashar>	 I suspect the labs security rules to be borked.  Maybe some cache yields an empty list
[15:42:13] <hashar>	 yeah
[15:42:22] <hashar>	 that is integration-slave1001.eqiad.wmflabs
[15:42:45] <hashar>	 we use ferm rules on CI slaves, and have one to allow ssh (port 22) from gallium ( 208.80.154.135 )
[15:43:02] <hashar>	 so at least from prod to labs that is blocked by something :D
[15:44:06] <Coren>	 hashar: Indeed.  I'm trying to track down where from.
[15:44:07] <hashar>	 on the same projects, the puppet agent  can not reach the project puppetmaster on integration-puppetmaster.eqiad.wmflabs ( 10.68.16.96  port 8140)
[15:44:23] <hashar>	 so that is traffic local to the labs project
[15:44:41] <hashar>	 the puppet master works just fine (I can run puppet agent locally)
[15:46:09] <Coren>	 That's odd; I see the traffic reaching the labnet at least.
[15:46:46] <hashar>	 great
[15:49:37] <Coren>	 hashar: Your current security group rules (on Wikitech) only show port 22 open from 10.0.0.0/8
[15:50:10] <Coren>	 hashar: Sadly, we have no change history over that config so I couldn't tell you if that was changed Friday.
[15:50:21] <hashar>	 pfff
[15:50:22] <hashar>	 wtf
[15:51:38] <Coren>	 Also, port 8140 is not open at all.
[15:51:50] <hashar>	 it never was
[15:51:57] <Coren>	 Only rule I see for gallium is to open 873
[15:51:58] <hashar>	 since the traffic is local to the project
[15:52:12] <hashar>	 I have added  208.80.154.135/32 port 22
[15:52:47] <Coren>	 And that does open up ssh from gallium.
[15:52:52] <hashar>	 yes
[15:53:03] <hashar>	 so seems the security rules got messed up on friday :-(((
[15:53:21] <wikibugs>	 3Tool-Labs: Install byobu terminal multiplexer package on toollabs - https://phabricator.wikimedia.org/T88989#1024935 (10devunt) 3NEW
[15:53:43] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024943 (10coren) 5Open>3Resolved a:3coren The project security group did not (was changed not to?) include allowing ss...
[15:54:14] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024951 (10hashar)
[15:54:27] <hashar>	 one issue solved
[15:54:30] <hashar>	 now the puppetmaster one :D
[15:54:48] <hashar>	 deployment-prep has its own puppetmaster as well
[15:54:50] <hashar>	 and works just fine
[15:54:55] <wikibugs>	 3Tool-Labs: Install byobu terminal multiplexer package on toollabs - https://phabricator.wikimedia.org/T88989#1024955 (10devunt)
[15:55:00] <hashar>	 though port 8140 is not in the security rules
[15:58:34] <Coren>	 Hm, that's odd.
[15:58:49] <Coren>	 Do you have a ticket for that one too?
[15:59:05] <hashar>	 same ticket :D
[15:59:26] <Coren>	 Oh, I hadn't noticed the "additionally" part.
[15:59:28] <Coren>	 Heh.
[15:59:56] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024977 (10hashar) 5Resolved>3Open The integration labs project was missing a security rule to allow ssh from gallium for...
[16:00:13] <Coren>	 ... I wonder...
[16:00:21] <hashar>	 someone must have messed the rules
[16:01:54] <hashar>	 or maybe prod was always allowed to ssh to instances? 
[16:02:16] <hashar>	 or the rules are no applied at the instance level
[16:02:17] <Coren>	 hashar: I don't think it was, nor was it intended to.
[16:02:24] <hashar>	 and thus instances can no more communicate to each others
[16:02:58] <Coren>	 hashar: And, annoyingly enough, adding a rule to allow 8140 worked.
[16:03:16] <hashar>	 yeah something is definitely wrong
[16:03:25] <hashar>	 on deployment-prep  we have no rule to allow 8140
[16:03:32] <Coren>	 I saw.
[16:03:55] <Coren>	 I wonder if the intervening upgrade (between instance creation) changed a default.  I'll have to consult with Andrew
[16:05:17] <Coren>	 Well, your immediate issue is fixed, at least, but it bears looking into
[16:05:25] <hashar>	 I guess you can comment on https://phabricator.wikimedia.org/T88960
[16:05:36] <hashar>	 I know _joe_ had some related issue this morning
[16:06:46] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025012 (10coren) The puppetmaster issue did appear related: adding an explcit rule to allow it fixed the immediate problem,...
[16:07:22] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025014 (10coren) 5Open>3Resolved
[16:07:43] <Coren>	 _joe_: Moar details?  ^^
[16:07:44] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025015 (10hashar) The deployment-prep labs project also uses a local puppetmaster but it does not need any specific security...
[16:07:45] <hashar>	 Coren: should I fill another task so ?
[16:07:57] <Coren>	 hashar: Might be wise.
[16:08:07] <Coren>	 So as to not lose track of the underlying question.
[16:09:01] <_joe_>	 Coren: nope
[16:10:22] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025027 (10akosiaris) @hashar  root@integration-slave1002:~# telnet 10.68.16.96 8140 Trying 10.68.16.96... Connected to 10.68...
[16:11:58] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025041 (10coren) Yeah, things are working fine now with an explicit rule - but the necessity of //having// the explicit rule...
[16:12:41] <wikibugs>	 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025046 (10hashar) 3NEW
[16:13:06] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024242 (10hashar) Filled another ticket for investigation of the underlying issue: {T88995}.
[16:25:10] <^d>	 Coren: Whatever you and andrewbogott_afk did seems to have worked. New instances are coming up fine with nfs for me now
[16:47:53] <Coren>	 Heh.  Spam.  "Reduce your mortgage interest by 4%".  I'd love to know how they can find a bank that'll *give* me interest on a loan.  :-)
[16:51:14] <hashar>	 Coren: in France the real-estate mortgage are veryyyy low
[16:51:30] <hashar>	 sometime less than 2% 
[16:51:40] <Coren>	 I dunno about France, but I pay 2.5% on mine so that's pretty reasonable.
[16:52:11] <hashar>	 heck http://www.meilleurtaux.com  even shows 1,75% (excluding insurance) for 15 years
[16:52:29] <hashar>	 (fixed rates)
[16:52:52] <hashar>	 the only problem is a house is half a million of euros :-]
[16:58:53] <^d>	 Coren: Perhaps I spoke too soon. deployment-elastic0[568] all came up fine, but [7] did not. Same 2 mounts.
[16:59:36] <Coren>	 ^d: The race cannot be won all the times.  10:1 just rebooting it will fix it.
[16:59:43] * ^d has rebooted 3 times
[16:59:47] <Coren>	 Huh.
[17:00:12] <Coren>	 I'm guessing you made them in that order?
[17:00:39] <^d>	 Yep
[17:04:24] <Coren>	 ^d: That's odd - the LDAP entry for that instance is completely hosed.
[17:04:32] <^d>	 boo :(
[17:05:44] <Coren>	 That's actually a little worrisome.  andrewbogott_afk ping?
[17:06:21] <^d>	 I had deleted an instance of the same name a bit ago. Chance the old entry wasn't fully deleted first?
[17:06:22] <Coren>	 ^d: Can you avoid touching it for a bit so it's left intact for investigation?
[17:06:42] <Coren>	 ^d: That might be part of the issue, or a possible cause.
[17:07:26] <^d>	 I'll leave it alone
[17:07:33] <^d>	 logged out too
[17:11:00] <wikibugs>	 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025233 (10coren) 3NEW
[17:11:30] <wikibugs>	 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025247 (10coren) p:5Triage>3High
[17:11:50] <Coren>	 ^d: Can you live without and/or make due with a -09 in the meantime?
[17:12:00] <^d>	 I was just thinking that.
[17:34:23] <legoktm>	 why are my python grid jobs failing with "KeyboardInterrupt"?
[17:34:35] <legoktm>	 I'm definitely not ^C'ing them...
[17:36:23] <legoktm>	 they also appear to be failing on network issues talking to gerrit?
[17:43:40] <wikibugs>	 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025343 (10Andrew) I've seen this too, but only on Friday.  The addition of the DNS info to the ldap record is done by a mw job -- for the time being I'm blaming this on load but we should keep an eye out fo...
[18:01:43] <valhallasw`cloud>	 legoktm__: I think that might be the SIGKILL that's sent before the SIGTERM? not sure...
[18:01:50] <valhallasw`cloud>	 although ctrl-c is SIGINT I think
[18:18:46] <Coren>	 Jobkill goes INT -> TERM -> KILL
[18:19:13] <Coren>	 legoktm: But tell me more about the network issues.
[18:19:32] <Coren>	 Guest92790: ^^
[18:21:13] <wikibugs>	 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025447 (10coren) I note there are also several missing puppetVar and puppetClass, but not all of them.  Are they in fact added through different mechanism?
[18:26:19] <wikibugs>	 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025449 (10Andrew) No!  The mw job only sets the arecord.  If we're getting instances without puppetVars then... something interesting is happening :(
[18:28:51] <wikibugs>	 3Labs: Wikitech created a broken LDAP entry for a new instance - https://phabricator.wikimedia.org/T89001#1025453 (10Andrew) Oh, nope, I'm wrong -- puppetVars are handled in OpenStackNovaPrivateHost.php, same as the private arecord.  So this is at least only one problem rather than two.
[18:33:28] <andrewbogott>	 Coren: want to double-check my work and verify that the jobqueue on virt1000 is in fact working and empty?
[18:36:39] <Coren>	 andrewbogott: Looking now.
[18:36:44] <andrewbogott>	 thx
[18:36:53] <andrewbogott>	 It should be running 1/min on a cron
[18:37:54] <Coren>	 It does, and showJobs shows 0
[18:39:08] <andrewbogott>	 Is there any scenario where the jobqueue would just… throw jobs away?
[18:39:20] <Coren>	 andrewbogott: Do you have an opinion about the networking issue?  T88960
[18:39:53] <Coren>	 andrewbogott: I don't know of a way for jobs to not be run, but I don't think error recovery when a job fails is that robust.
[18:41:22] <andrewbogott>	 The job resubmits itself in certain situations, maybe I can broaden that
[18:41:58] <andrewbogott>	 Coren: is that bug a network issue, or just an ‘instances without an arec don’t have an arec’ issue?
[18:42:01] * andrewbogott reads more closely
[18:42:21] <andrewbogott>	 oh, sorry, misread
[18:42:53] <Coren>	 No, that one is unrelated.  It looks like there is no communication between instances in the same project by default anymore; at least that always has been my understanding.
[18:43:45] <andrewbogott>	 Coren: hashar’s instances use ferm…
[18:43:57] <andrewbogott>	 So it could be an issue with the firewall local to the instance
[18:44:00] <Coren>	 That one didn't.
[18:44:09] <Coren>	 I checked to make sure first.
[18:44:12] <andrewbogott>	 ok
[18:44:17] <andrewbogott>	 sorry, I will actually read this to the end
[18:44:22] <Coren>	 :-)
[18:45:56] <andrewbogott>	 I messed with that project’s security rules on Friday.  So this is probably my fault one way or another
[18:48:20] <Coren>	 Huh.  Is there a tunable knob about "allow connections within the project"?
[18:48:42] <wikibugs>	 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025519 (10Andrew) I messed with the security rules on Friday because someone on IRC (timo, I think?) was trying to ssh between instances and failing....
[18:49:47] <andrewbogott>	 Hm, I don’t think of security rules as having any sense of in- or out-of-project
[18:58:03] <legoktm>	 Coren: sorry, I read the log wrong, there are no network issues. My python script is shelling out to git, does a git pull on a large repo which takes a while, and is dying with KeyboardInterrupt (/data/project/extreg-wos/generate.err)...any ideas?
[19:01:47] <valhallasw`cloud>	 legoktm: that could be SGE killing the job
[19:02:23] <legoktm>	 valhallasw`cloud: should I throw more memory at it?
[19:02:39] <valhallasw`cloud>	 might help. qacct -j <jobname or id>    should tell the maxvmem, I think
[19:02:48] <valhallasw`cloud>	 and you can add -ma to get a mail on abort
[19:05:56] <legoktm>	 ok yeah it's running out of memory
[19:05:56] <legoktm>	 I bumped it up to 900M from 500M
[19:06:54] <valhallasw`cloud>	 I really think -ma should be default
[20:01:53] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025736 (10hashar) 3NEW
[20:02:35] <wikibugs>	 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024242 (10hashar) The doc publishing jobs are failing as well and there is no workaround for it :(  T89026
[20:02:50] <wikibugs>	 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025046 (10hashar) The doc publishing jobs are failing as well and there is no workaround for it :( T89026
[20:04:26] <hashar>	 andrewbogott: good morning :]
[20:04:33] <andrewbogott>	 'morning!
[20:04:39] <andrewbogott>	 hashar: ready for me to merge those two patches?
[20:04:44] <hashar>	 the poor security rules are broken on integration :D
[20:04:49] <hashar>	 sorry lacking a bouncer
[20:05:06] <hashar>	 I guess some rule that allow communications between instances inside a project has been removed somehow
[20:06:01] <andrewbogott>	 hashar: I responded to the phab.  I was messing with security rules on (I think) Timo’s request.  I probably broke something by accident.
[20:06:07] <andrewbogott>	 Anyway, all working now right?
[20:06:27] <hashar>	 na :]
[20:06:31] <hashar>	 well let me check
[20:06:47] <andrewbogott>	 Coren: can you check on the status of /public/backups?  Labs instances are complaining; I’m on labstore1001 but don’t see what’s going on
[20:06:58] <hashar>	 there is a workaround to explicitly define security rules, but in some case that is not possible
[20:07:48] <Coren>	 andrewbogott: Hm, /public/backup is obsolete and shouldn't be even used since https://gerrit.wikimedia.org/r/#/c/184638/
[20:08:16] <andrewbogott>	 jgage is getting puppet errors about it, but maybe his puppet is broken
[20:08:32] <^d>	 FOR ALL THAT IS GOOD AND HOLY IN THIS WORLD
[20:08:47] <andrewbogott>	 ^d?
[20:08:48] * ^d stabs nfs
[20:08:51] <^d>	 *AGAIN*
[20:08:56] <Coren>	 o_O
[20:09:03] <Coren>	 ^d: What instance?
[20:09:06] <^d>	 Spun up an 09 to work around the missing 07.
[20:09:11] <^d>	 deployment-elastic09
[20:09:19] <^d>	 (and this is *not* a replacement. There's never been an 09)
[20:09:37] <Coren>	 ^d: Still broken in ldap
[20:09:48] <Coren>	 andrewbogott: ^^ ze problem, she is real.
[20:10:34] <Coren>	 andrewbogott: Same missing things.  It's got no aRecord nor the non-default vars and classes
[20:11:01] <Coren>	 ^d: The issue isn't NFS but LDAP.  We be on it.
[20:11:16] <^d>	 Well, cascades into nfs, right?
[20:11:20] <^d>	 When ldaps busted?
[20:11:46] <andrewbogott>	 ^d: dns is in ldap, so nothing works without it
[20:11:53] <^d>	 Makes sense
[20:13:19] <hashar>	 andrewbogott: still broken :]
[20:13:37] <andrewbogott>	 hashar: the phab says otherwise, can you update the ticket accordingly?
[20:13:41] <hashar>	 andrewbogott: I have a use case were the labs instance ssh to another instance  and I can't really allow any instance to ssh to it
[20:13:42] <andrewbogott>	 Or am I misreading somehow?
[20:13:56] <hashar>	 andrewbogott: I want to restrict ssh to the integration labs instance.  Filled a bug about it https://phabricator.wikimedia.org/T89026
[20:14:12] <hashar>	 the way we worked around the underlying issue is by explicitly defining security rules
[20:14:21] <andrewbogott>	 ok
[20:14:22] <hashar>	 such as allowing connections to puppet port
[20:14:31] <andrewbogott>	 so convince me that this is a new problem and not just the way things have always worked?
[20:14:50] <hashar>	 I have no clue how the security rules work, but seems a rule that allow instances to communicate together inside a project has been dropped
[20:15:03] <andrewbogott>	 ah, ok
[20:15:25] <hashar>	 I am 1000% sure that instances always have been able to communicate on whatever port / protocol as long as they are in the same project.
[20:15:51] <andrewbogott>	 so you were making use of a ‘source group’ rule which is so obscure that I removed it from the GUI.  you must’ve configured it back in the day.
[20:15:52] <hashar>	 on beta cluster, there is no security rules to communicate to the puppet port ( 8160 or something )  and puppet is still happy
[20:15:58] <andrewbogott>	 I will replace that rule from the commandline
[20:16:02] <hashar>	 ohhhhh
[20:16:10] <Coren>	 andrewbogott: And, indeed, so am I.  I always believed that this was the case.
[20:16:22] <hashar>	 so deployment-prep must have a similar rule
[20:17:19] <hashar>	 ah source group
[20:17:21] <hashar>	 \O/
[20:17:21] <andrewbogott>	 the openstack people removed use of ‘source groups’ from their docs as well — it caused a bunch of confusion and I judged it to be worthless.  Clearly that was incorrect :(
[20:17:45] <hashar>	 so by default instances can't communicate with each other right ?
[20:17:58] <andrewbogott>	 ugh, and removed from their usage statements as well.  This should be interesting
[20:18:01] * andrewbogott dives into source
[20:18:04] <hashar>	 whenever we created the contintcloud project (fairly recently) it cames with the source group by default
[20:18:28] <andrewbogott>	 hashar: right, by default the project membership has no affect on firewall rules
[20:19:37] <hashar>	 andrewbogott: here what I got https://phabricator.wikimedia.org/F38628 :D
[20:19:43] <hashar>	 for contintcloud project (not used yet)
[20:19:54] <hashar>	 deployment-prep has the same
[20:20:01] <hashar>	 at least we know what is the underlying cause now!
[20:20:07] <andrewbogott>	 hashar, please stop talking for a minute so I can fix your project?
[20:20:45] <wikibugs>	 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025803 (10hashar)
[20:20:46] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025802 (10hashar)
[20:20:59] <hashar>	 sure
[20:22:35] <wikibugs>	 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025046 (10hashar) Andrew found out that the integration labs project is missing the security rule that allows communication between instances. That i...
[20:27:20] <andrewbogott>	 hashar: it looks like the commandline has changed and now I’m required to specify a port range.  What do you need besides 22?
[20:30:12] <hashar>	 aoriarhee
[20:30:25] <hashar>	 no clue really
[20:30:30] <hashar>	 try 'any' or '*' ? :-D
[20:30:54] <andrewbogott>	 I can do 0-1000000
[20:31:01] <andrewbogott>	 just thought you might have a list of services
[20:31:01] <hashar>	 andrewbogott: it still works for deployment-prep so maybe whatever is set for that project can be reused?
[20:31:34] <andrewbogott>	 hashar: yeah, I’m talking about the commandline tool changing not the firewall implementation
[20:31:46] <andrewbogott>	 anyway, integration should be happy now
[20:33:38] <andrewbogott>	 Coren: we have 464 instances and 589 host entries — so it’s not like we hit another luck number in ldap…
[20:33:52] <andrewbogott>	 I can’t imagine why it’s rejecting those entries.
[20:33:58] <hashar>	 trying :]
[20:34:43] <Coren>	 andrewbogott: It looks like the job is failing, but I can't see why.
[20:35:40] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025875 (10hashar) a:3Andrew This has been fixed on spot by Andrew. See T88995 for details.
[20:35:48] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025877 (10hashar) 5Open>3Resolved
[20:37:05] <andrewbogott>	 hashar:  sorry i broke everything on Friday :(
[20:37:22] <andrewbogott>	 nested firewalls are hard to debug
[20:39:16] <hashar>	 what kind of sec rule have you added ?
[20:39:38] <hashar>	 integration-publisher [10.68.16.255] ssh port can now be reached by any instance
[20:39:47] <hashar>	 I guess there is an allow source any isn't it ?
[20:40:43] <hashar>	 ah no different rule :-]
[20:42:37] <hashar>	 andrewbogott: and yeah stuff get broken from time no worries. I am quite happy to see you figured out the fix in a few minutes \O/
[20:43:02] <hashar>	 can you comment and close the bug for later reference please? https://phabricator.wikimedia.org/T88995  thx!
[20:47:02] <jgage>	 greetings
[20:47:23] <jgage>	 andrewbogott: yes my stale nfs mounts problem is only with the "backups" mount point
[20:47:35] <jgage>	 i tried rebooting
[20:48:12] <andrewbogott>	 jgage and Coren can you confer on this please?
[20:48:52] <Coren>	 jgage: Hm.  What instance is this?
[20:49:35] <wikibugs>	 3Wikimedia-Labs-Infrastructure: Labs security rules changed on integration labs project around Friday Feb 6th 23:30 UTC - https://phabricator.wikimedia.org/T88995#1025923 (10Andrew) 5Open>3Resolved a:3Andrew Yeah, I deleted the 'source group' rule because I suspected it of interfering with inter-instance s...
[20:49:37] <wikibugs>	 3Continuous-Integration, Wikimedia-Labs-Infrastructure: integration labs instance can not rsync/ssh to integration-publisher [10.68.16.255] instance - https://phabricator.wikimedia.org/T89026#1025926 (10Andrew)
[20:49:37] <jgage>	 coren - project: ipsec. affected hosts: ipsec-c{1-4}. not affected: ipsec-pm. nfs mounts of /public/backups are stale, and after reboot when i try to mount -a i get this:
[20:49:40] <jgage>	 mount.nfs: mounting labstore.svc.eqiad.wmnet:/backups failed, reason given by server: No such file or directory
[20:50:57] <andrewbogott>	 thx coren
[20:51:05] <jgage>	 it should be noted that i'm using self hosted puppetmaster, so i wondered whether i'm missing an important puppet change
[20:52:12] <Coren>	 jgage: You have, at least one: https://gerrit.wikimedia.org/r/#/c/184638/ is the one that removed /public/backup/
[20:52:19] <jgage>	 aha :)
[20:53:38] * andrewbogott endorses rebasing every day when using self-hosted puppet
[21:10:43] <bd808>	 andrewbogott, jgage: This module will do that (sync with upstream puppet repo) -- https://github.com/wikimedia/operations-puppet/blob/production/modules/puppetmaster/manifests/gitsync.pp
[21:11:32] <bd808>	 Set $::puppetmaster_autoupdate and poof it should just work
[21:18:19] <andrewbogott>	 grr, I’m remembering that things run from the jobqueue don’t write to the auth log :(
[21:22:29] <Alchimista>	 using cron, is there a specific place to add the -cwd flag? Seems not be working in a job
[21:25:05] <Alchimista>	 the line is this one: http://pastebin.com/0QaK1fsF
[21:31:04] <valhallasw`cloud>	 Alchimista: doesn't work in what sense? the cwd for the second jsub command is $HOME, not whatever you cd'ed to
[21:31:09] <valhallasw`cloud>	 in the other jsub job
[21:32:31] <valhallasw`cloud>	 Alchimista: I *think* the best way to get what you want is to make a small shell script that does
[21:33:06] <valhallasw`cloud>	 #!/bin/bash
[21:33:06] <valhallasw`cloud>	 cd  /data/project/alchimista/bots/alch;
[21:33:06] <valhallasw`cloud>	 jsub -cwd -N stewie python pwb.py stewievo.py > ~/bots/logs/stewie.log 2>&1
[21:34:08] <Alchimista>	 valhallasw`cloud: i was tryng to avoid it, it'll be 20 or more cron jobs, if it where possible to do it in cron, was more easily manageble :s
[21:34:29] <valhallasw`cloud>	 Alchimista: hmm.
[21:34:48] <valhallasw`cloud>	 it's a bit annoying the jsub-fixup-script doesn't understand cd et al
[21:38:00] <Alchimista>	 yah, specially when it needs more than one file
[21:40:28] <valhallasw`cloud>	 Alchimista: I'm not sure. Maybe with some bash, but getting bash -c syntax right is basically impossible
[21:41:06] <Coren>	 It may well be a good time to get rid of the fixup script now; there's not much reason for it anymore.
[21:41:34] <andrewbogott>	 Coren: I’m looking at deployment-elastic09 in ldap now… it’s missing an arec but the puppet settings look fine to me.  Do you disagree?
[21:41:38] <Coren>	 Though it does meant a bit of policing to make sure tools-submit isn't being misused.
[21:41:59] <Coren>	 andrewbogott: Lemme refresh
[21:42:52] <Coren>	 andrewbogott: Well, they are correct for the default, but I doubt ^d just didn't configure the rest like -08
[21:43:36] <andrewbogott>	 you mean like role::elasticsearch::server?  Where would that come from if not hand-config by ^d?
[21:46:00] <Coren>	 That's was my point but I now notice that -07 and -09 were indeed /not/ configured (probably because they were broken).  So the absence of vars and classes is not a problem, only the arec is missing and the problem
[21:46:24] <andrewbogott>	 ok, great.
[21:46:51] <andrewbogott>	 Too bad there’s no logging from things in the jobqueue :(
[21:50:14] <andrewbogott>	 Coren: is there logging someplace that shows the jobs at least trying to run?
[21:51:17] <^d>	 Hm?
[21:51:21] * ^d catches scrollback
[21:52:03] <hashar>	 andrewbogott: on regular wikis we used to log start of jobs
[21:52:13] <^d>	 So what I was doing was spinning up the instances, making sure there were ok, then adding my puppet roles.
[21:52:27] <^d>	 Since 07 and 09 didn't come up with their mounts, I stopped at that point
[21:52:41] <^d>	 (hence why 06/08 have role::elasticsearch::server etc)
[21:52:43] <andrewbogott>	 hashar: log where?
[21:52:52] <andrewbogott>	 ^d: yeah, that all fits with what I’m seeing
[21:52:57] <hashar>	 udp2log / fluorine
[21:53:06] <hashar>	 but nowadays it is a dedicated service I think
[21:53:28] <Coren>	 ^d: Right, that became obvious in retrospect.  Originally, I thought the lack of configuration was a secondary issue.
[21:54:43] <^d>	 Ah no, it's not :)
[21:55:30] <hashar>	 andrewbogott: in the udp2log bucket "runJobs" 
[21:55:39] <hashar>	 andrewbogott: you should have a message on start of job and one on completion
[21:56:03] <andrewbogott>	 hashar: oh, I guess that doesn’t help me w/wikitech
[21:58:37] <hashar>	 depends where it is sending its udp2log spam :-]
[21:58:45] <hashar>	 would be rather nice to have it send to some logstash
[21:58:46] <andrewbogott>	 probably nowhere
[22:00:31] <hashar>	 bug fill it :-]
[22:00:52] <hashar>	 thanks again for the security rule fix. I am out to bed
[22:01:28] <andrewbogott>	 logging a bug isn’t so much fun when I know I’ll have to fix it myself
[22:37:37] <andrewbogott>	 Coren: review svp?  https://gerrit.wikimedia.org/r/189602
[22:37:51] <andrewbogott>	 that doesn’t fix the bug but may allow us to see it
[22:39:03] <andrewbogott>	 Of course, passing that global into another process space may be a Bad Thing
[23:08:06] * Coren reviews
[23:09:25] <andrewbogott>	 thx
[23:14:51] <tfinc>	 andrewbogott: yo
[23:15:47] <andrewbogott>	 tfinc: hello!
[23:16:06] <tfinc>	 andrewbogott: thanks for the backstory on unicorn, go ahead an add me 
[23:16:15] <tfinc>	 and*
[23:16:18] <andrewbogott>	 Thank you for volunteering :)  I’ll add you as an admin
[23:16:35] <andrewbogott>	 that should give you sudo on unicorn, among other things
[23:17:28] <andrewbogott>	 tfinc: what’s your labs username?
[23:17:41] <tfinc>	 andrewbogott: Tfinc
[23:17:44] <andrewbogott>	 (And I hope my email wasn’t grumpy… I had a moment of panic when I realized someone might still be relying on that instance :)  )
[23:18:10] <tfinc>	 andrewbogott: i learned about it for the first time on Friday 
[23:18:26] <andrewbogott>	 yikes
[23:18:29] <andrewbogott>	 ok, you should have access now
[23:20:18] <tfinc>	 andrewbogott: sigh, key mismatch to get into bastion, i'll have to fix that. can you update the file to be https://gist.githubusercontent.com/flyingclimber/addfb86f1c816f50205f/raw/19703b5bf066fc88f2abb56f0f23a31b8a8a5de1/discussion.json .. last update had the HTML and not JSON content 
[23:20:33] <andrewbogott>	 yep.  Same file?
[23:21:47] <tfinc>	 andrewbogott: yes
[23:22:11] <andrewbogott>	 ok… how’s that?
[23:24:43] <tfinc>	 andrewbogott: looks good 
[23:24:44] <tfinc>	 thanks
[23:25:10] <andrewbogott>	 np
[23:32:52] <andrewbogott>	 Coren: either my change doesn’t work, or that job is never running.  So I’ve learned nothing!
[23:48:02] <Coren>	 Never or unreliably.
[23:49:27] <Alchimista>	 valhallasw`cloud: found the problem, -cwd doesn't make it run on the current dir, it was failling because the script needed a local .txt wich wasn't loaded by the grid, after adding the full path to the txt, it works directly on crontab
[23:50:55] <andrewbogott>	 It /is/ running and just not logging
[23:56:15] <jgage>	 hmm i just made a new node ipsec-c5 but it can't mount /home because its export doesn't include the new node's ip. it's been >30 mins.
[23:57:26] <Coren>	 jgage: I expect it's a known issue.  We're having odd unreliability issues with LDAP that causes that.