[01:24:30] <tgr>	 bd808: how do you set a puppet class parameter via nova puppet groups?
[01:24:46] <tgr>	 adding a parameter with :: in the name does not seem to work
[01:40:58] <grrrit-wm>	 (03PS1) 10Gergő Tisza: Private hiera keys for Sentry [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) 
[01:41:55] <grrrit-wm>	 (03PS2) 10Gergő Tisza: Private hiera keys for Sentry [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) 
[02:08:57] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1493979 (10scfc) The disk space on `tools-webgrid-lighttpd-1406` was running low as well, so I rebooted that instance, too.  Searching for instances that have a `pacct.0`...
[02:44:33] <shinken-wm>	 PROBLEM - SSH on tools-webgrid-lighttpd-1409 is CRITICAL: Server answer
[02:49:35] <shinken-wm>	 RECOVERY - SSH on tools-webgrid-lighttpd-1409 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)
[02:55:24] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1493990 (10scfc) Rebooted: - `tools-webgrid-generic-1401`, - `tools-webgrid-generic-1402`, - `tools-webgrid-generic-1403`, and - `tools-webgrid-lighttpd-1401`.
[02:57:23] <wikibugs>	 6Labs, 10Tool-Labs: Cannot log into tools-webgrid-lighttpd-1409: "Connection closed by UNKNOWN" - https://phabricator.wikimedia.org/T107403#1493993 (10scfc) 3NEW
[03:40:34] <shinken-wm>	 PROBLEM - SSH on tools-webgrid-lighttpd-1409 is CRITICAL: Server answer
[04:20:34] <shinken-wm>	 RECOVERY - SSH on tools-webgrid-lighttpd-1409 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)
[04:56:40] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 30.00% of data above the critical threshold [0.0]
[05:36:39] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0]
[06:01:33] <wikibugs>	 10Tool-Labs-tools-Erwin's-tools: "Fatal Error: Database query failed" for Related Changes tool - MySQL down? - https://phabricator.wikimedia.org/T104688#1494098 (10Supernino) This tool was extremely important to patrollers, maybe developing a new one as an alternative would be a good idea.
[06:47:47] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 40.00% of data above the critical threshold [0.0]
[07:27:47] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0]
[07:51:14] <wikibugs>	 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1494177 (10valhallasw) 5Resolved>3Open I see four:  * tools-shadow-01    * Puppet failure UNKNOWN1M 5d UNKNOWN: execution of the check script exited with exception list index out of range     *  Puppet staleness UNKNOW...
[07:57:40] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 44.44% of data above the critical threshold [0.0]
[08:03:36] <wikibugs>	 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1494203 (10yuvipanda) @Coren hasn't really responded on T104781 so I'm just going to delete tools-shadow-01 now. Not sure about the other host
[08:04:29] <wikibugs>	 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1494208 (10yuvipanda) Bah, can't do that (unable to login to wikitech, 2fa phone dead atm). I'll do it tomorrow if nobody else gets to it in the meantime.
[08:06:12] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106, and 2 others: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1494210 (10yuvipanda) 5Open>3Resolved It's also fixed for user accounts now, and monitoring has been set up as well.  Let me know if it doesn't work...
[08:20:33] <wikibugs>	 6Labs, 3Labs-Sprint-107: Identify user facing services labs provides - https://phabricator.wikimedia.org/T105721#1494217 (10yuvipanda) I call this 'done'. See T107058 and children for implementation of checks.
[08:25:03] <wikibugs>	 6Labs, 10Incident-20150617-LabsNFSOutage, 3Labs-Sprint-102, 3Labs-Sprint-103: Labs: rewrite remaining labstore* scripts - https://phabricator.wikimedia.org/T102520#1494219 (10yuvipanda) Only maintain-replicas is left.
[08:32:38] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0]
[08:33:18] <grrrit-wm>	 (03PS1) 10Yuvipanda: Remove jgage's root key from labs instances [labs/private] - 10https://gerrit.wikimedia.org/r/227951 
[08:33:20] <grrrit-wm>	 (03PS1) 10Yuvipanda: Replace Yuvi's key with new Key from new laptop [labs/private] - 10https://gerrit.wikimedia.org/r/227952 
[08:33:44] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Remove jgage's root key from labs instances [labs/private] - 10https://gerrit.wikimedia.org/r/227951 (owner: 10Yuvipanda)
[08:33:56] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Replace Yuvi's key with new Key from new laptop [labs/private] - 10https://gerrit.wikimedia.org/r/227952 (owner: 10Yuvipanda)
[08:38:16] <wikibugs>	 6Labs: Disable NFS on the orgcharts project - https://phabricator.wikimedia.org/T103137#1494226 (10yuvipanda) So...  @matanya wanted this project to be not-killed, so it hasn't been yet. However, the instance could just as well be dead - I"m unable to ssh into it. @matanya - can you?
[08:50:03] <wikibugs>	 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1494233 (10yuvipanda) Anything else? /root still has test scripts, but that's presumably ok?
[10:07:33] <shinken-wm>	 PROBLEM - Puppet failure on tools-static-01 is CRITICAL 100.00% of data above the critical threshold [0.0]
[11:15:44] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL 60.00% of data above the critical threshold [0.0]
[11:31:35] <wikibugs>	 6Labs, 3Labs-Sprint-107: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1494429 (10mark) I'd like to see a migration plan be developped on this ticket.
[11:50:42] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1214 is OK Less than 1.00% above the threshold [0.0]
[12:14:13] <wikibugs>	 10Wikibugs: Add wikibugs to #wikimedia-rename - https://phabricator.wikimedia.org/T107418#1494512 (10Steinsplitter) 3NEW
[12:47:25] <wikibugs>	 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1494562 (10faidon) Sounds good to me, although I'd really prefer it if we reinstalled labstore1001 at this point before we switch over t...
[13:36:29] <andrewbogott>	 YuviPanda: do you know what’s up with all the tools puppet warnings over night?
[14:03:51] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds
[14:05:17] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 33.33% of data above the critical threshold [0.0]
[14:05:55] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 30.00% of data above the critical threshold [0.0]
[14:05:57] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 40.00% of data above the critical threshold [0.0]
[14:08:41] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 838311 bytes in 2.203 second response time
[14:25:22] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1401 is OK Less than 1.00% above the threshold [0.0]
[14:45:58] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0]
[14:45:58] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0]
[14:50:48] <andrewbogott>	 10-minute warning!  I’ll be rebooting tools-login shortly.
[14:50:49] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds
[14:54:44] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 838932 bytes in 3.917 second response time
[15:00:26] <andrewbogott>	 !log tools rebooting tools-bastion-01 aka tools-login
[15:00:31] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy
[15:02:07] <wikibugs>	 6Labs: Disable NFS on the orgcharts project - https://phabricator.wikimedia.org/T103137#1494786 (10Matanya) No, i don't have time to work on this now, although it is a very nice project. You can kill it from my POV :(
[15:07:53] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1494799 (10Andrew) tools-bastion-01 is now rebooted as well.
[15:11:28] <doctaxon>	 thanx Coren for the fast reboot and the countdown before
[15:11:43] <Coren>	 doctaxon: That's all Andrew this time.  :-)
[15:12:15] <doctaxon>	 thanx andrewbogott for the fast reboot and the countdown before
[15:12:44] * andrewbogott hopes it actually solved the problem :)
[15:15:18] <doctaxon>	 andrewbogott - but such jobs are great jobs - with countdown you can prepare up to the point very effectivel
[15:15:32] <doctaxon>	 y
[16:06:14] <bd808>	 tgr|away: I don't think you can directly. You either need to use hiera or resort to the pre-hiera trick of global variables -- https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/kibana.pp#L23-L26
[16:06:41] <bd808>	 You can use your project
[16:06:54] <bd808>	 You can use your project's page in the Hiera namespace -- https://wikitech.wikimedia.org/wiki/Hiera:Logstash
[16:08:27] <tgr>	 yeah, I ended up using globals
[16:08:40] <bd808>	 gross :)
[16:09:43] <bd808>	 If you need per-host hiera settings you can do that with patches to ops/puppet -- https://github.com/wikimedia/operations-puppet/tree/production/hieradata/labs/deployment-prep/host
[16:10:07] <bd808>	 but without a project puppetmaster that is pretty annoying
[16:14:18] <tgr>	 I want to make using the class easy, and there are some parameters you need to set every time
[16:14:55] <tgr>	 turning them into a variable seemed the nicest way to go
[16:15:24] <tgr>	 although you still need to create a hiera file just to pass the variable so maybe not so much
[16:16:04] <tgr>	 I'm hoping labs gets hiera role support at some point which would fix that
[16:26:09] <bd808>	 tgr: Ori is actually trying to kill hiera roles in prod so it's not highly likely to show up in labs. https://phabricator.wikimedia.org/T106404
[16:37:37] <avic>	 does @reboot not actually work in crontab?
[17:20:15] <wikibugs>	 6Labs, 10Tool-Labs: Cannot log into tools-webgrid-lighttpd-1409: "Connection closed by UNKNOWN" - https://phabricator.wikimedia.org/T107403#1495142 (10scfc) 5Open>3Resolved a:3scfc Login works again.
[17:20:31] <wikibugs>	 6Labs, 10Tool-Labs: Cannot log into tools-webgrid-lighttpd-1409: "Connection closed by UNKNOWN" - https://phabricator.wikimedia.org/T107403#1495150 (10scfc) a:5scfc>3None
[17:22:42] <wikibugs>	 6Labs, 10Tool-Labs: GeoHack failing in Trusty because it can not allocate enough memory? - https://phabricator.wikimedia.org/T107253#1495162 (10scfc)
[17:23:26] <wikibugs>	 6Labs, 10Tool-Labs: GeoHack failing in Trusty because it can not allocate enough memory? - https://phabricator.wikimedia.org/T107253#1490787 (10scfc) (Geohack is indeed currently running on Precise.)
[17:30:34] <jem>	 Idea, possibly for a Phab task if you don't say it's mad: Shouldn't independent Labs projects have a tools.wmflabs.org/name redirection to name.wmflabs.org, and of course, be reserved as simple tool names? Or maybe this is taken into account somehow?
[17:35:02] <valhallasw`cloud>	 jem: mm, maybe, but nontrivial to implement
[17:35:57] <valhallasw`cloud>	 jem: I'm not sure if it's worth the effort. It's probably easier to act manually once problems arise
[17:36:19] <Glaisher>	 doesn't that already happen for projects?
[17:36:41] <Glaisher>	 hmm
[17:36:54] <Glaisher>	 not tools redirects
[17:37:07] <valhallasw`cloud>	 no, and service groups can have the same names as projects
[17:37:13] <valhallasw`cloud>	 eg. tools.xtools and xtools
[17:39:07] <jem>	 I see
[17:40:20] <jem>	 In fact I wasn't thinking about "problems", just memory failures
[17:41:20] <YuviPanda>	 I'm not sure what's up with the memory failures
[17:41:25] <YuviPanda>	 we do have enough space on the trustyes
[17:41:45] <jem>	 Ehm
[17:41:57] <jem>	 Maybe I didn't explain well
[17:42:08] <jem>	 I mean human memory failures
[17:42:32] <jem>	 (And probably my English is not deep enough)
[17:43:07] <YuviPanda>	 ahaha :)
[17:43:11] <jem>	 :)
[17:43:17] <YuviPanda>	 but we have been having some memory issues on the trusties as well
[17:44:13] <jem>	 Well, alternatively, tools.wmflabs.org/whatever could point to an error/somehow helping page
[17:44:38] <jem>	 It seems it justs loads a blank page now
[17:45:00] <jem>	 So one could get a little confused
[17:52:00] <wikibugs>	 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Build proper monitoring for making sure that processes that need to run only once on one labstore only are running only once on one labstore only - https://phabricator.wikimedia.org/T106590#1495299 (10yuvipanda) Taking stock, I see the following daemons running tha...
[17:55:33] <wikibugs>	 6Labs, 6operations, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1495302 (10yuvipanda) @andrew can you send out a scheduling email? If labvirt1009 is ok with you I can provide exact toollabs failover mechanisms.
[17:56:11] <wikibugs>	 6Labs, 3Labs-Sprint-107: Identify user facing services labs provides - https://phabricator.wikimedia.org/T105721#1495308 (10yuvipanda) 5Open>3Resolved a:3yuvipanda
[17:56:11] <wikibugs>	 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1495310 (10yuvipanda)
[17:56:56] <wikibugs>	 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1485532 (10yuvipanda)
[17:58:15] <wikibugs>	 6Labs: Setup an availability checker for all labsdb hosts - https://phabricator.wikimedia.org/T107449#1495326 (10yuvipanda) 3NEW
[17:58:59] <wikibugs>	 6Labs: Have a checkpoint check for labs proxies - https://phabricator.wikimedia.org/T107450#1495332 (10yuvipanda) 3NEW
[18:00:00] <wikibugs>	 6Labs: Have checkpoint check for public labs DNS - https://phabricator.wikimedia.org/T107451#1495338 (10yuvipanda) 3NEW
[18:01:00] <valhallasw`cloud>	 jem: yeah, there's an open bug for the empty response. It should just say 'we don't know this tool'
[18:01:12] <valhallasw`cloud>	 and we could add a check 'maybe the project exists' there, I suppose
[18:01:29] <jem>	 Good
[18:02:03] <jem>	 Is the bug number easy to find?
[18:02:13] <wikibugs>	 6Labs: Setup checkpoint check for private DNS  - https://phabricator.wikimedia.org/T107453#1495357 (10yuvipanda) 3NEW
[18:03:18] <wikibugs>	 6Labs: Create a checkpoint check for labs LDAP - https://phabricator.wikimedia.org/T107454#1495364 (10yuvipanda) 3NEW
[18:05:16] <valhallasw`cloud>	 https://phabricator.wikimedia.org/T104870
[18:06:15] <wikibugs>	 6Labs: Have checkpoint check for external network access from labs - https://phabricator.wikimedia.org/T107455#1495371 (10yuvipanda) 3NEW
[18:06:16] <jem>	 Suscribed, thanks
[18:07:07] <wikibugs>	 6Labs: Create a checkpoint check for labs puppetmaster - https://phabricator.wikimedia.org/T107456#1495380 (10yuvipanda) 3NEW
[18:07:10] <jem>	 I'll add a comment about "project exists"
[18:08:05] <wikibugs>	 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1495386 (10yuvipanda) 3NEW
[18:09:26] <wikibugs>	 6Labs, 10Tool-Labs: 'webservice not available' message no longer shown - https://phabricator.wikimedia.org/T104870#1495399 (10-jem-) It could be interesting that the landing page showed more information about tools with a similar name or Labs projects with the same name.
[18:10:46] <wikibugs>	 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1495405 (10yuvipanda) The enwiki login and edit check is 30k points a month and the read check is 21k points a month.
[18:12:46] <Krenair>	 YuviPanda, shouldn't these all have the monitoring project?
[18:12:53] <YuviPanda>	 Krenair: possibly
[18:13:06] <wikibugs>	 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1495416 (10Krenair) and -static?
[18:13:27] <Krenair>	 also Gerrit just very helpfully decided to email me about Krinkle_'s CR of my maintain-meta_p commit
[18:13:34] <wikibugs>	 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1495418 (10yuvipanda) Created subtasks for everything except instance availability.
[18:13:51] <Krenair>	 which he did two days ago
[18:14:40] <YuviPanda>	 ouch
[18:15:29] <Krenair>	 YuviPanda, see PS4 though!
[18:20:38] <valhallasw`cloud>	 YuviPanda: https://github.com/dbcli/mycli *drool*
[18:23:33] <YuviPanda>	 valhallasw`cloud: nice!
[18:23:37] <YuviPanda>	 we can import it if we want
[18:24:45] <valhallasw`cloud>	 I thought we weren't doing python packages anymore =p
[18:24:59] <YuviPanda>	 :P
[18:25:00] <valhallasw`cloud>	 oh, there's a package on packagecloud, whatever that is
[18:28:11] <chasemp>	 seems nice, I tried to write something like that once adn gave up as it was a real pita
[18:31:13] <wikibugs>	 6Labs: Home directory does not contain replica.my.cnf - https://phabricator.wikimedia.org/T107034#1495512 (10yuvipanda) 5Open>3Resolved a:3yuvipanda It should now! Fixed in T104453
[19:14:07] <wikibugs>	 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1495734 (10coren) a:3coren
[19:18:11] <wikibugs>	 6Labs, 3Labs-Sprint-107: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1495747 (10coren) a:3coren
[19:21:50] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds
[19:24:32] <Coren>	 Hm.
[19:25:55] <Coren>	 Caused by toolsdb timing out.
[19:26:59] <Geisotr>	 Whats wrong with https://tools.wmflabs.org/geohack/geohack.php P
[19:27:44] <Coren>	 Geisotr: Don't know yet, but I expect it's the same root cause.  One of the databases appears ill.
[19:28:49] <Geisotr>	 emmm... https://tools.wmflabs.org/kmlexport/ seems to have the same problem
[19:29:08] <Geisotr>	 for almost 20 hours: http://tools.freeside.sk/monitor/http-kmlexport.html
[19:29:28] <Coren>	 Geisotr: Unlikely to be related then.
[19:30:24] <Geisotr>	 is it possible to restart kmlexport then?
[19:30:45] <Coren>	 I'll look into it once I'm done with the immediate db issue.
[19:31:25] <mutante>	 hi
[19:31:31] <valhallasw`cloud>	 I restarted kmlexport yesterday
[19:31:35] <mutante>	 is there any indication that labsdb1005 is involved?
[19:31:44] <mutante>	 i got a report that geohack is down
[19:31:58] <mutante>	 but i also see this:
[19:32:00] <mutante>	 12:27 < icinga-wm> PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 55.56% of data above the critical threshold [35.0]
[19:32:01] <Coren>	 mutante: Yes, that's the immediate issue - but there doesn't seem to be anything wrong with the DB itself.  I'm looking at networking now.
[19:32:19] <mutante>	 Coren: it's firewalling then :/
[19:32:24] <valhallasw`cloud>	 but apparently it's completely broken again? the developers should put time in it to fix it...
[19:32:34] <mutante>	 Coren: we limited the connections to 10.0.0.0/8 
[19:32:39] <Coren>	 mutante: That (labstore2001) warning is unrelaed - three backups are in progress and that's the destination.
[19:33:12] <YuviPanda>	 mutante: yeah, need to revert I guess
[19:33:20] <Coren>	 mutante: Labs instances should be in there for sure; I'm still trying to figure out what gives.
[19:33:28] <mutante>	 i am reverting it.. damn it
[19:33:42] <Coren>	 Maybe there is natting somewhere along the line.  :-(
[19:33:42] <YuviPanda>	 ok
[19:33:44] <mutante>	 we _did_ check the connections before
[19:33:51] <mutante>	 and it was only 10.x.x.x 
[19:33:54] <Coren>	 Hm.
[19:34:08] <Coren>	 Then maybe the issue isn't what you tried to filter, but what the filter effectively did.
[19:34:09] <mutante>	 discussed it quite a bit with jynus and all that ..
[19:34:10] <mutante>	 hold on
[19:34:28] <mutante>	     0     0 ACCEPT     tcp  --  *      *       10.0.0.0/8           0.0.0.0/0            tcp dpt:5666
[19:34:40] <YuviPanda>	 is that the right port?
[19:34:51] <jynus>	 wow
[19:34:58] <mutante>	 it is the right port
[19:35:05] <jynus>	 no 3306?
[19:35:09] <mutante>	 per netstat
[19:35:22] <mutante>	 no, it's not
[19:35:27] <mutante>	 it's nrpe.. crap
[19:35:31] <jynus>	 ywa
[19:35:40] <Coren>	 That explains this.
[19:36:24] <Coren>	 mutante: So 3306 should work then.  :-)
[19:36:34] <mutante>	 unbelievable after all the talk about it.. fixing it
[19:36:45] <jynus>	 both should be open, nrpe should be open by default
[19:36:55] <jynus>	 as in, the base class
[19:37:18] <wikibugs>	 6Labs, 10Labs-Infrastructure: tools.labsdb down - https://phabricator.wikimedia.org/T107470#1495779 (10Magnus) 3NEW
[19:38:05] <wikibugs>	 6Labs, 10Labs-Infrastructure: tools.labsdb down - https://phabricator.wikimedia.org/T107470#1495788 (10coren) Issue is known, there is currently a bug in the firewall that is in the process of being fixed.
[19:38:34] <mutante>	 the base class got applied but not the rule from the role class
[19:38:42] <mutante>	 NRPE and SSH etc were just fine
[19:40:54] <mutante>	 had to manually flush it all
[19:40:56] <mutante>	 but done
[19:41:59] <mutante>	 but geohack not back yet ?
[19:42:17] <mutante>	 sorry users, really tried to double check all this before hand
[19:42:45] <mutante>	 https://tools.wmflabs.org/geohack/geohack.php
[19:42:45] <Coren>	 mutante: It's likely most of the geohack connections will need to timeout before it's fixed.  Lemme kick it.
[19:42:48] <mutante>	 there, looks normal?
[19:43:19] <mutante>	 https://tools.wmflabs.org/geohack/
[19:43:20] <Coren>	 Still lots of connections in exponential backoff - I expect it'll take some time to recover all.
[19:43:32] <mutante>	 it looks ok in my browser now
[19:43:51] <Coren>	 Yep.  Most things are recovering now.
[19:45:41] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 838925 bytes in 2.613 second response time
[19:45:52] <mutante>	 sorry again
[19:46:08] <Coren>	 mutante: Excrement occurs.  :-)
[19:46:57] <mutante>	 i really expected this to be it:
[19:47:04] <mutante>	 507     ferm::service{ 'mariadb-labsdb':
[19:47:04] <mutante>	 508         proto  => 'tcp',
[19:47:04] <mutante>	 509         port   => 3306,
[19:47:05] <mutante>	 510         srange => '$INTERNAL',
[19:47:12] <mutante>	 gotta figure out why that role wasn't applied here
[19:47:19] <mutante>	 that should have been
[19:47:49] <mutante>	 at least it was only db1005 and we did not do the others that have enwiki etc on it
[19:51:23] <mutante>	 root cause:
[19:51:25] <eranroz>	 i have a continous job running python script that fetch changes from recent changes IRC and analyze them (for copyright violations). I see that after few hours, the process stops getting updates from IRC - even if I recreate the thread that listens to the IRC
[19:51:28] <mutante>	 the firewall hole we wanted is in role::mariadb::labs
[19:51:38] <mutante>	 but that server only has role labs::db::master
[19:51:49] <wikibugs>	 6Labs, 10Labs-Infrastructure: tools.labsdb down - https://phabricator.wikimedia.org/T107470#1495830 (10coren) 5Open>3Resolved a:3coren This was fixed.
[19:51:54] <mutante>	 we mixed up the roles
[19:53:17] <eranroz>	 in the job.err/job.out I see no indication for error, so I wonder what could be the problem
[19:54:07] <Coren>	 eranroz: Perhaps the irc server has throttled you for some reason.  Are you responding to pings and/or what IRC library are you using?
[19:55:55] <eranroz>	 pywikibot.botirc and I see the account is still connected in irc://irc.wikimedia.org
[19:56:36] <eranroz>	 and yes it responds to pings
[20:03:01] <mutante>	 eranroz: i know this doesn't help with the current problem, but i wanted to say it anyways. do you know about https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org already? eventually it is supposed to replace irc.wikimedia.org. "Consuming RCStream is a cleaner approach than parsing the change messages from irc.wikimedia.org. " etc
[20:04:32] <eranroz>	 mutante: yes, I actually implemented it with stream, and failed to get updates after few hours. I implemented a different source class that use IRC with same API so I can easily switch between them
[20:04:58] <mutante>	 eranroz: so it fails after a few hours with both methods? hmm
[20:05:06] <eranroz>	 (the reason I moved back to IRC is due to this problem - whcih unfourntaly wasn't solved with IRC client)
[20:05:35] <mutante>	 that's good to know, we recently were wondering if/when irc.wikimedia.org could be deprecated and if stream is able to replace it yet or not
[20:05:42] <mutante>	 that sounds like not :/
[20:07:37] <eranroz>	 mutante: I thought I stop getting updates due to https://phabricator.wikimedia.org/T86771
[20:08:02] <eranroz>	 (in the original implementation)
[20:09:50] <mutante>	 eranroz: thanks, noting that we need to wait for a fix here before even considering to deprecate irc.wm
[20:18:29] <valhallasw`cloud>	 andrewbogott: I didn't get your original 'rescheduled tools-login reboot' email. Is there something broken with -announce vs -l again?
[20:18:33] <valhallasw`cloud>	 I'm only subscribed to -l.
[20:19:04] <andrewbogott>	 I don’t know.  I think I sent the first one to -announce and the second to just labs-l
[20:19:31] <valhallasw`cloud>	 yeah. I only got the 15 minute warning and follow-up emails
[20:19:44] <valhallasw`cloud>	 so I can imagine there also were others who didn't get the original...
[20:19:57] <gifti>	 me as well
[20:23:00] <andrewbogott>	 I’m sure I approved the message for forwarding to labs-l
[20:23:14] <andrewbogott>	 But (weekly refrain):  we should only have one list
[20:25:01] <mutante>	 and none of them should end in -l :)
[20:27:25] <andrewbogott>	 YuviPanda, Coren, what is this diagram missing?  https://wikitech.wikimedia.org/wiki/File:Labs_cluster.png
[20:27:34] <andrewbogott>	 other than ‘clarity'
[20:37:25] * Coren looks
[20:38:27] <Coren>	 andrewbogott: I'm a little confused about the "public ip"<->"private ip" delineation?
[20:43:24] <andrewbogott>	 Coren: how so?
[20:43:31] <andrewbogott>	 things in the public ip box have public ips
[20:43:35] <andrewbogott>	 things in the private ip box...
[20:43:55] <andrewbogott>	 maybe it’s needless detail though
[20:44:51] <Coren>	 Hm.  Well, it looked like it was meant to be highly significant given that it was the "outermost" boxes.  Perhaps putting them on top, hollow with a dotted outline would still convey the data without it being confusing?
[20:44:59] <Coren>	 (I suppose that fits in "clarity")
[20:45:20] <andrewbogott>	 ok, lemme try
[20:45:37] <Coren>	 Missing labstore2001 in codfw though.
[20:46:36] <andrewbogott>	 I could just remove public/private entirely.  It’s just if you’re trying to ssh to one, it’s nice to know if it’s .wikimedia.org or .eqiad.wmnet
[21:00:55] <andrewbogott>	 Coren: do instances ever talk to labstore2001, or does it only talk to the other labstores for backups? 
[21:01:25] <Coren>	 andrewbogott: Only the other labstores, at least for the forseeable future.
[21:09:16] <Coren>	 YuviPanda: Puppet doesn't seem to like your "ensure => $is_active"
[21:09:37] <Coren>	 Hm.  Which makes sense since that is a boolean value.
[21:09:52] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Filter WMDE- out of #wikimedia-fundraising [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/227746 (owner: 10Ejegg)
[21:10:08] <grrrit-wm>	 (03Merged) 10jenkins-bot: Filter WMDE- out of #wikimedia-fundraising [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/227746 (owner: 10Ejegg)
[21:10:25] <YuviPanda>	 Coren: puppet itself should be OK with it but ensure validate isn't 
[21:10:32] <YuviPanda>	 I'm working on a patch for it
[21:10:35] <andrewbogott>	 Coren: better?  https://wikitech.wikimedia.org/wiki/File:Labs_cluster.png
[21:11:43] <Coren>	 andrewbogott: That looks like its got everything.
[21:11:44] <wikibugs>	 !log tools.wikibugs Updated channels.yaml to: f638b92139d824b52c8e37284b4e5bedf07cf52c Filter WMDE- out of #wikimedia-fundraising
[21:11:48] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master
[21:21:19] <andrewbogott>	 oops, except for keystone
[21:25:08] <YuviPanda>	 andrewbogott: and labmon
[21:25:15] <andrewbogott>	 dammit
[21:25:44] <YuviPanda>	 andrewbogott: and labstore box says mariadb
[21:25:56] <andrewbogott>	 ok, thanks
[21:27:40] <andrewbogott>	 YuviPanda: labmon only talks to instances and not to other labs services, right?
[21:27:49] <andrewbogott>	 And, what services does it run? Just shinken?
[21:27:58] <YuviPanda>	 andrewbogott: no, just graphite and statsd
[21:28:06] <YuviPanda>	 shinken itself runs on a labs instnace (shinken-01)
[21:28:09] <andrewbogott>	 and /not/ shinbone?
[21:28:14] * andrewbogott curses spellcheck
[21:28:15] <YuviPanda>	 no
[21:28:18] <andrewbogott>	 ok
[21:36:28] <andrewbogott>	 YuviPanda: better?
[21:37:20] <YuviPanda>	 andrewbogott: yup!
[21:41:52] <andrewbogott>	 YuviPanda, Coren, I’m updating https://wikitech.wikimedia.org/wiki/Labs_infrastructure — please add a few sentences about NFS and/or monitoring if you have a few minutes.
[21:43:37] <wikibugs>	 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496411 (10yuvipanda) 3NEW
[21:44:59] <andrewbogott>	 meanwhile, I’ve had enough of this lousy wifi.  Out for now.
[21:48:17] <wikibugs>	 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496436 (10coren) I'd rather enable jessie-backports, I think, since the packages there will be maintained actively by the community; and they are deactivated by default.
[21:59:53] <wikibugs>	 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496478 (10scfc) Option c: Enable `jessie-backports` only for the hosts that need it and only for the package that is needed.
[22:07:01] <wikibugs>	 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496511 (10yuvipanda) I've just synced the sources for both of them, and puppet succeeds. Leaving this open to determine the right thing to do...
[22:07:53] <wikibugs>	 6Labs: Support bare-metal server allocation in labs - https://phabricator.wikimedia.org/T95185#1496515 (10yuvipanda)
[22:09:58] <wikibugs>	 6Labs, 10Tool-Labs: shinken is too "volatile" and imprecise to be of use - https://phabricator.wikimedia.org/T107297#1496525 (10Legoktm) I have the same problem with shinken alerts in the integration project.
[22:45:32] <wikibugs>	 6Labs, 10Tool-Labs, 7Database: Tool Labs enwiki_p replicated database missing rows - https://phabricator.wikimedia.org/T106470#1497039 (10MZMcBride) Sweet, thanks!