[01:24:30] bd808: how do you set a puppet class parameter via nova puppet groups? [01:24:46] adding a parameter with :: in the name does not seem to work [01:40:58] (03PS1) 10Gergő Tisza: Private hiera keys for Sentry [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) [01:41:55] (03PS2) 10Gergő Tisza: Private hiera keys for Sentry [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) [02:08:57] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1493979 (10scfc) The disk space on `tools-webgrid-lighttpd-1406` was running low as well, so I rebooted that instance, too. Searching for instances that have a `pacct.0`... [02:44:33] PROBLEM - SSH on tools-webgrid-lighttpd-1409 is CRITICAL: Server answer [02:49:35] RECOVERY - SSH on tools-webgrid-lighttpd-1409 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:55:24] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1493990 (10scfc) Rebooted: - `tools-webgrid-generic-1401`, - `tools-webgrid-generic-1402`, - `tools-webgrid-generic-1403`, and - `tools-webgrid-lighttpd-1401`. [02:57:23] 6Labs, 10Tool-Labs: Cannot log into tools-webgrid-lighttpd-1409: "Connection closed by UNKNOWN" - https://phabricator.wikimedia.org/T107403#1493993 (10scfc) 3NEW [03:40:34] PROBLEM - SSH on tools-webgrid-lighttpd-1409 is CRITICAL: Server answer [04:20:34] RECOVERY - SSH on tools-webgrid-lighttpd-1409 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [04:56:40] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 30.00% of data above the critical threshold [0.0] [05:36:39] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [06:01:33] 10Tool-Labs-tools-Erwin's-tools: "Fatal Error: Database query failed" for Related Changes tool - MySQL down? - https://phabricator.wikimedia.org/T104688#1494098 (10Supernino) This tool was extremely important to patrollers, maybe developing a new one as an alternative would be a good idea. [06:47:47] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 40.00% of data above the critical threshold [0.0] [07:27:47] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [07:51:14] 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1494177 (10valhallasw) 5Resolved>3Open I see four: * tools-shadow-01 * Puppet failure UNKNOWN1M 5d UNKNOWN: execution of the check script exited with exception list index out of range * Puppet staleness UNKNOW... [07:57:40] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 44.44% of data above the critical threshold [0.0] [08:03:36] 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1494203 (10yuvipanda) @Coren hasn't really responded on T104781 so I'm just going to delete tools-shadow-01 now. Not sure about the other host [08:04:29] 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1494208 (10yuvipanda) Bah, can't do that (unable to login to wikitech, 2fa phone dead atm). I'll do it tomorrow if nobody else gets to it in the meantime. [08:06:12] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106, and 2 others: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1494210 (10yuvipanda) 5Open>3Resolved It's also fixed for user accounts now, and monitoring has been set up as well. Let me know if it doesn't work... [08:20:33] 6Labs, 3Labs-Sprint-107: Identify user facing services labs provides - https://phabricator.wikimedia.org/T105721#1494217 (10yuvipanda) I call this 'done'. See T107058 and children for implementation of checks. [08:25:03] 6Labs, 10Incident-20150617-LabsNFSOutage, 3Labs-Sprint-102, 3Labs-Sprint-103: Labs: rewrite remaining labstore* scripts - https://phabricator.wikimedia.org/T102520#1494219 (10yuvipanda) Only maintain-replicas is left. [08:32:38] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [08:33:18] (03PS1) 10Yuvipanda: Remove jgage's root key from labs instances [labs/private] - 10https://gerrit.wikimedia.org/r/227951 [08:33:20] (03PS1) 10Yuvipanda: Replace Yuvi's key with new Key from new laptop [labs/private] - 10https://gerrit.wikimedia.org/r/227952 [08:33:44] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove jgage's root key from labs instances [labs/private] - 10https://gerrit.wikimedia.org/r/227951 (owner: 10Yuvipanda) [08:33:56] (03CR) 10Yuvipanda: [C: 032 V: 032] Replace Yuvi's key with new Key from new laptop [labs/private] - 10https://gerrit.wikimedia.org/r/227952 (owner: 10Yuvipanda) [08:38:16] 6Labs: Disable NFS on the orgcharts project - https://phabricator.wikimedia.org/T103137#1494226 (10yuvipanda) So... @matanya wanted this project to be not-killed, so it hasn't been yet. However, the instance could just as well be dead - I"m unable to ssh into it. @matanya - can you? [08:50:03] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1494233 (10yuvipanda) Anything else? /root still has test scripts, but that's presumably ok? [10:07:33] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 100.00% of data above the critical threshold [0.0] [11:15:44] PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL 60.00% of data above the critical threshold [0.0] [11:31:35] 6Labs, 3Labs-Sprint-107: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1494429 (10mark) I'd like to see a migration plan be developped on this ticket. [11:50:42] RECOVERY - Puppet failure on tools-exec-1214 is OK Less than 1.00% above the threshold [0.0] [12:14:13] 10Wikibugs: Add wikibugs to #wikimedia-rename - https://phabricator.wikimedia.org/T107418#1494512 (10Steinsplitter) 3NEW [12:47:25] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1494562 (10faidon) Sounds good to me, although I'd really prefer it if we reinstalled labstore1001 at this point before we switch over t... [13:36:29] YuviPanda: do you know what’s up with all the tools puppet warnings over night? [14:03:51] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [14:05:17] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 33.33% of data above the critical threshold [0.0] [14:05:55] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 30.00% of data above the critical threshold [0.0] [14:05:57] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 40.00% of data above the critical threshold [0.0] [14:08:41] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 838311 bytes in 2.203 second response time [14:25:22] RECOVERY - Puppet failure on tools-exec-1401 is OK Less than 1.00% above the threshold [0.0] [14:45:58] RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0] [14:45:58] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0] [14:50:48] 10-minute warning! I’ll be rebooting tools-login shortly. [14:50:49] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [14:54:44] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 838932 bytes in 3.917 second response time [15:00:26] !log tools rebooting tools-bastion-01 aka tools-login [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [15:02:07] 6Labs: Disable NFS on the orgcharts project - https://phabricator.wikimedia.org/T103137#1494786 (10Matanya) No, i don't have time to work on this now, although it is a very nice project. You can kill it from my POV :( [15:07:53] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1494799 (10Andrew) tools-bastion-01 is now rebooted as well. [15:11:28] thanx Coren for the fast reboot and the countdown before [15:11:43] doctaxon: That's all Andrew this time. :-) [15:12:15] thanx andrewbogott for the fast reboot and the countdown before [15:12:44] * andrewbogott hopes it actually solved the problem :) [15:15:18] andrewbogott - but such jobs are great jobs - with countdown you can prepare up to the point very effectivel [15:15:32] y [16:06:14] tgr|away: I don't think you can directly. You either need to use hiera or resort to the pre-hiera trick of global variables -- https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/kibana.pp#L23-L26 [16:06:41] You can use your project [16:06:54] You can use your project's page in the Hiera namespace -- https://wikitech.wikimedia.org/wiki/Hiera:Logstash [16:08:27] yeah, I ended up using globals [16:08:40] gross :) [16:09:43] If you need per-host hiera settings you can do that with patches to ops/puppet -- https://github.com/wikimedia/operations-puppet/tree/production/hieradata/labs/deployment-prep/host [16:10:07] but without a project puppetmaster that is pretty annoying [16:14:18] I want to make using the class easy, and there are some parameters you need to set every time [16:14:55] turning them into a variable seemed the nicest way to go [16:15:24] although you still need to create a hiera file just to pass the variable so maybe not so much [16:16:04] I'm hoping labs gets hiera role support at some point which would fix that [16:26:09] tgr: Ori is actually trying to kill hiera roles in prod so it's not highly likely to show up in labs. https://phabricator.wikimedia.org/T106404 [16:37:37] does @reboot not actually work in crontab? [17:20:15] 6Labs, 10Tool-Labs: Cannot log into tools-webgrid-lighttpd-1409: "Connection closed by UNKNOWN" - https://phabricator.wikimedia.org/T107403#1495142 (10scfc) 5Open>3Resolved a:3scfc Login works again. [17:20:31] 6Labs, 10Tool-Labs: Cannot log into tools-webgrid-lighttpd-1409: "Connection closed by UNKNOWN" - https://phabricator.wikimedia.org/T107403#1495150 (10scfc) a:5scfc>3None [17:22:42] 6Labs, 10Tool-Labs: GeoHack failing in Trusty because it can not allocate enough memory? - https://phabricator.wikimedia.org/T107253#1495162 (10scfc) [17:23:26] 6Labs, 10Tool-Labs: GeoHack failing in Trusty because it can not allocate enough memory? - https://phabricator.wikimedia.org/T107253#1490787 (10scfc) (Geohack is indeed currently running on Precise.) [17:30:34] Idea, possibly for a Phab task if you don't say it's mad: Shouldn't independent Labs projects have a tools.wmflabs.org/name redirection to name.wmflabs.org, and of course, be reserved as simple tool names? Or maybe this is taken into account somehow? [17:35:02] jem: mm, maybe, but nontrivial to implement [17:35:57] jem: I'm not sure if it's worth the effort. It's probably easier to act manually once problems arise [17:36:19] doesn't that already happen for projects? [17:36:41] hmm [17:36:54] not tools redirects [17:37:07] no, and service groups can have the same names as projects [17:37:13] eg. tools.xtools and xtools [17:39:07] I see [17:40:20] In fact I wasn't thinking about "problems", just memory failures [17:41:20] I'm not sure what's up with the memory failures [17:41:25] we do have enough space on the trustyes [17:41:45] Ehm [17:41:57] Maybe I didn't explain well [17:42:08] I mean human memory failures [17:42:32] (And probably my English is not deep enough) [17:43:07] ahaha :) [17:43:11] :) [17:43:17] but we have been having some memory issues on the trusties as well [17:44:13] Well, alternatively, tools.wmflabs.org/whatever could point to an error/somehow helping page [17:44:38] It seems it justs loads a blank page now [17:45:00] So one could get a little confused [17:52:00] 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Build proper monitoring for making sure that processes that need to run only once on one labstore only are running only once on one labstore only - https://phabricator.wikimedia.org/T106590#1495299 (10yuvipanda) Taking stock, I see the following daemons running tha... [17:55:33] 6Labs, 6operations, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1495302 (10yuvipanda) @andrew can you send out a scheduling email? If labvirt1009 is ok with you I can provide exact toollabs failover mechanisms. [17:56:11] 6Labs, 3Labs-Sprint-107: Identify user facing services labs provides - https://phabricator.wikimedia.org/T105721#1495308 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [17:56:11] 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1495310 (10yuvipanda) [17:56:56] 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1485532 (10yuvipanda) [17:58:15] 6Labs: Setup an availability checker for all labsdb hosts - https://phabricator.wikimedia.org/T107449#1495326 (10yuvipanda) 3NEW [17:58:59] 6Labs: Have a checkpoint check for labs proxies - https://phabricator.wikimedia.org/T107450#1495332 (10yuvipanda) 3NEW [18:00:00] 6Labs: Have checkpoint check for public labs DNS - https://phabricator.wikimedia.org/T107451#1495338 (10yuvipanda) 3NEW [18:01:00] jem: yeah, there's an open bug for the empty response. It should just say 'we don't know this tool' [18:01:12] and we could add a check 'maybe the project exists' there, I suppose [18:01:29] Good [18:02:03] Is the bug number easy to find? [18:02:13] 6Labs: Setup checkpoint check for private DNS - https://phabricator.wikimedia.org/T107453#1495357 (10yuvipanda) 3NEW [18:03:18] 6Labs: Create a checkpoint check for labs LDAP - https://phabricator.wikimedia.org/T107454#1495364 (10yuvipanda) 3NEW [18:05:16] https://phabricator.wikimedia.org/T104870 [18:06:15] 6Labs: Have checkpoint check for external network access from labs - https://phabricator.wikimedia.org/T107455#1495371 (10yuvipanda) 3NEW [18:06:16] Suscribed, thanks [18:07:07] 6Labs: Create a checkpoint check for labs puppetmaster - https://phabricator.wikimedia.org/T107456#1495380 (10yuvipanda) 3NEW [18:07:10] I'll add a comment about "project exists" [18:08:05] 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1495386 (10yuvipanda) 3NEW [18:09:26] 6Labs, 10Tool-Labs: 'webservice not available' message no longer shown - https://phabricator.wikimedia.org/T104870#1495399 (10-jem-) It could be interesting that the landing page showed more information about tools with a similar name or Labs projects with the same name. [18:10:46] 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1495405 (10yuvipanda) The enwiki login and edit check is 30k points a month and the read check is 21k points a month. [18:12:46] YuviPanda, shouldn't these all have the monitoring project? [18:12:53] Krenair: possibly [18:13:06] 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1495416 (10Krenair) and -static? [18:13:27] also Gerrit just very helpfully decided to email me about Krinkle_'s CR of my maintain-meta_p commit [18:13:34] 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1495418 (10yuvipanda) Created subtasks for everything except instance availability. [18:13:51] which he did two days ago [18:14:40] ouch [18:15:29] YuviPanda, see PS4 though! [18:20:38] YuviPanda: https://github.com/dbcli/mycli *drool* [18:23:33] valhallasw`cloud: nice! [18:23:37] we can import it if we want [18:24:45] I thought we weren't doing python packages anymore =p [18:24:59] :P [18:25:00] oh, there's a package on packagecloud, whatever that is [18:28:11] seems nice, I tried to write something like that once adn gave up as it was a real pita [18:31:13] 6Labs: Home directory does not contain replica.my.cnf - https://phabricator.wikimedia.org/T107034#1495512 (10yuvipanda) 5Open>3Resolved a:3yuvipanda It should now! Fixed in T104453 [19:14:07] 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1495734 (10coren) a:3coren [19:18:11] 6Labs, 3Labs-Sprint-107: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1495747 (10coren) a:3coren [19:21:50] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [19:24:32] Hm. [19:25:55] Caused by toolsdb timing out. [19:26:59] Whats wrong with https://tools.wmflabs.org/geohack/geohack.php P [19:27:44] Geisotr: Don't know yet, but I expect it's the same root cause. One of the databases appears ill. [19:28:49] emmm... https://tools.wmflabs.org/kmlexport/ seems to have the same problem [19:29:08] for almost 20 hours: http://tools.freeside.sk/monitor/http-kmlexport.html [19:29:28] Geisotr: Unlikely to be related then. [19:30:24] is it possible to restart kmlexport then? [19:30:45] I'll look into it once I'm done with the immediate db issue. [19:31:25] hi [19:31:31] I restarted kmlexport yesterday [19:31:35] is there any indication that labsdb1005 is involved? [19:31:44] i got a report that geohack is down [19:31:58] but i also see this: [19:32:00] 12:27 < icinga-wm> PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 55.56% of data above the critical threshold [35.0] [19:32:01] mutante: Yes, that's the immediate issue - but there doesn't seem to be anything wrong with the DB itself. I'm looking at networking now. [19:32:19] Coren: it's firewalling then :/ [19:32:24] but apparently it's completely broken again? the developers should put time in it to fix it... [19:32:34] Coren: we limited the connections to 10.0.0.0/8 [19:32:39] mutante: That (labstore2001) warning is unrelaed - three backups are in progress and that's the destination. [19:33:12] mutante: yeah, need to revert I guess [19:33:20] mutante: Labs instances should be in there for sure; I'm still trying to figure out what gives. [19:33:28] i am reverting it.. damn it [19:33:42] Maybe there is natting somewhere along the line. :-( [19:33:42] ok [19:33:44] we _did_ check the connections before [19:33:51] and it was only 10.x.x.x [19:33:54] Hm. [19:34:08] Then maybe the issue isn't what you tried to filter, but what the filter effectively did. [19:34:09] discussed it quite a bit with jynus and all that .. [19:34:10] hold on [19:34:28] 0 0 ACCEPT tcp -- * * 10.0.0.0/8 0.0.0.0/0 tcp dpt:5666 [19:34:40] is that the right port? [19:34:51] wow [19:34:58] it is the right port [19:35:05] no 3306? [19:35:09] per netstat [19:35:22] no, it's not [19:35:27] it's nrpe.. crap [19:35:31] ywa [19:35:40] That explains this. [19:36:24] mutante: So 3306 should work then. :-) [19:36:34] unbelievable after all the talk about it.. fixing it [19:36:45] both should be open, nrpe should be open by default [19:36:55] as in, the base class [19:37:18] 6Labs, 10Labs-Infrastructure: tools.labsdb down - https://phabricator.wikimedia.org/T107470#1495779 (10Magnus) 3NEW [19:38:05] 6Labs, 10Labs-Infrastructure: tools.labsdb down - https://phabricator.wikimedia.org/T107470#1495788 (10coren) Issue is known, there is currently a bug in the firewall that is in the process of being fixed. [19:38:34] the base class got applied but not the rule from the role class [19:38:42] NRPE and SSH etc were just fine [19:40:54] had to manually flush it all [19:40:56] but done [19:41:59] but geohack not back yet ? [19:42:17] sorry users, really tried to double check all this before hand [19:42:45] https://tools.wmflabs.org/geohack/geohack.php [19:42:45] mutante: It's likely most of the geohack connections will need to timeout before it's fixed. Lemme kick it. [19:42:48] there, looks normal? [19:43:19] https://tools.wmflabs.org/geohack/ [19:43:20] Still lots of connections in exponential backoff - I expect it'll take some time to recover all. [19:43:32] it looks ok in my browser now [19:43:51] Yep. Most things are recovering now. [19:45:41] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 838925 bytes in 2.613 second response time [19:45:52] sorry again [19:46:08] mutante: Excrement occurs. :-) [19:46:57] i really expected this to be it: [19:47:04] 507 ferm::service{ 'mariadb-labsdb': [19:47:04] 508 proto => 'tcp', [19:47:04] 509 port => 3306, [19:47:05] 510 srange => '$INTERNAL', [19:47:12] gotta figure out why that role wasn't applied here [19:47:19] that should have been [19:47:49] at least it was only db1005 and we did not do the others that have enwiki etc on it [19:51:23] root cause: [19:51:25] i have a continous job running python script that fetch changes from recent changes IRC and analyze them (for copyright violations). I see that after few hours, the process stops getting updates from IRC - even if I recreate the thread that listens to the IRC [19:51:28] the firewall hole we wanted is in role::mariadb::labs [19:51:38] but that server only has role labs::db::master [19:51:49] 6Labs, 10Labs-Infrastructure: tools.labsdb down - https://phabricator.wikimedia.org/T107470#1495830 (10coren) 5Open>3Resolved a:3coren This was fixed. [19:51:54] we mixed up the roles [19:53:17] in the job.err/job.out I see no indication for error, so I wonder what could be the problem [19:54:07] eranroz: Perhaps the irc server has throttled you for some reason. Are you responding to pings and/or what IRC library are you using? [19:55:55] pywikibot.botirc and I see the account is still connected in irc://irc.wikimedia.org [19:56:36] and yes it responds to pings [20:03:01] eranroz: i know this doesn't help with the current problem, but i wanted to say it anyways. do you know about https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org already? eventually it is supposed to replace irc.wikimedia.org. "Consuming RCStream is a cleaner approach than parsing the change messages from irc.wikimedia.org. " etc [20:04:32] mutante: yes, I actually implemented it with stream, and failed to get updates after few hours. I implemented a different source class that use IRC with same API so I can easily switch between them [20:04:58] eranroz: so it fails after a few hours with both methods? hmm [20:05:06] (the reason I moved back to IRC is due to this problem - whcih unfourntaly wasn't solved with IRC client) [20:05:35] that's good to know, we recently were wondering if/when irc.wikimedia.org could be deprecated and if stream is able to replace it yet or not [20:05:42] that sounds like not :/ [20:07:37] mutante: I thought I stop getting updates due to https://phabricator.wikimedia.org/T86771 [20:08:02] (in the original implementation) [20:09:50] eranroz: thanks, noting that we need to wait for a fix here before even considering to deprecate irc.wm [20:18:29] andrewbogott: I didn't get your original 'rescheduled tools-login reboot' email. Is there something broken with -announce vs -l again? [20:18:33] I'm only subscribed to -l. [20:19:04] I don’t know. I think I sent the first one to -announce and the second to just labs-l [20:19:31] yeah. I only got the 15 minute warning and follow-up emails [20:19:44] so I can imagine there also were others who didn't get the original... [20:19:57] me as well [20:23:00] I’m sure I approved the message for forwarding to labs-l [20:23:14] But (weekly refrain): we should only have one list [20:25:01] and none of them should end in -l :) [20:27:25] YuviPanda, Coren, what is this diagram missing? https://wikitech.wikimedia.org/wiki/File:Labs_cluster.png [20:27:34] other than ‘clarity' [20:37:25] * Coren looks [20:38:27] andrewbogott: I'm a little confused about the "public ip"<->"private ip" delineation? [20:43:24] Coren: how so? [20:43:31] things in the public ip box have public ips [20:43:35] things in the private ip box... [20:43:55] maybe it’s needless detail though [20:44:51] Hm. Well, it looked like it was meant to be highly significant given that it was the "outermost" boxes. Perhaps putting them on top, hollow with a dotted outline would still convey the data without it being confusing? [20:44:59] (I suppose that fits in "clarity") [20:45:20] ok, lemme try [20:45:37] Missing labstore2001 in codfw though. [20:46:36] I could just remove public/private entirely. It’s just if you’re trying to ssh to one, it’s nice to know if it’s .wikimedia.org or .eqiad.wmnet [21:00:55] Coren: do instances ever talk to labstore2001, or does it only talk to the other labstores for backups? [21:01:25] andrewbogott: Only the other labstores, at least for the forseeable future. [21:09:16] YuviPanda: Puppet doesn't seem to like your "ensure => $is_active" [21:09:37] Hm. Which makes sense since that is a boolean value. [21:09:52] (03CR) 10Legoktm: [C: 032] Filter WMDE- out of #wikimedia-fundraising [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/227746 (owner: 10Ejegg) [21:10:08] (03Merged) 10jenkins-bot: Filter WMDE- out of #wikimedia-fundraising [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/227746 (owner: 10Ejegg) [21:10:25] Coren: puppet itself should be OK with it but ensure validate isn't [21:10:32] I'm working on a patch for it [21:10:35] Coren: better? https://wikitech.wikimedia.org/wiki/File:Labs_cluster.png [21:11:43] andrewbogott: That looks like its got everything. [21:11:44] !log tools.wikibugs Updated channels.yaml to: f638b92139d824b52c8e37284b4e5bedf07cf52c Filter WMDE- out of #wikimedia-fundraising [21:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [21:21:19] oops, except for keystone [21:25:08] andrewbogott: and labmon [21:25:15] dammit [21:25:44] andrewbogott: and labstore box says mariadb [21:25:56] ok, thanks [21:27:40] YuviPanda: labmon only talks to instances and not to other labs services, right? [21:27:49] And, what services does it run? Just shinken? [21:27:58] andrewbogott: no, just graphite and statsd [21:28:06] shinken itself runs on a labs instnace (shinken-01) [21:28:09] and /not/ shinbone? [21:28:14] * andrewbogott curses spellcheck [21:28:15] no [21:28:18] ok [21:36:28] YuviPanda: better? [21:37:20] andrewbogott: yup! [21:41:52] YuviPanda, Coren, I’m updating https://wikitech.wikimedia.org/wiki/Labs_infrastructure — please add a few sentences about NFS and/or monitoring if you have a few minutes. [21:43:37] 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496411 (10yuvipanda) 3NEW [21:44:59] meanwhile, I’ve had enough of this lousy wifi. Out for now. [21:48:17] 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496436 (10coren) I'd rather enable jessie-backports, I think, since the packages there will be maintained actively by the community; and they are deactivated by default. [21:59:53] 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496478 (10scfc) Option c: Enable `jessie-backports` only for the hosts that need it and only for the package that is needed. [22:07:01] 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496511 (10yuvipanda) I've just synced the sources for both of them, and puppet succeeds. Leaving this open to determine the right thing to do... [22:07:53] 6Labs: Support bare-metal server allocation in labs - https://phabricator.wikimedia.org/T95185#1496515 (10yuvipanda) [22:09:58] 6Labs, 10Tool-Labs: shinken is too "volatile" and imprecise to be of use - https://phabricator.wikimedia.org/T107297#1496525 (10Legoktm) I have the same problem with shinken alerts in the integration project. [22:45:32] 6Labs, 10Tool-Labs, 7Database: Tool Labs enwiki_p replicated database missing rows - https://phabricator.wikimedia.org/T106470#1497039 (10MZMcBride) Sweet, thanks!