[00:25:59] <wikibugs>	 6Labs, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701156 (10Krenair) @Coren found the log entry for me, as predicted it was a "Maximum execution time of 30 seconds exceeded".  I...
[01:07:47] <wikibugs>	 6Labs, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701178 (10Krenair) It's somewhere beneath OpenStackNovaArticle::editArticle - I'm guessing the edit triggers slow+buggy SMW hoo...
[01:14:33] <wikibugs>	 6Labs, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701187 (10Krenair) a:3Krenair
[01:14:43] <wikibugs>	 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1688874 (10Krenair)
[01:25:00] <wikibugs>	 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 5Patch-For-Review: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701202 (10Krenair) The patch above is live hacked, uncommitted,...
[02:10:16] <wikibugs>	 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 5Patch-For-Review: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701214 (10scfc) Thanks.  Just to recap: The slowness comes not...
[08:18:57] <wikibugs>	 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 5Patch-For-Review: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701416 (10Krenair) Yep.
[08:55:29] <wikibugs>	 6Labs, 10Tool-Labs, 7Database: tools.citationhunt can't access databases - https://phabricator.wikimedia.org/T109972#1701477 (10jcrespo) I cannot say anything about labs user accounts -I do not maintain them.   However, the mysql account exists on ours database servers with the right permissions, so either h...
[09:03:52] <wikibugs>	 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1701493 (10jcrespo)
[09:03:54] <wikibugs>	 6Labs, 10Tool-Labs, 7Database, 3Labs-Q4-Sprint-1, and 5 others: Make sure tools-db is replicated somewhere - https://phabricator.wikimedia.org/T88718#1701490 (10jcrespo) 5Open>3Resolved I've manually imported the above tables. There is still some (minor) issues, but the general scope of the ticket has...
[09:12:04] <wikibugs>	 6Labs, 10Tool-Labs, 7Database, 3Labs-Q4-Sprint-1, and 5 others: Make sure tools-db is replicated somewhere - https://phabricator.wikimedia.org/T88718#1701517 (10jcrespo) {F2659835}
[09:37:14] <wikibugs>	 6Labs, 10Tool-Labs, 7Database: Provide replication lag as a database function - https://phabricator.wikimedia.org/T50628#1701583 (10jcrespo) This will be possible very soon due to: T111266, and that may even be a duplicate of this.
[09:44:17] <wikibugs>	 6Labs, 10Tool-Labs, 7Database: Provide replication lag as a database function - https://phabricator.wikimedia.org/T50628#1701607 (10jcrespo)
[09:58:19] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0]
[10:38:14] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:24:14] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:04:16] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:41:39] <wikibugs>	 6Labs, 10Tool-Labs: Define policy for tools-local packages - https://phabricator.wikimedia.org/T114645#1702017 (10valhallasw) 3NEW
[14:56:36] <wikibugs>	 6Labs: Eliminate SPOFs in Labs infrastructure (Tracking) - https://phabricator.wikimedia.org/T105723#1702084 (10Andrew)
[14:56:37] <wikibugs>	 6Labs, 3labs-sprint-116: Make labs domainproxies fully redundant - https://phabricator.wikimedia.org/T98556#1702083 (10Andrew) 5Open>3Resolved
[14:56:49] <wikibugs>	 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1702088 (10Andrew)
[14:56:52] <wikibugs>	 6Labs: Eliminate SPOFs in Labs infrastructure (Tracking) - https://phabricator.wikimedia.org/T105723#1702086 (10Andrew) 5Open>3Resolved a:3Andrew
[15:39:10] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Labs virt capacity expansion - https://phabricator.wikimedia.org/T107624#1702230 (10RobH)
[17:02:06] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 5Patch-For-Review, and 2 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1702672 (10coren)
[17:02:08] <wikibugs>	 6Labs, 6Discovery, 7Elasticsearch, 3labs-sprint-116, 3labs-sprint-117: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1702673 (10yuvipanda)
[17:02:18] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 3labs-sprint-116, 3labs-sprint-117: Write admission controller disabling mounting of unauthorized volumes - https://phabricator.wikimedia.org/T112718#1702674 (10yuvipanda)
[17:02:37] <wikibugs>	 6Labs, 10Tool-Labs, 3labs-sprint-116, 3labs-sprint-117: Tool Labs: Provide anonymized view of the user_properties table - https://phabricator.wikimedia.org/T60196#1702676 (10coren)
[17:02:40] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1702677 (10yuvipanda)
[17:04:08] <wikibugs>	 6Labs, 10Tool-Labs, 3labs-sprint-117: Setup a way to store secrets and access them from puppet inside the Tool Labs project - https://phabricator.wikimedia.org/T112005#1702679 (10yuvipanda)
[17:04:39] <wikibugs>	 6Labs, 3Labs-Sprint-109, 7Monitoring, 5Patch-For-Review, and 2 others: Monitor nova services - https://phabricator.wikimedia.org/T90784#1702684 (10Andrew)
[17:09:33] <wikibugs>	 6Labs, 10Tool-Labs, 6Design Research Backlog, 6Learning-and-Evaluation, and 3 others: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1702697 (10yuvipanda)
[17:18:00] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4, 3labs-sprint-117: Limit NFS bandwith per-instance - https://phabricator.wikimedia.org/T98048#1702723 (10coren)
[17:31:56] <wikibugs>	 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1702863 (10Dzahn) i used John's disable_list shell script on this list:   ``` @fermium:/usr/local/sbin# ./disable_list toolserver-l  toolserver-l disabled. Archives should be available at current lo...
[17:33:45] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1702868 (10yuvipanda) @faidon instead suggested:  # Make all tool home dirctories `g+w` - they already have `s` set so... # Set up the default umask to b...
[17:44:17] <wikibugs>	 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1702897 (10Dzahn) 5Open>3Resolved and removed all >500 subscribers with:  fermium:/var/lib/mailman/bin# ./remove_members -a toolserver-l
[17:44:43] <wikibugs>	 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1702905 (10Dzahn) 5Resolved>3Open
[18:08:50] <wikibugs>	 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1703079 (10Dzahn) 5Open>3Resolved eh, forgot the "last mail to list" to announce it. re-added members, set the variable goodbye_msg to a custom message linking back to this. re-unsubscribed ever...
[18:54:04] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703261 (10scfc) As I wrote in T48468, for normal users `umask` needs to be `022` to avoid all other users tampering with a user's file (because they sha...
[18:56:14] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703264 (10yuvipanda) This would be only for tools and not users - and doesn't every tool have its own group now?
[19:25:13] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[19:29:51] * Coren glares at puppet.
[19:32:04] <Coren>	 Lies.  All lies!
[19:34:47] <valhallasw`cloud>	 Coren: probably memory issues --> fork failed
[19:35:20] <valhallasw`cloud>	 there's a task about making shinken less volatile somewhere
[19:35:53] <Coren>	 Yeah, but I'm looking at logs and stats and it looks like the same issue anrew ran into a few times - the fork() fails in puppet only, and free reports plenty of memory.
[19:36:13] <Coren>	 It's not clear what the actual resource that's choked is.
[19:36:31] <valhallasw`cloud>	 oh? I noticed it mostly on the host where kmlexport runs
[19:36:35] * valhallasw`cloud opens graphite
[19:38:43] <Coren>	 That node has less than 20% memory use as far as I can tell; and graphite shows no deviation in hours.
[19:38:53] <valhallasw`cloud>	 yeah, but graphite only samples each five minutes
[19:39:00] <valhallasw`cloud>	 so if it's a short-term thing, graphite wouldn't notice
[19:39:28] <Coren>	 Nevertheless; if we had ran out of memory, you'd see a _huge_ dip in Buffers and Cached which would take sevral samples to recover.
[19:40:15] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:40:45] <Coren>	 Now how stable (and larged) Cached is - that's a system with no memory pressure.
[19:41:41] <Coren>	 Huh.  There's a *gap* in the data where that alert was though.
[19:42:07] <Coren>	 yuvipanda: How does the alert work?  Would it trigger if it got /no/ data?
[19:42:26] <yuvipanda>	 graphite
[19:42:29] <Coren>	 Nah, that makes no sense - there wouldn't be traces in the log if that was the only issue.
[19:42:33] <valhallasw`cloud>	 Coren: there is an error in the actual log, though, so the error is correct
[19:42:34] <valhallasw`cloud>	 tirght.
[19:42:35] <yuvipanda>	 if diamond dies from memory issues
[19:42:36] <valhallasw`cloud>	 right.
[19:42:40] <yuvipanda>	 and what valhallasw`cloud says
[19:42:46] <valhallasw`cloud>	 but if diamond also fails..
[19:43:23] <Coren>	 Right, but the graph clearly shows no memory issues.  It's not possible to run out of memory and return, just one sample later, with all the metrics at the same levels.
[19:44:02] <Coren>	 At the /very/ least, a memory crunch would flush the disk caches.
[19:44:35] <Coren>	 And that vm has been stable at 5.3G of cache for at least 24h
[19:45:05] <Coren>	 Wait - what if the issue is underneath?  It's possible the hypervisor failed a memory allocation request on paging.
[19:45:17] <Coren>	 IIRC, we overcommit ram.
[19:47:51] <valhallasw`cloud>	 so the error is ENOMEM, which means 'fork() failed to allocate the necessary kernel structures because memory is tight.'
[19:48:03] <valhallasw`cloud>	 I'm wondering whether that can also mean that some kernel structure allocation is full
[19:48:09] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703384 (10scfc) If `yuvipanda` creates a file in `~tools.admin/`, `yuvipanda`'s `umask` is applied.
[19:48:51] <Coren>	 valhallasw`cloud: I'm guessing yes; though it's not clear which.  Not process slots for sure, that box doesn't have that many.
[19:48:58] <Coren>	 Well, it doesn't have that many /now/.
[19:49:42] <Coren>	 Hm.  Not that either, judging by the graph.
[19:49:46] <Coren>	 But also that gap there.
[19:49:55] <Coren>	 So *something* went amiss.
[19:50:12] <Coren>	 Maybe a very short fork bomb?
[19:50:20] <valhallasw`cloud>	 diamond is a fork bomb on itself...
[19:50:31] <valhallasw`cloud>	 well, not really. but it does create a large number of threads at once, I think
[19:50:52] <Coren>	 But then, what are the chances that it always occurs when puppet runs and not at other random times?
[19:51:10] <valhallasw`cloud>	 we only notice it when puppet runs because most other stats are unaffected
[19:51:27] <valhallasw`cloud>	 and it only happens rarely, maybe once a day for almost a hundred hosts
[19:51:34] <Coren>	 True, but I've just looked back and every recent puppet fail matches a gap.
[19:51:49] <Coren>	 And I don't see gaps elsewhere - but those are hard to spot.
[19:52:15] <valhallasw`cloud>	 maybe puppet and diamond running at the same time? Just brainstorming here.
[19:52:21] <Coren>	 There should be a "I don't have a datapoint" metric.
[19:52:40] <Coren>	 valhallasw`cloud: That'd explain the very intermittent pattern.
[19:53:36] <Coren>	 Or there is an edge condition that makes puppet fork bomb, making diamond bark.  Most other things are already running and happily chugging along so you wouldn't notice unless a job *happened* to try to start at exactly the right time?
[19:53:45] <Coren>	 s/bark/barf/
[19:54:19] <valhallasw`cloud>	 but puppet is single-threaded
[19:54:30] <valhallasw`cloud>	 or, at least, does everything in a linear fashion
[19:54:47] <valhallasw`cloud>	 hum.
[19:54:52] <Coren>	 Right, so not very likely - unless it's something puppet starts I suppose.
[19:55:09] <valhallasw`cloud>	 yeah, but then puppet runs apt which fails to start
[19:55:27] <valhallasw`cloud>	 I guess the accidental fork bomb could daemonize
[19:55:59] <Coren>	 A point to note is that puppet /itself/ got fork()ed and exec()ed from cron, so the rersources clearly were there when /it/ started.
[19:56:23] <Coren>	 So it wasn't a preexisting issue.
[19:56:46] <valhallasw`cloud>	 maybe the accounting log has some clues
[19:57:34] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703406 (10yuvipanda) Oh, duh. So that might not work.
[20:02:39] <valhallasw`cloud>	 there are a lot perl-sh-zip triplets from kmlexport
[20:03:03] <valhallasw`cloud>	 nothing from diamond around that puppet run though
[20:04:03] <valhallasw`cloud>	 eh, wait, I'm not looking at the right place
[20:04:23] <valhallasw`cloud>	 the puppet run started at 7:12pm
[20:04:25] <valhallasw`cloud>	 UTC
[20:05:49] <valhallasw`cloud>	 and the accounting log is full of kmlexport stuff
[20:06:52] <valhallasw`cloud>	 perl                 X tools.km __         1.12 secs Mon Oct  5 19:15
[20:06:55] <valhallasw`cloud>	 X = sigterm'ed
[20:08:17] <valhallasw`cloud>	 Coren: sudo lastcomm | head -n 8000 | grep '19:13'
[20:08:20] <valhallasw`cloud>	 or :15, etc.
[20:08:38] <valhallasw`cloud>	 so I'm not entirely sure what happens, but most of what's happening at that time is kmlexport
[20:09:40] <Coren>	 Yeah.  It's not clear what's being consumed, but it seems likely kmlexport is the one consuming it.
[20:10:18] <Coren>	 Unless we got this backwards; and kmlexport is flailing because of the same thing.
[20:10:33] <Coren>	 With its children dying off.
[20:10:50] <valhallasw`cloud>	 well, I'm also surprised by the raw amount of jobs to be honest
[20:11:32] <Coren>	 Looks pretty typical for a lighttpd host.
[20:13:05] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703470 (10coren) Mind you, the umask doesn't prevent explicit mode setting - sftp supports it (and, by extension, so does scp -p)
[20:15:19] <yuvipanda>	 !log project-proxy delete dynamicproxy-gateway
[20:15:21] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy/SAL, Master
[20:15:41] <Coren>	 http://stackoverflow.com/questions/21055536/puppet-fails-with-cannot-allocate-memory-fork2 seems to be comparable, but the answers aren't satisfying.
[20:38:55] <shinken-wm>	 PROBLEM - Puppet failure on tools-puppetmaster-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[20:39:37] * yuvipanda pats tools-puppetmaster-01
[20:39:39] <yuvipanda>	 it's ok
[20:39:51] <yuvipanda>	 valhallasw`cloud: Coren I suppose it just is kmlexport spkining and killing memory for everyone else
[20:40:10] <valhallasw`cloud>	 ooooh, our own puppetmaster?
[20:40:22] <yuvipanda>	 valhallasw`cloud: yes and lighttpd hosts have too many processes since our strategy for them is 'hope'
[20:40:26] <yuvipanda>	 valhallasw`cloud: yeah but only for the k8s hosts
[20:40:32] <valhallasw`cloud>	 ah
[20:40:45] <yuvipanda>	 don't think we can put 80 hosts onto one labs instance without it croaking now and then
[20:51:24] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[20:52:11] <shinken-wm>	 PROBLEM - Puppet failure on tools-checker-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[20:52:13] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[20:54:20] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[20:54:54] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[20:55:16] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[20:55:18] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[20:55:26] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[20:55:26] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[20:55:34] <shinken-wm>	 PROBLEM - Puppet failure on tools-proxy-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[20:56:00] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[20:56:12] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0]
[20:56:20] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[20:57:07] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[20:57:24] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[20:57:24] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[20:57:24] <Coren>	 yuvipanda: That's a true one; you on it?
[20:57:43] <yuvipanda>	 Coren: this bunch? no let me look
[20:57:56] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[20:57:58] <Coren>	 I already found the why, I just didn't want to ninja you.
[20:58:02] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[20:58:08] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[20:58:08] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[20:58:15] <yuvipanda>	 Coren: is this the typo in the role file?
[20:58:16] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[20:58:18] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[20:58:29] <yuvipanda>	 yup
[20:58:30] <Coren>	 yep
[20:58:31] <yuvipanda>	 Coren: I merged a fix
[20:59:10] <shinken-wm>	 PROBLEM - Puppet failure on tools-redis-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[20:59:17] <Coren>	 Ah, so I see.  I had testes just before you merge kicked in it seems.
[20:59:18] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[20:59:21] <Coren>	 tested*
[20:59:30] <shinken-wm>	 PROBLEM - Puppet failure on tools-packages is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[20:59:58] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[21:00:53] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[21:01:51] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[21:01:54] <shinken-wm>	 PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[21:08:23] <shinken-wm>	 PROBLEM - Puppet failure on tools-proxy-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[21:12:12] <shinken-wm>	 RECOVERY - Puppet failure on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:14:00] <shinken-wm>	 PROBLEM - Puppet failure on tools-worker-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[21:14:22] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:15:34] <shinken-wm>	 PROBLEM - Puppet failure on tools-k8s-master-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[21:17:22] <shinken-wm>	 PROBLEM - Puppet failure on tools-k8s-bastion-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[21:19:08] <shinken-wm>	 PROBLEM - Puppet failure on tools-worker-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[21:31:31] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:31:41] <wikibugs>	 6Labs, 10Labs-Infrastructure, 10hardware-requests, 6operations: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1703697 (10RobH)
[21:32:13] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:32:15] <bd808>	 My "bash" and "sal" tools are both failing because they can't connect to https://stashbot.wmflabs.org/ (Hosted in stashbot Labs project with instance proxy frontend). The Labs server works fine for external access.
[21:32:45] <bd808>	 Was there a change to instance proxy that would have caused this?
[21:33:19] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:33:57] <shinken-wm>	 RECOVERY - Puppet failure on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:34:09] <shinken-wm>	 RECOVERY - Puppet failure on tools-redis-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:34:53] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:35:15] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:35:17] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:35:31] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:35:32] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:35:33] <shinken-wm>	 RECOVERY - Puppet failure on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:36:01] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:36:13] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:36:23] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:36:57] <shinken-wm>	 RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0]
[21:37:07] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:37:23] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:37:23] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:37:55] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:38:03] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:38:07] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:38:11] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:38:15] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:39:15] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:39:27] <shinken-wm>	 RECOVERY - Puppet failure on tools-packages is OK: OK: Less than 1.00% above the threshold [0.0]
[21:39:59] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:40:51] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:41:53] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:44:02] <yuvipanda>	 bd808: how long ago was that?
[21:45:06] <bd808>	 yuvipanda: right now
[21:45:15] <bd808>	 curl: (7) Failed to connect to stashbot.wmflabs.org port 443: No route to host
[21:45:36] <yuvipanda>	 bd808: yeah fixing
[21:46:33] <shinken-wm>	 PROBLEM - Puppet failure on tools-proxy-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[21:48:23] <shinken-wm>	 RECOVERY - Puppet failure on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:48:59] <shinken-wm>	 RECOVERY - Puppet failure on tools-worker-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:50:25] <yuvipanda>	 bd808: should be fixed as dns caches expire...
[21:57:17] <shinken-wm>	 RECOVERY - Puppet failure on tools-k8s-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:01:40] <bd808>	 yuvipanda: so it looks like it is going to take another hour based on the TTLs I'm seeing from dig :/
[22:02:07] <bd808>	 3383 from one master and 2763 from the other
[22:03:51] <yuvipanda>	 ugh I thought we had got it down to 5mins...
[22:04:11] <yuvipanda>	 ofcourse
[22:04:18] <yuvipanda>	 we have gotten it down for things of form X.eqiad.wmflabs
[22:04:22] <yuvipanda>	 not for X.wmflabs.org
[22:07:39] <bd808>	 I can live with them being broken for another hour, but maybe a good learning point for the next time you need to rotate the proxy to a new ip
[22:08:07] <yuvipanda>	 bd808: I'm making a patch now that'll drop the TTL to be 1min as well, in line with other stuff
[22:08:17] <yuvipanda>	 bd808: and yeah, Krenair has a WIP patch to automate this that I should look at soon
[22:41:29] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-117: Setup a way to store secrets and access them from puppet inside the Tool Labs project - https://phabricator.wikimedia.org/T112005#1703864 (10yuvipanda) Ok, there's now tools-puppetmaster-01 setup to serve as puppetmaster for the k8s master node, two...
[22:49:05] <shinken-wm>	 RECOVERY - Puppet failure on tools-worker-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:50:33] <shinken-wm>	 RECOVERY - Puppet failure on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:06:32] <shinken-wm>	 RECOVERY - Puppet failure on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0]