[00:25:59] 6Labs, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701156 (10Krenair) @Coren found the log entry for me, as predicted it was a "Maximum execution time of 30 seconds exceeded". I... [01:07:47] 6Labs, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701178 (10Krenair) It's somewhere beneath OpenStackNovaArticle::editArticle - I'm guessing the edit triggers slow+buggy SMW hoo... [01:14:33] 6Labs, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701187 (10Krenair) a:3Krenair [01:14:43] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1688874 (10Krenair) [01:25:00] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 5Patch-For-Review: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701202 (10Krenair) The patch above is live hacked, uncommitted,... [02:10:16] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 5Patch-For-Review: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701214 (10scfc) Thanks. Just to recap: The slowness comes not... [08:18:57] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 5Patch-For-Review: Adding a user to a project results in a blank page with the user added to the project but no shell access - https://phabricator.wikimedia.org/T114229#1701416 (10Krenair) Yep. [08:55:29] 6Labs, 10Tool-Labs, 7Database: tools.citationhunt can't access databases - https://phabricator.wikimedia.org/T109972#1701477 (10jcrespo) I cannot say anything about labs user accounts -I do not maintain them. However, the mysql account exists on ours database servers with the right permissions, so either h... [09:03:52] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1701493 (10jcrespo) [09:03:54] 6Labs, 10Tool-Labs, 7Database, 3Labs-Q4-Sprint-1, and 5 others: Make sure tools-db is replicated somewhere - https://phabricator.wikimedia.org/T88718#1701490 (10jcrespo) 5Open>3Resolved I've manually imported the above tables. There is still some (minor) issues, but the general scope of the ticket has... [09:12:04] 6Labs, 10Tool-Labs, 7Database, 3Labs-Q4-Sprint-1, and 5 others: Make sure tools-db is replicated somewhere - https://phabricator.wikimedia.org/T88718#1701517 (10jcrespo) {F2659835} [09:37:14] 6Labs, 10Tool-Labs, 7Database: Provide replication lag as a database function - https://phabricator.wikimedia.org/T50628#1701583 (10jcrespo) This will be possible very soon due to: T111266, and that may even be a duplicate of this. [09:44:17] 6Labs, 10Tool-Labs, 7Database: Provide replication lag as a database function - https://phabricator.wikimedia.org/T50628#1701607 (10jcrespo) [09:58:19] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [10:38:14] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [13:24:14] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:04:16] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [14:41:39] 6Labs, 10Tool-Labs: Define policy for tools-local packages - https://phabricator.wikimedia.org/T114645#1702017 (10valhallasw) 3NEW [14:56:36] 6Labs: Eliminate SPOFs in Labs infrastructure (Tracking) - https://phabricator.wikimedia.org/T105723#1702084 (10Andrew) [14:56:37] 6Labs, 3labs-sprint-116: Make labs domainproxies fully redundant - https://phabricator.wikimedia.org/T98556#1702083 (10Andrew) 5Open>3Resolved [14:56:49] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1702088 (10Andrew) [14:56:52] 6Labs: Eliminate SPOFs in Labs infrastructure (Tracking) - https://phabricator.wikimedia.org/T105723#1702086 (10Andrew) 5Open>3Resolved a:3Andrew [15:39:10] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Labs virt capacity expansion - https://phabricator.wikimedia.org/T107624#1702230 (10RobH) [17:02:06] 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 5Patch-For-Review, and 2 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1702672 (10coren) [17:02:08] 6Labs, 6Discovery, 7Elasticsearch, 3labs-sprint-116, 3labs-sprint-117: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1702673 (10yuvipanda) [17:02:18] 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 3labs-sprint-116, 3labs-sprint-117: Write admission controller disabling mounting of unauthorized volumes - https://phabricator.wikimedia.org/T112718#1702674 (10yuvipanda) [17:02:37] 6Labs, 10Tool-Labs, 3labs-sprint-116, 3labs-sprint-117: Tool Labs: Provide anonymized view of the user_properties table - https://phabricator.wikimedia.org/T60196#1702676 (10coren) [17:02:40] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1702677 (10yuvipanda) [17:04:08] 6Labs, 10Tool-Labs, 3labs-sprint-117: Setup a way to store secrets and access them from puppet inside the Tool Labs project - https://phabricator.wikimedia.org/T112005#1702679 (10yuvipanda) [17:04:39] 6Labs, 3Labs-Sprint-109, 7Monitoring, 5Patch-For-Review, and 2 others: Monitor nova services - https://phabricator.wikimedia.org/T90784#1702684 (10Andrew) [17:09:33] 6Labs, 10Tool-Labs, 6Design Research Backlog, 6Learning-and-Evaluation, and 3 others: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1702697 (10yuvipanda) [17:18:00] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4, 3labs-sprint-117: Limit NFS bandwith per-instance - https://phabricator.wikimedia.org/T98048#1702723 (10coren) [17:31:56] 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1702863 (10Dzahn) i used John's disable_list shell script on this list: ``` @fermium:/usr/local/sbin# ./disable_list toolserver-l toolserver-l disabled. Archives should be available at current lo... [17:33:45] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1702868 (10yuvipanda) @faidon instead suggested: # Make all tool home dirctories `g+w` - they already have `s` set so... # Set up the default umask to b... [17:44:17] 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1702897 (10Dzahn) 5Open>3Resolved and removed all >500 subscribers with: fermium:/var/lib/mailman/bin# ./remove_members -a toolserver-l [17:44:43] 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1702905 (10Dzahn) 5Resolved>3Open [18:08:50] 6Labs, 10Wikimedia-Mailing-lists: Shutdown toolserver-l mailman list - https://phabricator.wikimedia.org/T113845#1703079 (10Dzahn) 5Open>3Resolved eh, forgot the "last mail to list" to announce it. re-added members, set the variable goodbye_msg to a custom message linking back to this. re-unsubscribed ever... [18:54:04] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703261 (10scfc) As I wrote in T48468, for normal users `umask` needs to be `022` to avoid all other users tampering with a user's file (because they sha... [18:56:14] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703264 (10yuvipanda) This would be only for tools and not users - and doesn't every tool have its own group now? [19:25:13] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:29:51] * Coren glares at puppet. [19:32:04] Lies. All lies! [19:34:47] Coren: probably memory issues --> fork failed [19:35:20] there's a task about making shinken less volatile somewhere [19:35:53] Yeah, but I'm looking at logs and stats and it looks like the same issue anrew ran into a few times - the fork() fails in puppet only, and free reports plenty of memory. [19:36:13] It's not clear what the actual resource that's choked is. [19:36:31] oh? I noticed it mostly on the host where kmlexport runs [19:36:35] * valhallasw`cloud opens graphite [19:38:43] That node has less than 20% memory use as far as I can tell; and graphite shows no deviation in hours. [19:38:53] yeah, but graphite only samples each five minutes [19:39:00] so if it's a short-term thing, graphite wouldn't notice [19:39:28] Nevertheless; if we had ran out of memory, you'd see a _huge_ dip in Buffers and Cached which would take sevral samples to recover. [19:40:15] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [19:40:45] Now how stable (and larged) Cached is - that's a system with no memory pressure. [19:41:41] Huh. There's a *gap* in the data where that alert was though. [19:42:07] yuvipanda: How does the alert work? Would it trigger if it got /no/ data? [19:42:26] graphite [19:42:29] Nah, that makes no sense - there wouldn't be traces in the log if that was the only issue. [19:42:33] Coren: there is an error in the actual log, though, so the error is correct [19:42:34] tirght. [19:42:35] if diamond dies from memory issues [19:42:36] right. [19:42:40] and what valhallasw`cloud says [19:42:46] but if diamond also fails.. [19:43:23] Right, but the graph clearly shows no memory issues. It's not possible to run out of memory and return, just one sample later, with all the metrics at the same levels. [19:44:02] At the /very/ least, a memory crunch would flush the disk caches. [19:44:35] And that vm has been stable at 5.3G of cache for at least 24h [19:45:05] Wait - what if the issue is underneath? It's possible the hypervisor failed a memory allocation request on paging. [19:45:17] IIRC, we overcommit ram. [19:47:51] so the error is ENOMEM, which means 'fork() failed to allocate the necessary kernel structures because memory is tight.' [19:48:03] I'm wondering whether that can also mean that some kernel structure allocation is full [19:48:09] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703384 (10scfc) If `yuvipanda` creates a file in `~tools.admin/`, `yuvipanda`'s `umask` is applied. [19:48:51] valhallasw`cloud: I'm guessing yes; though it's not clear which. Not process slots for sure, that box doesn't have that many. [19:48:58] Well, it doesn't have that many /now/. [19:49:42] Hm. Not that either, judging by the graph. [19:49:46] But also that gap there. [19:49:55] So *something* went amiss. [19:50:12] Maybe a very short fork bomb? [19:50:20] diamond is a fork bomb on itself... [19:50:31] well, not really. but it does create a large number of threads at once, I think [19:50:52] But then, what are the chances that it always occurs when puppet runs and not at other random times? [19:51:10] we only notice it when puppet runs because most other stats are unaffected [19:51:27] and it only happens rarely, maybe once a day for almost a hundred hosts [19:51:34] True, but I've just looked back and every recent puppet fail matches a gap. [19:51:49] And I don't see gaps elsewhere - but those are hard to spot. [19:52:15] maybe puppet and diamond running at the same time? Just brainstorming here. [19:52:21] There should be a "I don't have a datapoint" metric. [19:52:40] valhallasw`cloud: That'd explain the very intermittent pattern. [19:53:36] Or there is an edge condition that makes puppet fork bomb, making diamond bark. Most other things are already running and happily chugging along so you wouldn't notice unless a job *happened* to try to start at exactly the right time? [19:53:45] s/bark/barf/ [19:54:19] but puppet is single-threaded [19:54:30] or, at least, does everything in a linear fashion [19:54:47] hum. [19:54:52] Right, so not very likely - unless it's something puppet starts I suppose. [19:55:09] yeah, but then puppet runs apt which fails to start [19:55:27] I guess the accidental fork bomb could daemonize [19:55:59] A point to note is that puppet /itself/ got fork()ed and exec()ed from cron, so the rersources clearly were there when /it/ started. [19:56:23] So it wasn't a preexisting issue. [19:56:46] maybe the accounting log has some clues [19:57:34] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703406 (10yuvipanda) Oh, duh. So that might not work. [20:02:39] there are a lot perl-sh-zip triplets from kmlexport [20:03:03] nothing from diamond around that puppet run though [20:04:03] eh, wait, I'm not looking at the right place [20:04:23] the puppet run started at 7:12pm [20:04:25] UTC [20:05:49] and the accounting log is full of kmlexport stuff [20:06:52] perl X tools.km __ 1.12 secs Mon Oct 5 19:15 [20:06:55] X = sigterm'ed [20:08:17] Coren: sudo lastcomm | head -n 8000 | grep '19:13' [20:08:20] or :15, etc. [20:08:38] so I'm not entirely sure what happens, but most of what's happening at that time is kmlexport [20:09:40] Yeah. It's not clear what's being consumed, but it seems likely kmlexport is the one consuming it. [20:10:18] Unless we got this backwards; and kmlexport is flailing because of the same thing. [20:10:33] With its children dying off. [20:10:50] well, I'm also surprised by the raw amount of jobs to be honest [20:11:32] Looks pretty typical for a lighttpd host. [20:13:05] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, 3labs-sprint-117: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1703470 (10coren) Mind you, the umask doesn't prevent explicit mode setting - sftp supports it (and, by extension, so does scp -p) [20:15:19] !log project-proxy delete dynamicproxy-gateway [20:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy/SAL, Master [20:15:41] http://stackoverflow.com/questions/21055536/puppet-fails-with-cannot-allocate-memory-fork2 seems to be comparable, but the answers aren't satisfying. [20:38:55] PROBLEM - Puppet failure on tools-puppetmaster-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:39:37] * yuvipanda pats tools-puppetmaster-01 [20:39:39] it's ok [20:39:51] valhallasw`cloud: Coren I suppose it just is kmlexport spkining and killing memory for everyone else [20:40:10] ooooh, our own puppetmaster? [20:40:22] valhallasw`cloud: yes and lighttpd hosts have too many processes since our strategy for them is 'hope' [20:40:26] valhallasw`cloud: yeah but only for the k8s hosts [20:40:32] ah [20:40:45] don't think we can put 80 hosts onto one labs instance without it croaking now and then [20:51:24] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:52:11] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:52:13] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:54:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:54:54] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:55:16] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:55:18] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:55:26] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:55:26] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:55:34] PROBLEM - Puppet failure on tools-proxy-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:56:00] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:56:12] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [20:56:20] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:57:07] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:57:24] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:57:24] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:57:24] yuvipanda: That's a true one; you on it? [20:57:43] Coren: this bunch? no let me look [20:57:56] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:57:58] I already found the why, I just didn't want to ninja you. [20:58:02] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:58:08] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:58:08] PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:58:15] Coren: is this the typo in the role file? [20:58:16] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:58:18] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:58:29] yup [20:58:30] yep [20:58:31] Coren: I merged a fix [20:59:10] PROBLEM - Puppet failure on tools-redis-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:59:17] Ah, so I see. I had testes just before you merge kicked in it seems. [20:59:18] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:59:21] tested* [20:59:30] PROBLEM - Puppet failure on tools-packages is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:59:58] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:00:53] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:01:51] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:01:54] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [21:08:23] PROBLEM - Puppet failure on tools-proxy-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:12:12] RECOVERY - Puppet failure on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:14:00] PROBLEM - Puppet failure on tools-worker-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:14:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:34] PROBLEM - Puppet failure on tools-k8s-master-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:17:22] PROBLEM - Puppet failure on tools-k8s-bastion-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:19:08] PROBLEM - Puppet failure on tools-worker-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:31:31] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [21:31:41] 6Labs, 10Labs-Infrastructure, 10hardware-requests, 6operations: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1703697 (10RobH) [21:32:13] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [21:32:15] My "bash" and "sal" tools are both failing because they can't connect to https://stashbot.wmflabs.org/ (Hosted in stashbot Labs project with instance proxy frontend). The Labs server works fine for external access. [21:32:45] Was there a change to instance proxy that would have caused this? [21:33:19] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [21:33:57] RECOVERY - Puppet failure on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:34:09] RECOVERY - Puppet failure on tools-redis-02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:34:53] RECOVERY - Puppet failure on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:15] RECOVERY - Puppet failure on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:17] RECOVERY - Puppet failure on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:31] RECOVERY - Puppet failure on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:32] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:33] RECOVERY - Puppet failure on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:01] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:13] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:23] RECOVERY - Puppet failure on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:57] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [21:37:07] RECOVERY - Puppet failure on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [21:37:23] RECOVERY - Puppet failure on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [21:37:23] RECOVERY - Puppet failure on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [21:37:55] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [21:38:03] RECOVERY - Puppet failure on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [21:38:07] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [21:38:11] RECOVERY - Puppet failure on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [21:38:15] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [21:39:15] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [21:39:27] RECOVERY - Puppet failure on tools-packages is OK: OK: Less than 1.00% above the threshold [0.0] [21:39:59] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [21:40:51] RECOVERY - Puppet failure on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [0.0] [21:41:53] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [21:44:02] bd808: how long ago was that? [21:45:06] yuvipanda: right now [21:45:15] curl: (7) Failed to connect to stashbot.wmflabs.org port 443: No route to host [21:45:36] bd808: yeah fixing [21:46:33] PROBLEM - Puppet failure on tools-proxy-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:48:23] RECOVERY - Puppet failure on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:48:59] RECOVERY - Puppet failure on tools-worker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:50:25] bd808: should be fixed as dns caches expire... [21:57:17] RECOVERY - Puppet failure on tools-k8s-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:01:40] yuvipanda: so it looks like it is going to take another hour based on the TTLs I'm seeing from dig :/ [22:02:07] 3383 from one master and 2763 from the other [22:03:51] ugh I thought we had got it down to 5mins... [22:04:11] ofcourse [22:04:18] we have gotten it down for things of form X.eqiad.wmflabs [22:04:22] not for X.wmflabs.org [22:07:39] I can live with them being broken for another hour, but maybe a good learning point for the next time you need to rotate the proxy to a new ip [22:08:07] bd808: I'm making a patch now that'll drop the TTL to be 1min as well, in line with other stuff [22:08:17] bd808: and yeah, Krenair has a WIP patch to automate this that I should look at soon [22:41:29] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-117: Setup a way to store secrets and access them from puppet inside the Tool Labs project - https://phabricator.wikimedia.org/T112005#1703864 (10yuvipanda) Ok, there's now tools-puppetmaster-01 setup to serve as puppetmaster for the k8s master node, two... [22:49:05] RECOVERY - Puppet failure on tools-worker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:50:33] RECOVERY - Puppet failure on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:06:32] RECOVERY - Puppet failure on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0]