[00:26:19] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [00:36:20] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [00:42:17] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [00:47:18] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [01:02:50] 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Andrew) 05Open>03Resolved a:03Andrew That VM was OOM and killing processes right and left, so it's possible we were locked out by sshd dying or so... [01:02:59] RECOVERY - Host deployment-maps03 is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms [01:06:51] 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Andrew) Oh, and to answer your main question -- there isn't a great workaround for accessing VMs when ssh stops working. Salt was good for that but was... [01:09:30] PROBLEM - Puppet errors on deployment-maps03 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [01:14:27] RECOVERY - Puppet errors on deployment-maps03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:24:18] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [02:29:17] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [02:50:20] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [05:11:18] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [05:21:18] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [05:42:20] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [05:52:20] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [06:13:18] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [08:21:18] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [08:26:18] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [08:35:26] PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:45:17] RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) [09:07:19] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [09:12:18] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [09:33:17] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [10:04:09] PROBLEM - Puppet errors on deployment-deploy01 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [10:39:09] RECOVERY - Puppet errors on deployment-deploy01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:30:17] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [11:40:21] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [12:16:17] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [12:21:18] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [13:27:20] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [13:37:17] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [13:43:19] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [13:48:16] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [13:52:33] 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Krenair) >>! In T205195#4608607, @Andrew wrote: > That VM was OOM and killing processes right and left, so it's possible we were locked out by sshd dyin... [13:57:45] 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Krenair) I've now run `iptables -P INPUT ACCEPT`, `iptables -F`, `apt-get remove ferm` and running puppet again shows ferm has not been re-installed, co... [13:59:17] PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%) [14:03:08] !log rm stuff in deployment-deploy01:/tmp to try to clear space and stop shinken whining [14:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:04:04] (cleaned 2.5G of scap l10n and captcha stuff) [14:19:33] 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) It looks like my net-dns-users subscription got approved some time between July 25 and Aug... [14:38:21] Krenair i wonder should we recreate deploy01 with a bigger disk? [14:39:04] possibly paladox [14:39:10] ok [14:39:17] should also make stuff stop leaving huge amounts of stuff behind in /tmp though [14:39:25] yeh [14:41:40] okay [14:42:14] deploy servers are c8.m8.s60 [14:43:48] Should probably be c8.m8.s80 [14:43:56] yeh [14:44:16] don't want to go to xlarge as that's 16GB RAM and 160GB disk [14:44:45] so make a #beta-cluster-infrastructure #cloud-vps task for that [14:45:18] and a #beta-cluster-reproducible #scap task for scap_l10n stuff getting left in /tmp [14:47:30] s/and/or/ ? [14:51:32] 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) And actually now that I've done this and investigated some more I'm confident enough to jus... [14:51:49] paladox, actually wondering if we need to do both or not [14:55:20] oh god, RT. I have forgotten about this thing. [15:05:40] RT? [15:12:28] Krenair: what’s RT? [15:13:09] paladox, it's a ticketing system. My secondary school used it and so did Wikimedia Ops until around 4 years ago with the phab migration [15:13:25] Oh [15:13:38] But in this case, it's also used by CPAN/Perl [15:14:20] Oh [15:14:26] -> PM [15:21:35] Project beta-scap-eqiad build #222824: 04FAILURE in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222824/ [15:21:37] Project beta-update-databases-eqiad build #28534: 04FAILURE in 1 min 36 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28534/ [15:34:18] 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) Okay here's what I'm gonna send to upstream bug tracking: ```After a bit of investigation i... [15:36:10] Yippee, build fixed! [15:36:11] Project beta-scap-eqiad build #222825: 09FIXED in 13 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222825/ [15:37:54] 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) Upstreamed, skipping ferm and going straight to Net::DNS: https://rt.cpan.org/Ticket/Displa... [15:38:49] 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) [15:49:44] also these guys are using SVN [16:00:05] still, at least it looks like the project is active - latest release two days ago [16:00:35] (I checked the changes file and didn't see anything related to our problem) [16:04:59] 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Krenair) I've been thinking about how rebooting fixed this - I think because ferm was still installed, rebooting it triggered ferm to replace the iptabl... [16:22:12] Yippee, build fixed! [16:22:12] Project beta-update-databases-eqiad build #28535: 09FIXED in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28535/ [16:30:16] RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK [16:35:31] 10Beta-Cluster-Infrastructure, 10Operations, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10MarcoAurelio) On the other hand, if purge_checkuser detects CheckUser is not installed it will just print that the... [16:46:24] PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:49] 10Beta-Cluster-Infrastructure, 10Operations, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Reedy) Make it do a file existence && run script [16:59:36] 10Beta-Cluster-Infrastructure, 10Operations, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Krenair) We can try, but this is puppet.git, and we may just get a CR-2. [17:16:15] RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) [21:31:54] 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) A comment on the Net::DNS t... [22:39:48] 10Deployments, 10MediaWiki-Internationalization, 10Patch-For-Review, 10Performance-Team (Radar): Experiment with plain .php files for l10n cache instead of CDB - https://phabricator.wikimedia.org/T99740 (10Seb35) I tried this option. I didn’t benchmark but I noticed the files are quite big – as noted in rM...