[00:26:19] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[00:36:20] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[00:42:17] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[00:47:18] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[01:02:50] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Andrew) 05Open>03Resolved a:03Andrew That VM was OOM and killing processes right and left, so it's possible we were locked out by sshd dying or so...
[01:02:59] <shinken-wm>	 RECOVERY - Host deployment-maps03 is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms
[01:06:51] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Andrew) Oh, and to answer your main question -- there isn't a great workaround for accessing VMs when ssh stops working.  Salt was good for that but was...
[01:09:30] <shinken-wm>	 PROBLEM - Puppet errors on deployment-maps03 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]
[01:14:27] <shinken-wm>	 RECOVERY - Puppet errors on deployment-maps03 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:24:18] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[02:29:17] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[02:50:20] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[05:11:18] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[05:21:18] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[05:42:20] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[05:52:20] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[06:13:18] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[08:21:18] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[08:26:18] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[08:35:26] <shinken-wm>	 PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:45:17] <shinken-wm>	 RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0)
[09:07:19] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[09:12:18] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[09:33:17] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[10:04:09] <shinken-wm>	 PROBLEM - Puppet errors on deployment-deploy01 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0]
[10:39:09] <shinken-wm>	 RECOVERY - Puppet errors on deployment-deploy01 is OK: OK: Less than 1.00% above the threshold [0.0]
[11:30:17] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[11:40:21] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[12:16:17] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[12:21:18] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[13:27:20] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[13:37:17] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[13:43:19] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[13:48:16] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[13:52:33] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Krenair) >>! In T205195#4608607, @Andrew wrote: > That VM was OOM and killing processes right and left, so it's possible we were locked out by sshd dyin...
[13:57:45] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Krenair) I've now run `iptables -P INPUT ACCEPT`, `iptables -F`, `apt-get remove ferm` and running puppet again shows ferm has not been re-installed, co...
[13:59:17] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-deploy01 is CRITICAL: CRITICAL: deployment-prep.deployment-deploy01.diskspace.root.byte_percentfree (<11.11%)
[14:03:08] <Krenair>	 !log rm stuff in deployment-deploy01:/tmp to try to clear space and stop shinken whining
[14:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[14:04:04] <Krenair>	 (cleaned 2.5G of scap l10n and captcha stuff)
[14:19:33] <wikibugs>	 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) It looks like my net-dns-users subscription got approved some time between July 25 and Aug...
[14:38:21] <paladox>	 Krenair i wonder should we recreate deploy01 with a bigger disk?
[14:39:04] <Krenair>	 possibly paladox
[14:39:10] <paladox>	 ok
[14:39:17] <Krenair>	 should also make stuff stop leaving huge amounts of stuff behind in /tmp though
[14:39:25] <paladox>	 yeh
[14:41:40] <Krenair>	 okay
[14:42:14] <Krenair>	 deploy servers are c8.m8.s60
[14:43:48] <Krenair>	 Should probably be c8.m8.s80
[14:43:56] <paladox>	 yeh
[14:44:16] <Krenair>	 don't want to go to xlarge as that's 16GB RAM and 160GB disk
[14:44:45] <Krenair>	 so make a #beta-cluster-infrastructure #cloud-vps task for that
[14:45:18] <Krenair>	 and a #beta-cluster-reproducible #scap task for scap_l10n stuff getting left in /tmp
[14:47:30] <Krenair>	 s/and/or/ ?
[14:51:32] <wikibugs>	 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) And actually now that I've done this and investigated some more I'm confident enough to jus...
[14:51:49] <Krenair>	 paladox, actually wondering if we need to do both or not
[14:55:20] <Krenair>	 oh god, RT. I have forgotten about this thing.
[15:05:40] <paladox>	 RT?
[15:12:28] <paladox>	 Krenair: what’s RT?
[15:13:09] <Krenair>	 paladox, it's a ticketing system. My secondary school used it and so did Wikimedia Ops until around 4 years ago with the phab migration
[15:13:25] <paladox>	 Oh
[15:13:38] <Krenair>	 But in this case, it's also used by CPAN/Perl
[15:14:20] <paladox>	 Oh
[15:14:26] <Krenair>	 -> PM
[15:21:35] <wmf-insecte>	 Project beta-scap-eqiad build #222824: 04FAILURE in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222824/
[15:21:37] <wmf-insecte>	 Project beta-update-databases-eqiad build #28534: 04FAILURE in 1 min 36 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28534/
[15:34:18] <wikibugs>	 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) Okay here's what I'm gonna send to upstream bug tracking: ```After a bit of investigation i...
[15:36:10] <wmf-insecte>	 Yippee, build fixed!
[15:36:11] <wmf-insecte>	 Project beta-scap-eqiad build #222825: 09FIXED in 13 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222825/
[15:37:54] <wikibugs>	 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468 (10Krenair) Upstreamed, skipping ferm and going straight to Net::DNS: https://rt.cpan.org/Ticket/Displa...
[15:38:49] <wikibugs>	 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair)
[15:49:44] <Krenair>	 also these guys are using SVN
[16:00:05] <Krenair>	 still, at least it looks like the project is active - latest release two days ago
[16:00:35] <Krenair>	 (I checked the changes file and didn't see anything related to our problem)
[16:04:59] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Cloud-VPS: Please fix my screw-up - unbreak SSH access to deployment-maps03 VM - https://phabricator.wikimedia.org/T205195 (10Krenair) I've been thinking about how rebooting fixed this - I think because ferm was still installed, rebooting it triggered ferm to replace the iptabl...
[16:22:12] <wmf-insecte>	 Yippee, build fixed!
[16:22:12] <wmf-insecte>	 Project beta-update-databases-eqiad build #28535: 09FIXED in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28535/
[16:30:16] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-deploy01 is OK: OK: All targets OK
[16:35:31] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Operations, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10MarcoAurelio) On the other hand, if purge_checkuser detects CheckUser is not installed it will just print that the...
[16:46:24] <shinken-wm>	 PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:53:49] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Operations, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Reedy) Make it do a file existence && run script
[16:59:36] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Operations, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Krenair) We can try, but this is puppet.git, and we may just get a CR-2.
[17:16:15] <shinken-wm>	 RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0)
[21:31:54] <wikibugs>	 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) A comment on the Net::DNS t...
[22:39:48] <wikibugs>	 10Deployments, 10MediaWiki-Internationalization, 10Patch-For-Review, 10Performance-Team (Radar): Experiment with plain .php files for l10n cache instead of CDB - https://phabricator.wikimedia.org/T99740 (10Seb35) I tried this option. I didn’t benchmark but I noticed the files are quite big – as noted in rM...