[00:01:12] <grrrit-wm>	 (03PS4) 10JanZerebecki: Add script to create a composer.local.json based on a list of extensions [integration/jenkins] - 10https://gerrit.wikimedia.org/r/192177 (owner: 10Legoktm)
[00:02:31] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit in the short term - https://phabricator.wikimedia.org/T94322#1490271 (10Krinkle)
[00:13:41] <grrrit-wm>	 (03Merged) 10jenkins-bot: Make python tests verbose. [integration/jenkins] - 10https://gerrit.wikimedia.org/r/227619 (owner: 10JanZerebecki)
[00:14:06] <grrrit-wm>	 (03CR) 10JanZerebecki: "recheck" [integration/jenkins] - 10https://gerrit.wikimedia.org/r/192177 (owner: 10Legoktm)
[00:18:46] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-fluorine is CRITICAL deployment-prep.deployment-fluorine.diskspace._srv.byte_percentfree (<50.00%)
[00:19:09] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 032] Add script to create a composer.local.json based on a list of extensions [integration/jenkins] - 10https://gerrit.wikimedia.org/r/192177 (owner: 10Legoktm)
[00:20:07] <grrrit-wm>	 (03PS5) 10JanZerebecki: Add script to create a composer.local.json based on a list of extensions [integration/jenkins] - 10https://gerrit.wikimedia.org/r/192177 (owner: 10Legoktm)
[00:21:54] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 032] Add script to create a composer.local.json based on a list of extensions [integration/jenkins] - 10https://gerrit.wikimedia.org/r/192177 (owner: 10Legoktm)
[00:22:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add script to create a composer.local.json based on a list of extensions [integration/jenkins] - 10https://gerrit.wikimedia.org/r/192177 (owner: 10Legoktm)
[00:22:34] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds
[00:27:13] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds
[00:27:29] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[00:28:47] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[00:29:27] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds
[00:31:05] <shinken-wm>	 PROBLEM - HHVM Queue Size on deployment-mediawiki01 is CRITICAL 37.50% of data above the critical threshold [80.0]
[00:32:44] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds
[00:35:28] <grrrit-wm>	 (03PS1) 10JanZerebecki: Replace perl inline with proper script [integration/config] - 10https://gerrit.wikimedia.org/r/227625 (https://phabricator.wikimedia.org/T106433) 
[00:37:08] <shinken-wm>	 PROBLEM - HHVM Queue Size on deployment-mediawiki02 is CRITICAL 66.67% of data above the critical threshold [80.0]
[00:56:06] <shinken-wm>	 PROBLEM - HHVM Queue Size on deployment-mediawiki01 is CRITICAL 87.50% of data above the critical threshold [80.0]
[00:59:25] <Krenair>	 hmm...
[00:59:44] <Krenair>	 deployment-mediawiki01's /var/log/apache2/other_vhosts_access.log is showing everything getting 503
[00:59:49] <Krenair>	 even varnish health checks
[01:01:29] <Krenair>	 which just get /w/load.php, ok
[01:02:08] <grrrit-wm>	 (03CR) 10JanZerebecki: "Update Jenkins jobs: (['mwext-Wikibase-client-tests-mysql-hhvm', 'mwext-Wikibase-client-tests-mysql-zend', 'mwext-Wikibase-client-tests-sq" [integration/config] - 10https://gerrit.wikimedia.org/r/227625 (https://phabricator.wikimedia.org/T106433) (owner: 10JanZerebecki)
[01:02:18] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 032] Replace perl inline with proper script [integration/config] - 10https://gerrit.wikimedia.org/r/227625 (https://phabricator.wikimedia.org/T106433) (owner: 10JanZerebecki)
[01:03:03] <Krenair>	 same for mediawiki02 and presumably the others
[01:04:31] <grrrit-wm>	 (03Merged) 10jenkins-bot: Replace perl inline with proper script [integration/config] - 10https://gerrit.wikimedia.org/r/227625 (https://phabricator.wikimedia.org/T106433) (owner: 10JanZerebecki)
[01:09:37] <jzerebecki>	 Krenair: followup error from deployment-fluorine having no space left on /srv ?
[01:10:10] <Krenair>	 might be...
[01:10:26] <Krenair>	 do we really depend on that to be able to serve requests?
[01:11:08] <jzerebecki>	 mh or the other way around fatal log is full of [2015-07-29 00:15:17] Fatal error: /srv/mediawiki/wikiversions-labs.cdb has no version entry for `mediawikiwiki`.
[01:11:26] <jzerebecki>	 Server: deployment-videoscaler01
[01:11:48] <Krenair>	 -rw-r--r-- 1 udp2log udp2log  25G Jul 29 01:11 wfDebug.log
[01:12:18] <Krenair>	 fatal.log is "only" 1.9G
[01:13:26] <jzerebecki>	 are there mechanisms that are supposed to rotate these logs?
[01:17:01] <wikibugs>	 10Continuous-Integration-Infrastructure, 5MW-1.26-release, 5Patch-For-Review: Fetch dependencies using composer instead of cloning mediawiki/vendor for non-wmf branches - https://phabricator.wikimedia.org/T90303#1490412 (10JanZerebecki)
[01:17:10] <Krenair>	 I don't think we care about the debug log
[01:17:38] <Krenair>	 it's a labs-only thing
[01:19:23] <Krenair>	 jzerebecki, okay, I deleted it and it's already back above 1G
[01:19:57] <Krenair>	 going to disable the log
[01:21:57] <grrrit-wm>	 (03PS1) 10JanZerebecki: Stop running composer twice [integration/config] - 10https://gerrit.wikimedia.org/r/227631 
[01:22:04] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3896 bytes in 0.001 second response time
[01:22:10] <Krenair>	 ooh
[01:22:34] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 3896 bytes in 0.002 second response time
[01:22:48] <Krenair>	 sigh
[01:23:05] <Krenair>	 they're recovering because they can now immediately throw an error saying I broke it even more than it already was
[01:23:10] <Krenair>	 -.-
[01:23:15] <jzerebecki>	 :)
[01:26:03] <Krenair>	 okay
[01:26:07] <Krenair>	 why the fuck did that fix anything?
[01:26:51] <Krenair>	 (I commented out the full "-wgDebugLogFile" block in InitialiseSettings-labs rather than just the default line)
[01:26:59] <Krenair>	 and the site came back up..
[01:27:21] <Krenair>	 Also I killed a random jenkins scap that was getting in my way
[01:27:22] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31062 bytes in 1.771 second response time
[01:27:23] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 44894 bytes in 1.466 second response time
[01:28:12] <jzerebecki>	 Krenair: well done.
[01:28:16] <jzerebecki>	 i'm out
[01:28:41] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 45163 bytes in 1.618 second response time
[01:28:47] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-fluorine is OK All targets OK
[01:30:40] <Krenair>	 I assume jenkins will be along in another few minutes to screw it all up again?
[01:33:29] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[01:33:33] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds
[01:33:43] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds
[01:37:11] <shinken-wm>	 PROBLEM - HHVM Queue Size on deployment-mediawiki02 is CRITICAL 33.33% of data above the critical threshold [80.0]
[01:38:07] <Krenair>	 sigh
[01:38:10] <Krenair>	 well now something else is broken
[01:38:14] <Krenair>	 or maybe it's related
[01:38:33] <Krenair>	 monolog errors in hhvm.log, probably related
[01:42:33] <aude>	 beta broken?
[01:42:36] <Krenair>	 very
[01:42:39] <aude>	 :(
[01:43:12] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds
[01:46:20] <aude>	 Fatal error: /srv/mediawiki/wikiversions-labs.cdb has no version entry for `mediawikiwiki`.
[01:46:25] <aude>	 see stuff like that
[01:46:54] <aude>	 but is for beta videoscalars
[01:47:46] <Krenair>	 fatal.log?
[01:48:13] <Krenair>	 ok
[01:48:55] <aude>	 fatal, yes
[01:49:11] <Krenair>	 so what is trying to run "mwscript something.php mediawikiwiki" on deployment-videoscaler01?
[01:49:31] <aude>	 no idea
[01:51:38] <Krenair>	 videoscalers do run jobrunners
[01:58:24] <Krenair>	 aude, okay so if you stop the jobrunner service on videoscaler01 it shuts up
[01:58:33] <Krenair>	 so it's definitely something coming from there..
[01:59:08] <aude>	 interesting
[01:59:41] <Krenair>	 ... and now it won't start again. oops
[02:00:58] <Krenair>	 okay well, /var/log/mediawiki/jobrunner.log has a ton of entries about this
[02:01:26] <Krenair>	 and it's all CA's fault
[02:01:27] <Krenair>	 legoktm
[02:01:42] <Krenair>	 nice -19 php /srv/mediawiki/multiversion/MWScript.php runJobs.php --wiki='mediawikiwiki' --type='CentralAuthCreateLocalAccountJob' --maxtime='60' --memory-limit='300M' --result=json
[02:01:57] <Krenair>	 in beta, where there is no mediawikiwiki
[02:07:20] <Krenair>	 oh, great
[02:07:24] <Krenair>	 I guess that's why it doesn't load
[02:07:29] <Krenair>	 /dev/vda2                                                  1.9G  1.9G     0 100% /var
[02:08:35] <Krenair>	 yup
[02:08:51] <Krenair>	 rm'd /var/log/mediawiki/jobrunner.log, service stayed running
[02:09:29] <Krenair>	 but it still fills up with that CentralAuth mw.o nonsense
[02:10:53] <aude>	 :(
[02:20:12] <Krenair>	 hmm - https://github.com/wikimedia/mediawiki-services-jobrunner/commits/master
[02:20:27] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-videoscaler01 is OK All targets OK
[02:30:13] <grrrit-wm>	 (03PS1) 10Legoktm: Add experimental 'mwext-mw-selenium' for Echo and Flow [integration/config] - 10https://gerrit.wikimedia.org/r/227640 
[02:32:24] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Add experimental 'mwext-mw-selenium' for Echo and Flow [integration/config] - 10https://gerrit.wikimedia.org/r/227640 (owner: 10Legoktm)
[02:33:56] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add experimental 'mwext-mw-selenium' for Echo and Flow [integration/config] - 10https://gerrit.wikimedia.org/r/227640 (owner: 10Legoktm)
[02:34:30] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/227640
[02:34:34] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[02:37:53] <Krenair>	 aude, I tried to get rid of the mediawikiwiki:jobqueue:CentralAuthCreateLocalAccountJob keys from redis but that doesn't seem to have changed anything
[02:38:22] <aude>	 Krenair: i have no idea where it is set or why it broke now
[02:40:19] <aude>	 maybe ori or legoktm knows?
[02:41:09] <legoktm>	 kajhdfdkfh
[02:41:23] <legoktm>	 I know what the issue is,
[02:41:30] <legoktm>	 https://phabricator.wikimedia.org/T87398
[02:41:31] <aude>	 oh, ok
[02:41:37] <legoktm>	 probably related to that
[02:44:08] <Krenair>	 legoktm, even after you'd purged the nonsense jobs from redis?
[02:44:25] <legoktm>	 have you restarted the job runner service?
[02:44:30] <Krenair>	 ydp
[02:44:32] <Krenair>	 yep*
[02:45:06] <Krenair>	 stopped it, but then as soon as I start it again, the log starts flooding
[02:49:44] <Krenair>	 any ideas legoktm?
[02:49:51] <legoktm>	 nope :<
[02:54:17] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<10.00%)
[02:59:59] <ostriches>	 !log deployment-bastion: purged a bunch of atop and pacct logs, and apt cache...clogging up /var again.
[03:00:02] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[03:00:41] <ostriches>	 !log deployment-bastion: please please someone rebuild me to not have a stupid 2G /var partition
[03:00:44] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[03:09:22] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK All targets OK
[03:12:31] <wikibugs>	 6Release-Engineering, 10Wikimedia-Site-requests: The favicon for test2 has been wrongly changed from Wikipedia logo to Wikimedia logo - https://phabricator.wikimedia.org/T107245#1490547 (10Ryasmeen) 3NEW
[03:16:21] <wikibugs>	 6Release-Engineering, 10Wikimedia-Site-requests: The favicon for test2 has been wrongly changed from Wikipedia logo to Wikimedia logo - https://phabricator.wikimedia.org/T107245#1490560 (10Krenair) a:3Krenair Yeah, someone from arbcom_enwiki complained about this too. https://gerrit.wikimedia.org/r/#/c/22735...
[03:16:58] <wikibugs>	 6Release-Engineering, 10Wikimedia-Site-requests: Special Wikipedias don't have wikipedia logo - https://phabricator.wikimedia.org/T107245#1490562 (10Krenair)
[04:09:46] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-fluorine is CRITICAL deployment-prep.deployment-fluorine.diskspace._srv.byte_percentfree (<40.00%)
[04:20:41] <shinken-wm>	 PROBLEM - Host deployment-cache-text03 is DOWN: CRITICAL - Host Unreachable (10.68.17.220)
[04:27:20] <shinken-wm>	 RECOVERY - Host deployment-cache-text03 is UPING OK - Packet loss = 0%, RTA = 1.20 ms
[04:28:59] <ostriches>	 Hrm, anyone remember where we stashed the "puppet autosigner" thingie for staging?
[04:30:04] <ostriches>	 Oh yeah, puppetmaster::certcleaner
[04:45:23] <shinken-wm>	 PROBLEM - Host deployment-mediawiki01 is DOWN: CRITICAL - Host Unreachable (10.68.17.170)
[04:45:41] <shinken-wm>	 PROBLEM - Host deployment-cache-text03 is DOWN: CRITICAL - Host Unreachable (10.68.17.220)
[04:45:54] <shinken-wm>	 RECOVERY - Host deployment-mediawiki01 is UPING OK - Packet loss = 0%, RTA = 2.18 ms
[05:45:35] <shinken-wm>	 PROBLEM - Host deployment-mediawiki01 is DOWN: CRITICAL - Host Unreachable (10.68.17.170)
[05:47:15] <shinken-wm>	 RECOVERY - Host deployment-mediawiki01 is UPING OK - Packet loss = 0%, RTA = 28.31 ms
[05:51:43] <wikibugs>	 10Beta-Cluster, 10Staging, 6Collaboration-Team, 7Database: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1490740 (10Mattflaschen)
[06:03:33] <shinken-wm>	 PROBLEM - Host deployment-mediawiki01 is DOWN: CRITICAL - Host Unreachable (10.68.17.170)
[06:03:47] <shinken-wm>	 RECOVERY - Host deployment-mediawiki01 is UPING OK - Packet loss = 0%, RTA = 0.47 ms
[06:10:43] <shinken-wm>	 RECOVERY - Host deployment-cache-text03 is UPING OK - Packet loss = 0%, RTA = 1.67 ms
[06:33:22] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31064 bytes in 1.784 second response time
[06:33:34] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 3896 bytes in 0.002 second response time
[06:39:28] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[06:39:42] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds
[06:42:13] <shinken-wm>	 PROBLEM - HHVM Queue Size on deployment-mediawiki02 is CRITICAL 57.14% of data above the critical threshold [80.0]
[07:58:02] <wikibugs>	 5Continuous-Integration-Isolation: Investigate non blocking fs resizing when instance is booted - https://phabricator.wikimedia.org/T104974#1490837 (10hashar)
[07:58:20] <wikibugs>	 5Continuous-Integration-Isolation, 6Labs, 10Labs-Infrastructure: Investigate non blocking fs resizing when instance is booted - https://phabricator.wikimedia.org/T104974#1433499 (10hashar)
[07:58:59] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation: Design the Jenkins isolation architecture - https://phabricator.wikimedia.org/T86171#1490843 (10hashar)
[08:16:33] <wikibugs>	 5Continuous-Integration-Isolation, 6operations: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1490849 (10hashar) Before rebuilding the system, I wanted to make sure all .deb package dependencies are on apt.wikimedia.org.  The Nodepool requirements.txt file list the python mo...
[08:23:26] <grrrit-wm>	 (03PS6) 10Paladox: Update HitCounters tests [integration/config] - 10https://gerrit.wikimedia.org/r/227438 
[08:23:33] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] Update HitCounters tests [integration/config] - 10https://gerrit.wikimedia.org/r/227438 (owner: 10Paladox)
[08:29:42] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1490853 (10hashar)
[08:32:14] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1490854 (10hashar) In operations/debs/nodepool.git ``` $ git diff debian..0.1.0 requirements.txt ... -python-novaclient +python-novaclient>=2.21.0 $ ```
[09:19:21] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 44868 bytes in 2.582 second response time
[09:19:21] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31062 bytes in 2.194 second response time
[09:19:35] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 3896 bytes in 0.003 second response time
[09:22:09] <shinken-wm>	 RECOVERY - HHVM Queue Size on deployment-mediawiki02 is OK Less than 30.00% above the threshold [10.0]
[09:22:25] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mx is CRITICAL 100.00% of data above the critical threshold [0.0]
[09:50:26] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1491022 (10hashar) Build the package and put it at:   https://people.wikimedia.org/~hashar/debs/nodepool_0.1.0-wmf1/  terbium.eqiad.wmnet:/home/hashar...
[09:51:24] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1491023 (10hashar) a:3hashar
[09:56:00] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1491030 (10hashar) ``` root@labnodepool1001:/root# dpkg -i nodepool_0.1.0-wmf1_amd64.deb (Reading database ... 52061 files and directories currently i...
[09:56:31] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[09:59:05] <wikibugs>	 5Continuous-Integration-Isolation, 7Nodepool: Bump Nodepool package to v0.1.0 and propose it to Debian - https://phabricator.wikimedia.org/T98295#1491037 (10hashar) 5Open>3declined a:3hashar Bumping to 0.1.0 is was  T104971.
[10:01:48] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[10:02:05] <wikibugs>	 5Continuous-Integration-Isolation, 5Patch-For-Review: nodepool users should have OpenStack env variables set on login - https://phabricator.wikimedia.org/T103673#1491050 (10hashar) 5stalled>3Resolved a:3hashar /var/lib/nodepool/.profile is populated by puppet
[10:03:04] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3896 bytes in 0.004 second response time
[10:03:24] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 44876 bytes in 2.512 second response time
[10:06:24] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31062 bytes in 2.970 second response time
[10:06:40] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 45181 bytes in 1.291 second response time
[10:09:14] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Epic, 3releng-201415-Q3, and 2 others: [Quarterly Success Metric] Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#1491060 (10hashar)
[10:09:17] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 6operations: Acquire old production API servers for use in CI - https://phabricator.wikimedia.org/T84940#1491057 (10hashar) 5Open>3declined a:3hashar We are using labs infrastructure for now.  There is no plan to reuse the old...
[10:12:07] <shinken-wm>	 RECOVERY - HHVM Queue Size on deployment-mediawiki01 is OK Less than 30.00% above the threshold [10.0]
[10:14:16] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool Debian package to  - https://phabricator.wikimedia.org/T107266#1491065 (10hashar) 3NEW a:3hashar
[10:14:29] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491076 (10hashar)
[10:14:45] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1433466 (10hashar)
[10:14:48] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491065 (10hashar)
[10:31:43] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491090 (10hashar) 3NEW
[10:31:56] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10hashar)
[10:31:59] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491100 (10hashar)
[10:33:34] <wikibugs>	 5Continuous-Integration-Isolation, 7Nodepool: Bump Nodepool to support statsd 0.3.0 - https://phabricator.wikimedia.org/T107268#1491102 (10hashar) 3NEW
[10:33:55] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491090 (10hashar)
[10:33:57] <wikibugs>	 5Continuous-Integration-Isolation, 7Nodepool: Bump Nodepool to support statsd 0.3.0 - https://phabricator.wikimedia.org/T107268#1491113 (10hashar)
[10:34:00] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1491115 (10hashar)
[10:34:20] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491118 (10hashar)
[10:34:23] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10hashar)
[10:34:30] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491065 (10hashar)
[10:34:47] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491065 (10hashar)
[10:34:49] <wikibugs>	 5Continuous-Integration-Isolation, 7Nodepool: Bump Nodepool to support statsd 0.3.0 - https://phabricator.wikimedia.org/T107268#1491102 (10hashar)
[10:39:44] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 5Patch-For-Review: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1491147 (10hashar) a:3hashar
[10:40:23] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 5Patch-For-Review: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1361446 (10hashar) https://gerrit.wikimedia.org/r/227461 causes Nodepool to no more rely on disk image builder. We will provid...
[10:52:31] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-videoscaler01 is CRITICAL deployment-prep.deployment-videoscaler01.diskspace._var.byte_percentfree (<30.00%)
[11:00:34] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL - Socket timeout after 10 seconds
[11:03:48] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[11:05:24] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 44894 bytes in 1.037 second response time
[11:06:28] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds
[11:11:21] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 44875 bytes in 1.313 second response time
[12:04:48] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL - Socket timeout after 10 seconds
[12:09:38] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 45180 bytes in 0.924 second response time
[12:36:15] <hashar_>	 !log salt on deployment-salt is missing most of the instances :-(((
[12:36:18] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[12:38:17] <hashar_>	 !log salt minions are back somehow
[12:38:20] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[12:54:32] <grrrit-wm>	 (03PS5) 10Paladox: Update Apex tests [integration/config] - 10https://gerrit.wikimedia.org/r/226994 
[12:54:56] <grrrit-wm>	 (03PS4) 10Paladox: Add extension-unittests-generic to metrolook [integration/config] - 10https://gerrit.wikimedia.org/r/226995 
[12:55:08] <grrrit-wm>	 (03PS4) 10Paladox: Add BlogPage to testextension [integration/config] - 10https://gerrit.wikimedia.org/r/227217 
[12:55:31] <grrrit-wm>	 (03PS5) 10Paladox: Update WikidataPageBanner tests [integration/config] - 10https://gerrit.wikimedia.org/r/226913 
[12:56:47] <grrrit-wm>	 (03PS10) 10Paladox: Add jenkings test for BoilerPlate [integration/config] - 10https://gerrit.wikimedia.org/r/226680 
[13:03:36] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1491403 (10hashar) git clone git://anonscm.debian.org/openstack/python-os-client-config.git  dch --bpo  Modify chang...
[13:07:35] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] Stop running composer twice [integration/config] - 10https://gerrit.wikimedia.org/r/227631 (owner: 10JanZerebecki)
[13:08:16] <grrrit-wm>	 (03PS2) 10Hashar: Use MW-Selenium setup slave script [integration/config] - 10https://gerrit.wikimedia.org/r/227616 (owner: 10Dduvall)
[13:09:58] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "Refreshed jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/227616 (owner: 10Dduvall)
[13:12:02] <grrrit-wm>	 (03Merged) 10jenkins-bot: Use MW-Selenium setup slave script [integration/config] - 10https://gerrit.wikimedia.org/r/227616 (owner: 10Dduvall)
[13:25:07] <grrrit-wm>	 (03PS11) 10Paladox: Add jenkins test for BoilerPlate [integration/config] - 10https://gerrit.wikimedia.org/r/226680 
[13:25:19] <grrrit-wm>	 (03PS12) 10Paladox: Add jenkins test for BoilerPlate [integration/config] - 10https://gerrit.wikimedia.org/r/226680 
[13:32:29] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL - Socket timeout after 10 seconds
[13:37:20] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 44894 bytes in 2.087 second response time
[13:38:06] <wikibugs>	 10Continuous-Integration-Infrastructure, 6operations: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia and trusty-wikimedia - https://phabricator.wikimedia.org/T106499#1491499 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi {{done}}  ``` root@carbon:~# reprepro -C main --ignore=wrongd...
[14:09:59] <shinken-wm>	 PROBLEM - Free space - all mounts on integration-slave-trusty-1016 is CRITICAL integration.integration-slave-trusty-1016.diskspace._mnt.byte_percentfree (<10.00%)
[14:28:52] <bblack>	 !log cherry-picked https://gerrit.wikimedia.org/r/#/c/215624 into deployment-puppetmaster ops/puppet
[14:28:55] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:38:43] <bblack>	 !log cherry-picked https://gerrit.wikimedia.org/r/#/c/215624 (updated to PS8) into deployment-puppetmaster ops/puppet
[14:38:46] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[14:53:32] <wikibugs>	 10Beta-Cluster: Could not login into Beta Cluster - https://phabricator.wikimedia.org/T107288#1491679 (10Luke081515) 3NEW
[14:57:48] <andrewbogott>	 Hey hey!  Can someone who works on deployment-prep have a look at ^ right away?
[14:57:57] <andrewbogott>	 The reported is in #wikimedia-labs if you need more details.
[14:58:00] <andrewbogott>	 *reporter
[15:00:37] <wikibugs>	 10Beta-Cluster: Could not login into Beta Cluster - https://phabricator.wikimedia.org/T107288#1491701 (10Luke081515)
[15:01:41] <ostriches>	 andrewbogott: Looking.
[15:01:51] <andrewbogott>	 thanks!
[15:02:19] <bblack>	 I don't think it's my recent cherrypick above, as the timeline doesn't fit "for more than two hours"
[15:02:21] <ostriches>	 Hmm, login page is timing out for me...
[15:03:57] <ostriches>	 bblack: We could just kill bits outright on deployment-prep and send it directly to the text cache.
[15:04:04] <ostriches>	 I don't see the need for back-compat in beta.
[15:09:26] <bblack>	 ostriches: we're going to do that in production as well (kill the bits cluster, send the traffic to the text cluster).  That patch is to support that mode of operation properly.  It's cherry-picked to beta just to test it functionally before trying it on production.
[15:09:39] <ostriches>	 andrewbogott: Jul 29 06:59:39 deployment-mediawiki01:  Fatal error: entire web request took longer than 290 seconds and timed out in /srv/mediawiki/php-master/includes/debug/logger/monolog/LegacyHandler.php on line 207 
[15:09:56] <ostriches>	 Bunch of those. Clearly timing out trying to login, somewhere in the logging code, but could be a false herring.
[15:10:07] <ostriches>	 bblack: fair 'nuff :)
[15:11:00] <andrewbogott>	 isn’t ‘legacyhandler’ exactly what bblack just patched?  Maybe I’m behind.
[15:11:06] <bblack>	 no
[15:11:13] <andrewbogott>	 ok :)
[15:11:23] <bblack>	 what I patched is varnish-level stuff related to bits.wikimedia(.beta.wmflabs)?.org
[15:12:08] <bblack>	 (but the patch is not to the bits-cluster VCL, it's to the text-cluster VCL so that the text-cluster supposedly supports handling requests for the bits hostname)
[15:23:26] <andrewbogott>	 Anyone know if hashar_ is gone for the day?
[15:23:27] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL 100.00% of data above the critical threshold [0.0]
[15:25:44] <hashar>	 (I was busy coding :D )
[15:50:55] <ostriches>	 thcipriani: Better channel :p
[15:50:57] <ostriches>	 Yeah, that's how MWMultiversion blows up on invalid hostnames
[15:51:54] <thcipriani>	 yup, just a large amount of invalid hostnames, definitely clutters fatalmonitor
[16:21:18] <thcipriani>	 ostriches: lots of timeouts in different places, machines sure are working hard. How do you feel about kicking hhvm/apache quickly?
[16:21:28] <ostriches>	 Go for it
[16:24:38] <andrewbogott>	 y’all are working on beta, right?  Because now it’s broken in a new way :)
[16:24:59] <thcipriani>	 andrewbogott: oh boy
[16:25:15] <wikibugs>	 10Browser-Tests, 6Release-Engineering: We should not run Jenkins jobs for browser tests when beta labs is down - https://phabricator.wikimedia.org/T107305#1491965 (10Jdlrobson) 3NEW
[16:25:16] <thcipriani>	 andrewbogott: yeah, just kicked apache and hhvm, what are you seeing now?
[16:25:55] <andrewbogott>	 I was getting an error for the front page a minute ago, seems better now
[16:26:07] <andrewbogott>	 although the ‘log in’ link seems to time out
[16:26:51] <thcipriani>	 yeah, that's the same problem
[16:27:32] <thcipriani>	 !log deployment-prep login timeouts, tried restarting apache, hhvm, and nutcracker on mediawiki{01..03}
[16:27:35] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[16:28:31] <bd808>	 I saw a bunch of ferm rules merge in ops/puppet... hopefully we are missing important ones
[16:30:23] <Krenair>	 It was very broken earlier
[16:30:25] <Krenair>	 is it even more broken now?
[16:31:15] <thcipriani>	 Krenair: yeah, Special:UserLogin seems to timeout
[16:31:39] <Krenair>	 sounds less broken than earlier
[16:32:13] <bd808>	 thcipriani: ori was doing stuff with adding nutcracker to the redis servers; that might be broken
[16:33:09] <thcipriani>	 restarted nutcracker on the mediawiki instances, can connect via telnet to nutcracker but "stats" doesn't return anything :\
[16:33:56] <wikibugs>	 10Browser-Tests, 6Release-Engineering: We should not run Jenkins jobs for browser tests when beta cluster is down - https://phabricator.wikimedia.org/T107305#1492025 (10greg)
[16:34:39] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Do not run Jenkins jobs for browser tests when beta cluster is down - https://phabricator.wikimedia.org/T107305#1491965 (10greg)
[16:35:12] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Do not run Jenkins jobs for browser tests when beta cluster is down - https://phabricator.wikimedia.org/T107305#1491965 (10greg)
[16:35:13] <wikibugs>	 10Browser-Tests: When beta cluster is down Jenkins jobs should be aborted and not trigger e-mail notifications - https://phabricator.wikimedia.org/T101563#1492035 (10greg)
[16:37:15] <thcipriani>	 oh, redis nutcracker. Well, FWIW nutcracker isn't a recognized service on either of deployment-prep redis boxes.
[16:37:40] <bd808>	 it would be on the mw servers pointed at the redis boxes
[16:40:24] <thcipriani>	 in the nutcracker yaml I see memcached pointed to deployment-memc02 and memc03 and that seems to be it
[16:41:26] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1890 bytes in 4.555 second response time
[16:41:26] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1534 bytes in 4.750 second response time
[16:49:07] <thcipriani>	 !log lots of "Error connecting to 10.68.16.193: Can't connect to MySQL server on '10.68.16.193'" deployment-db1 seems up and functional :(
[16:49:10] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[16:51:19] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 31061 bytes in 0.677 second response time
[16:55:42] <thcipriani>	 ping times are pretty high coming out of deployment-mediawiki01, like 1000ms+
[16:55:51] <thcipriani>	 to deployment-db1
[16:58:33] <thcipriani>	 well, now they seem back to normal...
[17:01:24] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 44893 bytes in 0.910 second response time
[17:02:17] <bd808>	 redis logs are full of "Could not connect to server". that would break all kinds of things
[17:02:42] <bd808>	 Aaron has been switching more and more caching from mc to redis
[17:04:02] <bd808>	 memecached is jacked up too
[17:04:12] <bd808>	 so basically caching is broken
[17:04:31] <bd808>	 Memcached error for key "WANCache:v:enwiki:messages:en:hash" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[17:06:22] <bd808>	 thcipriani: want me to jump in and try to help or would I just get in your way?
[17:06:50] <thcipriani>	 bd808: please, jump away
[17:07:14] <bd808>	 k. I'll get on mw02 and start tracing things from there
[17:07:28] <thcipriani>	 sounds good, thanks!
[17:09:34] <thcipriani>	 it's weird, looking at fatalmonitor in logstash looks like there's a giant error spike at midnight 7-29 that was sustained thereafter
[17:10:11] <wikibugs>	 5Continuous-Integration-Isolation, 6operations: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1492182 (10Andrew) I re-imaged labnodepool1001 and got a clean puppet run.  Nice work!  Hashar, you can verify that it's behaving adequately and then Chase or I will yank your root...
[17:10:12] <greg-g>	 thcipriani: btw, if you're mid-diagnosing, don't worry about our 1:1, getting beta stable again is important ;)
[17:10:23] <bd808>	 2015-07-29T00:01:13.130Z 	production 	ori 	Switching over the sessions ObjectCache instance to use nutcracker. Users with an existing edit session in progress will have...
[17:10:37] <bd808>	 ori did it!
[17:10:44] * bd808 knew that wold be the answer
[17:12:25] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1890 bytes in 4.367 second response time
[17:12:25] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1534 bytes in 3.527 second response time
[17:13:28] <bd808>	 there is supposed to be a nutcracker listening on 6380 now to connect to redis -- https://github.com/wikimedia/operations-mediawiki-config/commit/57dba051d99b615a411067724b6c878d60d08ed6
[17:13:37] <bd808>	 mw02 at least doesn't have that
[17:13:57] <thcipriani>	 yeah, I haven't seen any of the mws with that
[17:14:22] <thcipriani>	 nutcracker.yaml doesn't have that config
[17:14:31] <bd808>	 https://phabricator.wikimedia.org/T106986#1489920
[17:14:43] <bd808>	 maybe he changed something that was prod only in puppet?
[17:15:32] <thcipriani>	 OR puppet is just not running on these hosts, which is fairly likely
[17:15:41] <greg-g>	 can we get ori in here to fix it up?
[17:15:56] <thcipriani>	 Error 400 on SERVER: Could not find data item mediawiki_session_redis_servers in any Hiera data file
[17:16:04] <thcipriani>	 ^ from puppet.log
[17:16:26] <bd808>	 yup
[17:17:07] <bd808>	 lots of changes in the last couple of days for this -- https://github.com/wikimedia/operations-puppet/commits/production/manifests/role/mediawiki.pp
[17:17:31] <bd808>	 https://github.com/wikimedia/operations-puppet/commit/810c16248e8a5ad8882c06c7d8b5dcbf73ae3b7c
[17:18:28] <thcipriani>	 I got it, patch incoming
[17:18:32] <bd808>	 greg-g: is now the time to be a dick about breaking beta?
[17:18:49] <bd808>	 I fought this battle last summer :/
[17:23:17] <greg-g>	 we're just 1.5 hours into 2+ people diagnosing this issue
[17:25:08] <thcipriani>	 welp, something's wrong with the new nutcracker config on 01
[17:25:15] <thcipriani>	 refuses to start :\
[17:26:07] <bd808>	 I'm running puppet on mw02; I guess I'll find out if I have the same problem
[17:26:46] <bd808>	 nutcracker: configuration file '/etc/nutcracker/nutcracker.yml' syntax is invalid
[17:27:37] <thcipriani>	 I see the problem
[17:27:38] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Do not run Jenkins jobs for browser tests when beta cluster is down - https://phabricator.wikimedia.org/T107305#1492238 (10Jdlrobson)
[17:27:39] <thcipriani>	 fixing
[17:27:40] <wikibugs>	 10Browser-Tests: When beta cluster is down Jenkins jobs should be aborted and not trigger e-mail notifications - https://phabricator.wikimedia.org/T101563#1492239 (10Jdlrobson)
[17:27:47] <bd808>	 dear nutcracker authors: thanks for the brilliant error message
[17:28:32] <thcipriani>	 kk, should be fixed after a puppet run
[17:30:06] <thcipriani>	 in /etc/nutcracker/nutcracker.yml the servers should be: 10.68.16.177:6380:1 and 10.68.16.231:6380:1 I left out the port the first time
[17:30:17] <shinken-wm>	 RECOVERY - Puppet failure on nodepool-t105406 is OK Less than 1.00% above the threshold [0.0]
[17:30:22] <bd808>	 ah.
[17:30:32] <bd808>	 that did look odd
[17:31:10] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)
[17:31:17] <thcipriani>	 and what do you know, you can get to Special:UserLogin now!
[17:31:45] <thcipriani>	 bd808: thanks for your help!
[17:33:58] <bd808>	 thcipriani: pro tip -- when beta randomly stops working look at ori's gerrit commits first ;)
[17:35:05] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK Less than 1.00% above the threshold [0.0]
[17:36:55] <shinken-wm>	 RECOVERY - Puppet failure on deployment-videoscaler01 is OK Less than 1.00% above the threshold [0.0]
[17:39:29] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki02 is OK Less than 1.00% above the threshold [0.0]
[17:41:34] <shinken-wm>	 RECOVERY - Host deployment-test is UPING OK - Packet loss = 0%, RTA = 1.78 ms
[17:41:46] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki03 is OK Less than 1.00% above the threshold [0.0]
[17:43:09] <bd808>	 Those "Could not connect to server" message on the redis log channel look like they mean that $server is empty -- wfDebugLog( 'redis', "Could not connect to server $server" );
[17:49:21] <shinken-wm>	 RECOVERY - Puppet failure on deployment-jobrunner01 is OK Less than 1.00% above the threshold [0.0]
[17:51:32] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)
[17:52:09] <Krenair>	 so is beta fixed now?
[17:59:24] <bd808>	 I think so Krenair. fatalmonitor looks pretty good
[18:00:55] <Krenair>	 bd808, /var on deployment-videoscaler01 is still full
[18:01:31] <Krenair>	 weird since I just moved a huge log file out of there (jobrunner log, filled with flood about one of the issues I was trying to deal with earlier)
[18:02:35] <bd808>	 !log rm deployment-videoscaler01:/var/log/atop.log.?*
[18:02:38] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:03:03] <thcipriani>	 atop log *le sigh*
[18:03:34] <bd808>	 deployment-videoscaler01 is an old base image with a 2G /var still
[18:03:57] <Krenair>	 did you find my live hack to the debug log?
[18:04:21] <Krenair>	 looks like someone reverted it and everything is OK now
[18:04:51] <Krenair>	 except /srv on deployment-fluorine is full
[18:05:13] <Krenair>	 because /srv/mw-log/wfDebug.log is 27G
[18:05:22] <Krenair>	 so maybe not all OK :)
[18:05:44] <thcipriani>	 yeah, whoa
[18:05:54] <bd808>	 heh
[18:06:02] <Krenair>	 anyway, I need to go afk now. have fun
[18:06:02] <bd808>	 LOG ALL THE THINGS!
[18:08:16] <bd808>	 !log rm deployment-fluorine:/a/mw-log/archive/*-201505*
[18:08:18] <bd808>	 !log rm deployment-fluorine:/a/mw-log/archive/*-201506*
[18:08:19] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:08:22] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:08:38] <bd808>	 that's 21G back
[18:09:37] <wikibugs>	 10Continuous-Integration-Infrastructure, 6operations: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#1492366 (10Cmjohnson) 5Open>3Resolved Added lanthanum to server spares. Resolving this ticket
[18:09:39] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Epic, 3releng-201415-Q3, and 2 others: [Quarterly Success Metric] Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#1492368 (10Cmjohnson)
[18:09:46] <thcipriani>	 hmm...I can't seem to login to beta still, but this just may be my bad password management
[18:10:08] <bd808>	 I was magically still logged in when it recovered
[18:10:15] <bd808>	 I can try logging out and back in
[18:10:51] <bd808>	 thcipriani: :( yeah I can't get back in either
[18:12:00] <thcipriani>	 bummer, ok, so something still whacked about redis connection, I guess.
[18:12:33] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-videoscaler01 is OK All targets OK
[18:14:18] <bd808>	 !log upgraded nutcracker on deployment-videoscaler01
[18:14:21] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:15:20] <bd808>	 !log upgraded nutcracker on deployment-jobrunner01
[18:15:23] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:15:45] <bd808>	 I didn't do those 2 yesterday when I did the mw servers
[18:19:06] <bd808>	 captcha is broken on the new account screen too for enwiki.beta
[18:19:42] <bd808>	 http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:Captcha/image&wpCaptchaId=311198436
[18:19:46] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-fluorine is OK All targets OK
[18:20:39] <thcipriani>	 yup, same on deployment
[18:20:58] <shinken-wm>	 RECOVERY - Host deployment-test is UPING OK - Packet loss = 0%, RTA = 0.79 ms
[18:21:54] <bd808>	 looks like deployment-videoscaler01 has some broken job queued -- Fatal error: /srv/mediawiki/wikiversions-labs.cdb has no version entry for `mediawikiwiki`. -- over and over
[18:23:48] <bd808>	 thcipriani: I'm not seeing the centralauth problem in the logs :/
[18:24:03] <thcipriani>	 mediawikiwiki isn't a wiki on beta
[18:24:24] <bd808>	 yeah, that's a junk job from somewhere
[18:27:00] <thcipriani>	 I think I see a problem, 6380 isn't the right port for redis
[18:27:23] <bd808>	 oh? prod redis are on strange ports?
[18:27:57] <chasemp>	 6379
[18:27:58] <thcipriani>	 nutcracker is on 6380 redis is 6379
[18:29:40] <thcipriani>	 updated in hiera
[18:29:51] <chasemp>	 where is that changeset?
[18:30:20] <thcipriani>	 I still haven't posted it in gerrit, I'm just using wikitech at the moment :\
[18:31:22] <chasemp>	 looks like this in prod too I think
[18:31:23] <chasemp>	 nutcracke 22581 nutcracker   63u  IPv4 3607400004      0t0  TCP localhost:6380 (LISTEN)
[18:32:09] <thcipriani>	 yup, I think the backend redis-hosts are port 6379 and nutcracker is 6380
[18:32:58] <thcipriani>	 I had nutcracker trying to connect to the redis hosts on 6380, no idea why it didn't blow up a little louder
[18:33:52] <chasemp>	 my worry is ...is this also bad in prod?
[18:33:59] <chasemp>	 ori switched over to nutcracker I think there too
[18:34:10] <chasemp>	 I'm guessing it was tested in beta first and so parity badness
[18:34:17] <bd808>	 chasemp: the problem was that he didn't really switch beta at all
[18:34:25] <bd808>	 he never tests in beta
[18:34:36] <bd808>	 please yell at him about that ;)
[18:34:56] <thcipriani>	 well, I'm getting different error messages on login, like my password is wrong (which it might be)
[18:35:28] <bd808>	 thcipriani: success! I got logged in
[18:35:38] <chasemp>	 well
[18:35:38] <chasemp>	 https://gerrit.wikimedia.org/r/#/c/227573/
[18:35:59] <bd808>	 thcipriani: you can use the createAndPromote.php script to reset your password
[18:37:40] <bd808>	 chasemp: ori didn't add hiera settings for beta to go with that series of changes. That's what thcipriani has been fixing
[18:37:51] <chasemp>	 ah
[18:42:41] <grrrit-wm>	 (03PS2) 10JanZerebecki: Stop running composer twice [integration/config] - 10https://gerrit.wikimedia.org/r/227631 
[18:51:31] <wikibugs>	 10Beta-Cluster: Could not login into Beta Cluster - https://phabricator.wikimedia.org/T107288#1492481 (10Luke081515) 5Open>3Resolved a:3Luke081515 Login works now, thanks.
[19:18:48] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)
[19:52:12] <wikibugs>	 5Continuous-Integration-Isolation, 6operations: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1492591 (10hashar) 5stalled>3Resolved a:3hashar Amazing thanks @andrew .  The daemon does not run right now because it depends on statsd 2.0 whereas Jessie has 3.0.   Will try...
[19:53:29] <wikibugs>	 5Continuous-Integration-Isolation, 6operations: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1492594 (10hashar) labnodepool has been reinstalled from scratch.  I might still need root over the next two days to install some new nodepool .deb...
[19:54:16] <wikibugs>	 5Continuous-Integration-Isolation, 6operations: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1492598 (10hashar)
[19:54:19] <wikibugs>	 5Continuous-Integration-Isolation, 6operations, 5Patch-For-Review: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1492596 (10hashar) 5Open>3Resolved Nodepool no more rely on sudo / diskimage-builder
[20:09:02] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Config, 3Mobile-App-Sprint-62-Android-Summer-Breeze, 5Patch-For-Review, 3Wikipedia-Android-App: Create jenkins slave instance dedicated to Android runs - https://phabricator.wikimedia.org/T107336#1492626 (10hashar) + #releng , we want to have this task dealt...
[20:09:11] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Config, 3Mobile-App-Sprint-62-Android-Summer-Breeze, 3Wikipedia-Android-App: Create jenkins slave instance dedicated to Android runs - https://phabricator.wikimedia.org/T107336#1492628 (10hashar)
[20:10:13] <shinken-wm>	 RECOVERY - Host deployment-test is UPING OK - Packet loss = 0%, RTA = 1.63 ms
[20:16:06] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Config, 3Mobile-App-Sprint-62-Android-Summer-Breeze, 3Wikipedia-Android-App: Create jenkins slave instance dedicated to Android runs - https://phabricator.wikimedia.org/T107336#1492654 (10Niedzielski) @hashar, thanks again! Happy to pair.  We can probably get...
[20:20:30] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Config, 3Mobile-App-Sprint-62-Android-Summer-Breeze, 3Wikipedia-Android-App: Create jenkins slave instance dedicated to Android runs - https://phabricator.wikimedia.org/T107336#1492670 (10hashar)
[20:25:55] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Config, 3Mobile-App-Sprint-62-Android-Summer-Breeze, 3Wikipedia-Android-App: Create jenkins slave instance dedicated to Android runs - https://phabricator.wikimedia.org/T107336#1492677 (10hashar) I rephrased the task to get more cores eventually.   The `integr...
[20:28:47] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Config, 3Mobile-App-Sprint-62-Android-Summer-Breeze, 3Wikipedia-Android-App: Create jenkins slave instance dedicated to Android runs - https://phabricator.wikimedia.org/T107336#1492679 (10Niedzielski) @hashar, thanks!!
[21:04:22] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 032] "Jenkins jobs updated: (['mwext-Wikibase-client-tests-mysql-hhvm', 'mwext-Wikibase-client-tests-mysql-zend', 'mwext-Wikibase-client-tests-s" [integration/config] - 10https://gerrit.wikimedia.org/r/227631 (owner: 10JanZerebecki)
[21:12:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: Stop running composer twice [integration/config] - 10https://gerrit.wikimedia.org/r/227631 (owner: 10JanZerebecki)
[21:18:27] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL 100.00% of data above the critical threshold [0.0]
[21:25:36] <jdlrobson>	 marxarelli: how can i remote debug phantomjs browser? I have a test failing there that doesn't fail anywhere else..
[21:38:47] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-restbase01 is CRITICAL 100.00% of data above the critical threshold [43200.0]
[21:43:28] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1534 bytes in 3.249 second response time
[21:48:22] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 44628 bytes in 0.711 second response time
[21:58:00] <marxarelli>	 jdlrobson: you can use pry to debug the ruby/mw-selenium side, but you won't be able to inspect the browser very easily
[21:58:10] <marxarelli>	 jdlrobson: honestly, i recommend not using phantomjs anymore
[21:58:24] <marxarelli>	 partly for that reason ...
[21:58:34] <marxarelli>	 it's really hard to debug something you can't see :)
[21:59:39] <marxarelli>	 also, those kinds of edge cases might not even be relevant to real browsers
[22:51:06] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL 33.33% of data above the critical threshold [0.0]
[22:56:07] <wikibugs>	 10Continuous-Integration-Infrastructure, 6Release-Engineering: automatically build and commit mediawiki/vendor (composer) - https://phabricator.wikimedia.org/T101123#1493245 (10csteipp) >>>! In T101123#1354950, @Krinkle wrote: >> Automating this defeats the purpose of mediawiki/vendor and compromises our produ...
[23:01:03] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK Less than 1.00% above the threshold [0.0]
[23:01:33] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Wikidata: github.com is 403ing downloads from Wikimedia CI during composer update - https://phabricator.wikimedia.org/T106519#1493267 (10Legoktm) @csteipp has written a MITM packagist proxy: https://github.com/Stype/packagistproxy
[23:06:42] <wikibugs>	 10Continuous-Integration-Infrastructure, 6Release-Engineering: automatically build and commit mediawiki/vendor (composer) - https://phabricator.wikimedia.org/T101123#1493285 (10JanZerebecki) >>! In T101123#1493245, @csteipp wrote: > (pulled down with git, or some other utility that actually has integrity check...
[23:11:56] <shinken-wm>	 PROBLEM - Free space - all mounts on integration-slave-trusty-1012 is CRITICAL integration.integration-slave-trusty-1012.diskspace._mnt.byte_percentfree (<100.00%)
[23:18:13] <wikibugs>	 10Beta-Cluster, 6Collaboration-Team, 10Flow, 10VisualEditor: Parsoid broken in beta for VisualEditor but not for Flow (RESTbase breakage?) - https://phabricator.wikimedia.org/T107342#1493319 (10Catrope)
[23:33:11] <Luke081515>	 Hello, can somebody help me? The API action=edit works not at beta
[23:36:03] <greg-g>	 marxarelli: ^ that seems odd
[23:36:28] <greg-g>	 hi Luke081515, you also noticed the other Beta outage this morning, thanks :)
[23:36:58] <Luke081515>	 No problem ;)
[23:48:06] <marxarelli>	 greg-g: wish the sal entries were a little more verbose
[23:48:10] <marxarelli>	 "clearing disk space on trusty 1011 and 1012"
[23:48:15] <marxarelli>	 how?!
[23:49:26] <wikibugs>	 10Browser-Tests, 10Fundraising Tech Backlog, 10MediaWiki-extensions-CentralNotice: Write API for campaign creation and use it to create browser test fixtures - https://phabricator.wikimedia.org/T107376#1493504 (10awight) 3NEW
[23:50:46] <greg-g>	 marxarelli: luckily that was > a month ago, I bet this happened more recently
[23:50:50] <greg-g>	 but yeah, agreed.
[23:52:09] <marxarelli>	 greg-g: sanity check: does `find /mnt/jenkins-workspace/workspace -type d -mindepth 1 -maxdepth 1 -mtime +30 -exec rm -rf {} \;` seem sane?
[23:53:04] <greg-g>	 eye-balling, yes. delete 30+ day old directories sounds sane
[23:53:37] <greg-g>	 could echo it once to check :)
[23:54:05] <marxarelli>	 greg-g: you're no fun
[23:54:33] <greg-g>	 fine, type it with your eyes closed
[23:54:54] <marxarelli>	 haha
[23:55:45] <marxarelli>	 !log clearing disk space on integrations-slave-trusty-1012 with `find /mnt/jenkins-workspace/workspace -mindepth 1 -maxdepth 1 -type d -mtime +15 -exec rm -rf {} \;`
[23:55:51] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[23:56:00] <marxarelli>	 i'm going with 15 days
[23:56:14] <greg-g>	 chicken
[23:56:38] <greg-g>	 wait, take that back, I, uh.... shush
[23:56:59] <marxarelli>	 i'm not a chicken! you're a turkey!