[00:34:20] <shinken-wm>	 PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[01:05:36] <shinken-wm>	 PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[01:14:20] <shinken-wm>	 RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0]
[01:21:53] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[01:24:09] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-jobrunner02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]
[02:01:53] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[02:10:36] <shinken-wm>	 RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:48:10] <wikibugs>	 10Gerrit, 10Release-Engineering-Team (Next), 10Scap, 10ORES, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3534823 (10greg)
[03:22:53] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[04:02:52] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[04:16:57] <wmf-insecte>	 Yippee, build fixed!
[04:16:58] <wmf-insecte>	 Project selenium-MultimediaViewer » firefox,beta,Linux,BrowserTests build #490: 09FIXED in 20 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/490/
[04:31:35] <shinken-wm>	 PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[04:34:59] <shinken-wm>	 PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[04:45:48] <shinken-wm>	 PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0]
[05:10:00] <shinken-wm>	 RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:11:34] <shinken-wm>	 RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:35:50] <shinken-wm>	 PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [10.0]
[06:02:38] <shinken-wm>	 PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[06:47:34] <shinken-wm>	 RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:54:48] <shinken-wm>	 PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[06:55:16] <shinken-wm>	 PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[07:21:43] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:23:01] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:26:35] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39217 bytes in 1.061 second response time
[07:29:48] <shinken-wm>	 RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0]
[07:30:16] <shinken-wm>	 RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:32:51] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki05 is OK: HTTP OK: HTTP/1.1 200 OK - 50495 bytes in 0.847 second response time
[08:03:35] <shinken-wm>	 PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[08:12:20] <shinken-wm>	 PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[08:50:51] <shinken-wm>	 RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0]
[08:52:24] <shinken-wm>	 RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:13:35] <shinken-wm>	 RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:35:59] <shinken-wm>	 PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[10:10:58] <shinken-wm>	 RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0]
[10:44:06] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[11:24:08] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]
[11:25:46] <wikibugs>	 (03PS1) 10AnotherLadsgroup: Enable jenkins (without any tests) on wheels repo of ORES [integration/config] - 10https://gerrit.wikimedia.org/r/372759 (https://phabricator.wikimedia.org/T173251)
[11:26:08] <wikibugs>	 10Continuous-Integration-Config, 10Scoring-platform-team, 10Easy, 10Patch-For-Review, 10User-Ladsgroup: Have CI merge research/ores/wheels changes - https://phabricator.wikimedia.org/T173251#3535037 (10Ladsgroup)
[13:23:51] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[14:03:54] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[14:04:33] <shinken-wm>	 PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[14:32:01] <wikibugs>	 10Beta-Cluster-Infrastructure: Disk full on deployment-jobrunner02 - https://phabricator.wikimedia.org/T173571#3535162 (10Krenair) I moved the file to /srv/jobrunner-old-20170819.log (11GB) and restarted hhvm to prevent apache just responding with 503s (HHVM had failed to load earlier for some reason) However it...
[14:37:48] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-jobrunner02 is OK: OK: All targets OK
[14:39:34] <shinken-wm>	 RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:04:08] <shinken-wm>	 RECOVERY - Puppet staleness on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [3600.0]
[15:10:56] <wikibugs>	 10Beta-Cluster-Infrastructure: Disk full on deployment-jobrunner02 - https://phabricator.wikimedia.org/T173571#3535182 (10Krenair) alright, managed to kill the dewikivoyage jobs: ``` krenair@deployment-tin:/srv/mediawiki-staging$ redis-cli -a $password_here -h deployment-redis01 deployment-redis01:6379> keys dew...
[15:26:18] <shinken-wm>	 PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:38:20] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3535193 (10Krenair)
[15:38:22] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-pdf01 - https://phabricator.wikimedia.org/T173552#3535190 (10Krenair) 05Open>03Resolved a:03Krenair Fixed it by adding some stuff to puppet based on the role used in prod: ```profile::redis::master::instances: - 6379 profile::redis::mas...
[15:42:27] <shinken-wm>	 RECOVERY - Puppet errors on deployment-pdf01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:49:00] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535198 (10Krenair)
[15:50:52] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535211 (10Krenair) ```Aug 19 15:50:16 deployment-imagescaler02 systemd[1]: Starting The Salt Minion... Aug 19 15:50:16 deployment-imagescaler02 systemd[1]: Started The Salt Min...
[15:51:18] <TabbyCat>	 !!!
[15:51:28] <TabbyCat>	 https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:RecentChanges
[15:52:18] <Krenair>	 TabbyCat, ack
[15:53:20] <TabbyCat>	 Krenair: on it as well
[15:57:46] <Krenair>	 TabbyCat, think we got it
[15:59:57] <TabbyCat>	 mostly yep
[16:00:03] <TabbyCat>	 f- spammers
[16:01:16] <shinken-wm>	 RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:04:39] <Krenair>	 did we start using debian stretch?
[16:05:31] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535232 (10Krenair) am guessing this is md5 vs. sha256. also noticed this box is running Debian Stretch
[16:08:38] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535233 (10Krenair) It works if I set `hash_type: md5` in /etc/salt/minion, guess I need to puppetise this
[16:10:13] <TabbyCat>	 globally (ahem) locked on beta cluster as well
[16:13:12] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535234 (10Krenair)
[16:13:21] <Krenair>	 ty TabbyCat 
[16:13:28] <Krenair>	 I probably should've done that instead of blocking
[16:13:29] <Krenair>	 oh well
[16:18:27] <TabbyCat>	 it's better blocking nonetheless since locking does not block the IPs, so they can go on socking
[16:20:15] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535249 (10Krenair) For now I've just edited the individual instance's puppet config in horizon to use the sha256 fingerprint. @Andrew, we should think about which way we want t...
[16:21:41] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535252 (10Krenair) a:03Krenair
[16:27:59] <Krenair>	 hm, my old "ad04b266e4 hacks to fix puppet --krenair" is still sitting around on the puppetmaster :|
[16:37:24] <wikibugs>	 10Beta-Cluster-Infrastructure: deployment-logstash2 out of disk space - https://phabricator.wikimedia.org/T170521#3434408 (10Krenair) ```alex@alex-laptop:~$ ssh deployment-logstash2 df -h /mnt Filesystem                          Size  Used Avail Use% Mounted on /dev/mapper/vd-second--local--disk  139G   41G   92...
[16:39:38] <shinken-wm>	 RECOVERY - Puppet errors on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:43:50] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Backlog), 10Patch-For-Review: Beta puppetmaster cherry-pick process - https://phabricator.wikimedia.org/T135427#3535261 (10Krenair) ```root@deployment-puppetmaster02:/var/lib/git/operations/puppet# git log --oneline origin/production..HEAD dd8fcf7762...
[17:41:27] * TabbyCat nice to see Krenair back on duty :)
[17:42:32] <Krenair>	 -> pm
[17:42:48] <Krenair>	 something very strange going on with captcha storage
[17:42:52] <Krenair>	 on deployment-mediawiki04:
[17:42:59] <Krenair>	 hphpd> =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" );
[17:42:59] <Krenair>	 =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" );
[17:42:59] <Krenair>	 false
[17:43:02] <Krenair>	 on deployment-tin:
[17:43:16] <Krenair>	 hphpd> =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" );
[17:43:16] <Krenair>	 =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" );
[17:43:16] <Krenair>	 Array
[17:43:16] <Krenair>	 (
[17:43:17] <Krenair>	     [salt] => "6aee9c0f"
[17:43:18] <Krenair>	     [hash] => "d9d8c7858e9805f6"
[17:43:19] <Krenair>	     [viewed] => false
[17:43:22] <Krenair>	     [index] => "1092669001"
[17:43:24] <Krenair>	 )
[19:07:49] <wmf-insecte>	 Yippee, build fixed!
[19:07:50] <wmf-insecte>	 Project selenium-MinervaNeue » chrome,beta,Linux,BrowserTests build #84: 09FIXED in 18 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/84/
[19:17:10] <wmf-insecte>	 Yippee, build fixed!
[19:17:11] <wmf-insecte>	 Project selenium-MinervaNeue » firefox,beta,Linux,BrowserTests build #84: 09FIXED in 28 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/84/
[19:33:06] <shinken-wm>	 PROBLEM - Free space - all mounts on integration-slave-jessie-android is CRITICAL: CRITICAL: integration.integration-slave-jessie-android.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-android.diskspace.root.byte_percentfree (<100.00%)
[19:59:57] <wikibugs>	 (03CR) 10XZise: [C: 031] "In theory looks good, but I'm wondering if in the future it should run on the "commit-msg" hook instead. But it would need some rewrite to" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372424 (owner: 10Legoktm)
[20:08:41] <wikibugs>	 (03PS2) 10XZise: Don't include BAD_FOOTERS in FOOTERS [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/368515
[20:54:09] <wikibugs>	 (03CR) 10Legoktm: "Originally I had tried to use commit-msg but I found that it would need to interact properly with git-review's own hook, and that some GUI" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372424 (owner: 10Legoktm)
[20:55:32] <wikibugs>	 (03CR) 10Legoktm: "Also I fixed the permissions so you and John now have +2 in this repo" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372424 (owner: 10Legoktm)
[20:57:41] <wikibugs>	 (03CR) 10Legoktm: [C: 032] Don't include BAD_FOOTERS in FOOTERS [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/368515 (owner: 10XZise)
[20:58:16] <wikibugs>	 (03Merged) 10jenkins-bot: Don't include BAD_FOOTERS in FOOTERS [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/368515 (owner: 10XZise)
[20:59:33] <wikibugs>	 (03CR) 10Legoktm: "Right that all makes sense. As a side-effect of writing this test case, it pointed out that we're counting line length differently in Pyth" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372418 (owner: 10Legoktm)
[21:35:15] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0]
[21:38:15] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0]
[21:43:20] <shinken-wm>	 PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[21:53:18] <shinken-wm>	 PROBLEM - Puppet errors on deployment-kafka01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[21:55:32] <shinken-wm>	 PROBLEM - Puppet errors on deployment-imagescaler02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:18:21] <shinken-wm>	 RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:22:45] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0]
[22:30:31] <shinken-wm>	 RECOVERY - Puppet errors on deployment-imagescaler02 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:30:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0]
[22:33:20] <shinken-wm>	 RECOVERY - Puppet errors on deployment-kafka01 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:49:56] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0]
[23:24:54] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]