[00:34:20] PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [01:05:36] PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:14:20] RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:21:53] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:24:09] PROBLEM - Puppet staleness on deployment-jobrunner02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [02:01:53] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [02:10:36] RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:48:10] 10Gerrit, 10Release-Engineering-Team (Next), 10Scap, 10ORES, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3534823 (10greg) [03:22:53] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [04:02:52] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [04:16:57] Yippee, build fixed! [04:16:58] Project selenium-MultimediaViewer » firefox,beta,Linux,BrowserTests build #490: 09FIXED in 20 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/490/ [04:31:35] PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [04:34:59] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [04:45:48] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [05:10:00] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:11:34] RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:35:50] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [10.0] [06:02:38] PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:47:34] RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:54:48] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [06:55:16] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:21:43] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:01] PROBLEM - App Server Main HTTP Response on deployment-mediawiki05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:26:35] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39217 bytes in 1.061 second response time [07:29:48] RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [07:30:16] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [07:32:51] RECOVERY - App Server Main HTTP Response on deployment-mediawiki05 is OK: HTTP OK: HTTP/1.1 200 OK - 50495 bytes in 0.847 second response time [08:03:35] PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [08:12:20] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [08:50:51] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [08:52:24] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [09:13:35] RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:35:59] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [10:10:58] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:44:06] PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:24:08] RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [11:25:46] (03PS1) 10AnotherLadsgroup: Enable jenkins (without any tests) on wheels repo of ORES [integration/config] - 10https://gerrit.wikimedia.org/r/372759 (https://phabricator.wikimedia.org/T173251) [11:26:08] 10Continuous-Integration-Config, 10Scoring-platform-team, 10Easy, 10Patch-For-Review, 10User-Ladsgroup: Have CI merge research/ores/wheels changes - https://phabricator.wikimedia.org/T173251#3535037 (10Ladsgroup) [13:23:51] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:03:54] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [14:04:33] PROBLEM - Puppet errors on integration-r-lang-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:32:01] 10Beta-Cluster-Infrastructure: Disk full on deployment-jobrunner02 - https://phabricator.wikimedia.org/T173571#3535162 (10Krenair) I moved the file to /srv/jobrunner-old-20170819.log (11GB) and restarted hhvm to prevent apache just responding with 503s (HHVM had failed to load earlier for some reason) However it... [14:37:48] RECOVERY - Free space - all mounts on deployment-jobrunner02 is OK: OK: All targets OK [14:39:34] RECOVERY - Puppet errors on integration-r-lang-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:04:08] RECOVERY - Puppet staleness on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [3600.0] [15:10:56] 10Beta-Cluster-Infrastructure: Disk full on deployment-jobrunner02 - https://phabricator.wikimedia.org/T173571#3535182 (10Krenair) alright, managed to kill the dewikivoyage jobs: ``` krenair@deployment-tin:/srv/mediawiki-staging$ redis-cli -a $password_here -h deployment-redis01 deployment-redis01:6379> keys dew... [15:26:18] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:38:20] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3535193 (10Krenair) [15:38:22] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-pdf01 - https://phabricator.wikimedia.org/T173552#3535190 (10Krenair) 05Open>03Resolved a:03Krenair Fixed it by adding some stuff to puppet based on the role used in prod: ```profile::redis::master::instances: - 6379 profile::redis::mas... [15:42:27] RECOVERY - Puppet errors on deployment-pdf01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:49:00] 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535198 (10Krenair) [15:50:52] 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535211 (10Krenair) ```Aug 19 15:50:16 deployment-imagescaler02 systemd[1]: Starting The Salt Minion... Aug 19 15:50:16 deployment-imagescaler02 systemd[1]: Started The Salt Min... [15:51:18] !!! [15:51:28] https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:RecentChanges [15:52:18] TabbyCat, ack [15:53:20] Krenair: on it as well [15:57:46] TabbyCat, think we got it [15:59:57] mostly yep [16:00:03] f- spammers [16:01:16] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:04:39] did we start using debian stretch? [16:05:31] 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535232 (10Krenair) am guessing this is md5 vs. sha256. also noticed this box is running Debian Stretch [16:08:38] 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535233 (10Krenair) It works if I set `hash_type: md5` in /etc/salt/minion, guess I need to puppetise this [16:10:13] globally (ahem) locked on beta cluster as well [16:13:12] 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535234 (10Krenair) [16:13:21] ty TabbyCat [16:13:28] I probably should've done that instead of blocking [16:13:29] oh well [16:18:27] it's better blocking nonetheless since locking does not block the IPs, so they can go on socking [16:20:15] 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535249 (10Krenair) For now I've just edited the individual instance's puppet config in horizon to use the sha256 fingerprint. @Andrew, we should think about which way we want t... [16:21:41] 10Beta-Cluster-Infrastructure, 10Salt: deployment-imagescaler02 is not responding to salt - https://phabricator.wikimedia.org/T173628#3535252 (10Krenair) a:03Krenair [16:27:59] hm, my old "ad04b266e4 hacks to fix puppet --krenair" is still sitting around on the puppetmaster :| [16:37:24] 10Beta-Cluster-Infrastructure: deployment-logstash2 out of disk space - https://phabricator.wikimedia.org/T170521#3434408 (10Krenair) ```alex@alex-laptop:~$ ssh deployment-logstash2 df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/mapper/vd-second--local--disk 139G 41G 92... [16:39:38] RECOVERY - Puppet errors on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:43:50] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Backlog), 10Patch-For-Review: Beta puppetmaster cherry-pick process - https://phabricator.wikimedia.org/T135427#3535261 (10Krenair) ```root@deployment-puppetmaster02:/var/lib/git/operations/puppet# git log --oneline origin/production..HEAD dd8fcf7762... [17:41:27] * TabbyCat nice to see Krenair back on duty :) [17:42:32] -> pm [17:42:48] something very strange going on with captcha storage [17:42:52] on deployment-mediawiki04: [17:42:59] hphpd> =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" ); [17:42:59] =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" ); [17:42:59] false [17:43:02] on deployment-tin: [17:43:16] hphpd> =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" ); [17:43:16] =ObjectCache::getMainStashInstance()->get( "enwiki:captcha:1092669001" ); [17:43:16] Array [17:43:16] ( [17:43:17] [salt] => "6aee9c0f" [17:43:18] [hash] => "d9d8c7858e9805f6" [17:43:19] [viewed] => false [17:43:22] [index] => "1092669001" [17:43:24] ) [19:07:49] Yippee, build fixed! [19:07:50] Project selenium-MinervaNeue » chrome,beta,Linux,BrowserTests build #84: 09FIXED in 18 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/84/ [19:17:10] Yippee, build fixed! [19:17:11] Project selenium-MinervaNeue » firefox,beta,Linux,BrowserTests build #84: 09FIXED in 28 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/84/ [19:33:06] PROBLEM - Free space - all mounts on integration-slave-jessie-android is CRITICAL: CRITICAL: integration.integration-slave-jessie-android.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-android.diskspace.root.byte_percentfree (<100.00%) [19:59:57] (03CR) 10XZise: [C: 031] "In theory looks good, but I'm wondering if in the future it should run on the "commit-msg" hook instead. But it would need some rewrite to" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372424 (owner: 10Legoktm) [20:08:41] (03PS2) 10XZise: Don't include BAD_FOOTERS in FOOTERS [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/368515 [20:54:09] (03CR) 10Legoktm: "Originally I had tried to use commit-msg but I found that it would need to interact properly with git-review's own hook, and that some GUI" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372424 (owner: 10Legoktm) [20:55:32] (03CR) 10Legoktm: "Also I fixed the permissions so you and John now have +2 in this repo" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372424 (owner: 10Legoktm) [20:57:41] (03CR) 10Legoktm: [C: 032] Don't include BAD_FOOTERS in FOOTERS [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/368515 (owner: 10XZise) [20:58:16] (03Merged) 10jenkins-bot: Don't include BAD_FOOTERS in FOOTERS [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/368515 (owner: 10XZise) [20:59:33] (03CR) 10Legoktm: "Right that all makes sense. As a side-effect of writing this test case, it pointed out that we're counting line length differently in Pyth" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/372418 (owner: 10Legoktm) [21:35:15] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [21:38:15] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] [21:43:20] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:53:18] PROBLEM - Puppet errors on deployment-kafka01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:55:32] PROBLEM - Puppet errors on deployment-imagescaler02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:18:21] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [22:22:45] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [22:30:31] RECOVERY - Puppet errors on deployment-imagescaler02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:30:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] [22:33:20] RECOVERY - Puppet errors on deployment-kafka01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:49:56] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [23:24:54] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]