[00:00:23] Yeah, I guess that means we need to have two menu items [00:00:34] But not sure what a good link label would be [00:00:48] Or maybe just linked from the main Coverage page [00:00:52] without being in the main menu [00:00:55] I was thinking of maybe a submenu just for coverage [00:00:56] yep [00:00:58] that ;) [00:01:49] huh, the first two nav links on https://doc.wikimedia.org/cover/extensions/ are broken [00:03:33] Project beta-code-update-eqiad build #188154: 04STILL FAILING in 32 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/188154/ [00:04:44] 10Release-Engineering-Team (Kanban), 10Operations, 10Release Pipeline: Package/upload service-checker for Debian stretch - https://phabricator.wikimedia.org/T184224#3876861 (10dduvall) p:05Triage>03Normal [00:09:50] legoktm: Right. The Page thing expects a single depth of pages [00:10:06] so it allows linking from /foo/ to /bar/ as ../bar but doesn't deal with /foo/bar/ to /baz/ [00:10:34] The main reason this complexity exists is because I want to be able to easily test this thing locally without it having its own doc root, e.g. from localhost/dev/integration-docroot/---- etc. [00:10:42] me too... [00:10:54] maybe cover-extensions/ ? [00:13:45] Yippee, build fixed! [00:13:46] Project beta-code-update-eqiad build #188155: 09FIXED in 45 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/188155/ [00:26:15] RECOVERY - Puppet errors on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [00:29:06] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [00:35:04] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:36:53] RECOVERY - Puppet errors on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:57:46] PROBLEM - Puppet errors on deployment-kafka04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:59:46] PROBLEM - Puppet errors on deployment-kafka01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:02:05] PROBLEM - Puppet errors on deployment-sca03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:02:47] PROBLEM - Puppet errors on deployment-cache-upload04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:05:00] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:07:48] PROBLEM - Puppet errors on deployment-changeprop is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:07:55] PROBLEM - Puppet errors on deployment-kafka05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:17:59] 10MediaWiki-Codesniffer: Undefined index: scope_opener - https://phabricator.wikimedia.org/T184232#3877014 (10Reedy) [01:28:14] RECOVERY - Puppet staleness on deployment-ms-be03 is OK: OK: Less than 1.00% above the threshold [3600.0] [01:34:34] PROBLEM - Free space - all mounts on deployment-mx is CRITICAL: CRITICAL: deployment-prep.deployment-mx.diskspace._var_log.byte_percentfree (<100.00%) [01:34:46] RECOVERY - Puppet errors on deployment-kafka01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:04] RECOVERY - Puppet errors on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:44] RECOVERY - Puppet errors on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:45] RECOVERY - Puppet errors on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:54] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3877039 (10Krenair) p:05Triage>03Normal [01:38:32] 10Beta-Cluster-Infrastructure, 10Puppet: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3877039 (10Krenair) [01:39:07] RECOVERY - Puppet staleness on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [3600.0] [01:39:40] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-cache-text04 due to varnishkafka issues - https://phabricator.wikimedia.org/T184234#3877051 (10Krenair) p:05Triage>03Normal [01:39:59] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [01:40:01] RECOVERY - Puppet staleness on deployment-ms-be04 is OK: OK: Less than 1.00% above the threshold [3600.0] [01:42:30] 10Beta-Cluster-Infrastructure, 10Analytics, 10Puppet: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877064 (10Krenair) p:05Triage>03Normal [01:43:03] 10Beta-Cluster-Infrastructure, 10Analytics, 10Puppet: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877064 (10Krenair) ```krenair@deployment-kafka03:~$ sudo puppet agent -tv Warning: Setting configtimeout is deprecated. (at /usr/lib/ruby/vendor_ruby/pup... [01:46:45] 10Beta-Cluster-Infrastructure, 10Analytics, 10Puppet: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877076 (10Krenair) 2.7G /var/log/daemon.log 2.6G /var/log/daemon.log.1 221M /var/log/kafka/controller.log 257M /var/log/kafka/kafka-mirror-main-deployment-pr... [01:47:47] RECOVERY - Puppet errors on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [01:47:55] RECOVERY - Puppet errors on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [01:49:19] hmph. zotero01 was having memory issues earlier. not anymore? [01:51:12] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#3877079 (10Krenair) p:05Triage>03Normal [01:53:10] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-eventlogging04 due to missing repo on deployment-tin? - https://phabricator.wikimedia.org/T184238#3877100 (10Krenair) p:05Triage>03Normal [01:54:52] 10Beta-Cluster-Infrastructure, 10Analytics, 10Puppet: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877117 (10Krenair) Repeat of T174742 ? [02:12:57] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3877128 (10Krenair) p:05Triage>03Normal [02:15:05] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-kafka-jump-[12] due to version of a package being missing - https://phabricator.wikimedia.org/T184240#3877141 (10Krenair) p:05Triage>03Normal [02:17:43] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-trending01 due to removal of role - https://phabricator.wikimedia.org/T184241#3877153 (10Krenair) p:05Triage>03Normal [02:21:43] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-netbox, looks like it thinks its a prod box - https://phabricator.wikimedia.org/T184242#3877167 (10Krenair) p:05Triage>03Normal [02:23:03] 10Beta-Cluster-Infrastructure: various .beta.wmflabs.org domains use an invalid ssl certificate - https://phabricator.wikimedia.org/T182927#3877181 (10Krenair) https://community.letsencrypt.org/t/staging-endpoint-for-acme-v2/49605 [02:29:57] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-redis0[12] due to systemd on trusty - https://phabricator.wikimedia.org/T184243#3877182 (10Krenair) p:05Triage>03Normal [02:31:00] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#3877193 (10Krenair) p:05Triage>03Normal [02:48:07] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-cache-text04 due to varnishkafka issues - https://phabricator.wikimedia.org/T184234#3877214 (10Krenair) hiera part: ```diff --git a/hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml b/hieradata/labs/deployment-prep/host/deploym... [03:07:38] Project mwext-phpunit-coverage-publish build #31: 04FAILURE in 7 min 38 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/31/ [03:44:54] 10Continuous-Integration-Config, 10MinusX, 10Google-Code-in-2017, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10Patch-For-Review: Add MinusX to MediaWiki extensions and PHP library repos - https://phabricator.wikimedia.org/T175794#3877249 (10Ryan10145) [04:49:42] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [10.0] [05:14:44] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [10.0] [05:15:18] (03PS1) 10Legoktm: Safely handle incomplete clover.xml files [integration/docroot] - 10https://gerrit.wikimedia.org/r/402174 [05:16:40] (03CR) 10Legoktm: [C: 032] Safely handle incomplete clover.xml files [integration/docroot] - 10https://gerrit.wikimedia.org/r/402174 (owner: 10Legoktm) [05:17:02] (03Merged) 10jenkins-bot: Safely handle incomplete clover.xml files [integration/docroot] - 10https://gerrit.wikimedia.org/r/402174 (owner: 10Legoktm) [05:17:09] (03CR) 10jenkins-bot: Safely handle incomplete clover.xml files [integration/docroot] - 10https://gerrit.wikimedia.org/r/402174 (owner: 10Legoktm) [05:30:13] (03PS1) 10Legoktm: Generate clover.xml files for tox-py27-coverage-publish [integration/config] - 10https://gerrit.wikimedia.org/r/402175 (https://phabricator.wikimedia.org/T179054) [05:30:55] (03CR) 10Legoktm: [C: 032] Generate clover.xml files for tox-py27-coverage-publish [integration/config] - 10https://gerrit.wikimedia.org/r/402175 (https://phabricator.wikimedia.org/T179054) (owner: 10Legoktm) [05:32:11] (03Merged) 10jenkins-bot: Generate clover.xml files for tox-py27-coverage-publish [integration/config] - 10https://gerrit.wikimedia.org/r/402175 (https://phabricator.wikimedia.org/T179054) (owner: 10Legoktm) [05:34:05] 10Continuous-Integration-Infrastructure, 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017), 10Patch-For-Review: Create pretty landing page at https://doc.wikimedia.org/cover/ - https://phabricator.wikimedia.org/T146970#3877367 (10Legoktm) [05:34:07] 10Continuous-Integration-Config, 10Wiki-Loves-Monuments-Database, 10Patch-For-Review: Generate clover.xml for labs/tools/heritage - https://phabricator.wikimedia.org/T179054#3877365 (10Legoktm) 05Open>03Resolved https://doc.wikimedia.org/cover/ shows labs-tools-heritage at 40%, which matches https://doc.... [05:47:07] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<22.22%) [06:42:24] 10Continuous-Integration-Infrastructure, 10Operations, 10Traffic: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#3877424 (10Legoktm) [07:07:09] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:17:09] (03PS1) 10Legoktm: Rename to cover-extensions/ to avoid issues with subdirectories [integration/docroot] - 10https://gerrit.wikimedia.org/r/402184 [07:48:42] (03PS1) 10Legoktm: Publish extension coverage to cover-extensions/ [integration/config] - 10https://gerrit.wikimedia.org/r/402186 [07:50:05] (03PS2) 10Legoktm: Publish extension coverage to cover-extensions/ [integration/config] - 10https://gerrit.wikimedia.org/r/402186 [07:56:44] (03CR) 10Legoktm: [C: 032] Rename to cover-extensions/ to avoid issues with subdirectories [integration/docroot] - 10https://gerrit.wikimedia.org/r/402184 (owner: 10Legoktm) [07:56:58] (03CR) 10Legoktm: [C: 032] Publish extension coverage to cover-extensions/ [integration/config] - 10https://gerrit.wikimedia.org/r/402186 (owner: 10Legoktm) [07:57:10] (03Merged) 10jenkins-bot: Rename to cover-extensions/ to avoid issues with subdirectories [integration/docroot] - 10https://gerrit.wikimedia.org/r/402184 (owner: 10Legoktm) [07:57:16] (03CR) 10jenkins-bot: Rename to cover-extensions/ to avoid issues with subdirectories [integration/docroot] - 10https://gerrit.wikimedia.org/r/402184 (owner: 10Legoktm) [07:58:14] (03Merged) 10jenkins-bot: Publish extension coverage to cover-extensions/ [integration/config] - 10https://gerrit.wikimedia.org/r/402186 (owner: 10Legoktm) [07:59:42] (03PS1) 10Legoktm: Install extension dependencies for coverage job [integration/config] - 10https://gerrit.wikimedia.org/r/402189 [08:00:11] (03CR) 10Legoktm: [C: 032] Install extension dependencies for coverage job [integration/config] - 10https://gerrit.wikimedia.org/r/402189 (owner: 10Legoktm) [08:01:20] (03Merged) 10jenkins-bot: Install extension dependencies for coverage job [integration/config] - 10https://gerrit.wikimedia.org/r/402189 (owner: 10Legoktm) [08:01:38] Yippee, build fixed! [08:01:39] Project mwext-phpunit-coverage-publish build #32: 09FIXED in 3 min 1 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/32/ [08:14:02] 10Continuous-Integration-Config, 10Operations: tox 2.5.0 on phabricator-jessie-diffs fails with ERROR: Commands not specified - https://phabricator.wikimedia.org/T184060#3877476 (10hashar) The revert commit for 2.7.0 https://github.com/tox-dev/tox/issues/454 which looks like a hack when one can achieve exactly... [08:18:44] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Discovery, 10Wikimedia-Portals, and 2 others: Update Portals page on Beta to reflect head of master branch - https://phabricator.wikimedia.org/T181799#3877478 (10hashar) 05Open>03Resolved [08:21:17] (03CR) 10Hashar: [C: 032] "Ah indeed they are non voting :} Sorry for all the delay regarding the BlueSpice extensions." [integration/config] - 10https://gerrit.wikimedia.org/r/394578 (owner: 10Robert Vogel) [08:21:24] (03CR) 10jerkins-bot: [V: 04-1] Changed settings for BlueSpice-repos [integration/config] - 10https://gerrit.wikimedia.org/r/394578 (owner: 10Robert Vogel) [08:25:35] (03PS7) 10Hashar: Changed settings for BlueSpice-repos [integration/config] - 10https://gerrit.wikimedia.org/r/394578 (owner: 10Robert Vogel) [08:26:05] (03CR) 10Hashar: [C: 032] "Rebased. Some will probably fail but can be fixed later on :-}" [integration/config] - 10https://gerrit.wikimedia.org/r/394578 (owner: 10Robert Vogel) [08:28:13] (03Merged) 10jenkins-bot: Changed settings for BlueSpice-repos [integration/config] - 10https://gerrit.wikimedia.org/r/394578 (owner: 10Robert Vogel) [08:33:21] (03PS2) 10Hashar: Add BlueSpicePageAccess extension to zuul/layout.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/401627 (https://phabricator.wikimedia.org/T183674) (owner: 10Divadsn) [08:35:11] (03PS4) 10Hashar: Add BlueSpiceNamespaceCSS extension to zuul/layout.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/401628 (https://phabricator.wikimedia.org/T183674) (owner: 10Divadsn) [08:36:33] (03CR) 10Hashar: [C: 032] "Rebased and moved the definition to have the extension in alphabetical order." [integration/config] - 10https://gerrit.wikimedia.org/r/401627 (https://phabricator.wikimedia.org/T183674) (owner: 10Divadsn) [08:36:37] (03CR) 10Hashar: [C: 032] "Rebased and moved the definition to have the extension in alphabetical order." [integration/config] - 10https://gerrit.wikimedia.org/r/401628 (https://phabricator.wikimedia.org/T183674) (owner: 10Divadsn) [08:37:02] Project mwext-phpunit-coverage-publish build #38: 04FAILURE in 34 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/38/ [08:37:44] (03Merged) 10jenkins-bot: Add BlueSpicePageAccess extension to zuul/layout.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/401627 (https://phabricator.wikimedia.org/T183674) (owner: 10Divadsn) [08:38:06] (03Merged) 10jenkins-bot: Add BlueSpiceNamespaceCSS extension to zuul/layout.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/401628 (https://phabricator.wikimedia.org/T183674) (owner: 10Divadsn) [08:39:48] ahhhhh [08:39:52] not skins again :( [08:40:09] Yippee, build fixed! [08:40:09] Project mwext-phpunit-coverage-publish build #39: 09FIXED in 3 min 6 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/39/ [08:50:07] 10Continuous-Integration-Infrastructure, 10RemexHtml, 10Patch-For-Review: Figure out how to speed up RemexHtml coverage runs - https://phabricator.wikimedia.org/T179055#3877494 (10Legoktm) ...and with PHP 7, it takes 7 minutes. Wonderful. https://integration.wikimedia.org/ci/job/remexhtml-phpunit-coverage-pu... [08:53:27] Project mwext-phpunit-coverage-publish build #42: 04FAILURE in 5.6 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/42/ [08:54:50] ^ I'll fix the extension coverage job tomorrow [08:55:11] 10Continuous-Integration-Config, 10Operations: tox 2.5.0 on phabricator-jessie-diffs fails with ERROR: Commands not specified - https://phabricator.wikimedia.org/T184060#3877497 (10fgiunchedi) 05Open>03Invalid Fair enough! Thanks @hashar ! [08:57:06] Yippee, build fixed! [08:57:07] Project mwext-phpunit-coverage-publish build #43: 09FIXED in 3 min 38 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/43/ [09:12:48] PROBLEM - Puppet errors on deployment-snapshot01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [09:17:41] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3877516 (10mobrovac) [09:17:43] 10Beta-Cluster-Infrastructure, 10Puppet, 10Services (done): Puppet broken on deployment-trending01 due to removal of role - https://phabricator.wikimedia.org/T184241#3877513 (10mobrovac) 05Open>03Resolved The instance has been deleted and its puppet prefix and web proxy cleaned up. [09:20:38] PROBLEM - Host deployment-trending01 is DOWN: CRITICAL - Host Unreachable (10.68.18.186) [10:29:55] PROBLEM - Puppet staleness on deployment-kafka03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [10:32:01] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:01] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:04:51] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-kafka-jump-[12] due to version of a package being missing - https://phabricator.wikimedia.org/T184240#3877141 (10Paladox) Probably want to include the os too like Jessie or stretch? [11:07:06] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3877128 (10Paladox) Maybe stretch is pointing to an o... [11:10:48] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#3877079 (10Paladox) I guess we can apply this https://github.com/wikimedia/mediawiki-vagrant/commit/ac6d19df598c75d97b635b026763ae7fd96f5970 fix at /... [11:30:32] 10Continuous-Integration-Config, 10Wiki-Loves-Monuments-Database, 10Patch-For-Review: Generate clover.xml for labs/tools/heritage - https://phabricator.wikimedia.org/T179054#3877791 (10Lokal_Profil) Thanks! Possibly a new task: Could we also generate coverage for the PHP components (the API) and if so is it... [13:07:30] 10Beta-Cluster-Infrastructure, 10Puppet: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3877039 (10Paladox) This is probaly because puppet has been broken on this host for a long while now. Probaly needs to be recreated or deleted. It’s been disconnected from getting any changes... [13:12:43] 10Release-Engineering-Team (Kanban), 10Scap, 10Wikimedia-Incident: Investigate deployment that caused high error-rate but wasn't prevented by Scap - https://phabricator.wikimedia.org/T183952#3878049 (10zeljkofilipin) Scap did fail during deployment. Since the commit that caused the failure was already merged... [13:12:52] 10Continuous-Integration-Infrastructure: integration.integration-slave-jessie-1001 disk space full - https://phabricator.wikimedia.org/T184269#3878052 (10Paladox) [14:10:43] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [14:19:39] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Flood of ORES errors at Beta Cluster - https://phabricator.wikimedia.org/T184276#3878311 (10MarcoAurelio) [14:23:26] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Flood of ORES errors at Beta Cluster - https://phabricator.wikimedia.org/T184276#3878331 (10MarcoAurelio) https://logstash-beta.wmflabs.org/goto/3da590c69d2896cf4d4cd227616fcd29 is one of them, but you should check the... [14:25:24] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Flood of ORES errors at Beta Cluster - https://phabricator.wikimedia.org/T184276#3878311 (10awight) @MarcoAurelio Thanks for the report! Our celery worker died three days ago, probably due to out-of-memory. It's not t... [14:26:55] !log restarted celery-ores-worker on deployment-sca03 [14:26:59] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878342 (10awight) [14:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:28:12] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878346 (10MarcoAurelio) Dear @awight; thanks for your quick response. Yesterday @Krenair was discussing at -releng that there were a numb... [14:31:14] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878311 (10Halfak) It looks like we might need more memory on sca03 (or whatever beta cluster node we're deploying to). Maybe it's time t... [14:37:43] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878361 (10Halfak) Alternatively, we could also reduce the # of workers from 8 to 4. I think we could still handle beta-capacity with th... [14:37:46] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878362 (10awight) Looking at /srv/log/ores/app.log, we've been down for at least 2 weeks. Any useful evidence has been rotated out of lo... [15:00:11] Project mwext-phpunit-coverage-publish build #46: 04FAILURE in 11 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/46/ [15:12:32] PROBLEM - Puppet errors on deployment-sca01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:15:03] PROBLEM - Puppet errors on deployment-redis06 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:21:54] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Move beta cluster ORES to its own machine - https://phabricator.wikimedia.org/T184282#3878447 (10awight) [15:27:44] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:33:42] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:37:54] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team, 10Wikimedia-log-errors: Move beta cluster ORES to its own machine - https://phabricator.wikimedia.org/T184282#3878472 (10Halfak) FWIW, our staging machine for our CloudVPS install for ORES is 16GB and usually runs with 9.2GB free. It has 8 ce... [15:41:02] 10MediaWiki-Codesniffer: Undefined index: scope_opener - https://phabricator.wikimedia.org/T184232#3878475 (10Umherirrender) [15:41:04] 10MediaWiki-Codesniffer: Undefined index: scope_opener in IfElseStructureSniff - https://phabricator.wikimedia.org/T183828#3878478 (10Umherirrender) [15:47:30] RECOVERY - Puppet errors on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:37] Yippee, build fixed! [15:57:38] Project mwext-phpunit-coverage-publish build #47: 09FIXED in 31 min: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/47/ [16:04:56] RECOVERY - Puppet staleness on deployment-kafka03 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:16:38] (03PS1) 10Umherirrender: Add BlueSpice extensions [integration/config] - 10https://gerrit.wikimedia.org/r/402374 (https://phabricator.wikimedia.org/T130811) [16:22:02] 10Gerrit, 10Release-Engineering-Team (Kanban), 10Regression, 10Upstream: Cannot log into Gerrit as of recent upgrade - https://phabricator.wikimedia.org/T152640#3878553 (10Paladox) See also https://groups.google.com/forum/m/#!topic/repo-discuss/rP3DdKXxHbI This problem may not been fully fixed in 2.14 but... [16:22:37] (03PS2) 10Umherirrender: Add BlueSpice extensions [integration/config] - 10https://gerrit.wikimedia.org/r/402374 (https://phabricator.wikimedia.org/T130811) [16:23:01] (03CR) 10Umherirrender: "Removed BlueSpiceRSSFeeder, because it is an empty repo (waiting for inital commit)" [integration/config] - 10https://gerrit.wikimedia.org/r/402374 (https://phabricator.wikimedia.org/T130811) (owner: 10Umherirrender) [16:32:05] PROBLEM - Puppet errors on deployment-kafka03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:12:57] PROBLEM - Puppet errors on deployment-memc06 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:47:57] RECOVERY - Puppet errors on deployment-memc06 is OK: OK: Less than 1.00% above the threshold [0.0] [18:10:52] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:39:22] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:45:51] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:53:38] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Scap: Scap not working in Beta - https://phabricator.wikimedia.org/T184176#3878841 (10mobrovac) 05Resolved>03Open This is still an issue even though the params are now quoted: ```lines=10 18:50:25 Started deploy [mathoid/deploy@c9957c... [18:54:29] PROBLEM - Free space - all mounts on deployment-kafka03 is CRITICAL: CRITICAL: deployment-prep.deployment-kafka03.diskspace.root.byte_percentfree (<100.00%) [18:55:28] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Scap: Scap not working in Beta - https://phabricator.wikimedia.org/T184176#3878848 (10mmodell) @mobrovac: beta is where proper testing takes place. [18:56:31] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Scap: Scap not working in Beta - https://phabricator.wikimedia.org/T184176#3878849 (10mmodell) I'm working on reverting the problematic change. Just give me a few more minutes. [18:58:20] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Scap: Scap not working in Beta - https://phabricator.wikimedia.org/T184176#3878867 (10mobrovac) >>! In T184176#3878848, @mmodell wrote: > @mobrovac: beta is where proper testing takes place. If you define //testing// as //not working even... [19:10:56] (03PS1) 10Umherirrender: Archive mediawiki/extensions/DataTypes [integration/config] - 10https://gerrit.wikimedia.org/r/402412 [19:31:28] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Scap: Scap not working in Beta - https://phabricator.wikimedia.org/T184176#3878921 (10mmodell) Ok I've installed 3.7.5 on deployment-tin, this should resolve the issue with tagging. [19:35:38] PROBLEM - Puppet errors on deployment-parsoid09 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [19:35:58] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:36:00] PROBLEM - Puppet errors on deployment-aqs03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [19:36:36] 10MediaWiki-Codesniffer: Undefined index: scope_opener in IfElseStructureSniff - https://phabricator.wikimedia.org/T183828#3878926 (10Umherirrender) p:05Triage>03Normal a:03Umherirrender [19:37:10] PROBLEM - Puppet errors on deployment-sca02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:37:39] (03PS1) 10Umherirrender: Fix Undefined index: scope_opener in IfElseStructureSniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/402418 (https://phabricator.wikimedia.org/T183828) [19:38:12] PROBLEM - Puppet errors on deployment-cassandra3-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:38:32] PROBLEM - Puppet errors on deployment-sca01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [19:38:50] PROBLEM - Puppet errors on deployment-changeprop is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:38:54] PROBLEM - Puppet errors on deployment-cpjobqueue is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:39:06] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Scap: Scap not working in Beta - https://phabricator.wikimedia.org/T184176#3878942 (10mmodell) well now I need to force the right scap version on all of beta hosts...I'm gonna try to figure out how to use cumin to do that. [19:40:35] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:41:51] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:41:59] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [19:44:18] PROBLEM - Puppet errors on deployment-cassandra3-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:44:48] PROBLEM - Puppet errors on deployment-aqs02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:46:00] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:48:33] PROBLEM - Puppet errors on deployment-imagescaler01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [19:49:03] PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:49:25] PROBLEM - Puppet errors on deployment-mediawiki06 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:50:41] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:51:02] PROBLEM - Puppet errors on deployment-eventlog02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:57:51] PROBLEM - Puppet errors on deployment-zotero01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:59:29] PROBLEM - Puppet errors on deployment-mediawiki04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:00:05] PROBLEM - Puppet errors on deployment-pdfrender02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:00:35] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:01:56] PROBLEM - Puppet errors on deployment-mcs01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:03:04] PROBLEM - Puppet errors on deployment-sca03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:09:09] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban), 10Scap: Scap not working in Beta - https://phabricator.wikimedia.org/T184176#3879021 (10mmodell) Ok I force-downgraded scap on deployment-prep with cumin, as follows: ``` sudo cumin 'O{project:deployment-prep}' 'dpkg-query --status scap && D... [20:09:23] puppet errors should be fixed now [20:09:44] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [20:10:10] PROBLEM - Host deployment-phab is DOWN: CRITICAL - Host Unreachable (10.68.17.67) [20:10:34] shinken-wm: deployment-phab is deleted, so of course it's down [20:13:27] twentyafterfour: it usually takes shinken 30m or so to grok a delete [20:14:08] yeah I just enjoy mocking the bots :-o [20:14:40] they will have their revenge one day when they take my job :P [20:16:52] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:18:29] RECOVERY - Puppet errors on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:21:00] RECOVERY - Puppet errors on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:22:00] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [20:23:33] RECOVERY - Puppet errors on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:24:02] RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [20:24:17] RECOVERY - Puppet errors on deployment-cassandra3-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:24:50] RECOVERY - Puppet errors on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:29:29] RECOVERY - Puppet errors on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [20:30:41] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:31:03] RECOVERY - Puppet errors on deployment-eventlog02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:34:30] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:34:30] RECOVERY - Puppet errors on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:35:08] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:36:58] RECOVERY - Puppet errors on deployment-mcs01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:37:52] RECOVERY - Puppet errors on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:38:04] RECOVERY - Puppet errors on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:40:41] RECOVERY - Puppet errors on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [20:40:43] plenty more puppet bugs open if anyone is interested: https://phabricator.wikimedia.org/T132259 [20:40:59] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [20:40:59] RECOVERY - Puppet errors on deployment-aqs03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:41:28] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): Various puppet issues in deployment-prep - https://phabricator.wikimedia.org/T180935#3773712 (10Krenair) >>! In T180935#3872752, @hashar wrote: > As for puppet being broken on several instances, indeed we could use some new tasks. The reasons... [20:42:10] RECOVERY - Puppet errors on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:12] RECOVERY - Puppet errors on deployment-cassandra3-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:48] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3879068 (10mmodell) [20:43:50] 10Beta-Cluster-Infrastructure, 10Puppet: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3879065 (10mmodell) 05Open>03Resolved a:03mmodell I deleted the instance [20:43:56] RECOVERY - Puppet errors on deployment-cpjobqueue is OK: OK: Less than 1.00% above the threshold [0.0] [20:44:30] cpjobqueue? [20:44:52] * twentyafterfour shrugs [20:47:01] change prop job queue [20:47:29] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879086 (10Krenair) Nope, it just plain doesn't exist... [20:48:47] RECOVERY - Puppet errors on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [20:52:12] 10Gerrit, 10Release-Engineering-Team (Kanban), 10Regression, 10Upstream: Cannot log into Gerrit as of recent upgrade - https://phabricator.wikimedia.org/T152640#3879094 (10demon) *eyeroll* It'll be fixed when they stop putting canonical data in a secondary index. [20:53:11] Krenair try apt-get update && apt-get install prometheus-nutcracker-exporter [20:54:02] same error as expected [20:54:17] what are the apt error you get when doing apt-get update please? [21:02:44] !log legoktm@contint1001:/srv/org/wikimedia/doc/cover$ sudo -u jenkins-slave rm -rf extensions [21:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:10:28] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879147 (10Krenair) Dug into this a bit more with som... [21:10:32] (03PS1) 10Legoktm: Cleanup skins before setting up extension coverage job [integration/config] - 10https://gerrit.wikimedia.org/r/402428 [21:10:34] (03PS1) 10Legoktm: Only generate coverage information for master [integration/config] - 10https://gerrit.wikimedia.org/r/402429 [21:10:43] 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879148 (10Paladox) It's due to the experimental comp... [21:20:38] RECOVERY - Puppet errors on deployment-mediawiki07 is OK: OK: Less than 1.00% above the threshold [0.0] [21:31:16] (03CR) 10Legoktm: [C: 032] Cleanup skins before setting up extension coverage job [integration/config] - 10https://gerrit.wikimedia.org/r/402428 (owner: 10Legoktm) [21:31:29] (03CR) 10Legoktm: [C: 032] "Noticed MobileFrontend's "specialpages" branch." [integration/config] - 10https://gerrit.wikimedia.org/r/402429 (owner: 10Legoktm) [21:32:32] (03Merged) 10jenkins-bot: Cleanup skins before setting up extension coverage job [integration/config] - 10https://gerrit.wikimedia.org/r/402428 (owner: 10Legoktm) [21:32:39] (03Merged) 10jenkins-bot: Only generate coverage information for master [integration/config] - 10https://gerrit.wikimedia.org/r/402429 (owner: 10Legoktm) [21:40:52] (03CR) 10Hashar: [C: 032] Archive mediawiki/extensions/DataTypes [integration/config] - 10https://gerrit.wikimedia.org/r/402412 (owner: 10Umherirrender) [21:41:30] (03CR) 10Hashar: [C: 032] Add BlueSpice extensions [integration/config] - 10https://gerrit.wikimedia.org/r/402374 (https://phabricator.wikimedia.org/T130811) (owner: 10Umherirrender) [21:41:56] (03Merged) 10jenkins-bot: Archive mediawiki/extensions/DataTypes [integration/config] - 10https://gerrit.wikimedia.org/r/402412 (owner: 10Umherirrender) [21:42:38] (03Merged) 10jenkins-bot: Add BlueSpice extensions [integration/config] - 10https://gerrit.wikimedia.org/r/402374 (https://phabricator.wikimedia.org/T130811) (owner: 10Umherirrender) [22:08:00] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [22:09:39] jenkins still doesn't merge ores wheels: https://gerrit.wikimedia.org/r/#/c/401822/ I need to find out why [22:11:00] is it in integration/zuul [22:11:06] cannot use the noop template. [22:11:43] https://github.com/wikimedia/integration-config/blob/c26fa8dab814f0075497293691cda267e90423e5/zuul/layout.yaml#L8030 [22:11:45] Amir1 [22:11:49] you cannot use noop [22:11:53] it dosen't self merge. [22:12:23] well, I saw it in other repos and it worked just fine [22:12:32] search for noop tests [22:12:46] Yeh, it adds jenkins to the repo, but dosen't self merge. [22:14:01] some how [22:08:00] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [22:14:12] is picked up when a change depends on a ton of other changes [22:18:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [22:18:49] Amir1: paladox yeah that is transient [22:19:00] yep [22:19:20] 10Beta-Cluster-Infrastructure, 10Upstream: Non-existent wiki urls on beta cluster gives Unsafe/Insecure connection message - https://phabricator.wikimedia.org/T173469#3879271 (10Krenair) [22:19:22] 10Beta-Cluster-Infrastructure: various .beta.wmflabs.org domains use an invalid ssl certificate - https://phabricator.wikimedia.org/T182927#3879274 (10Krenair) [22:19:35] someone has sent a lot of dependent changes to Gerrit [22:19:52] which triggers a lot of merge checks , and it takes a bit to process them [22:19:57] yeh [22:20:12] https://gerrit.wikimedia.org/r/#/q/project:mediawiki/services/parsoid+is:open :] [22:20:40] it happens from time to time [22:20:48] can be solved by adding a few more zuul-merger instances [22:22:07] hashar that will be fixed with https://github.com/openstack-infra/zuul/commit/773651ad7bf0fc6adba2357173ffb657d874478a [22:22:52] paladox: na that is slightly different :D [22:22:57] oh [22:28:29] 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Puppet: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879304 (10Krenair) Patch handl... [22:28:59] Jan 05 22:23:25 deployment-videoscaler01 ferm[24188]: DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN [22:29:00] really. [22:29:13] * paladox gets that error too [22:29:19] but for different hosts [22:30:00] wonder if it's the AAAA thing [22:33:13] why do I feel like I've fought with ferm and its backing DNS library before [22:33:50] oh yeah here we go: https://phabricator.wikimedia.org/T153468 [22:34:05] RECOVERY - Puppet errors on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:34:33] RECOVERY - Puppet errors on deployment-imagescaler02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:35:39] RECOVERY - Puppet errors on deployment-redis06 is OK: OK: Less than 1.00% above the threshold [0.0] [22:39:56] 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Puppet: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879347 (10Krenair) Actually th... [23:05:15] !log restarted staged ores-wmflabs-deploy:8d252de [23:05:25] Woops [23:05:32] wrong channel :| [23:07:01] Seems that stashbot has died so maybe no one will know my mistake [23:18:59] RECOVERY - Puppet errors on deployment-kafka-jumbo-2 is OK: OK: Less than 1.00% above the threshold [0.0] [23:21:00] RECOVERY - Puppet errors on deployment-kafka-jumbo-1 is OK: OK: Less than 1.00% above the threshold [0.0] [23:22:05] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3879427 (10Krenair) [23:23:08] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192864 (10demon) Is this really best as a tracking task or should we add it to the deployment-prep workboard column? The task by its nature is always gonna be... [23:24:34] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3879432 (10Krenair) It's fine with me if you want to move them all to a particular workboard column instead of a tracking task [23:36:08] PROBLEM - Puppet errors on deployment-netbox is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:38:19] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [23:48:19] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1