[01:26:00] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<50.00%) [04:54:49] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [10.0] [05:24:50] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [07:00:47] Project selenium-Wikibase ยป chrome,beta,Linux,BrowserTests build #554: 04FAILURE in 2 hr 20 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/554/ [07:06:02] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:34:58] PROBLEM - Free space - all mounts on integration-slave-jessie-1001 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1001.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1001.diskspace._srv.byte_percentfree (<10.00%) [08:03:58] !log nodepool: manually rebuilding snapshot-ci-jessie [08:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:07:51] (03PS2) 10Hashar: debian-glue: use prev distro instead of UNRELEASED [integration/config] - 10https://gerrit.wikimedia.org/r/392789 (https://phabricator.wikimedia.org/T181120) [08:09:03] (03CR) 10Hashar: [C: 032] debian-glue: use prev distro instead of UNRELEASED [integration/config] - 10https://gerrit.wikimedia.org/r/392789 (https://phabricator.wikimedia.org/T181120) (owner: 10Hashar) [08:09:18] 10Continuous-Integration-Config, 10Release-Engineering-Team (Kanban), 10MediaWiki-Debian, 10Patch-For-Review: jenkins-debian-glue job should support UNRELEASED - https://phabricator.wikimedia.org/T181120#3784846 (10hashar) 05Open>03Resolved [08:10:13] (03Merged) 10jenkins-bot: debian-glue: use prev distro instead of UNRELEASED [integration/config] - 10https://gerrit.wikimedia.org/r/392789 (https://phabricator.wikimedia.org/T181120) (owner: 10Hashar) [08:13:27] !log upgrading blubber on contint2001 [08:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:14:22] !log nodepool: Image snapshot-ci-jessie-1511510623 in wmflabs-eqiad is ready [08:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:16:06] !log pooling integration-slave-docker-1003 again | T179378 [08:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:16:10] T179378: some labvirt servers are at full CPU capacity - https://phabricator.wikimedia.org/T179378 [08:16:23] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [08:23:03] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused amusso ZeroMQ is not loading for some reason [08:24:32] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [08:27:12] \o/ [08:41:43] (03PS1) 10Hashar: wikibase/pywikibot to a single Docker job [integration/config] - 10https://gerrit.wikimedia.org/r/393184 [08:43:06] (03CR) 10Hashar: [C: 032] wikibase/pywikibot to a single Docker job [integration/config] - 10https://gerrit.wikimedia.org/r/393184 (owner: 10Hashar) [08:44:13] (03Merged) 10jenkins-bot: wikibase/pywikibot to a single Docker job [integration/config] - 10https://gerrit.wikimedia.org/r/393184 (owner: 10Hashar) [08:48:04] (03PS1) 10Hashar: Remove Zuul template 'tox-jessie' [integration/config] - 10https://gerrit.wikimedia.org/r/393187 [08:48:14] (03CR) 10Hashar: [C: 032] Remove Zuul template 'tox-jessie' [integration/config] - 10https://gerrit.wikimedia.org/r/393187 (owner: 10Hashar) [08:49:28] (03Merged) 10jenkins-bot: Remove Zuul template 'tox-jessie' [integration/config] - 10https://gerrit.wikimedia.org/r/393187 (owner: 10Hashar) [09:14:42] 10MediaWiki-Releasing, 10Security: Consider using a single MediaWiki releases key instead of individual keys - https://phabricator.wikimedia.org/T181019#3784996 (10MoritzMuehlenhoff) >>! In T181019#3782098, @Legoktm wrote: > Would we be creating a new key every year then? Not needed, I'd simply use the key ov... [09:17:14] 10MediaWiki-Releasing, 10Security: Consider using a single MediaWiki releases key instead of individual keys - https://phabricator.wikimedia.org/T181019#3785022 (10MoritzMuehlenhoff) >>! In T181019#3782121, @demon wrote: > As a complete idiot when it comes to this: are subkeys an option? Like, could we have a... [09:21:47] 10MediaWiki-Releasing, 10Security: Consider using a single MediaWiki releases key instead of individual keys - https://phabricator.wikimedia.org/T181019#3785038 (10MoritzMuehlenhoff) >>! In T181019#3782273, @greg wrote: > Just so everyone's on the same page, by "the Jenkins instance" Darian is referring to the... [09:31:20] (03PS1) 10Hashar: Migrate operations/software/conftool to Docker [integration/config] - 10https://gerrit.wikimedia.org/r/393194 [09:34:08] (03CR) 10Hashar: [C: 032] Migrate operations/software/conftool to Docker [integration/config] - 10https://gerrit.wikimedia.org/r/393194 (owner: 10Hashar) [09:35:36] (03Merged) 10jenkins-bot: Migrate operations/software/conftool to Docker [integration/config] - 10https://gerrit.wikimedia.org/r/393194 (owner: 10Hashar) [10:05:57] (03PS1) 10Hashar: Clean up tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/393202 [10:22:08] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Investigate replacing nodemw with mwbot - https://phabricator.wikimedia.org/T181284#3785137 (10zeljkofilipin) [10:22:20] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Investigate replacing nodemw with mwbot - https://phabricator.wikimedia.org/T181284#3785151 (10zeljkofilipin) p:05Triage>03Normal [10:24:17] 10Release-Engineering-Team (Kanban), 10Discovery, 10Discovery-Search (Current work), 10MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), 10Patch-For-Review: [Epic] Port Selenium tests from Ruby to Node.js for the Search Platform - https://phabricator.wikimedia.org/T174103#3785155 (10zeljkofil... [10:29:02] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Replace jshint with eslint in nodemw - https://phabricator.wikimedia.org/T181285#3785159 (10zeljkofilipin) [10:29:52] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Replace jshint with eslint in nodemw - https://phabricator.wikimedia.org/T181285#3785159 (10zeljkofilipin) p:05Triage>03Low [10:33:43] (03CR) 10Hashar: [C: 032] Clean up tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/393202 (owner: 10Hashar) [10:35:00] (03Merged) 10jenkins-bot: Clean up tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/393202 (owner: 10Hashar) [11:15:48] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [11:36:44] 10Continuous-Integration-Config, 10MediaWiki-Core-Tests, 10Documentation: Jenkins-bot should remind users to add documentation for new functions - https://phabricator.wikimedia.org/T179632#3785351 (10hashar) [11:37:03] 10Continuous-Integration-Config, 10MediaWiki-Codesniffer, 10MediaWiki-Core-Tests, 10Documentation: Jenkins-bot should remind users to add documentation for new functions - https://phabricator.wikimedia.org/T179632#3731487 (10hashar) Maybe PHP_CodeSniffer has support to enforce documentation. [11:44:43] 10Continuous-Integration-Infrastructure: Shinken erroneous disk usage alert on integration-slave-docker hosts - https://phabricator.wikimedia.org/T181295#3785380 (10hashar) [11:46:47] 10Continuous-Integration-Infrastructure: Shinken erroneous disk usage alert on integration-slave-docker hosts - https://phabricator.wikimedia.org/T181295#3785396 (10hashar) [11:53:01] 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Shinken erroneous disk usage alert on integration-slave-docker hosts - https://phabricator.wikimedia.org/T181295#3785412 (10hashar) We will have Graphite on labs to be cleared out of all metrics `servers.*.diskspace._var_lib_docker_devicemapper_mnt_*`. [11:53:30] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Shinken erroneous disk usage alert on integration-slave-docker hosts - https://phabricator.wikimedia.org/T181295#3785413 (10hashar) a:03hashar [11:53:52] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Shinken erroneous disk usage alert on integration-slave-docker hosts - https://phabricator.wikimedia.org/T181295#3785380 (10hashar) p:05Triage>03High [13:15:57] 10Release-Engineering-Team (Kanban): Identify Orphaned components/code - https://phabricator.wikimedia.org/T173349#3785605 (10Aklapper) >>! In T173349#3703862, @Jrbranaa wrote: > Orphaned will essentially be all components that don't have a responsible team within the Foundation or sister chapter. Hmm, that w... [14:03:15] 10Continuous-Integration-Config, 10MediaWiki-Codesniffer, 10MediaWiki-Core-Tests, 10Documentation: Jenkins-bot should remind users to add documentation for new functions - https://phabricator.wikimedia.org/T179632#3785697 (10Huji) It does. But it checks an entire piece of code, not just the new additions.... [14:20:07] (03PS1) 10Hashar: npm test with Xvfb and PhantomJS [integration/config] - 10https://gerrit.wikimedia.org/r/393232 (https://phabricator.wikimedia.org/T179360) [14:21:53] (03PS1) 10Hashar: Switch ArticlePlaceholder to npm-browser-test [integration/config] - 10https://gerrit.wikimedia.org/r/393233 (https://phabricator.wikimedia.org/T179360) [14:24:45] (03PS2) 10Hashar: npm test with Xvfb and PhantomJS [integration/config] - 10https://gerrit.wikimedia.org/r/393232 (https://phabricator.wikimedia.org/T179360) [14:24:48] (03PS2) 10Hashar: Switch ArticlePlaceholder to npm-browser-test [integration/config] - 10https://gerrit.wikimedia.org/r/393233 (https://phabricator.wikimedia.org/T179360) [14:24:49] (03PS1) 10Hashar: mediawiki/extensions/DataTypes no more needs npm test [integration/config] - 10https://gerrit.wikimedia.org/r/393235 (https://phabricator.wikimedia.org/T180172) [14:28:23] (03CR) 10Hashar: [C: 032] mediawiki/extensions/DataTypes no more needs npm test [integration/config] - 10https://gerrit.wikimedia.org/r/393235 (https://phabricator.wikimedia.org/T180172) (owner: 10Hashar) [14:29:29] (03Merged) 10jenkins-bot: mediawiki/extensions/DataTypes no more needs npm test [integration/config] - 10https://gerrit.wikimedia.org/r/393235 (https://phabricator.wikimedia.org/T180172) (owner: 10Hashar) [14:32:22] (03PS1) 10Hashar: Remove mwgate-npm-node-6-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/393237 [14:33:34] 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review: Create "npm-browser" docker image with npm, xvfb, chromium, and firefox installed - https://phabricator.wikimedia.org/T179360#3785743 (10hashar) I created a basic Docker container which starts Xvfb on display port 94 and runs the npm tests... [15:30:30] (03PS1) 10Hashar: Migrate some npm jobs to Docker [integration/config] - 10https://gerrit.wikimedia.org/r/393246 [15:40:09] 10Continuous-Integration-Infrastructure (shipyard), 10Cloud-Services, 10Graphite, 10Patch-For-Review, 10User-Addshore: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3785866 (10hashar) [15:40:12] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Shinken erroneous disk usage alert on integration-slave-docker hosts - https://phabricator.wikimedia.org/T181295#3785868 (10hashar) [15:41:04] 10Continuous-Integration-Infrastructure (shipyard), 10Cloud-Services, 10Graphite, 10Patch-For-Review, 10User-Addshore: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3645705 (10hashar) [15:42:42] 10Continuous-Integration-Infrastructure (shipyard), 10Cloud-Services, 10Operations, 10monitoring, and 3 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3785870 (10hashar) [15:45:25] 10Continuous-Integration-Infrastructure, 10Nodepool: 2016-08-10 CI incident follow-ups - https://phabricator.wikimedia.org/T142952#3785875 (10hashar) [15:45:29] 10Continuous-Integration-Infrastructure, 10Operations, 10Nodepool, 10Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#3785873 (10hashar) 05Open>03declined Nodepool is legacy. I am not going to bother upgrading the python modules. We will... [15:45:30] 10Continuous-Integration-Infrastructure, 10Nodepool: 2016-08-10 CI incident follow-ups - https://phabricator.wikimedia.org/T142952#2552198 (10hashar) 05Open>03Resolved a:03hashar [15:45:38] 10Continuous-Integration-Config, 10BlueSpice, 10Patch-For-Review: Enable unit tests on BlueSpice* repos - https://phabricator.wikimedia.org/T130811#3785878 (10Osnard) @Paladox I have the feeling that CI and gate-and-submit jobs don't work properly on a lot of our (new) BlueSpice* repos. I've checked the "int... [15:46:22] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Watching / External), 10Cloud-Services, 10Patch-For-Review: Fix ci puppet role to support stretch - https://phabricator.wikimedia.org/T166611#3785879 (10hashar) @Paladox sounds like you have finished thi... [16:01:22] (03PS1) 10Jdrewniak: [WIP] Setting portals JJ to commit to portals/deploy [integration/config] - 10https://gerrit.wikimedia.org/r/393252 (https://phabricator.wikimedia.org/T180777) [16:05:40] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Watching / External), 10Cloud-Services, 10Patch-For-Review: Fix ci puppet role to support stretch - https://phabricator.wikimedia.org/T166611#3785911 (10Paladox) I think so :). [16:09:59] (03CR) 10Paladox: "@Hashar would you be able to review this please? :)" [integration/config] - 10https://gerrit.wikimedia.org/r/389412 (owner: 10Robert Vogel) [18:39:22] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:27:58] 10Release-Engineering-Team (Kanban): Identify Orphaned components/code - https://phabricator.wikimedia.org/T173349#3786148 (10Jrbranaa) Yeah, that's a good point. One of the thoughts that I've been having is to recognize "stewardship" outside of the foundation and sister orgs. Meaning, anyone can agree to be a... [21:38:24] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10MarcoAurelio) [21:38:54] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786200 (10MarcoAurelio) p:05Triage>03Unbreak! Temptatively UBN as this is not normal and has jobs stuck. [21:45:18] Reedy, no_justification : could any of you restart zuul? [21:46:00] or Krinkle ? [21:57:19] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786259 (10Paladox) p:05Unbreak!>03High Changing to high as UBN means a site is down. Tests getting stuck in the post merge pipeline happened... [21:58:58] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786262 (10MarcoAurelio) Okay. Would a restart of zuul help unlock those jobs? [22:07:00] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10hashar) That happens from time to time and it is T72597. There is no magic solution to remove the lock though ;( [22:09:27] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786267 (10hashar) That happens from time to time and it is T72597. There is no magic solution to remove the lock though ;( I went to https://in... [22:10:43] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Jenkins, 10Upstream: Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#3786268 (10hashar) 05... [22:11:04] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10hashar) 05Open>03Resolved a:03hashar [22:11:36] 10Continuous-Integration-Infrastructure, 10Operations, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786275 (10MarcoAurelio) Thank you! [23:30:56] PROBLEM - Puppet errors on deployment-kafka-jumbo-2 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:36:09] PROBLEM - Puppet errors on deployment-netbox is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:37:26] PROBLEM - Puppet errors on deployment-kafka-jumbo-1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]