[00:02:08] 10Beta-Cluster-Infrastructure, 10MediaWiki-General-or-Unknown: LoadBalancer fatals on Beta cluster rendering pages inaccessible - https://phabricator.wikimedia.org/T147240#2686528 (10greg) >>! In T147240#2686492, @Krenair wrote: > Might be something to do with https://gerrit.wikimedia.org/r/#/c/310757/ ? > Not... [00:16:10] Project selenium-Flow » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #163: 04FAILURE in 10 sec: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/163/ [00:16:14] Project selenium-Flow » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #163: 04FAILURE in 13 sec: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/163/ [00:20:02] Project beta-update-databases-eqiad build #11807: 04STILL FAILING in 1.6 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11807/ [01:05:19] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:08:51] PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:09:31] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:12:24] PROBLEM - Puppet run on deployment-mediawiki06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:13:43] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:14:21] !log New scap command line autocompletions are now installed on deployment-tin and deployment-mira refs T142880 [01:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [01:14:42] thcipriani|afk: ^ [01:17:07] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:18:25] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:19:03] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:19:35] PROBLEM - Puppet run on deployment-parsoid09 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:20:05] Project beta-update-databases-eqiad build #11808: 04STILL FAILING in 5.1 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11808/ [01:20:47] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:20:57] PROBLEM - Puppet run on deployment-pdfrender is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:22:36] PROBLEM - Puppet run on deployment-jobrunner02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:22:50] PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:26:56] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:27:06] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [01:28:12] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:28:46] PROBLEM - Puppet run on deployment-aqs01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:29:33] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [01:30:29] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:33:51] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:34:31] PROBLEM - Puppet run on deployment-tmh01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:36:08] 10Continuous-Integration-Config: Banana-checker reports spurious error about supposedly undefined key feedback-error-title - https://phabricator.wikimedia.org/T147245#2686603 (10Tgr) [01:45:20] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:52:23] RECOVERY - Puppet run on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [01:53:27] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [01:53:43] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [01:54:37] RECOVERY - Puppet run on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [01:55:29] 10Continuous-Integration-Config, 10MediaWiki-General-or-Unknown, 06translatewiki.net, 13Patch-For-Review: Banana-checker reports spurious error about supposedly undefined key feedback-error-title - https://phabricator.wikimedia.org/T147245#2686626 (10matmarex) Actually, let's add back #Continuous-Integrati... [01:55:57] RECOVERY - Puppet run on deployment-pdfrender is OK: OK: Less than 1.00% above the threshold [0.0] [01:57:37] RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:57:49] RECOVERY - Puppet run on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [01:59:03] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:00:46] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:01:58] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:03:46] RECOVERY - Puppet run on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:05:28] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:08:13] RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:13:03] Project selenium-QuickSurveys » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #175: 04FAILURE in 3 sec: https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/175/ [02:13:10] 10Continuous-Integration-Config, 10MediaWiki-General-or-Unknown, 06translatewiki.net, 13Patch-For-Review: Banana-checker reports spurious error about supposedly undefined key feedback-error-title - https://phabricator.wikimedia.org/T147245#2686633 (10Tgr) Thanks! I guess I still don't quite understand how... [02:14:29] RECOVERY - Puppet run on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:20:01] Project beta-update-databases-eqiad build #11809: 04STILL FAILING in 1.4 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11809/ [02:41:08] Project selenium-CirrusSearch » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #172: 04FAILURE in 8.5 sec: https://integration.wikimedia.org/ci/job/selenium-CirrusSearch/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/172/ [03:20:01] Project beta-update-databases-eqiad build #11810: 04STILL FAILING in 1.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11810/ [03:56:06] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #161: 04FAILURE in 5.9 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/161/ [03:56:07] Project selenium-MultimediaViewer » chrome,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #161: 04FAILURE in 5.8 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/161/ [04:20:01] Project beta-update-databases-eqiad build #11811: 04STILL FAILING in 1.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11811/ [05:16:22] Undefined index: wmgExtraLanguageNames in /srv/mediawiki-staging/php-master/includes/SiteConfiguration.php on line 312 [05:16:42] PHP Fatal error: Class name must be a valid object or a string in /srv/mediawiki-staging/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php on line 217 [05:20:01] Project beta-update-databases-eqiad build #11812: 04STILL FAILING in 1.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11812/ [05:23:51] beta-update-databases-eqiad is broken because of https://phabricator.wikimedia.org/rMW09ca28d01a170d4859e68d4eba1c861ffb576f43 [05:28:05] 10Beta-Cluster-Infrastructure, 10MediaWiki-General-or-Unknown, 05MW-1.28-release-notes, 13Patch-For-Review, 05WMF-deploy-2016-10-04_(1.28.0-wmf.21): LoadBalancer fatals on Beta cluster rendering pages inaccessible - https://phabricator.wikimedia.org/T147240#2686432 (10mmodell) beta-update-databases-eqiad... [05:31:15] 10Beta-Cluster-Infrastructure, 10MediaWiki-General-or-Unknown, 05MW-1.28-release-notes, 13Patch-For-Review, 05WMF-deploy-2016-10-04_(1.28.0-wmf.21): LoadBalancer fatals on Beta cluster rendering pages inaccessible - https://phabricator.wikimedia.org/T147240#2687221 (10mmodell) [05:31:49] 10Beta-Cluster-Infrastructure, 10MediaWiki-General-or-Unknown, 05MW-1.28-release-notes, 13Patch-For-Review, 05WMF-deploy-2016-10-04_(1.28.0-wmf.21): LoadBalancer fatals on Beta cluster rendering pages inaccessible - https://phabricator.wikimedia.org/T147240#2686432 (10mmodell) [05:52:41] PROBLEM - Host deployment-phab1001 is DOWN: CRITICAL - Host Unreachable (10.68.23.160) [05:54:53] PROBLEM - Host deployment-phab2001 is DOWN: CRITICAL - Host Unreachable (10.68.23.140) [06:20:01] Project beta-update-databases-eqiad build #11813: 04STILL FAILING in 1.3 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11813/ [07:02:07] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:20:01] Project beta-update-databases-eqiad build #11814: 04STILL FAILING in 1.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11814/ [07:35:30] PROBLEM - Puppet run on deployment-tmh01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [07:39:33] RECOVERY - Puppet run on deployment-apertium01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:09:26] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T146998#2687382 (10MarcoAurelio) [08:10:29] RECOVERY - Puppet run on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:12:25] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T146998#2687396 (10MarcoAurelio) [08:12:27] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2687397 (10MarcoAurelio) [08:20:02] Project beta-update-databases-eqiad build #11815: 04STILL FAILING in 1.4 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11815/ [08:31:08] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2686048 (10hashar) Would it be possible to have for each class the list of instances having the class applied ? Would ease the migration toward roles :] [08:36:22] 10Beta-Cluster-Infrastructure, 07Puppet: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2687554 (10hashar) deployment-apertium01 is a Trusty instance. Maybe we can move apertium to the deployment-sca* instances which are jessie? that is {T14... [08:48:36] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 10SpamBlacklist, 07Documentation, 15User-zeljkofilipin: Figure out a system to override default settings when in test context - https://phabricator.wikimedia.org/T89096#2687566 (10zeljkofilipin) [09:01:57] (03CR) 10Hashar: [C: 032] Replace deprecated logrotate with build-discarder [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [09:02:57] (03Merged) 10jenkins-bot: Replace deprecated logrotate with build-discarder [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [09:02:59] !log Regenerating configuration of all Jenkins job due to https://gerrit.wikimedia.org/r/#/c/313306/ [09:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:03:27] (03CR) 10Hashar: "Regenerating all 208 jobs configurations via:" [integration/config] - 10https://gerrit.wikimedia.org/r/313306 (owner: 10Paladox) [09:03:42] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [09:03:54] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [09:04:43] 10Beta-Cluster-Infrastructure, 07Puppet, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2687581 (10akosiaris) [09:04:45] 10Beta-Cluster-Infrastructure, 07Puppet: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2687578 (10akosiaris) 05Open>03Resolved a:03akosiaris https://gerrit.wikimedia.org/r/#/c/308679/ fixes this. I 'll close as resolved [09:20:01] Project beta-update-databases-eqiad build #11816: 04STILL FAILING in 1.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11816/ [09:26:27] 10Beta-Cluster-Infrastructure, 10Citoid, 10VisualEditor: Move citoid to deployment-sca* hosts in Beta Cluster - https://phabricator.wikimedia.org/T142150#2687638 (10akosiaris) [09:34:37] (03PS1) 10Hashar: Remove leftover 'mediawiki-gate' job [integration/config] - 10https://gerrit.wikimedia.org/r/313973 [09:34:59] (03CR) 10Hashar: [C: 032] Remove leftover 'mediawiki-gate' job [integration/config] - 10https://gerrit.wikimedia.org/r/313973 (owner: 10Hashar) [09:35:35] (03Merged) 10jenkins-bot: Remove leftover 'mediawiki-gate' job [integration/config] - 10https://gerrit.wikimedia.org/r/313973 (owner: 10Hashar) [09:43:44] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:43:54] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:04:14] 06Release-Engineering-Team, 10Wikimedia-Developer-Summit, 06Developer-Relations (Oct-Dec-2016): Developer Summit 2017: Work with TPG and RelEng on solution to event documenting - https://phabricator.wikimedia.org/T132400#2687791 (10Qgil) [10:13:57] (03PS1) 10Addshore: Turn on tests for InterwikiSorting extension [integration/config] - 10https://gerrit.wikimedia.org/r/313981 [10:14:32] (03PS1) 10Addshore: Rename 2ColConflict -> TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/313982 [10:15:43] (03PS1) 10Hashar: [oojs-ui] point to the proper Nodepool jobs [integration/config] - 10https://gerrit.wikimedia.org/r/313983 [10:15:56] (03CR) 10Hashar: [C: 032] [oojs-ui] point to the proper Nodepool jobs [integration/config] - 10https://gerrit.wikimedia.org/r/313983 (owner: 10Hashar) [10:16:54] (03CR) 10jenkins-bot: [V: 04-1] [oojs-ui] point to the proper Nodepool jobs [integration/config] - 10https://gerrit.wikimedia.org/r/313983 (owner: 10Hashar) [10:18:27] (03PS2) 10Hashar: Point to the proper Nodepool jobs [integration/config] - 10https://gerrit.wikimedia.org/r/313983 [10:18:39] (03CR) 10Hashar: [C: 032] Point to the proper Nodepool jobs [integration/config] - 10https://gerrit.wikimedia.org/r/313983 (owner: 10Hashar) [10:19:39] (03Merged) 10jenkins-bot: Point to the proper Nodepool jobs [integration/config] - 10https://gerrit.wikimedia.org/r/313983 (owner: 10Hashar) [10:20:01] Project beta-update-databases-eqiad build #11817: 04STILL FAILING in 1.3 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11817/ [10:20:57] (03PS2) 10Hashar: Rename 2ColConflict -> TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/313982 (owner: 10Addshore) [10:21:04] (03CR) 10Hashar: [C: 032] Rename 2ColConflict -> TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/313982 (owner: 10Addshore) [10:21:43] (03Merged) 10jenkins-bot: Rename 2ColConflict -> TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/313982 (owner: 10Addshore) [10:23:22] (03PS2) 10Hashar: Turn on tests for InterwikiSorting extension [integration/config] - 10https://gerrit.wikimedia.org/r/313981 (owner: 10Addshore) [10:23:24] (03PS2) 10Hashar: [XenForoAuth] Add jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/313924 (owner: 10Paladox) [10:23:36] (03CR) 10Hashar: [C: 032] Turn on tests for InterwikiSorting extension [integration/config] - 10https://gerrit.wikimedia.org/r/313981 (owner: 10Addshore) [10:23:40] (03CR) 10Hashar: [C: 032] [XenForoAuth] Add jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/313924 (owner: 10Paladox) [10:24:41] (03Merged) 10jenkins-bot: Turn on tests for InterwikiSorting extension [integration/config] - 10https://gerrit.wikimedia.org/r/313981 (owner: 10Addshore) [10:25:09] cheers hashar [10:25:29] (03Merged) 10jenkins-bot: [XenForoAuth] Add jenkins tests [integration/config] - 10https://gerrit.wikimedia.org/r/313924 (owner: 10Paladox) [10:31:48] going to hunt [10:32:52] 06Release-Engineering-Team, 10Wikimedia-Developer-Summit, 06Developer-Relations (Oct-Dec-2016): Developer Summit 2017: Work with TPG and RelEng on solution to event documenting - https://phabricator.wikimedia.org/T132400#2196957 (10Qgil) @Rfarrand @ksmith should we still pursue this task for #wikidev17 and... [10:57:31] hasharFood: uh oh, is beta down? https://en.wikipedia.beta.wmflabs.org/ [10:57:46] MediaWiki internal error. [10:58:14] hm [10:58:17] main page works fine [10:58:18] https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [11:01:26] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: "MediaWiki internal error" at https://en.wikipedia.beta.wmflabs.org/ - https://phabricator.wikimedia.org/T147298#2687950 (10zeljkofilipin) [11:01:39] zeljkof: check logstash ? [11:01:51] hasharFood: https://phabricator.wikimedia.org/T147298 [11:01:54] Original exception: [V-OMGgpEE4AAAAiw1BEAAAAL] / InvalidArgumentException from line 357 of /srv/mediawiki/php-master/includes/libs/rdbms/database/Database.php: Database::factory no viable database extension found for type '' [11:01:55] bah [11:02:04] looks like there is just a problem with redirect [11:02:49] that is similar to https://phabricator.wikimedia.org/T147240 [11:02:49] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: "MediaWiki internal error" at https://en.wikipedia.beta.wmflabs.org/ - https://phabricator.wikimedia.org/T147298#2687964 (10zeljkofilipin) [11:03:10] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 15User-zeljkofilipin: "MediaWiki internal error" at https://en.wikipedia.beta.wmflabs.org/ - https://phabricator.wikimedia.org/T147298#2687950 (10zeljkofilipin) [11:03:54] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 15User-zeljkofilipin: "MediaWiki internal error" at https://en.wikipedia.beta.wmflabs.org/ - https://phabricator.wikimedia.org/T147298#2687972 (10hashar) [11:03:56] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2687971 (10hashar) [11:04:32] zeljkof: I have made it a blocker of the new mw version [11:04:42] hasharFood: great, thanks [11:20:02] Project beta-update-databases-eqiad build #11818: 04STILL FAILING in 1.4 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11818/ [11:32:53] PROBLEM - Puppet run on deployment-db03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [11:44:38] PROBLEM - Puppet run on deployment-elastic08 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:44:42] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:01:29] Project selenium-RelatedArticles » chrome,beta-mobile,Linux,contintLabsSlave && UbuntuTrusty build #166: 04FAILURE in 28 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/166/ [12:01:29] Project selenium-RelatedArticles » chrome,beta-desktop,Linux,contintLabsSlave && UbuntuTrusty build #166: 04FAILURE in 28 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-desktop,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/166/ [12:12:54] RECOVERY - Puppet run on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [12:20:01] Project beta-update-databases-eqiad build #11819: 04STILL FAILING in 1.3 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11819/ [12:22:04] Project selenium-GettingStarted » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #166: 04FAILURE in 4.5 sec: https://integration.wikimedia.org/ci/job/selenium-GettingStarted/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/166/ [12:24:37] RECOVERY - Puppet run on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [12:24:41] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [12:34:42] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:34:57] (03PS1) 10Hashar: (WIP) experiment with makefile (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/313998 [13:04:06] Project selenium-Math » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #165: 04FAILURE in 6.3 sec: https://integration.wikimedia.org/ci/job/selenium-Math/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/165/ [13:04:07] Project selenium-Math » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #165: 04FAILURE in 6.3 sec: https://integration.wikimedia.org/ci/job/selenium-Math/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/165/ [13:14:44] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:20:02] Project beta-update-databases-eqiad build #11820: 04STILL FAILING in 1.5 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11820/ [13:27:57] (03CR) 10Aude: [C: 032] Update Wikidata branch - wmf/1.28.0-wmf.21 [tools/release] - 10https://gerrit.wikimedia.org/r/313861 (owner: 10Aude) [13:30:32] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:31:57] (03Merged) 10jenkins-bot: Update Wikidata branch - wmf/1.28.0-wmf.21 [tools/release] - 10https://gerrit.wikimedia.org/r/313861 (owner: 10Aude) [13:53:36] hashar: where did we leave off w/ our move back to nodepools? long last week and yuvi is out this week, irrc we have 1 revert to go? [14:00:33] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [14:04:00] RECOVERY - App Server Main HTTP Response on deployment-mediawiki04 is OK: HTTP OK: HTTP/1.1 200 OK - 45356 bytes in 5.498 second response time [14:04:54] chasemp: good morning! [14:05:08] chasemp: yeah with the offsite coming last week I havent pushed. What is left are the PHP based jobs [14:05:11] which are a good chunk of builds [14:05:12] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 45779 bytes in 1.129 second response time [14:05:16] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 19088 bytes in 1.191 second response time [14:05:44] also Nodepool no more queries for a list of floating IP on each server deletion [14:05:58] which removed maybe 20-30% of queries [14:06:57] is that 1 distinct revert then left? [14:07:02] Okay [14:07:06] yeah but will have to rebase it [14:07:08] RECOVERY - App Server Main HTTP Response on deployment-mediawiki06 is OK: HTTP OK: HTTP/1.1 200 OK - 45339 bytes in 4.141 second response time [14:07:12] and probably split it in chunks [14:07:15] Tried to fix beta by cherry-picking Aaron's patch [14:07:18] Had to patch scap first [14:07:49] hashar: ok I guess if you want to split that up and whatnot let's plan on starting monday when yuvi is back and we aren't short handed? [14:08:08] RECOVERY - App Server Main HTTP Response on deployment-mediawiki05 is OK: HTTP OK: HTTP/1.1 200 OK - 45339 bytes in 1.147 second response time [14:08:16] chasemp: are you still worrying about the consequence on the labs infra / OpenStack ? [14:08:44] well yeah since we've never been here before [14:08:59] I can surely split them in chunks that would let us control how many new builds we aad to nodepool [14:10:03] whatever you think is a reasonable approach there is ok by me [14:10:22] well I would just raise the pool and switch everything in one go :D [14:10:31] but that is because I dont really deal with the aftermath on the labs infra [14:10:39] so lets be careful and do it in chunks when yuvi is back [14:10:42] yeah that sounds safe [14:10:54] also I could use the rate to be lowered a bit [14:11:24] I finally managed to confirm the impact of the rate with the deletion/spawn rate of instance https://phabricator.wikimedia.org/T146813 for shinny graphs [14:12:01] I am tempted to lower it from 8 seconds (or 7.5 requests per minute) to 6 seconds (or 10 requests per minutes) [14:13:56] if you mean like outside of puppet that wouldn't be cool [14:13:59] I appreciate https://phabricator.wikimedia.org/T146813 a lot [14:14:29] 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938#2688648 (10hashar) From a discussion with Chase. What is left todo is to migrate back the PHP based jobs: https://gerrit.wikimedia.org/r/#/c/306727/ Revert "Tempo... [14:14:33] what it is lacking is tying it back to some meaningful metric that isn't for its own sake? if we are slow to reach ready state ok [14:14:42] but what is that impacting that we want to start getting more aggressive? [14:15:31] I was somewhat miffed by the last dive down this hole where it surfaced (for me) that there is a higher level throttle than the nodepool consumption of jobs making teh actual job wait times more opaque than we had talked about [14:15:37] PROBLEM - Puppet run on deployment-elastic08 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:15:52] ah yeah the wait_time :( [14:17:20] I guess I haven't dig enough in how the wait time is composed [14:17:24] my thinking atm is to revert back the jobs still left behind and wait to see how the releng CI offsite talks go, I think that's really soon? [14:17:25] or at least havent been careful enough [14:17:38] the offsite is in a couple weeks [14:17:44] but one sure thing, we want to drop the permanent slaves [14:17:55] they are unmanageable [14:18:19] I felt like greg/chad/I had a different convo a few weeks ago where we said some very small/short/unstateful jobs like linting [14:18:22] may make no sense on nodepool [14:18:25] and it is unlikely we will come with a new solution any time soon (that is more in the scale of several months) [14:18:54] for the short jobs, yeah [14:19:03] there are some experiments to regroup them in just a single job [14:19:07] I'm not entirely in on the scope of hte coming talks, nodepool entirely is up for debate I thought? in a few weeks it may be that nodepool is an anti-pattern? [14:19:35] which is more or less what we did with introducing "compsoer test" "npm test" etc. Basically reduce the # of jobs to consume less vm [14:19:47] well [14:19:49] as I see it [14:19:55] we have permanent slaves + nodepool slaves [14:20:04] I would got rid of permanent slaves first [14:20:10] then migrate from nodepool to whatever new system [14:20:27] cause really I dont want to maintain three different sort of slaves [14:20:37] what percentage of work is remaining on permanent slaves? [14:20:56] sure but putting in work to migrate to a declared dead system doesn't seem to make sense if that's how it goes [14:20:57] Yippee, build fixed! [14:20:58] Project beta-update-databases-eqiad build #11821: 09FIXED in 57 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11821/ [14:21:49] chasemp: the mediawiki/PHP related jobs [14:21:54] they are ready [14:22:09] I have been holding the switch since ~ May since labs infra lacked power to host them [14:22:15] after that we can delete permanent slaves? [14:22:24] power in what way do you recall? [14:22:29] yeah at least half of them [14:22:31] probably more [14:22:32] I mean, labs is already hosting them [14:22:36] so that's a bit of a misnomer [14:23:18] more power in the sense we needed more capacity for nodepool slaves while keeping the capacity of permanent slaves [14:23:18] ok but, can you quantify in some way what it would take to remove permanent slaves entirely? [14:23:23] kind of a temporary bump of usage [14:23:41] so at some point you would have the permanent slaves untouched + additional nodepool instances [14:23:52] and once migration is done, we can reclaim resources by deleting the perm slaves [14:24:01] but it was not possible to get the 15-20 more instances we needed [14:25:42] chasemp: the quantity argument is on https://phabricator.wikimedia.org/T133911 [14:26:19] we got it to 20 [14:26:23] and then down to 10 in an emergency [14:26:42] due to labs being exhausted for whatever reason and nodepool was the low hanging fruit to quickly reclaim capacity [14:29:19] anyway, will prepare some patches to move back the PHP jobs [14:29:32] ok thanks [14:29:32] with links to expected build/days added to nodepool instances [14:33:38] in audio with chasemp [14:34:18] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2688739 (10Krenair) [14:38:26] err with Tyler [14:39:26] heh :) [14:47:01] RECOVERY - Puppet run on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:02] hashar thanks for merging my patches today :) [15:01:14] !log shutdown deployment-poolcounter02, replaced by deployment-poolcounter04 - T123734 [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:05:17] PROBLEM - Host deployment-poolcounter02 is DOWN: CRITICAL - Host Unreachable (10.68.23.77) [15:09:25] Hi, I seem to have lost all my groups on my Jenkins login, or have a different account I'm unaware of or something. Anyone here have some troubleshooting steps or can look at my account or something? I'm currently unable to start new builds on the extensions we maintain, which is a problem. [15:10:14] old session that hasn't expired properly? [15:10:17] logged out and in again? [15:13:19] jhobs, you're logged in? [15:13:24] PROBLEM - Puppet run on deployment-mediawiki06 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:13:26] Krenair: yep [15:13:33] username? [15:13:38] same as IRC [15:13:49] and user profile shows no groups [15:14:59] what page is that? [15:15:14] Krenair: the page when I click my username is what I was referring to [15:15:16] oh, a url like https://integration.wikimedia.org/ci/user/alex%20monk/ [15:15:20] I get a load of groups there [15:15:31] at https://integration.wikimedia.org/ci/user/jhobs/ I see a couple of groups [15:15:36] PROBLEM - Puppet run on deployment-parsoid09 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:15:44] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:15:45] Krenair: I see virtually nothing on this page: https://integration.wikimedia.org/ci/user/jhobs/ [15:16:28] Just my user ID and the sidebar [15:16:37] do you see project-bastion, project-mobile and the equivalent ROLE_PROJECT_s? [15:17:03] Krenair: here's exactly what I see: https://i.gyazo.com/0138a2329bca93b6c5aeece169f422a5.png [15:17:38] huh [15:17:41] did you log out and back in? [15:17:48] yep [15:18:14] you can understand my baffled-ness :D [15:18:36] PROBLEM - Puppet run on deployment-jobrunner02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:18:48] PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:18:58] Yippee, build fixed! [15:18:59] Project selenium-MobileFrontend » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #181: 09FIXED in 16 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/181/ [15:19:05] I logged out, back in, and still see a ton of groups on mine [15:19:26] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:19:32] try clearing integration.wm.o cookies? [15:20:06] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:20:49] Krenair: no dice, same problem [15:21:20] yep okay this is beyond what I can help with [15:21:22] hashar, around? [15:21:33] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:21:47] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:22:59] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:24:49] PROBLEM - Puppet run on deployment-aqs01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:24:51] PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:25:45] Yippee, build fixed! [15:25:46] Project selenium-MobileFrontend » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #181: 09FIXED in 22 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/181/ [15:27:01] 10Continuous-Integration-Infrastructure (phase-out-gallium), 03releng-201617-q1, 07Wikimedia-Incident: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757#2688915 (10hashar) [15:28:57] 10Continuous-Integration-Infrastructure (phase-out-gallium), 03releng-201617-q1, 07Wikimedia-Incident: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757#1199594 (10hashar) [15:30:52] 10Continuous-Integration-Infrastructure (phase-out-gallium), 03releng-201617-q1, 07Wikimedia-Incident: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757#2688926 (10hashar) @thcipriani and I have overhauled this task. The task details highlight the migration overview. https://docs.go... [15:41:19] Project selenium-MobileFrontend » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #182: 04FAILURE in 15 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/182/ [15:44:45] PROBLEM - Puppet run on deployment-phab02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:44:51] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:48:19] Krenair: was in an audio what is up ? [15:48:27] I only have a couple minutes though [15:49:03] jhobs has issues with his account in jenkins [15:50:31] might be a case sensitive issue with the login name [15:50:43] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:00] hashar: I can login; the problem is I'm missing groups/roles [15:51:01] eg Jhobs vs jhobs [15:51:04] ohh [15:51:11] arent you in the ldap group "wmf" ? [15:51:36] you are not [15:51:38] this is a recent change. I used to have build rights on MobileFrontend, but now I don't [15:51:56] yeah you are no more in the "wmf" ldap group [15:52:01] as far as I can tell [15:52:03] does he need to be? [15:52:09] jenkins showed him having no groups [15:52:12] for rebuild? yeah [15:52:18] (when viewed from his account) [15:52:22] did non-wmf users lose that right? [15:52:22] so I guess someone got borked in LDAP somehow [15:53:02] jhobs: so in short get a task on https://phabricator.wikimedia.org/tag/ldap-access-requests/ [15:53:13] to get added to the "wmf" ldap group [15:53:22] RECOVERY - Puppet run on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:28] and that will fix it. [15:53:34] hashar: ok thanks. Not sure why it got removed :/ [15:53:52] jhobs: we can in theory add a hack in Jenkins to specifically allow you, but really I am not willing to maintain yet another access list in jenkins :D [15:54:03] some provisionning exploded somehow [15:54:17] well I should be on the "wmf" group anyways, so that's fine [15:54:28] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [15:54:31] jhobs: ask your manager if there's any news you didn't get ;) [15:54:44] jhobs: that will also grant you access to grafana, logstash and a bunch of other interfaces [15:55:05] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:10] greg-g: talking with him right now, there are no problems there haha [15:55:34] RECOVERY - Puppet run on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:35] as for why you got renamed, I have no idea really [15:55:41] one will have to dig in the audit logs [15:56:00] Krenair: thank you :] [15:56:04] PROBLEM - Puppet run on deployment-phab01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:56:07] I am off be back later tonight [15:56:46] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:38] RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:48] RECOVERY - Puppet run on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:12] twentyafterfour: so deployment-prep is now running scap master seems like? [16:00:31] is the debian glue job doing that? Or was that a manual thing? [16:00:43] thcipriani: debian-glue [16:00:50] neat :) [16:01:31] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:08] jhobs: :) [16:02:57] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:04:45] RECOVERY - Puppet run on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:48] anyone ever seen certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster.deployment-prep.eqiad.wmflabs] [16:13:10] my instances apparently don't trust the deployment puppetmaster's cert [16:14:24] there was a cherry-pick by Yuvi on one of my projects that caused that. It made the puppet runs fail fast and hard though with a compilation error [16:15:08] it could be caused by the labs-wide to local pm switch failing initially [16:15:19] hmm [16:15:40] it did fail initially, but eventually switched [16:16:00] my puppet runs are failing fast and hard, too [16:16:36] phab + scap just don't like eachother or they are trolling me I think [16:16:55] lol. yeah that happens some days [16:17:21] this was fun to get working -- https://wikitech.wikimedia.org/wiki/User:BryanDavis/Scap3_in_a_Labs_project [16:18:12] looks like deployment-puppetmaster has merge conflicts with upstream too. [16:18:19] I'll see if I can figure it out [16:19:23] bd808: a heroic effort! [16:19:57] !log deployment-puppetmaster: removing cherry-pick of https://gerrit.wikimedia.org/r/#/c/305256/; conflicts with upstream changes [16:19:59] mobrovac: ^ [16:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:21:05] why did ha.shar not remove the pick? He commented on the patch [16:22:36] !log Restarted puppetmaster process on deployment-puppetmaster [16:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:22:59] twentyafterfour: forgot to post this somewhere yesterday, but FWIW here's the full scap update process to build a new package for prod: https://wikitech.wikimedia.org/wiki/How_to_deploy_code/Scap [16:23:07] twentyafterfour: things are up to date now at least. no idea if it will fix any of you woes [16:28:19] PROBLEM - Puppet run on deployment-stream is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:30:02] How do I debug "This change or one of its cross-repo dependencies was unable to be automatically merged"? -- https://gerrit.wikimedia.org/r/#/c/313774/ [16:30:18] the dep is merged so ... [16:30:35] * bd808 could remove the constrain I guess [16:31:23] hrm, I saw this yesterday, hasharAw.ay did something to zuul [16:31:43] I'm going to try without the Depends-On in the commit message and see what happens [16:32:04] nope :/ [16:34:22] PROBLEM - Puppet run on deployment-mediawiki06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:35:37] this is what I see in the zuul merger log: https://phabricator.wikimedia.org/P4155 [16:36:02] hmm.. ok [16:38:42] so it looks like there is a conflict in /srv/ssd/zuul/git/mediawiki/extensions/OATHAuth [16:38:45] the whole chain needed to be rebase apparently. Thanks twentyafterfour [16:38:49] er thcipriani [16:40:35] It would be swell if that nice bit of logging got added to the patch instead of the generic "something it wrong somewhere" message [16:43:12] woo! deployment-phab01 works...so whatever you guys did fixed it [16:43:17] thanks bd808 and thcipriani [16:45:37] oh, you guys were looking at puppet on that host too? [16:46:36] what was the fix? [16:47:08] rebased patches on deployment-puppetmaster [16:47:45] huh [16:47:59] interesting error message for a compilation error [16:48:08] er [16:48:13] the ssl verification error [16:48:18] would expect it for a compilation error [16:48:41] something must have gone wrong in the middle of the switch from labs puppetmaster to local puppetmaster [16:49:18] I restarted the puppetmaster too just for good measure [16:49:49] I did that earlier but it didn't fix the cert error [16:51:06] https://asciinema.org/a/7exvd6ltvwyjnnw9ha294eguy [16:51:15] Fixed 02 by moving /var/lib/puppet/ssl out of the way [16:51:59] I have my ssh config set up so I can just 'ssh deployment-tin' and it figures out the rest [16:52:10] does your 'beta tin' do something similar? [16:52:42] lol at "monkey@yourmom" [16:53:11] LOL [16:53:35] no paladox [16:53:47] ? [16:53:58] what does your ssh config look like? I can do: ssh [whatever].eqiad.wmflabs and it gets it right, but ssh deployment-tin would be nice. [16:54:08] https://gist.github.com/thcipriani/5f26708108265c9c5c5e57e7319f0651 [16:55:21] Host deployment-* !*.deployment-prep.eqiad.wmflabs [16:55:21] Hostname %h.deployment-prep.eqiad.wmflabs [16:55:51] ah, nice :) [16:56:12] probably made you happy when mira moved to a sane domain :P [16:56:42] yes [16:57:01] the !*.deployment-prep.eqiad.wmflabs is there purely for the purpose of allowing me to connect to mira.deployment-prep.eqiad.wmflabs, it is otherwise unused [16:57:55] actually it probably was made obsolete a while ago, I'm not sure deployment-* alone would match the deployment-prep part of mira's old fqdn [16:58:48] I guess I used that when the first match was against *deployment-*, which mira.deployment-prep.eqiad.wmflabs would've hit and broken [16:59:55] fancy. stole that one thanks Krenair :) [17:00:37] Krenair: yeah `beta $name` is essentially an alias for `ssh deployment-$name` [17:01:22] with a little bonus that beta autocompletes all deployment-* hostnames (and a few from other projects as well) [17:01:42] huh [17:01:46] how'd you get that set up? [17:02:13] a bit of bash and some openstack api client I lifted from yuvi [17:02:29] * twentyafterfour will publish it since I find it pretty handy [17:03:20] RECOVERY - Puppet run on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [17:04:07] that would be interesting to see [17:04:12] sounds like it would be pretty slow... [17:05:42] Krenair: I cache the api client output so completion isn't slow [17:05:47] ah [17:05:55] how do you update the cache? [17:06:59] don't you have to log into silver or labnodepool to connect to keystone+nova? [17:07:14] (you can connect to nova from inside labs, but there's no way to auth) [17:10:04] Krenair: https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niregion=eqiad&format=json&niproject=deployment-prep [17:10:16] oh [17:10:18] I guess it's not so much openstack api as it is wikitech api? [17:10:19] that's not the openstack api [17:10:34] it hits the openstack api behind the scenes [17:10:40] I suppose that's one way that'll work for a while [17:11:19] I just use it to keep a flat file of hostnames up to date .. [17:13:57] twentyafterfour, https://phabricator.wikimedia.org/T104575 [17:14:21] https://phabricator.wikimedia.org/T143136 or https://phabricator.wikimedia.org/T104588 [17:18:41] (03PS1) 10Niedzielski: Android: add IRC notification to alpha builds [integration/config] - 10https://gerrit.wikimedia.org/r/314036 [17:19:17] Krenair: https://gist.github.com/20after4/742dba22b6ea5a42072e7933ab055036 [17:19:23] I'm ok with it going away some day [17:19:24] RECOVERY - Puppet run on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [17:19:36] I'm sure I can figure out another way to maintain the list ;) [17:19:49] (03CR) 10Niedzielski: "This is deployed and seems to be working properly." [integration/config] - 10https://gerrit.wikimedia.org/r/314036 (owner: 10Niedzielski) [17:22:32] also https://horizon.wikimedia.org/project/access_and_security/ seems to show some public hostnames but I haven't tried accessing them [17:26:35] twentyafterfour, compute (nova) is accessible, identity (keystone) is not [17:27:01] hostnames can be misleading [17:27:13] labnet100[12].eqiad.wmnet runs nova-api and is accessible to labs [17:27:21] labcontrol1001.wikimedia.org runs keystone and is not [17:32:57] Krinkle, know anything about "Call to a member function canExist() on null in /home/alex/Development/MediaWiki/includes/skins/Skin.php:212" on load.php? [17:34:46] 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2689359 (10hashar) The deployment servers on beta cluster are now fully migrated to Jessie. We ended up keeping the same hostname and have: * deploymen... [17:37:08] legoktm [17:37:15] 03Scap3: Unhandled(?) exceptions in scap3 - https://phabricator.wikimedia.org/T147334#2689369 (10Yurik) [17:38:39] hi [17:39:13] Krenair: traceback? [17:39:13] ah, I had 090d0267daa4721ffb154e7e604804201365f9dd but hadn't updated vendor [17:39:31] was a weird error for that though [17:39:41] huh [17:40:48] Krenair: no idea, resolved? [17:40:52] yep [18:03:11] 03Scap3, 10Kartotherian: Break Kartotherian scap3 deployment into 2 groups - https://phabricator.wikimedia.org/T147337#2689471 (10Yurik) [18:11:00] 03Scap3, 06Discovery, 10Kartotherian, 06Maps, 03Interactive-Sprint: Break Kartotherian scap3 deployment into 2 groups - https://phabricator.wikimedia.org/T147337#2689519 (10Yurik) [18:15:43] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 10Reading-Web-Tech-Debt, 07Browser-Tests, and 4 others: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2689567 (10MBinder_WMF) [18:18:05] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2689640 (10Tgr) [18:35:32] (03CR) 10Hashar: [C: 032] Android: add IRC notification to alpha builds [integration/config] - 10https://gerrit.wikimedia.org/r/314036 (owner: 10Niedzielski) [18:36:30] (03Merged) 10jenkins-bot: Android: add IRC notification to alpha builds [integration/config] - 10https://gerrit.wikimedia.org/r/314036 (owner: 10Niedzielski) [18:57:28] 06Release-Engineering-Team, 15User-greg: Create SOW for contractor - https://phabricator.wikimedia.org/T146711#2689785 (10greg) 05Open>03Resolved [19:05:43] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:17:42] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2689877 (10Andrew) [19:22:48] 03Scap3: scap l10n-purge broken - https://phabricator.wikimedia.org/T147349#2689918 (10thcipriani) [19:23:15] 03Scap3: scap l10n-purge broken - https://phabricator.wikimedia.org/T147349#2689930 (10thcipriani) p:05Triage>03Unbreak! [19:31:54] 03Scap3: scap l10n-purge broken - https://phabricator.wikimedia.org/T147349#2689972 (10thcipriani) [19:32:59] What are deployment-sca01 and deployment-sca02, and why do they include contint::slave_scripts? [19:33:14] image scalers? [19:33:31] services cluster [19:33:45] like the sca hosts in prod [19:34:02] 10:58 hashar: Changing Jenkins slaves home dir for deployment-sca01 and deployment-sca02 from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy [19:34:04] contint is probably there for Parsoid deploy chain via jenkins [19:34:08] They gave me the technically correct, yet completely useless answer? :) [19:34:43] I dont think it is needed anymore [19:34:49] well it might [19:34:51] as far as I can tell, those are the only hosts in deployment-prep that include those scripts. The parsoid deploy is specific to image scaling? [19:35:00] but deployment-sc* hosts are no more Jenkins slaves afaik [19:35:25] what we aim at is to have Jenkins to run "scap deploy" from the deployment server which will be a jenkins slave [19:35:43] and last week I have migrated the CI stuff from /mnt to /srv [19:35:55] the deployment server is deployment-tin and deployment-mira, right? [19:36:04] yeah [19:36:32] so most probably contint::slave_scripts got copy pasted from some other instance [19:36:35] and can be dropped [19:36:49] if I drop them right now are we in a position to notice if it breaks something? [19:36:56] that class basically git clone a bunch of repos from integration/*.git [19:37:00] (^ exact duplicate of conversation yesterday just before I broke everything) [19:37:04] na we cant [19:37:08] cause puppet is lame [19:37:19] removing the resource in the manifest does not remove it from the instance [19:37:31] Well, I can remove the classes /and/ purge those directories on the hosts if that helps [19:37:32] so we would only figure out whether it got broken when an instance is rebuild [19:38:13] yeah lets do that [19:38:19] ok, here goes... [19:38:27] should be just a few dirs under /srv/deployment/ [19:40:16] done [19:40:40] now we just wait for alerts? [19:41:09] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2689985 (10Andrew) [19:42:36] andrewbogott: !log it maybe, and move to the next class :] [19:42:42] we can catch up with whatever failure later on [19:42:46] but I guess it is going to be fine [19:43:17] !log removed contint::slave_scripts and associated files from deployment-sca01 and deployment-sca02 [19:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:49:58] how long does it take for the beta puppetmaster to rebase? [19:52:33] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:53:07] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:53:27] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Browser-Tests, 13Patch-For-Review, and 3 others: Add helper to Selenium that allows you to query whether JavaScript module has loaded - https://phabricator.wikimedia.org/T146292#2690024 (10MBinder_WMF) [19:54:07] deployment-puppetmaster syncs every 10 minutes [19:54:16] syncs with upstream [19:56:01] hm… thcipriani looks like I just broke something :( Do you have a minute to look? [19:56:10] (Otherwise I can do a blind revert, but… this should have been a no-op.) [19:56:27] here is the diff that puppet just applied: [19:56:27] eh, I can poke real quick [19:56:29] https://www.irccloud.com/pastebin/f12TVGMM/ [19:56:41] that is a result of https://gerrit.wikimedia.org/r/#/c/313904/ [19:56:52] which should not have changed anything since includes all the same classes [19:57:03] but there must be some path-specific hiera or something [19:57:11] oh...I've actually noticed that happen before [19:57:22] I think it may be a beta-weird-transient thing [19:57:24] oh, maybe that oscillates naturally and is unrelated to my change? [19:57:37] * andrewbogott runs puppet again to see if they come back [19:57:48] was just going to suggest that [19:59:46] hm, no change this time [19:59:52] :( [20:01:46] weirder yet, those values come from a class default [20:01:53] so why would we be overriding them with ""? [20:03:06] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [20:03:45] this must be an enc bug [20:03:46] dammit [20:05:50] hrm, I suppose given the current layout it's sort of surprising to me that this worked at all. [20:06:24] how so? [20:06:28] like they're class defaults, but not for the class that creates that script. [20:07:18] it looks like it was always using the class defaults for scap::master wasn't it? [20:07:25] (which should have been meaningless for beta) [20:07:29] https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/manifests/scripts.pp#L126 vs https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/manifests/master.pp#L4-L13 [20:07:33] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [20:07:55] but the file gets created by scap::scripts I wouldn't think those vars would be in-scope [20:08:21] oh yeah [20:08:27] I think that those variables trickle down [20:08:34] but maybe scap::scripts is included in multiple places? [20:08:41] oh, possible [20:08:41] some without those vars in the scope? [20:08:52] I agree that it's implicit and therefore bad [20:09:31] eh, it is in face included in 5 places according to git grep [20:09:35] *fact [20:10:55] so probably the actual defs of those variables is… unpredictable :( [20:11:26] anyway, I see how to partially revert my patch to get things back the (broken) way they were... [20:11:31] AND one of those places is beta::autoupdater [20:11:32] do you want to open a bug about this or shall I? [20:12:11] I can write a task a bit later that describes this. [20:12:30] I'm running the mw train currently, don't want to get too far away from it. [20:13:24] it has a bad habit of just rolling forward without a conductor at the... proverbial whell [20:13:27] wheel [20:15:41] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:43] thcipriani: I'm going to merge this immediately, but we should be able to revert it once that mess is sorted out, so maybe make a note when you create your task: https://gerrit.wikimedia.org/r/#/c/314068/ [20:16:03] andrewbogott: ok [20:16:23] thcipriani: thanks for following up, and sorry about poking the beast :( [20:16:53] heh, it needs poking for sure. [20:17:20] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2690204 (10Andrew) [20:23:59] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:27:31] andrewbogott: I caught up with some mails. If I understand it the point is to migrate from wikitech to horizon and to do so only use role:: classes ? [20:28:03] hashar: that's right [20:28:21] the only role:: classes thing is almost already the case, I'm just mopping up the stragglers. [20:28:24] that is like cleaning XX years of legacy stuff :D [20:28:42] Of course each straggler turns out to be its own kind of mess [20:29:05] on some tasks I have commented it would be nice to have a list of instance_name: [class1, class2] [20:29:12] would save up time if it is easy to extract [20:29:46] heck if it can be extracted with some ldap commands, we could even have a Jenkins job to run daily and mail with a progress status :D [20:30:41] it's easy to get from ldap, although it might be paged responses (which make querying hard without the novaadmin creds) [20:31:20] basically just a dump of 'ou=hosts,dc=wikimedia,dc=org' contains everything [20:31:46] don't you wish we had all of that in elastic search? :D [20:32:05] LOL [20:32:13] Im testing that with phabricator at phab-01 [20:33:54] hashar: btw, I let mark, faidon, chasemp, and bblack know that you and/or tyler will be pinging the ops list looking for an opsen pair and timeslot for gallium (in the technology manager's meeting today during my team update) [20:34:18] E_TOO_MANY_RECURSIONS [20:34:20] I will confirm audible reception os said signal [20:34:40] greg-g: I have that on my list to get out today. [20:35:03] greg-g: yeah Tyler and I had a one hour long discussion. We refined the plan, polished a task for sos/ops list [20:35:16] thcipriani: hashar rock, thanks [20:35:25] and I am cowardly taking advantage of Tyler better english to write the actual mail [20:35:35] he does write good [20:35:48] and he definitely convinced me to migrate after our offsite [20:35:49] (that's a joke/bad grammar, don't do that, "he writes well" is correct) [20:35:58] even if that delay further the migration. That is the safe thing to do [20:35:59] :D [20:36:01] sorry greg I only take your text literally [20:36:07] jokes that you have to explain are the best [20:36:09] all that fancy book lurnin [20:36:24] greg-g: word (though I cant remember if that is the proper context to use) [20:36:48] hashar: :) [20:36:52] andrewbogott: might try to clean up a few class for beta/integration. If I remember about it. [20:37:01] thcipriani: all my best jokes are [20:37:07] andrewbogott: the ultimate end goal is to drop OpenStackManager is it ? [20:37:32] yeah [20:38:21] andrewbogott: if you have a ldap search command line to use for a report, I am pretty sure I can hack some Jenkins job to craft basic report [20:38:26] at least I can try [20:38:32] and give up early if that is too complicated [20:38:51] but given I am currently writing a makefile, I am not too worried about parsing ldapsearch result :D [20:40:08] I'm really not doing anything more sophisticated than ldapsearch 'ou=hosts,dc=wikimedia,dc=org' [20:40:23] and then grepping for puppetClass: [20:50:49] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2690373 (10thcipriani) [20:53:49] (03PS2) 10Hashar: (WIP) experiment with makefile (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/313998 [20:55:59] # numResponses: 2049 [20:55:59] # numEntries: 2048 [20:56:03] andrewbogott: off by one! :) [20:56:26] I think one of the 'responses' is a message saying "there's more of this stuff" [20:58:22] associatedDomain: ci-jessie-wikimedia-81876.contintcloud.eqiad.wmflabs [20:58:37] looks like the contintcloud / nodepool instances are added in LDAP [20:59:02] might want to hack something to drop them [20:59:34] (03PS1) 10Jean-Frédéric: Publish code coverage post-merge in labs/tools/heritage [integration/config] - 10https://gerrit.wikimedia.org/r/314171 [21:02:54] (03CR) 10Jean-Frédéric: "To get the conversation started :) I’d like to get coverage published, but it seems no one else has wanted to do that on a Python project " [integration/config] - 10https://gerrit.wikimedia.org/r/314171 (owner: 10Jean-Frédéric) [21:03:59] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:10:13] (03CR) 10Legoktm: "Nice!" (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/314171 (owner: 10Jean-Frédéric) [21:11:01] andrewbogott: any idea how many instances we have on labs ? [21:11:03] ldapsearch -x -b 'ou=hosts,dc=wikimedia,dc=org' '(!(dc=*.contintcloud.eqiad.wmflabs))' dc puppetClass [21:11:07] yields 829 entries [21:11:19] sounds about right [21:11:23] I can get an exact count, hang on [21:11:58] nova says 696 [21:14:52] ldap has a bunch of ghost entries so :D [21:16:14] (03CR) 10Jean-Frédéric: Publish code coverage post-merge in labs/tools/heritage (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/314171 (owner: 10Jean-Frédéric) [21:16:17] as always, yeah [21:16:19] 490 not in contintcloud and having puppetClass [21:16:19] ldapsearch -x -b 'ou=hosts,dc=wikimedia,dc=org' '(&(!(dc=*.contintcloud.eqiad.wmflabs))(puppetClass=*))' dc puppetClass [21:16:21] bah [21:16:36] I fixed some thread-unsafe stuff a few weeks ago that should cut down on leaks [21:22:04] looks like the contintcloud / nodepool instances are added in LDAP [21:22:11] yep, all instances are added to and removed from LDAP [21:22:35] we could probably use a rule to get skip contintcloud [21:22:41] submit a puppet patch [21:22:44] but maybe that is needed by nova/openstack [21:22:49] no [21:22:59] the ldap host entries are consumed by our software [21:23:13] andrewbogott, we should cleanup ldap at some stage [21:23:46] andrewbogott, also, I ran the designate cleanup script before the ops offsite week, now it's time to deal with this properly [21:24:21] doesn't dealing with it properly mean not storing hosts in ldap at all? [21:24:28] or do you have something more specific/short-term in mind? [21:25:36] no I was changing the subject, it has nothing to do with ldap [21:25:44] ah, ok :) [21:26:07] it just came to mind because the dns sink plugin sits right next to the ldap sink plugin [21:26:22] so you're talking about the reverse dns designate leaks? [21:27:21] yeah [21:27:39] and any potential forward dns leaks that may exist that I haven't written a script to detect yet [21:27:49] Krenair: if you have a pointer at a puppet file, maybe I can dig in the code and add a bypass for "contintcloud" [21:28:01] can probably skip the DNS as well [21:28:16] pretty sure nodepool uses the IP to connect to instances and I am not sure how having them in dns is needed [21:28:44] 2 or 3 files, it's not really clear since these live under the version system of the openstack module [21:29:12] modules/openstack/files/{kilo,liberty,mitaka}/designate/nova_ldap/base.py [21:29:37] vim -O modules/openstack/files/{kilo,liberty,mitaka}/designate/nova_ldap/base.py [21:29:40] ... [21:29:43] for dns, swap 'ldap' for 'fixed_multi' [21:35:12] ohh [21:35:17] 10Continuous-Integration-Config, 10Fundraising-Backlog, 07FR-2016-17-Q2-Bugs, 13Patch-For-Review: mediawiki/extensions/DonationInterface/vendor repo needs CI V+2 jobs - https://phabricator.wikimedia.org/T143025#2690682 (10DStrine) [21:52:49] Krenair: andrewbogott: maybe something like https://gerrit.wikimedia.org/r/314188 openstack: skip DNS update for contintcloud [21:52:50] for liberty [21:52:53] untested obviously [21:53:04] I have no idea what 'context' contains when the event is passed [21:53:08] will want to do mitaka too [21:53:12] don't know about kilo [21:53:21] yeah [21:53:26] then I have no clue how to test it [21:54:12] hm… I can't decide if that's a good idea or a bad one :) [21:55:07] :D [21:55:14] hashar, I think it's payload['tenant_id'] [21:55:35] look at how _create gets it [21:55:38] yeah that is in the notification event [21:55:48] but there is also a context which has a lot more info [21:56:03] but then I found various reference 'tenant' '_object_tenant' '_object_tenant_name' etc [21:56:09] so really I have no clue [21:56:16] hashar, also this is the ldap update, not DNS update [21:56:22] yeah same deal [21:56:27] no [21:57:28] dns hasn't been backed by ldap for a long time now [21:58:04] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2690772 (10thcipriani) [21:58:30] what I mean is both are similar [21:58:40] a high level class that announce which events it is interested in [21:58:43] then a dispatcher [21:58:55] and another class that does the low level commands [21:59:20] the patch I sent shortcircuit the event handling at the high level, but i have no idea what is in the context object :D [21:59:49] and we can surely copy paste in novamulti [22:05:42] anyway bed time! [22:09:47] ah, yeah. they work on the same plugin system [22:10:07] oh he went [22:43:01] PROBLEM - Puppet run on deployment-elastic08 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:54:16] 10Continuous-Integration-Infrastructure (phase-out-gallium), 03releng-201617-q1, 07Wikimedia-Incident: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757#1199594 (10Dzahn) https://gerrit.wikimedia.org/r/313579 https://gerrit.wikimedia.org/r/313581 merged, ran puppet on gallium, gangl... [22:57:50] 10Beta-Cluster-Infrastructure, 10Flow, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review, 07Performance: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#2691120 (10jmatazzoni) [23:23:03] RECOVERY - Puppet run on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0]