[00:05:45] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:45:44] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:54:40] Hi [00:54:57] does anyone know if (and how) one can trigger manual runs of a (browsertest) job? [00:55:23] nvm, got it [02:13:11] Project selenium-QuickSurveys » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #146: 04FAILURE in 10 sec: https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/146/ [02:45:30] PROBLEM - Puppet run on deployment-cache-text04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [03:02:36] What is wrong with shinken... [03:08:46] Project selenium-Wikibase » firefox,test,Linux,contintLabsSlave && UbuntuTrusty build #100: 04FAILURE in 2 hr 13 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/100/ [03:10:29] RECOVERY - Puppet run on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:39] Yippee, build fixed! [04:18:40] Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #134: 09FIXED in 22 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/134/ [04:35:14] !log depooled integration-slave-jessie-1005 in jenkins so I can test puppet stuff on it [04:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [04:59:56] !log added Krenair to integration project to help debug puppet stuff [05:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [05:11:09] PROBLEM - Puppet run on integration-slave-jessie-1005 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [05:22:02] legoktm@integration-slave-jessie-1005:~$ PHP_BIN=php7.0 php --version [05:22:02] PHP 7.0.10-1+0~20160829102714.10+jessie~1.gbpd58428 (cli) ( NTS ) [05:22:06] ta-daaaaa [05:25:58] !log cherry-picked https://gerrit.wikimedia.org/r/#/c/308918/ onto integration-puppetmaster with a hack that has it only apply to integration-slave-jessie-1005 [05:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [05:26:10] RECOVERY - Puppet run on integration-slave-jessie-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [05:41:39] 10Continuous-Integration-Infrastructure: Investigate installing php5.3 on trusty and/or debian instance - https://phabricator.wikimedia.org/T103786#2613662 (10Legoktm) Precise LTS support ends in April 2017. MediaWiki 1.23 goes EOL in May 2017 (last version to support 5.3). If the labs team is okay with us havin... [05:49:29] (03PS1) 10Legoktm: Get rid of manual PHP_BIN handling [integration/jenkins] - 10https://gerrit.wikimedia.org/r/308931 [05:50:22] (03PS2) 10Legoktm: Get rid of manual PHP_BIN handling [integration/jenkins] - 10https://gerrit.wikimedia.org/r/308931 [05:54:01] (03PS3) 10Legoktm: Publish Doxygen and jsduck documentation for Kartographer [integration/config] - 10https://gerrit.wikimedia.org/r/299697 (https://phabricator.wikimedia.org/T140657) (owner: 10MaxSem) [05:54:03] (03PS1) 10Legoktm: Add mwext-doxygen-publish generic job [integration/config] - 10https://gerrit.wikimedia.org/r/308932 [05:55:26] (03CR) 10Legoktm: [C: 032] Add mwext-doxygen-publish generic job [integration/config] - 10https://gerrit.wikimedia.org/r/308932 (owner: 10Legoktm) [05:55:32] (03CR) 10Legoktm: [C: 032] Publish Doxygen and jsduck documentation for Kartographer [integration/config] - 10https://gerrit.wikimedia.org/r/299697 (https://phabricator.wikimedia.org/T140657) (owner: 10MaxSem) [05:56:08] (03Merged) 10jenkins-bot: Add mwext-doxygen-publish generic job [integration/config] - 10https://gerrit.wikimedia.org/r/308932 (owner: 10Legoktm) [05:56:10] (03Merged) 10jenkins-bot: Publish Doxygen and jsduck documentation for Kartographer [integration/config] - 10https://gerrit.wikimedia.org/r/299697 (https://phabricator.wikimedia.org/T140657) (owner: 10MaxSem) [05:57:03] !log deploying https://gerrit.wikimedia.org/r/308932 https://gerrit.wikimedia.org/r/299697 [05:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [06:08:27] (03PS1) 10Legoktm: doc: Fix alpha-sort of MediaWiki extensions [integration/docroot] - 10https://gerrit.wikimedia.org/r/308934 [06:08:29] (03PS1) 10Legoktm: doc: Add Kartographer extension [integration/docroot] - 10https://gerrit.wikimedia.org/r/308935 (https://phabricator.wikimedia.org/T140657) [06:09:13] (03CR) 10Legoktm: [C: 032] doc: Fix alpha-sort of MediaWiki extensions [integration/docroot] - 10https://gerrit.wikimedia.org/r/308934 (owner: 10Legoktm) [06:09:16] (03CR) 10Legoktm: [C: 032] doc: Add Kartographer extension [integration/docroot] - 10https://gerrit.wikimedia.org/r/308935 (https://phabricator.wikimedia.org/T140657) (owner: 10Legoktm) [06:09:31] (03Merged) 10jenkins-bot: doc: Fix alpha-sort of MediaWiki extensions [integration/docroot] - 10https://gerrit.wikimedia.org/r/308934 (owner: 10Legoktm) [06:09:36] (03Merged) 10jenkins-bot: doc: Add Kartographer extension [integration/docroot] - 10https://gerrit.wikimedia.org/r/308935 (https://phabricator.wikimedia.org/T140657) (owner: 10Legoktm) [07:23:15] PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [07:43:16] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2613749 (10Paladox) @legoktm hi, how would we manage to support php 5.6, and 7 on Jessie, or are we going to use php 7 instead of php 5.6? [08:03:13] RECOVERY - Puppet run on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [0.0] [08:53:39] 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Puppet, 07Technical-Debt, 07Zuul: role::zuul::configuration should be replaced by hiera - https://phabricator.wikimedia.org/T139527#2613906 (10hashar) p:05Triage>03Normal a:03hashar As part of migrating CI from gallium to a new host, I... [08:55:50] 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Puppet, 07Technical-Debt, 07Zuul: role::zuul::configuration should be replaced by hiera - https://phabricator.wikimedia.org/T139527#2613923 (10Paladox) @hashar I can ajust but I doint know how I can run puppet without it deleting the /var/ww... [09:25:58] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [09:35:14] (03PS2) 10Hashar: Revert "Move `rake` jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306723 [09:35:16] (03PS2) 10Hashar: Revert "rake: Fix bundle install path" [integration/config] - 10https://gerrit.wikimedia.org/r/306724 [09:35:58] (03CR) 10jenkins-bot: [V: 04-1] Revert "Move `rake` jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306723 (owner: 10Hashar) [09:36:22] (03CR) 10jenkins-bot: [V: 04-1] Revert "rake: Fix bundle install path" [integration/config] - 10https://gerrit.wikimedia.org/r/306724 (owner: 10Hashar) [09:38:11] (03PS3) 10Hashar: Revert "Move `rake` jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306723 [09:38:13] (03PS3) 10Hashar: Revert "rake: Fix bundle install path" [integration/config] - 10https://gerrit.wikimedia.org/r/306724 [09:40:43] (03CR) 10Hashar: [C: 032] Revert "Move `rake` jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306723 (owner: 10Hashar) [09:40:45] (03CR) 10Hashar: [C: 032] Revert "rake: Fix bundle install path" [integration/config] - 10https://gerrit.wikimedia.org/r/306724 (owner: 10Hashar) [09:41:22] !log Moving rake jobs back to Nodepool ( T143938 ) with https://gerrit.wikimedia.org/r/#/c/306723/ and https://gerrit.wikimedia.org/r/#/c/306724/ [09:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:41:52] (03Merged) 10jenkins-bot: Revert "Move `rake` jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306723 (owner: 10Hashar) [09:42:30] (03CR) 10Hashar: [C: 032] Revert "rake: Fix bundle install path" [integration/config] - 10https://gerrit.wikimedia.org/r/306724 (owner: 10Hashar) [09:42:50] 10Gerrit: Project access history links broken - https://phabricator.wikimedia.org/T120658#2614057 (10Paladox) 05Open>03declined I'm declining this based on the fact this is not going to work for the parent mediawiki project. [09:44:00] (03Merged) 10jenkins-bot: Revert "rake: Fix bundle install path" [integration/config] - 10https://gerrit.wikimedia.org/r/306724 (owner: 10Hashar) [09:50:23] 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938#2614078 (10hashar) I have moved the rake and oojs-ui-rake jobs to Nodepool. Validated them by hitting `recheck` on a couple dummy changes: | https://gerrit.wikimed... [09:51:12] 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938#2614082 (10hashar) [10:00:57] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [10:19:19] (03PS2) 10Hashar: Revert "Move npm-node-4 off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306722 [10:19:36] (03PS3) 10Hashar: Revert "Move npm-node-4 off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306722 (https://phabricator.wikimedia.org/T143938) [10:25:57] 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Merges of l10n updates by Jenkins should not be reported by grrrit-wm - https://phabricator.wikimedia.org/T93082#1128887 (10Paladox) [10:37:13] !log beta: cleaning up salt-keys on deployment-salt02 . Bunch of instances got deleted [10:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:55:13] !log integration: dropped PHP7 cherry pick from puppet master. https://gerrit.wikimedia.org/r/#/c/308918/ has been merged. Pushing it to the fleet of permanent Jessie slaves. T144872 [10:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:03:09] !log integration: cherry pick https://gerrit.wikimedia.org/r/#/c/308955/ "contint: prefer our bin/php alternative" T144872 [11:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:04:49] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2614187 (10hashar) p:05Triage>03Normal [11:41:53] 10Gerrit, 10grrrit-wm, 07Upstream: Patchsets created through web interface attributed to the wrong user - https://phabricator.wikimedia.org/T141329#2614232 (10Paladox) Yes. [11:53:57] !log Force refreshing Nodepool jessie snapshot to get PHP7 included T144872 [11:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:57:06] bah [11:57:07] Error: Could not find dependency Apt::Repository[sury-php] for Package[php7.0-cli] at /puppet/modules/contint/manifests/packages/php.pp:44 [12:01:16] Project selenium-RelatedArticles » chrome,beta-desktop,Linux,contintLabsSlave && UbuntuTrusty build #137: 04FAILURE in 15 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-desktop,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/137/ [12:09:02] (03PS1) 10Hashar: dib: include contint::packages::apt [integration/config] - 10https://gerrit.wikimedia.org/r/308960 (https://phabricator.wikimedia.org/T144872) [12:09:46] (03CR) 10Hashar: [C: 032] dib: include contint::packages::apt [integration/config] - 10https://gerrit.wikimedia.org/r/308960 (https://phabricator.wikimedia.org/T144872) (owner: 10Hashar) [12:10:28] (03Merged) 10jenkins-bot: dib: include contint::packages::apt [integration/config] - 10https://gerrit.wikimedia.org/r/308960 (https://phabricator.wikimedia.org/T144872) (owner: 10Hashar) [12:20:30] (03PS1) 10Hashar: dib: add 'apt' class [integration/config] - 10https://gerrit.wikimedia.org/r/308962 [12:21:00] (03CR) 10Hashar: [C: 032] dib: add 'apt' class [integration/config] - 10https://gerrit.wikimedia.org/r/308962 (owner: 10Hashar) [12:21:58] (03Merged) 10jenkins-bot: dib: add 'apt' class [integration/config] - 10https://gerrit.wikimedia.org/r/308962 (owner: 10Hashar) [12:31:38] (03PS1) 10Hashar: dib: apt-get update before php7 [integration/config] - 10https://gerrit.wikimedia.org/r/308963 [12:31:57] (03CR) 10Hashar: [C: 032] dib: apt-get update before php7 [integration/config] - 10https://gerrit.wikimedia.org/r/308963 (owner: 10Hashar) [12:32:58] (03Merged) 10jenkins-bot: dib: apt-get update before php7 [integration/config] - 10https://gerrit.wikimedia.org/r/308963 (owner: 10Hashar) [12:36:19] hey zeljkof, i may have found a bug in either the jenkins setup in mwext-mw-selenium or in mediawiki-ruby-api [12:36:43] here's an example of related build failure: https://integration.wikimedia.org/ci/job/mwext-mw-selenium/9801/consoleFull [12:37:30] Mediawiki::Client#get_wikitext is opening $MEDIAWIKI_URL/w/index.php [12:38:00] but jenkins is setting $MEDIAWIKI_URL to ".../w/index.php/" anyway [12:39:13] phuedx: huh [12:39:23] can you create a task in phab, please? [12:39:30] yup [12:39:57] was wondering if i'd messed something up and you'd tell me in 2s flat ;) [12:40:16] phuedx: it's strange [12:40:20] did something change recently? [12:40:38] i introduced the get_wikitext call [12:40:48] in the change that's being tested [12:41:48] hm, I think we are using it elsewhere, it should not fail [12:44:43] (03PS1) 10Hashar: dib: swap puppet dep ordering [integration/config] - 10https://gerrit.wikimedia.org/r/308965 [12:44:59] (03CR) 10Hashar: [C: 032] dib: swap puppet dep ordering [integration/config] - 10https://gerrit.wikimedia.org/r/308965 (owner: 10Hashar) [12:45:32] (03Merged) 10jenkins-bot: dib: swap puppet dep ordering [integration/config] - 10https://gerrit.wikimedia.org/r/308965 (owner: 10Hashar) [12:47:14] 10Browser-Tests-Infrastructure, 07Jenkins, 07Ruby: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912#2614350 (10phuedx) [12:47:35] ^ zeljkof: hopefully that's explanatory [12:48:36] 10Browser-Tests-Infrastructure, 07Jenkins, 07Ruby: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912#2614375 (10zeljkofilipin) a:03zeljkofilipin [12:48:47] 10Browser-Tests-Infrastructure, 07Jenkins, 07Ruby: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912#2614350 (10zeljkofilipin) p:05Triage>03Normal [12:50:23] 10Browser-Tests-Infrastructure, 07Jenkins, 07Ruby, 15User-zeljkofilipin: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912#2614350 (10zeljkofilipin) [12:52:35] 10Browser-Tests-Infrastructure, 07Jenkins, 07Ruby, 15User-zeljkofilipin: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912#2614385 (10phuedx) [12:52:40] phuedx: thanks for the task, I will take a look, but probably tomorrow, I am in the middle of something else now [12:53:13] 10Browser-Tests-Infrastructure, 07Jenkins, 07Ruby, 15User-zeljkofilipin: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912#2614350 (10phuedx) ^ I mustn't disregard the possibility (strong likelihood) that I've done something wrong. [12:53:17] zeljkof: no worries [12:56:58] (03PS1) 10Hashar: dib: an extra dependency for php7 class [integration/config] - 10https://gerrit.wikimedia.org/r/308969 [13:02:37] (03PS1) 10Hashar: dib: drop puppet chain, use serial execution [integration/config] - 10https://gerrit.wikimedia.org/r/308972 [13:07:56] (03PS1) 10Hashar: dib: require_package('apt-transport-https') [integration/config] - 10https://gerrit.wikimedia.org/r/308974 [13:13:51] !log Image ci-jessie-wikimedia-1473253681 in wmflabs-eqiad is ready , has php7 packages. T144872 [13:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:13:58] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2614420 (10hashar) I had to a few other puppet tweak in integration/config to get the surly apt repo to be updated before the PHP7 package resources get realized. The Jessie Nodepo... [13:28:58] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2614474 (10faidon) Yes, this is inline with what I've previously said and it sounds fine with me. This is really no... [13:30:11] 10Continuous-Integration-Infrastructure: Install php7 and the php-ast extension so etsy/phan can be run from jenkins - https://phabricator.wikimedia.org/T132636#2614487 (10hashar) [13:30:13] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2614486 (10hashar) 05Open>03Resolved [13:38:27] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2614527 (10hashar) [14:00:00] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2614563 (10hashar) The /usr/bin/php is wrong when provisioning images. The alternatives::install for php was run before the php7 packages which in turn overrides it. Had to rebuil... [15:17:53] hashar, hey [15:18:00] what host did you build the nodepool image on? [15:19:21] Krenair: home machine [15:19:26] ah [15:19:36] which is really just Jessie + libvirt-tools + diskimage-builder python software [15:19:36] iirc [15:19:44] I got one on labs at one point but it got deleted [15:19:46] I just ask because I was looking at integration tasks and found T126613 [15:19:58] once the base image has been uploaded to labs, we can get nodepool to refresh it [15:20:23] with something like: (get fix merged in puppet or integration/config) then: nodepool update wmflabs-eqiad ci-jessie-wikimedia [15:20:36] that spawn an instance out of the image, run puppet and then snapshot the instance [15:20:41] the snapshot is then used to boot the instances being consumed [15:20:45] 10Continuous-Integration-Infrastructure: Give @mobrovac access to CI instances - https://phabricator.wikimedia.org/T129880#2118616 (10Krenair) The blocked task is resolved... still want this @mobrovac? [15:24:10] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests, 10netops: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2614974 (10hashar) Looping #netops . We would need contint1001 to be moved to the public network with... [15:24:36] Krenair: yeah integration-dev was the instance [15:24:38] FYI: https://integration.wikimedia.org/ci/job/mwext-qunit-composer-jessie/1793/console [15:24:50] "The requested PHP extension ext-mbstring * is missing from your system. Install or enable PHP's mbstring extension." etc. [15:26:14] 10Continuous-Integration-Infrastructure, 07WorkType-Maintenance: Rebuild integration-dev (instance to build images) - https://phabricator.wikimedia.org/T126613#2614994 (10hashar) 05Open>03declined No more needed. To build an image one would need libvirt-tools and diskimage-builder then run the shell scrip... [15:26:19] Krenair: I have closed it. The doc is on http://wikitech.wikimedia.org/wiki/Nodepool [15:27:12] Project selenium-MobileFrontend » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #150: 04FAILURE in 5 min 11 sec: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/150/ [15:27:15] ok [15:27:36] one less task thanks ! ;) [15:29:42] yay [15:35:28] thcipriani: nodepool dead :( [15:35:36] can't reach the openstack api [15:35:37] wat!? [15:35:39] labnet1002.eqiad.wmnet : no route to host [15:35:44] from labnodepool1001.eqiad.wmnet [15:36:06] * hashar fills a task first [15:36:42] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:37:02] hashar, what about labnet1001? [15:37:32] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: labnet1002.eqiad.wmnet: no route to host - https://phabricator.wikimedia.org/T144945#2615093 (10hashar) [15:37:34] pretty sure the active one is 1001 [15:38:13] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: labnet1002.eqiad.wmnet: no route to host - https://phabricator.wikimedia.org/T144945#2615108 (10hashar) [15:39:02] Krenair: well the keystone entry point seems to disagree / be out of date [15:39:11] huh, ok [15:39:57] hashar, I can ping it from inside labs... [15:40:05] wonder why you can't from labnodepool [15:40:08] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: labnet1002.eqiad.wmnet: no route to host - https://phabricator.wikimedia.org/T144945#2615093 (10hashar) Step to reproduce: ``` ssh labnodepool1001.eqiad.wmnet user@labnodepool1001:~$ become-nodepool nodepool@labnodepool1001:~$ openst... [15:40:32] andrewbogott: ^ keystone directory may still say labnet1002 fyi [15:41:08] Yeah, I'm rebuilding it [15:41:10] oh! [15:41:19] yeah, I bet I broke the nova api, sorry. One moment... [15:41:28] seems something does not gracefully failover [15:43:50] at least nodepool is not hammering the api as fast as it can. It has a 60 seconds time out on api queries [15:44:06] seems to be enough to throttle nodepool requests [15:45:02] andrewbogott: it is back \O/ [15:45:10] good :) [15:45:34] * andrewbogott updates the labnet failover docs [15:45:47] I am monitoring nodepool [15:47:05] PROBLEM - Puppet run on deployment-db1 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:48:12] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: labnet1002.eqiad.wmnet: no route to host - https://phabricator.wikimedia.org/T144945#2615165 (10hashar) 05Open>03Resolved a:03Andrew labnet1002 is in maintenance but the failover did not update Keystone. The openstack CLI tool on... [15:50:39] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure: Ensure /srv/deployment/integration/slave-scripts is latest master on deployment-tin - https://phabricator.wikimedia.org/T97324#2615186 (10Krenair) [15:50:40] hashar: thank you for the follow up on php7 :) [15:50:52] legoktm: that gaves me a bunch of headaches :D [15:50:58] legoktm, did you see my ping re integration-puppetmaster? [15:51:08] thanks to puppet, but php7 should be on both permanent and nodepool slaves now [15:51:44] Krenair: yes, I assume we need to move it to jessie somehow? [15:51:53] yes [15:52:07] or just not-precise realy [15:52:08] really* [15:52:18] bah nodepool is off [15:52:28] it has four instances flagged for deletion [15:52:42] but somehow the requests are made to the old entry labnet1001 entry point [15:52:46] so they fail [15:58:15] !log Restarting Nodepool . Lost state when labnet got moved T144945 [15:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:09:13] !log Nodepool back in action. Had to manually delete some instances in labs [16:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:11:26] 10Continuous-Integration-Infrastructure: Move integration-puppetmaster off of precise (probably to jessie) - https://phabricator.wikimedia.org/T144951#2615279 (10Legoktm) [16:11:44] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:45] Krenair: ^ is there a task we should block on or depend upon? [16:22:05] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:35] legoktm, asking [16:29:06] legoktm, maybe not [16:31:22] 10Continuous-Integration-Infrastructure: Install php7 and the php-ast extension so etsy/phan can be run from jenkins - https://phabricator.wikimedia.org/T132636#2615347 (10Legoktm) We have PHP7 now, the external repo we're using doesn't have php-ast yet, so we need to ask them to include it, or package/build it... [16:47:11] 06Release-Engineering-Team, 10DBA, 10MediaWiki-Maintenance-scripts, 06Operations, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2615428 (10greg) [16:47:23] hashar: is the process for getting puppet updates to nodepool images documented somewhere? I'd like to try it (have a new puppet patch to push) [16:48:06] legoktm: sure! https://wikitech.wikimedia.org/wiki/Nodepool#Manually_generate_a_new_snapshot [16:48:22] which ask nodepool to create a new snapshot instances based on a stock image [16:48:47] it boot the stock image (ci-jessie-wikimedia), run setup_node.sh script into it (as root) [16:48:59] which really puppet apply the puppet manifests in integration/config [16:49:05] then if all goes fine, take a snapshot of that instance [16:49:13] and then use that to boot new instances [16:49:32] okay, my patch is https://gerrit.wikimedia.org/r/309039 [16:49:36] gotta get the puppet patch merged in puppet.git first though (there is no cherry picking system) [16:49:47] aha [16:50:15] with europe, ops are usually quite fast at merging them at least [16:50:27] and in case of emergency one can update the ciimage.pp in integration/config [16:50:29] merge that [16:50:36] and refresh the imaget [16:50:40] then later upstream the bit from ciimage.pp to puppet.git [16:51:21] oh [16:51:35] legoktm: once the snapshot has been refreshed, the instances that got spawned with the previous are still around [16:51:43] so gotta wait for them to be consumed or manually delete the old instances [16:51:51] via: nodepool delete [16:52:06] then nodepool will refill the pool with instances based on the last snapshot [16:55:35] I am off! [16:55:55] o/ [17:06:14] * paladox watching apple event ;) [17:06:18] :) [17:07:35] 10Continuous-Integration-Infrastructure: Investigate installing php5.3 on trusty and/or debian instance - https://phabricator.wikimedia.org/T103786#2615478 (10greg) >>! In T103786#2613662, @Legoktm wrote: > Precise LTS support ends in April 2017. MediaWiki 1.23 goes EOL in May 2017 (last version to support 5.3).... [17:10:30] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2615494 (10Legoktm) >>! In T144872#2613749, @Paladox wrote: > @legoktm hi, how would we manage to support php 5.6, and 7 on Jessie, or are we going to use php 7 instead of php 5.6?... [17:12:18] 10Continuous-Integration-Infrastructure: Install PHP5.5 on jessie CI instances - https://phabricator.wikimedia.org/T144959#2615500 (10Legoktm) [17:12:26] 10Continuous-Integration-Infrastructure: Install PHP5.5 on jessie CI instances - https://phabricator.wikimedia.org/T144959#2615514 (10Legoktm) p:05Triage>03Low [17:13:06] legoktm we can probaly update integration/config now for php7 [17:13:28] probaly create a basic composer test that uses php7? Just as a test to make sure things work [17:14:54] hold on, I'm still filing tasks xD [17:15:47] legoktm Oh, so we carnt do that yet?, or just wait for you to finish filling tasks before doing it? [17:15:48] :) [17:16:17] 10Continuous-Integration-Config: Create composer-php70 job - https://phabricator.wikimedia.org/T144961#2615547 (10Legoktm) [17:16:33] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2615561 (10Legoktm) [17:17:12] LOL, they brought mario to the app store :) [17:17:47] 10Continuous-Integration-Config: Create composer-php70 job - https://phabricator.wikimedia.org/T144961#2615547 (10Paladox) We can do this now :) [17:19:15] 10Continuous-Integration-Config: Run MediaWiki tests on PHP 7 - https://phabricator.wikimedia.org/T144962#2615580 (10Legoktm) [17:19:25] 10Continuous-Integration-Config: Run MediaWiki tests on PHP 7 - https://phabricator.wikimedia.org/T144962#2615592 (10Legoktm) [17:19:27] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2615593 (10Legoktm) [17:19:54] paladox: do you want to create the composer job? otherwise I can do it later today [17:19:55] 10Continuous-Integration-Infrastructure: Support PHP 7 in CI infra - https://phabricator.wikimedia.org/T144872#2615612 (10Paladox) [17:20:22] 10Continuous-Integration-Infrastructure: Tracking php7 support in ci - https://phabricator.wikimedia.org/T144964#2615613 (10Paladox) [17:20:54] 10Continuous-Integration-Config: Run MediaWiki tests on PHP 7 - https://phabricator.wikimedia.org/T144962#2615629 (10Paladox) [17:20:56] 10Continuous-Integration-Config: Create composer-php70 job - https://phabricator.wikimedia.org/T144961#2615630 (10Paladox) [17:21:01] legoktm yeh i can do that [17:21:51] legoktm do i do in parameter_functions [17:21:52] if 'php7' in job.name: [17:21:52] params['PHP_BIN'] = 'php7' [17:21:59] legoktm also doint forget about php7.1 [17:22:08] let's name it "php70" [17:22:10] Ok [17:22:21] legoktm so do i do params['PHP_BIN'] = 'php70' [17:22:23] ? [17:22:36] and PHP_BIN needs to be set to php7.0 [17:22:45] ah [17:22:46] thanks [17:25:00] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests, 10netops: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2615667 (10RobH) So I can handle the vlan move and reimage. Just to confirm there is no data that is c... [17:28:25] so hashar isnt about but can anyone else in release engineering confirm that conint1001 doesnt house data and i can reimage per https://phabricator.wikimedia.org/T140257 [17:28:27] ? [17:28:55] im 99.99% its not but I dont want to just do it, im paranoid. [17:28:58] greg-g thcipriani ^^ [17:29:19] ostriches ^^ [17:30:16] I think its fine, since it was supposed to be allocated as an emergency replacement but then not used. I expect all the data is just OS and puppet data, easily lost. [17:30:16] robh: that is correct, nothing important on that box. It was rebuilt in an emergency and now we just need to reimage it. [17:30:21] awesome [17:30:27] I'll reimage it for you guys on the right vlan now [17:30:29] thx! [17:30:37] awesome! Thank you! [17:31:04] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests, 10netops: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2615745 (10RobH) a:03RobH checked in release engineering, its cool for me to reimage this now (after... [17:31:06] 06Release-Engineering-Team, 06Operations, 15User-greg, 07Wikimedia-Incident: Improve reminders for teams/people to address identified actionables from incident reports - https://phabricator.wikimedia.org/T141287#2615747 (10greg) [17:32:59] 06Release-Engineering-Team, 06Operations, 15User-greg, 07Wikimedia-Incident: Improve reminders for teams/people to address identified actionables from incident reports - https://phabricator.wikimedia.org/T141287#2493130 (10greg) [17:33:15] thanks much robh [17:34:35] is there a preferred paritiotning? [17:34:47] or is the typical raid1 ext4 /srv mount fine? [17:35:00] its now in there as ci-master.cfg, which is a new one to me. [17:35:01] (03PS1) 10Paladox: Add support for php7.0 [integration/config] - 10https://gerrit.wikimedia.org/r/309048 (https://phabricator.wikimedia.org/T144961) [17:35:09] legoktm ^^ [17:35:15] :) [17:35:22] (03PS2) 10Paladox: Add support for php7.0 [integration/config] - 10https://gerrit.wikimedia.org/r/309048 (https://phabricator.wikimedia.org/T144961) [17:35:55] i can leave it as the ci-master recipe, but its very non standard compared ot all the other recipes [17:36:04] as it has a large /var, but perhaps needed here? [17:36:42] 8gb swap (eww, we typically dont use swap because we just throw in enough ram), 50gb /, then 100g /srv and 250g /var [17:36:49] thcipriani ^^ [17:36:50] is what its set for on ci-master.cfg [17:37:51] 06Release-Engineering-Team, 15User-greg: Ping tasks in #wikimedia-incident without recent activity for follow-up near the end of FY1617Q1 - https://phabricator.wikimedia.org/T144973#2615805 (10greg) [17:38:29] 06Release-Engineering-Team, 06Operations, 15User-greg, 07Wikimedia-Incident: Plan how to improve reminders for teams/people to address identified actionables from incident reports - https://phabricator.wikimedia.org/T141287#2493130 (10greg) [17:38:41] i'll likely leave it on the ci-master thing if there is no answer, assuming this will become said master since it replaces the existing ones. [17:38:44] 06Release-Engineering-Team, 15User-greg: Ping tasks in #wikimedia-incident without recent activity for follow-up near the end of FY1617Q1 - https://phabricator.wikimedia.org/T144973#2615831 (10greg) [17:38:46] 06Release-Engineering-Team, 06Operations, 15User-greg, 07Wikimedia-Incident: Plan how to improve reminders for teams/people to address identified actionables from incident reports - https://phabricator.wikimedia.org/T141287#2493130 (10greg) [17:38:53] robh: I don't think that the ci-master.cfg is correct...looking at how gallium is partitioned and what you're describing [17:39:12] ok, so basically gallium is my template for parititoning, that makes more sense =] [17:39:19] cool, i rather elimiante odd partitioning [17:39:29] paladox: looks good, will merge/deploy later [17:39:40] cool, yeah, I'm unclear where we used ci-master.cfg :\ [17:39:45] 06Release-Engineering-Team, 15User-greg: Ping tasks in #wikimedia-incident without recent activity for follow-up near the end of FY1617Q1 - https://phabricator.wikimedia.org/T144973#2615805 (10greg) [17:39:47] 06Release-Engineering-Team, 06Operations, 15User-greg, 07Wikimedia-Incident: Plan how to improve reminders for teams/people to address identified actionables from incident reports - https://phabricator.wikimedia.org/T141287#2493130 (10greg) 05Open>03Resolved a:03greg With the retitling and documentin... [17:39:49] legoktm ok thanks :) [17:40:12] thcipriani: it seems it was only used for the emergency reinstall of contint1001 awhile back, so glad to eliminate it. [17:58:49] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests, 10netops: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2615947 (10RobH) [18:07:42] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:23:28] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-extensions-Examples, 07Documentation, and 3 others: Improve documentation around running/writing (with lots of examples) browser tests - https://phabricator.wikimedia.org/T108108#1512435 (10greg) https://www.mediawiki.org/wiki/Selenium... [18:35:44] thcipriani: about? [18:35:59] chasemp: ish, doing a SWAT deploy now [18:36:01] I think the jobs hashar reverted this morning have some long running components [18:36:02] https://graphite.wikimedia.org/render/?width=2064&height=1100&_salt=1472565940.394&target=nodepool.job.*jessie*.runtime.mean&hideLegend=false&from=-2h&lineMode=connected [18:36:14] k [18:36:16] not urget [18:36:17] just fyi https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle(zuul.pipeline.gate-and-submit.label.ci*.wait_time.mean)&hideLegend=false&lineMode=connected&from=-6h [18:36:30] I'm not entirely clear on what jobs were reverted in practical terms [18:36:41] but I think that's them causing delay here atm [18:42:00] 10Continuous-Integration-Infrastructure: Tracking php7 support in ci - https://phabricator.wikimedia.org/T144964#2615613 (10hashar) Please hold from adding jobs on all MediaWiki repos, and specially mediawiki/core. We do not have enough capacity to run them :-) Though that would be fine in experimental. [18:42:40] 10Continuous-Integration-Infrastructure: Tracking php7 support in ci - https://phabricator.wikimedia.org/T144964#2616200 (10Paladox) Ok, but doint we have two Jessie instances we could use? [18:42:46] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:47:31] (03CR) 10Hashar: "JJB def looks fine. We do not have enough capacity on labs now to have the php70 jobs triggered automatically so "experimental" sounds s" [integration/config] - 10https://gerrit.wikimedia.org/r/309048 (https://phabricator.wikimedia.org/T144961) (owner: 10Paladox) [18:49:20] chasemp: wowza, yeah, I'm not sure what jobs have all been reverted either. Could the maintenance where we couldn't communicate with openstack account for some of this? Or is the timing wrong? [18:49:27] * thcipriani looks through backscroll [18:49:42] I thikn this (really recent backlog) is way later [18:50:30] yeah, looks like [18:50:38] (03CR) 10Hashar: [C: 031] "Sounds sane. One less legacy thingy to deal with!" (031 comment) [integration/jenkins] - 10https://gerrit.wikimedia.org/r/308931 (owner: 10Legoktm) [18:54:18] hrm, I don't see any changes to config that happened around that time... [18:54:33] thcipriani: no I think it was the reverts from tis morning lying in wait for actual work etc [18:54:40] iiuc the jobs here https://graphite.wikimedia.org/render/?width=2064&height=1100&_salt=1472565940.394&target=nodepool.job.*jessie*.runtime.mean&hideLegend=false&from=-2h&lineMode=connected [18:54:54] that are long running and recent are those [18:55:03] https://phabricator.wikimedia.org/T143938#2614056 [18:55:33] PROBLEM - Puppet run on deployment-mediawiki02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:58:21] sounds true: more jobs, same pool available, wait times are longer [18:59:04] plus during SWAT there were a ton of patches I was stuck behind when I tried to recheck one. Waiting on rake-jessie, IIRC [19:00:03] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests, 10netops, 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2616279 (10hashar) I confirm the server content on contint1001.eqiad.wmnet can be... [19:00:06] I bet that's what it was [19:00:12] I ping you on huge backlog right as you are doing swat [19:00:19] and neither of us immediately connects those dots [19:00:22] heh [19:00:24] :D [19:00:50] and I bet a lot of tha twork was tox and linting (moved over today) [19:00:54] iiuc [19:01:41] (03PS3) 10Paladox: Add support for php7.0 [integration/config] - 10https://gerrit.wikimedia.org/r/309048 (https://phabricator.wikimedia.org/T144961) [19:01:47] robh: if you are going to reimage contint1001.eqiad.wmnet, we will want to drop the roles in puppet before its moved to the new vlan/reimaged https://phabricator.wikimedia.org/T140257 [19:02:08] robh: else puppet might well spawn new Zuul and an empty Jenkins [19:03:14] chasemp: I have moved the rake jobs to Nodepool. It is only a few builds [19:03:52] could be just jobs alrady reverted before today that aligned w/ the swat [19:04:16] been running with other meetings most of my afternoon. So I havent moved anything else. [19:05:10] Tyler mentionned to me the spike from yesterday. I haven't really looked into it [19:05:13] sure I mean even last week or last few days, something not triggered until x happens [19:05:18] where x is something from swat possibly [19:05:42] something triggering? [19:05:54] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [19:07:36] just had a big spike in wait time for jobs. Lots of patches came in to the test queue all at the same time. Triggered big wait times. [19:08:31] : https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle%28zuul.pipeline.gate-and-submit.label.ci*.wait_time.mean%29&hideLegend=false&lineMode=connected&from=-6h [19:08:56] we were talking about what exactly those jobs are etc [19:10:00] ah in gate [19:10:10] the gate also has a window to limit the jobs it is triggering [19:11:13] does that mean wait-times can be artificially high no matter how idle nodepool is? [19:11:30] for the gate-and-submit yeah [19:11:40] if there is like 100 patches that get a CR+2 and put in the queue [19:12:00] unless a handful of them actually get processed, the others are idling pending for the changes ahead in the queue to be complete [19:12:45] where is the limit defined? [19:14:47] doc is http://docs.openstack.org/infra/zuul/zuul.html?highlight=window [19:14:52] we have window-floor: 12 [19:15:01] defined in integration/config.git /zuul/layout.yaml [19:15:38] that is rate limiting system to the number of jobs triggered by the pipeline [19:15:42] with a default of 20 [19:15:47] so for long wait-time spikes how do we know which is the cuplrit? [19:15:50] culprit even [19:15:55] is there a metric for jobs blocked by this? [19:16:00] and window-floor is to have the minimum of jobs running to be 12 [19:16:04] not that I know [19:16:22] (should I have thought about that one sorry :( ) [19:16:30] let me check the sources [19:17:00] at the mometn it seems like we are blind afa tuning nodepool if there is a secondary higher up mechanism inflating wait-times randomly [19:17:21] or periodically anyway when it would be the most illustrative [19:18:43] from a quick check to the sources, the "window" is not reported to statsd [19:19:00] but it is log.debug and there are a few messages in gallium.wikimedia.org /var/log/zuul/debug.log [19:20:02] 2016-09-07 18:18:14 window size increased to 22> [19:20:14] 2016-09-07 18:18:15 window size decreased to 12 [19:20:33] 2016-09-07 18:31:26 window size increased to 13> [19:22:34] 2016-09-07 18:18:15,150 INFO zuul.DependentPipelineManager: Reported change status: all-succeeded: True, merged: False [19:22:35] 2016-09-07 18:18:15,150 DEBUG zuul.DependentPipelineManager: Reported change failed tests or failed to merge [19:22:42] that is what triggered the window decrease [19:22:49] (all of that from /var/log/zuul/debug.log [19:23:24] caused by https://gerrit.wikimedia.org/r/#/c/309059/ force merged [19:23:58] why do we use this windowing behavior? [19:24:00] zuul had the tests completed, then tried to --submit the change [19:24:09] since the change got closed due to the forced merge, it considered the change failed to merge [19:24:30] thus triggered a window decrease to the window-floor value [19:24:43] then cancels all the jobs, rebuild the queue and reprocess [19:27:59] How do I give someone the 'advanced' option to create tasks in phab? [19:28:05] 10Continuous-Integration-Infrastructure: Move integration-puppetmaster off of precise (probably to jessie) - https://phabricator.wikimedia.org/T144951#2615279 (10yuvipanda) we're considering not supporting precise puppetmasters that are self-hosted on labs soon, so yes please :) [19:28:07] I need to give Pam (new buyer) the ability to create s4 tasks [19:29:27] robh: spaces and advanced as kind of perpindicular to each other so to speak, you want to add this person to the relevant acl I think? [19:29:30] chasemp: the window flavor is always enabled [19:29:49] hashar: it says it can be disabled [19:29:50] A value of 0 disables rate limiting on the DependentPipelineManager. Default: 20. [19:30:00] oh [19:30:33] RECOVERY - Puppet run on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:58] from the commit that added window-floor https://gerrit.wikimedia.org/r/#/c/199169/ [19:31:09] that was to ensure that at least 12 jobs can run [19:31:27] but I think window-floor is only relevant is windowing is enabled at all [19:31:29] right? [19:31:31] else in case of multiple failures it would go down to 3 which is typically not enough (a mediawiki core job is 10-12 jobs iirc) [19:31:38] if even [19:36:15] is beta having problems? [19:37:05] mobrovac: have you checked https://logstash-beta.wmflabs.org/ > mediawiki-errors ? [19:41:20] 10Continuous-Integration-Infrastructure: Move integration-puppetmaster off of precise (probably to jessie) - https://phabricator.wikimedia.org/T144951#2616415 (10Legoktm) Okay, I'll try and do this later in the week. Notes from IRC: * create new puppetmaster instance on jessie, with same puppet roles * copy over... [19:45:52] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:56:41] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:00:01] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-ORES, 15User-Ladsgroup: User contribs seems to be empty when ores enabled - https://phabricator.wikimedia.org/T144999#2616468 (10Ladsgroup) [20:06:41] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T142855#2616509 (10Jdforrester-WMF) [20:07:46] mobrovac, we are investigating the missing text in #wikimedia-dev. [20:08:00] kk thnx matt_flaschen [20:09:24] hashar, I am trying to run sync-dir manually on Beta Cluster to test a revert, but it just hangs. [20:09:26] Any thoughts? [20:09:27] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-ORES, 15User-Ladsgroup: User contribs seems to be empty when ores enabled - https://phabricator.wikimedia.org/T144999#2616541 (10Ladsgroup) I guess that's old data in wikidata and English Wikipedia in beta cluster. [20:10:58] matt_flaschen: no idea [20:11:19] matt_flaschen: remember that Jenkins update all repos every 10 minutes and then trigger a scap. That will override your hack [20:11:42] so you will want to disable the job https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ [20:11:52] (which does the update / reset --hard [20:11:53] hashar, well, if scap isn't working it's a moot point. [20:12:00] hashar, but you're right, thank you. [20:12:04] then you can trigger the scap either manually or via https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ [20:12:26] hashar, if I trigger it in the web UI, will it change which code is checked out? [20:13:03] it just runs "scap sync" [20:13:11] hashar, okay, cool, thank you. [20:13:14] you will get bunch of logs via the console https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/119020/console :D [20:14:06] Project beta-code-update-eqiad build #120474: 15ABORTED in 1 min 5 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/120474/ [20:14:52] !Temporarily disabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ to test live revert of aa0f6ea [20:14:55] !log Temporarily disabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ to test live revert of aa0f6ea [20:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:21:39] hashar, it's hanging on there too, maybe the same place: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/119021/console [20:22:23] Never mind, just slow. [20:26:04] 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2616610 (10hashar) [20:27:20] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2616631 (10hashar) [20:27:23] 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10hashar) 05stalled>03Open contint1001 has been moved to the production public network with fqdn c... [20:30:43] !log Updated security group for contintcloud and integration labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323 [20:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:31:37] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:35:30] !log Updated security group for deployment-prep labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323 [20:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:44:15] !log Re-enabled beta-code-update-eqiad . [20:44:20] Thanks, hashar [20:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:46:29] PROBLEM - Puppet run on deployment-cache-text04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:50:54] greg-g: I've claimed https://phabricator.wikimedia.org/T141985 [20:53:01] robh: noticed you got contint1001 a new vlan/ip and even got it reimaged! impressive :) (ping thcipriani ) [20:53:39] thcipriani: new hostname is contint1001.wikimedia.org with IP 208.80.154.17 . I have sent a few patches to puppet to adjust the conf [20:53:48] nice [20:54:13] hashar: yep, signed hte puppet keys but not salt yet [20:54:24] got distracted with other sutff, will do so now and its ready for you guys to take over [20:54:43] :) [20:54:53] robh: do not enabled puppet on it though [20:55:06] oh, i already had and ran initial run [20:55:10] it has nothing but base [20:55:14] oh [20:55:16] as it has no site.pp entries, does it need to start over? [20:55:19] ah yeah site.pp do not match it [20:55:23] will update site.pp [20:55:35] yeah, the old site.pp entry was for eqiad.wmnet [20:55:46] so i assumed it was ok to run puppet and sign so you can implement as needed via site.pp [20:56:00] yeah definitely [20:56:03] you are way more careful than me! [20:56:21] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests, 10netops, 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2616802 (10RobH) [20:57:12] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10hardware-requests, 10netops, 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2458291 (10RobH) a:05RobH>03hashar contint1001.wikimedia.org is online with p... [20:57:18] its all yours now =] [21:01:50] robh: if you feel adventurous, the updated site.pp is https://gerrit.wikimedia.org/r/#/c/309069/ [21:02:07] else will loop it with Europe ops tomorrow [21:08:58] hashar: yeah rebasing and will merge and run no problem [21:09:09] robh: awesome. This way site.pp is clean [21:09:48] Krinkle: ty sir [21:12:49] so contint1001.timer.start() [21:13:03] :) [21:14:25] waiting on auto verification [21:14:27] =P [21:14:42] CI is slow to approve the patch to start the process of making a new ci server. [21:14:55] yeah [21:15:08] https://integration.wikimedia.org/zuul/ there is a bunch of -trusty and -jessie jobs waiting for a node to be available [21:15:20] I guess we should probaly detangle all mw jobs. [21:15:42] Demand from gearman: ci-jessie-wikimedia: 6 ci-trusty-wikimedia: 4 (jobs pending) [21:16:03] PROBLEM - Puppet run on deployment-conf03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:16:29] it does not garbage collect fast enough [21:18:20] its merged now [21:19:37] thcipriani: chasemp: the same issue has yesterday reproduced just now! [21:20:01] basically got a bunch of jobs running with pool being fully occupied [21:20:20] 06Release-Engineering-Team, 10DBA, 10MediaWiki-Maintenance-scripts, 06Operations, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2616900 (10greg) I'm also a big +1 on having those long running maint sc... [21:20:34] with up to 7 request to delete servers (which free up the pool and allow to spawn new instance) [21:20:56] and at the given rate, that takes a while to reclaim all those slots [21:21:18] then eventually once the deleted nodes got freed, a surge of new ones are created [21:21:28] RECOVERY - Puppet run on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:22:46] PROBLEM - Puppet run on deployment-mx is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [21:23:00] so whenever the pool has consumed all nodes, it takes a good minute and a half to delete all the nodes [21:23:12] on labnodepool, a good view would be: grep 'wmflabs-eqiad running task' /var/log/nodepool/debug.log [21:24:54] hashar i guess openstack never produce that since they have 1000+ instances, plus there list in zuul is long [21:25:21] so they doint really mind the bad performance, so maybe it is a bug openstack havent seen yet? [21:26:13] paladox: all queries made to the cloud provider are serialized in a queue (eg processed one by one) every X seconds [21:26:36] yep [21:26:46] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [21:28:02] that grep does give an interesting view of the log [21:28:30] the queue: X [21:28:48] is the number of any tasks in the queue (I thought it was for a given task / with shared queus) [21:29:00] yeah, listservertask, createservertask, deleteservertask [21:29:02] so queue: 8 --> 8 tasks waiting, can be anything [21:29:17] and I checked the code, the queue size is not sent to statsd [21:31:08] 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2616944 (10hashar) contint1001 now has the default set of rules from contint::firewall.... [21:31:45] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [21:32:00] thcipriani: the ferm rules are now enabled on contint1001 . So gotta connect to it via the bastions [21:32:22] :) [21:32:30] (if only ferm could set description to the iptables rules...) [21:32:53] (Question, will you be doing zuul and jenkins tomarror on contint101?) [21:34:01] a very long oneliner that could probably all be done with awk: grep 'wmflabs-eqiad running task' /var/log/nodepool/debug.log | cut -d ':' -f5 | tr -d ')' | awk '{sum+=$1} END { print "Average = ",sum/NR}' [21:34:23] seem to average around 3 tasks in the queue at any time. [21:35:03] beautiful [21:36:32] 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2616971 (10hashar) [21:37:50] 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10hashar) [21:40:16] thcipriani: feel free to puppetize the one-liner! nodepool-queue.sh ! [21:42:01] checked a few network flows https://phabricator.wikimedia.org/T137323 [21:42:07] +1, I love me a good one-line .sh (I have a few) [21:42:21] (well, two lines, #! first) [21:43:00] I puppetized the sudo command line to become the "nodepool" user [21:43:03] become-nodepool [21:43:19] inspired by toollabs, turns out to be a huge time saver [21:43:29] (I checked a few network flows https://phabricator.wikimedia.org/T137323 ) [21:43:37] iridium I have no idea if I have access to it [21:43:53] some others are pending the puppet classes for jenkins and zuul which add the ferm rules [21:43:58] twentyafterfour has access to ^^ [21:44:30] anyway sleep time [21:44:42] hashar: you don't: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Access_list [21:44:45] ;) [21:44:52] will check with ops tomorrow to enable backup and maybe load Jenkins data to contint1001 [21:45:06] greg-g: nice table! [21:45:14] hashar: copy/pasted from you! [21:45:17] g'night! [21:45:22] o really [21:45:30] hashar: nice table! [21:45:30] hashar: thank you [21:45:34] :) [21:47:16] sleep well folks! [21:47:23] you too, hashar [21:47:31] oh Platonides !!! [21:48:03] * Platonides waves :) [21:48:06] Platonides: are you back / active again??? [21:49:15] Platonides: when ever you are there, poke me so we can chat a bit :D For now I really need sleep time sorry [21:50:05] np [21:54:05] hashar we could probaly get your zuul change merged tomrror? [21:54:09] tommorror? [21:54:28] the one that cleans it up and supports heira [21:55:49] maybe [21:55:52] ok [21:55:53] gotta discuss them with ops [21:55:56] ok [21:56:00] cya [21:56:00] can I help? [21:56:14] * twentyafterfour got pinged [21:56:56] twentyafterfour sorru, i did it when hashar was going through https://phabricator.wikimedia.org/T137323 [21:57:08] and said he coulden test iridium since he didnt have access [21:58:53] 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10mmodell) ``` twentyafterfour@iridium:~$ telnet 208.80.154.17 4730 Trying 208.8... [21:58:58] connection refused [21:59:29] 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2617146 (10mmodell) [22:02:44] RECOVERY - Puppet run on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [22:02:52] 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) - https://phabricator.wikimedia.org/T137323#2617183 (10hashar) Yup the service is not enabled (that is the Gearman server embedded in... [22:13:14] twentyafterfour: uhhh, your update submodules commit just got linked to a bunch of irrelevant tasks [22:14:02] [15:11:23] MediaWiki-General-or-Unknown: Change user namespace from usuário to wikipedista in portuguese wikipedia - https://phabricator.wikimedia.org/T11587#2617200 (mmodell) declined>Resolved [22:19:46] twentyafterfour: ????? stop [22:20:42] this happens every time... [22:21:45] Oh phabricators repo needs switching to notifiy off [22:23:03] I assume he'll clean up his mess, but it would be nice to not create it in the first place, given how many times it's happened in the past. [22:24:45] legoktm: ? [22:25:24] twentyafterfour the upstream branch you pulled in linked to other tasks and closed them with status that shoulden be [22:25:29] see also -devtools [22:25:50] autoclose should be disabled [22:27:07] Yep, i think it may be phabricator deployment [22:27:12] twentyafterfour ^^ [22:27:24] https://phabricator.wikimedia.org/diffusion/PHDEP/manage/actions/ [22:33:43] freakin gerrit decided to quote the commit messages from submodule changes that got merged [22:33:56] * twentyafterfour did not make this commit https://phabricator.wikimedia.org/rPHDEP81dc55d04fec633ae3958e9849a43429468208aa [22:36:00] LOL [23:06:55] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:15:42] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: doc.wikimedia.org should be running PHP 5.5+, not 5.3 -> demos etc. don't work - https://phabricator.wikimedia.org/T127504#2045488 (10Krinkle) Moving integration.wikimedia.org is harder due to the Zuul and Jenkins proxies. {F4099754 size=ful... [23:37:14] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T142855#2617850 (10Ladsgroup) [23:46:54] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:21] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T142855#2617920 (10Jdforrester-WMF)