[00:02:45] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 07WorkType-NewFunctionality: Install and use JDK 8 for Android CI testing - https://phabricator.wikimedia.org/T138506#2577999 (10Dzahn) 05Resolved>03Open Is this the same thing as "contint: Java 8 on Jessie slaves" ? [00:07:46] RECOVERY - Puppet run on integration-slave-jessie-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [04:06:49] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #118: 04FAILURE in 10 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/118/ [07:05:59] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 07WorkType-NewFunctionality: Install and use JDK 8 for Android CI testing - https://phabricator.wikimedia.org/T138506#2578246 (10hashar) >>! In T138506#2577999, @Dzahn wrote: > Is this the same thing as "contint: Java 8 on Jessie slaves" ? Ye... [07:59:07] Yippee, build fixed! [07:59:08] Project selenium-Core » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #131: 09FIXED in 6 min 27 sec: https://integration.wikimedia.org/ci/job/selenium-Core/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/131/ [08:05:10] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 TLS Redirect - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 587 bytes in 0.002 second response time [08:06:18] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 TLS Redirect - string 'Wikipedia' not found on 'http://en.m.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 589 bytes in 0.002 second response time [08:22:19] PROBLEM - Puppet run on deployment-salt02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [08:39:50] 10Beta-Cluster-Infrastructure, 06Operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2578340 (10hashar) That is really neat @AlexMonk-WMF ! Is there anything left to do? [08:45:36] 10Beta-Cluster-Infrastructure, 06Operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2578342 (10AlexMonk-WMF) Convince someone with ops rights to merge the patch [08:47:37] PROBLEM - SSH on deployment-salt02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:32] RECOVERY - SSH on deployment-salt02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:53:20] 10Continuous-Integration-Infrastructure, 06Operations, 10puppet-compiler, 13Patch-For-Review: OSError: [Errno 28] No space left on device on compiler02.puppet3-diffs.eqiad.wmflabs - https://phabricator.wikimedia.org/T143671#2578346 (10fgiunchedi) yup @greg ! I've proposed a patch to cleanup old compilation... [08:55:58] 10Continuous-Integration-Infrastructure, 07Puppet: Cant refresh Nodepool snapshot due to puppet: Could not find class passwords::puppet::database - https://phabricator.wikimedia.org/T143769#2578347 (10hashar) [08:58:36] PROBLEM - SSH on deployment-salt02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:30] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2578374 (10zeljkofilipin) Tried again. ``` $ ls screenshots/... [09:18:30] RECOVERY - SSH on deployment-salt02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [09:19:40] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2578377 (10zeljkofilipin) I am doing something wrong, but not... [09:19:57] hashar: do you know what am I doing wrong here? :| https://phabricator.wikimedia.org/T143655#2578377 [09:20:50] zeljkof: no clue ? :D [09:21:00] hashar: thanks, same here [09:21:16] or that reqId:V71kiwpEEaoAAEaiEmoAAAAF does not exist ? [09:21:16] I am not at all familiar with kibana, so I thought it was me [09:21:23] or was made later / earlier ? [09:21:26] hashar: but it does [09:21:43] hm, timing might be important [09:21:43] it was created a few minutes ago [09:21:55] yeah that is the date/time filter [09:22:03] change it to last 7 days and it find something [09:22:41] hashar: thanks, figured it out [09:22:46] https://logstash-beta.wmflabs.org/goto/b2d8adc1f8c4ec2b673a5a1fdc30e2c6 [09:22:56] https://logstash-beta.wmflabs.org/goto/0c416dc6114efee609ed469aa4ded6c3 [09:23:19] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2575006 (10hashar) Date filter prevent that reqId from being s... [09:23:33] Could not acquire lock for "mwstore://local-multiwrite/local-public/d/de/VisualEditor_category_item-hr.png". [09:23:42] something is off with the filebackend somehow [09:23:50] then I have absolutely no idea how it works [09:23:58] nor which file it points to (should be swift) [09:24:13] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2578387 (10zeljkofilipin) Thanks @hashar, did not notice time... [09:24:16] then it is marked as local-multiwrite which would indicate local file system [09:24:46] hashar: I will leave it to anomie, looks like he knows what to do [09:25:33] that filebackend local-multiwrite use as a backend local-swift-eqiad [09:25:38] which is definitely swift [09:25:51] 06Release-Engineering-Team, 03releng-201617-q1, 15User-greg: Perform a technical debt analysis of software and services maintained by WMF Release Engineering - https://phabricator.wikimedia.org/T138225#2578389 (10zeljkofilipin) [09:25:58] 06Release-Engineering-Team, 15User-greg: Plan basic logistics of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T134830#2578390 (10zeljkofilipin) [09:26:05] 06Release-Engineering-Team, 15User-greg: Determine location of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T137721#2578391 (10zeljkofilipin) [09:26:25] with Redis to manage locks [09:26:55] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 10Wikidata, 13Patch-For-Review, 15User-zeljkofilipin: Ownership for wikidata/wikibase selenium tests - https://phabricator.wikimedia.org/T143309#2578397 (10zeljkofilipin) @Tobi_WMDE_SW do you accept ownership of wikidata browser tests? [09:26:57] if you look at operations/mediawiki-config.git that is configured in wmf-config/fileback-labs.php [09:27:18] the lock manager is configured line 157 via $wgLockManagers[] [09:27:27] 'lockServers' => $wmfMasterServices['redis_lock'], [09:28:00] which from wmf-config/LabsServices.php yields: [09:28:01] 'redis_lock' => [ [09:28:02] 'rdb1' => '10.68.16.177', // deployment-redis01.deployment-prep.eqiad.wmflabs [09:28:02] 'rdb2' => '10.68.16.231', // deployment-redis02.deployment-prep.eqiad.wmflabs [09:28:04] ], [09:28:08] so maybe one of those redis server has some issue [09:28:31] hashar: it might be causing T142600 too [09:28:35] or be somehow related [09:33:51] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2578414 (10hashar) On beta in `wmf-config/fileback-labs.php` t... [09:34:32] zeljkof: redis has some issue I believe https://logstash-beta.wmflabs.org/goto/12f5196d7094fae3e85a4dd1d35a0c76 [09:35:43] uh oh [09:38:10] !log deployment-redis02 initctl stop redis-instance-tcp_6379 && initctl start redis-instance-tcp_6379 | That did not fix it magically though T143655 [09:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:39:06] so [09:39:16] zeljkof: on deployment-redis02 [09:39:19] we have several redis instances running in parallel [09:39:31] you can see them with: initctl list|grep redis [09:39:42] the one that is borked is redis-instance-tcp_6379 [09:39:57] if you ask for the status of it with: initctl status redis-instance-tcp_6379 [09:40:02] you will see the pid changing over and over [09:40:08] indicating it keep dieing / being respawned [09:40:29] to find logs: find /var/log -name '*6379*' [09:40:36] eg /var/log/redis/tcp_6379.log [09:40:38] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 10Wikidata, 13Patch-For-Review, 15User-zeljkofilipin: Ownership for wikidata/wikibase selenium tests - https://phabricator.wikimedia.org/T143309#2578453 (10Tobi_WMDE_SW) @zeljkofilipin sorry for the delay - I was out of office for some time... [09:43:03] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2578461 (10hashar) On deployment-redis02: ``` name=/var/log/r... [09:43:19] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 10Wikidata, 13Patch-For-Review, 15User-zeljkofilipin: Ownership for wikidata/wikibase selenium tests - https://phabricator.wikimedia.org/T143309#2563730 (10zeljkofilipin) a:05Tobi_WMDE_SW>03zeljkofilipin Great, I will take care of it. [09:43:33] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 10Wikidata, 13Patch-For-Review, 15User-zeljkofilipin: Ownership for wikidata/wikibase selenium tests - https://phabricator.wikimedia.org/T143309#2578464 (10zeljkofilipin) p:05Triage>03Normal [09:43:51] !log T143655 stopping redis 6379 on deployment-redis02 : initctl stop redis-instance-tcp_6379 [09:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:48:51] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2578477 (10hashar) Found the config in /etc/redis/tcp_6379.con... [09:50:33] !log deployment-redis02 fixed AOF file /srv/redis/deployment-redis02-6379.aof and restarted the redis instance should fix T143655 and might help T142600 [09:50:36] zeljkof: ^^^^ [09:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:50:53] zeljkof: retest both bugs ? [09:51:16] hashar: in the middle of something else, will retest as soon as am I done with this [09:51:20] thanks a lot! [09:51:41] zeljkof: the tldr is that some redis instance had corrupted data file which I have fixed [09:56:12] 03Scap3, 10scap, 06Operations, 15User-mobrovac: Scap::server::sources is out of sync with the repositories actually present on tin/mira - https://phabricator.wikimedia.org/T143692#2578491 (10Joe) a:03Joe [10:02:50] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2578515 (10zeljkofilipin) 05Open>03Resolved a:03zeljkofil... [10:27:35] 10Continuous-Integration-Infrastructure, 06Operations, 07Jenkins, 13Patch-For-Review, 07Wikimedia-Incident: Jenkins files under /var/lib/jenkins/config-history/config need to be garbage collected - https://phabricator.wikimedia.org/T126552#2578554 (10hashar) p:05Normal>03High Would need someone famil... [10:27:38] hashar: redis fix ftw! so far so good https://integration.wikimedia.org/ci/view/Selenium/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/132/console [10:28:46] zeljkof: neat [10:28:46] Yippee, build fixed! [10:28:47] Project selenium-QuickSurveys » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #132: 09FIXED in 3 min 56 sec: https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/132/ [10:28:51] zeljkof: recheck the other task as well :} [10:30:48] (03PS3) 10Urbanecm: Whitelist Urbanecm e-mail adresss [integration/config] - 10https://gerrit.wikimedia.org/r/302314 [10:30:57] (03PS4) 10Hashar: Whitelist Urbanecm e-mail adresss [integration/config] - 10https://gerrit.wikimedia.org/r/302314 (owner: 10Urbanecm) [10:32:12] (03PS5) 10Hashar: Whitelist Urbanecm e-mail adresss [integration/config] - 10https://gerrit.wikimedia.org/r/302314 (owner: 10Urbanecm) [10:32:37] (03CR) 10Hashar: [C: 032] "Updated commit message and escaped the dots '.' in the regex." [integration/config] - 10https://gerrit.wikimedia.org/r/302314 (owner: 10Urbanecm) [10:33:57] (03Merged) 10jenkins-bot: Whitelist Urbanecm e-mail adresss [integration/config] - 10https://gerrit.wikimedia.org/r/302314 (owner: 10Urbanecm) [10:34:22] (03CR) 10Hashar: "Deployed :)" [integration/config] - 10https://gerrit.wikimedia.org/r/302314 (owner: 10Urbanecm) [10:35:00] hashar: the other task is resolved [10:35:12] I am uploading images to beta commons right now [10:39:36] great [10:46:30] Yippee, build fixed! [10:46:30] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #119: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/119/ [10:57:04] (03PS1) 10Hashar: debian-glue: let us override the build timeout [integration/config] - 10https://gerrit.wikimedia.org/r/306414 (https://phabricator.wikimedia.org/T143546) [10:57:26] 10Continuous-Integration-Config, 13Patch-For-Review: Make debian-glue job timeout configurable - https://phabricator.wikimedia.org/T143546#2578621 (10hashar) p:05Triage>03Normal a:03hashar [10:58:43] (03CR) 10Hashar: [C: 032] debian-glue: let us override the build timeout [integration/config] - 10https://gerrit.wikimedia.org/r/306414 (https://phabricator.wikimedia.org/T143546) (owner: 10Hashar) [10:58:49] Yippee, build fixed! [10:58:49] Project selenium-MultimediaViewer » chrome,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #119: 09FIXED in 23 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/119/ [10:59:49] (03Merged) 10jenkins-bot: debian-glue: let us override the build timeout [integration/config] - 10https://gerrit.wikimedia.org/r/306414 (https://phabricator.wikimedia.org/T143546) (owner: 10Hashar) [11:05:38] (03PS1) 10Hashar: Fix BUILD_TIMEOUT token extension in debian-glue [integration/config] - 10https://gerrit.wikimedia.org/r/306416 (https://phabricator.wikimedia.org/T143546) [11:06:56] 10Continuous-Integration-Config, 13Patch-For-Review: Make debian-glue job timeout configurable - https://phabricator.wikimedia.org/T143546#2578636 (10hashar) I have added a BUILD_TIMEOUT parameter to both debian-glue jobs with a default value of 30 (minutes). Made Zuul to inject 180 (minutes) for `operations... [11:07:38] (03CR) 10Hashar: [C: 032] Fix BUILD_TIMEOUT token extension in debian-glue [integration/config] - 10https://gerrit.wikimedia.org/r/306416 (https://phabricator.wikimedia.org/T143546) (owner: 10Hashar) [11:07:56] Thanks hashar [11:08:07] Lets wait for few hours now :) [11:08:15] kart__: hopefully that will do it with 3 hours timeout [11:08:24] (03Merged) 10jenkins-bot: Fix BUILD_TIMEOUT token extension in debian-glue [integration/config] - 10https://gerrit.wikimedia.org/r/306416 (https://phabricator.wikimedia.org/T143546) (owner: 10Hashar) [11:08:48] kart__: there might be way to speed up the compilation using some kind of cache or distributed building [11:09:41] hashar: I have used ccache in the past, but not recently. [11:09:49] Let me check with upstream. [11:10:13] also there is only one CPU being used with htwolcpre3 [11:10:25] when the instances have two cpu [11:10:44] maybe there is not much way to parallelize [11:11:37] 10Continuous-Integration-Config, 13Patch-For-Review: Make debian-glue job timeout configurable - https://phabricator.wikimedia.org/T143546#2578652 (10hashar) IF https://integration.wikimedia.org/ci/job/debian-glue/562/console manages to complete, we can mark this task as solved. [11:12:47] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2578654 (10hashar) I got it installed manually on the permanent Jessie slaves and indeed that is now pending upload to apt.wikimed... [11:13:25] 10Continuous-Integration-Infrastructure, 06Operations, 13Patch-For-Review: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2578655 (10hashar) Need #operations to publish jenkins-debian-glue packages to apt.wikimedia.org T141114#2488638 [11:13:38] 10Continuous-Integration-Infrastructure, 06Operations: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2578657 (10hashar) p:05Triage>03Normal [11:25:02] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10DBA, 10MediaWiki-Database, 07WorkType-NewFunctionality: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255#2578701 (10Nikerabbit) [11:37:41] 06Release-Engineering-Team, 03releng-201617-q1, 15User-greg: Perform a technical debt analysis of software and services maintained by WMF Release Engineering - https://phabricator.wikimedia.org/T138225#2578712 (10hashar) From our weekly team meeting, @dduvall mentioned we had a tendency to start new things w... [11:39:23] 10Beta-Cluster-Infrastructure, 06Operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2578713 (10hashar) So that is now pending review / merge of https://gerrit.wikimedia.org/r/247587 //beta: Use Let's Encrypt cert// which is already on beta cluster. [11:47:05] 10Continuous-Integration-Config, 13Patch-For-Review: Make debian-glue job timeout configurable - https://phabricator.wikimedia.org/T143546#2578716 (10hashar) Note that the Jenkins slaves have two CPU and run only a single build. I have noticed the operations/debs/contenttranslation/giella-sme package only use... [11:47:36] PROBLEM - SSH on deployment-salt02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:51:26] PROBLEM - Puppet staleness on deployment-changeprop is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [11:52:32] RECOVERY - SSH on deployment-salt02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:19:43] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:21:25] RECOVERY - Puppet staleness on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [3600.0] [12:24:34] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 44818 bytes in 1.444 second response time [12:34:45] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 07WorkType-NewFunctionality: Install and use JDK 8 for Android CI testing - https://phabricator.wikimedia.org/T138506#2578810 (10hashar) Looked at the patch, we need both java 7 and java 8 on the Jessie slaves. [12:53:20] hashar: as far as I can see, MediawikiApi::LoginError error is no longer with us!!!!11!!1! [12:53:48] we can wait a day so all the tests run, and then resolve it [12:56:22] (03CR) 10Ejegg: "How frustrating! So patches in this repo need to be submitted by a releng person within hours or minutes of being uploaded?" [integration/config] - 10https://gerrit.wikimedia.org/r/301025 (https://phabricator.wikimedia.org/T141309) (owner: 10Awight) [12:56:35] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 05MW-1.28-release-notes, 13Patch-For-Review, and 2 others: Various browser tests failing with MediawikiApi::LoginError - https://phabricator.wikimedia.org/T142600#2578842 (10zeljkofilipin) As far as I can see, this is resolved when @hashar fixed redis.... [12:57:06] zeljkof: try manually triggering the few jobs known to fail? [12:57:10] but yeah that would fix a bunch of issues [12:57:12] hashar: done [12:57:26] inspected the failures, no more api login failures [12:58:04] hashar: one patch for swat today [12:58:26] will join the hangout in a minute, have to go outside, my son is still sleeping in my office :) [13:28:21] hashar: we've success right for build timeout? [13:28:37] no clue havent checked the build result [13:32:06] https://integration.wikimedia.org/ci/job/debian-glue/562/consoleFull - says OK. [13:32:31] Let me do recheck on giella-sme again. [13:32:43] 561 was failure. [13:48:36] PROBLEM - SSH on deployment-salt02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:26] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [13:53:34] RECOVERY - SSH on deployment-salt02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [13:58:40] kart__: https://integration.wikimedia.org/ci/job/debian-glue/561/consoleFull is me failing [13:58:46] BUILD_TIMEOUT was improperly set [13:58:50] got it fixed in the job config [13:59:02] https://integration.wikimedia.org/ci/job/debian-glue/562/consoleFull is passing so that is all good to me :) [13:59:19] might be worth looking at having some build steps to be run in parallel [13:59:24] dh build has --parallel [13:59:33] but looking at the process list there was a make -j1 [13:59:43] maybe dh enforces -j1 somehow [14:00:09] would have to ask operations [14:06:44] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [14:20:37] PROBLEM - SSH on deployment-salt02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:24] hashar: looking at debian/rules again. [14:36:31] 06Release-Engineering-Team (Long-Lived-Branches), 03Scap3: Create `scap swat` command to automate patch merging & testing during a swat deployment - https://phabricator.wikimedia.org/T142880#2579083 (10mmodell) [14:41:19] godog: gentle poke: puppet has been disabled on deployment-imagescaler01.deployment-prep.eqiad.wmflabs for 6 days with "Puppet is disabled. filippo" as the reason. [14:41:30] 10Browser-Tests-Infrastructure, 10VisualEditor, 10VisualEditor-MediaWiki, 13Patch-For-Review, 15User-zeljkofilipin: Fix font support on SauceLabs VE screenshots - https://phabricator.wikimedia.org/T141369#2579085 (10zeljkofilipin) @Esanders @Amire80 @Elitre: Please take a look at the screenshots and let... [14:42:06] bd808: gah, sorry about that, I'm reenabling it now [14:42:31] no worries! I just got on a kick of paying attention this week :) [14:43:02] heheh [14:43:03] {{done}} [14:44:04] while we're here, I'd like to try prometheus for beta too if there no objections/concerns, in practice it means having role::prometheus::node_exporter on all machines to start with and one machine with role::prometheus::labs_project for the server [14:45:01] seems reasonable to me. Will you need a new VM or will it report to the one that Tools is using? [14:45:37] I don't think that anyone is in love with icinga in beta cluster [14:46:39] also we've stopped putting hearts in icinga messages some time ago [14:47:03] it can use an existing vm with enough ram/disk I'd say [14:47:20] but no generally each project gets its own prometheus server [14:50:50] *nod* K.renair spent some time putting the project on a diet so finding somewhere to co-locate might be nice. deployment-fluorine02 might work for it. [14:51:15] that's a new jessie vm that just does udp2log aggregation [14:51:37] lots of disk, maybe light on ram/cpu [14:53:24] RECOVERY - Puppet staleness on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:58:19] thanks bd808 I'll check that [15:02:54] hashar: jenkins strangly failed this job: https://integration.wikimedia.org/ci/job/mwext-testextension-php55-composer/4430/console I did "recheck" and it seems ok-ish but just wanted to let you know this happened [15:07:54] (03PS27) 10Zfilipin: WIP Run language screenshots script for VisualEditor in Jenkins [integration/config] - 10https://gerrit.wikimedia.org/r/300035 (https://phabricator.wikimedia.org/T139613) [15:08:01] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2579149 (10hashar) @thcipriani pointed to some IRC logs from OpenStack infrastructure that reflected them having the sa... [15:11:32] (03PS28) 10Zfilipin: WIP Run language screenshots script for VisualEditor in Jenkins [integration/config] - 10https://gerrit.wikimedia.org/r/300035 (https://phabricator.wikimedia.org/T139613) [15:14:51] !log deploying ores d00171 [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:23:35] (03PS1) 10Zfilipin: Tobi is owner of selenium-Wikibase and selenium-Wikidata jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306449 (https://phabricator.wikimedia.org/T143309) [15:25:07] (03CR) 10Zfilipin: "Tobi, please +1/+2 the commit." [integration/config] - 10https://gerrit.wikimedia.org/r/306449 (https://phabricator.wikimedia.org/T143309) (owner: 10Zfilipin) [15:32:45] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2579209 (10hashar) As for the quota glitches, the is two ghost instances that needs to be deleted. Also Nova codes see... [15:38:35] 10Browser-Tests-Infrastructure, 10VisualEditor, 10VisualEditor-MediaWiki, 13Patch-For-Review, and 2 others: Fix font support on SauceLabs VE screenshots - https://phabricator.wikimedia.org/T141369#2579224 (10Elitre) The last one for Tamil doesn't look right? [15:38:38] 06Release-Engineering-Team (Long-Lived-Branches): Static asset time on disk - https://phabricator.wikimedia.org/T140921#2579225 (10demon) I'm thinking that option (1) will be the best to be honest. The FIFO could be managed by .gitignore--automatically during RelEng's weekly branch work. Images would Just Work i... [15:45:38] 06Release-Engineering-Team (Long-Lived-Branches): Static asset time on disk - https://phabricator.wikimedia.org/T140921#2579238 (10demon) Addendum: I read static.php and the related Apache config a bit further, and it seems like it will Mostly Work The Way I Want as it is right now. The main difference in the co... [15:56:42] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:56:54] hashar: we're good at, https://integration.wikimedia.org/ci/job/debian-glue/564/ [15:57:09] hashar: so you can close build timeout task. [15:57:36] kart__: neat! please close it :) [15:57:54] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 07WorkType-NewFunctionality: Install and use JDK 8 for Android CI testing - https://phabricator.wikimedia.org/T138506#2579277 (10Dzahn) Thanks @hashar for the clarification here and on gerrit. merged! [15:58:09] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 07WorkType-NewFunctionality: Install and use JDK 8 for Android CI testing - https://phabricator.wikimedia.org/T138506#2579278 (10Dzahn) 05Open>03Resolved a:03Dzahn [15:58:28] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 07WorkType-NewFunctionality: Install and use JDK 8 for Android CI testing - https://phabricator.wikimedia.org/T138506#2402389 (10Dzahn) a:05Dzahn>03None [15:58:39] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 07WorkType-NewFunctionality: Install and use JDK 8 for Android CI testing - https://phabricator.wikimedia.org/T138506#2402389 (10Dzahn) a:03hashar [15:59:05] 10Continuous-Integration-Config, 13Patch-For-Review: Make debian-glue job timeout configurable - https://phabricator.wikimedia.org/T143546#2579287 (10KartikMistry) 05Open>03Resolved https://integration.wikimedia.org/ci/job/debian-glue/564/ is passing tests, so build timeout issue is fixed. [15:59:19] hashar: done. [15:59:25] kart__: awesome :} [15:59:32] thank you for the task! [15:59:42] hashar: thanks a lot for quick work! [16:00:21] 10Browser-Tests-Infrastructure, 10VisualEditor, 10VisualEditor-MediaWiki, 13Patch-For-Review, and 2 others: Fix font support on SauceLabs VE screenshots - https://phabricator.wikimedia.org/T141369#2579292 (10zeljkofilipin) @Elitre sorry, you will have to be more specific :) there are a lot of screenshots a... [16:09:26] 10Deployment-Systems: Move test.wikipedia.org from mw1017 and make it a more "normal" test wiki - https://phabricator.wikimedia.org/T45722#2579320 (10greg) 05Open>03declined mw1017 is a part of the special group of app servers that serve requests with the X-Wikimedia-Debug header set. See also https://wikite... [16:11:04] actually [16:12:02] 10Deployment-Systems: Move test.wikipedia.org from mw1017 and make it a more "normal" test wiki - https://phabricator.wikimedia.org/T45722#2579334 (10ori) 05declined>03Resolved a:03ori This was actually done in {185a40c060}. [16:12:17] ori: we have been doing our tests with mw1099 during the European SWAT window [16:13:02] and man that Chromium extension + header is SOOO useful [16:13:02] makes swat a breeze [16:13:20] that's awesome, glad to hear it :) [16:14:23] ori: better than $wgDBname === 'enwiki' && user_id = '21' /* brion */ { debug stuff } [16:15:10] and all volunteers and patch proposers knew about that extension and were already waiting to test on mw1099 even before the change get CR+2 [16:18:53] 06Release-Engineering-Team, 15User-greg: Plan basic logistics of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T134830#2579372 (10greg) [16:18:54] 06Release-Engineering-Team, 15User-greg: Determine location of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T137721#2579368 (10greg) 05Open>03Resolved a:03greg Washington, DC [16:19:22] 06Release-Engineering-Team, 15User-greg: Plan basic logistics of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T134830#2278940 (10greg) [16:19:24] 06Release-Engineering-Team, 15User-greg: Determine timing of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T137720#2579375 (10greg) 05Open>03Resolved a:03greg Week of October 17th [16:20:15] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2579386 (10hashar) From #openstack-infra: > [16:15:12Z] hashar: There was a quota mismatch with our proj... [16:27:02] (03PS29) 10Zfilipin: WIP Run language screenshots script for VisualEditor in Jenkins [integration/config] - 10https://gerrit.wikimedia.org/r/300035 (https://phabricator.wikimedia.org/T139613) [16:28:30] (03PS30) 10Zfilipin: WIP Run language screenshots script for VisualEditor in Jenkins [integration/config] - 10https://gerrit.wikimedia.org/r/300035 (https://phabricator.wikimedia.org/T139613) [16:31:47] 07Browser-Tests, 10MediaWiki-extensions-MultimediaViewer: Download menu.Clicking the image closes the download menu browser test fails in Firefox - https://phabricator.wikimedia.org/T143801#2579459 (10Jdlrobson) [17:07:22] 06Release-Engineering-Team, 10MediaWiki-Vagrant, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2579656 (10greg) >>! In T136429#2572477, @greg wrote: > Not there, sadly. There are just only so many SF<->European slots in the week and our team meetings overlap.... [17:13:17] (03Abandoned) 1020after4: Fix debian-glue by working around it's exported path for where debs are saved [integration/config] - 10https://gerrit.wikimedia.org/r/300790 (owner: 10Paladox) [17:14:42] 07Browser-Tests, 10MediaWiki-extensions-MultimediaViewer, 06Reading-Web-Backlog: Browser test "Download menu.Clicking the image closes the download menu browser test" fails in Firefox - https://phabricator.wikimedia.org/T143801#2579672 (10Jdlrobson) [17:14:44] 07Browser-Tests, 10MediaWiki-extensions-MultimediaViewer, 06Reading-Web-Backlog: Browser test "Download menu.Clicking the image closes the download menu browser test" fails in Firefox - https://phabricator.wikimedia.org/T143801#2579459 (10Jdlrobson) [17:14:46] (03Abandoned) 1020after4: outline for `scap swat` command line tool [tools/release] - 10https://gerrit.wikimedia.org/r/304855 (owner: 1020after4) [17:15:25] 07Browser-Tests, 10MediaWiki-extensions-MultimediaViewer, 06Reading-Web-Backlog: Browser test "Download menu.Clicking the image closes the download menu" fails in Firefox - https://phabricator.wikimedia.org/T143801#2579459 (10Jdlrobson) [17:15:52] 07Browser-Tests, 10MediaWiki-extensions-MultimediaViewer, 06Reading-Web-Backlog: Browser test "Download menu.Clicking the image closes the download menu" fails in Firefox - https://phabricator.wikimedia.org/T143801#2579459 (10Jdlrobson) p:05Triage>03Normal [17:23:38] 06Release-Engineering-Team (Long-Lived-Branches), 03Scap3 (Scap3-MediaWiki-MVP): Deploy mediawiki release tools repo (rMREL) with scap3 - https://phabricator.wikimedia.org/T142588#2579697 (10mmodell) [17:23:40] 06Release-Engineering-Team (Long-Lived-Branches), 03Scap3 (Scap3-MediaWiki-MVP): make scap3 look in PWD to find local CLI extensions - https://phabricator.wikimedia.org/T142590#2579696 (10mmodell) [17:23:49] 06Release-Engineering-Team (Long-Lived-Branches), 03Scap3 (Scap3-MediaWiki-MVP): Deploy mediawiki release tools repo (rMREL) with scap3 - https://phabricator.wikimedia.org/T142588#2540290 (10mmodell) p:05Triage>03Low [17:23:51] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2579700 (10chasemp) >>! In T143016#2579386, @hashar wrote: > From #openstack-infra: > >> [16:15:12Z] has... [17:24:19] 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP): Deploy mediawiki release tools repo (rMREL) with scap3 - https://phabricator.wikimedia.org/T142588#2540290 (10mmodell) [17:42:38] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-API, 10MediaWiki-Uploading, and 3 others: internal_api_error_LocalFileLockError while uploading file via API to commons.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T143655#2575006 (10AlexMonk-WMF) This instance probably crashed or som... [17:44:12] Hmm, did something massively improve about CI? VE-core's merge pipeline is now running in 1–2 minutes, previously it was about 4–5… [17:46:27] James_F: I'm not sure, but maybe the quota was improved? [17:46:40] I know at least, that there was the plan to do that [17:52:47] James_F: yes there were some massive improvements thanks to heroic work by chasemp and thcipriani last week [17:53:38] twentyafterfour: Ah, awesome. Thanks chasemp and thcipriani (and anyone else who also helped)! [17:55:07] "massive improvements" == "slowed rate of horrible breakages" :P [17:55:38] also I blame legoktm and chasemp mostly. I just flailed and fretted at it :) [18:05:44] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:32] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 44818 bytes in 1.299 second response time [18:22:14] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2579863 (10dduvall) [18:42:17] RECOVERY - Puppet run on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:17:35] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2580118 (10hashar) [19:18:07] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2502733 (10hashar) Couple issue that might be worth investigating: * {T143817} * {T143818} Others might show up [19:47:36] twentyafterfour: begining of august I have sprinted a bit the debian-glue job with akosiaris [19:47:48] looks like they are more or less working for most of the operations/debs/* repo [19:48:19] (noticed you abandoned https://gerrit.wikimedia.org/r/#/c/300790/ ) [20:06:15] hashar: the scap package has been broken for a while [20:06:40] time to fix it up!?! [20:09:31] PROBLEM - Host integration-raita is DOWN: CRITICAL - Host Unreachable (10.68.16.53) [20:10:00] yeah [20:16:09] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2580419 (10hashar) [20:16:38] ninja fixed by dereckson [20:16:51] the other ones are the implicit database commit [20:17:03] I dont think they are serious [20:20:48] hashar: yeah, I think those are mostly aaron forcing them to warn so he can find and fix in support of the multi-dc work he's doing [20:29:56] yeah that is what I have understood [20:32:48] * greg-g nods [20:36:17] (03PS1) 10Hashar: maps/kartotherian/deploy noop jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306537 [20:36:39] (03CR) 10Hashar: [C: 032] maps/kartotherian/deploy noop jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306537 (owner: 10Hashar) [20:37:34] (03Merged) 10jenkins-bot: maps/kartotherian/deploy noop jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306537 (owner: 10Hashar) [20:38:33] 03Scap3, 06Discovery, 06Maps: Failed to rollback scap3 deployment - https://phabricator.wikimedia.org/T142792#2580458 (10thcipriani) 05Open>03Resolved a:03thcipriani Should be fixed in latest release. Thank you for filing the task—bad bug, glad it's gone. [20:40:08] 03Scap3: TypeError: unsupported operand type(s) for %: 'dict' and 'tuple' - https://phabricator.wikimedia.org/T142364#2580463 (10thcipriani) 05Open>03Resolved a:03thcipriani This //should// be fixed by the scap 3.2.3-1 package that went out yesterday. [20:41:59] Yippee, build fixed! [20:41:59] Project selenium-Echo » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #127: 09FIXED in 57 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/127/ [20:42:11] Yippee, build fixed! [20:42:12] Project selenium-Echo » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #127: 09FIXED in 1 min 10 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/127/ [20:49:07] 10Continuous-Integration-Config, 06Operations, 07Puppet: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2580487 (10hashar) Thanks for the task. It is due to what Bryan said: [[https://tickets.puppetlabs.com/browse/PUP-3261|puppet doc passes fil... [20:55:36] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2580531 (10hashar) Yeah the quota usage links you have posted earlier have lead me to figure out how to look at the act... [21:03:36] Yippee, build fixed! [21:03:37] Project selenium-Wikidata » firefox,test,Linux,contintLabsSlave && UbuntuTrusty build #96: 09FIXED in 2 hr 13 min: https://integration.wikimedia.org/ci/job/selenium-Wikidata/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/96/ [21:14:09] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2580595 (10hashar) Right now with 2 jessie and 2 trusty instances (min-ready values). On the Horizon project page at h... [21:17:34] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2580596 (10hashar) I have looked at all the projects I have access too and `tools` seems to be off by one with the pie... [22:08:42] 03Scap3, 15User-mobrovac: Sequential execution should be per-deployment, not per-phase - https://phabricator.wikimedia.org/T142990#2580727 (10dduvall) [22:10:48] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2580735 (10Dereckson) [23:12:06] PROBLEM - Host deployment-poolcounter01 is DOWN: CRITICAL - Host Unreachable (10.68.19.181) [23:12:56] 10Beta-Cluster-Infrastructure, 06Operations, 07Puppet, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2580877 (10AlexMonk-WMF) [23:12:57] 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Setup poolcounter daemon in Beta Cluster - https://phabricator.wikimedia.org/T38891#2580874 (10AlexMonk-WMF) 05Open>03Resolved a:03AlexMonk-WMF Still seems to work after pointing it to poolcounter02, a trusty instance I set up the... [23:34:36] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Tracking: Thumbnail generation should happen via the same setup in the beta cluster and in production (tracking) - https://phabricator.wikimedia.org/T84950#2580904 (10Krenair) [23:34:40] 10Beta-Cluster-Infrastructure, 06Operations, 07Puppet, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2580903 (10Krenair) [23:52:40] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2580975 (10Dereckson)