[00:46:00] (03CR) 10Krinkle: [C: 04-1] "The bootstrap change will need careful review to ensure the layout has no breaking changed that require further updates to our HTML conten" (032 comments) [integration/docroot] - 10https://gerrit.wikimedia.org/r/311345 (https://phabricator.wikimedia.org/T109747) (owner: 10Paladox) [01:07:19] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:12:49] 10Deployment-Systems, 06Operations, 13Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#2651178 (10Dzahn) The last comment "Definitely better than hardcoding uids in the puppet tree." sounds like this ticket might be rejected? [01:17:46] 10Gerrit, 06Repository-Ownership-Requests, 10grrrit-wm: Give paladox a +2 in labs/tools/grrrit - https://phabricator.wikimedia.org/T145416#2629393 (10Legoktm) @yuvipanda, could you comment that you're okay with this? [01:19:06] 10Gerrit, 06Repository-Ownership-Requests, 10grrrit-wm: Give paladox a +2 in labs/tools/grrrit - https://phabricator.wikimedia.org/T145416#2651189 (10yuvipanda) I'm not a maintainer anymore :) [01:31:52] 10Gerrit, 06Repository-Ownership-Requests, 10grrrit-wm: Give paladox a +2 in labs/tools/grrrit - https://phabricator.wikimedia.org/T145416#2651199 (10Legoktm) 05Open>03Resolved a:03Legoktm Alright then, done. Now that I re-read the request this seems uncontroversial to me since he already can deploy ch... [01:42:18] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:55:37] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:55:40] 10Continuous-Integration-Infrastructure: Move integration-puppetmaster off of precise (probably to jessie) - https://phabricator.wikimedia.org/T144951#2651207 (10Legoktm) 05Open>03Resolved a:03Legoktm https://lists.wikimedia.org/pipermail/qa/2016-September/002552.html I will delete integration-puppetmaste... [02:30:36] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [02:34:45] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:11:17] Project mediawiki-core-code-coverage build #2273: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2273/ [03:14:43] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:26:40] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [04:01:37] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [05:05:43] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [05:34:54] PROBLEM - Puppet staleness on deployment-db03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:45:43] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:35:32] 10Gerrit, 06Repository-Ownership-Requests, 10grrrit-wm: Give paladox a +2 in labs/tools/grrrit - https://phabricator.wikimedia.org/T145416#2651409 (10Paladox) Thank you :) [07:00:30] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [07:35:29] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [08:13:40] 06Release-Engineering-Team, 10Monitoring, 06Operations, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2651497 (10hashar) [08:14:21] 10Browser-Tests-Infrastructure, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 15User-zeljkofilipin: CentralNotice: Intermittent unexplained browser test failures - https://phabricator.wikimedia.org/T145718#2651498 (10zeljkofilipin) Only three faile... [08:27:15] 06Release-Engineering-Team, 06Editing-Department, 10Monitoring, 06Operations, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2651523 (10hashar) The graph from https://grafana.wikimedia.org/dashboard/db/authenti... [08:33:30] 06Release-Engineering-Team, 10Monitoring, 06Operations, 06Performance-Team, 07Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#2651529 (10hashar) [08:34:03] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328#2564761 (10hashar) From T146099 > load times increased by approx 600ms: > {F4488077} That got introduced by wmf.18 and left unnoticed for almost 1... [08:34:32] 06Release-Engineering-Team, 10Monitoring, 06Operations, 07Tracking, 07Wikimedia-Incident: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481685 (10hashar) [08:36:55] 06Release-Engineering-Team, 10Monitoring, 06Operations, 06Performance-Team, 07Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#2651529 (10hashar) [08:39:25] (03PS1) 10Nschaaf: Add research/recommendation-api to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311667 (https://phabricator.wikimedia.org/T146057) [08:39:32] 10Continuous-Integration-Config, 06Wikipedia-Android-App-Backlog, 13Patch-For-Review: [Dev] Fix periodic tests - https://phabricator.wikimedia.org/T139137#2651585 (10hashar) @Mholloway definitely aced this :] Well done! [08:43:30] 06Release-Engineering-Team, 06Editing-Department, 10Monitoring, 06Operations, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2651615 (10Tgr) centrallogin is not interesting, it can be added to web or just ignor... [08:47:47] 10Browser-Tests-Infrastructure, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 15User-zeljkofilipin: CentralNotice: Intermittent unexplained browser test failures - https://phabricator.wikimedia.org/T145718#2651635 (10zeljkofilipin) [09:02:02] 10Browser-Tests-Infrastructure, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 15User-zeljkofilipin: CentralNotice: Intermittent unexplained browser test failures - https://phabricator.wikimedia.org/T145718#2651654 (10zeljkofilipin) All three failur... [09:26:08] 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2651679 (10elukey) As @AlexMonk-WMF reported, we caused an issue when dealing with restbase configs: https://phabricator.wikimedia.org/T146053 As follo... [09:47:57] hashar: --^ [10:13:00] elukey: hello :) [10:13:06] o/ o/ [10:13:17] sorry I forgot the client hello part :D [10:13:19] so yeah restbase points to some deployment-mw host [10:13:31] and puppet does not restart the restbase service on configuration change [10:13:35] (intentionally) [10:14:02] as I rember it [10:14:04] remember it [10:15:33] oh yes I wasn't blaming anybody but me for the outage, only suggesting the "tribal knowledge" page [10:15:42] if it is not there already [10:15:43] well it is hard to catch really [10:16:17] on other news mira02.deployment-prep.eqiad.wmflabs needs more disk :( [10:34:01] hashar: since we now have sufficient quota again, I'll simply "reimage" mira02 with a new instance (and also change the name since people seem to prefer the redundant form :-) [10:34:18] that way it's also a clean rebuiild [10:34:44] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: mediawiki_selenium feature to show/capture Selenium WebDriver requests to remote browser. - https://phabricator.wikimedia.org/T94577#2651805 (10zeljkofilipin) @Jhernandez, @hashar: Looks like this works great for Chrome, but it is not... [10:35:28] moritzm: sounds good [10:35:41] and I am 99 sure it is going to recreate all fine [10:35:47] then we can dish out the mira one [10:36:15] yep, will start in half an hour or so [10:39:57] elukey: do you remember about my Debian packages building not including orig.tar.gz in .changes? [10:40:15] I got the issue for another package (Nodepool) and I am pretty sure I shared a link to debian policy but cant find it anymore :( [10:40:36] I tried instructing git buildpackage to pass -sa to force orig.tar.gz to be included in the .changes but without any result :( [10:42:48] mmm IIRC you shared to me a link with a contact us from random people explaining the issue.. [10:42:55] was it the one that you are looking? [10:43:02] yeah maybe :] [10:43:22] I want to fix that once for good [10:45:15] !log created deployment-jobrunner02 in deployment-prep [10:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:45:55] hashar: https://www.logilab.org/ticket/22071 [10:46:04] you are awesome! [10:48:34] say thank you to my Chrome history :P [10:48:58] 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Get the Nodepool/Zuul debian packaging to pass -sa so orig.tar.gz is always in the .changes file - https://phabricator.wikimedia.org/T145797#2651824 (10hashar) [11:04:23] 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Get the Nodepool/Zuul debian packaging to pass -sa so orig.tar.gz is always in the .changes file - https://phabricator.wikimedia.org/T145797#2651860 (10hashar) I have updated the task details, but apparently not including the original tarba... [11:05:24] I found the culpirt code " dpkg-genchanges " [11:05:32] now gotta figure out how to pass it -sa :D [11:08:00] (reads its 6th man page in a raw) [11:13:49] hashar: when you have time https://gerrit.wikimedia.org/r/#/c/311681/1 [11:13:57] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 07Upstream, 15User-zeljkofilipin: Firefox v47 breaks mediawiki_selenium - https://phabricator.wikimedia.org/T137561#2651868 (10zeljkofilipin) I am not sure what needs to be done here to resolved the task. Firefox 47.0.1 is released and workin... [11:14:08] (trying to figure out how to have debian/gpb.conf to pass -sa reliably :D ) [11:14:26] elukey: yeah be bold :] [11:14:57] elukey: the jobrunner to Jessie should be fine since the prod one got moved back in July. I am more worried about deployment-tmh01 which deal with video scaling [11:15:28] ah yeah that one is still pending for production too [11:15:51] but we can survive for a while with a few ubuntu trustys rather than tons of them :D [11:16:02] guess whoever takes care of that for prod can spawn a Jessie videoscaler on beta [11:16:07] and catch up issues / test on beta [11:20:19] !log applied beta::deployaccess, role::labs::lvm::srv, role::mediawiki::jobrunner to jobrunner02 [11:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:22:16] PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:26:18] Project beta-scap-eqiad build #120853: 04FAILURE in 1 min 44 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120853/ [11:36:16] Project beta-scap-eqiad build #120854: 04STILL FAILING in 1 min 45 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120854/ [11:41:10] Project beta-scap-eqiad build #120855: 04STILL FAILING in 1 min 48 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120855/ [11:42:03] (03Abandoned) 10Hashar: Pass option -sa to git-pbuilder [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/310959 (https://phabricator.wikimedia.org/T145797) (owner: 10Paladox) [11:42:38] HURRAH FIXED !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [11:42:57] (and it only took me an hour to figure it out) [11:46:12] Project beta-scap-eqiad build #120856: 04STILL FAILING in 1 min 41 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120856/ [11:46:37] PROBLEM - Puppet run on deployment-jobrunner02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:52:03] hashar: can you try to deploy again to jobrunner02? [11:52:07] it should be ok now [11:52:15] not sure about the puppet run since it worked [11:54:10] 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Get the Nodepool/Zuul debian packaging to pass -sa so orig.tar.gz is always in the .changes file - https://phabricator.wikimedia.org/T145797#2651885 (10hashar) So after much trial and errors, reading 6 or 7 different man pages and digging i... [11:54:56] elukey: you can login in Jenkins with your wmflabs account from https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [11:55:03] elukey: then hit the "rebuild" link on the left [11:55:11] * elukey learns [11:55:36] or eventually the job that runs every 10 minutes or so to pull new code will end up triggering the beta-scap-eqiad job [11:55:42] which it actually did already ! [11:55:50] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120857/ [11:56:00] it shows: Started by upstream project beta-code-update-eqiad build number 122288 [11:56:15] Project beta-scap-eqiad build #120857: 04STILL FAILING in 1 min 44 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120857/ [11:56:20] I guess it ran too late :] [11:56:21] weird [11:56:25] ahhh okok [11:56:29] let me re-run it [11:56:38] RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:57:02] thanks [11:58:20] Project beta-scap-eqiad build #120858: 04STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120858/ [11:58:28] grrr [11:59:14] (03PS2) 10Zfilipin: WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) [11:59:52] * elukey realized that he accepted the key on puppet master [11:59:59] * elukey cries a bit [12:00:40] 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2651889 (10AlexMonk-WMF) >>! In T144006#2651679, @elukey wrote: > As @AlexMonk-WMF reported, we caused an issue when dealing with restbase configs: http... [12:01:27] (03CR) 10jenkins-bot: [V: 04-1] WIP Marionette [selenium] - 10https://gerrit.wikimedia.org/r/310286 (https://phabricator.wikimedia.org/T137540) (owner: 10Zfilipin) [12:01:56] Yippee, build fixed! [12:01:57] Project beta-scap-eqiad build #120859: 09FIXED in 1 min 42 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120859/ [12:02:14] RECOVERY - Puppet run on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [0.0] [12:03:27] 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2651892 (10elukey) >>! In T144006#2651889, @AlexMonk-WMF wrote: >>>! In T144006#2651679, @elukey wrote: >> As @AlexMonk-WMF reported, we caused an issue... [12:04:21] goooood deployment fixed [12:04:27] the jobrunner seems working \o/ [12:06:02] hey elukey w [12:06:10] what's the status of the migrations to jessie? [12:06:24] o/ [12:06:42] the videoscaler is the only one left, but we are still working on it in production [12:06:51] so it might wait a bit [12:06:56] have the old instances been terminated? [12:07:20] only the trusty jobrunner is still active, I'll kill it today if nothing comes up [12:07:49] so we have room in the quota to create a new instance without using up the quota increase you got for that? [12:08:20] 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Get the Nodepool/Zuul debian packaging to pass -sa so orig.tar.gz is always in the .changes file - https://phabricator.wikimedia.org/T145797#2651911 (10hashar) a:03hashar Definitely solved on my local installation by adjusting my `~/.gbp.... [12:08:52] so the quota increase was needed since I had to replace one by one old/new instances and I didn't want it to do for the jobrunner since it was alone [12:09:10] plus godog needs to room for his prometheus instances [12:09:18] prometheus was a separate ticket [12:10:07] I ask because yuvipanda and I are going to swap the puppetmaster soon [12:10:07] I think he can use the extra quota for it? [12:10:13] yes [12:10:43] but until he does use it, we have to make sure we don't accidentally use it for something else [12:11:36] hashar hi, thankyou for reasearching on how to include the original source today +1 :) [12:11:39] from my side I will not need more instances, and moritzm is going to replace mira in place IIRC [12:12:01] the videoscaler is the only question mark but not sure when it will happen [12:12:04] we can coordinate [12:12:37] should be no problem, the old mira instance can also go away after more tests [12:13:29] 10Continuous-Integration-Infrastructure, 06Labs, 06Operations, 07Nodepool: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142#2651917 (10hashar) I have refreshed the package on https://people.wikimedia.org/~hashar/debs/nodepool_0.1.1-wmf5/ .... [12:13:38] paladox__: that is rather messy :( but I got the doc updated! [12:13:54] Yep [12:14:04] moritzm: elukey mira is not really used so I would just dish out in favor of mira02 [12:14:08] But at least we now know a solution we can use :) [12:15:49] yeah [12:16:05] and the root cause is really that I create package for a new upstream version [12:16:12] but that first iteration is not always send to apt.wm.o [12:16:22] so I then get a 2nd iteration (eg: 1.0.0-wmf2) [12:16:23] Oh [12:16:41] which does not have the original tarball since 1.0.0-wmf1 has it and is supposed to have been uploaded already [12:16:51] Yep [12:17:04] regardless, wikimedia system requires the original to always be included [12:17:10] so -sa all the time and we are covered \o/ [12:17:11] Oh [12:17:18] Yep [12:17:22] Lol [12:17:28] * paladox__ is going now, use paladox so I can read messages later :) [12:21:43] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2651940 (10zeljkofilipin) ``` $ bundle exec cucumber tests/browser/features/create_account.feature:10 /usr/local/lib/ruby/gems/2.3.0/gems/page-obje... [12:30:12] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2651942 (10zeljkofilipin) https://github.com/SeleniumHQ/selenium/wiki/DesiredCapabilities#firefox-specific [13:04:28] Project selenium-Math » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #150: 04FAILURE in 27 sec: https://integration.wikimedia.org/ci/job/selenium-Math/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/150/ [13:07:33] !log add deployment-prometheus01 instance T53497 [13:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:10:36] elukey: did puppet work on jobrunner02 today? I'm getting a self-signed cert on prometheus01 when running puppet [13:10:46] Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster.deployment-prep.eqiad.wmflabs] [13:16:04] ah yes it is because of the self hosted puppet master [13:16:21] delete the cert on the host, re-run puppet and it will go [13:16:40] there is a weird use case in which the self hosted puppet master is not recognized [13:16:57] BUT I didn't have the time to follow up :) [13:18:02] elukey: which cert should be deleted? [13:19:28] the ones on the target host [13:19:31] *one [13:19:52] the brutal find /var/lib/puppet/ssl -type f -exec rm {} \; [13:20:44] you'll see that when the first puppet run will go a little snippet for the self hosted puppet master is added [13:20:58] but don't ask me why :D [13:22:13] heh, that also means that puppet won't work by default for freshly-provisioned instances? [13:22:32] it does work indeed after removing the files in /var/lib/puppet/ssl [13:22:50] only with self-hosted puppet masters yes [13:28:47] godog: yeah new instances have broken puppet [13:29:07] gotta dish bunch of material from /var and iirc redo the puppet.conf manually [13:34:49] PROBLEM - Puppet run on deployment-kafka04 is CRITICAL [13:34:49] PROBLEM - Puppet run on deployment-zookeeper01 is CRITICAL [13:34:49] PROBLEM - Puppet staleness on integration-slave-jessie-1003 is CRITICAL [13:34:50] PROBLEM - Puppet run on deployment-imagescaler01 is CRITICAL [13:34:51] PROBLEM - Puppet staleness on deployment-sentry01 is CRITICAL [13:34:53] PROBLEM - Free space - all mounts on deployment-elastic06 is CRITICAL [13:34:55] PROBLEM - Free space - all mounts on deployment-kafka05 is CRITICAL [13:34:56] PROBLEM - Puppet staleness on mira02 is CRITICAL [13:34:58] PROBLEM - Puppet run on deployment-pdf01 is CRITICAL [13:35:01] PROBLEM - Puppet run on deployment-sentry01 is CRITICAL [13:35:02] PROBLEM - Puppet staleness on integration-slave-precise-1002 is CRITICAL [13:35:02] PROBLEM - Puppet run on integration-slave-jessie-1001 is CRITICAL [13:35:03] PROBLEM - Free space - all mounts on integration-slave-jessie-1002 is CRITICAL [13:35:05] PROBLEM - Free space - all mounts on deployment-redis01 is CRITICAL [13:35:06] PROBLEM - Free space - all mounts on deployment-cache-text04 is CRITICAL [13:35:08] PROBLEM - Puppet run on deployment-conf03 is CRITICAL [13:35:12] PROBLEM - Free space - all mounts on deployment-mediawiki04 is CRITICAL [13:35:13] PROBLEM - Free space - all mounts on integration-slave-jessie-android is CRITICAL [13:35:15] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is CRITICAL [13:36:31] PROBLEM - Free space - all mounts on deployment-jobrunner02 is CRITICAL [13:36:31] PROBLEM - Puppet run on deployment-sca02 is CRITICAL [13:36:35] PROBLEM - Puppet staleness on integration-slave-trusty-1003 is CRITICAL [13:36:44] PROBLEM - Puppet staleness on deployment-mediawiki05 is CRITICAL [13:36:44] PROBLEM - Puppet staleness on deployment-elastic05 is CRITICAL [13:36:44] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL [13:36:48] PROBLEM - Puppet run on integration-slave-jessie-android is CRITICAL [13:36:48] PROBLEM - Puppet run on integration-slave-trusty-1014 is CRITICAL [13:36:52] PROBLEM - Puppet run on integration-slave-trusty-1004 is CRITICAL [13:36:53] PROBLEM - Puppet run on deployment-logstash2 is CRITICAL [13:36:53] PROBLEM - Puppet run on deployment-redis01 is CRITICAL [13:36:54] PROBLEM - Puppet run on integration-slave-trusty-1017 is CRITICAL [13:36:55] PROBLEM - Puppet run on deployment-db03 is CRITICAL [13:36:57] PROBLEM - Free space - all mounts on deployment-logstash2 is CRITICAL [13:36:58] PROBLEM - Puppet staleness on castor is CRITICAL [13:36:58] PROBLEM - Free space - all mounts on deployment-tin is CRITICAL [13:36:59] PROBLEM - Puppet staleness on deployment-eventlogging04 is CRITICAL [13:37:01] PROBLEM - Puppet staleness on deployment-sca02 is CRITICAL [13:37:01] PROBLEM - Puppet staleness on integration-slave-trusty-1011 is CRITICAL [13:37:03] PROBLEM - Puppet run on deployment-puppetmaster is CRITICAL [13:37:03] PROBLEM - Puppet staleness on deployment-ores-redis is CRITICAL [13:37:06] PROBLEM - Puppet staleness on deployment-db2 is CRITICAL [13:37:06] PROBLEM - Free space - all mounts on integration-slave-precise-1011 is CRITICAL [13:37:07] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL [13:37:08] PROBLEM - Free space - all mounts on deployment-mediawiki05 is CRITICAL [13:37:09] PROBLEM - Free space - all mounts on integration-slave-precise-1012 is CRITICAL [13:37:09] PROBLEM - Free space - all mounts on deployment-eventlogging03 is CRITICAL [13:37:10] PROBLEM - Free space - all mounts on deployment-mx is CRITICAL [13:37:10] PROBLEM - Free space - all mounts on deployment-memc04 is CRITICAL [13:37:12] PROBLEM - Puppet run on integration-slave-trusty-1018 is CRITICAL [13:37:13] PROBLEM - Free space - all mounts on mira02 is CRITICAL [13:37:55] bah [13:38:19] same in -labs [13:38:23] shinken is dead :D [13:39:24] PROBLEM - Puppet staleness on integration-slave-jessie-1005 is CRITICAL [13:39:25] PROBLEM - Puppet run on deployment-restbase02 is CRITICAL [13:39:26] PROBLEM - Puppet run on deployment-stream is CRITICAL [13:39:27] PROBLEM - Puppet staleness on deployment-redis02 is CRITICAL [13:39:28] PROBLEM - Free space - all mounts on deployment-eventlogging04 is CRITICAL [13:39:28] PROBLEM - Free space - all mounts on deployment-sentry01 is CRITICAL [13:51:36] 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Get the Nodepool/Zuul debian packaging to pass -sa so orig.tar.gz is always in the .changes file - https://phabricator.wikimedia.org/T145797#2652108 (10hashar) 05Open>03Resolved Documentation patch has been reviewed and merged by @akosi... [14:08:10] hashar: how are things going? [14:08:21] problems with labs now? [14:12:07] aude: back again to wmf.18 [14:12:19] due to a huge regression in MediaWiki load time :( [14:12:40] !sal [14:12:41] https://tools.wmflabs.org/sal/releng [14:13:02] elukey: may i please mess up with deployment-jobrunner02 ? Would like to try a rsyslog config change :D [14:13:26] hashar: is https://phabricator.wikimedia.org/T146044 in your scope ? [14:14:01] please go, I think that we could delete 01 [14:14:07] what do you think? [14:15:25] elukey: yeah lets do that [14:15:30] can further tweak as needed [14:15:49] matanya: I will let mobile folks do the first investigation :D [14:16:04] thanks hashar [14:16:38] FORAZEIOAZ eppuet [14:16:42] disabling puppet [14:16:50] hashar: hey! checking in, since you've been awake for a while. i'm about to put beta cluster in read-only mode for the data migration [14:17:04] any problem with that right now? [14:18:10] hashar: at this point, i don't know about trying to deploy any new wikidata code [14:18:35] elukey: looks like the whole migration of jobrunner/jobchron to systemd is fucked up [14:18:42] might be too confusing but could at least get some backports out for the code we have now [14:18:45] https://gerrit.wikimedia.org/r/#/c/311453/ [14:18:45] elukey: service status jobrunner: status: unrecognized service [14:18:46] at swat [14:19:00] marxarelli: yeah be bold let s do that :] [14:19:12] marxarelli: if somebody complains we can point at the maintenance anouncement you wrote :D [14:19:16] hashar: alrighty :] [14:19:54] hashar: you are systemdinzing init :D [14:20:03] service jobrunner status [14:20:09] !log disabling beta cluster jenkins jobs in preparation for data migration (T138778) [14:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:20:21] elukey: ho for god sake [14:20:25] really [14:20:41] they switched the order of parameters ccompared to upstart... [14:21:20] hashar: it should be systemctl [action] [unit] for systemd i believe [14:21:46] I am just going to create a shell function [14:22:07] catch the wrongly ordered args and reorder :] [14:23:40] yes so systemctl status jobrunner or service jobrunner status [14:23:46] I confuse them all the times [14:27:14] !log stopped puppet, jobrunner and jobchron on deployment-jobrunner01 [14:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:37:05] PROBLEM - Puppet run on deployment-mira02 is CRITICAL [14:37:48] elukey: is that common practice to have the systemd units to have stdout/stderr send to syslog? [14:37:57] we used to just > /var/log/foo/soft.log [14:38:06] which is really super easier compared to handling rsyslog config :] [14:39:01] PROBLEM - Free space - all mounts on deployment-mira02 is CRITICAL [14:39:07] yeah that's how production is done [14:39:20] > redirection opens a can of worms wrt to e.g. rotation [14:39:25] PROBLEM - Puppet staleness on deployment-mira02 is CRITICAL [14:39:47] ahh i see [14:39:58] so the soft just write to the same io channels [14:40:04] and we can logrotate as needed [14:40:49] yeah, whatever needs to be run is a binary that stays in the foreground and logs to stdout/stderr and that's it [14:41:05] in production using syslog has the added benefit of central logging too [14:41:12] yeah [14:41:18] which is exactly what I need actually [14:41:38] know I am trying to figure out how to have rsyslog to not write to the generic syslog :D [14:42:32] what do you mean? [14:44:25] !log entering read-only mode on beta cluster [14:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:44:31] godog: I crafted a rsyslog rule to catch my program and write to a given file [14:44:40] but the messages are also written to /var/log/syslog [14:45:59] ah, check puppet there should be some examples there [14:46:02] for e.g. thumbor [14:46:26] found it [14:46:59] & ~ [14:47:50] also if I am not mistaken everything is also handled by journald [14:48:06] and journald pushes to rsyslog if instructed to [14:48:19] E_TOOMANY_TERMINALS [14:48:21] so you have journalctl -u unit.service available [14:48:23] .... can you believe it ? [14:49:50] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2652243 (10zeljkofilipin) Explodes here: ```lang=ruby def new_browser(options) Watir::Browser.new(browser_name, options) end ``` ```lang=ruby... [14:52:47] hashar: https://gerrit.wikimedia.org/r/#/c/311717 [14:53:16] after I merge the last step is to kill the jobrunner [14:53:24] if you agree [14:54:14] !log beta: cherry picking fix up for the jobrunner logging https://gerrit.wikimedia.org/r/#/c/311702/ and https://gerrit.wikimedia.org/r/311719 T146040 [14:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:54:53] elukey: the jobchron service fails for some reason :( [14:54:56] on the jessie instance [14:57:17] really? [14:57:22] what is the error? [14:57:23] didn't see it [14:57:24] snap [15:00:01] so it failed for LightProcess::closeShadow failed due to exception: Failed in afdt::sendRaw: Broken pipe [15:00:06] but then systemd restarted it [15:00:33] mmm no sorry it was stopped [15:00:33] Sep 20 14:57:08 deployment-jobrunner02 systemd[1]: Stopping "Mediawiki job queue chron loop"... [15:00:37] Sep 20 14:57:08 deployment-jobrunner02 jobchron[10839]: Caught signal (15) [15:01:18] elukey: yeah that was me play testing it [15:01:22] to fix rsyslog [15:01:28] I have reenabled puppet and done with tests sorry [15:02:13] ah okok [15:02:27] but what "elukey: the jobchron service fails for some reason :(" was about then? [15:06:50] PROBLEM - Keyholder status on deployment-mira02 is CRITICAL [15:12:37] Project mediawiki-core-code-coverage build #2274: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2274/ [15:19:20] elukey: on deployment-jobrunner02 in /var/log/mediawiki/jobchron.log there are some suspicious messages. [15:19:30] havent had time to investigate though and I am about to leave :( [15:20:54] hashar: sure.. do you want me to re-enable jobrunner01? [15:21:32] not sure [15:21:35] ah the 2016-09-20T14:57:07+0000 NOTICE: Raced out of periodic tasks. [15:21:39] maybe the issue also existed on runner01 :] [15:21:42] yeah those messages [15:21:55] one might want to compare with runner01 [15:21:58] hashar: deployment-db1 is in read-only mode while i migrate the beta cluster dbs, in case that may be breaking jobchron [15:22:08] elukey: maybe it was just a transient issue [15:22:27] or what marxarelli said :] [15:22:46] elukey: lets forget about the trusty instance just delete it indeed [15:23:04] we can then tune the jessie one as needed, but it is most probably all fine since prod runs on it [15:23:11] we can wait a bit and do it tomorrow, I'll keep it stopped (via puppet) but ready to go [15:23:31] would it be ok? [15:25:32] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Update mediawiki_selenium to use Marionette - https://phabricator.wikimedia.org/T137540#2652378 (10zeljkofilipin) Firefox 47.0.1 mediawiki_selenium 1.7.2 watir-webdriver 0.9.3 selenium-webdriver 2.53.4 firefox driver ```lang=bash $ b... [15:28:22] elukey: yeah [15:28:37] elukey: and we got a quota bump so we can afford to get some unused instances :] [15:28:46] also marxarelli is migrating the huge dbx Precies instances [15:28:59] which I believe will soonish free up ton of resources AND get us prometheus on beta \o/ [15:47:46] !log completed innobackupex on deployment-db1. copying backup to deployment-db03 for restoration [15:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:23:40] !log applied innodb transaction logs to deployment-db1 backup and successfully restored on deployment-db03 [16:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:30:16] !log rebooting deployment-mira02 [16:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:31:41] !log cherry picking operations/puppet patches (T138778) to deployment-puppetmaster [16:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:54:52] !log upgraded package and data to mariadb 10 on deployment-db03 [16:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:58:15] 10Beta-Cluster-Infrastructure, 07Puppet, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2652713 (10AlexMonk-WMF) [16:58:17] 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 07Puppet: Puppet failing on deployment-conf03 due to missing files - https://phabricator.wikimedia.org/T144703#2652712 (10AlexMonk-WMF) 05Open>03Resolved [17:07:35] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2652761 (10MoritzMuehlenhoff) I've rebuilt the host as deployment-mira02 with /srv/ on a separate 20 GB partition and deleted the mira02 instance... [17:09:29] PROBLEM - Host mira02 is DOWN: CRITICAL - Host Unreachable (10.68.22.226) [17:26:48] Hey e.g. greg-g, I'm looking at https://www.mediawiki.org/wiki/Review_queue - we already have a tracking task for beta cluster deployment (which is all I want to accomplish for now), should I add that to Wikimedia-Extension-setup? [17:28:11] yup [17:28:39] Cool cool cool [17:30:13] alright alright alright [17:37:06] !log deployment-db04 restored from backup and replication started [17:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:37:21] wee [17:39:08] nice, does this mean we're migrated? [17:39:26] Krenair: almost. just need to merge and deploy the mediawiki-config changes [17:43:51] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2652884 (10thcipriani) >>! In T144578#2652761, @MoritzMuehlenhoff wrote: > I've rebuilt the host as deployment-mira02 with /srv/ on a separate 20... [17:44:54] hmm, copied up the new wmf-config/db-labs.php but still seeing the read-only message [17:45:13] db03 and db04 are definitely not read-only [17:53:54] 10Beta-Cluster-Infrastructure, 10Wikimedia-Extension-setup, 07Category, 10FileAnnotations (Beta Cluster Release): Release FileAnnotations on the Beta Cluster - https://phabricator.wikimedia.org/T144302#2652949 (10greg) [18:00:59] marxarelli, you didn't deploy your change? [18:01:04] you just merged it on deployment-tin [18:01:05] !log deployed mediawiki-config changes on beta cluster. back in read/write mode using new database instances [18:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [18:01:35] Krenair: no, worse than that. i didn't pull it after merging :) we're good now [18:02:00] doesn't look okay to me [18:02:17] scap sync still failed on deployment-jobrunner02, but everything else synced [18:02:18] `/home/krenair/foreachapache 'grep db1 /srv/mediawiki/wmf-config/db-labs.php'` still shows old db config [18:02:18] no? [18:02:32] <|404> I don't get at least the read-only message anymore [18:02:36] ah, deployment-jobrunner02 [18:02:43] deployment-tin too :/ [18:03:06] well that's not good, using two different databases :/ [18:05:19] also `diff ../../mediawiki/wmf-config/db-labs.php db-labs.php` from /srv/mediawiki-staging/wmf-config [18:05:25] didn't sync fully [18:05:44] yeah, i saw that [18:06:35] going to try a full scap by letting the jenkins job run [18:07:03] btw, like the colour prompts on new hosts? [18:10:16] wait, i don't have color prompts! [18:10:21] * marxarelli feels cheated [18:11:08] which host? [18:11:41] db03 and db04 [18:11:50] oh, those might be just old enough [18:11:59] try mira02 [18:12:21] basically I fixed up the code in the skeleton bashrc files that linux copies to new users' home directories on their first login to the system [18:12:38] Krenair: fancy! :D [18:12:52] k. looks like a full scap worked from jenkins [18:12:58] maybe a permissions issue [18:14:43] if you want colour prompts on existing hosts you can copy it in from /etc/skel/ [18:14:51] well, that was fun. haven't done a database migration in a while and the tools are so much better now. only an hour and 15 over the window. no bigs, right greg-g? :) [18:16:05] :) [18:16:23] Krenair: nice! thanks for the tip [18:16:26] I'm supposed to be getting dinner but something is up with keyholder on -mira02 [18:16:30] i always want more color [18:16:50] it has the same private key as -tin but `SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-tin` says "Agent admitted failure to sign using the key." [18:16:56] so if someone wants to look into that.. [18:17:29] diamond says in syslog 'diamond[426]: CRITICAL: Keyholder is not armed. Run 'keyholder arm' to arm it.', but keyholder arm says it successfully added the key [18:20:49] Krenair: i still know embarrassingly little about keyholder but i can poke at it [18:22:34] 06Release-Engineering-Team, 06Editing-Department, 10Monitoring, 06Operations, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2653054 (10Tgr) Note that failure means the authentication code ran successfully but... [18:25:57] Project beta-scap-eqiad build #120875: 04FAILURE in 16 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120875/ [18:29:03] RECOVERY - Free space - all mounts on deployment-kafka01 is OK: OK: All targets OK [18:29:04] RECOVERY - Puppet staleness on integration-slave-trusty-1006 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:05] RECOVERY - Puppet staleness on integration-slave-trusty-1013 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:06] RECOVERY - Free space - all mounts on deployment-restbase01 is OK: OK: deployment-prep.deployment-restbase01.diskspace._var_log.byte_percentfree (No valid datapoints found) [18:29:06] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:08] RECOVERY - Free space - all mounts on deployment-memc05 is OK: OK: All targets OK [18:29:09] RECOVERY - Free space - all mounts on deployment-kafka03 is OK: OK: All targets OK [18:29:10] RECOVERY - Puppet staleness on deployment-conftool is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:11] RECOVERY - Free space - all mounts on deployment-salt02 is OK: OK: All targets OK [18:29:11] RECOVERY - Puppet run on deployment-ircd is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:14] RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:14] RECOVERY - Puppet staleness on deployment-kafka03 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:14] RECOVERY - Free space - all mounts on deployment-urldownloader is OK: OK: All targets OK [18:29:17] RECOVERY - Free space - all mounts on deployment-apertium01 is OK: OK: All targets OK [18:29:18] RECOVERY - Puppet run on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:19] RECOVERY - Puppet staleness on integration-slave-jessie-1005 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:21] RECOVERY - Puppet run on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:21] RECOVERY - Puppet staleness on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:23] RECOVERY - Free space - all mounts on deployment-eventlogging04 is OK: OK: All targets OK [18:29:23] RECOVERY - Free space - all mounts on deployment-sentry01 is OK: OK: All targets OK [18:29:25] RECOVERY - Free space - all mounts on deployment-pdf02 is OK: OK: All targets OK [18:29:26] RECOVERY - Puppet run on deployment-kafka03 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:26] RECOVERY - Puppet run on deployment-conftool is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:27] RECOVERY - Puppet staleness on deployment-stream is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:28] RECOVERY - Puppet staleness on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:30] RECOVERY - Puppet staleness on deployment-mira02 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:30] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:31] RECOVERY - Puppet run on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:31] RECOVERY - Free space - all mounts on deployment-ores-redis is OK: OK: All targets OK [18:29:35] RECOVERY - Puppet staleness on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:35] RECOVERY - Puppet run on integration-slave-trusty-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:36] RECOVERY - Puppet staleness on integration-slave-trusty-1017 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:36] RECOVERY - Puppet staleness on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:38] RECOVERY - Puppet staleness on deployment-tin is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:39] RECOVERY - Puppet staleness on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:39] RECOVERY - Puppet staleness on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:40] RECOVERY - Free space - all mounts on deployment-tmh01 is OK: OK: All targets OK [18:29:42] RECOVERY - Puppet run on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:43] RECOVERY - Free space - all mounts on integration-puppetmaster is OK: OK: All targets OK [18:29:44] RECOVERY - Puppet run on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:45] RECOVERY - Puppet staleness on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:46] RECOVERY - Puppet staleness on integration-puppetmaster is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:47] RECOVERY - Free space - all mounts on deployment-zotero01 is OK: OK: All targets OK [18:29:47] RECOVERY - Puppet run on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:48] RECOVERY - Puppet staleness on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:49] RECOVERY - Puppet staleness on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:53] all that spam is check_graphite that got fixed up [18:30:02] w005 [18:30:06] -5+t [18:30:56] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [18:30:58] RECOVERY - Puppet staleness on integration-slave-jessie-1002 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:00] RECOVERY - Puppet staleness on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:02] RECOVERY - Puppet staleness on integration-slave-trusty-1004 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:06] RECOVERY - Puppet run on deployment-ms-be01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:08] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:09] RECOVERY - Puppet staleness on integration-publisher is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:09] RECOVERY - Puppet staleness on integration-slave-jessie-1004 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:10] RECOVERY - Puppet staleness on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:13] RECOVERY - Free space - all mounts on integration-slave-trusty-1004 is OK: OK: All targets OK [18:31:16] RECOVERY - Free space - all mounts on integration-slave-trusty-1013 is OK: OK: All targets OK [18:31:17] RECOVERY - Puppet staleness on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:31] RECOVERY - Free space - all mounts on deployment-jobrunner02 is OK: OK: All targets OK [18:31:31] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:35] RECOVERY - Puppet staleness on integration-slave-trusty-1003 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:36] Project beta-scap-eqiad build #120876: 04STILL FAILING in 4 min 1 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120876/ [18:31:43] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:45] RECOVERY - Puppet staleness on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:47] RECOVERY - Puppet run on integration-slave-jessie-android is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:47] RECOVERY - Puppet staleness on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:49] RECOVERY - Puppet run on integration-slave-trusty-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:49] RECOVERY - Puppet run on integration-slave-trusty-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:54] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:56] RECOVERY - Free space - all mounts on deployment-logstash2 is OK: OK: deployment-prep.deployment-logstash2.diskspace._srv.byte_percentfree (No valid datapoints found) [18:31:57] RECOVERY - Puppet run on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:57] RECOVERY - Puppet run on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:57] RECOVERY - Puppet staleness on castor is OK: OK: Less than 1.00% above the threshold [3600.0] [18:31:58] RECOVERY - Puppet run on integration-slave-trusty-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [18:31:59] RECOVERY - Free space - all mounts on deployment-tin is OK: OK: All targets OK [18:32:00] RECOVERY - Puppet staleness on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:32:02] RECOVERY - Puppet run on deployment-puppetmaster is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:03] RECOVERY - Puppet staleness on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [3600.0] [18:32:04] RECOVERY - Puppet staleness on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:32:04] RECOVERY - Puppet staleness on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:32:05] RECOVERY - Free space - all mounts on integration-slave-precise-1011 is OK: OK: All targets OK [18:32:05] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:06] deployment-mira02.deployment-prep.eqiad.wmflabs returned [255]: Host key verification failed. [18:32:06] RECOVERY - Free space - all mounts on deployment-eventlogging03 is OK: OK: All targets OK [18:32:06] RECOVERY - Free space - all mounts on integration-slave-precise-1012 is OK: OK: All targets OK [18:32:07] RECOVERY - Free space - all mounts on deployment-mx is OK: OK: All targets OK [18:32:07] known issue [18:32:08] RECOVERY - Free space - all mounts on deployment-mediawiki05 is OK: OK: All targets OK [18:32:08] RECOVERY - Free space - all mounts on deployment-memc04 is OK: OK: All targets OK [18:32:10] @q shinken-wm [18:32:10] RECOVERY - Puppet run on deployment-mira02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:11] RECOVERY - Puppet run on integration-slave-trusty-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:12] RECOVERY - Free space - all mounts on integration-slave-trusty-1011 is OK: OK: All targets OK [18:32:13] RECOVERY - Puppet run on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:16] RECOVERY - Puppet run on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:17] RECOVERY - Puppet run on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:36] documented on https://phabricator.wikimedia.org/T144006 [18:33:31] !log on deployment-mira02 ran `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs` per T144006 [18:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [18:34:09] marxarelli: have you completed the switch of db to jessie yet? :] [18:36:14] Project beta-scap-eqiad build #120877: 04STILL FAILING in 1 min 42 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120877/ [18:37:30] hashar: yep yep [18:38:10] !log on tin: `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira02.deployment-prep.eqiad.wmflabs` - T144006 [18:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [18:38:25] marxarelli: awesome :] [18:39:04] tom29739: was that meant to silent shinken-wm ? if so can you restore it ? thx! [18:39:39] hashar, it was but it doesn't work in here because wm-bot doesn't have ops in here. [18:44:45] Project beta-scap-eqiad build #120878: 04STILL FAILING in 6 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120878/ [18:48:10] @uq shinken-wm [18:48:13] Project beta-scap-eqiad build #120879: 04STILL FAILING in 1 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120879/ [18:48:18] I know: add, changepass, channel-info, channellist, commands, configure, drop, github-, github+, github-off, github-on, grant, grantrole, help, info, instance, join, language, notify, optools-off, optools-on, optools-permanent-off, optools-permanent-on, part, rc-ping, rc-restart, reauth, recentchanges-bot-off, recentchanges-bot-on, recentchanges-minor-off, recentchanges-minor-on, recentchanges-off, recentchanges-on, reload, restart, revoke, revokerole, seen, seen-host, seen-off, seen-on, seenrx, suppress-off, suppress-on, systeminfo, system-rm, time, traffic-off, traffic-on, translate, trustadd, trustdel, trusted, uptime, verbosity--, verbosity++, wd, whoami [18:48:18] @commands [18:48:19] ah [18:51:45] blerg. Looks like deployment-mira02 needs to be added to network constants as a deploy host. [18:56:21] Project beta-scap-eqiad build #120880: 04STILL FAILING in 1 min 48 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120880/ [19:00:40] !log cherry-picked https://gerrit.wikimedia.org/r/#/c/311760/ to deployment-puppetmaster to fix failing beta-scap-eqiad job, had to manually start rsync, puppet failed to start [19:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:05:47] (03CR) 1020after4: [C: 032] "testing" [integration/config] - 10https://gerrit.wikimedia.org/r/311497 (owner: 10Paladox) [19:06:20] twentyafterfour ^^ thanks [19:06:49] (03Merged) 10jenkins-bot: [mediawiki/extensions] Add noop jenkins test [integration/config] - 10https://gerrit.wikimedia.org/r/311497 (owner: 10Paladox) [19:07:44] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:08:01] PROBLEM - Puppet run on deployment-puppetmaster is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:08:31] Yippee, build fixed! [19:08:32] Project beta-scap-eqiad build #120881: 09FIXED in 3 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120881/ [19:08:47] ^ puppetmaster puppet issue is me & yuvi, we're working on it [19:21:58] thcipriani: deployment-mira02 looks all in good shape [19:22:04] I guess we can get rid of mira (trusty) [19:29:00] hashar: last I looked deployment-mira02 still had a small /srv/ of 20GB [19:29:16] hashar: https://phabricator.wikimedia.org/T144578#2652884 [19:29:24] oh man [19:29:38] but as far as all the deployment tooling stuff goes: it looks good to me. [19:30:07] it might be nice to actually cut over to running a deploy from it, although that is probably a fair amount of work :\ [19:30:39] well on beta it is really just to test comaster switch isn't it ? [19:32:15] yeah, It should just be changing a hiera var and letting puppet run everywhere [19:32:38] would have to turn off beta-scap-eqiad or re-point it to the right master as part of that, too [19:34:20] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:38:01] RECOVERY - Puppet run on deployment-puppetmaster is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:43] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:45:37] PROBLEM - Puppet run on deployment-elastic08 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:48:04] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:48:12] PROBLEM - Puppet run on deployment-elastic06 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:48:38] PROBLEM - Puppet run on deployment-jobrunner02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:49:20] PROBLEM - Puppet run on deployment-salt02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [19:49:46] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:49:56] PROBLEM - Puppet run on deployment-pdfrender is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:51:50] PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:53:22] > Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class base::environment [19:53:24] why [20:05:37] RECOVERY - Puppet run on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [20:11:12] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 06Labs, 10Labs-Infrastructure, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653709 (10hashar) [20:11:21] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2604309 (10hashar) Lets get a custom flavor for the deployment servers. 8 CPUs to get faster l10n rebuild 8 GB RAM: 2G for system, 6G for cache,... [20:12:36] thcipriani: transient issue :( [20:13:48] thcipriani: filled that has https://phabricator.wikimedia.org/T145631 which I later merged with https://phabricator.wikimedia.org/T131946 [20:14:03] I suspect the auto git rebase causing puppet master to miss some files from time to time [20:14:09] eg it is not atomic enough :] [20:14:18] RECOVERY - Puppet run on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:18] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:22] I just had to manually fix rebasing the puppet repo [20:18:15] (03PS2) 10Hashar: Add research/recommendation-api to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311667 (https://phabricator.wikimedia.org/T146057) (owner: 10Nschaaf) [20:24:18] (03CR) 10Hashar: [C: 032] "I have granted jenkins-bot the ability to submit patches." [integration/config] - 10https://gerrit.wikimedia.org/r/311667 (https://phabricator.wikimedia.org/T146057) (owner: 10Nschaaf) [20:25:17] (03Merged) 10jenkins-bot: Add research/recommendation-api to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311667 (https://phabricator.wikimedia.org/T146057) (owner: 10Nschaaf) [20:25:18] twentyafterfour: one of your patch did not get deployed on zuul server :/ [20:25:34] the one adding noop jobs to mediawiki/extensions that is rather harmless though [20:25:45] CR+2 does not deploy on gallium. Has to be done manually [20:26:48] RECOVERY - Puppet run on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [20:27:50] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 06Labs, 10Labs-Infrastructure, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653807 (10Andrew) [20:28:04] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:28:12] RECOVERY - Puppet run on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [20:28:36] RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:29:47] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:29:59] RECOVERY - Puppet run on deployment-pdfrender is OK: OK: Less than 1.00% above the threshold [0.0] [20:31:09] hashar: sorry [20:31:54] twentyafterfour: np :] [20:32:50] twentyafterfour: actually the related doc to deploy a zuul config change was outdated and zeljkof prompted me to update it https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Deploy_configuration :D [20:32:51] in short: [20:32:56] fab deploy_zuul [20:32:58] :] [20:33:25] twentyafterfour theres https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Actually_upgrade which hashar made [20:33:33] and super easy to follow, just copy and paste :) [20:33:45] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2653842 (10Andrew) [20:33:50] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 06Labs, 10Labs-Infrastructure, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653840 (10Andrew) 05Open>03Resolved a:03Andrew [20:47:54] !log Creating deployment-mira instance with flavor c8.m8.s60 (8 cpu, 8G RAM and 60G disk) T144578 [20:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:53:19] !sal [20:53:19] https://tools.wmflabs.org/sal/releng [20:54:29] !log from deployment-tin for T144578, accept ssh host key of deployment-mira : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs [20:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:02:38] greg-g: So what's up with the deployment train this week? I see we're still on wmf.18? Is today's train going forward? Are we going to go to 19 or skip it and go straight to 20? [21:02:58] Oh, I now see there was an email to wikitech but not engineering [21:03:02] RoanKattouw: on hold due to a performance regression that occured with wmf.18 :/ [21:03:19] Ugh, branching of 20 is paused? [21:03:23] Can we not do that, please? [21:03:46] My week is scheduled around the branch cut happening on Tuesdays, I have patches waiting to be merged post-cut [21:03:57] we havent branched wmf.20 afaik [21:04:02] since wmf.19 hasnt been deployed [21:04:05] Yeah that's whta bothers me [21:04:09] But I guess it makes some sense [21:04:18] and wikidata team is in a similar position [21:04:20] I'll ask on the list [21:04:34] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [21:04:42] specially with mediawiki/core going to have a few breaking changes in wmf.20 :( [21:04:50] (such as DBFactory that got renamed) [21:05:02] yeah better on list or poke Tyler about it [21:05:37] Info: /Stage[main]/Redis/Sysctl::Parameters[vm.overcommit_memory]/Sysctl::Conffile[vm.overcommit_memory]/File[/etc/sysctl.d/70-vm-overcommit_memory.conf]: Scheduling refresh of Exec[update_sysctl] [21:05:47] sounds scary :D [21:09:31] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [21:10:43] we could cut it today but not do anything with it. We might not even deploy it, we might skip to wmf.21 if we can't resolve the issue in wmf.18/19 (cc thcipriani thoughts?) [21:11:57] we can cut so folks can continue with their cadence [21:13:53] what are the consequences of that on the other side? A few more backports? [21:16:00] may make it even harder to troubleshoot? Since now we'd be juggling 3 branches in limbo: wmf.18 (which has the regression), wmf.19 (which has the regression and is still untested), and wmf.20 (which has the regression and would be completely untested) [21:17:03] but, it's possible, that by delaying wmf.20 we'd be creating a more unstable branch since it seems like folk hold-off a merge to master until Tuesday afternoons [21:18:15] right, also end of quarter complications :/ [21:18:47] I think we should have a policy of no non-urgent fixes deployed during SWATs while we're in a reverted state, as well [21:18:52] gut check? :) [21:20:33] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:20:44] ehm, maybe. Seems like that could help isolate issues, but would likely result in other complications that I can't anticipate, particularly as SWAT is the only way to get *any* change into mw-config [21:21:10] yeah, I'd say those are OK, but backports of code code are not so much [21:21:10] is made more complicated by the freeze next week? [21:21:17] chasemp: yes [21:21:21] sadly [21:21:24] (just confirming I'm following) [21:21:41] I bet a lot of teams were counting on wmf.20 cutting today and going out this week with their end of quarter goals related code :) [21:21:55] sure yeah [21:22:19] the thing about that is...nothing is actually *more* broken in wmf.20 [21:22:25] right [21:22:41] (that we know of) and/or (just related to this one perf issue, probably) [21:23:03] blug. I'm going to cut the branch now. I'll update the mailing list. [21:23:47] If we haven't heard anything about performance by train time tomorrow, I'll go to group0 and keep a close eye on things and try to run a shortened schedule from there. [21:23:51] how does that sound for a plan? [21:24:16] with 19 or 20? [21:24:34] with .20 and skip wmf.19 entirely [21:24:39] * greg-g nods [21:25:08] wmf.20 would then have whatever backports are in 19 now (modulu this evening's swat) [21:25:22] right [21:25:23] s/modulu/modulo/ #effing coffe shop conversations next to me [21:25:52] Thanks, that's exactly what I was hoping for [21:25:57] ok, I'll cut the branch and then check in with performance folks then update wikitech-l [21:26:38] Approved. [21:26:44] :D [21:26:54] ;) [21:27:39] Ooh I also approve of the releng offsite coinciding with my team's offsite [21:27:45] No deployment worries during my offsite :) [21:29:24] heh, you're welcome :P [21:30:33] RoanKattouw: second opinions on "no non emergency or simple config chanes during SWAT deploys while we're in a reverted state" (mostly looking for language improvements, I feel we mostly do that now by default) [21:33:17] greg-g: "Only emergency hotfixes during SWAT while production is reverted to a previous branch" ? [21:33:22] Hmm, maybe [21:33:42] I'm not too convinced that simple config changes should be blocked in a reverted state, can you explain why? [21:34:03] oh, bad english [21:34:04] Naive thinking: the reverted state is a stable state, so what's wrong with adding some import sources on the Kazakh Wikipedia [21:34:07] simple configs ok [21:34:10] OK [21:34:27] but not "enable new feature" obvs [21:34:32] Project beta-scap-eqiad build #120896: 04FAILURE in 0.3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120896/ [21:34:35] OK, so only simple config changes and emergency fixes are allowed in SWAT while revertd? [21:34:44] right [21:35:01] That makes sense to me [21:35:09] word [21:35:12] * greg-g documents more [21:35:21] Rules like that don't have to be too precise anyway, they can say "when in doubt ask Greg for approval" at the end [21:35:31] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:56] yup, or "the on-point train conductor" [21:35:57] scap failure on beta is me [21:36:13] * greg-g still can't find suitable adult sized train conductor hats [21:38:28] ahem. Will just leave this here for greg-g https://www.amazon.com/Mineola-Hat-Store-mhs-eng-Engineer/dp/B000BBS82K/ref=sr_1_1 [21:39:29] they make tons of them. source: my old neighbor was an adult model train guy [21:39:36] those ppl are serious business [21:39:39] engineer != conductor [21:39:52] https://www.amazon.com/Large-Navy-Blue-Conductor-Hat/dp/B000JGAF3C/ref=pd_sim_193_6 [21:40:10] :) [21:40:31] Why didn't I like this one before (I remember seeing it).. I think I started looking for customizable options... [21:41:19] " Only 3 left in stock. " crap [21:41:22] needs an embroidered scappy the scap pig [21:41:32] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:41:39] Its on sale in the us but not in the uk LOL [21:41:50] thcipriani: that would be so \m/ [21:42:03] oh right, the quality, many complaints of quality (for $9 not surprising) [21:42:08] $9 = £39 [21:42:25] Project beta-scap-eqiad build #120897: 04STILL FAILING in 4 min 45 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120897/ [21:42:29] Your technolly right but look at it in the uk ^^ [21:42:34] Question: [21:42:35] What size(s) does it come in? I wear 7 3/8. [21:42:35] Answer: [21:42:35] Trying googling and see what you get.l [21:42:41] ugh amazon [21:43:33] It would have to be great quality here otherwise for that price no one would buy it here [21:43:54] it looks like lots of people buy it here but wont in the us so there probaly doing a sale :) [21:44:01] Interesting question, why do grown men all want to be train engineers and not conductors? One for the ages. [21:44:07] it's not, the reviews are pretty clear on the quality [21:44:31] Project beta-scap-eqiad build #120898: 04STILL FAILING in 0.33 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120898/ [21:45:03] race condition [21:45:37] what's happening? [21:45:42] But if i order from america here it is £56 [21:45:51] 57 i mean [21:45:52] https://www.amazon.co.uk/gp/offer-listing/B000JGAF3C/ref=dp_olp_new_mbc?ie=UTF8&condition=new [21:46:14] Plus delivery [21:47:49] thcipriani: for info, I got a new deployment-mira host (that is like the 3rd or 4th) this time [21:47:57] has 8cpu and 40G on /srv :] [21:48:03] :) [21:48:07] that ought to do it [21:48:22] Yippee, build fixed! [21:48:23] Project beta-scap-eqiad build #120899: 09FIXED in 1 min 42 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120899/ [21:48:42] ah evening sprint completed [21:49:07] !log Deleting deployment-mira02 /srv was too small. Replaced by deployment-mira [21:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:51:59] PROBLEM - Host deployment-mira02 is DOWN: CRITICAL - Host Unreachable (10.68.19.67) [21:53:38] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2654077 (10hashar) I did a sprint tonight: * Got a new flavor in openstack with larger disk T146209, huge thanks to Andrew to have created it up... [21:54:48] sprint complete :] [21:56:10] g'night :) [21:57:32] I also wonder whether deployment-db1 and deployment-db2 can be shutdown (not deleted) [21:57:37] seems the migration is complete [22:01:32] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [22:09:57] ok deployment-mira good for service as far as I can tell [22:10:01] bed crash \o/ [22:16:23] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T142855#2654227 (10thcipriani) 05Open>03Resolved `1.28.0-wmf.18` is deployed and live everywhere. [22:19:17] 06Release-Engineering-Team, 10DBA, 10MediaWiki-Maintenance-scripts, 06Operations, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2654239 (10greg) For the task at hand, I've added https://wikitech.wikim... [22:20:20] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2654240 (10thcipriani) I just cut the `1.28.0-wmf.20` branch. The tentative plan is simply to skip the roll out of `1.28.0-wmf.19` and sync `wmf.20` to group0 wikis tom... [22:22:12] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2654246 (10thcipriani) [22:24:03] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328#2654251 (10thcipriani) 05Open>03Invalid Closing this task since the tentative plan is to skip the roll out of wmf.19 entirely. Any discussion... [22:30:27] 06Release-Engineering-Team, 10DBA, 10MediaWiki-Maintenance-scripts, 06Operations, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2654312 (10greg) Ok, emailed. Resolving. Thanks @jcrespo for the sugges... [22:30:35] 06Release-Engineering-Team, 10DBA, 10MediaWiki-Maintenance-scripts, 06Operations, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2654313 (10greg) a:03greg [22:31:13] 06Release-Engineering-Team, 10DBA, 10MediaWiki-Maintenance-scripts, 06Operations, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2606542 (10greg) 05Open>03Resolved p:05Triage>03Normal [22:33:23] thcipriani: Could I have https://phabricator.wikimedia.org/T143328#2627818 apply to wmf.20 too? cherry-pick is https://gerrit.wikimedia.org/r/#/c/311851/ [22:34:23] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 06Labs, 10Labs-Infrastructure, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653709 (10greg) (This isn't a quota increase request, if that was a mis-aligned upstream task addition ;) ). [22:35:14] legoktm: yup, sure. I won't do any actual checkout of wmf.20 code on tin until tomorrow. [22:37:46] 10Deployment-Systems, 03Scap3: Purge the hhvm fcgi and cli bytecache as part of deployment - https://phabricator.wikimedia.org/T146226#2654338 (10thcipriani) [22:37:49] thanks :) [22:38:17] 10Deployment-Systems, 03Scap3: Purge the hhvm fcgi and cli bytecache as part of deployment - https://phabricator.wikimedia.org/T146226#2654357 (10thcipriani) p:05Triage>03Low [22:38:27] greg-g, so if I wanted to run a script that takes 3 weeks to complete, no deployments that time? :P [22:38:58] MaxSem: I'll clarify that, but no, other deploys can happen at the same time :) [22:39:32] it can't go into the same table then [22:43:04] yeah it can [22:43:44] see https://wikitech.wikimedia.org/wiki/Deployments/Archive/2016/09#deploycal-item-20160912T1600 [22:59:51] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 06Labs, 10Labs-Infrastructure, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2654466 (10AlexMonk-WMF) [23:15:49] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2654510 (10dduvall) The migration to `deployment-db03` and `deployme... [23:21:35] 06Release-Engineering-Team (Long-Lived-Branches), 03Scap3 (Scap3-MediaWiki-MVP), 13Patch-For-Review: Create `scap swat` command to automate patch merging & testing during a swat deployment - https://phabricator.wikimedia.org/T142880#2654515 (10Krinkle) [23:22:23] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2654516 (10dduvall)