[00:40:58] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879116 (10Dzahn) compiled, checked it's noop on contint1001, merged provisioning change. contint2001 is getting Apache, ferm rules and all... [00:48:30] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879127 (10Dzahn) contint2001 now has: - contint-admin/roots users: ``` [contint2001:~] $ id hashar uid=1010(hashar) gid=500(wikidev) gro... [00:49:22] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879130 (10Dzahn) [01:00:10] PROBLEM - jenkins_zmq_publisher on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [01:05:34] ^ brand new provisioning [01:06:30] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused daniel_zahn T150771 [01:17:01] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[zuul] [01:18:46] ACKNOWLEDGEMENT - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[zuul] daniel_zahn T150771 [01:29:18] PROBLEM - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [01:29:38] PROBLEM - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [01:31:23] ACKNOWLEDGEMENT - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused daniel_zahn T150771 [01:31:23] ACKNOWLEDGEMENT - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server daniel_zahn T150771 [01:48:08] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [01:58:26] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879239 (10Dzahn) rebased/amended/merged/follow-up fix done contint1001/2001 are now including identical roles in site.pp... [02:12:34] (03PS1) 10Filippo Giunchedi: Add prometheus-related repositories [integration/config] - 10https://gerrit.wikimedia.org/r/327692 [02:14:53] We getting more jenkins masters to speed up ci? [02:23:59] it's more about not having a single point of failure [02:24:07] so to have a backup server [02:24:11] and being able to switch datacenters [02:24:16] (for now) [02:43:20] 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10RESTBase, 13Patch-For-Review, and 2 others: Set up MCS in BetaCluster - https://phabricator.wikimedia.org/T149671#2879261 (10mobrovac) 05Open>03Resolved a:03mobrovac MCS is now set-up in Beta, so it can be now used for BetaCluster deployments... [02:44:24] 03Scap3, 10Mobile-Content-Service, 06Services (doing), 15User-mobrovac: Enable Scap3 config deploys for MCS - https://phabricator.wikimedia.org/T144598#2879266 (10mobrovac) a:03mobrovac [03:11:58] PROBLEM - Puppet run on repository is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:51:55] RECOVERY - Puppet run on repository is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:36] Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #235: 04FAILURE in 22 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/235/ [04:32:37] (03CR) 10Jforrester: [C: 032] Allow filtering documentation requirements based on visibility [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/327442 (owner: 10Legoktm) [04:33:55] (03Merged) 10jenkins-bot: Allow filtering documentation requirements based on visibility [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/327442 (owner: 10Legoktm) [06:53:21] Project selenium-Wikibase » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #209: 04FAILURE in 2 hr 13 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/209/ [06:55:23] PROBLEM - Puppet run on integration-slave-docker-1000 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:28:48] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879965 (10hashar) >>! In T150771#2879239, @Dzahn wrote: > rebased/amended/merged/follow-up fix done What a surprise to hav... [08:39:21] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: integration-slave-docker-1000 fails puppet: E: Version '1.12.3-0~jessie' for 'docker-engine' was not found - https://phabricator.wikimedia.org/T153419#2879986 (10hashar) [08:50:33] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2724601 (10ArielGlenn) I have a few scripts that are generated from templates. Any thoughts about what we can do for these cases? [09:50:12] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2880204 (10Volans) @ArielGlenn it's surely depends on the specific cases, but I think that this is usually an anti-pattern. What... [10:04:15] !log deployment-puppetmaster02 updated puppet repo. Was stall due to a bump of the mariadb submodule [10:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:09:31] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: integration-slave-docker-1000 fails puppet: E: Version '1.12.3-0~jessie' for 'docker-engine' was not found - https://phabricator.wikimedia.org/T153419#2880256 (10hashar) The instance [[ https://horizon.wikimedia.org/project/instances/0e7bb95b... [10:13:36] !log integration-slave-docker-1000 changed docker::version from no more existent '1.12.3-0~jessie' to simply 'present'. Will have to manually upgrade it from now on. T153419 [10:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:14:20] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: integration-slave-docker-1000 fails puppet: E: Version '1.12.3-0~jessie' for 'docker-engine' was not found - https://phabricator.wikimedia.org/T153419#2880266 (10hashar) 05Open>03Resolved a:03hashar Changed the version to `present` for... [10:14:55] !sal [10:14:55] https://tools.wmflabs.org/sal/releng [10:15:37] !log integration: apt-get upgrade on all permanent slaves [10:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:17:18] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T152563#2880278 (10phuedx) [10:25:23] RECOVERY - Puppet run on integration-slave-docker-1000 is OK: OK: Less than 1.00% above the threshold [0.0] [10:42:37] magic [10:57:05] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2880407 (10hashar) About spellcheck, Jessie has version 0.3.4 , would it make sense to backport 0.4.4 from testing and add it to... [11:36:25] 10Browser-Tests-Infrastructure, 07Easy: Remove lines from Gemfile that are used by RVM - https://phabricator.wikimedia.org/T1331#2880489 (10zeljkofilipin) [13:15:14] !log integration: update sudo policy for debian-glue to keep the env variable SHELL_ON_FAILURE (for https://gerrit.wikimedia.org/r/#/c/327720/ ) [13:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:18:12] (03CR) 10Hashar: [] "recheck" [integration/uprightdiff] (debian) - 10https://gerrit.wikimedia.org/r/326821 (owner: 10Hashar) [13:18:26] (03CR) 10jenkins-bot: [V: 04-1] .gitreview: swap defaultbranch for track [integration/uprightdiff] (debian) - 10https://gerrit.wikimedia.org/r/326821 (owner: 10Hashar) [13:25:00] 03Scap3, 10Mobile-Content-Service, 06Services (done), 15User-mobrovac: Enable Scap3 config deploys for MCS - https://phabricator.wikimedia.org/T144598#2880842 (10mobrovac) 05Open>03Resolved MCS has been switched to use Scap3 for config deploys, so resolving. This means that from now on if any config ch... [13:35:33] (03PS1) 10Hashar: debian-glue: prevent shell on build failure [integration/config] - 10https://gerrit.wikimedia.org/r/327729 [13:38:11] (03CR) 10Hashar: [C: 032] debian-glue: prevent shell on build failure [integration/config] - 10https://gerrit.wikimedia.org/r/327729 (owner: 10Hashar) [13:39:29] (03Merged) 10jenkins-bot: debian-glue: prevent shell on build failure [integration/config] - 10https://gerrit.wikimedia.org/r/327729 (owner: 10Hashar) [13:44:07] !log integration / contintcloud : update security rules of labs projects to allow contint2001 [13:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:45:38] !log integration / contintcloud : remove security rules of labs projects that allowed gallium (phased out) T95757 [13:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:45:58] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #245: 04FAILURE in 1 min 57 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/245/ [13:49:19] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2880915 (10hashar) I have updated the labs projects security rules to allow contint2001 to ssh to the labs instances [14:19:43] !log Refreshing Nodepool images. The snapshots were broken due to mariadb-client failing to upgrade [14:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:25:22] !log Nodepool Image ci-trusty-wikimedia-1481897961 in wmflabs-eqiad is ready [14:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:33:58] !log Nodepool Image ci-jessie-wikimedia-1481897950 in wmflabs-eqiad is ready [14:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:47:28] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:55:15] PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:00:25] PROBLEM - Puppet run on deployment-mediawiki06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:00:59] ^ that's elukey's patch going through now, I'll patch things up in beta once gerrit is working again [15:01:23] ouch sorry Krenair [15:01:27] checking [15:02:10] it'll just be missing hiera data or something [15:02:21] don't worry about it right now, you haven't finished with prod yet I think? [15:02:31] yeah started eqiad [15:02:33] Could not find data item prometheus_nodes in any Hiera data file and no default supplied at [15:02:59] so yeah it needs the hiera variable [15:03:20] if you can take care of it I'd be really glad :) [15:11:13] elukey, it looks much happier after https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1168607&oldid=1125290 [15:11:29] thanks! [15:11:35] although... [15:11:36] I am about to finish eqiad [15:12:05] anything that doesn't look right? [15:12:11] hm, strange, puppet complained about ferm failing to start, but then it succeeded when I ran puppet again [15:12:42] weird [15:13:02] though ferm still doesn't want to start manually [15:15:51] oh, derp [15:16:16] /usr/sbin/ferm /etc/ferm/ferm.conf says DNS query for 'deployment-prometheus.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN [15:16:20] and it's right, I missed an 01 in there [15:17:42] :) [15:18:05] um [15:18:16] DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN [15:18:16] krenair@deployment-mediawiki06:~$ host deployment-prometheus01.deployment-prep.eqiad.wmflabs [15:18:16] deployment-prometheus01.deployment-prep.eqiad.wmflabs has address 10.68.20.247 [15:20:12] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:20:22] RECOVERY - Puppet run on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [15:22:26] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [15:23:50] thanks! [15:24:07] no I mean [15:24:10] it's not fully fixed [15:24:27] ferm is still broken with a completely nonsense error [15:25:55] neither of the DNS recursors return NXDOMAIN when you ask for that domain, either A or AAAA [15:28:36] sure you can't get an AAAA record returned, but there's no NXDOMAIN [15:33:58] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#2881201 (10hashar) Jenkins just runs `tox` which has the env/commands to run defin... [15:38:33] pfff, what? [15:38:40] when I get rid of the AAAA rules it's happy [15:45:23] Project selenium-MobileFrontend » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #262: 04FAILURE in 23 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/262/ [15:47:46] root@deployment-mediawiki06:/etc/ferm/conf.d# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'AAAA'); print \$resolver->errorstring" [15:47:46] NXDOMAIN [15:48:01] root@deployment-mediawiki06:/etc/ferm/conf.d# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'A'); print \$resolver->errorstring" [15:48:01] NOERROR [15:48:19] but... [15:48:24] root@deployment-mediawiki06:/etc/ferm/conf.d# dig deployment-prometheus01.deployment-prep.eqiad.wmflabs AAAA [15:48:28] ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53078 [15:48:37] root@deployment-mediawiki06:/etc/ferm/conf.d# dig deployment-prometheus01.deployment-prep.eqiad.wmflabs A [15:48:40] ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64731 [15:54:53] mmmm [15:55:36] greg-g: hi, are you around :) [15:56:13] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development, 13Patch-For-Review: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#2881303 (10hashar) a:03hashar https://gerrit.wikimedia.org... [16:02:27] 10Gerrit, 06Operations, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881321 (10Paladox) [16:03:29] root@deployment-mediawiki06:/usr/lib/x86_64-linux-gnu/perl5/5.20/Net/DNS# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'AAAA'); print \$resolver->errorstring" [16:03:29] NOERROR [16:03:37] The difference? The trailing bloody full stop [16:05:06] 10Gerrit, 06Operations, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881333 (10Paladox) [16:05:12] I think otherwise, it does lookups for deployment-prometheus01.deployment-prep.eqiad.wmflabs.deployment-prep.eqiad.wmflabs and deployment-prometheus01.deployment-prep.eqiad.wmflabs.eqiad.wmflabs, and obviously those are NXDOMAIN [16:06:14] 10Gerrit, 06Operations, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2724182 (10Paladox) [16:06:33] Nope. [16:06:35] This just gets worse [16:06:39] DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs.' failed: NOERROR [16:08:06] 10Gerrit, 06Operations, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881343 (10Paladox) [16:10:01] yeah I don't know enough perl to understand wtf is going on in there [16:10:40] 10Gerrit, 06Operations, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881366 (10hashar) For the December 16th issue, there were multiple threads at 100% CPU and HTOP reported ~ 150... [16:15:06] 10Gerrit, 06Operations, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881399 (10Paladox) @hashar that sounds like the prevous problems described on here, i noticed gerrit's cpu get... [16:15:43] Krenair: use dig ? [16:15:55] read up [16:17:24] Krenair: if you search without a dot eg: example.org [16:17:26] if that fails [16:17:35] it will try again by appending the domains in /etc/resolv.conf [16:17:55] ie: example.org.deployment-prep.eqiad.wmflabs. and example.org.eqiad.wmflabs. [16:17:58] and eventually bails out [16:18:11] AAAA is for an ipv6 record, and there is none in labs [16:18:37] a poor man debugger is to tcpdump on port 53 [16:19:00] I know all of this [16:19:16] though I haven't been dumping raw DNS traffic [16:21:38] 10Browser-Tests-Infrastructure, 07Easy: Remove lines from Gemfile that are used by RVM - https://phabricator.wikimedia.org/T1331#2881450 (10hashar) Try the github code search for `ruby-gemset`: https://github.com/search?q=org%3Awikimedia+%22ruby-gemset%22&type=Code [16:22:50] mutante: thanks for the contint2001 merges yesterday. I looked at your follow up change at breakfast this morning and eventually forgot about them today :(( sorry! [16:23:23] hashar: you're welcome [16:23:49] I don't think there's anything wrong with what's happening on the wire [16:23:58] mutante: there are lot of good points and that gave me a few ideas. contint::firewall definitely has to be split :} [16:24:04] I think either the client DNS library, or ferm, is misinterpreting the result [16:24:05] 10Browser-Tests-Infrastructure, 07Easy: Remove lines from Gemfile that are used by RVM - https://phabricator.wikimedia.org/T1331#2881466 (10zeljkofilipin) [16:24:14] mutante: will work on that on monday I guess. or maybe a bit over the week-end if time allow [16:24:33] hashar: alright :) all good! [16:25:11] mutante: and we got to rethink our deployment so we keep the hot spare up-to-date [16:25:30] eg zuul config , jenkins jobs , generated doc etc :} [16:25:41] at least now we have something to switch to! [16:25:49] hashar: we should setup rsync cronjobs [16:25:56] like we did with gerrrit [16:25:57] i guess [16:26:05] yeah potentially [16:26:18] *nod* [16:26:38] the zuul conf we can deploy it on both, just have to not start/reload zuul on the hotspare [16:26:45] the doc can be copied to both servers once generated [16:27:00] and for Jenkins, I would like to eventually have two of them running in parallel [16:27:06] aka an active/active setup [16:27:14] but gotta play with that in labs first [16:27:39] but first, I will follow up on all your patches :} [16:27:42] ok, yes [16:27:54] like with the other systems like phab [16:28:02] i think the first step is warm standby [16:28:13] and only after that we get into "real" active/active or cluster setup [16:43:30] yeah sounds wise [16:43:50] warm standby is probably easy to reach [16:43:56] active/active will need a bit more work [16:44:05] anyway thanks a ton for the patches! [16:44:52] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2881502 (10hashar) status: got distracted with other things today. Have to follow up on Daniel follow up patches and refacto... [16:46:42] :) [16:48:06] That's my goal with gerrit. True master/master is Hard, but scorching-hot-warm-standby is totally doable [16:48:07] :) [16:51:14] yep, it's kind of like that for all 3, phab, CI and gerrit [16:57:45] FlorianSW: what's up? /me hasn't read backscroll yet, just got on [16:58:35] greg-g: thanks for your reply! legoktm already did what I wanted to ask you :) https://phabricator.wikimedia.org/T153438 (thanks @legoktm!) [16:58:51] * hashar waves [16:58:54] have a good week-end [16:58:59] hashar: you too! [16:59:08] FlorianSW: cool [17:05:00] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2881618 (10fgiunchedi) >>! In T148494#2880204, @Volans wrote: > @ArielGlenn it's surely depends on the specific cases, but I thi... [17:41:53] What the heck did we deploy recently that breaks editing on some many sites? [17:42:04] Issue is always with something in MediaWiki:Common.js [17:42:32] like "TypeError: mw.cookie is undefined" for ce.wp and ru.wp currently. Or "mw.util.$content is null" on tt.wp [17:42:45] T153456, T153476 [17:50:25] andre__: are we sure that they are defining the RL dependencies properly? [17:51:45] They very likely don't. [17:51:55] However I wonder why it's barking suddenly now. [17:52:00] Maybe it's just coincidence. [17:52:41] Krinkle recently (few weeks ago) got rid of the position distinction in RL, so there's no bottom queue, so everything will load in top [17:55:46] most likely it was broken for a few users in the past, and now it's broken for everyone [17:58:00] andre__: i thought i told you not to put the candy cane in the deployments [18:00:17] Attention: grrrit-wm will be going under maintaience and will be down for at most 30 minutes if you have any questions please direct them to me, for more info please go to E424 on Phab! [18:00:43] This will begin at 7 CST UTC-6 [18:07:27] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations, 05Mediawiki SWAT Deployments, 07Puppet: mwdebug1002 should have PHP extensions - https://phabricator.wikimedia.org/T153316#2876393 (10greg) Let's do that. It makes sense. [19:02:59] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T152563#2881986 (10mmodell) 05Open>03Resolved [19:17:23] Project beta-scap-eqiad build #133515: 04FAILURE in 2 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/133515/ [19:18:34] PROBLEM - Puppet run on deployment-parsoid09 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:24:52] Project beta-scap-eqiad build #133516: 04STILL FAILING in 0.33 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/133516/ [19:34:50] Project beta-scap-eqiad build #133517: 04STILL FAILING in 0.29 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/133517/ [19:41:13] 10Gerrit, 06Release-Engineering-Team, 10Wikimedia-Logstash, 13Patch-For-Review, 07Technical-Debt: Look into shoving gerrit logs into logstash - https://phabricator.wikimedia.org/T141324#2882157 (10Paladox) Ok, I've got it all working now with log4j :) [19:44:50] Project beta-scap-eqiad build #133518: 04STILL FAILING in 0.31 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/133518/ [19:46:48] ^ Should unbreak in a moment [19:53:34] RECOVERY - Puppet run on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [19:57:07] 10Gerrit: operations/software/hhvm_exporter to replicating to github - https://phabricator.wikimedia.org/T153489#2882229 (10fgiunchedi) [19:59:18] 10Gerrit: operations/software/hhvm_exporter to replicating to github - https://phabricator.wikimedia.org/T153489#2882267 (10demon) 05Open>03Resolved a:03demon Because I created it with the wrong name. Fixed. [20:11:41] Project beta-scap-eqiad build #133519: 04STILL FAILING in 16 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/133519/ [20:17:39] Project beta-scap-eqiad build #133520: 04STILL FAILING in 4 min 2 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/133520/ [20:19:28] Ugh, but I reverted.... [20:20:39] Ah, I think slaves might be lagging...? [20:24:17] Yippee, build fixed! [20:24:18] Project beta-scap-eqiad build #133521: 09FIXED in 4 min 44 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/133521/ [20:27:56] There we go [20:28:04] thcipriani: Fixed ^ [20:35:22] Ah. Cool beans. [21:30:15] legoktm: Interesting bug those two. It seems the reason the problem in Common.js was able to break the WikiEditor toolbar because of cascading failures in jQuery core. [21:30:32] There was no dependency between WikiEditor and Common.js. Common.js throwing was not itself the problem [21:30:40] the probelm was it threw inside $() ready callback [21:31:07] and apparently 'site' registered its callback before WikiEditor, and our current jQuery version stops at that point in certain conditions. [21:31:20] A fresh callstack will allow further execution, but previously registered handlers are not run in that case. [21:49:44] 10Gerrit, 06Operations, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2882577 (10Dzahn) @paladox pointed out these spikes today {F5076613} Here is the newest of the custom gc log f... [21:54:08] 03Scap3, 15User-mobrovac: Scap3 announces deploys even when it's just a restart - https://phabricator.wikimedia.org/T153500#2882580 (10mobrovac) [21:56:37] 10Gerrit, 10GitHub-Mirrors: operations/software/hhvm_exporter to replicating to github - https://phabricator.wikimedia.org/T153489#2882593 (10greg) [21:57:12] 03Scap3, 10Citoid, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, and 7 others: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602#2882594 (10mobrovac) [21:59:44] 03Scap3, 15User-mobrovac: Scap3 announces deploys even when it's just a restart - https://phabricator.wikimedia.org/T153500#2882598 (10demon) a:03demon I was looking at making this a nicer message in that scenario, just hadn't gotten to it yet. [22:02:25] 10Gerrit: `git submodule update --init --recursive` in mediawiki/extensions fails - https://phabricator.wikimedia.org/T153503#2882641 (10matmarex) [22:04:10] ostriches ^^ [22:04:29] 10Continuous-Integration-Infrastructure, 10Parsoid, 10RESTBase, 07Parsoid-Tests, 06Services (blocked): Move Parsoid and RESTBase testing from Travis CI to our Jenkins - https://phabricator.wikimedia.org/T78410#2882667 (10mobrovac) [22:06:36] 10Gerrit: `git submodule update --init --recursive` in mediawiki/extensions fails - https://phabricator.wikimedia.org/T153503#2882671 (10Paladox) Failed because of https://phabricator.wikimedia.org/diffusion/EWID/browse/master/.gitmodules [22:08:18] 06Release-Engineering-Team, 10MediaWiki-API, 10Monitoring, 06Operations, and 3 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2882677 (10mobrovac) [22:10:12] (03PS6) 10Bartosz Dziewoński: Add AbuseFilter & SpamBlacklist as UploadWizard dependency [integration/config] - 10https://gerrit.wikimedia.org/r/327202 (owner: 10Matthias Mullie) [22:10:36] anyone around on this fine friday afternoon? anyone wants to deploy a zuul config change for me? it's blocking some merges. https://gerrit.wikimedia.org/r/327202 [22:12:33] (03CR) 10Bartosz Dziewoński: [] "https://gerrit.wikimedia.org/r/#/c/326134/ needs this to be deployed." [integration/config] - 10https://gerrit.wikimedia.org/r/327202 (owner: 10Matthias Mullie) [22:31:11] ostriches: i'm sure you're around. can you merge and deploy this CI config change for me? it's blocking some merges i'd like to do before the long break. https://gerrit.wikimedia.org/r/327202 [22:31:41] 03Scap3, 15User-mobrovac: Scap3 announces deploys even when it's just a restart - https://phabricator.wikimedia.org/T153500#2882746 (10demon) [22:32:04] (03CR) 10Chad: [C: 032] Add AbuseFilter & SpamBlacklist as UploadWizard dependency [integration/config] - 10https://gerrit.wikimedia.org/r/327202 (owner: 10Matthias Mullie) [22:32:08] k what do I do ow? [22:32:54] (03Merged) 10jenkins-bot: Add AbuseFilter & SpamBlacklist as UploadWizard dependency [integration/config] - 10https://gerrit.wikimedia.org/r/327202 (owner: 10Matthias Mullie) [22:33:05] legoktm ^^ [22:33:29] I can deploy it I guess [22:33:32] but my internet is pretty laggy [22:34:22] !log deploying https://gerrit.wikimedia.org/r/327202 [22:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:34:43] ostriches: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Update_configuration [22:34:48] MatmaRex: it's deployed [22:34:53] I have to install fabric? [22:35:04] Ew. [22:36:11] ? [22:36:25] legoktm: ostriches: thanks <3 [22:36:35] Oh, i never used fabric on the test instance. [22:37:01] ostriches: you can ssh into the box and run all the commands manually if you want :) [22:37:22] ostriches: there's a bug about converting it to scap...that would be really nice too! [22:37:42] scap3 would make me happppyyyy [22:37:43] <3 [22:38:49] ostriches: should I assign https://phabricator.wikimedia.org/T129357 to you? ;) [22:39:01] Nope lol [22:42:02] Attention: Scheduled grrrit-wm maintaince in 3 hours [22:53:52] 03Scap3, 15User-mobrovac: Scap3 announces deploys even when it's just a restart - https://phabricator.wikimedia.org/T153500#2882772 (10demon) 05Open>03Resolved [22:55:09] 03Scap3, 15User-mobrovac: Scap deploy failed to sync git-fat artifacts - https://phabricator.wikimedia.org/T147856#2882774 (10thcipriani) p:05Normal>03High Raising the priority since this is still happening. Again, all the logs say is that the pull was initiated: ``` Git fat pull '/srv/deployment/cassandr... [23:05:51] ostriches: greg-g has asked if I would be willing to turn stashbot task echos back on here. I wanted to give you a chance to yell at me first. [23:06:04] damnit, I tried to do this incognito! [23:06:14] ostriches: it is also possible to put your nick on a blacklist for it globally [23:06:29] * greg-g likes the echos [23:06:36] * greg-g pets the bot [23:07:04] that's fine. [23:07:12] I have it globally ignored now anyway [23:07:17] ooohhh. scap3 for tools would be sweet [23:07:57] yay [23:11:40] greg-g: should work again now [23:12:04] * bd808 sacrificed his 24d uptime for the config change [23:12:10] :) [23:12:18] T50000 [23:12:19] T50000: Test - https://phabricator.wikimedia.org/T50000 [23:12:23] heh [23:15:52] bd808: no reason it's not "possible" for tools to use scap3. would want to simplify the setup process though. namely the puppet bits [23:16:29] most are just sync some code and maybe bounce a service...we do that! [23:17:14] *nod* I guess that what I really want is to run scap on my laptop and have it update my tools :) [23:17:37] which ... hmmm... [23:18:11] push to a diffusion repo, fire a trigger that runs scap... [23:18:28] instant PaaS :) [23:19:32] main issue would be installing scap locally. We could make that easier if we publish scap via pip [23:20:03] (we have debian packages, but they're only on wmf apt & they're not useful on non-debian as a result [23:20:27] distribute software all the ways! [23:25:38] Take off the training bra and make it non debian combatiable [23:28:19] Zppix: not really a useful comment no metaphor. [23:28:22] nor* [23:28:55] I was rephrasing legoktm comment greg-g [23:29:14] ... [23:29:24] Zppix: I don't see lego mentioning any kind of bra nor clothing of any sort [23:29:55] It made sense to me damnit D: [23:30:05] Zppix: your comment doesn't make any sense, and isn't related to what I said... [23:30:12] Zppix: I understand, but please don't repeat things like that in the future [23:30:21] Ack [23:30:33] the software is already non-debian compatible, it's just not distributed like that. [23:30:40] Ah i see [23:30:48] I misunderstood [23:31:03] Zppix: thanks [23:47:51] Attention: 1 hour until scheduled grrrit-wm maintainence