[00:26:45] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:47] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:48] wfm [00:31:39] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30643 bytes in 0.613 second response time [00:31:39] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39583 bytes in 0.543 second response time [01:39:26] Project browsertests-Wikidata-SmokeTests-linux-firefox-sauce build #463: 04FAILURE in 22 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-SmokeTests-linux-firefox-sauce/463/ [03:42:34] 7Browser-Tests, 10MediaWiki-extensions-MultimediaViewer: Failure in MMV browser tests - https://phabricator.wikimedia.org/T66249#1857495 (10Tgr) Task is over a year old, probably not relevant anymore? [03:55:26] It seems Jenkins is stuck. [04:09:16] kart_: what job? I don't see anything waiting for an executor at https://integration.wikimedia.org/ci/ [04:09:53] * bd808 now sees things stacked up in zuul [04:12:37] !log ci-jessie-wikimedia-10306 down and blocking many zuul queues [04:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [04:16:41] bd808: rake-jessie is queued. [04:17:14] yeah. Apparently there is only one node that runs all of the jessie jobs and it isn't connecting to jenkins [04:17:42] I'm trying to figure out where the instance actually is [04:24:17] !log The ip address in jenkins for ci-jessie-wikimedia-10306 now belongs to an instance named future-wikipedia.reading-web-staging.eqiad.wmflabs (obviously the config is wrong) [04:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [04:36:28] :) [04:42:22] I sent hashar an email. Hopefully he'll know how to fix things. I think it's an openstack problem [06:11:06] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:13:42] (03PS1) 10Phedenskog: Modified number of WebPageTest run to reflect production [integration/config] - 10https://gerrit.wikimedia.org/r/257277 [06:16:01] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39242 bytes in 0.730 second response time [06:22:07] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:14] bd808: Thanks. It seems hashar is still not around (should be online by now!). [09:01:58] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39266 bytes in 0.596 second response time [10:01:42] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [10:03:04] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [10:05:11] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857773 (10hashar) Nodepool (the CI process that spawn instances on labs) has been broken since Dec 6th 23:00 UTC at least. It is no more able to spawn instan... [10:08:54] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857775 (10hashar) p:5Triage>3Unbreak! On the `contintcloud` labs project, I can get a list of servers and they are reachable: ``` nodepool@labnodepool1001:... [10:13:36] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857793 (10hashar) The last instances spawned were apparently on 2015-12-06T14:00:17 UTC. [10:16:02] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857794 (10yuvipanda) There is a bunch of VMs that are stuck in the building/scheduling state: ```root@labcontrol1001:/home/yuvipanda# nova show 9bb7842a-4580-... [10:19:59] 7Browser-Tests, 10Continuous-Integration-Config, 10Wikidata: [Task] Move Wikidata browsertests into Wikibase repository - https://phabricator.wikimedia.org/T118727#1857815 (10adrianheine) [10:20:03] 7Browser-Tests, 10Continuous-Integration-Config, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-12-01: create a Wikibase browser test job running against a fresh MediaWiki installation - https://phabricator.wikimedia.org/T118284#1857814 (10adrianheine) 5Open>3Resolved [10:27:43] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857830 (10yuvipanda) Ok, I restarted nova-conductor and nova-scheduler, and that seems to have fixed it. There is also a labsservices1001 puppet failure that m... [10:27:54] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857831 (10yuvipanda) p:5Unbreak!>3High [10:33:04] Project beta-scap-eqiad build #81549: 04FAILURE in 7 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/81549/ [10:49:17] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857898 (10yuvipanda) The puppet failures were because the ip alias generator found instances that had no address (because they were stuck in the scheduler): `... [10:54:30] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: Connection refused [10:59:30] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.078 second response time [11:12:37] (03CR) 10Addshore: [C: 032] Update squizlabs/php_codesniffer to 2.4.0 [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/255557 (owner: 10Paladox) [11:15:14] (03Merged) 10jenkins-bot: Update squizlabs/php_codesniffer to 2.4.0 [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/255557 (owner: 10Paladox) [11:15:16] (03CR) 10Addshore: [C: 04-1] "OK (12 tests, 0 assertions)" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/237899 (https://phabricator.wikimedia.org/T92751) (owner: 10TasneemLo) [11:15:37] 7Browser-Tests, 10Continuous-Integration-Config, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-12-01: create a Wikibase browser test job running against a fresh MediaWiki installation - https://phabricator.wikimedia.org/T118284#1857938 (10Tobi_WMDE_SW) @adrianheine @janzerebecki I love to see move... [11:29:22] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857950 (10Luke081515) hm, creation of my instance rcm-5 worked. I start the creation at 21UTC, and one and a half hour later, it finisshed, so this is a long t... [11:58:03] PROBLEM - Host Generic Beta Cluster is DOWN: PING CRITICAL - Packet loss = 100% [12:07:58] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2908 bytes in 0.057 second response time [12:08:04] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2908 bytes in 0.070 second response time [12:08:54] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2908 bytes in 0.067 second response time [12:13:53] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 39267 bytes in 1.490 second response time [12:15:40] Yippee, build fixed! [12:15:41] Project beta-scap-eqiad build #81553: 09FIXED in 6 min 55 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/81553/ [12:17:57] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 39266 bytes in 0.548 second response time [12:18:04] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 39229 bytes in 0.631 second response time [12:28:01] 7Browser-Tests, 10Continuous-Integration-Config, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-12-01: create a Wikibase browser test job running against a fresh MediaWiki installation - https://phabricator.wikimedia.org/T118284#1858035 (10JanZerebecki) > One problem I see here is, that if you modi... [12:34:13] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #871: 04FAILURE in 2 min 13 sec: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/871/ [12:56:16] Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #682: 04FAILURE in 2 min 15 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/682/ [13:08:11] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #752: 04FAILURE in 2 min 9 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/752/ [13:08:29] hashar: hello [13:08:55] kart_: hello [13:09:46] hashar: it seems: https://integration.wikimedia.org/ci/job/mwext-qunit/8735/artifact/log/mw-error.log [13:09:59] like on Friday. [13:10:35] I've recheck. Lets wait. [13:15:55] kart_: I am not sure whether this issue causes any actual trouble for the qunit job [13:15:59] probably worth reporting though [13:16:15] jenkins_u2_mw is the $wgDBName crafted for the job [13:17:37] oh [13:17:57] 'Pending AJAX request #0' now actually give you the proper ajax query being pending [13:18:25] waits on some action=cxpublish [13:28:32] https://integration.wikimedia.org/ci/job/mwext-qunit/8736/console looks okay now [13:41:34] !log deleting a bunch of unmanaged Jenkins jobs (no more in JJB / no more in Zuul) [13:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:43:44] 10Continuous-Integration-Config, 10Math, 10MathSearch: Make Restbase availible to jenkins - https://phabricator.wikimedia.org/T120657#1858160 (10Physikerwelt) 3NEW a:3hashar [13:43:59] 10Continuous-Integration-Config, 10Math, 10MathSearch: Make Restbase availible to jenkins - https://phabricator.wikimedia.org/T120657#1858168 (10Physikerwelt) e.g. https://gerrit.wikimedia.org/r/#/c/257309/ [13:44:13] 10Continuous-Integration-Config, 10Math, 10MathSearch: Make Restbase availible to jenkins - https://phabricator.wikimedia.org/T120657#1858169 (10Physikerwelt) [13:46:55] !log Reloading Jenkins configuration from disk following up mass deletions of jobs directly on gallium [13:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:48:36] it seems the mw api is not reachable from inside deployment-prep [13:49:05] curl -v en.wikipedia.beta.wmflabs.org/w/api.php times out from deployment-restbase01 [13:49:11] parsoid has got the same problem [13:49:12] it should [13:49:17] there is a specific DNS hack [13:49:39] euh? it's been working until now [13:49:44] to have en.wikipedia.beta.wmflabs.org to resolved to the instance internal IP serving that domain [13:49:55] which is deployment-cache-textXX [13:50:23] if inside labs you are yield a public IP when resolving the name of en.wikipedia.beta.wmflabs.org , that is an issue with the labs dns server :-( [13:50:31] en.wikipedia.beta.wmflabs.org/w/api.php seems to resolve to 208.80.155.135 [13:50:38] which is the proxy iirc [13:50:57] hashar: so related to this morning's labs outage? [13:51:02] 10Continuous-Integration-Config, 10Math, 10MathSearch: Make Restbase availible to jenkins - https://phabricator.wikimedia.org/T120657#1858184 (10Physikerwelt) p:5Normal>3Low [13:51:02] no clue [13:51:13] this morning that was the labs scheduler being crazy :-} [13:51:55] so yeah [13:52:41] hashar: so, is there something we can/should do here? [13:54:23] en.wikipedia.beta.wmflabs points to 208.80.155.135 which is relayed to deployment-cache-text04 [13:54:42] and while inside labs en.wikipedia.beta.wmflabs should really yield 10.68.18.103 [13:54:48] so I guess [13:54:55] fill bug [13:55:00] or look up for the related labs bug if there is any :-} [13:55:38] mobrovac: nothing I can do on my sie [13:55:40] side [13:56:02] kk thnx hashar [14:00:34] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #3104: 04FAILURE in 1 hr 14 min: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/3104/ [14:04:23] 01:14:33.210 urllib2.URLError: [14:04:25] hehe [14:09:23] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs: Instance creation fails - https://phabricator.wikimedia.org/T120586#1858207 (10coren) a:3coren [14:34:53] hashar: which node is updated by https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/ ? [14:35:35] deployment-parsoid05 ? [14:36:41] mobrovac: if you look at one of the build page [14:36:44] eg https://integration.wikimedia.org/ci/job/beta-parsoid-update-eqiad/1489/ [14:36:57] on the top right it shows you when and how long the build tok [14:36:59] took [14:37:02] as well as the node [14:37:05] Took 3 min 18 sec on deployment-parsoid05 [14:37:08] yup, and where [14:37:09] so yes :-} [14:37:16] figured it right after asking [14:37:19] sorry to bug to [14:37:20] :P [14:38:27] see! [14:38:31] it serves a purpose to ask :-} [14:44:06] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [14:52:06] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [14:52:35] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1858278 (10coren) 5Open>3Resolved @yuvipanda was correct that the puppet error was caused by the stall in the scheduler; the patch abov... [14:57:34] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1858287 (10hashar) Since people mentioned issue with the DNS resolver aliasing, it is working again: ``` deployment-bastion:~$ dig +short... [15:01:46] RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [15:02:39] Yippee, build fixed! [15:02:39] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #3105: 09FIXED in 4 min 59 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/3105/ [15:22:29] !log labs DNS had some issue. all solved now. [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:33:49] (03CR) 10Krinkle: [C: 032] "Recompiled and deployed." [integration/config] - 10https://gerrit.wikimedia.org/r/257277 (owner: 10Phedenskog) [15:36:53] (03CR) 10jenkins-bot: [V: 04-1] Modified number of WebPageTest run to reflect production [integration/config] - 10https://gerrit.wikimedia.org/r/257277 (owner: 10Phedenskog) [15:37:52] hashar: jjb is giving me errors [15:38:39] "cannot serialize %r (type %s)" % (text, type(text).__name__) [15:38:39] TypeError: cannot serialize {'cron': '@hourly'} (type dict) [15:38:48] when running jenkins-jobs test [15:38:56] I'm fixing hte zuul error now, unrelated [15:43:14] Krinkle: oh [15:43:28] Did someone merge without testing? [15:43:28] Krinkle: ah yeah the pollscm: cron: syntax has been changed [15:43:45] you probably want to upgrade your jjb install form integration/jenkins.git [15:43:47] Or did we upgrade jenkins-job-builder without migrating the syntax first? [15:43:57] the old syntax is: [15:44:05] grzh [15:44:06] I installed latest from integration/jenkins-job-builder [15:44:13] there was some warning spurt outs [15:44:26] and I thin kI have seen a change that use the new syntax [15:44:55] http://docs.openstack.org/infra/jenkins-job-builder/triggers.html#triggers.pollscm [15:45:07] so it used to be: - pollscm: '@hourly' [15:45:55] ahhhrg [15:46:07] Krinkle: TwitterCards / ViewFiles it is me. I deleted a bunch of jobs that were no more defined in jjb [15:46:31] Is the new syntax newer or older than integration/jenkins-job-builder repo version (last updatd 13 days ago) [15:46:40] yeah it should [15:47:04] OK. It's passing now. So we are using mixed syntaxes, but it's okay, the old syntax was depercted, not removed. [15:47:15] I think maybe one shell window was still referencing the old jjb install [15:47:19] 10Deployment-Systems, 10Salt, 5Patch-For-Review: trebuchet should expect salt APIs to be asynchronous and poll for status updates from all minions - https://phabricator.wikimedia.org/T103013#1858329 (10ArielGlenn) Currently trebuchet/trigger waits for some timeout number of seconds and then displays the resu... [15:47:26] potentially yeah, that is annoying :-( [15:47:43] OK. So TwitterCards, why is that one failing? Maybe it is still referenced in jjb or zuul? [15:48:03] they were no more defined in JJB [15:48:05] so I deleted them [15:48:13] but forgot to check whether they were triggered by Zuul [15:48:16] But they are implicitly used by zuul [15:48:16] yeah [15:48:36] so I guess the easy workaround is to regenerate them :-} [15:48:40] will do [15:48:42] Is that repo closed? [15:48:51] I don't find an issue to that extent in Phabricator [15:48:55] in fact, there is an issue to review and deploy it [15:49:25] TwitterCards pass https://gerrit.wikimedia.org/r/#/c/245786/ [15:49:32] will make it generic [15:49:40] k [15:50:05] view files pass as well [15:52:38] (03PS2) 10Hashar: Modified number of WebPageTest run to reflect production [integration/config] - 10https://gerrit.wikimedia.org/r/257277 (owner: 10Phedenskog) [15:52:40] (03PS1) 10Hashar: TwitterCards ViewFiles now use generic job [integration/config] - 10https://gerrit.wikimedia.org/r/257340 [15:52:58] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/257340" [integration/config] - 10https://gerrit.wikimedia.org/r/257277 (owner: 10Phedenskog) [15:53:28] Krinkle: sorry should have thought about testing the zuul layout after I mass deleted jobs :-( [16:05:10] (03CR) 10Hashar: [C: 032] TwitterCards ViewFiles now use generic job [integration/config] - 10https://gerrit.wikimedia.org/r/257340 (owner: 10Hashar) [16:07:52] (03Merged) 10jenkins-bot: TwitterCards ViewFiles now use generic job [integration/config] - 10https://gerrit.wikimedia.org/r/257340 (owner: 10Hashar) [16:08:01] (03Merged) 10jenkins-bot: Modified number of WebPageTest run to reflect production [integration/config] - 10https://gerrit.wikimedia.org/r/257277 (owner: 10Phedenskog) [16:08:39] hmm [16:09:12] !log Nodepool no more notice Jenkins slaves went offline. Delay deletions and repooling significantly. Investigating [16:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:09:18] something crazy is haqppening [16:10:13] (03CR) 10Hashar: "I have deployed the Zuul part. Assuming the Jenkins job has been refreshed." [integration/config] - 10https://gerrit.wikimedia.org/r/257277 (owner: 10Phedenskog) [16:14:45] (03CR) 10Hashar: "Validated on:" [integration/config] - 10https://gerrit.wikimedia.org/r/257340 (owner: 10Hashar) [16:15:57] oh [16:16:23] !log Nodepool no more listens for Jenkins events over ZeroMQ. No TCP connection established on port 8888 [16:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:32:23] zeljkof: Hey, is https://gerrit.wikimedia.org/r/#/c/201481/ purely an example or is it something that you guys are interested in pushing forward at some point [16:33:25] zeljkof: andre came through and embarrassed us about the UploadWizard review queue so if there's something that can be abandoned I'd like to do it [16:33:28] marktraceur: an experiment at this point, if there is interest, we finish it [16:34:02] zeljkof: Well that sounds like abandon-fodder to me [16:34:12] marktraceur: let me just see if it is linked to a phab task, so it does not get lost [16:34:15] KK [16:34:39] marktraceur: abandon at will [16:34:44] the link is there [16:35:09] get that review queue to zero! :) [16:35:23] Well [16:35:29] I'd settle for "manageable" [16:35:44] The biggest problem was the OSM patch stream, which we abandoned because it's just nowhere near ready [16:36:01] I guess I should abandon the flickr things too. Crap [16:39:21] 10Continuous-Integration-Infrastructure, 7Nodepool: Nodepool does not delete nodes in a timely manner (ZeroMQ dead) - https://phabricator.wikimedia.org/T120668#1858570 (10hashar) [16:40:35] marktraceur: have fun! ;) [16:41:07] 10Continuous-Integration-Infrastructure, 7Monitoring: Monitor Jenkins master listens on TCP port 8888 (ZeroMQ) - https://phabricator.wikimedia.org/T120669#1858578 (10hashar) 3NEW [16:45:19] 10Continuous-Integration-Infrastructure, 7Monitoring: Monitor Jenkins master listens on TCP port 8888 (ZeroMQ) - https://phabricator.wikimedia.org/T120669#1858613 (10hashar) p:5Triage>3Normal [16:45:55] 10Continuous-Integration-Infrastructure, 7Nodepool: Nodepool does not delete nodes in a timely manner (ZeroMQ dead) - https://phabricator.wikimedia.org/T120668#1858618 (10hashar) That is probably because when restarting Jenkins there were duplicate process. Thus on the second instance ZMQ would have failed be... [16:46:05] 10Continuous-Integration-Infrastructure, 7Nodepool, 7WorkType-Maintenance: Nodepool does not delete nodes in a timely manner (ZeroMQ dead) - https://phabricator.wikimedia.org/T120668#1858619 (10hashar) [17:00:25] 10Continuous-Integration-Infrastructure, 7Nodepool, 7WorkType-Maintenance: Nodepool does not delete nodes in a timely manner (ZeroMQ dead) - https://phabricator.wikimedia.org/T120668#1858704 (10hashar) 5Open>3Resolved a:3hashar I have restarted Jenkins to get the ZeroMQ publisher started again. [17:29:46] 3Scap3: Build a dependency graph resolver for deployment stages and tasks - https://phabricator.wikimedia.org/T120684#1858895 (10mmodell) 3NEW a:3mmodell [17:31:37] 6Release-Engineering-Team, 15User-greg: Survey on needs related to Beta Cluster/Staging - https://phabricator.wikimedia.org/T115497#1858907 (10greg) 5Open>3declined a:3greg [17:56:28] 10Continuous-Integration-Config, 10OOjs-UI: Stop OOUI using js documentation patterns in PHP by Doxygen testing on merge - https://phabricator.wikimedia.org/T119198#1858968 (10Jdforrester-WMF) p:5Triage>3Normal [18:02:00] 10MediaWiki-Releasing: Ready-to-use Docker package for MediaWiki - https://phabricator.wikimedia.org/T92826#1859027 (10GWicke) A very bare-bones proof of concept docker-compose setup is now available at https://github.com/gwicke/mediawiki-docker-compose. It starts up three containers: - Apache, PHP & MediaWiki... [18:03:29] 3Scap3: Build a dependency graph resolver for deployment stages and tasks - https://phabricator.wikimedia.org/T120684#1859040 (10mmodell) p:5Triage>3High [18:07:31] thcipriani: so in #wikimedia-services urandom is willing to make sure a mw/core patch will be included in next wmf branch [18:07:55] thcipriani: that is a dependency for EventBus . https://gerrit.wikimedia.org/r/#/c/56567 "Add UUIDv1 function to UIDGenerator" [18:07:59] I guess that will be done by tomorrow branch cutting [18:41:37] [10:39:50] $ git review -s Problems encountered installing commit-msg hook The following command failed with exit code 104 "GET https://gerrit.wikimedia.org/tools/hooks/commit-msg" ----------------------- 404 Not Found

Not Found

The requested URL /tools/hooks/commit-msg was not found on this server.

----- [18:41:37] ----------- [18:41:42] GCI student is running into that [18:41:50] ostriches: ^? do you know why that would happen? [18:42:25] they're in #mediawiki if someone knows [18:42:32] Bleh, git-review. [18:43:43] https remote instead of ssh [18:43:53] 3Scap3: Build a dependency graph resolver for deployment stages and tasks - https://phabricator.wikimedia.org/T120684#1859271 (10mobrovac) I agree that //within// stages a DAG would be awesome to have. What worries me here is to use one for defining the relationship between stages as well. What would be the rati... [18:45:01] legoktm: my response to that err msg is to usually copy the needed hook file from an existing local gerrit-controlled repo [18:45:01] That commit message is only available over ssh. git-review isn't smart enough to figure out the ssh url when you use an https clone. [18:45:17] yeah, that [18:45:22] +1 [18:47:45] ok, passed that along, thanks [18:47:47] Do we have a checklist on wikitech somewhere for all the little bits that need to be ready to make a new extension ride the deploy train? [18:48:31] * bd808 squints at https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment [18:49:27] that page looks promising at first but peters out towards the end [18:51:37] Where can I find error logs for betacommons? [18:51:43] bd808: sort-of this: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1c:_new_submodule_.28extension.2C_skin.2C_etc..29 [18:52:31] thcipriani: *nod* except that's the degenerate "add to branch after cut" style. [18:52:41] * bd808 may try to write something [18:53:04] thcipriani: I think this is kind of the steps -- https://phabricator.wikimedia.org/T116676#1859167 [18:53:20] but I can't remember if that is missing anything [18:55:45] Never mind, found 'em [18:57:25] bd808: you need to add the extension to the old branch too [18:57:32] or create the special extension-list.$wgVersion thing [18:58:02] oh frack, yes [18:58:28] and I think Ori actually ripped that special casing out of the config [19:08:43] OK this is pretty odd, I'm getting 503s in the browser but there are no discernible errors in any of the logs [19:09:57] Can we access the varnish logs on deployment-fluorine too? [19:10:44] 503s where? [19:11:00] betacommons [19:11:11] On request for a stash thumbnail [19:12:21] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1859398 (10yuvipanda) 5Resolved>3Open The edge case was unrelated to anything here, let's keep this open for investigation. [19:12:35] I've been staring at /srv/mw-log for a good 20 minutes but can't grep anything useful [19:13:23] varnish logs don't go to the logging server. There is some magic way to watch requests fly by by logging into the right varnish server and running ... some command [19:13:35] bd808: imo trying to have it ride the train adds a lot of complexity, it's probably easier to get a separate deploy slot and do backports and whatnot [19:14:02] it shouldn't be hard :( [19:14:12] is that command documented somewhere? [19:14:13] this should be the right thing to do [19:14:27] Krenair: I'm looking. I think it is on wikitech somewhere [19:14:36] My hero, thanks [19:14:58] https://wikitech.wikimedia.org/wiki/Varnish#See_request_logs [19:15:09] `varnishncsa` apparently [19:15:26] marktraceur: so now we have to figure out which varnish to look at [19:15:59] bd808: I assume I'll hit the same one every time [19:16:18] deployment-cache-upload04 I think... [19:16:28] if not then deployment-cache-text04 [19:16:30] It says text04 miss [19:16:39] there you go then [19:16:53] Perfect, thanks [19:17:29] "stash thumbnail" -- ugh thumbs in beta cluster are a wreck as I recall [19:20:16] tgr keeps trying to get me to do a cleanup sprint with him to fix the 404 handler logic for beta cluster [19:20:16] it of course does things completely differently from production [19:20:28] bd808: Well, ragesoss reported this bug for production, so I suspect the bug will show up fine [19:21:23] Hah, here's where it gets fun [19:21:27] deployment-cache-text04.deployment-prep.eqiad.wmflabs 1162 2015-12-07T19:20:46 2.853962421 127.0.0.1 miss/200 5202 GET http://commons.wikimedia.beta.wmflabs.org/w/thumb.php?f=20151207192038%21URL26b773d2b805-1.jpg&width=100&temp=1 - - - 10.68.16.127, 127.0.0.1, 10.68.18.109, 10.68.16.189 MediaWiki/1.27alpha [19:21:35] But my browser still gets 503s [19:21:42] So what's between Varnish and my browser [19:22:47] there shouldn't be anything between varnish and the browser [19:22:52] That's what I'm saying\ [19:23:10] but the miss in cache would pass to the backend [19:23:31] and then that response would pass back [19:23:46] Hm [19:24:02] So going to Special:UploadStash and clicking on the image... "Cannot serve a file larger than 1048576 bytes." [19:24:08] Which I guess I vaguely expected [19:26:46] Maybe the thumbnail process is timing out for some reason [19:27:26] twentyafterfour: Since you run the train, this question is all about you. Do you mind doing the dance required to get new extensions on the train -or- would you rather that adding a new extension was done as a separate deploy? [19:28:07] bd808: starting this week the train is a rotating responsibility ;) [19:28:16] awesome [19:28:27] who's the first victim? [19:28:27] but I'd be in favor of reducing the the number of dance steps involved [19:28:57] thcipriani volunteered, in fact petitioned for this fate [19:29:16] yeah, I'm dumb :) [19:29:51] its good to move it around a bit. Every time it changes hands it gets better [19:30:16] yup [19:30:29] I'm not sure that changing weekly is good though just because it will reduce the "nope I'm fixing that crap" incentive [19:30:42] which is why it gets better when it changes hands [19:31:26] I'm wonder if we should do like a month each. [19:31:45] Hopefully, it will result in more robust solutions to the problems we find; however, yeah, maybe weekly is a bit short. [19:31:45] Long enough that you can get a cadence going and fix problems you encounter, but short enough that it's not /forever/ [19:32:07] shorter than forever is certainly ideal. [19:32:19] 3-4 in a row seems reasonable [19:32:30] so monthly I guess would get my vote [19:32:50] * bd808 actually kind of misses the feeling of accomplishment of pushing a new branch out [19:32:51] +1, or start with 2 weeks for everyone, then after it's gone a full round, move to monthly [19:33:23] I vote not me. [19:34:39] no fun [19:34:57] Hey, I don't even have deployment access. That's my excuse and I'm sticking to it. [19:35:17] but you have +2 in the config repo ... [19:35:24] bd808: Inorite? [19:36:53] * bd808 shushes himself [19:38:38] bd808: We don't talk about that, it just helps get shit done [19:39:34] * James_F coughs. [19:53:57] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 6Labs, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1859622 (10hashar) Nodepool uses the OpenStack API to delete instances. The first error occurred at 2015-12-06 08:00:30,030 (nothing earli... [20:03:35] mobrovac: thcipriani: so I guess kudos on having AQS set up on beta :-} [20:03:59] not entirely yet hashar [20:04:14] (ish) we still need to get cassandra running and the restbase module happy. [20:04:18] the scap3 part went ok-ish, but still need to work on ops/puppet stuff to get it properly working [20:04:27] :-( [20:04:38] thcipriani: re cass config, https://gerrit.wikimedia.org/r/#/c/257406/ [20:05:26] mobrovac: nice! we can cherry pick that on deployment-puppetmaster and try it out. [20:06:05] thcipriani: thought so, but i know we have stopped puppet on depoyment-aqs01 and i wonder whether there was something else we changed apart from the cass config [20:07:39] mobrovac: right, we stopped it because the restbase module ensures that the deployment location is /srv/deployment/restbase/deploy (which is not wanted for aqs) and we changed the systemd unit file to point to a different location. [20:08:27] hm, right [20:09:00] but i guess we can get rid of that since the plan is to try to work out the proper ops/puppet patch that would not need these manual changes anyway [20:09:25] right, my plan was to work on those puppet changes this afternoon. [20:11:36] could cherry pick your change now, run puppet, and see what happens (e.g. if cassandra starts/is happy) then fix the other aqs stuff later. [20:14:43] yup, sounds like a plan thcipriani [20:14:44] thnx! [20:15:09] thank you! [20:16:53] ostriches: are you willing to help with a gerrit testing thing tomorrow? Or if not, can you suggest someone else I should ask? [20:17:06] gerrit testing thing? [20:18:25] ostriches: like, I’m going to break gerrit, and I want someone designated to notice and tell me it’s broken [20:19:03] How are you going to break it? [20:19:30] ldap [20:20:02] ostriches: sorry, writing an email to explain right now [20:20:06] but I don’t know if I should send it to you or not : [20:20:07] :) [20:20:15] Plz do :D [20:20:21] We’re switching ldap servers. Want to test everything that relies on ldap right after the switch [20:20:24] because there may be subtle issues [20:20:29] Because if you break gerrit everyone's gonna come for me with the pitchforks :D :P [20:20:44] mobrovac: cherry picked patch, cassandra _seems_ happy on deployment-aqs01 [20:20:44] so that sounds like you /are/ interested then :) [20:20:46] unless it is done while you are sleeping ostriches :-D [20:20:53] I will include you in the mail :) [20:21:01] andrewbogott: CC me as well please ( hashar @ free .fr ) [20:21:04] ok [20:21:09] hashar: That's silly. I don't sleep. [20:21:11] since I will most probably be poked by Europe folks [20:21:53] ostriches: s/sleep/random activities/ :D [20:23:10] thcipriani: it is indeed! just connected to it via the cql shell :) [20:23:39] andrewbogott: anything in particular that's worrying you? [20:23:48] email'd [20:24:04] ostriches: not really. I think that gerrit pulls new user accounts from ldap and then they persist in gerrit’s db right [20:24:18] so we’ll have to create a new account on wikitech and make sure it lives on in gerrit [20:24:29] mobrovac: awesome. Now just need some puppet refactoring for ::restbase and it should Just Work™ (hopefully). [20:24:47] The accounts do. Passwords aren't obviously...those are queried on login [20:24:53] thcipriani: famous last words :P [20:24:56] Also we care about groups [20:25:29] All the LDAP config is done in gerrit.config.erb [20:28:06] ostriches: that should be handled by https://gerrit.wikimedia.org/r/#/c/256346/ I think [20:28:38] Yeah long as we just point to the new servers it should Just Work [20:29:08] As long as the schema or w/e ldap calls it is the same, I don't foresee any gerrit probs. [20:29:50] me neither, but it’s a long laundry list so surely something will surprise us [20:31:24] * ostriches adds calendar event so he doesn't forget [20:32:17] "Event 'Watch Andrew Break Everything' created" [20:39:39] andrewbogott: Expanded the gerrit test criteria and also added grafana-admin as another login to test [20:40:40] ostriches: great, thank you [20:41:12] np [21:03:48] (03PS1) 10Hashar: [WYSIWIG] Use generic PHPUnit jobs [integration/config] - 10https://gerrit.wikimedia.org/r/257424 [21:04:01] (03CR) 10Hashar: [C: 032] [WYSIWIG] Use generic PHPUnit jobs [integration/config] - 10https://gerrit.wikimedia.org/r/257424 (owner: 10Hashar) [21:06:56] (03Merged) 10jenkins-bot: [WYSIWIG] Use generic PHPUnit jobs [integration/config] - 10https://gerrit.wikimedia.org/r/257424 (owner: 10Hashar) [21:21:09] Yippee, build fixed! [21:21:09] Project browsertests-QuickSurveys-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #93: 09FIXED in 5 min 7 sec: https://integration.wikimedia.org/ci/job/browsertests-QuickSurveys-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/93/ [21:23:03] Project beta-scap-eqiad build #81588: 04FAILURE in 2 min 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/81588/ [21:23:55] (03PS1) 10Jdlrobson: Add Cards to list of extensions to branch [tools/release] - 10https://gerrit.wikimedia.org/r/257432 (https://phabricator.wikimedia.org/T116676) [21:31:01] Yippee, build fixed! [21:31:02] Project beta-scap-eqiad build #81589: 09FIXED in 6 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/81589/ [21:48:11] greg-g: mind if i grab a deploy window after parsoid to do a portal deploy? [21:57:42] nah, doit [21:57:49] thanks [22:06:17] James_F: thanks for the thanks on the appreciation thread :) [22:12:22] greg-g: Totally deserved. :_) [22:12:28] I just hope I didn't kill the thread! [22:15:31] heh [22:42:11] (03PS1) 10Gilles: Add thumbor/base-engine [integration/config] - 10https://gerrit.wikimedia.org/r/257492 [22:45:49] hey twentyafterfour just to let you know i have some changes i'd like to ride the train. I've added you to the patches in Gerrit and they are all listed in https://phabricator.wikimedia.org/T116676 [22:45:54] anything else you'd need for that to happen? [22:57:59] Just make sure they're merged, and they'll be in the next branch? [23:11:04] 3Scap3, 3releng-201516-q2, 3releng-201516-q3: [keyresult] Migrate all Service team owned services and MW deploys to scap3 - https://phabricator.wikimedia.org/T109926#1860478 (10greg) Adding our Q3 goal project because any of these we don't get to in Q2 should obviously flow over into Q3. [23:13:51] 10Continuous-Integration-Config, 10Math, 10MathSearch: Make Restbase availible to jenkins - https://phabricator.wikimedia.org/T120657#1860481 (10mobrovac) RESTBase is technically reachable from CI. However, in the case of Math, the problem is that the `/media/math/check` `POST` endpoint is available only to... [23:29:57] greg-g: Are we back to 'normal' for SWATs today? [23:30:06] (The calendar says yes, but want to check. :-)) [23:30:11] yup [23:30:14] Cool. [23:30:21] Now I need to find FR and tell them something. [23:30:32] James_F: normal as a beard in the mission [23:31:11] greg-g: So… achingly fashionable and used only by the chronically self-conscious? ;-) [23:31:34] ........ right, normal. [23:31:38] * James_F grins. [23:31:49] :) [23:43:01] 6Release-Engineering-Team, 6Developer-Relations, 6Phabricator, 10Phabricator-Sprint-Extension: Let's all stay in the loop on the Projects V3 update - https://phabricator.wikimedia.org/T120276#1860710 (10ksmith) At some point in the next couple months, it seems almost inevitable that we will do one of the f... [23:58:02] (03CR) 10BryanDavis: [C: 031] Add Cards to list of extensions to branch [tools/release] - 10https://gerrit.wikimedia.org/r/257432 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [23:59:39] 6Release-Engineering-Team, 6Developer-Relations, 6Phabricator, 10Phabricator-Sprint-Extension: Let's all stay in the loop on the Projects V3 update - https://phabricator.wikimedia.org/T120276#1849906 (10greg) First my on-topic point: >>! In T120276#1860710, @ksmith wrote: > Perhaps we should create a "Dep...