[00:58:43] PROBLEM - SSH on deployment-bastion is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:34] RECOVERY - SSH on deployment-bastion is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [01:19:03] Yippee, build fixed! [01:19:03] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #208: FIXED in 1 min 2 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/208/ [01:19:43] PROBLEM - SSH on deployment-bastion is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:24:33] RECOVERY - SSH on deployment-bastion is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [02:20:59] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:59] Project beta-scap-eqiad build #47316: FAILURE in 6 min 57 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47316/ [02:24:00] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 48330 bytes in 0.551 second response time [02:27:37] Project browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce build #54: FAILURE in 3 min 28 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.m.wikipedia.beta.wmflabs.org-linux-android-sauce/54/ [02:27:38] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:38] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:38] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:38] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:45] Yippee, build fixed! [02:36:45] Project beta-scap-eqiad build #47318: FIXED in 2 min 49 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47318/ [02:36:49] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #528: FAILURE in 3 min 48 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/528/ [02:43:33] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 48149 bytes in 1.351 second response time [02:43:34] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 48131 bytes in 5.441 second response time [02:46:19] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 48439 bytes in 8.948 second response time [02:53:02] Project beta-scap-eqiad build #47319: FAILURE in 5 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47319/ [02:53:02] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:02] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:30] Yippee, build fixed! [03:03:30] Project beta-scap-eqiad build #47320: FIXED in 2 min 39 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47320/ [03:03:30] 10Continuous-Integration: Zuul Status API must not be cached indefinitely - https://phabricator.wikimedia.org/T94796#1172819 (10Krinkle) 3NEW [03:09:09] Project beta-scap-eqiad build #47321: FAILURE in 5 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47321/ [03:25:44] PROBLEM - SSH on deployment-bastion is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:37] RECOVERY - SSH on deployment-bastion is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [03:38:01] Yippee, build fixed! [03:38:02] Project beta-scap-eqiad build #47324: FIXED in 4 min 2 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47324/ [03:42:26] 10Continuous-Integration: Zuul Status API must not be cached indefinitely - https://phabricator.wikimedia.org/T94796#1172875 (10Krinkle) [05:13:01] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce build #396: FAILURE in 2 min 2 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-9-sauce/396/ [05:13:10] Project beta-scap-eqiad build #47326: FAILURE in 3 min 36 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47326/ [05:13:12] PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:15] PROBLEM - SSH on deployment-bastion is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:16] PROBLEM - HHVM Queue Size on deployment-mediawiki01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [05:13:16] PROBLEM - HHVM Queue Size on deployment-mediawiki02 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [80.0] [05:13:17] (03CR) 10Legoktm: Create prepare-mediawiki-zuul-project builder macro (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/201022 (owner: 10Legoktm) [05:13:18] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #401: FAILURE in 2 min 20 sec: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/401/ [05:13:18] (03PS2) 10Legoktm: Create prepare-mediawiki-zuul-project builder macro [integration/config] - 10https://gerrit.wikimedia.org/r/201022 [05:13:20] (03CR) 10Legoktm: [C: 032] Create prepare-mediawiki-zuul-project builder macro [integration/config] - 10https://gerrit.wikimedia.org/r/201022 (owner: 10Legoktm) [05:13:20] RECOVERY - SSH on deployment-bastion is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [05:13:21] (03Merged) 10jenkins-bot: Create prepare-mediawiki-zuul-project builder macro [integration/config] - 10https://gerrit.wikimedia.org/r/201022 (owner: 10Legoktm) [05:13:21] (03PS2) 10Legoktm: Create generic phpunit job for extensions [integration/config] - 10https://gerrit.wikimedia.org/r/201032 (https://phabricator.wikimedia.org/T94327) [05:13:23] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:24] (03PS1) 10Krinkle: Remove obsolete '{name}-{ext-name}-phpcs-HEAD' template [integration/config] - 10https://gerrit.wikimedia.org/r/201414 [05:13:26] !log the shinken alerts about beta cluster issues are due to wmflabs having issues. [05:13:26] Logged the message, Master [05:13:27] RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 9.854 second response time [05:13:59] RECOVERY - App Server bits response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 0.002 second response time [05:14:02] !log and right when I log'd that, things seem to be recovering [05:14:07] Logged the message, Master [05:16:05] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 48131 bytes in 0.586 second response time [05:16:14] Yippee, build fixed! [05:16:15] Project beta-scap-eqiad build #47330: FIXED in 1 min 37 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/47330/ [05:16:35] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.019 second response time [05:17:05] PROBLEM - Puppet failure on integration-slave1003 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [05:17:34] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29175 bytes in 0.568 second response time [05:25:10] RECOVERY - HHVM Queue Size on deployment-mediawiki02 is OK: OK: Less than 30.00% above the threshold [10.0] [05:31:56] RECOVERY - Puppet failure on integration-slave1003 is OK: OK: Less than 1.00% above the threshold [0.0] [07:07:08] (03PS2) 10Krinkle: Remove obsolete '{name}-{ext-name}-phpcs-HEAD' template [integration/config] - 10https://gerrit.wikimedia.org/r/201414 [07:08:30] (03CR) 10Krinkle: [C: 032] "Deployed mwext-*-phpcs-HEAD." [integration/config] - 10https://gerrit.wikimedia.org/r/201414 (owner: 10Krinkle) [07:12:49] (03Merged) 10jenkins-bot: Remove obsolete '{name}-{ext-name}-phpcs-HEAD' template [integration/config] - 10https://gerrit.wikimedia.org/r/201414 (owner: 10Krinkle) [07:33:33] (03PS4) 10Zfilipin: Notify Collaboration team of failing browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [07:34:29] (03CR) 10Zfilipin: [C: 031] Notify Collaboration team of failing browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [08:07:08] aharoni: will be here in a few minutes. I will poke you :) [08:07:40] cool, she'll probably be here later [08:12:16] !log upgrading packages on integration-dev [08:12:19] qa-morebots: ping [08:12:21] pff [08:12:22] Logged the message, Master [08:12:22] I am a logbot running on tools-exec-10. [08:12:22] Messages are logged to https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL. [08:12:22] To log a message, type !log . [08:13:34] aharoni: so what she is willing to do? I have no idea what LanguageTool is :) [08:13:50] usually for dev work people tend to use MediaWiki Vagrant which "easily" let you start a dev env [08:13:53] https://www.languagetool.org/ [08:14:03] with a wiki and the ability to enable whatever extensions you want with their relevant backends [08:14:18] another solution is to use a labs instance with some puppet role which setup mediawiki on it [08:14:42] oh [08:14:50] that is roughly a spell checker ? [08:15:02] a smart spell checker [08:15:07] checks grammar, too [08:15:19] would be nice to have that integrated within VisualEditor :D [08:15:20] Free Software \o/ [08:15:30] I have a dream, to integrate it with VisualEditor. [08:15:33] there is even a plugin for vim! [08:15:53] and that's precisely what ankita-ks wants to do, so I want to help [08:16:46] hashar: try writing a French sentence with grammar mistakes in languagetool.org, and it *might* detect it [08:17:12] (scope of support is very different from language to language) [08:17:31] so what is the scope of the project? [08:17:39] is that to integrate it within VE N? [08:17:46] yes [08:18:05] I envision it as a button in the VE toolbar, and when you press it, it searches for mistakes in the article. [08:18:13] sounds fun ;) [08:18:19] and shows them in a bubble, or a side panel, or a bottom panel. [08:18:29] I guess one should contact the VE team (James Forrester) [08:18:37] add the language tool to the product roadmap [08:18:42] I proposed it as a GSoC project, and a lot of people want to do it, so I make up test tasks for everybody. [08:18:46] James knows it. [08:18:57] and whoever wants to achieve that would need at least a mentor from the ve team [08:19:23] but I am not sure it is appropriate for a GSoC project. I am afraid it might end requiring a lot of work to make languagetool able to recognize text in VE [08:19:42] well... we'll try :) [08:20:11] seems to be a java server [08:20:39] so I speculate VE would have to send some raw text to a backend service then interpret the result in a machine readable format [08:20:50] or maybe it can be integrated in Parsoid :D [08:21:56] hashar: yes [08:24:11] 10Beta-Cluster, 10MediaWiki-ResourceLoader: http://en.wikipedia.beta.wmflabs.org/wiki/Special:Preferences lacks normal styling - https://phabricator.wikimedia.org/T93050#1173089 (10Krinkle) This has now hit production. https://en.wikipedia.org/wiki/Special:Preferences. It seems de.wikipedia.org and www.wikida... [08:24:15] 10Beta-Cluster, 10MediaWiki-ResourceLoader: http://en.wikipedia.beta.wmflabs.org/wiki/Special:Preferences lacks normal styling - https://phabricator.wikimedia.org/T93050#1173092 (10TheDJ) This has now progressed towards en.wp production: T94808 [08:25:12] aharoni: so in short, looks like most of the work would haqve to happen with VE/ Parsoid :D [08:25:29] to setup a dev env, sounds like MediaWiki Vagrant is the easiest path [08:25:37] since it comes with appropriate roles to easily add VE and Parsoid [08:25:42] hashar: yes, and setting up the server may be a challenge, too [08:25:45] though I am not familiar with that [08:25:59] I'm not even sure that the LT server has Ubuntu packages. [08:26:17] but I think it is too large of a project for a GSoC :/ [08:26:42] are you discouraging or challenging? :) [08:26:42] but a proof of concept would be fun [08:26:51] oh [08:26:54] I am just being pragmatic [08:27:18] I have the feeling it is not doing a favor to the student to enroll them in a project that they probably can't achieve [08:27:30] much better imho to work on something smaller which can be delivered [08:27:32] and deployed [08:27:48] Well... I actually think that it's possible. [08:52:17] PROBLEM - Puppet staleness on deployment-bastion is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [08:56:32] 10Continuous-Integration: Zuul Status API cached too long by Varnish - https://phabricator.wikimedia.org/T94796#1173129 (10hashar) The upstream status page http://zuul.openstack.org/ queries http://zuul.openstack.org/status.json which hits the Zuul webapp directly. They have no caching layer in between. The web... [08:57:52] (03PS5) 10Hashar: Notify Collaboration team of failing browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [08:58:03] zeljkof: lets spam collaboration team :) [08:58:17] hashar: yeah! :) [09:00:16] (03CR) 10Hashar: [C: 032] "Thanks Matt. I have updated the jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [09:00:58] (03PS6) 10Hashar: Notify Collaboration team of failing browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [09:02:30] (03CR) 10Hashar: "I have removed links to unrelated bugs:" [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [09:02:35] (03CR) 10Hashar: [C: 032] Notify Collaboration team of failing browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [09:03:03] 10Continuous-Integration, 6Release-Engineering, 7Browser-Tests, 7Tracking: Move browser test alerts to responsible teams' channels from -releng (tracking) - https://phabricator.wikimedia.org/T89375#1173153 (10hashar) [09:03:31] 10Continuous-Integration, 6Release-Engineering, 7Browser-Tests, 7Tracking: Move browser test alerts to responsible teams' channels from -releng (tracking) - https://phabricator.wikimedia.org/T89375#1034923 (10hashar) Disregard patch https://gerrit.wikimedia.org/r/201083 which change the email notification... [09:04:11] (03Abandoned) 1020after4: Add a rudimentary resume feature to make-wmf-branch [tools/release] - 10https://gerrit.wikimedia.org/r/192868 (owner: 1020after4) [09:04:20] 10Continuous-Integration, 6Collaboration-Team, 10Flow, 7Browser-Tests, 7Easy: send Flow browser test job notices to #wikimedia-corefeatures channel - https://phabricator.wikimedia.org/T66103#1173163 (10hashar) [09:05:58] 10Continuous-Integration, 6Collaboration-Team, 10Flow, 7Browser-Tests, 7Easy: send Flow browser test job notices to #wikimedia-corefeatures channel - https://phabricator.wikimedia.org/T66103#697529 (10hashar) Please disregard patch https://gerrit.wikimedia.org/r/201083 which changes the email notificatio... [09:06:03] aharoni: I am in th hangout. coming? [09:06:15] ow [09:06:23] summer time [09:06:25] hmm [09:06:31] aharoni: you did not switch? [09:06:34] can we do an hour later? [09:06:48] I switched, by my colleagues from India didn't [09:06:48] aharoni: sure, will move the meeting in calendar [09:06:52] thanks [09:07:28] (03Merged) 10jenkins-bot: Notify Collaboration team of failing browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/201083 (https://phabricator.wikimedia.org/T94152) (owner: 10Mattflaschen) [09:10:03] 10Continuous-Integration, 6Collaboration-Team, 10Echo, 7Browser-Tests, 5Patch-For-Review: Fix failed Echo browsertests Jenkins job - https://phabricator.wikimedia.org/T94152#1173192 (10hashar) The browser tests for Echo now send email notifications to each developers of the collaboration team. That shoul... [09:11:14] 10Continuous-Integration, 6Collaboration-Team, 10Flow, 7Browser-Tests, 5Patch-For-Review: Fix failed Flow browsertests Jenkins job - https://phabricator.wikimedia.org/T94153#1173194 (10hashar) The browser tests for Echo now send email notifications to each developers of the collaboration team. That shoul... [09:27:00] hashar: Have you been able to install Zuul locally? [09:27:13] Krinkle: good evening :) [09:27:16] I tried last night (just to get a simple localhost:../status.json) [09:27:18] what do you mean? [09:27:25] But got stuck in so many layers of bull shit. [09:27:26] and which version are you using? [09:27:33] latest upstream master [09:28:07] should be all about something like: pip install -e --user . [09:28:18] (-e to have the install point to your working copy, makes hacking easier) [09:28:33] (--user to install somewhere in your home dir to avoid cluttering the global site-packages) [09:28:43] I tried with tox --develop --notest; .tox/py27/bin/zuul-server -d -c etc/zuul.conf-sample [09:28:51] as well as sudo python setup.py develop [09:28:59] then you need to craft a zuul.conf and start the zuul-server. That should spawn the embedded webserver on port 8080 iirc [09:29:07] both times it had weird errors about some class not having a method "poll" [09:29:13] OH [09:29:16] are you on a mac? [09:29:19] Yes [09:29:23] poll() is a system call [09:29:34] I also tried installing Gearman first, since it was coming from gear module [09:29:35] which is used to listen for events on a bunch of file descriptors [09:29:48] But than that was failing too [09:29:50] poll() is broken on mac os, it does not work with some kind of file descriptors such as sockets (iirc) [09:29:58] There's like no documentation on installing zuul. [09:30:06] so the python version build by Apple does not provide poll [09:30:11] the homebrew python doesn't either [09:30:14] right [09:30:26] though you could pass a parameter to homebrew to get it to compile python with the broken poll [09:30:32] and it more or less work for zuul [09:31:07] ori noticed that a long time ago and sent a patch to migrate to the system call select() (that is and old but reliable polling mecanism) [09:31:22] but ori patches got overwritten / lost when the Gearman Zuul branch got merged in [09:31:28] so zuul is back to use poll() [09:31:41] hashar: Is there a bug about that? [09:31:46] I find that storyboard hard to use. [09:31:57] Too simple, no control/overview. [09:32:27] there is no bug filled for it [09:32:28] Do they actually use that as primary issue tracker? [09:33:11] here one a related patch by ori https://review.openstack.org/#/c/28126/2/tests/test_scheduler.py [09:33:18] So far I've been using ?source_url=https://integration.wikimedia.org/zuul/status.json in the zuul status part so that I don't need to set up zuul server [09:33:33] But for the bug I'm fixing now (broken HTTP 304) I need the server running [09:33:40] :/ [09:33:58] maybe you could use a vagrant instance ? [09:34:02] or one of the integration machine [09:34:11] then I need to forward the port [09:34:13] I think Igot one with a very basic zuul server [09:34:14] ugh [09:34:31] I lost my energy on that after 2 hours. [09:34:33] Will try tomorrow. [09:34:35] :( [09:34:41] I understand your pain [09:34:49] ah [09:34:50] integration-zuul-server [09:35:01] that instance should have a basic zuul server + a local gerrit [09:35:29] with a proxy configured: integration.wmflabs.org http://integration-zuul-server.eqiad.wmflabs:80 [09:35:42] aha [09:36:04] cant remember how to access the status page though hehe [09:36:33] I cant even ssh to it [09:36:45] so [09:36:54] you can attempt to migrate from select.poll to select.select [09:36:57] might take a while [09:37:08] or setup a debian vm on your machine to play with zuul [09:37:42] !log rebooting integration-zuul-server homedir seems to be stalled/missing [09:37:47] Logged the message, Master [09:38:23] Krinkle: another possibility would be to grab the upstream web source and publish them on our integration.wikimedia.org site [09:38:28] directly on gallium, and hack from there :) [09:38:43] this way you can add a cache breaker to the status query [09:39:00] hashar: we have a cache breaker already [09:39:03] I want to get rid of it [09:39:05] upstream already did that [09:39:11] yeah I replied on the task [09:39:20] I'm working on a major commit that pulls in all upstream changes to zuul status page [09:39:24] the end users hit directly the Zuul internal webserver [09:39:35] however I'm blocked on our Zuul server being too old [09:39:37] and to mitigate load there is a 1 second process cache to avoid regenerating the whole json [09:39:41] properties like 'live' don't exist in our version [09:39:51] yeah :( [09:40:08] luckily I have completed the Debian package for Zuul [09:40:10] hashar: Yeah, it's cached inside Zuul, but should also be cached in Varnish [09:40:14] and going to install it today on one of the slaves [09:40:19] we don't want to have concurrent requests going to gallium [09:40:21] no need to [09:40:25] then next week switch gallium to use that debian package [09:40:29] even if it's just 1 second [09:40:38] from there I will be able to update our setup to catch up upstream [09:40:49] well [09:40:51] 10Deployment-Systems, 6Services, 6operations: Automate compiling service dependencies using production Jessie libraries - https://phabricator.wikimedia.org/T94611#1173275 (10fgiunchedi) p:5Triage>3Normal [09:40:59] does it cause any problem to have multiple concurrent query made ? [09:41:10] The problem is that Zuul server serves an invalid 304 header. Which Apache interprets as no-cache, and Varnish interprets as "whatever" (= 2minutes in wmf-config) [09:41:26] hashar: It's wasteful. And not the point I'm making. [09:41:27] I tried overloading it with multiple concurrent status.json query and the server was handling them just fine [09:41:38] Ie I haven't noticed any CPU surges [09:41:43] the point is that plain zuul.json is broken [09:41:49] and cached for 2 minutes by Varnish [09:41:52] yeah [09:41:55] that is already happening right now [09:42:13] which means we're forced to send cache busters [09:42:20] which makes the AJAX handling more complicated [09:42:20] but I did not bother pursuing the caching headers patch I wrote since we use a cache breaker [09:42:36] Yeah, but upstream removed the cache breaker from the status page, and I'd like to keep it that way [09:43:09] is the cache breaker causing any troubles? [09:43:26] or is that for the sake of having the cache to be handled via varnish? [09:44:56] my point is that we could certainly come up with headers to be sent by the webapp [09:45:01] that would let you get rid of the cache breaker [09:45:09] not sure whether it is worth our time though [09:48:11] 10Continuous-Integration, 6operations: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173290 (10fgiunchedi) p:5Triage>3Normal [09:48:34] 10Deployment-Systems, 6Release-Engineering, 6Services, 6operations: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1173292 (10fgiunchedi) p:5Triage>3Normal [09:48:44] 10Deployment-Systems, 6Services, 6operations: Evaluate Docker as a container deployment tool - https://phabricator.wikimedia.org/T93439#1173293 (10fgiunchedi) p:5Triage>3Normal [09:49:27] hashar: It's simple. It just needs to send Cache-Control: public, max-age=.., s-maxage=..; and Expires: (date). [09:51:17] Krinkle: yeah I have updated the task to point to the abandonned change https://review.openstack.org/#/c/66583/ [09:51:41] it is quite lame since it just sends Cache-Control: no-cache [09:51:42] https://review.openstack.org/#/c/66583/4/zuul/webapp.py,unified [09:52:06] not sure which max-age / smaxage value to set though [09:52:19] Yeah, but upstream doesn't want that. They explicitly want to cache it. [09:52:34] which will work fine. [09:53:01] Varnish and browsers don't cache by default. It is currently caching because zuul only sends a half-ass cache header. [09:53:32] https://github.com/openstack-infra/zuul/blob/db8b89b/zuul/webapp.py#L124 [10:03:07] 10Continuous-Integration: Zuul Status API cached too long by Varnish - https://phabricator.wikimedia.org/T94796#1173305 (10Krinkle) >>! In T94796#1173129, @hashar wrote: > The upstream status page http://zuul.openstack.org/ queries http://zuul.openstack.org/status.json which hits the Zuul webapp directly. They h... [10:04:27] Krinkle: which comes from https://github.com/openstack-infra/zuul/commit/aa4f2e7a3ab2ff6f235c2b421342c8154542acbb [10:04:54] yeah, they didn't test it I guess :) [10:05:01] Apache won't cache anything without a Last-Modified header. Be a good [10:05:01] citizen and set the Last-Modified header when serving status.json data. [10:05:01] Set the value to the cache_time timestamp. [10:07:15] zeljkof: interestingly Flow browser test totally broke :( https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/383/testReport/ [10:27:05] 10Continuous-Integration, 6operations: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173325 (10hashar) As pointed by Filippo, the Alioth repository does not have an upstream branch containing the source. There is a debian/watch file though so one ca... [10:30:55] aharoni: https://github.com/wikimedia/mediawiki-selenium/blob/master/lib/mediawiki_selenium/support/hooks.rb [10:41:22] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » id,contintLabsSlave && UbuntuTrusty build #35: SUCCESS in 22 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=id,label=contintLabsSlave%20&&%20UbuntuTrusty/35/ [10:42:58] 10Continuous-Integration, 6operations, 7Blocked-on-Operations, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1173355 (10fgiunchedi) [10:42:59] 10Continuous-Integration, 6operations: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173352 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi python-gear 0.5.5-2 uploaded to both jessie-wikimedia and trusty-wikimedia [11:01:02] Project browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox » hi,contintLabsSlave && UbuntuTrusty build #35: FAILURE in 42 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=hi,label=contintLabsSlave%20&&%20UbuntuTrusty/35/ [11:03:12] 10Continuous-Integration, 6operations: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173378 (10hashar) Confirmed. Thanks a lot @fgiunchedi From T89952 : Debian Developer [[ https://qa.debian.org/developer.php?login=mika%40debian.org | Michael "mika"... [11:41:58] 10Quality-Assurance: Implement Ruby/Selenium/Cucumber/page-object coding conventions (tracking) - https://phabricator.wikimedia.org/T62335#1173436 (10zeljkofilipin) [12:19:54] !log preventing job to run on integration-slave1001 by replacing its label with 'DoNotLabelThisSlaveHashar'. Going to install Zuul debian package on it [12:20:00] Logged the message, Master [12:25:54] (03CR) 10Hashar: Package python deps with dh-virtualenv (031 comment) [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [12:37:35] PROBLEM - SSH on deployment-lucid-salt is CRITICAL: Connection refused [12:41:39] (03PS1) 10Hashar: Drop /usr/local/bin from zuul-cloner invocation [integration/config] - 10https://gerrit.wikimedia.org/r/201451 [12:57:11] (03PS1) 10Krinkle: Enable 'npm' for mwext-Collection [integration/config] - 10https://gerrit.wikimedia.org/r/201458 [12:57:43] (03CR) 10Hashar: "Passing --version yields '2.0.0', I am going to make it output the version from the debian changelog by changing OSLO_PACKAGE_VERSION in d" (031 comment) [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [12:57:50] (03CR) 10Krinkle: [C: 031] Drop /usr/local/bin from zuul-cloner invocation [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [12:58:04] (03PS16) 10Hashar: Package python deps with dh-virtualenv [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) [12:59:21] (03CR) 10Krinkle: [C: 032] Enable 'npm' for mwext-Collection [integration/config] - 10https://gerrit.wikimedia.org/r/201458 (owner: 10Krinkle) [13:00:39] (03Merged) 10jenkins-bot: Enable 'npm' for mwext-Collection [integration/config] - 10https://gerrit.wikimedia.org/r/201458 (owner: 10Krinkle) [13:01:19] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/201458 [13:01:23] Logged the message, Master [13:12:22] 10Continuous-Integration, 10pywikibot-core, 5Patch-For-Review: run pep8 and pep257 for pywikibot/core - https://phabricator.wikimedia.org/T87169#1173590 (10Aklapper) @jayvdb: This task has "Unbreak now" priority since January 2015 which [[ https://www.mediawiki.org/wiki/Phabricator/Project_management#Priorit... [13:18:37] 10Continuous-Integration, 10pywikibot-core, 5Patch-For-Review: run pep8 and pep257 for pywikibot/core - https://phabricator.wikimedia.org/T87169#1173616 (10jayvdb) @aklapper, I was hoping the CI team might fix this regression created by @hashar. https://lists.wikimedia.org/pipermail/pywikipedia-l/2015-Januar... [13:35:12] (03CR) 10Hashar: [C: 032] "I have double checked that jenkins-slave user on gallium/production and jenkins-deploy on labs instances all have a PATH that contains /us" [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [13:35:42] !log reloading Jenkins configuration files from disk to make it knows about a change manually applied to most jobs config.xml files for https://gerrit.wikimedia.org/r/#/c/201451/ [13:35:47] Logged the message, Master [13:35:52] (03PS2) 10Hashar: Drop /usr/local/bin from zuul-cloner invocation [integration/config] - 10https://gerrit.wikimedia.org/r/201451 [13:36:03] (03CR) 10Hashar: [C: 032] Drop /usr/local/bin from zuul-cloner invocation [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [13:36:53] hashar: how do I fix this? https://gerrit.wikimedia.org/r/#/c/200558/ [13:37:02] it just says "cannot merge" [13:37:18] zeljkof: have you read my comment ? [13:37:22] https://gerrit.wikimedia.org/r/#/c/200558/1/tests/browser/features/step_definitions/login_steps.rb [13:37:28] the indentation is bad :D [13:37:34] I guess the automatic fixing did something wrong [13:37:36] hashar: yes, I have fixed it [13:37:50] well the change is still at patchset 1 isnt it ? [13:38:13] note the parent change has received a new patchset [13:38:20] so you will need some rebasing :D [13:38:31] or just send it again using tip of the branch as a parent [13:39:15] hashar: will try rebasing [13:39:46] pff [13:39:48] that is a mess [13:40:12] whaqt is the default for Style/StringLiterals ? [13:40:15] is that single_quotes ? [13:41:36] hashar: you should pick one [13:41:41] there is no default [13:41:42] so [13:41:48] on https://gerrit.wikimedia.org/r/#/c/200556/5 [13:41:54] you change the style from single_quotes to double_quotes [13:42:11] then update the rubocop todo to ignore the double quotes errors [13:42:18] I would left it as single_quotes as it was [13:42:24] and ignore the single_quotes error [13:42:31] this way the patch just refresh the todo file [13:42:42] then in a second commit, switch the repo to double_quotes and fix the single quotes occurences [13:42:52] (03CR) 10jenkins-bot: [V: 04-1] Drop /usr/local/bin from zuul-cloner invocation [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [13:42:56] damn [13:43:09] hashar: but todo file is autogenerated, I did not update it manually [13:43:19] so I had to manually update the main config file [13:43:26] (03CR) 10Hashar: [C: 032] "urllib2.HTTPError: HTTP Error 503: Service Unavailable" [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [13:43:46] (03CR) 10Hashar: [C: 032] Drop /usr/local/bin from zuul-cloner invocation [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [13:44:35] * zeljkof brb [13:45:16] !log pooling back integration-slave1001 and 1002 which are using zuul-cloner provided by a debian package [13:45:22] Logged the message, Master [13:46:49] (03CR) 10Hashar: "Regenerated the package. I have deployed it (after removing the pip installed version) on all our Precise labs slave integration-slave100[" [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [13:47:43] ahh [13:48:11] (03Merged) 10jenkins-bot: Drop /usr/local/bin from zuul-cloner invocation [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [13:49:16] (03CR) 10Hashar: "I have installed the package on slaves 1001 and 1002. Two wikidata jobs triggered on them and worked just fine:" [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [14:04:13] !log uninstall the pip installed zuul version from Precise labs slaves by doing: pip uninstall zuul && rm /usr/local/bin/zuul* . Switching them all to a Debian package [14:04:17] Logged the message, Master [14:10:48] !log integration-slave100[1-4] are now using Zuul provided by a Debian package as of https://gerrit.wikimedia.org/r/#/c/195272/ PS 16 [14:10:52] Logged the message, Master [14:11:42] !log reduced integration-slave1004 executors from 6 to 5 to make it on par with the other precise slaves [14:11:47] Logged the message, Master [14:24:40] 10Continuous-Integration, 6operations: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173711 (10hashar) [14:26:13] <^d> Sleep, wake up, coffee, mess with puppet. [14:26:29] <^d> Repeat as needed until desired results are observed. [14:27:39] ^d good morning! [14:27:45] do you always wake up so early? [14:36:15] <^d> hashar: I do :) [14:36:22] <^d> Rise with the sun most days [14:36:49] 10Continuous-Integration: Create CI slaves using Debian Jessie for debian-glue script - https://phabricator.wikimedia.org/T94836#1173747 (10hashar) 3NEW [14:38:12] ^d: that is a good habit to have :) [14:38:44] according to Google sunrise in SF was 6:53 [14:38:48] 7:42am in my city [14:39:09] but 6:37am in Berlin [14:39:15] <^d> Yeah 6:30 is roughly when I get up most days [14:40:08] 10Continuous-Integration, 7Technical-Debt, 7Tracking: All repositories should pass jshint test (tracking) - https://phabricator.wikimedia.org/T62619#1173763 (10Aklapper) [14:42:13] 10Continuous-Integration: Create CI slaves using Debian Jessie for debian-glue script - https://phabricator.wikimedia.org/T94836#1173779 (10hashar) Created instance i-00000a3b with image "debian-8.0-jessie" and hostname i-00000a3b.eqiad.wmflabs. [[ https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000a3b.e... [14:42:29] 10Continuous-Integration, 7Technical-Debt, 7Tracking: All repositories should pass jshint test (tracking) - https://phabricator.wikimedia.org/T62619#1173813 (10Aklapper) [14:42:54] !log Created [[Nova_Resource:I-00000a3b.eqiad.wmflabs|integration-slave-jessie-1001]] to try out CI slave on Jessie ([[T94836]]) [14:42:59] Logged the message, Master [14:43:50] 10Continuous-Integration, 7Technical-Debt, 7Tracking: All repositories should pass jshint test (tracking) - https://phabricator.wikimedia.org/T62619#658851 (10Aklapper) [14:49:17] !log integration: nice thing, newly created instances are automatically made to point to integration-pummetmaster via hiera! Just have to sign the certificate on the master using: puppet ca list ; puppet ca sign i-000xxxx.eqiad.wmflabs [14:49:21] Logged the message, Master [14:49:41] <^d> We did the same with staging :) [14:49:56] <^d> Plus a fun script running on our puppetmaster to autosign for us [14:50:06] that is nice [14:50:45] you would never believe how much of a pain it was to have ops to +2 changes to get them available on labs instance :) [14:51:51] !log applying role::ci::slave::labs::common on integration-slave-jessie-1001 [14:51:55] Logged the message, Master [14:52:43] Error: Failed to apply catalog: Could not find dependency File[/etc/ldap/ldap.conf] for Class[Puppet::Self::Config] at /etc/puppet/modules/puppet/manifests/self/client.pp:21 [14:52:45] pfff [14:54:41] hashar: just saw this this morning https://phabricator.wikimedia.org/T94834 [14:56:50] 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Jessie puppet self instance has puppet erroring out ( Could not find dependency File[/etc/ldap/ldap.conf] ) - https://phabricator.wikimedia.org/T94840#1174061 (10hashar) 3NEW [14:58:37] 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Jessie puppet self instance has puppet erroring out ( Could not find dependency File[/etc/ldap/ldap.conf] ) - https://phabricator.wikimedia.org/T94840#1174061 (10hashar) [14:58:51] oh man thcipriani thanks! [14:59:12] 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Jessie puppet self instance has puppet erroring out ( Could not find dependency File[/etc/ldap/ldap.conf] ) - https://phabricator.wikimedia.org/T94840#1174077 (10hashar) [14:59:47] hashar: yw, andrewbogott was talking about looking into it over in -operations [15:00:01] good to know [15:00:09] good news for me, that is the end of the day [15:00:22] so hopefully by tomorrow morning that will have been fixed and I can resume :) [15:01:44] 10Continuous-Integration: Create CI slaves using Debian Jessie for debian-glue script - https://phabricator.wikimedia.org/T94836#1174092 (10hashar) [15:15:11] 10Continuous-Integration, 6operations, 7Blocked-on-Operations, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1174146 (10hashar) Thanks to all @fgiunchedi reviews, I used patchset 16 of https://gerrit.wikimedia.org/r/#/c/195272/... [15:16:11] (03CR) 10Hashar: [C: 031 V: 031] "From my update on T48552" [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [15:16:39] <^d> hashar: Yes I would. I've been here :p [15:16:42] (03CR) 10Hashar: "And updated slaves 1003 & 1004 as well." [integration/config] - 10https://gerrit.wikimedia.org/r/201451 (owner: 10Hashar) [15:17:57] and [15:17:59] I am off! [15:18:01] see you tomorrow [15:38:53] <^d> twentyafterfour: So, automating getting things on tin. What were your plans there? [15:40:34] <^d> (if you don't have firm plans I'll even ideas or the ramblings of a madman even) [15:41:10] <^d> s/even/take/ [16:02:24] which even? [16:02:26] :P [16:05:16] there was no global flag. [16:06:58] zeljko is doing a talk on vim in a little bit (if you're in Croatia you can attend), it might be good for this situation [16:08:37] marxarelli: https://phabricator.wikimedia.org/T93174 [16:10:27] marxarelli: http://firefogg.org/ [16:11:56] oh man, vim regex substitutions. Best tip is :set hlsearch get your search right /\v\beven\b then do :%s//take/ if you don't pass a search to :s it'll use the last search. Took me a year or so to figure that out. [16:12:36] thcipriani: huh, neat [16:31:45] marxarelli: https://phabricator.wikimedia.org/T64839 [16:41:07] 10Continuous-Integration: Create CI slaves using Debian Jessie for debian-glue script - https://phabricator.wikimedia.org/T94836#1174474 (10Andrew) [16:42:11] marxarelli: https://phabricator.wikimedia.org/T66211 [16:43:30] marxarelli: https://phabricator.wikimedia.org/T71725 [16:43:31] 6Release-Engineering, 6MediaWiki-Core-Team, 10MediaWiki-Debug-Logging, 10Wikimedia-Logstash, and 2 others: Log php fatals with full backtraces again (fatal.log on fluorine) - https://phabricator.wikimedia.org/T89169#1174494 (10hoo) [16:43:50] 6Release-Engineering, 6MediaWiki-Core-Team, 10MediaWiki-Debug-Logging, 10Wikimedia-Logstash, 7HHVM: Log php fatals with full backtraces again (fatal.log on fluorine) - https://phabricator.wikimedia.org/T89169#1029116 (10hoo) [16:48:27] 6Release-Engineering, 6operations: SlowTimer logs should go to their own location, instead of hhvm.log - https://phabricator.wikimedia.org/T94855#1174515 (10EBernhardson) [16:51:36] 6Release-Engineering, 10MediaWiki-Logging, 6operations, 7HHVM: SlowTimer logs should go to their own location, instead of hhvm.log - https://phabricator.wikimedia.org/T94855#1174518 (10greg) [17:04:46] 10Deployment-Systems, 7Documentation: document trebuchet - https://phabricator.wikimedia.org/T94619#1174568 (10greg) p:5Triage>3Normal [17:04:54] 10Deployment-Systems, 7Documentation: Document Scap - https://phabricator.wikimedia.org/T94618#1174570 (10greg) p:5Triage>3Normal [17:05:21] btw, who's doing the trebuchet one? [17:05:43] 6Release-Engineering, 10Architecture, 10Parsoid, 6Services, 7Service-Architecture: Distribution strategy option: Use Vagrant puppet modules - https://phabricator.wikimedia.org/T88151#1174575 (10Arlolra) p:5Triage>3Normal [17:07:15] 10Deployment-Systems, 6operations: Use FQDNs for mediawiki-installation - https://phabricator.wikimedia.org/T93983#1174579 (10greg) 5Open>3Resolved >>! In T93983#1157794, @Dzahn wrote: >>>! In T93983#1151838, @bd808 wrote: >> The fix will be to update `mediawiki-installation` which is currently maintained... [17:08:09] 10Deployment-Systems, 6Services: Evaluate Ansible as a deployment tool - https://phabricator.wikimedia.org/T93433#1174591 (10greg) p:5Triage>3Normal [17:08:22] 10Deployment-Systems: The future of MediaWiki deployment: Tooling - https://phabricator.wikimedia.org/T94620#1174597 (10greg) p:5Triage>3Normal [17:21:10] <^d> bd808: Working on staging-tin, /srv/mediawiki-staging/ perms aren't getting setup right (https://phabricator.wikimedia.org/P473) [17:21:24] <^d> It defaulted to root:project-staging, which was obvs wrong. [17:21:35] <^d> I tried changing it to mwdeploy:mwdeploy, that's what gave me P473 [17:21:54] root:wikidev [17:22:04] <^d> /srv/mediawiki-staging is 0755 [17:22:07] <^d> Gotcha [17:22:17] mwdeploy should only own /srv/mediawiki [17:22:29] deployers should own -staging [17:22:49] <^d> Mmmk. [17:22:56] <^d> ok, 2 things to note here [17:22:58] So you want 0775 root:wikidev [17:23:06] <^d> A) I should ensure this in puppet [17:23:14] That should match tin [17:23:14] <^d> B) Permissions are good now, it worked yay! [17:23:23] <^d> C) It used prod dsh lists so exploded, lol [17:23:30] <^d> That's 3 things, w/e [17:23:44] I thought thcipriani had a patch to fix this from last Friday? [17:23:46] ^d: it is _somewhat_ ensured in puppet. If it were a new install it would be ensured. [17:23:58] <^d> Ah fixed after the fact? [17:24:07] <^d> Maybe we should just rebuild tin :) [17:24:14] exactly, which puppet is bad at. [17:25:05] <^d> So we need dsh lists in hiera I s'pose [17:26:24] probably, someway to customize deployment groups. [17:27:30] <^d> Right now we get 17:22:34 sudo -u mwdeploy -n -- /usr/bin/rsync -l staging-tin.eqiad.wmflabs::common/wikiversions*.{json,cdb} /srv/mediawiki on mw1090.eqiad.wmnet returned [255]: ssh: connect to host mw1090.eqiad.wmnet port 22: No route to host [17:27:48] <^d> Which is exactly the reason labs doesn't touch prod like that :p [17:28:21] oops, I deployed [17:28:47] heh, vlans are a Good Thing™ [17:43:34] (03PS1) 10Dduvall: Browser proxy support for Firefox/Chrome/Phantomjs [selenium] - 10https://gerrit.wikimedia.org/r/201492 (https://phabricator.wikimedia.org/T71725) [19:34:59] ^d: what plans were you referring to? [19:49:13] <^demon|lunch> twentyafterfour: I heard vague rumblings? [19:49:41] about tin? [19:51:06] <^demon|lunch> Yeah, about wanting to automate the "getting things on tin" so there's less copy+paste+chance for error [19:57:03] ^d: well I've had the most success so far, strangely enough, using php (cli) to automate all the deployment steps .. but what do you mean "getting things on tin" - what things? [19:57:23] you mean just getting all the right repositories set up and sync'd? [19:58:06] so that we can puppet up a new tin at will without fiddling with it? [20:02:58] <^demon|lunch> Eh something like that. Moreso the checkoutMediaWiki stuff and then the fetch/merge/rebase portions of a deploy [20:05:49] ^d: Well I was working on monitoring gerrit so that my script can tell when the merge happens and then trigger the git pull. that part isn't too hard, gerrit has a json api [20:07:11] <^demon|lunch> Yeah, that was the bit I'm interested in [20:07:18] <^demon|lunch> As I want to do the same in staging [20:07:30] 6Release-Engineering: Design a Test-Driven Development (TDD) survey - https://phabricator.wikimedia.org/T94472#1175305 (10Jdlrobson) @zeljkofilipin could be a good question for the survey? "Does Wikimedia have a QA team?" :) The fact that so few engineers have joined this conversation is problematic in itself. [20:08:59] 6Release-Engineering: Design a Test-Driven Development (TDD) survey - https://phabricator.wikimedia.org/T94472#1175318 (10Jdlrobson) [20:09:45] ^demon|lunch: https://gerrit.wikimedia.org/r/?q=status:merged%20owner:self%20project:operations/mediawiki-config#/ [20:12:22] <^demon|lunch> How often would you poll the rest api? [20:53:11] * ^demon|lunch is filing log-error bugs [21:24:58] !log Re-creating integration-dev-slave-precise in preparation of re-creating precise slaves [21:25:02] Logged the message, Master [21:31:18] PROBLEM - Host integration-dev-precise is DOWN: CRITICAL - Host Unreachable (10.68.16.72) [21:31:52] PROBLEM - Host integration-slave1410 is DOWN: CRITICAL - Host Unreachable (10.68.17.209) [21:32:29] legoktm: ^ alerts are showing up here [21:40:18] YuviPanda: Guest2989: legoktm: Hm.. those are bogus those, I deleted those instances. [21:41:06] !log It seems integration-slave-jessie-1001 has role::ci::slave::labs::common instead of role::ci::slave::labs. Intentional? [21:41:08] Logged the message, Master [21:41:43] Krinkle: yeah, it takes about 5mins for those to disappear, so deletion brings up a notice [21:41:45] which is fine I think [21:41:55] Yeah, it's nice in a way [21:42:02] as long as it doesn't repeat [21:42:19] yeah, it won't give you a 'recovery' at all :) [21:51:16] PROBLEM - Host integration-dev-slave-precise is DOWN: CRITICAL - Host Unreachable (10.68.18.3) [21:57:10] 10Continuous-Integration: Re-create ci slaves (April 2015) - https://phabricator.wikimedia.org/T94916#1175877 (10Krinkle) 3NEW a:3Krinkle [22:03:01] 10Continuous-Integration: Re-create ci slaves (March 2015) - https://phabricator.wikimedia.org/T91524#1175910 (10Krinkle) [22:04:06] 10Continuous-Integration, 6operations, 7Puppet: Puppet (silently) fails to setup apache on some integration-slave14xx instances - https://phabricator.wikimedia.org/T91832#1175914 (10Krinkle) [22:04:08] 10Continuous-Integration: Re-create ci slaves (April 2015) - https://phabricator.wikimedia.org/T94916#1175913 (10Krinkle) [22:04:22] 10Continuous-Integration, 6operations, 7Puppet: Puppet (silently) fails to setup apache on new trusty instances - https://phabricator.wikimedia.org/T91832#1175915 (10Krinkle) [22:06:35] PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:11:34] RECOVERY - Puppet failure on integration-slave-precise-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [22:22:53] PROBLEM - Puppet failure on integration-slave-precise-1012 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [22:23:47] PROBLEM - Puppet failure on integration-slave-precise-1013 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [22:24:28] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #551: FAILURE in 36 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/551/ [22:25:20] 10Continuous-Integration, 6operations, 7Puppet, 7Regression: Puppet: "Package[git-core] is already declared in file modules/authdns/manifests/scripts.pp" - https://phabricator.wikimedia.org/T94921#1176059 (10Krinkle) 3NEW a:3Krinkle [22:27:15] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:27:54] RECOVERY - Puppet failure on integration-slave-precise-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [22:28:46] RECOVERY - Puppet failure on integration-slave-precise-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [22:32:12] RECOVERY - Puppet failure on integration-slave-precise-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [22:39:37] 10Continuous-Integration, 6operations, 7Puppet, 7Regression: Puppet: "Could not find class role::ci::slave::labs" - https://phabricator.wikimedia.org/T94925#1176120 (10Krinkle) 3NEW a:3Krinkle [22:41:18] 10Continuous-Integration, 6operations, 7Puppet, 7Regression: Puppet: "Could not find class role::ci::slave::labs" - https://phabricator.wikimedia.org/T94925#1176120 (10Krinkle) [22:41:29] 10Continuous-Integration, 6operations, 7Puppet, 7Regression: Puppet: "Could not find class role::ci::slave::labs" - https://phabricator.wikimedia.org/T94925#1176120 (10Krinkle) [22:41:30] whether it fixes the error or not, I'm not fond of the approach of https://gerrit.wikimedia.org/r/#/c/201603 [22:43:29] bblack: There isn't really another approach possible without changing semantics. Which is unrelated to the issue and not what I should spend my time on. [22:43:55] CI is blocked on doing anything until this is resolved. So this is the begin and of of it for me, I can't afford more. [22:44:00] what caused this? none of the bigs being referenced or fixed are new [22:44:08] I dont know, I'm asking ops what changed. [22:44:18] All I know is that as of this week, I can't create intances. [22:44:34] it's not failing in production, right? [22:44:35] It seems weird, but this happens every month. every month I create a new instance applying the same roles as existing instances but yet new errors show up. [22:44:42] bblack: Not yet anyway :-) [22:45:00] (but also, why would you even notice a problem in the authdns module? do we do integration hosts with role::authdns? [22:45:07] I don't know. [22:45:15] that's the main thing that made me even take note, nobody really messes with authdns but me [22:45:28] and it hasn't been messed with in quite some time [22:45:29] bblack: Does puppet evaluate it not first? [22:45:39] the master or the client? [22:45:46] I mean the class definitino [22:45:53] Maybe it interprets and as such clases [22:45:55] clashes [22:46:00] if it's horribly broken, it could break the master, yes [22:46:09] is the error you're getting on the master, or on a client run? [22:46:22] bblack: a client ci instance, not our puppetmaster [22:47:33] bblack: incinga includes authdns::monitoring, contint::packages includes include authdns::lint [22:47:46] so it's probably that causing it [22:47:52] ah there we go, that explains some of this! [22:48:23] and I'm guessing the other piece of the puzzle is something got upgraded for labs and/or ci and you're running a newer puppet version than the rest of the world. [22:48:25] https://github.com/wikimedia/operations-puppet/commit/e76110c1 [22:49:12] in which case there could be *lots* other issues. we probably shouldn't be running a completely different version of puppet than prod [22:49:20] if that's the case, that should get fixed rather than this [22:49:21] we do a few packages 'latest', but afaik puppet is just the standard puppet provided on trusty images for wmflabs [22:49:34] yeah but our puppetmaster in prod is precise [22:49:52] I mean the client [22:50:01] going through the exercise of upgrading puppetmaster and updating all manifests for whatever new stupid changes happened is nontrivial and rare [22:50:11] I'm asking about the master, though, as in maybe that's what changed a week ago and broke this [22:50:16] although our puppetmaster also changed to trusty it seems [22:50:20] right [22:50:25] that's a bad idea, and should revert [22:50:36] I guess Antoine did that by accident when he had to re-create it due to lack of space. [22:50:37] Ugh.. [22:52:01] if we patched over these couple of Package issues but left that master on trusty, you're just gonna keep finding endless new issues to solve, and solving some of them will turn out to be risky for the prod environment, too. it's best to deal with a big leap forward on the master in sync,. [22:52:23] bblack: I'm curious how this can be different though [22:52:36] it's the same manifests, the errors aren't wrong, right? [22:52:52] why wouldn't this be failing on puppetmaster@precise [22:53:17] !log Most puppet failures blocking T94916 may be caused by the fact that intergration-puppetmaster was inadvertently changed to Trusty; puppetmaster version of Trusty is not yet supported by ops [22:53:20] Logged the message, Master [22:53:48] Krinkle: because puppet on trusty is a different version that precise, and that matters a lot [22:53:58] puppet tends to not be very compatible across versions :/ [22:55:05] there's issues with client compatibility levels too, but we've solved a lot of that for prod already because we have trusty/jessie clients in prod already [22:55:24] but we haven't moved the master forward because that's a whole other ball of problems to deal with [22:56:10] 10Continuous-Integration: Downgrade intergration-puppetmaster back to Ubuntu Precise (re-create instance) - https://phabricator.wikimedia.org/T94927#1176152 (10Krinkle) 3NEW a:3Krinkle [22:56:19] 10Continuous-Integration: Downgrade intergration-puppetmaster back to Ubuntu Precise (re-create instance) - https://phabricator.wikimedia.org/T94927#1176152 (10Krinkle) a:5Krinkle>3None [22:56:42] !log New integration-slave-precise-101x are unfinished and must remain depooled. See T94916. [22:56:45] Logged the message, Master [22:57:07] bblack: OK. I'll get back on this later. too many dependencies :-( [22:57:11] Thanks! [22:59:26] 6Release-Engineering, 6Phabricator: Adding users to CC on Phabricator security tasks doesn't add them to the view/edit policy - https://phabricator.wikimedia.org/T94565#1176178 (10Aklapper) p:5Triage>3Normal [23:23:57] Project beta-code-update-eqiad build #50193: FAILURE in 57 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/50193/ [23:41:05] 10Continuous-Integration, 6Collaboration-Team, 10Flow, 7Browser-Tests, 5Patch-For-Review: Fix failed Flow browsertests Jenkins job - https://phabricator.wikimedia.org/T94153#1176326 (10Mattflaschen) p:5Normal>3High