[00:01:52] chrismcmahon: yeah, it's a big improvement [00:05:43] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:15:04] !log Permissions in deployment/integration/slave-scripts on integration-slave1003 are screwed up as well [00:15:07] Logged the message, Master [00:25:09] _fake !log Everything is fucked [00:29:42] !log Tried reconnecting Gearman, relaunching slave agents. Force-restarting Zuul now. [00:29:45] Logged the message, Master [00:30:09] James_F: greg-g: CI is back. Queue not preserved. [00:40:58] !log Permissions of deployment/integration/slave-scripts on labs slave are all screwed up (git-status says files are dirty, but when run as root git-status is clean and jshint also works fine via sudo) [00:41:00] Logged the message, Master [00:41:35] !log rm -rf slave-scripts and re-cloning from integration/jenkins.git on all slaves (under sudo, just like puppet originally did) - git-status and jshint both work fine now [00:41:37] Logged the message, Master [00:43:53] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:45:33] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:45:56] Krinkle: Thanks. [00:46:44] 3Continuous-Integration: On all slaves, /srv/deployment/integration/slave-scripts permissions went crazy - https://phabricator.wikimedia.org/T85969#958596 (10Krinkle) [00:48:33] Krinkle: :/ thanks [00:49:09] Krinkle: how the heck did that happen? [00:51:28] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [00:52:21] 3Continuous-Integration: On all slaves, /srv/deployment/integration/slave-scripts permissions went crazy - https://phabricator.wikimedia.org/T85969#958607 (10Krinkle) I found the same on integration-slave1003, slave1005 and 1007 and eventually all labs slaves. Symptoms: * Running `git status` (as regular user)... [00:52:28] greg-g: I have no idea [00:52:52] It worked fine in the staging area and on integration-dev, which are also in labs, also running precise and also using plain git-clone (not trebuchet) [00:53:54] :/ [01:05:31] greg-g: the next puppet run didn't break it again [01:05:37] whew [01:05:55] which is actually a bad thing. It means there's a hidden state that puppet can introduce but does not enforce. [01:06:34] I've never seen this kind of permissions screw up to the point that even .git gets confused [01:09:17] point [01:16:28] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:57] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [02:02:54] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [02:18:54] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [02:29:44] 3Beta-Cluster: VE connection to Parsoid is broken again - https://phabricator.wikimedia.org/T85863#958706 (10Aklapper) >>! In T85863#957929, @Ryasmeen wrote: > Verified the fix. Feel free to add the respective verified project tag. [02:43:55] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [03:25:07] (03PS1) 10Abartov: Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 [03:25:48] (03CR) 10jenkins-bot: [V: 04-1] Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [03:34:48] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #183: FAILURE in 32 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/183/ [03:45:32] Project browsertests-Flow-test2.wikipedia.org-windows_8-internet_explorer-sauce build #377: FAILURE in 44 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-test2.wikipedia.org-windows_8-internet_explorer-sauce/377/ [04:05:24] Yippee, build fixed! [04:05:25] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #291: FIXED in 9 min 9 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/291/ [04:06:09] Yippee, build fixed! [04:06:10] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #462: FIXED in 49 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/462/ [04:17:04] (03PS2) 10Abartov: Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 [04:17:27] (03CR) 10jenkins-bot: [V: 04-1] Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [04:28:54] (03PS3) 10Abartov: Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 [04:29:11] (03CR) 10jenkins-bot: [V: 04-1] Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [04:31:25] (03PS4) 10Abartov: Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 [04:31:46] (03CR) 10jenkins-bot: [V: 04-1] Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [04:33:32] Yippee, build fixed! [04:33:33] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #221: FIXED in 41 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/221/ [04:34:21] (03PS5) 10Abartov: Tokens now auto-refresh on badtoken [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 [04:41:41] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [04:46:01] Yippee, build fixed! [04:46:01] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #429: FIXED in 40 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/429/ [05:06:40] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:42:52] Yippee, build fixed! [05:42:52] Project browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce build #270: FIXED in 11 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce/270/ [06:57:33] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #391: FAILURE in 24 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/391/ [09:18:44] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [09:43:42] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:49:57] !log rebooting deployment-cache-bits01 [09:50:00] Logged the message, Master [09:57:01] 3Beta-Cluster: Beta cluster text and mobile varnishes fails loading VCL (invalid syntax: req.http.X-Wikimedia-Debug = "1") - https://phabricator.wikimedia.org/T85993#959020 (10hashar) 3NEW a:3ori [10:00:27] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: Connection refused [10:01:24] !log restarted deployment-cache-mobile03 and deployment-cache-text02 [10:01:26] Logged the message, Master [10:02:16] 3Quality-Assurance: Update QUnit to 1.16.0 - https://phabricator.wikimedia.org/T85994#959029 (10adrianheine) 3NEW [10:02:32] (03CR) 10Zfilipin: [C: 031] Initialization command for new test suites [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/183089 (owner: 10Dduvall) [10:05:55] (03CR) 10Zfilipin: [C: 031] Improved documentation on EAL [selenium] (env-abstraction-layer) - 10https://gerrit.wikimedia.org/r/183090 (owner: 10Dduvall) [10:06:29] 3Beta-Cluster: Beta cluster text and mobile varnishes fails loading VCL (invalid syntax: req.http.X-Wikimedia-Debug = "1") - https://phabricator.wikimedia.org/T85993#959037 (10hashar) 5Open>3Invalid I have filled this task fearing the fault landed in operations/puppet and hence in production. It is just a... [10:06:52] (03CR) 10Zfilipin: [C: 031] Merge branch 'env-abstraction-layer' [selenium] - 10https://gerrit.wikimedia.org/r/183093 (owner: 10Dduvall) [10:08:58] 3Quality-Assurance: Update QUnit to 1.16.0 - https://phabricator.wikimedia.org/T85994#959045 (10adrianheine) [10:10:15] 3Quality-Assurance: Update QUnit to 1.16.0 - https://phabricator.wikimedia.org/T85994#959029 (10adrianheine) I realized that @Krinkle already changed the HTML, so the update is really straight-forward now. [10:19:50] Project beta-scap-eqiad build #37173: FAILURE in 22 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37173/ [10:22:06] Yippee, build fixed! [10:22:06] Project beta-update-databases-eqiad build #6725: FIXED in 2 min 5 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/6725/ [10:24:15] pfff [10:24:21] varnish text is broken :( [10:24:43] !log beta varnish text cache is broken. The vcl refuses to load because of undefined probes [10:24:46] Logged the message, Master [10:25:23] !log deleting /etc/varnish on deplloyment-cache-text02 and running puppet [10:25:25] Logged the message, Master [10:27:42] Project beta-scap-eqiad build #37174: STILL FAILING in 5 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37174/ [10:33:52] Project beta-scap-eqiad build #37175: STILL FAILING in 4 min 37 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37175/ [10:34:55] !log varnish text cache is back up. Had to delete /etc/varnish and reinstall varnish from scratch + rerun puppet. [10:34:56] Logged the message, Master [10:40:27] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 49166 bytes in 0.560 second response time [10:45:33] RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0] [10:47:11] Yippee, build fixed! [10:47:12] Project beta-scap-eqiad build #37176: FIXED in 12 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37176/ [10:56:30] !log installed openjdk 8 on CI Trusty labs slaves https://phabricator.wikimedia.org/T85964 [10:56:32] Logged the message, Master [10:57:19] !log Taught Jenkins configuration about Java 8. Name: "Ubuntu - OpenJdk 8" JAVA_HOME: /usr/lib/jvm/java-8-openjdk-amd64/ . Only available on Trusty slaves though [10:57:20] Logged the message, Master [12:31:40] PROBLEM - Puppet failure on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [12:54:57] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:02:17] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [13:24:57] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [13:27:12] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [13:35:56] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:48:27] 3Continuous-Integration: On all slaves, /srv/deployment/integration/slave-scripts permissions went crazy - https://phabricator.wikimedia.org/T85969#959295 (10ssastry) Doesn't seem fixed yet .. See https://gerrit.wikimedia.org/r/#/c/181250/ and the attempts to get that patch merged. [13:49:44] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:05:55] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:45] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:22:31] 3Continuous-Integration: On all slaves, /srv/deployment/integration/slave-scripts permissions went crazy - https://phabricator.wikimedia.org/T85969#959372 (10Krinkle) Looks like I either forgot a few of the slaves or they regressed again. Iterated over all 9 integration slaves and had to re-apply it to: * integ... [14:22:57] 3Continuous-Integration: On all slaves, /srv/deployment/integration/slave-scripts permissions went crazy - https://phabricator.wikimedia.org/T85969#959373 (10Krinkle) 5Open>3Resolved [14:41:02] (03PS1) 10Hashar: Switch wikidata-gremlin to Java 8 on Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/183251 [14:48:32] (03CR) 10Hashar: "Job updated, confirmed to work on a patchset that requires java 8 : https://gerrit.wikimedia.org/r/#/c/183169/" [integration/config] - 10https://gerrit.wikimedia.org/r/183251 (owner: 10Hashar) [15:00:11] (03CR) 10Hashar: [C: 032] Switch wikidata-gremlin to Java 8 on Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/183251 (owner: 10Hashar) [15:05:38] 3Ops-Access-Requests, Continuous-Integration, operations: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#957717 (10hashar) I believe ops expect shell access / permission changes to be in #Ops-Access-Requests [15:07:55] (03Merged) 10jenkins-bot: Switch wikidata-gremlin to Java 8 on Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/183251 (owner: 10Hashar) [15:15:09] 3Beta-Cluster: monitor that application servers are responding - https://phabricator.wikimedia.org/T54867#959479 (10hashar) That is excellent @yuvipanda . Thank you very much ! [15:20:01] 3Quality-Assurance: Update QUnit to 1.16.0 - https://phabricator.wikimedia.org/T85994#959498 (10Krinkle) Note that `assert.expect` is not new. It's just been renamed. It was previously known as `QUnit.expect` and via `QUnit.test( name, expect, callback )`. [15:31:54] 3Quality-Assurance: Document how to debug Selenium tests - https://phabricator.wikimedia.org/T50216#959551 (10Aklapper) [15:31:58] 3Quality-Assurance: Update QA/testing documentation - https://phabricator.wikimedia.org/T59841#959552 (10Aklapper) [15:34:04] https://integration.wikimedia.org/ci/job/mwext-UploadWizard-qunit/735/console <-- this looked bogus to me because I didn't touch any qunit tests, but apparently it's persisting [15:34:34] Exception thrown by test.module1 [...] Error: expected Error: expected [...] PhantomJS timed out, possibly due to a missing QUnit start() call. [15:34:49] Krinkle|detached: Maybe you know? :D [15:37:22] 3MediaWiki-Core-Team, Beta-Cluster: no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - https://phabricator.wikimedia.org/T74275#959614 (10Aklapper) [15:37:24] 3Quality-Assurance: Quality Assurance/Browser testing/Setup instructions is out of date - https://phabricator.wikimedia.org/T74732#959615 (10Aklapper) [15:37:32] 3Continuous-Integration: Allow extensions to automatically generate jsduck documentation on doc.wikimedia.org - https://phabricator.wikimedia.org/T50337#959620 (10Aklapper) [15:37:47] 3Continuous-Integration: Document CI doc publishing process - https://phabricator.wikimedia.org/T71975#959633 (10Aklapper) [15:37:54] 3Continuous-Integration: Enhance tutorial to trigger a job manually - https://phabricator.wikimedia.org/T63322#959638 (10Aklapper) [15:38:04] 3Continuous-Integration: JJB: document "current build parameters" to trigger-build builder - https://phabricator.wikimedia.org/T47910#959645 (10Aklapper) [15:38:34] 3Quality-Assurance: Document how the entire browser test tool chain is set up - https://phabricator.wikimedia.org/T58192#959672 (10Aklapper) [15:38:35] 3Quality-Assurance: update browser test docs on mw.o - https://phabricator.wikimedia.org/T58980#959674 (10Aklapper) [15:38:36] 3Quality-Assurance: update "How to run browser tests" wiki page - https://phabricator.wikimedia.org/T56601#959673 (10Aklapper) [15:38:45] 3Continuous-Integration: document CI workflow using sequence diagram - https://phabricator.wikimedia.org/T55716#959682 (10Aklapper) [15:40:18] 3Continuous-Integration: Jenkins: Publish error output of Doxyxgen - https://phabricator.wikimedia.org/T35524#959714 (10Aklapper) [15:40:28] 3Continuous-Integration: Describe integration/* git repositories - https://phabricator.wikimedia.org/T44518#959722 (10Aklapper) [15:40:33] 3Continuous-Integration: document Jenkins Job Builder - https://phabricator.wikimedia.org/T44293#959725 (10Aklapper) [15:40:37] 3Beta-Cluster: document deployment on beta - https://phabricator.wikimedia.org/T39943#959729 (10Aklapper) [15:59:19] Figured it out. Stubbing document.createElement isn't a good idea. [16:01:40] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:26:17] 3Continuous-Integration, Citoid, VisualEditor: Set up CI in the mediawiki/services/citoid.git repo - https://phabricator.wikimedia.org/T76069#959860 (10Jdforrester-WMF) [16:26:42] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:30:44] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:30:51] Yippee, build fixed! [16:30:52] Project browsertests-Wikidata-SmokeTests-linux-firefox-sauce build #115: FIXED in 18 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-SmokeTests-linux-firefox-sauce/115/ [16:35:00] Project beta-scap-eqiad build #37212: FAILURE in 58 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37212/ [16:38:01] hi twentyafterfour I hear you know something about an issue with a mobile varnish server on beta labs? the mobile view is gone right now, http://en.m.wikipedia.beta.wmflabs.org/ [16:40:07] I don't think he knows anything about specifically, I just suggested because I'm not on the releng team and he is. :) [16:43:10] 3Beta-Cluster: Mobile URL on beta labs does not respond at all - https://phabricator.wikimedia.org/T86020#959942 (10Cmcmahon) 3NEW [16:43:29] 3Beta-Cluster: Mobile URL on beta labs does not respond at all - https://phabricator.wikimedia.org/T86020#959942 (10Cmcmahon) [16:44:34] oh, marktraceur has access to gallium, just noticed. /me notes to self to bug him for CI issues too ;) [16:45:01] Project beta-scap-eqiad build #37213: STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37213/ [16:46:20] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:47:35] * greg-g assumes DNS for the above two failures and waits [16:49:25] greg-g: Tch no [16:49:30] What happened? [16:49:52] marktraceur: re CI or re those two failures [16:50:06] CI [16:50:32] I was just looking at this old patch and saw your name: https://gerrit.wikimedia.org/r/#/c/181211/3/modules/admin/data/data.yaml [16:50:55] Ah yes. [16:51:09] I thought it was because there was something I could do for you [16:51:40] marktraceur: not right now, maybe later :P [16:51:53] Got it. [16:55:29] Yippee, build fixed! [16:55:30] Project beta-scap-eqiad build #37214: FIXED in 1 min 33 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37214/ [17:00:42] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:07:56] chrismcmahonbrb: not sure I know something about it but I'll take a look [17:11:49] twentyafterfour: bd 808 suggested you, but turns out only because you are actually RelEng :-) [17:12:43] chrismcmahon: I don't mind looking into it [17:23:48] twentyafterfour: YuviPanda|food may have something to do with it [17:24:35] chrismcmahon: hmm ok .. I'm not having much luck tracking down what host is running that service [17:25:44] twentyafterfour: I know what you mean. I have lost track of all of the beta labs hosts and what they do. This would be either a varnish not responding or an apache I guess, but which one is pretty mysterious [17:27:30] I wish there was an easy way to generate an outline from the puppet manifests [17:33:38] twentyafterfour: https://wikitech.wikimedia.org/wiki/Special:NovaAddress will show you which instances are mapped to which hosts [17:33:56] 3Beta-Cluster: Mobile URL on beta labs does not respond at all - https://phabricator.wikimedia.org/T86020#960055 (10hashar) a:3hashar Ah I must have broken varnish on the instance. The text varnish suffered from some issue early today. [17:34:05] 3Continuous-Integration: On all slaves, /srv/deployment/integration/slave-scripts permissions went crazy - https://phabricator.wikimedia.org/T85969#960057 (10cscott) Hm. The fact that this 'regressed' by itself makes me think that puppet is making this change deliberately. If we retickle puppet on these slaves... [17:34:10] Mobile stuff should all run through deployment-cache-mobile03 I think [17:35:35] 3Beta-Cluster: Mobile URL on beta labs does not respond at all - https://phabricator.wikimedia.org/T86020#960063 (10hashar) The varnish frontend is dead: ``` root@deployment-cache-mobile03:~# ps -u varnish f|cat PID TTY STAT TIME COMMAND 1209 ? Sl 0:22 /usr/sbin/varnishd -P /var/run/varnishd... [17:37:17] so hashar broke mobile on beta it looks like [17:39:53] bd808: that page is helpful, thanks [17:41:08] (03PS1) 10Krinkle: Change parsoidsvc-jslint back to UbuntuPrecise. [integration/config] - 10https://gerrit.wikimedia.org/r/183277 [17:41:27] 3Beta-Cluster: Mobile URL on beta labs does not respond at all - https://phabricator.wikimedia.org/T86020#960069 (10hashar) 5Open>3Resolved ``` apt-get install --reinstall varnish varnish-dbg rm /etc/varnish/* puppet agent -tv puppet agent -tv ``` And it is magically back: ``` root@deployment-cache-mobile0... [17:43:15] chrismcmahon: I don’t actually think I did much to it :) [17:43:20] 3Beta-Cluster: Mobile URL on beta labs does not respond at all - https://phabricator.wikimedia.org/T86020#960075 (10Cmcmahon) That is frightening. [17:43:39] YuviPanda: it was hashar :-) [17:44:44] suuuure it was ;) [17:45:55] (03CR) 10Cscott: [C: 031] Change parsoidsvc-jslint back to UbuntuPrecise. [integration/config] - 10https://gerrit.wikimedia.org/r/183277 (owner: 10Krinkle) [17:45:57] so apparently the instance was just down? I tried logging in, failed, then tried again and it was back, I guess it got rebooted while I was trying to ssh to it? [17:48:56] twentyafterfour: hmm, don’t think it was gone - shinken would have noticed [17:49:19] my ssh attempts were rejected [17:52:14] twentyafterfour: strange, I am looking at auth.log and there aren’t many gaps there (we get a lot of spam there) so don’t think it restarted. perhaps OOM’d enough to not accept new connections? I see only one try from you there [17:52:24] which reminds me, I should add ssh checks to all instances too [17:53:03] rather than just ping ones [17:53:54] RECOVERY - Puppet failure on deployment-cache-mobile03 is OK: OK: Less than 1.00% above the threshold [0.0] [17:54:12] It gave me: The authenticity of host 'deployment-cache-mobile03.eqiad.wmflabs ()' can't be established. followed by: Host key verification failed. [17:54:17] twice [17:54:21] then it worked the 3rd try [18:36:24] Yippee, build fixed! [18:36:25] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #184: FIXED in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/184/ [18:38:02] Yippee [18:44:37] marktraceur: That "Timed out, possibley due to missing start call" error usually means the page did not respond with a valid html page. Or that qunit.js wasn't loaded. [18:44:42] marktraceur: Where are you getting it [18:47:39] Krinkle: We figured it out I think [18:47:49] Krinkle: I stubbed document.createElement and qunit did NOT like it. [18:47:58] Yippee, build fixed! [18:47:59] Project browsertests-Flow-test2.wikipedia.org-windows_8-internet_explorer-sauce build #378: FIXED in 46 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-test2.wikipedia.org-windows_8-internet_explorer-sauce/378/ [18:50:23] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#960316 (10Krinkle) Originally we were hinting towards using our OpenStack cloud for this. That seems to me still like the most feasible path forwards, because: * We don't have to ensure that it is secure and... [18:50:45] marktraceur: Probably PhantomJS, not QUnit. Assuming it worked for you locally in Chrome/Firefox on the special page. [18:50:54] It didn't. [18:51:06] Not even a little. [18:58:12] 3Continuous-Integration: V+1 checks for non-whitelisted users are missing some linters included in V+2 voting checks - https://phabricator.wikimedia.org/T85800#960375 (10Krinkle) a:3Krinkle [19:10:45] (03CR) 10Dduvall: [C: 04-1] "This is a much appreciated contribution. Thanks!" (036 comments) [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [19:14:03] (03CR) 10Dduvall: Tokens now auto-refresh on badtoken (031 comment) [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [19:28:59] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#960529 (10greg) 3NEW [19:31:47] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [19:38:19] Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #408: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/408/ [19:42:16] ...hm. [19:42:52] Just a timeout. [19:45:24] 3Continuous-Integration: Evaluate JClouds Jenkins plugin - https://phabricator.wikimedia.org/T85933#960586 (10Krinkle) We can look forever at different cloud backends, hypervisors, vm containers and provisioning (OpenVZ, [Docker](https://www.docker.com/), OpenStack, Vagrant, minijail0, [firejail](https://l3net.w... [20:01:48] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:07:16] Sigh. https://phabricator.wikimedia.org/T2001#959812 [20:08:12] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#960640 (10greg) See also the draft RFC: https://docs.google.com/a/wikimedia.org/document/d/17g8cU2qkv256X5cb_TLfqubwMMgCIAxvJdoOwA5335g/edit (I *think* that's the latest version) [20:13:01] (03PS6) 10Dduvall: Tokens now auto-refresh on badtoken errors [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [20:15:40] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #222: FAILURE in 1 hr 16 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/222/ [20:15:51] (03CR) 10Dduvall: [C: 04-1] "Since webmock can be a little tricky to work with, and you've already very graciously done this on volunteer time, I went ahead and implem" [ruby/api] - 10https://gerrit.wikimedia.org/r/183200 (owner: 10Abartov) [20:17:41] PROBLEM - Puppet failure on deployment-rsync01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:18:37] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:24:35] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #430: FAILURE in 1 hr 16 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/430/ [20:30:20] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #273: FAILURE in 14 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/273/ [20:31:44] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:33:10] Project browsertests-PdfHandler-test2.wikipedia.org-linux-firefox-sauce build #305: FAILURE in 2 min 49 sec: https://integration.wikimedia.org/ci/job/browsertests-PdfHandler-test2.wikipedia.org-linux-firefox-sauce/305/ [20:35:30] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #46: FAILURE in 2 min 19 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/46/ [20:38:45] 3§ Phabricator-Sprint-Extension, Release-Engineering, Phabricator: Create a continuous integration plan for Wikimedia Phabricator patches - https://phabricator.wikimedia.org/T85123#960700 (10Aklapper) p:5Normal>3High Thanks Hashar for creating that Jenkins job! As described earlier, development processes di... [20:40:49] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#960712 (10hashar) [20:40:50] 3Continuous-Integration: Evaluate JClouds Jenkins plugin - https://phabricator.wikimedia.org/T85933#960709 (10hashar) 5Open>3Resolved a:3hashar Thanks Timo for the conversation with Monty, it highlights JClouds spawns the VM when the job is scheduled so that causes some startup overhead. Learned as well t... [20:41:40] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:42:34] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#514898 (10hashar) Yup definitely heading toward reproducing the OpenStack CI infrastructure which is supported/developped by more than a handful of person. I found out today that the blocking tasks are rathe... [20:42:40] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:44:52] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#960715 (10Aklapper) releng is silent enough so far but I am slightly concerned how noisy this will get with #qa's grrrit-wm, shinken-wm, wmf-insecte and wikibugs and how much their output (expe... [20:46:34] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:48:37] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [20:48:45] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#960733 (10greg) The output in -qa is manageable, I don't know if we want to announce all of what is announced in -devtools in -releng though (ie: probably not #_phabricator.org, just #_phabrica... [20:50:47] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:59:38] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:01:40] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [21:02:45] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [21:04:36] (03PS1) 10Hashar: Send Jenkins IRC notifs to -releng [integration/config] - 10https://gerrit.wikimedia.org/r/183322 (https://phabricator.wikimedia.org/T86053) [21:11:33] RECOVERY - Puppet failure on deployment-cache-text02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:46] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:21:09] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce build #46: FAILURE in 10 min: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce/46/ [21:21:44] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:21:44] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:27:40] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:29:37] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [21:41:52] 3Quality-Assurance: Advanced Topics in Browser Test Automation - https://phabricator.wikimedia.org/T86070#960963 (10Cmcmahon) 3NEW [21:42:13] 3Release-Engineering, Quality-Assurance: Advanced Topics in Browser Test Automation - https://phabricator.wikimedia.org/T86070#960963 (10Cmcmahon) [21:46:10] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#960991 (10hashar) a:3hashar [21:46:41] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#960529 (10hashar) Taking it, I have created a list of tasks to be done in this Task description and mentioned the Gerrit patches next to each item. [21:46:43] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:49:56] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #441: FAILURE in 56 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/441/ [21:50:48] greg-g: the bots are indeed rather spammy :D [21:51:04] greg-g: we can probably make wmf-insecte less noisy though [21:52:00] (03CR) 10Hashar: [C: 032] "Updated jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/183322 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [21:57:19] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#961019 (10hashar) [21:57:53] hashar: and wmf-insecte, by having yuvi/marc fix DNS :P [21:58:19] greg-g: I haven't pinged Coren since monday though :/ [21:58:25] (03Merged) 10jenkins-bot: Send Jenkins IRC notifs to -releng [integration/config] - 10https://gerrit.wikimedia.org/r/183322 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [21:58:45] oh right [21:59:41] PROBLEM - Puppet failure on deployment-cache-upload02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:04:53] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:07:44] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:08:16] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#961059 (10hashar) [22:11:52] 3Release-Engineering: Unify RelEng related IRC channels to #wikimedia-releng - https://phabricator.wikimedia.org/T86053#961064 (10hashar) [22:12:46] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:24:41] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:29:54] RECOVERY - Puppet failure on deployment-cache-mobile03 is OK: OK: Less than 1.00% above the threshold [0.0] [22:31:26] 3Continuous-Integration: Jenkins: Run jobs in disposable VMs - https://phabricator.wikimedia.org/T47499#961086 (10greg) [22:35:16] Project browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce build #271: FAILURE in 15 min: https://integration.wikimedia.org/ci/job/browsertests-Echo-test2.wikipedia.org-linux-firefox-sauce/271/ [22:36:16] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #381: FAILURE in 1 hr 15 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/381/ [22:37:45] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:37:45] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:54:53] Project beta-scap-eqiad build #37252: FAILURE in 59 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37252/ [23:07:04] Project browsertests-Flow-test2.wikipedia.org-linux-chrome-sauce build #380: FAILURE in 31 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-test2.wikipedia.org-linux-chrome-sauce/380/ [23:15:27] Yippee, build fixed! [23:15:27] Project beta-scap-eqiad build #37254: FIXED in 1 min 32 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/37254/ [23:22:40] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:33:26] Yippee, build fixed! [23:33:27] Project browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #386: FIXED in 14 min: https://integration.wikimedia.org/ci/job/browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/386/ [23:33:43] Yippee, build fixed! [23:33:44] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #392: FIXED in 26 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/392/ [23:34:31] Project browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce build #410: FAILURE in 48 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce/410/ [23:47:43] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0]