[02:16:40] Project selenium-QuickSurveys » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #110: 04FAILURE in 3 min 39 sec: https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/110/ [04:18:35] Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #100: 04FAILURE in 22 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/100/ [10:05:32] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 06Revision-Scoring-As-A-Service, 07Spike: [Spike] Should we make a model for ores in beta? - https://phabricator.wikimedia.org/T141980#2531267 (10Ladsgroup) a:05Ladsgroup>03None [11:01:39] Zuul seems to be extremely slow: https://integration.wikimedia.org/zuul/ [11:03:11] Ami1 that means nodepool has stallen [11:05:51] Amir1 meaning until it is restarted no tests will work [11:06:26] okay, thanks [11:07:08] Your welcome [11:08:10] Aug 07 11:07:54 labnodepool1001 nodepoold[16727]: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403) [11:08:39] Oh that has happened before [11:08:49] I'm sure it has [11:08:56] it's just to stop > 10 concurrent jobs [11:09:01] yep [11:09:06] but zuul is not working at all [11:09:06] But that's not the problem [11:09:20] yep [11:09:30] I don't think restarting nodepool is the answer [11:09:39] Oh but it has before i think [11:09:52] Active: active (running) since Mon 2016-07-04 12:39:33 UTC; 1 months 3 days ago [11:09:57] It's been running a while [11:10:06] oh [11:10:25] I'd presume restarting zuul would be a better answer [11:11:03] Maybe but unlikly zuul [11:11:11] zuul would not display anything [11:11:17] But maybe it is [11:11:17] More likely zuul than nodepool [11:11:28] Do you have access to restart zuul? [11:11:44] yes [11:12:09] ok [11:12:12] :) [11:13:30] Last reconfigured: Sun Aug 07 2016 11:26:08 GMT+0100 (GMT Summer Time) [11:13:36] It was restarted today [11:13:40] Reedy ^^ [11:13:44] Amir1: FWIW, recheck isn't going to do anything [11:14:11] Reedy: I realized that after a minute :D [11:14:30] Nothing logged to say anyone restarted it [11:14:49] Reedy tests are working [11:14:53] for non nodepool tests [11:15:06] Only the nodepool tests are not working [11:15:19] I'm not restarting nodepool on a whim [11:16:04] ok [11:16:30] I'll text hashar [11:16:56] ok [11:19:12] Reedy oh wait, didnt labs disable creating instances [11:30:10] paladox: Not sure? [11:30:14] That could be an issue [11:30:17] Yep [11:30:20] It was disabled [11:30:28] due to an issue that arrose [11:30:39] That might explain it [11:30:47] https://phabricator.wikimedia.org/T142165 [11:30:53] It was disabled because of ^^ [11:34:31] I don't know how related the vm pools are [11:35:29] Oh [11:35:59] It's possible, but I don't know what runs where [11:36:10] Ok [11:36:24] Anyway, I texted hashar a bit ago [11:36:27] ok [11:36:28] :) [11:36:39] thanks [11:38:50] Reedy ^^ i guess he got it [11:39:06] Reedy: paladox around :) [11:39:12] Yes [11:39:16] :) [11:39:18] yup Sam sent me a text [11:39:24] Yep [11:39:39] nodepool on zuul seems to be down [11:39:44] has a task been filled in Phabricator ? [11:39:47] probaly realted to labs having problems [11:40:01] labs disabled creating isntances [11:40:06] on friday [11:40:16] due to https://phabricator.wikimedia.org/T142165 [11:40:57] hashar nope [11:41:04] if it is labs yeah [11:41:05] definitely [11:41:09] yep [11:41:25] nodepool knows about 8 instances all flagged as being used [11:41:29] But i guess they can revert [11:41:33] disabled instances now [11:41:41] they found the problem and downgraded some kernals [11:41:53] oh [11:42:16] Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403) [11:42:22] Yep [11:43:49] hashar someone restarted zuul [11:43:50] today [11:43:50] Last reconfigured: Sun Aug 07 2016 11:26:08 GMT+0100 (GMT Summer Time) [11:43:53] and didnt log it [11:46:06] paladox: unlikely to be related [11:46:10] or labs died again [11:46:14] Oh [11:46:15] maybe [11:46:18] seems it started occurring roughly 4 hours ago [11:46:50] yep [11:53:30] paladox: could you check whether the jobs are processing again please ? [11:53:32] my net is slow :( [11:53:47] Ok [11:54:00] hashar nope [11:54:04] !log nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete ) [11:54:05] dosent look like there processing [11:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:54:08] qa-morebots: poke [11:54:09] I am a logbot running on tools-exec-1204. [11:54:09] Messages are logged to https://tools.wmflabs.org/sal/releng. [11:54:09] To log a message, type !log . [11:54:38] bah [11:54:54] !log Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403) [11:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:57:38] your net is slow? [12:00:54] hashar your internet sound slow [12:01:02] for it to be doing Max SendQ exceeded [12:01:24] yeah [12:01:32] Oh [12:01:32] qa-morebots: ping [12:01:32] I am a logbot running on tools-exec-1204. [12:01:32] Messages are logged to https://tools.wmflabs.org/sal/releng. [12:01:32] To log a message, type !log . [12:01:35] Project selenium-RelatedArticles » chrome,beta-desktop,Linux,contintLabsSlave && UbuntuTrusty build #105: 04FAILURE in 34 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-desktop,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/105/ [12:01:38] Project selenium-RelatedArticles » chrome,beta-mobile,Linux,contintLabsSlave && UbuntuTrusty build #105: 04FAILURE in 38 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/105/ [12:01:40] !log nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete ) [12:01:43] !log Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403) [12:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [12:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [12:03:50] I doint even know what that Max SendQ exceeded means [12:09:39] paladox: the IRC server buffer has been filled out waiting to send data to me [12:09:48] Oh [12:09:48] and eventually gave up / disconnected the client (me) [12:09:52] oh [12:10:08] I have been doing gerrit all week [12:10:12] making it puppitised [12:10:18] and we managed to do that [12:10:42] ostriches and mutante fixed all the problems, now we only need to make https optional [12:10:49] due to the rate limit in letsencrypt [12:11:03] I have the steps here https://phabricator.wikimedia.org/P3637 :) [12:12:22] I also got https://phabricator.wikimedia.org/T141329 that fixed upstream [12:12:29] should be fixed in gerrit 2.12.4 [12:12:31] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [12:22:28] Project selenium-GettingStarted » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #106: 04FAILURE in 27 sec: https://integration.wikimedia.org/ci/job/selenium-GettingStarted/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/106/ [12:24:20] hashar it seems to be working now [12:24:46] But not sure if it will go back to the quota error again [12:27:52] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:32:34] hashar i guess it is annoying it keeps discconnecting you? [12:41:41] Reedy: paladox ci back [12:41:46] Yep [12:41:49] thanks [12:42:04] :) [12:42:14] hashar i have been working on puppitising gerrit [12:42:17] so it works in labs [12:42:26] ostriches and mutante fixed most of the problems [12:42:48] and now it works except we need to do http instead of https [12:42:57] do to letsencrypt rate limit [12:43:05] I have a list https://phabricator.wikimedia.org/P3637 [12:43:06] ;0 [12:43:07] ;0 [12:43:08] :) [12:47:06] paladox: root cause is https://phabricator.wikimedia.org/T126552 [12:47:13] Oh [12:47:40] hashar we could run cron [12:47:48] that could run garbage collection [12:48:07] since that is what we did for gerrit before gerrit added support internally for doing that [12:59:54] I am off again *wave* [13:40:10] RECOVERY - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is OK: OK: Less than 100.00% above the threshold [0.0] [13:46:51] 06Release-Engineering-Team, 06Operations, 15User-greg, 07Wikimedia-Incident: Institute a weekly review of all UBN! tasks - https://phabricator.wikimedia.org/T141130#2531402 (10Aklapper) >>! In T141130#2515799, @greg wrote: > Should we re-instate a recurring weekly meeting for this? Something public (which... [13:52:13] (03PS1) 10Reedy: Make EducationProgram depend on WikiEditor for tests [integration/config] - 10https://gerrit.wikimedia.org/r/303419 [13:54:04] (03CR) 10Reedy: "The extension has a soft dependancy on it, but it's breaking https://gerrit.wikimedia.org/r/#/c/303384 so easier to just define it" [integration/config] - 10https://gerrit.wikimedia.org/r/303419 (owner: 10Reedy) [13:58:49] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07Beta-Cluster-reproducible: Beta-cluster is down, internal error - https://phabricator.wikimedia.org/T142334#2531404 (10Luke081515) [13:58:59] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07Beta-Cluster-reproducible: Beta-cluster is down, internal error - https://phabricator.wikimedia.org/T142334#2531416 (10Luke081515) p:05Triage>03Unbreak! [14:04:54] 06Release-Engineering-Team, 07Beta-Cluster-reproducible: Beta-cluster is down, internal error - https://phabricator.wikimedia.org/T142334#2531435 (10Krenair) [14:07:11] Luke|away: why is the beta cluster being down UBN? [14:07:41] valhallasw`cloud: because it seems like it is an mediawiki error, which would brake work of other people working with the maste rtoo? [14:07:44] *master too [14:08:01] [V6c9AwpEEaoAAEC3YlQAAAAA] / MWException from line 176 of /srv/mediawiki/php-master/includes/Hooks.php: Invalid callback MWOAuthAPISetup::onTestCanonicalRedirect in hooks for TestCanonicalRedirect [14:08:04] I guess I broke something [14:09:01] Yeah [14:09:26] 06Release-Engineering-Team, 10MediaWiki-extensions-OAuth, 07Beta-Cluster-reproducible: Beta-cluster is down, internal error - https://phabricator.wikimedia.org/T142334#2531452 (10Reedy) [14:09:29] valhallasw`cloud, It's UBN inside the Beta-Cluster- projects [14:09:40] uploaded a fix: https://gerrit.wikimedia.org/r/303421 [14:09:55] Luke|away, Krenair, thanks for the clarification, that makes sense. [14:10:03] Fuck namespacing [14:10:11] cheers Krenair [14:10:23] I also thought I was reading -operations which is clearly not the case [14:10:33] * valhallasw`cloud grabs coffee [14:10:43] 06Release-Engineering-Team, 10MediaWiki-extensions-OAuth, 07Beta-Cluster-reproducible: Invalid callback MWOAuthAPISetup::onTestCanonicalRedirect in hooks for TestCanonicalRedirect - https://phabricator.wikimedia.org/T142334#2531404 (10Reedy) [14:11:15] I guess it means it's well tested by the unit tests [14:11:17] * Reedy coughs [14:11:26] Fix is merged [14:11:34] So should be back up when next scap runs [14:11:41] beta works again :) [14:11:50] I will close the task now [14:11:53] yeah I made the fix on beta then uploaded it [14:12:20] 06Release-Engineering-Team, 10MediaWiki-extensions-OAuth, 07Beta-Cluster-reproducible: Invalid callback MWOAuthAPISetup::onTestCanonicalRedirect in hooks for TestCanonicalRedirect - https://phabricator.wikimedia.org/T142334#2531455 (10Luke081515) 05Open>03Resolved a:03Krenair Works again, thx :) [14:13:14] (03Draft2) 10Paladox: Testing reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303422 [14:15:11] (03Draft1) 10Paladox: Testing reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303422 [14:18:33] (03PS3) 10Paladox: Testing reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303422 [14:31:54] (03Abandoned) 10Paladox: Testing reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303422 (owner: 10Paladox) [14:32:19] (03Draft1) 10Paladox: Testing gerrit reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303424 [14:33:36] (03PS2) 10Paladox: Testing gerrit reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303424 [14:35:02] (03Abandoned) 10Paladox: Testing gerrit reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303424 (owner: 10Paladox) [14:37:17] (03Draft2) 10Paladox: testing reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303429 [14:57:32] (03Abandoned) 10Paladox: testing reviewer bot [integration/config] - 10https://gerrit.wikimedia.org/r/303429 (owner: 10Paladox) [14:58:30] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:59:15] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:32:01] 10Beta-Cluster-Infrastructure, 06Commons, 06Multimedia: Setup deployment-imagescaler host(s) in Beta Cluster - https://phabricator.wikimedia.org/T142289#2531512 (10AlexMonk-WMF) [15:32:04] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2531513 (10AlexMonk-WMF) [15:32:07] 10Beta-Cluster-Infrastructure, 07Tracking: Consolidate or remove Beta Cluster instances to help with [[wikitech:Purge_2016]] (tracking) - https://phabricator.wikimedia.org/T142288#2531511 (10AlexMonk-WMF) [15:39:10] 10Beta-Cluster-Infrastructure, 10Citoid, 10VisualEditor: Move citoid to deployment-sca* hosts in Beta Cluster - https://phabricator.wikimedia.org/T142150#2531527 (10AlexMonk-WMF) sqlite> select backend.url from backend, route where backend.route_id = route.id and route.domain = 'citoid-beta.wmflabs.org.'; ht... [15:46:09] 10Beta-Cluster-Infrastructure, 10Citoid, 10VisualEditor: Move citoid to deployment-sca* hosts in Beta Cluster - https://phabricator.wikimedia.org/T142150#2531535 (10AlexMonk-WMF) Oh, is this deployment-zotero01? [15:57:56] PROBLEM - Puppet run on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:19:48] 10Beta-Cluster-Infrastructure, 07Tracking: Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] (tracking) - https://phabricator.wikimedia.org/T142288#2531549 (10AlexMonk-WMF) [16:29:34] 10Beta-Cluster-Infrastructure, 07Tracking: Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] (tracking) - https://phabricator.wikimedia.org/T142288#2531552 (10AlexMonk-WMF) I did some digging through that page as well as instance lists, config files, `htop` and `d... [16:38:22] PROBLEM - Host andrew-test-deployment is DOWN: CRITICAL - Host Unreachable (10.68.22.32) [16:39:59] 10Beta-Cluster-Infrastructure, 06Commons, 06Multimedia: Setup deployment-imagescaler host(s) in Beta Cluster - https://phabricator.wikimedia.org/T142289#2531558 (10AlexMonk-WMF) So I just noticed we have deployment-imagescaler01, but it has puppet class `role::thumbor::mediawiki` instead of `role::mediawiki:... [17:14:37] 10Beta-Cluster-Infrastructure, 07Tracking: Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] (tracking) - https://phabricator.wikimedia.org/T142288#2531574 (10greg) mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ? [17:45:12] PROBLEM - Puppet run on deployment-db01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:37:03] (03CR) 10Paladox: [C: 031] Make EducationProgram depend on WikiEditor for tests [integration/config] - 10https://gerrit.wikimedia.org/r/303419 (owner: 10Reedy) [18:57:31] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-extensions-CentralAuth, 06Reading-Web-Backlog: Unable to log in on https://commons.m.wikimedia.beta.wmflabs.org/wiki/Special:UserLogin - https://phabricator.wikimedia.org/T142015#2531610 (10Anomie) When I... [21:00:42] 10scap: Improve scap canary check messages - https://phabricator.wikimedia.org/T142342#2531696 (10Legoktm) [21:13:03] (03CR) 10Legoktm: [C: 04-1] "Per Erik's comment. Everything else looks fine and is ready to go." [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/296395 (owner: 10Lethexie) [21:14:08] (03CR) 10Legoktm: [C: 032] "@Erik: I'll file that as a follow-up bug." [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/297355 (owner: 10Lethexie) [21:14:48] (03Merged) 10jenkins-bot: Add the SpaceBeforeClassBraceSniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/297355 (owner: 10Lethexie) [21:16:04] 10MediaWiki-Codesniffer: SpaceBeforeClassBraceSniff should also handle "class Foo extends Bar{" - https://phabricator.wikimedia.org/T142343#2531731 (10Legoktm) [21:18:30] 10MediaWiki-Codesniffer: False call-time pass-by-reference warning when an array literal with a reference is passed - https://phabricator.wikimedia.org/T137014#2355182 (10Legoktm) Sorry, missed this getting filed. We'll need to file this upstream. [21:31:47] PROBLEM - Puppet run on deployment-aqs01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:34:38] PROBLEM - Puppet run on integration-slave-precise-1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:35:44] PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:39:42] (03PS2) 10Legoktm: Add detection for calling global functions in target classes. [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/301335 (owner: 10Lethexie) [21:40:52] (03CR) 10Legoktm: Add detection for calling global functions in target classes. (031 comment) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/301335 (owner: 10Lethexie) [21:43:37] (03CR) 10jenkins-bot: [V: 04-1] Add detection for calling global functions in target classes. [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/301335 (owner: 10Lethexie) [21:56:24] 10Beta-Cluster-Infrastructure, 07Tracking: Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] (tracking) - https://phabricator.wikimedia.org/T142288#2531757 (10bd808) > Replace -logstash2 (@bd808) with a large instance? XLARGE! The xlarge was used for disk space mo...