[00:04:58] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10Legoktm) It's unclear to me why these problems have become so pressing that they merit shutting off l10nupdate indefinitely. We've known about the... [00:05:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:03:07] !log Fix Jenkins' collapsible sections for "npm-install" and "mw-install" under quibble to match current patterns [02:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [02:03:24] !log Fix Jenkins' section pattern for 'setup-docker-quibble-castor-load' [02:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [02:03:55] Example at https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/7578/consoleFull [02:13:00] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 11397 bytes in 0.108 second response time [02:13:32] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 11398 bytes in 2.744 second response time [02:13:45] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [10.0] [02:14:25] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'https://en.m.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 12004 bytes in 0.162 second response time [02:14:37] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'https://en.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 11978 bytes in 0.135 second response time [02:19:28] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 (10Krinkle) [02:19:34] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 (10Krinkle) [02:33:44] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [03:05:00] Project beta-update-databases-eqiad build #28953: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28953/ [04:05:00] Project beta-update-databases-eqiad build #28954: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28954/ [04:21:00] Project beta-update-databases-eqiad build #28955: 04FAILURE in 59 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28955/ [04:23:24] MigrateActors::migrateToTemp\nError: 1062 Duplicate entry '333417-662' for key 'PRIMARY' [04:38:06] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10MZMcBride) >>! In T206694#4657075, @Legoktm wrote: > It's unclear to me why these problems have become so pressing that they merit shutting off l1... [04:57:22] (03PS1) 10Legoktm: Keep test logs in php-compile-* jobs [integration/config] - 10https://gerrit.wikimedia.org/r/465931 [04:57:45] (03CR) 10Legoktm: [C: 032] Keep test logs in php-compile-* jobs [integration/config] - 10https://gerrit.wikimedia.org/r/465931 (owner: 10Legoktm) [05:03:41] (03Merged) 10jenkins-bot: Keep test logs in php-compile-* jobs [integration/config] - 10https://gerrit.wikimedia.org/r/465931 (owner: 10Legoktm) [05:21:51] 10Release-Engineering-Team (Watching / External), 10DBA, 10Operations, 10cloud-services-team, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) [05:32:02] 10Release-Engineering-Team, 10Wikidata, 10Jenkins: Wikibase appears to be failing checkstyle on all builds now - https://phabricator.wikimedia.org/T206738 (10Marostegui) [05:37:51] 10Release-Engineering-Team, 10Wikidata, 10Jenkins: Wikibase appears to be failing phan on all builds now - https://phabricator.wikimedia.org/T206738 (10Legoktm) [06:05:00] Project beta-update-databases-eqiad build #28956: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28956/ [07:05:00] Project beta-update-databases-eqiad build #28957: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28957/ [07:13:38] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10Strainu) There is a mention in the description of an "instigating security incident", which is probably the cause for this halt. @greg, is any dat... [07:18:35] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10Nikerabbit) > This update process is currently disabled and will remain disabled until there is a clear plan and resourcing for improving the arch... [08:05:00] Project beta-update-databases-eqiad build #28958: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28958/ [08:32:51] 10Release-Engineering-Team (Kanban), 10Code-Health-Metrics: Review code metric tools - https://phabricator.wikimedia.org/T205133 (10zeljkofilipin) a:03zeljkofilipin [08:33:06] 10Release-Engineering-Team (Kanban), 10Code-Health-Metrics, 10User-zeljkofilipin: Review code metric tools - https://phabricator.wikimedia.org/T205133 (10zeljkofilipin) [08:34:05] 10Release-Engineering-Team (Kanban), 10Code-Health-Metrics, 10User-zeljkofilipin: Review code metric tools - https://phabricator.wikimedia.org/T205133 (10zeljkofilipin) Assigning to myself while I dig up the data for core/extensions we already have from code climate. [08:57:18] 10Continuous-Integration-Config, 10MediaWiki Language Extension Bundle: Run php5 tests for MLEB extensions - https://phabricator.wikimedia.org/T197561 (10Nikerabbit) p:05Triage>03Low I am expecting this won't happen before PHP5 support is phased out, but leaving open for now. [09:05:00] Project beta-update-databases-eqiad build #28959: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28959/ [10:05:01] Project beta-update-databases-eqiad build #28960: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28960/ [11:05:00] Project beta-update-databases-eqiad build #28961: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28961/ [12:12:26] Project beta-update-databases-eqiad build #28962: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28962/ [12:57:51] :/ [12:58:04] getting CI builds timing out [12:58:04] for Wikibase [13:02:15] 10Release-Engineering-Team, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata: Undeploy WikibaseQuality extension from the WMF - https://phabricator.wikimedia.org/T205064 (10Addshore) 05stalled>03Open [13:05:00] Project beta-update-databases-eqiad build #28963: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28963/ [13:18:59] 10Release-Engineering-Team, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, 10Wikidata-Campsite: Undeploy WikibaseQuality extension from the WMF - https://phabricator.wikimedia.org/T205064 (10Addshore) [13:21:17] 10Release-Engineering-Team, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, 10Wikidata-Campsite: Undeploy WikibaseQuality extension from the WMF - https://phabricator.wikimedia.org/T205064 (10Addshore) [13:24:25] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 36361 bytes in 0.909 second response time [13:24:37] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 47852 bytes in 0.963 second response time [13:26:33] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 (10Anomie) [13:27:57] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 47261 bytes in 1.065 second response time [13:28:35] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 47271 bytes in 3.824 second response time [14:05:00] Project beta-update-databases-eqiad build #28964: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28964/ [14:30:23] reminder: services switch to eqiad starting now [14:30:39] please take a break from puppet merges etc for the next little while [14:50:54] because the next window is coming up n about 10 minutes, I'd ask folks to just wait on any deploys/puppet merges until after everything is wrapped up [14:51:00] rather than sneaking them in now [15:01:26] Project mediawiki-core-code-coverage-docker build #3815: 04FAILURE in 1 min 26 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-docker/3815/ [15:05:00] Project beta-update-databases-eqiad build #28965: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28965/ [15:10:48] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10greg) >>! In T206694#4657614, @Nikerabbit wrote: >> This update process is currently disabled and will remain disabled until there is a clear plan... [15:14:53] 10Release-Engineering-Team (Watching / External), 10Operations, 10Release Pipeline (Blubber): Update Debian package of Blubber (0.6.0-1) - https://phabricator.wikimedia.org/T206766 (10thcipriani) [15:19:17] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10Nikerabbit) I am planning to write down my thoughts and proposals and post them early next week. I don't know the current WMF setup very well, so... [15:27:15] 10Continuous-Integration-Config: Capture failure logs from php-compile jobs (at least for luasandbox) - https://phabricator.wikimedia.org/T205453 (10Legoktm) 05Open>03Resolved a:03Legoktm https://gerrit.wikimedia.org/r/465931 [15:28:49] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10greg) >>! In T206694#4657613, @Strainu wrote: > @greg, is any data about it public yet? Could you perhaps add a link to whatever you can share in... [15:58:26] 10Phabricator-Sprint-Extension: Unhandled Exception ("PhutilMissingSymbolException") - https://phabricator.wikimedia.org/T206489 (10Christopher) FYI: This is a known issue reported and discussed here https://phabricator.wikimedia.org/T90906#4525254 [15:59:48] 10Phabricator-Sprint-Extension: Wikimedia Phabricator sprint extension uses removed ManiphestEditStatusCapability - https://phabricator.wikimedia.org/T206538 (10Christopher) FYI: This is a known issue reported and discussed here https://phabricator.wikimedia.org/T90906#4525254 also related to T206489 [16:05:00] Project beta-update-databases-eqiad build #28966: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28966/ [16:05:14] RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) [16:16:26] PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:22] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Page-Previews, 10Readers-Web-Backlog (Tracking): selenium-daily-beta-Popups CI job failing - https://phabricator.wikimedia.org/T206640 (10phuedx) There were 5 builds triggered on 2018/10/10, of which 2 failed. The build has passed consi... [17:05:00] Project beta-update-databases-eqiad build #28967: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28967/ [17:32:14] RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) [17:34:26] (03CR) 10Dduvall: [C: 031] "There are 320 affected jobs. I'll deploy wmf-quibble-* and see how that goes before deploying all of them." [integration/config] - 10https://gerrit.wikimedia.org/r/463852 (owner: 10Dduvall) [17:37:25] !log deploying https://gerrit.wikimedia.org/r/c/integration/config/+/463852/ to 8 wmf-quibble-* jobs [17:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:44:34] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Quibble, 10Patch-For-Review: Quibble docker instance running on CI instance for 6 hours - https://phabricator.wikimedia.org/T198517 (10dduvall) More long-running containers seen today: {P7668} [17:47:12] !log killing long-running containers (> 3 hours) on docker integration nodes (T198517) [17:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:47:15] T198517: Quibble docker instance running on CI instance for 6 hours - https://phabricator.wikimedia.org/T198517 [17:57:57] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Onboarding liw - https://phabricator.wikimedia.org/T206466 (10greg) [18:05:00] Project beta-update-databases-eqiad build #28968: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28968/ [18:09:58] (03CR) 10Dduvall: [C: 031] "Verified to be working correctly. Deploying for all 320 affected jobs:" [integration/config] - 10https://gerrit.wikimedia.org/r/463852 (owner: 10Dduvall) [18:14:47] (03CR) 10Dduvall: [C: 032] Revert "Use a volume under workspace for /tmp in docker containers" [integration/config] - 10https://gerrit.wikimedia.org/r/463852 (owner: 10Dduvall) [18:15:10] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Onboarding liw - https://phabricator.wikimedia.org/T206466 (10chasemp) [18:16:09] grr, that databases run [18:16:26] !log deployed https://gerrit.wikimedia.org/r/c/integration/config/+/463852/ to all 320 affected jobs [18:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:17:38] !log killed 42 long-running (> 3 hours) containers on docker integration nodes (T198517) [18:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:17:41] T198517: Quibble docker instance running on CI instance for 6 hours - https://phabricator.wikimedia.org/T198517 [18:17:48] (03Merged) 10jenkins-bot: Revert "Use a volume under workspace for /tmp in docker containers" [integration/config] - 10https://gerrit.wikimedia.org/r/463852 (owner: 10Dduvall) [18:18:08] !log ores:1a5630f is going beta [18:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:20:03] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10greg) >>! In T206694#4657075, @Legoktm wrote: > It's unclear to me why these problems have become so pressing that they merit shutting off l10nupd... [18:25:43] 10Phabricator: Create Core Platform Team Herald rules - https://phabricator.wikimedia.org/T205973 (10Aklapper) a:05Aklapper>03None [18:53:11] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10QuickSurveys, 10Readers-Web-Backlog (Tracking): selenium-QuickSurveys failing - https://phabricator.wikimedia.org/T206776 (10Jdlrobson) [W76xwgpEEZ8AACAA9jYAAAAJ] Exception caught: Cannot read both schemas (internal_api_error_Inv... [18:53:45] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10QuickSurveys, 10Readers-Web-Backlog (Tracking): Stability issues with beta cluster caused selenium-QuickSurveys to fail - https://phabricator.wikimedia.org/T206776 (10Jdlrobson) [19:16:38] Project beta-update-databases-eqiad build #28969: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28969/ [19:17:59] that db update job is real annoying. it frequently enough times out [19:19:18] is it actually making any changes atm? [19:20:08] maintenance-disconnect-full-disks build 10449 integration-slave-jessie-1003 (/srv: 95%): OFFLINE due to disk space [19:25:08] maintenance-disconnect-full-disks build 10450 integration-slave-jessie-1003: OFFLINE due to disk space [19:28:14] !log integration-slave-jessie-1003:sudo rm -rf /srv/jenkins-workspace/workspace/* [19:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:28:44] !log integration-slave-jessie-1003 back online [19:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:38:24] PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:33] greg-g: Reedy: Looks like the db update is usually a no-op that run quite quickly. Iterating over all beta wiki tables in < 2min. But seems like it got stuck somehow. The end of the out put at "Abort" shows a 43min gap between the last line and the abort. So it wasn't doing anything. Not sure what's going on there. [19:48:40] probably just needs a manual run? [19:53:15] RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) [19:59:26] PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:41] greg-g: Yeah, if there's actual migration going on that takes too long, may need to be done manually, but I'm not entirely sure that's the case. Looks like maybe it's just stuck for no clear reason. Or a really slow migration with no progress output, but we usually provide progress in standard out. [20:05:00] Project beta-update-databases-eqiad build #28970: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28970/ [20:09:41] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:16] ok task series are fixed [20:13:22] https://phabricator.wikimedia.org/T206663 properly links to prev/next tasks. I also implemented proper 'create subtask' link that doesn't create a 'release' subtask. Thanks to krinkle for the suggestion [20:14:33] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 47267 bytes in 1.247 second response time [20:19:46] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [20:24:17] RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) [20:25:33] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10Legoktm) >>! In T206694#4659215, @greg wrote: >>>! In T206694#4657075, @Legoktm wrote: >> It's unclear to me why these problems have become so pre... [21:00:41] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:32] 10Deployments, 10Language-Team, 10I18n: Determine desired architecture to update localization strings for Wikimedia - https://phabricator.wikimedia.org/T206694 (10greg) >>! In T206694#4659757, @Legoktm wrote: >>>! In T206694#4659215, @greg wrote: >>>>! In T206694#4657075, @Legoktm wrote: >>> It's unclear to... [21:05:00] Project beta-update-databases-eqiad build #28971: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28971/ [21:05:34] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 47267 bytes in 1.639 second response time [21:30:27] PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:31:40] What's going on with mwmaint1001? It's refusing to let me SSH in [21:32:22] RoanKattouw: I switched to 1002. [21:32:35] Aha didn't know tha existed [21:33:59] 1001 was killed yesterday I think [21:59:20] (03CR) 10Thcipriani: "One problem inline, other than that looks good!" (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/465668 (https://phabricator.wikimedia.org/T198517) (owner: 10Dduvall) [22:03:10] (03PS4) 10Dduvall: Provide common docker-run builder [integration/config] - 10https://gerrit.wikimedia.org/r/465668 (https://phabricator.wikimedia.org/T198517) [22:03:59] (03CR) 10Dduvall: Provide common docker-run builder (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/465668 (https://phabricator.wikimedia.org/T198517) (owner: 10Dduvall) [22:05:00] Project beta-update-databases-eqiad build #28972: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28972/ [22:05:29] (03CR) 10Thcipriani: [C: 031] "lgtm!" [integration/config] - 10https://gerrit.wikimedia.org/r/465668 (https://phabricator.wikimedia.org/T198517) (owner: 10Dduvall) [22:10:32] (03CR) 10Thcipriani: [C: 031] "Couple of comments inline, overall lgtm" (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/465671 (https://phabricator.wikimedia.org/T198517) (owner: 10Dduvall) [22:21:25] 10Continuous-Integration-Config, 10Quibble: Move extension "npm test" out of quibble jobs - https://phabricator.wikimedia.org/T206816 (10Legoktm) [22:25:15] RECOVERY - SSH on integration-slave-docker-1021 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) [22:36:26] PROBLEM - SSH on integration-slave-docker-1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:48:55] legoktm: Do you know whether or not jobs now have sql on tmpfs? Seems like things are pretty slow again, regularly 20min in swat merges. [22:49:35] I think so? marxarelli ^ [22:49:37] Before the quibble we were around 6-9 minutes, which regressed to about 13min due to combination of things serially instead of parallelly, and then the logs of tmpfs I think made it 16min. [22:49:53] the loss of* [22:50:23] i believe quibble now puts sql databases on tmpfs, yes [22:50:46] but i took a brief look at the stats before/after and it didn't seem to improve things much [22:50:53] https://tools.wmflabs.org/coverage/charts/mediawiki.png there's also been a significant increase in tests in the past 2 months [22:50:58] (it was a very cursory look though) [22:51:10] Hm.. may need to confirm that it applies to the different job variants. [22:51:12] we're still seeing slowdown due to other problems at the moment [22:51:28] wmf-quibble-core-vendor-mysql-hhvm-docker takes the longest (18-22min) [22:51:32] The other variants are all 9-10 min [22:52:02] that's probably because we removed some of the duplicate steps from the other jobs, not because it's missing tmpfs datadir [22:52:03] I've suggested on IRC that we should split up wmf- jobs into multiple (wmf1, wmf2) due to the size, but never got around to writing a proper ticket for it [22:52:03] but maybe [22:52:18] hashar has a few patches to parallelize quibble [22:52:38] and i'm working on patches to make sure docker containers are properly reaped [22:52:40] https://phabricator.wikimedia.org/T198517 [22:52:41] If we dumped HHVM faster we'd save a lot of time. [22:52:53] today i saw 42 zombie containers running in ci :( [22:53:00] :|||| [22:53:05] damn [22:53:08] legoktm: right, that's a fair point. But given the tests do run in 9min on some of the variants, I assume more tests isn't what makes it slower [22:53:58] James_F: hhvm/php7 run concurrently, dropping it would save and reduce executor wait times at peak , but won't lower the median/minimum wait time of ~ 18min [22:54:24] hhvm is slower than php7 at running tests, it'd probably save a minute or two overall [22:54:25] marxarelli: cool, I'll take a look at some of those :) [22:54:28] Krinkle: max(20 min, 10 min) > max(10 min) [22:54:40] we could switch the test queue to run php70 and leave hhvm only on gate [22:54:58] James_F: Aye, maybe, but at least a significant portion of the tests we run on the hhvm one are things we only need 1 variant of, so those would move to a php7 job. [22:54:58] OK, not 20 and 10, but PHP7 seems to run our test suite ~40% faster [22:55:10] per the above, that's not hhvm vs php7 alone [22:55:21] Hmm. [22:55:25] Also see https://phabricator.wikimedia.org/T206816 - we should remove npm test from quibble [22:55:42] I don't know which ones we de-duped but e.g. selenium doesn't need to run 5x with php7/vendor/non-vendor/hhvm etc. [22:56:05] If we distribute the one-variant chunks among them, that would also speed up. [22:56:11] instead of having all the solo things in the same job. [22:56:26] why shouldn't selenium run across all flavors? [22:57:26] Because the frontend code doesn't differ. the backend interaction is already covered by unit/integration tests within php. [22:57:35] I guess maybe it gives us some extra coverage where we are missing tests. [22:57:44] I don't think the backend coverage is good enough yet [22:57:49] fair enough [22:57:57] but npm-test and qunit can definitely go [22:58:03] not needed on every variant. [22:58:45] how backend-agnostic is qunit? I know it runs a bit of PHP code right? [22:59:28] It runs a SpecialPage with OutputPage::disable() to produce basically 1 script tag, which, if failing, will fail the build [22:59:40] the rest is RL code to statically serve the scripts, which is pretty well-covered. [23:03:41] Yeah, dumping the npm, selenium, and qunit tests into one variant (php70 or whatever) would make sense. [23:04:44] Or maybe a "quibble-front-end" job and a set of "quibble-back-end-{variant}" jobs? [23:05:00] Project beta-update-databases-eqiad build #28973: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/28973/ [23:08:45] !log deploying https://gerrit.wikimedia.org/r/c/integration/config/+/465668 for 1 job mediawiki-quibble-vendor-mysql-php70-docker [23:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:09:52] Regarding the databases-eqiad failure. Both times it aborts 43min after the last activity, which is "ruwiki: have ref_id field in flow_wiki_r" [23:10:06] so either that step is hanging, or is taking 43min without progress/output. [23:10:18] for ruwiki-beta [23:14:15] 10MediaWiki-Releasing, 10Growth-Team, 10Notifications, 10MW-1.32-release: Bundle Echo extension with MW 1.32 - https://phabricator.wikimedia.org/T191738 (10Tgr) [23:14:30] !log rolling back deployment of https://gerrit.wikimedia.org/r/c/integration/config/+/465668 for 1 job [23:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:14:37] 10MediaWiki-Releasing, 10Growth-Team, 10Notifications, 10MW-1.32-release: Bundle Echo extension with MW 1.32 - https://phabricator.wikimedia.org/T191738 (10Tgr) License is MIT which is compatible with GPL 2+. [23:23:52] (03PS5) 10Dduvall: Provide common docker-run builder [integration/config] - 10https://gerrit.wikimedia.org/r/465668 (https://phabricator.wikimedia.org/T198517) [23:24:08] (03PS1) 10Legoktm: dockerfiles/.tox isn't a docker image directory [integration/config] - 10https://gerrit.wikimedia.org/r/466810 [23:24:10] (03PS1) 10Legoktm: Make MediaWiki PHP 7.1 tests voting [integration/config] - 10https://gerrit.wikimedia.org/r/466811 (https://phabricator.wikimedia.org/T204884) [23:24:54] (03CR) 10Legoktm: [C: 032] dockerfiles/.tox isn't a docker image directory [integration/config] - 10https://gerrit.wikimedia.org/r/466810 (owner: 10Legoktm) [23:25:08] (03CR) 10Jforrester: "Yay." [integration/config] - 10https://gerrit.wikimedia.org/r/466811 (https://phabricator.wikimedia.org/T204884) (owner: 10Legoktm) [23:28:26] (03Merged) 10jenkins-bot: dockerfiles/.tox isn't a docker image directory [integration/config] - 10https://gerrit.wikimedia.org/r/466810 (owner: 10Legoktm) [23:32:12] (03PS1) 10Legoktm: GoogleAuthenticator: disable browser tests (noselenium) [integration/config] - 10https://gerrit.wikimedia.org/r/466812 [23:34:39] !log redeploying https://gerrit.wikimedia.org/r/c/integration/config/+/465668 for 1 job mediawiki-quibble-vendor-mysql-php70-docker [23:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:45:45] !log killed 6 long-running containers (> 3 hours) on docker integration nodes (T198517) [23:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:45:48] T198517: Quibble docker instance running on CI instance for 6 hours - https://phabricator.wikimedia.org/T198517 [23:46:16] I'll deploy the 7.1 stuff later tonight when I have time to roll it out [23:46:34] sounds good [23:50:19] !log rolling back deployment of https://gerrit.wikimedia.org/r/c/integration/config/+/465668 (full deployment delayed until around 1600 tomorrow) [23:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:52:48] (03CR) 10Dduvall: [C: 031] "Deployed for just 1 job and it seems to be working." [integration/config] - 10https://gerrit.wikimedia.org/r/465668 (https://phabricator.wikimedia.org/T198517) (owner: 10Dduvall)