[00:10:38] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2661034 (10Krenair) [00:22:37] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [00:36:50] PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:41:14] 10Deployment-Systems, 03Scap3, 13Patch-For-Review: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#2661081 (10thcipriani) 05Resolved>03Open @fgiunchedi could I get you to upload scap_3.3.0-1 to carbon? Version contains fixes for {T134156} and {T145373}. (cc: @mobrovac) [00:47:34] I got/caused deployment-mediawiki04's puppet problem, should be fixed. [00:54:58] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:01:53] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:02:37] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [01:30:25] 10MediaWiki-Codesniffer: Undefined index: parenthesis_closer in SpaceBeforeControlStructureBraceSniff.php - https://phabricator.wikimedia.org/T146439#2661144 (10MaxSem) [01:30:32] legoktm, ^ [01:34:17] thank you [01:34:59] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:55:57] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [02:31:03] 10Continuous-Integration-Config: All code style CI tests should be run even if some fail - https://phabricator.wikimedia.org/T146445#2661267 (10Tgr) [02:35:59] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [03:37:18] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [04:12:18] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [04:15:24] PROBLEM - Puppet staleness on deployment-db1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:19:05] Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #150: 04FAILURE in 23 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/150/ [04:27:03] PROBLEM - Puppet staleness on deployment-db2 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:44:40] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [05:24:41] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [05:45:43] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:03:54] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [06:43:54] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:50:45] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [07:17:00] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2661458 (10jcrespo) There was a refactoring of mariadb's puppet code... [07:18:42] 10MediaWiki-Codesniffer: Undefined index: parenthesis_closer in SpaceBeforeControlStructureBraceSniff.php - https://phabricator.wikimedia.org/T146439#2661462 (10Paladox) Caused by https://phabricator.wikimedia.org/rMCSN42c7dca89f1ac314795802f8fd0bcc1ef62bba1b [08:05:01] 10Gerrit, 10GitHub-Mirrors, 10Librarization, 06Project-Admins: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661511 (10Paladox) [08:12:00] 10Gerrit, 10GitHub-Mirrors, 10Librarization, 06Project-Admins: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661526 (10Paladox) Oh looks like the repo was already created at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/libs/Assert [08:12:38] 10Gerrit, 10GitHub-Mirrors, 10Librarization, 06Project-Admins: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661527 (10Legoktm) >>! In T114330#2661497, @thiemowmde wrote: > Yesterday I wasted a full hour trying to rebase open pull requests on htt... [08:31:04] 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2661542 (10Aklapper) [09:04:53] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [09:44:51] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:49:31] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2661797 (10hashar) [09:50:29] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2606121 (10hashar) For the after effect, ORES jobs failed following deployment T146461 Which is entirely related to us having no monitoring/alarming about jobs, that is... [09:51:00] 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2661805 (10Paladox) [09:51:21] 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2556665 (10Paladox) Gerrit 2.12.5 has been released so we can update to that release now. [09:52:00] 10Gerrit, 07Upstream: Gerrit's new side-by-side diff screen sometimes cuts off the last few characters of a line - https://phabricator.wikimedia.org/T144565#2661810 (10Paladox) [09:52:03] 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2661809 (10Paladox) [09:53:26] 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2556665 (10Paladox) [10:01:06] 10Gerrit, 07Upstream: Gerrit's new side-by-side diff screen sometimes cuts off the last few characters of a line - https://phabricator.wikimedia.org/T144565#2661837 (10Paladox) This if fixed in gerrit 2.12.5 which has been released now. [10:10:17] 10Gerrit: Update gerrit to 2.12.5 - https://phabricator.wikimedia.org/T143089#2661851 (10hashar) [10:10:32] 10Gerrit: Update gerrit to 2.13 - https://phabricator.wikimedia.org/T146350#2658303 (10hashar) [10:11:05] paladox: I have linked each upgrade gerrit to X tasks to each other :] [10:42:23] PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:06:54] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2661955 (10Ladsgroup) [11:22:25] RECOVERY - Puppet run on integration-slave-precise-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [12:24:42] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2662046 (10MoritzMuehlenhoff) mira is now running jessie. Please give it some more testing, for migrating tin, we could mira temporarily make the... [12:35:53] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [12:53:15] 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662100 (10thiemowmde) To clarify: I can clone https://github.com/wikimedia/Assert and checkout all the branches from the open pull requests just fine. The on... [13:15:53] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:20:42] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2662117 (10Krenair) >>! In T144578#2662046, @MoritzMuehlenhoff wrote: > mira is now running jessie. Please give it some more testing The next de... [13:31:36] 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662149 (10Krenair) >>! In T114330#2661527, @Legoktm wrote: >> * Who does have admin access to this GitHub repo? > > I guess anyone in the wikimedia org? I'm... [13:34:01] 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662165 (10Krenair) >>! In T114330#2662100, @thiemowmde wrote: > add people (preferably @daniel and myself, since we are among the main contributors) as admin... [13:41:16] (03CR) 10Hashar: [C: 032] Change tmpfs to /srv instead of /mnt [integration/jenkins] - 10https://gerrit.wikimedia.org/r/312330 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [13:41:46] (03Merged) 10jenkins-bot: Change tmpfs to /srv instead of /mnt [integration/jenkins] - 10https://gerrit.wikimedia.org/r/312330 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [13:41:56] !log Switching tmpfs from /mnt to /srv https://gerrit.wikimedia.org/r/#/c/312330/ and running fab deploy_slave_scripts [13:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:46:39] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #154: 04FAILURE in 2 min 38 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/154/ [13:56:04] 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662261 (10thiemowmde) @Krenair: https://github.com/thiemowmde [13:57:20] 10Gerrit, 10GitHub-Mirrors, 10Librarization: Move wikimedia/assert to gerrit or wikimedia's github account - https://phabricator.wikimedia.org/T114330#2662262 (10Krenair) Done [14:04:51] !log remove the /mnt based tmpfs for T146381 / https://gerrit.wikimedia.org/r/#/c/312518/ via: salt -v '*' cmd.run 'umount /mnt/home/jenkins-deploy/tmpfs' [14:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [14:15:22] 10Gerrit, 07Upstream: Gerrit's new side-by-side diff screen sometimes cuts off the last few characters of a line - https://phabricator.wikimedia.org/T144565#2662345 (10Aklapper) p:05Triage>03Low [14:56:53] 10Gerrit: Update gerrit to 2.13.1 - https://phabricator.wikimedia.org/T146350#2662462 (10Paladox) [15:02:15] !log rebooting integration-slave-jessie-1001 [15:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:07:06] 10Gerrit: Update gerrit to 2.13.1 - https://phabricator.wikimedia.org/T146350#2662494 (10Paladox) [15:16:45] 10Deployment-Systems, 03Scap3, 13Patch-For-Review: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#2662514 (10fgiunchedi) 05Open>03Resolved scap uploaded and deployed, resolving [15:49:09] Project selenium-MobileFrontend » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #168: 04FAILURE in 27 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/168/ [16:50:39] i know it's friday ... but i'd like to deploy a patch to record some timing marks. the wmf.20 deploy increased p95 as seen from php (time spent in elasticsearch is unchanged) from 40ms to >200ms. Patch adds some timing marks to try and get an idea where in php this time is being spent: https://gerrit.wikimedia.org/r/#/c/312529/1 [16:51:37] for autocomplete, that is. p95 means ~1M req/day are getting that degraded performance. p75 isn't quite so bad, at a regression from 11ms ->26ms [16:52:03] ebernhardson: I'll allow it [16:52:12] [16:52:16] :) thanks [16:54:55] PROBLEM - Puppet run on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:55:09] PROBLEM - Puppet run on integration-slave-jessie-1005 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:12:09] 10Continuous-Integration-Config, 10RESTBase, 06Wikipedia-Android-App-Backlog: Kick off periodic Android CI tests when RB is updated on beta labs - https://phabricator.wikimedia.org/T146488#2662808 (10Mholloway) [17:12:33] 10Continuous-Integration-Config, 10RESTBase, 06Wikipedia-Android-App-Backlog: Kick off periodic Android CI tests when RESTBase is updated on beta labs - https://phabricator.wikimedia.org/T146488#2662822 (10Mholloway) [17:19:31] PROBLEM - Puppet run on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:25:32] scap sync-file is taking an amazingly long time for sync-masters, the rsync on mira is up to 2:27, real time is much longer than that [17:26:15] ouch. that's happened when there were ipv6 networking issues before [17:26:40] Not sure what else would make tin->mira sync so slow [17:27:07] deployment-mira was just reimaged to jessie today/last night [17:27:13] ebernhardson: or is this in production? [17:27:22] greg-g: prod, syncing out my timing patch [17:27:30] thcipriani: ^ ideas? [17:28:30] are there 2 procs running too? [17:28:32] blerg. I wonder if it has to sync everything over to mira since it was just reimaged? [17:28:40] on mira [17:29:02] oh right, prod mira was reimaged last night [17:29:11] ps axf on mira only shows one chain for the scap/rsync [17:29:14] man, I'm tired [17:29:24] ok, so it's just syncing the whole thing. will keep waiting :) [17:30:08] RECOVERY - Puppet run on integration-slave-jessie-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [17:30:41] mediawiki-staging is only 700mb on mira [17:30:58] except I can't read php-1.28.0-wmf.17 [17:31:20] 21 gigs on tin [17:32:22] there are probably 10G of that that won't be transferred, i.e. are cdb files. [17:32:30] this sync-file may take a while :(( [17:33:01] actually it just finshed, but with somewhat scary looking output: https://phabricator.wikimedia.org/P4110 [17:34:05] doesn't look to be too important for my sync, but something up with l10n permissions i guess [17:34:55] RECOVERY - Puppet run on integration-slave-trusty-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [17:34:59] hrm, looks like l10nupdate has a different uid [17:35:08] similar/related discussion going on in -operations from mutante [17:35:08] on mira vs tin, I think. [17:36:11] user l10nupdate on mira: l10nupdate❌1001:10002::/home/l10nupdate:/bin/bash [17:37:03] on tin: l10nupdate❌10002:10002::/home/l10nupdate:/bin/bash [18:09:36] thcipriani: yeah prod mira got reimaged to Jessie. I acked that to moritz earlier today [18:09:39] since beta looked fined [18:11:13] thcipriani: mira.deployment-prep we should dish it out it is too small [18:11:18] err twentyafterfour ^^^ [18:11:29] got reimated to deployment-mira or deployment-mira02 with larger disk [18:12:35] heh, wait, what? deployment-mira was too small? I thought it was based on some custom image? [18:13:45] oh, wait, misread: mira.deployment-prep was too small. Got it. [18:18:41] ideally we should have deployment-tin / deployment-mira both Jessie with the custom os flavor [18:18:50] something like c8.m8.d60 [18:19:00] others can be dished [18:19:08] and some puppet patch need to be adjusted in consequence [18:20:48] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2641971 [18:24:21] PROBLEM - Host integration-puppetmaster is DOWN: CRITICAL - Host Unreachable (10.68.16.42) [18:24:57] ^ ^ that is the old one guess legoktm is deleting the old CI puppetmaster :] [18:26:33] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2663057 (10dduvall) LGTM. Thanks, @jcrespo! [18:26:55] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2663062 (10dduvall) 05Open>03Resolved [18:34:11] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T144644#2663094 (10Krinkle) [18:34:13] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328#2663095 (10Krinkle) [19:23:37] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:28:59] 06Release-Engineering-Team, 06Developer-Relations, 10Wikimedia-Blog-Content: blog.wikimedia.org post on Phabricator improvements - https://phabricator.wikimedia.org/T141457#2663223 (10Aklapper) So...... I created https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts/Recent_improvements_in_the_Phabricator_pro... [19:41:50] Nodepool does less API queries to openstack https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=16&fullscreen&from=1474486854104&to=1474659654105&var-provider=All&var-task=CreateServerTask&var-task=DeleteKeypairTask&var-task=ListFloatingIPsTask&var-task=ListServersTask :D [19:44:38] twentyafterfour: thcipriani: so for beta the proper instances are deployment-tin02 and deployment-mira which has the flavor with large disk [19:45:07] hashar: I'm working on deployment-mira now [19:47:13] and looks like puppet is all settled [19:48:13] bah and https://integration.wikimedia.org/ci/job/phabricator-jessie-diffs/ has some issue [19:49:38] I don't think it's the job that has some issues [19:49:51] looks like missing dependencies [19:50:36] Error: Cannot find module 'ajv' [19:51:52] yeah [19:51:58] and the kafka server is never killed [19:52:13] twentyafterfour: is that job created via JJB ? [19:53:11] apparently not :D [19:53:21] !log added a 30 minutes build timeout to https://integration.wikimedia.org/ci/job/phabricator-jessie-diffs/ [19:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:00:43] PROBLEM - Puppet run on integration-slave-jessie-1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:00:46] !log rebooting all CI permanent slaves. Making sure nothing is left on /mnt (which is no more mounted) [20:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:01:10] hashar: originally yes but my jjb stuff never got merged [20:01:27] and I modified the job some manually to make it work [20:01:39] so what's the plan for -db1 and -db2? [20:03:08] hashar: https://gerrit.wikimedia.org/r/#/c/295396/ [20:03:39] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:05:10] PROBLEM - Puppet run on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:05:44] PROBLEM - Puppet run on integration-slave-jessie-1003 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [20:10:44] RECOVERY - Puppet run on integration-slave-jessie-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [20:10:44] RECOVERY - Puppet run on integration-slave-jessie-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [20:11:21] Krenair: Dan will delete them when their time has come [20:11:29] twentyafterfour: neat :) [20:12:04] twentyafterfour: looks like I reviewed that one back in june and never looked back :( [20:14:18] yuvipanda: mind if I make the http port configurable on the aptly class? https://gerrit.wikimedia.org/r/#/c/312562/ [20:15:09] RECOVERY - Puppet run on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [20:19:29] W: GPG error: http://debian.saltstack.com jessie-saltstack InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B09E40B0F2AE6AB9 [20:19:31] oh men .. [20:20:05] It seems some CI slavs are unable to access MySQL [20:20:08] Two jobs failed in a row. [20:20:20] https://gerrit.wikimedia.org/r/#/c/312467/ [20:20:35] and another two, and more. [20:20:36] bah :( [20:21:08] forgot to gracefully depool? [20:21:11] !log integration: salt -v '*trusty*' cmd.run 'service mysql start' [20:21:14] did [20:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:21:22] but mysql does not come back properly on boot on Trusty :( [20:21:35] due to some crazy hack [20:21:51] it only comes back when puppet runs [20:22:21] Krinkle: all good now. Sorry I should have spawned mysql again [20:24:59] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure: Build an apt repository on deployment-prep (for testing packages from jenkins) - https://phabricator.wikimedia.org/T146497#2663278 (10mmodell) [20:31:31] twentyafterfour: sure! we've a freeze until monday tho as ops are travelling [20:37:35] RoanKattouw hi, gerrit 2.12.5 was released today :), ive updated https://gerrit-test.wmflabs.org/ to use the new updated version [20:37:46] and includes the fix for allowing line wrap [20:37:49] in the preference :) [20:38:56] paladox: Hmm I don't see the orange thing you're talking about. Looking at http://gerrit-test.wmflabs.org/gerrit/#/c/17/49 [20:39:13] Oh it wont be on there [20:39:18] but will be on gerrit-new [20:39:19] Oh I see [20:40:59] yep [20:44:48] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Build an apt repository on deployment-prep (for testing packages from jenkins) - https://phabricator.wikimedia.org/T146497#2663333 (10yuvipanda) No objections. Poke me again next week and I'll CR? We're in a puppet fr... [20:45:35] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:46:41] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:50:51] hashar, do you know why deployment-tin.eqiad.wmflabs says "do not use this server" (and "'Connect to 'deployment.eqiad.wmnet' instead, it will route you to the correct server.'" which is certainly wrong)? [20:51:05] hashar, is there really a new Beta Cluster deployment server, or is it just a Puppet bug? [20:52:13] matt_flaschen: use deployment-mira [20:52:17] ah yeah [20:52:20] the message is wrong [20:54:12] PROBLEM - Puppet run on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:05:12] bah [21:11:08] matt_flaschen: we're mid migration to jessie for the deploy servers in both beta and prod, so hence that motd that recommends to use deployment.eqiad.wmnet (which will always point to the right place) [21:12:09] greg-g, okay, but the message is wrong when you ssh to deployment-tin. It should point to deployment-mira. [21:12:35] right, I think it's a prod'ism, as it were [21:14:45] matt_flaschen: greg-g: yeah the motd is hardcoded in a puppet file [21:14:55] in theory one can convert the file to a puppet erb template [21:15:07] and inject whatever variable represent the current scap master [21:15:11] or the entry point [21:15:48] hashar, yes, etonkovidova asked me about the message (probably not because of that per se, but it would be nice to make it accurate). [21:15:52] hashar it looks like ci is very slow, [21:15:58] one nodepool instance [21:16:06] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663465 (10Mattflaschen-WMF) [21:16:10] T146505 [21:16:10] Oh never mind [21:16:12] paladox: yeah we are out of quota / resource since July 4th [21:16:13] only one patch [21:16:18] Oh yep [21:16:28] The american holiday [21:16:38] and that is with most php jobs still on permanent slaves :( [21:16:44] basically we are stuck [21:16:55] Oh [21:17:19] We should probaly setup up a lab type thing for releng [21:17:36] so that you can have more space then using labs where everyone goes to test things [21:17:38] :) [21:17:44] at the bottom of the Zuul status page there are some links to CI / Nodepool / Zuul which are grafana board [21:17:50] the nodepool one has some extended details [21:17:56] eg https://grafana.wikimedia.org/dashboard/db/nodepool [21:17:58] Oh thanks [21:18:01] notably show you the status of the pool [21:18:05] 12 instances max [21:18:09] oh [21:18:09] green = ready to get a job [21:18:18] blue = busy executing [21:18:27] yellow = node is spawning / being provisionned [21:18:32] around 8pm it started getting problems [21:18:41] yeah [21:18:44] typical busy hour [21:19:07] Oh, i guess we should get dedicated hardware for releng :) [21:19:12] RECOVERY - Puppet run on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [21:19:24] the challenge is to find some metrics that represent the wait time / annoyance to the devs [21:19:28] so we can claim more instances [21:19:43] but I am unable to find meaningful that would say: we need X instances because of that [21:19:49] so we are stuck to 12 max for now [21:21:06] Yep, but i mean dedicated hardware not provided by labs since it seems they are going through a crunch (resources) :) [21:21:17] hardware isn't free [21:21:32] there's all kinds of "coulds" that could be done [21:21:43] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [21:21:50] Oh [21:21:53] ok [21:22:39] PROBLEM - Keyholder status on deployment-mira is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:22:51] paladox: yeah what greg said, we can come up with something that is lighter / take less resources [21:23:01] Oh ok [21:23:26] we'll be talking about it a lot at our team offsite in October (Oct 17-21); more later :) [21:23:37] Oh :) [21:23:57] I should add a sleep 180 to all jobs and see whether anyone complain :D [21:24:10] anyway time to bed. Had a rather crazy week with the jobrunners and all [21:24:18] +++ [21:24:18] Oh yep [21:24:19] :) [21:26:13] g'night hashar, have a good weekend [21:26:51] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy, 07Puppet: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663483 (10hashar) [21:28:38] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy, 07Puppet: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663487 (10hashar) [21:28:54] ^^that one is trivial :D [22:58:12] 10Deployment-Systems, 06Operations: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2663619 (10Mattflaschen-WMF) [23:23:15] PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]