[00:20:07] PROBLEM - Puppet run on deployment-phab02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:22:55] PROBLEM - Puppet run on deployment-phab01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [01:38:47] 06Release-Engineering-Team, 05MW-1.23-release: MW 1.23.16 patch is not compatible with PHP 5.3 - https://phabricator.wikimedia.org/T162572#3167286 (10TTO) [01:41:37] 06Release-Engineering-Team, 05MW-1.23-release: MW 1.23.16 is not compatible with PHP 5.3 - https://phabricator.wikimedia.org/T162572#3167298 (10TTO) [02:06:21] 06Release-Engineering-Team, 05MW-1.23-release: MW 1.23.16 is not compatible with PHP 5.3 - https://phabricator.wikimedia.org/T162572#3167301 (10tstarling) Credit to @Erkan_Yilmaz who reported the issue on IRC. He also reported having an oddly corrupted XmlTypeCheck.php file, which didn't match the one in the 1... [03:38:33] 06Release-Engineering-Team, 05MW-1.23-release: MW 1.23.16 is not compatible with PHP 5.3 - https://phabricator.wikimedia.org/T162572#3167342 (10demon) >>! In T162572#3167301, @tstarling wrote: > I wonder what needs to change to make it possible to run CI on security release tarballs. I understand this is curre... [09:39:58] 10MediaWiki-Releasing, 10Analytics: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3167774 (10Addshore) So to be done 'properly' this will require an oozie job to be created for the hadoop -> graphite data. A vague exmaple can be seen @ https://githu... [10:12:41] 10Continuous-Integration-Config, 06Release-Engineering-Team, 07Regression: doc.wikimedia.org docs for old releases is actually master - https://phabricator.wikimedia.org/T162506#3167807 (10hashar) On contint1001 the files are from November 25th 2015 which is when 1.26.0 got released ( https://lists.wikimedia... [10:27:20] 10Continuous-Integration-Config, 06Release-Engineering-Team, 07Regression: doc.wikimedia.org docs for old releases is actually master - https://phabricator.wikimedia.org/T162506#3167817 (10hashar) Looks like you have hacked the job :-] echo 'Skipped! --Krinkle' exit 1 [13:06:54] does zuul schedule stuff twice now? Im seeing 2 chains that have 1 job in common [13:25:11] PROBLEM - Free space - all mounts on integration-slave-jessie-1002 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1002.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1002.diskspace.root.byte_percentfree (<11.11%) [13:28:48] ^^ current job on that instance is android-publish [13:45:09] PROBLEM - Free space - all mounts on integration-slave-jessie-1002 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1002.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1002.diskspace.root.byte_percentfree (<11.11%) [13:47:00] 06Release-Engineering-Team, 10MediaWiki-General-or-Unknown, 07Wikimedia-log-errors: Failed connecting to redis server at rdbXXX.eqiad.wmnet: Bad file descriptor in /srv/mediawiki/php-1.29.0-wmf.12/includes/libs/redis/RedisConnectionPool.php on line 235 - https://phabricator.wikimedia.org/T158770#3168347 (10el... [13:51:51] the grafrana dashboard still lists percise on some of the graphs those should be removed... [14:09:47] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: OpenStack instances stuck in deletion state - https://phabricator.wikimedia.org/T162529#3168379 (10hashar) 05Resolved>03Open Only noticed it now, but that happened again on 04/09 around 8:50, 21:00 and on 04/10 at 2:00. Each time two... [14:11:17] is it the same ones hashar [14:11:48] Zppix: what do you mean? [14:11:58] T162529 [14:11:59] T162529: OpenStack instances stuck in deletion state - https://phabricator.wikimedia.org/T162529 [14:16:01] Zppix it wont be the same instances as nodepool creates them on the fly. [14:16:29] paladox: I meant same servers [14:17:02] for example saucelabs1001 saucelabs1002 etc etc [14:17:44] Oh [14:38:20] 06Release-Engineering-Team, 06Labs, 10MediaWiki-extensions-Linter, 07Wikimedia-log-errors: Table 'labtestwiki.linter' doesn't exist (208.80.153.14) - https://phabricator.wikimedia.org/T162605#3168469 (10jcrespo) I do not think mediawiki production should access that non-production db, ever. We should ask m... [14:42:00] 06Release-Engineering-Team, 06Labs, 10MediaWiki-extensions-Linter, 07Wikimedia-log-errors: Table 'labtestwiki.linter' doesn't exist (208.80.153.14) - https://phabricator.wikimedia.org/T162605#3168477 (10Legoktm) Disabling Linter there is also fine with me, I'm just not sure what the expectations for that w... [14:47:12] 06Release-Engineering-Team, 06Labs, 10MediaWiki-extensions-Linter, 07Wikimedia-log-errors: Table 'labtestwiki.linter' doesn't exist (208.80.153.14) - https://phabricator.wikimedia.org/T162605#3168510 (10jcrespo) > I'm just not sure what the expectations for that wiki are... I also don't know, and I don't... [14:51:17] hashar: o/ [14:51:32] do you have any idea who would be a good point of contact / owner for T125735 ? [14:51:32] T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735 [14:52:07] elukey: hello this is your task router. Please enter OK to proceed: [14:52:13] hahahaha [14:52:17] OK [14:52:25] EAGAIN [14:52:30] OK [14:52:31] EAGAIN [14:52:33] ;D [14:52:45] * elukey awards a token to hashar [14:52:51] made my day thanks [14:53:15] looks like multiple errors have surfaced [14:53:28] at least the RST spam seems to have a fix [14:53:42] though it I dont know whether it would actually solves the time out sisue [14:53:51] nono it won't sadly [14:54:13] I think that the jobrunners and jobqueues have a serious application level issue [14:54:44] that cannot be solved easily tuning Redis [14:54:46] yup :( [14:55:07] but maybe I am deeply wrong and there is a clear path to resolution.. [14:55:16] I think part of the reason it has not been addressed is that we have no clue what is happeniing on the server side [14:55:29] my understanding is the following [14:55:31] maybe redis just fails to allocate a socket [14:55:49] or linux has too many OPEN / CLOSE_WAIT connections [14:56:00] no no i checked, all good from that side [14:56:31] I think that sometimes Redis (single thread, executes tons of Lua, etc..) blocks for more than the allowed 0.3s to return a valid answer [14:56:35] even seconds [14:56:40] ahh [14:56:50] this causes the jobrunners to mark the Redis shard as "failed" for 30s [14:57:04] the former problem generates "Could not connect to server" [14:57:21] the latter "Unable to connect to redis server" [14:57:45] and then it keeps going in this way [14:57:49] maybe the jobrunner supports retrying [14:58:23] probably until it marks the job as "done" (or another state) it keeps grabbing it from the todo list [14:58:32] ah so it still tries to connect to it once it has been marked failed ? [14:58:56] yes and it emits Unable to connect to redis server [14:59:11] because the getConnection returns false [14:59:28] (there is a debug log that would confirm this but I need to enable debugging and I don't know how to do it) [14:59:38] ah that is your comment at https://phabricator.wikimedia.org/T125735#3109415 [14:59:43] / Mark server down for some time to avoid further timeouts [14:59:43] $this->downServers[$server] = time() + self::SERVER_DOWN_TTL; [15:00:07] that one [15:00:17] elukey: I have to get out for some grocery shopping + getting kids back home [15:00:25] could you summarize the idea on the task ? [15:00:45] and potentially what kind of debug logs might be needed? Maybe one from releng can check it out [15:00:50] we have our team meeting in an hour [15:00:58] sure sure [15:01:06] else I will try a patch based on your idea and poke Aaron about it tonight [15:01:25] iirc the jobrunner client has some ->debug() helper [15:01:35] though I cant remember off hand whether it is sent to logstash [15:01:46] or whether its is logged at all on the jobrunners [15:01:57] last time I had to mess with that, I live hacked on beta cluster to come up with a patch [15:02:01] got a review for it [15:02:10] and then deployed the patch on a single prod jobrunner [15:02:22] (which has lead me to mess up with trebuchet :( ) [15:02:48] but yeah I guess debugging logs to check what jobrunner is doing when a server is marked down will be a good thing [15:03:08] I'd need https://github.com/wikimedia/mediawiki/blob/9d25aa8ecac1ac2b8f765d4d4fcdfaaa798187b2/includes/libs/redis/RedisConnectionPool.php#L186 [15:03:29] I think that this one generates part of the log spam [15:03:52] yup [15:04:08] ah it is already there [15:04:22] so it is "all about" figuring out how the error levels is set :] [15:04:36] which probably involves all spam being sent to the local rsyslog [15:07:44] gotta rush out, will catch up tonight [15:30:09] PROBLEM - Free space - all mounts on integration-slave-jessie-1002 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1002.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1002.diskspace.root.byte_percentfree (<11.11%) [15:44:06] PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:06:55] 10MediaWiki-Releasing, 10Analytics: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#1835400 (10Nuria) It is not clear what is the value of this data, can someone explain? [16:19:05] RECOVERY - Puppet run on integration-slave-trusty-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [16:54:53] (03CR) 10Hashar: "That looks like a good first step for a simple jobs. Eventually we would have to a reference Docker image that is build using the existin" (033 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/347130 (owner: 10Thcipriani) [17:01:30] 10Scap (Scap3-Adoption-Phase1), 06Operations, 10RESTBase, 13Patch-For-Review, and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3168878 (10mobrovac) [18:43:02] 06Release-Engineering-Team, 10Page-Previews, 06Reading-Web-Backlog: Generate compiled assets from continuous integration - https://phabricator.wikimedia.org/T158980#3169269 (10Jdlrobson) a:03Jhernandez hey Joaquin, we talked about this in the backlog grooming today. We'd like to get this scheduled for an u... [18:56:01] hashar trilead-ssh2 now supports eddsa keys. [18:56:07] the change has been merged [18:56:23] though hmac 256 change will be merged within the next 24 hours [19:22:42] 10MediaWiki-Releasing, 10Analytics: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3169411 (10Legoktm) To get an idea of how many people are using the MediaWiki tarballs. Additionally, this data could be used to see if people are still downloading old... [19:33:34] hashar it's been merged now :) [19:33:35] https://github.com/jenkinsci/trilead-ssh2/pull/14#event-1037054693 [20:08:58] !log Update mobileapps to 1695900 [20:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:35:09] PROBLEM - Free space - all mounts on integration-slave-jessie-1002 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1002.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1002.diskspace.root.byte_percentfree (<11.11%) [20:38:12] hashar: whats up with the problem shinken-wm keeps talking about with jessie slave 1002? [20:39:19] some job consume disk space and release it apparently [20:39:24] and it is short on disk [20:39:28] guess that needs some cleanup [20:39:45] hashar: want me to file a task? [20:39:51] sure! [20:39:58] what exactly would we need done? [20:40:09] stop the alarm! :] [20:40:11] na seriously [20:40:16] investigate the disk usage [20:40:25] hashar: stop the alarm okay *turns off shinken-wm * [20:40:26] figure out what is filling up and clean! [20:40:29] ;D [20:40:30] alrighty [20:41:01] the easiest is probably to delete and let it refills [20:43:28] 10Continuous-Integration-Infrastructure: Investigate disk usage of integration-slave-jessie-1002 - https://phabricator.wikimedia.org/T162635#3169667 (10Zppix) [20:43:53] there you go hashar [20:44:01] thx [20:44:09] no problemo [20:46:53] PROBLEM - Puppet run on saucelabs-03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:48:22] 10Continuous-Integration-Infrastructure: Investigate disk usage of integration-slave-jessie-1002 - https://phabricator.wikimedia.org/T162635#3169685 (10hashar) Yeah / is almost full: ``` $ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda3 19G 17G 1.1G 95% / $ df -ih / Filesystem... [20:48:29] PROBLEM - Puppet run on deployment-urldownloader is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:49:18] 10Continuous-Integration-Infrastructure: Investigate disk usage of integration-slave-jessie-1002 - https://phabricator.wikimedia.org/T162635#3169687 (10Zppix) @hashar maybe its a good idea to invest in a script that auto cleans up /tmp? [20:49:46] !log integration-slave-jessie-1002 : cleaning up /tmp: sudo find /tmp -path '/tmp/android-tmp-robo*' -delete # T162635 [20:49:46] [20:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:49:49] T162635: Investigate disk usage of integration-slave-jessie-1002 - https://phabricator.wikimedia.org/T162635 [20:52:49] !log integration-slave-jessie-1001 : cleaning up /tmp: sudo find /tmp -path '/tmp/android-tmp-robo*' -delete # T162635 [20:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:55:10] PROBLEM - Free space - all mounts on integration-slave-jessie-1002 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1002.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1002.diskspace.root.byte_percentfree (<11.11%) [20:55:19] hashar: you really should just right a script that runs weekly (or even bidaily) to clean up /tmp [20:56:12] more or less [20:56:24] almost all jobs run on instances that are discarded on build completion [20:56:30] so there is no more need to cleanup /tmp [20:56:44] then why does shinken-wm always complain? [21:01:14] hashar: we have a job in zuul been queued for 3 hrs now [21:04:48] i've already taken the liberty of notifying labs [21:11:50] Zppix: why? [21:12:02] greg-g: a job's been in postmerge for 3 hrs [21:12:04] 10Continuous-Integration-Infrastructure: Investigate disk usage of integration-slave-jessie-1002 - https://phabricator.wikimedia.org/T162635#3169737 (10hashar) Cleared out the /tmp . There are multiple copies of the Android SDK on the slaves. I guess because its installation path got moved with time. Among ot... [21:12:19] hashar: if you are still here, could you add me as a member to the puppet3-diffs project in nova please? [21:12:24] Zppix: I understand that, but why do you feel it is Labs' fault? [21:12:54] greg-g: reason to believe it was nodepool, as theres a ticket that is in effect with similar symntoms [21:13:59] Zppix: any number of things could cause this. Please do not assume it is Labs' fault and report the issue to them without first undersatnding the current issue. [21:14:04] Zppix: the visualeditor job is scheduled on an instance named ci-jessie-wikimedia-607806 but it has been deleted [21:14:08] Zppix: there is a bug somewhere :( [21:14:37] hashar: its been 3 hours surely it would of been created by now yes? [21:15:01] Zppix: you clearly do not understand how nodepool works, based on that comment and previous ones [21:15:18] Zppix: I understand you are trying to help, but at this point you are causing more noise than actual signal [21:15:24] Zppix: Apr 10 17:43:21 contint1001 jenkins[4480]: SEVERE: I/O error in channel ci-jessie-wikimedia-607806 [21:15:24] :D [21:16:04] greg-g: as hashar just said i was wrong, but to be fair i was acting upon the information i had at the time, it doesn't hurt to cover all the basis but i understand [21:16:41] Zppix: it does when it wastes people's time because you are unable to do the first level investigation before going to an Ops person [21:17:00] 10Continuous-Integration-Infrastructure: Investigate disk usage of integration-slave-jessie-1002 - https://phabricator.wikimedia.org/T162635#3169757 (10Niedzielski) > `/mnt/home/jenkins-deploy/.android-sdk` Feel free to try this out by moving this directory to a temporary location and triggering our jobs. (I tr... [21:17:03] !log marked a nodepool node online manually. The instance was up but Jenkins failed to reach it due to some SEVERE: I/O error in channel [21:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:17:16] greg-g: ack my bad sorry! [21:17:47] Zppix: thank you for understanding. [21:17:57] greg-g: no problem [21:18:00] mobrovac: for puppet3-diffs, I would rather have it discussed with joe [21:18:19] hashar: ok, can you then just check the puppet version there please? [21:18:21] mobrovac: I dont even know why I am an admin there. Maybe to assist [21:19:41] hashar: any idea why the i/o error happened? [21:21:16] mobrovac: jessie and puppet 3.7.2 from jessie/main [21:21:28] huh, ok, thnx hashar [21:21:35] mobrovac: our puppet versions are inconsistent iirc [21:21:44] with trusty having 3.7 vs jessie having 3.8 [21:21:51] RECOVERY - Puppet run on saucelabs-03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:21:51] but dont quote me on the exact version numbers :D [21:21:56] hehe ok [21:22:04] Zppix: the visualeditor job is running [21:24:55] I am heading to bed! [21:25:15] hashar: okay night! [21:48:28] PROBLEM - Puppet run on deployment-logstash2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:16:18] 10Deployment-Systems, 10Scap, 13Patch-For-Review: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#3169911 (10thcipriani) >>! In T127762#3158160, @thcipriani wrote: > Just tagged scap debian/3.5.4-1: {rMSCAf78caf877875bfc38f9dbb4a8ac67ddb969b3c78} > > @fgiunchedi could you update... [22:24:04] 10MediaWiki-Releasing, 05MW-1.28-release: MW 1.28.1 tarball has languages/i18n/{en,qqx}.json.rej files - https://phabricator.wikimedia.org/T162643#3169920 (10Legoktm) [22:29:52] 10MediaWiki-Releasing, 05MW-1.28-release: MW 1.28.1 tarball has languages/i18n/{en,qqx}.json.rej files - https://phabricator.wikimedia.org/T162643#3169935 (10Peachey88) [23:01:49] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T160552#3170021 (10Krinkle) [23:04:44] Hm.. some wmf.17 blockers unresolved, that seems unlikely given all wikis are on 19/20 now [23:04:52] they were reverted? Or still broken in prod? [23:05:33] theres also 1 blocker from wmf.19? Krinkle [23:11:04] they should either be resolved/removed or have a comment on them expressing "okay-ness" [23:11:57] greg-g: okay i do see that now i didnt see it at first thanks [23:19:55] 10Continuous-Integration-Config, 06Release-Engineering-Team, 07Regression: doc.wikimedia.org docs for old releases is actually master - https://phabricator.wikimedia.org/T162506#3170061 (10Krinkle) >>! In T162506#3167817, @hashar wrote: > Looks like you have hacked the job :-] > > echo 'Skipped! --Krinkl... [23:35:10] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T160549#3170080 (10Krinkle) 05Open>03Resolved Subtasks are resolved (fixed in master and later branches). Although some of the issues still have underlying (older) issues as... [23:35:13] greg-g: Yep, I misread it. [23:35:22] greg-g: Thought they were sub tasks but were in fact sub-sub tasks [23:36:17] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T160549#3170098 (10Krinkle)