[00:00:02] yeah, filing a ticket would be good :) [00:00:57] thcipriani@deployment-changeprop:~$ ps aux | grep -i salt | wc -l => 340 [00:01:00] oh boy [00:02:00] slat loves to do fun things like that [00:02:00] problem started on august 4th? [00:02:03] *salt [00:03:12] looks like it failed (Result: signal) since Thu 2016-08-04 15:06:07 UTC; 2 weeks 0 days ago [00:03:43] looks like these are all systemctl start salt-minion jobs run by puppet that are all waiting on some unix socket at fd 3... [00:03:52] 2 weeks worth of puppet runs [00:04:20] 10Beta-Cluster-Infrastructure, 10Salt, 07Puppet: puppet on deployment-changeprop taking forever because of systemctl start salt-minion - https://phabricator.wikimedia.org/T143371#2566370 (10AlexMonk-WMF) [00:04:55] thanks :) [00:26:39] thcipriani, aha, I got a result [00:26:57] oh? [00:27:18] Oh dear. [00:27:44] Now it's decided to add an extra line to sshd_config and is trying to systemctl stop ssh.service [00:27:53] just an extra newline at the end of the file [00:28:18] though I don't know why it's just sitting there [00:28:25] maybe the problem is actually with systemctl? [00:29:44] when I jumped on that box was under some super heavy load, I got it back to where it was before I started fiddling and left it be... [00:30:13] I'll take a look tomorrow, lots of folks on that box right now seemingly :) [00:30:56] just me and ppchelko [00:31:02] though he does appear to be doing something [00:35:38] Okay [00:35:54] on deployment-fluorine02, /srv/mw-log/archive is just a copy of what deployment-fluorine had [00:36:40] deployment-fluorine's files directly under /srv/mw-log have been given copied with the prefix 'deployment-fluorine-' [00:37:29] sweet [00:41:08] I've shut down deployment-fluorine and we can probably delete it soon [00:41:55] PROBLEM - Host deployment-fluorine is DOWN: CRITICAL - Host Unreachable (10.68.16.198) [00:57:19] Hm... parent projects were not imported to Diffussion? [00:57:29] How can I browser refs/meta/config of performance.git? [00:57:34] This used to be possible via GitBlit [00:58:01] https://phabricator.wikimedia.org/r/branch/performance;refs/meta/config [00:58:14] Linked from https://gerrit.wikimedia.org/r/#/admin/projects/performance,branches [01:28:26] PROBLEM - Puppet staleness on deployment-imagescaler01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [02:54:57] (03PS9) 10Lethexie: Add usage to forbid superglobals like $_GET,$_POST [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/296395 [04:18:01] Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #113: 04FAILURE in 22 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/113/ [07:56:33] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2566824 (10thiemowmde) [07:56:47] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2566837 (10thiemowmde) p:05Triage>03Unbreak! [07:56:59] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata, 03Wikidata-Sprint-2016-08-16: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2566839 (10thiemowmde) [07:57:18] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint-2016-08-16: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2566824 (10thiemowmde) [12:01:45] Project selenium-RelatedArticles » chrome,beta-mobile,Linux,contintLabsSlave && UbuntuTrusty build #118: 04FAILURE in 44 sec: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/118/ [14:40:02] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 07Documentation, 15User-zeljkofilipin: Document browser tests ownership (and what it means) on wiki - https://phabricator.wikimedia.org/T142409#2567594 (10zeljkofilipin) a:03zeljkofilipin [14:49:57] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 07Documentation, 15User-zeljkofilipin: Document browser tests ownership (and what it means) on wiki - https://phabricator.wikimedia.org/T142409#2567612 (10zeljkofilipin) [[ https://phabricator.wikimedia.org/diffusion/CICF/browse/master/jjb/seleniu... [14:58:30] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:59:16] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:57:56] PROBLEM - Puppet run on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:02:45] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 05MW-1.28-release-notes, 13Patch-For-Review, and 2 others: Various browser tests failing with MediawikiApi::LoginError - https://phabricator.wikimedia.org/T142600#2567900 (10Jdlrobson) p:05Normal>03High I've personally started ignoring browser test... [16:06:33] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 05MW-1.28-release-notes, 13Patch-For-Review, and 2 others: Various browser tests failing with MediawikiApi::LoginError - https://phabricator.wikimedia.org/T142600#2567910 (10zeljkofilipin) a:03zeljkofilipin [16:07:36] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 05MW-1.28-release-notes, 13Patch-For-Review, and 2 others: Various browser tests failing with MediawikiApi::LoginError - https://phabricator.wikimedia.org/T142600#2540639 (10zeljkofilipin) I am working on this, I have tried a few things, no luck yet. [16:09:03] 10Beta-Cluster-Infrastructure, 05Goal: Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] - https://phabricator.wikimedia.org/T142288#2567912 (10AlexMonk-WMF) [16:09:49] PROBLEM - Puppet run on deployment-fluorine02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:19:48] RECOVERY - Puppet run on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:28:43] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2567942 (10AlexMonk-WMF) [16:48:59] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 10DBA, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2567996 (10AlexMonk-WMF) [16:57:15] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint-2016-08-16: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2568039 (10greg) p:05Unbreak!>0... [16:59:15] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint-2016-08-16: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2568055 (10greg) >>! In T143389#25... [17:03:13] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 07Documentation, 15User-zeljkofilipin: Document browser tests ownership (and what it means) on wiki - https://phabricator.wikimedia.org/T142409#2568080 (10greg) >>! In T142409#2567612, @zeljkofilipin wrote: > [[ https://phabricator.wikimedia.org/d... [17:12:40] thcipriani: greg-g: so far last 2 days or so most patches get CI'd in a 1-2m and I've seen a few 4-5m [17:12:48] but most of what I've experienced directly has been pretty good [17:13:32] yeah, probably mostly since we moved a ton of things off of nodepool, along with the nodepool changes [17:14:00] greg-g: how difficult to move things back slowly? [17:14:52] it per job type (eg: all rake jobs, or npm jobs), so... it's delicate/needs baby-sitting [17:14:56] it's per* [17:17:57] greg-g: welp let me know what you want to do, may take some adjustments w/ new load depending who knows [17:18:08] not being like for like with historical makes it all the more interesting [17:18:19] but we have headroom [17:20:24] chasemp: yeah, I want to stay where we are (sans any issues, /me knocks on all the wood in his office) until antoine can get caught up [17:21:01] greg-g: when does he return? [17:21:29] Monday, but you know, he's been gone for a month, so "sometime next week" is when he'll be "kinda caught up" [17:22:05] he did pop in on tasks last week, so he's aware of the issues, but he'll need more time I'm sure [17:22:36] I read pop and poop for a good minute there [17:22:45] no worries just wondering on relative timeline [17:24:24] word [17:24:38] and yeah, poop comes out of my mouth more often than pop lately, you understand [17:24:54] without the quotes around those words that sentence is weird [17:35:28] sorry, in meeting, yeah, nodepool is looking a *lot* better. Waittime for nodepool jobs went from a max of 1.15 hours to 12 minutes yesterday, so it's seeming more stable. [17:36:49] want to work through what we think we found wrt nodepool's non-waiting for instance deletion thing with hashar when he's back next week—we have a pairing session Wednesday where I'll rant :) [17:37:31] *max waittime of 1.15 hrs on Wednesday [17:41:57] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint-2016-08-16: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2568199 (10Anomie) >>! In T143389#... [17:43:16] (03PS2) 10Urbanecm: Whitelist my second e-mail adresss [integration/config] - 10https://gerrit.wikimedia.org/r/302314 [17:45:29] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint-2016-08-16: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2568225 (10greg) Note, afaict: thi... [18:34:52] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2568370 (10Jdforrester-WMF) [18:36:10] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Scribunto, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint-2016-08-16: [Bug] Scribunto_LuaSandboxTests repeatedly fails because it relies on code execution time - https://phabricator.wikimedia.org/T143389#2568372 (10Anomie) Yeah, it looks... [18:46:00] 10Beta-Cluster-Infrastructure, 10Salt, 07Puppet: puppet on deployment-changeprop taking forever because of systemctl start salt-minion - https://phabricator.wikimedia.org/T143371#2568418 (10thcipriani) So when I logged on yesterday, there were several hundred: systemctl start salt-minion jobs systemd on thi... [18:52:32] PROBLEM - SSH on deployment-changeprop is CRITICAL: Connection refused [18:52:39] 10Beta-Cluster-Infrastructure, 07Puppet, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2568441 (10thcipriani) [18:52:42] 10Beta-Cluster-Infrastructure, 10Salt, 07Puppet: puppet on deployment-changeprop taking forever because of systemctl start salt-minion - https://phabricator.wikimedia.org/T143371#2568438 (10thcipriani) 05Open>03Resolved a:03thcipriani Seems to be fixed: ``` thcipriani@deployment-changeprop:~$ sudo sys... [18:57:31] RECOVERY - SSH on deployment-changeprop is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [18:58:53] thcipriani: ok, let me know how it goes :) [18:59:52] chasemp: yarp. I'm sure we'll drag you into the fray at some point :P [19:06:48] RECOVERY - Puppet run on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:45] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [19:44:22] PROBLEM - Free space - all mounts on deployment-sentry01 is CRITICAL: CRITICAL: deployment-prep.deployment-sentry01.diskspace.root.byte_percentfree (<55.56%) [20:06:05] PROBLEM - Puppet run on deployment-db1 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:20:26] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [20:21:06] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [20:41:59] Yippee, build fixed! [20:42:00] Project selenium-Echo » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #122: 09FIXED in 58 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/122/ [20:42:59] Project selenium-Echo » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #122: 04FAILURE in 1 min 58 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/122/ [20:59:19] Project selenium-Wikidata » firefox,test,Linux,contintLabsSlave && UbuntuTrusty build #91: 04FAILURE in 2 hr 9 min: https://integration.wikimedia.org/ci/job/selenium-Wikidata/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/91/ [21:03:03] PROBLEM - Puppet run on deployment-sentry01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [21:12:03] PROBLEM - Puppet run on deployment-db1 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:22:04] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [21:25:12] getting a segfault from php55 in CI, anyway to get access to the core file and see whats up (also, do we have the right versions to have usable core files in CI?) [21:25:17] https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/6666/consoleFull [21:31:47] PROBLEM - Puppet run on deployment-aqs01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:34:38] PROBLEM - Puppet run on integration-slave-precise-1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:35:43] PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:36:32] ebernhardson: in that instance, you should be able to ssh to integration-slave-trusty-1003.integration.eqiad.wmflabs to dig around [21:37:20] thcipriani: that server doesn't like my cred's :( [21:37:24] that job ran there in /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55 [21:37:50] hrm, getent passwd ebernhardson returns stuff ebernhardson:x:3088:500:EBernhardson:/home/ebernhardson:/bin/bash [21:38:34] thcipriani: i'm not in the integration project though, so seems like it shouldn't let me in? [21:40:38] ebernhardson: give it a try now [21:40:58] thcipriani: success! thanks [21:41:31] absotively been meaning to dig into this, there's a bug somewhere... [21:43:47] I think all labs instances can look up info about all users like this [21:43:55] it's the ssh key lookup that controls access to hosts, right? [21:44:48] yeah, ssh-key-ldap-lookup [21:45:20] you're right, the passwd thing evidently just looks to see if user is in ldap [21:47:20] maybe it's not the ssh-key-ldap-lookup itself that checks project membership [21:47:31] but still pretty sure it's separate [21:48:25] something in pam [22:12:45] 10Beta-Cluster-Infrastructure, 10VisualEditor: Getting error while uploading image with VisualEditor - https://phabricator.wikimedia.org/T141814#2568888 (10Tanveer07) I would like to add that I am still being able to reproduce this bug even though my network connection seems fine. [22:20:15] 10Beta-Cluster-Infrastructure, 10VisualEditor: Getting error while uploading image with VisualEditor - https://phabricator.wikimedia.org/T141814#2568900 (10greg) >>! In T141814#2568888, @Tanveer07 wrote: > I would like to add that I am still being able to reproduce this bug even though my network connection se... [22:21:10] Project selenium-CentralAuth » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #120: 04FAILURE in 1 min 10 sec: https://integration.wikimedia.org/ci/job/selenium-CentralAuth/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/120/ [22:27:39] (03PS1) 10Krinkle: integration: Add slash to redirect from '/cover'. [integration/docroot] - 10https://gerrit.wikimedia.org/r/305745 (https://phabricator.wikimedia.org/T139620) [22:27:55] (03CR) 10Krinkle: [C: 032] integration: Add slash to redirect from '/cover'. [integration/docroot] - 10https://gerrit.wikimedia.org/r/305745 (https://phabricator.wikimedia.org/T139620) (owner: 10Krinkle) [22:34:20] (03Merged) 10jenkins-bot: integration: Add slash to redirect from '/cover'. [integration/docroot] - 10https://gerrit.wikimedia.org/r/305745 (https://phabricator.wikimedia.org/T139620) (owner: 10Krinkle) [22:59:33] (03PS1) 10Krinkle: Fix-up 61d394c: RewriteRule without leading slash [integration/docroot] - 10https://gerrit.wikimedia.org/r/305749 [22:59:45] (03CR) 10Krinkle: [C: 032] Fix-up 61d394c: RewriteRule without leading slash [integration/docroot] - 10https://gerrit.wikimedia.org/r/305749 (owner: 10Krinkle) [23:00:05] 10Continuous-Integration-Infrastructure (phase-out-gallium): Target architecture without gallium.wikimedia.org - https://phabricator.wikimedia.org/T133300#2568980 (10Krinkle) [23:00:08] 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review, 07WorkType-NewFunctionality: Move CI coverage reports out of integration.wikimedia.org to a new domain or doc.wm.o - https://phabricator.wikimedia.org/T139620#2568979 (10Krinkle) 05Open>03Resolved [23:00:14] (03CR) 10Dzahn: "lgtm, checked that mod_rewrite is already loaded on gallium" [integration/docroot] - 10https://gerrit.wikimedia.org/r/305745 (https://phabricator.wikimedia.org/T139620) (owner: 10Krinkle) [23:00:16] (03Merged) 10jenkins-bot: Fix-up 61d394c: RewriteRule without leading slash [integration/docroot] - 10https://gerrit.wikimedia.org/r/305749 (owner: 10Krinkle) [23:00:50] (03CR) 10Dzahn: "would merge but on this repo i have no permissions, not even a +1" [integration/docroot] - 10https://gerrit.wikimedia.org/r/305745 (https://phabricator.wikimedia.org/T139620) (owner: 10Krinkle) [23:08:02] RECOVERY - Puppet run on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:08:38] 10Beta-Cluster-Infrastructure, 13Patch-For-Review: deployment-fluorine becomes unresponsive frequently - https://phabricator.wikimedia.org/T140313#2568998 (10greg) 05Open>03Resolved Patch merged. Donezors. [23:10:03] 06Release-Engineering-Team, 03releng-201617-q1, 15User-greg, 15User-zeljkofilipin: Perform a technical debt analysis of software and services maintained by WMF Release Engineering - https://phabricator.wikimedia.org/T138225#2569000 (10greg)