[00:12:36] 10Beta-Cluster-Infrastructure, 10MediaWiki-DjVu: DjVu rendering broken on Beta - https://phabricator.wikimedia.org/T117132#1767552 (10Bawolff) same issue with tiffs and missing libvips. [01:07:02] 10Beta-Cluster-Infrastructure: +Sysop for User:Mww113 - https://phabricator.wikimedia.org/T116364#1767683 (10Mww113) Thank you [03:08:23] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<20.00%) [05:51:29] 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768017 (10Tgr) >>! In T102566#1761834, @saper wrote: > Question: wouldn't that be possible to ship the certifi... [06:00:37] (03PS1) 10Awight: Skip PHPUnit tests on all deployment branches [integration/config] - 10https://gerrit.wikimedia.org/r/249968 (https://phabricator.wikimedia.org/T117062) [06:01:43] 10Continuous-Integration-Config, 10Fundraising-Backlog, 10Unplanned-Sprint-Work, 3Fundraising Sprint William Shatner, 5Patch-For-Review: Tests on deployment branches of wikimedia/fundraising/crm falling causing to force merge (and deadlock of Zuul) - https://phabricator.wikimedia.org/T117062#1768025 (10aw... [06:38:21] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [09:06:33] hashar: meeting reminder :) [09:07:08] zeljkof: ah I thought you were still sick sorry [09:21:52] net breaking :/ [09:30:37] PROBLEM - Puppet failure on deployment-fluorine is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [09:38:28] 10Differential, 5Gerrit-Migration: Differential notification emails lack headers for repository and state change - https://phabricator.wikimedia.org/T117186#1768274 (10hashar) 3NEW [09:53:48] 10Beta-Cluster-Infrastructure, 10MediaWiki-DjVu: DjVu rendering broken on Beta - https://phabricator.wikimedia.org/T117132#1768318 (10hashar) [09:53:50] 10Beta-Cluster-Infrastructure: beta cluster missing vips command needed to render tiffs and pngs - https://phabricator.wikimedia.org/T116816#1768319 (10hashar) [09:57:26] PROBLEM - Host deployment-cache-parsoid04 is DOWN: CRITICAL - Host Unreachable (10.68.19.197) [09:59:54] 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools - https://phabricator.wikimedia.org/T117071#1768338 (10hashar) [10:03:33] 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools - https://phabricator.wikimedia.org/T117071#1768348 (10hashar) The Wikimedia git hosting solution is Gerrit. The aim is to replace Gerrit with Differential, a solution which is embedded in Phabricator and track... [10:18:37] 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1768361 (10hashar) Indeed we have a bunch of them scattered around in multiple extensions: | OpenStackManager | Spyc.php | ht... [10:20:40] 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768363 (10saper) Right, might be difficult due to the way OpenSSL usually wants to have certificates. Will try... [10:20:44] 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1768364 (10hashar) @bd808 pecl extension is: http://bd808.com/pecl-file_formats-yaml/ which uses libyaml C bindings. [10:21:35] 10Continuous-Integration-Config, 10Fundraising-Backlog, 10Unplanned-Sprint-Work, 3Fundraising Sprint William Shatner, and 2 others: Tests on deployment branches of wikimedia/fundraising/crm falling causing to force merge (and deadlock of Zuul) - https://phabricator.wikimedia.org/T117062#1768365 (10hashar) [10:22:55] (03CR) 10Hashar: [C: 032] "We will see what happens :-)" [integration/config] - 10https://gerrit.wikimedia.org/r/249968 (https://phabricator.wikimedia.org/T117062) (owner: 10Awight) [10:24:06] (03Merged) 10jenkins-bot: Skip PHPUnit tests on all deployment branches [integration/config] - 10https://gerrit.wikimedia.org/r/249968 (https://phabricator.wikimedia.org/T117062) (owner: 10Awight) [10:37:33] zeljkof: hi [10:41:44] 10Continuous-Integration-Config, 10MediaWiki-General-or-Unknown, 10MediaWiki-extensions-General-or-Unknown: Add grunt-concurrent to mediawiki/core and decide whether to add it to extensions (Improve grunt performance) - https://phabricator.wikimedia.org/T116988#1768397 (10hashar) The slowdown is usually due... [10:53:33] hashar: https://phabricator.wikimedia.org/T95892 [10:53:37] is there anything left to do? [10:54:02] Is ContentTranslation still different from other extensions in any regard? It should be the same. [10:55:18] aharoni: do you have a PHPUnit test that lint the YAML files ? [10:55:44] I don't think so. Where do I add it? [10:55:50] How is it done in other extensions? [10:56:18] It's .rubocop.yml, so it should be the same as all the other extensions that have .rubocop.yml. [10:56:28] hashar: ^ [11:11:10] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1768437 (10Amire80) [12:53] hashar: https://phabricator.wikimedia.org/T95892 [12:53] is there anything left to do? [12:54] Is ContentTranslatio... [11:11:59] 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools/kubernetes - https://phabricator.wikimedia.org/T117071#1768440 (10yuvipanda) [11:15:23] 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools/kubernetes - https://phabricator.wikimedia.org/T117071#1768447 (10yuvipanda) I think you're missing a lot of context here @hashar. This is a ticket that is gathering requirements for what sort of git setup we'd... [11:27:34] aharoni: sorry disconnected [11:27:55] hashar: np [11:27:58] aharoni: rubocop is a style checker for ruby language. So it has nothing to do with linting YAML files :-D [11:27:59] so do you have an example? [11:28:05] nop [11:28:20] if ContentTranslation has Spyc or some other YAML parser [11:28:28] hashar: yes, but rubocop's own config file is a yaml file [11:28:32] you can write a PHPUnit test that look for .yaml files and attempt to parse them [11:28:38] is it checked by anything? [11:28:41] this way if the .yaml file is wrong, the test will break [11:28:42] does it have to be checked? [11:28:50] I don't think [11:28:58] if it is invalid, I guess rubocop will fail :-} [11:29:40] hm [11:29:49] hashar: so this task can probably be closed, because there's no other yaml to check in ContentTranslation, and I don't expect there to be any. [11:29:58] yeah [11:30:08] I am not sure why we had a yamllint job on that repo :} [11:30:26] maybe copy-and-paste [11:30:28] ok thanks [11:30:48] yup :) [11:31:20] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1768464 (10hashar) [11:31:22] marked as resolved :) [11:31:32] aharoni: sorry for the mess [11:31:43] hashar: no problem at all, thanks for the help. [11:31:51] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1768467 (10Amire80) [11:32:21] 10Beta-Cluster-Infrastructure: beta cluster missing vips command needed to render tiffs and pngs, pnmtojpeg for DjVus - https://phabricator.wikimedia.org/T116816#1768469 (10Rillke) [11:35:51] !log have varnish stats collector to emit to labmon1001.eqiad.wmnet instead of production statsd ( cherry picked https://gerrit.wikimedia.org/r/#/c/249490/ ) [11:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:56:03] zeljkof: https://phabricator.wikimedia.org/T116164 [13:32:26] 10MediaWiki-Codesniffer, 3Outreachy-Round-11: Outreachy proposal for : Improving static analysis tools for MediaWiki - https://phabricator.wikimedia.org/T115585#1768654 (1001tonythomas) We are approaching the Outreachy'11 application deadline, and if you want to have your proposal considered to be part of this... [13:33:27] 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768664 (10BBlack) It really shouldn't be hard for an OS/distribution/platform/language/whatever to have workin... [13:51:52] 5Continuous-Integration-Scaling, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1768741 (10hashar) 3NEW [14:00:21] 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1768774 (10hashar) Applied `role::ci::pmcache` to `pmcache.integration.eqiad.wmflabs`. [14:06:58] PROBLEM - Puppet failure on pmcache is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:14:20] 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Evaluate devpi for caching Pypi python packages - https://phabricator.wikimedia.org/T114871#1768810 (10hashar) Integration being done via T117207 [14:16:50] RECOVERY - Puppet failure on pmcache is OK: OK: Less than 1.00% above the threshold [0.0] [14:42:46] (03PS1) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) [14:47:37] (03PS2) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) [14:47:49] (03CR) 10jenkins-bot: [V: 04-1] (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar) [14:50:14] (03PS3) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) [14:53:11] (03CR) 10jenkins-bot: [V: 04-1] (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar) [14:54:21] (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar) [14:58:26] (03PS4) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) [14:59:57] (03CR) 10Hashar: "rebuild" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar) [15:00:08] * hashar whistles [15:00:25] (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar) [15:00:39] * hashar feels tired [15:01:18] * greg-g waves [15:01:54] (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar) [15:09:27] 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768954 (10Nemo_bis) > This almost feels like trying to ship a bundled TCP/IP stack and set of ethernet hardwa... [15:15:32] (03PS1) 10Hashar: rake-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 [15:15:52] (03PS2) 10Hashar: tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 [15:17:02] (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar) [15:17:24] (03CR) 10jenkins-bot: [V: 04-1] tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 (owner: 10Hashar) [15:17:43] (03PS3) 10Hashar: tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 [15:20:24] thcipriani: post-checkout maybe? [15:20:28] * ostriches ponders for a moment [15:22:01] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [15:23:34] (03CR) 10Hashar: [C: 032] "Helpful to find out potential slowdown of a build. For integration/config I have spotted:" [integration/config] - 10https://gerrit.wikimedia.org/r/250018 (owner: 10Hashar) [15:24:08] ostriches: not entirely sure. I think the key is that the first time this code is run, the same time we setup the git hook, the first git-related thing happening currently will be calling `git fetch` from a target. [15:25:26] (03Merged) 10jenkins-bot: tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 (owner: 10Hashar) [15:29:00] thcipriani: And the clients fetching are dumb? [15:30:02] yeah, happening over http, so my assumption is that the first thing git-related that happens currently is a GET request for .git/info/refs [15:30:11] * ostriches nods [15:37:28] (03PS5) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T117207) [15:38:50] 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1769069 (10hashar) Playing with it at https://gerrit.wikimedia.org/r/#/c/250009/ Before running tox, we will need to export a couple variables, probably: ``` export PIP_... [15:39:39] 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1769072 (10hashar) a:3hashar [15:41:06] 5Continuous-Integration-Scaling: Package Zuul for Debian Jessie - https://phabricator.wikimedia.org/T117223#1769087 (10hashar) 3NEW a:3hashar [15:42:35] 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1769101 (10bd808) >>! In T116965#1768364, @hashar wrote: > @bd808 pecl extension is: http://bd808.com/pecl-file_formats-yaml/... [15:44:19] 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1769106 (10bd808) If we wanted to get really fancy we could make a Wikimedia library that wraps both symfony/yaml and the yaml... [15:44:39] bd808: you are lovely :-} [15:44:46] thcipriani: Bahaha, I figured it out :p [15:44:49] wikimedia/ultimate-yaml [15:45:18] ostriches: ? [15:45:21] Use post-update hook, create a remote "self" and push to yourself to trigger it :p [15:45:30] Which seems totally unworth the labor compared to what we do now. [15:45:42] There doesn't seem to be a post-tag hook of any sort. [15:45:49] And we can't ensure a checkout has happened. [15:46:07] (presumably it did, but it might not) [15:48:00] tagging would be nice thing to hooks to that I didn't think about, bummer that there's not a hook there. [15:48:27] s/hooks/hook/ [15:49:04] Yeah and post-commit doesn't run on tagging...I tried [15:49:40] hmm, reading the git-scm docs this seemed like the right thing to do, but maybe not for the weird thing we're trying to do. [15:50:31] It assumes a push to a central prior to a pull really for that scenario [15:51:43] yeah, and presumably, there is a checkout that happens prior to a deploy, but in the first instance that would be pre our ability to set hooks. [15:52:10] Yerp [15:52:28] So, I think I'm gonna abandon D24 and its associated task. What we're doing now works, and we can't hook it better. [15:53:57] ostriches: kk, sounds right, thanks for looking into it. [15:54:11] np [15:55:09] Woulda been nice to avoid a shell out :) [15:57:32] indeed. [16:28:40] 10Beta-Cluster-Infrastructure, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#1769274 (10matthiasmullie) [16:39:59] (03PS2) 10JanZerebecki: [WIP] Run Wikidata browsertests without saucelabs [integration/config] - 10https://gerrit.wikimedia.org/r/247901 (https://phabricator.wikimedia.org/T116166) [16:42:06] 10Browser-Tests, 10Wikidata, 7Easy, 5Patch-For-Review: move wikidata browsertests to not use saucelabs - https://phabricator.wikimedia.org/T116166#1769311 (10JanZerebecki) Yes that seems to disable saucelabs. But now I get "unable to obtain stable firefox connection in 60 seconds" as a failure for the seco... [16:58:22] PROBLEM - Free space - all mounts on deployment-db2 is CRITICAL: CRITICAL: deployment-prep.deployment-db2.diskspace._mnt.byte_percentfree (<11.11%) [17:00:53] 10Browser-Tests, 10Wikidata, 7Easy, 5Patch-For-Review: move wikidata browsertests to not use saucelabs - https://phabricator.wikimedia.org/T116166#1769396 (10zeljkofilipin) I think the only thing left is to tell the machine you want to run the browser headless. ``` export HEADLESS=true ``` https://github... [17:12:13] 10Browser-Tests, 10Wikidata, 7Easy, 5Patch-For-Review: move wikidata browsertests to not use saucelabs - https://phabricator.wikimedia.org/T116166#1769439 (10JanZerebecki) Yes that seems to work. [17:13:01] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:07] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:13] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:46] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:46] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:05] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38610 bytes in 7.529 second response time [17:30:03] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38731 bytes in 0.939 second response time [17:30:35] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30350 bytes in 0.459 second response time [17:30:37] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39021 bytes in 0.468 second response time [17:32:53] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 38731 bytes in 0.705 second response time [17:33:49] (03PS1) 10Ori.livneh: Update obsolete references to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/250043 [17:36:13] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:44] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:44] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:14] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [17:38:11] meh? [17:39:04] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 8.430 second response time [17:39:08] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:40] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1769612 (10Catrope) Looks like this is the same query as the one called out at {T78671}, which is the one I suspected. [17:40:43] 10Browser-Tests, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-10-13: [Bug] fix negative argument (ArgumentError) in browsertests - https://phabricator.wikimedia.org/T110510#1769625 (10zeljkofilipin) a:5zeljkofilipin>3None [17:41:07] (03CR) 10Chad: [C: 032] Update obsolete references to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/250043 (owner: 10Ori.livneh) [17:41:48] (03Merged) 10jenkins-bot: Update obsolete references to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/250043 (owner: 10Ori.livneh) [17:42:58] is beta down? http://commons.wikimedia.beta.wmflabs.org/ doesn't load for me [17:42:58] --------------------------------------------------------------------------------------- [17:43:10] (whoops, bad paste) [17:43:16] could've been worse :P [17:43:25] something is flapping, shinken reported a problem, then a recovery, now another problem :) [17:43:52] Apparently the DBs have too many connections? [17:43:56] Original exception: [762b386d] / DBConnectionError from line 871 of /srv/mediawiki/php-master/includes/db/Database.php: DB connection error: Too many connections (10.68.17.94) [17:44:00] when I load the page [17:44:27] Grr. [17:44:33] enwiki and commonswiki show different errors. [17:44:34] https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor [17:45:06] Well, same underlying error I presume. [17:45:09] Just different error page formats [17:45:26] How do you log in as mysql root again? [17:45:50] sudo -i; myql [17:45:53] mysql* [17:46:19] root@deployment-db2:~# mysql [17:46:19] ERROR 1040 (HY000): Too many connections [17:50:25] (03CR) 10Zfilipin: [C: 04-2] "I disagree that VisualEditor browser tests were fine before we deleted the jobs. They were deleted because they were broken for a month or" [integration/config] - 10https://gerrit.wikimedia.org/r/247612 (https://phabricator.wikimedia.org/T94162) (owner: 10Jforrester) [17:50:29] ori: https://phabricator.wikimedia.org/P2262 [17:52:39] looks like another bad query filled up /mnt/tmp [17:52:55] bah :/ [17:53:00] we should get a `show full processlist` this time [17:53:04] somehow ... [17:53:06] :( [17:53:13] db1 is not doing anything [17:53:24] still can't get to db2 [17:53:45] anyone comfortable doing this? https://www.percona.com/blog/2010/03/23/too-many-connections-no-problem/ [17:53:59] * marxarelli isn't 100% on which pid to connect to [17:54:20] from just viewing the file in /mnt/tmp i'm pretty sure it's flow [17:54:36] root@deployment-db2:/mnt/tmp# strings \#sql_7912_0.MAD | head [17:54:36] 20130917231900 [17:54:37] post-summarycreate-topic-summaryenwikienwikiSandboxdiscussionutf-8,gzip,html [17:54:47] it was last time [17:54:50] I can't actually connect anymore? [17:54:51] that's the 82G file [17:55:00] i'd really like to get a full view of the query this time [17:55:08] i can do https://www.percona.com/blog/2010/03/23/too-many-connections-no-problem/ [17:55:08] so they can debug the root cause [17:55:16] hahahaha [17:55:16] ori: sweet. hit it [17:55:17] thanks [17:55:22] I tried that with no luck ori [17:55:23] "Credit for the gdb magic goes to Domas." [17:55:33] haha [17:55:55] the pid file is /mnt/sqldata/deployment-db2.pid [17:56:11] yeah, 30994 [17:56:20] I got the same thing with "ps -e | grep mysql" [17:56:45] but, "Could not attach to process. If your uid matches the uid of the target" etc. [17:56:50] ptrace: Operation not permitted. [17:56:52] despite being root [17:57:02] try with sudo -u mysql? [17:57:19] that didn't work when I tried it earlier [17:58:06] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #708: 04FAILURE in 5.3 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/708/ [17:58:54] did it work for you ori? [17:59:24] well ,/proc/sys/kernel/yama/ptrace_scope is 0 [17:59:34] the only other reason for getting this message is if someone is already attached to that process via gdb [17:59:52] i see two gdb sessions open [17:59:54] can i kill them? [17:59:57] root 2042 0.1 0.1 64440 23000 pts/1 T 17:51 0:00 gdb -p 30994 -ex show variables like 'max_connections'; [17:59:59] root 2058 0.0 0.0 49016 8968 pts/1 T 17:52 0:00 gdb -p 30994 -ex set max_connections = 300 [18:00:09] * ori does [18:00:12] yep [18:00:18] the first one didn't work and I thought I had killed it [18:00:20] apparently not [18:02:08] got it [18:04:19] i got the output of show full processlist but it contains an ip, what's the right perm in phab to ensure you guys can see it but not everyone else? [18:05:08] just dump it on the labs instance itself? [18:05:14] do nda for now, but beta is on labs :) [18:05:52] sanitized: SELECT /* Flow\Formatter\ContributionsQuery::queryRevisions X.X.X.X */ * FROM `flow_revision`,`flow_workflow`,`flow_tree_node` WHERE rev_user_id = '820' AND rev_user_ip IS NULL AND rev_user_wiki = 'enwiki' AND (rev_id < 'G��p\0\0\0\0\0\0') AND workflow_wiki = 'enwiki' AND (tree_descendant_id = rev_type_id) AND rev_type = 'post-summary' ORDER BY rev_id DESC LIMIT 51 [18:06:14] and let's open a task for moving tmpdir to its own volume/partition [18:06:37] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39002 bytes in 1.891 second response time [18:06:37] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30351 bytes in 2.432 second response time [18:06:37] is that one being spammed by the same IP ori? [18:07:29] it's just one query [18:07:38] there are 300 checks for slave lag queued up behind it [18:07:42] but it is the cause [18:07:45] ori: https://phabricator.wikimedia.org/T78671#850799 [18:07:51] on db1 as /tmp/show_full_processlist.db1.1446228429 [18:08:02] "· [18:08:05] Dec 16 2014, 16:17" [18:08:28] i killed the query so it recovered [18:08:35] ori: thanks! [18:08:53] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 38713 bytes in 0.805 second response time [18:08:53] np [18:08:59] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38707 bytes in 1.357 second response time [18:09:16] ori: so, for future reference, the gdb thing worked? and you connected to the mysqld process? [18:09:44] 10Beta-Cluster-Infrastructure, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#1769830 (10Catrope) Updated explain result of that query: ``` +------+-------------+... [18:10:12] marxarelli: yep https://dpaste.de/AWxG/raw [18:10:35] ori: awesome. good to know [18:11:01] looks like slave replication is broken now ... [18:11:05] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38714 bytes in 0.968 second response time [18:11:07] the relay log is probably corrupt [18:12:00] marxarelli: in fact since the process is still running probably good to set max connections to whatever it is normally [18:12:06] might be able to fix it with a slave reset [18:12:31] ori: it's normally 250 [18:12:36] which seems fairly low [18:12:55] but yeah, i'll set it back [18:13:06] no idea what a good value would be [18:13:14] not something i know much about [18:13:37] usually just app processes + other processes + some change [18:15:02] i don't know enough about mw architecture to make a good estimate but 500 would still be conservative [18:15:21] aaron or jynus would probably know [18:15:26] at nanowrimo i think we did ~ 2000 [18:15:43] * ori parts to keep set of irc channels manageable *wave* [18:15:44] but that was easier to compute [18:15:49] ori: :) [18:15:51] see ya [18:16:41] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1769852 (10Catrope) >>! In T116447#1769612, @Catrope wrote: > Looks like this is the same query as the one called out at {T78671}, whic... [18:17:55] !log stopping/resetting slave on deployment-db2 to fix replicate after relay log corruption [18:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [18:18:23] RECOVERY - Free space - all mounts on deployment-db2 is OK: OK: All targets OK [18:19:58] !log deployment-db2 replication recovered after slave stop/reset/set master position/start [18:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [18:20:35] well good thing it was only the relay binlog this time [18:39:37] 10Deployment-Systems, 3Scap3: Scap3's checks.yaml file should be optional - https://phabricator.wikimedia.org/T116204#1769990 (10dduvall) 5Open>3Resolved [18:54:08] 10Deployment-Systems, 3Scap3: Scap3 should have idempotent deploys - https://phabricator.wikimedia.org/T109513#1770044 (10dduvall) 5Open>3Resolved [19:07:14] greg-g: looks like the query that took down db2 is the same one from last friday https://phabricator.wikimedia.org/T116447#1758398 [19:07:52] "good" [19:08:02] at least we get more data now! [19:08:53] it's essentially the same with a different rev_id range [19:08:56] but yeah, more! [19:19:16] 10Beta-Cluster-Infrastructure, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#1770170 (10dduvall) Another instance of this query took down Beta Cluster again toda... [19:56:49] 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1770275 (10Tgr) >>! In T102566#1768664, @BBlack wrote: > I haven't followed the low-level details too hard, but... [20:16:12] 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1770298 (10Tgr) The problem there is that libYAML implements YAML 1.1 while Symfony implements a subset of YAML 1.2. For the... [20:51:15] 10Continuous-Integration-Config, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Validator, 10Wikidata, 5Patch-For-Review: [Task] make core wmf branches only use submodule branches that run with it in CI - https://phabricator.wikimedia.org/T113731#1770348 (10mmodell) But I thought that media... [21:11:20] Yippee, build fixed! [21:11:21] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #813: 09FIXED in 45 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/813/ [21:27:49] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770392 (10hashar) 3NEW a:3hashar [21:30:41] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 7Upstream: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770410 (10hashar) Our DIB manifest uses the `debian` element which injects: ``` $ cat /etc/cloud/cloud.cfg.d/01_h... [21:46:35] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 7Upstream: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770500 (10hashar) Upstream patch https://review.openstack.org/#/c/240614/ [21:55:53] (03PS1) 10Hashar: nodepool: acquire hostname from cloud [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) [21:56:45] (03CR) 10Hashar: "Need to rebuild the image ... https://wikitech.wikimedia.org/wiki/Nodepool#Diskimage" [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar) [22:00:28] 6Release-Engineering-Team: Setup differential for https://github.com/wikimedia/composer-merge-plugin - https://phabricator.wikimedia.org/T117293#1770541 (10bd808) 3NEW [22:00:50] 6Release-Engineering-Team, 15User-Bd808-Test: Setup differential for https://github.com/wikimedia/composer-merge-plugin - https://phabricator.wikimedia.org/T117293#1770550 (10bd808) [22:20:28] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770603 (10hashar) Rebuilding an image: ``` dib-run-parts Fri Oct 30 22:19:39 UTC 2015 Runnin... [22:20:40] (03CR) 10Hashar: "Rebuilding an image:" [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar) [22:21:19] (03CR) 10Hashar: [C: 032] "I havent uploaded the new image. Cause it is Friday :)" [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar) [22:22:20] (03Merged) 10jenkins-bot: nodepool: acquire hostname from cloud [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar) [22:22:32] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream, 7WorkType-Maintenance: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770606 (10hashar) [22:28:13] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream, 7WorkType-Maintenance: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770621 (10hashar) The new image is on labnodepool1001.eqiad.wmnet... [22:28:46] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream, 7WorkType-Maintenance: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770623 (10hashar) p:5Triage>3Normal [22:32:34] 5Gerrit-Migration, 6Release-Engineering-Team, 15User-Bd808-Test: Setup differential for https://github.com/wikimedia/composer-merge-plugin - https://phabricator.wikimedia.org/T117293#1770541 (10hashar) I would love to get @bd808 involved in Differential as an early adopter. Might prove useful to gather feedb... [22:48:17] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770662 (10hashar) [22:49:37] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1749727 (10hashar) Pro tip: you can syntax highlight IRC logs with in Markdown with `lang=irc` [22:50:16] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770670 (10hashar) 5Resolved>3Open Reopening since we still follow up on determining the root cause of the outage. [22:50:24] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770672 (10hashar) p:5Unbreak!>3High [22:51:14] 10Beta-Cluster-Infrastructure, 7Tracking: Setup monitoring for Beta Cluster (tracking) - https://phabricator.wikimedia.org/T53497#1770676 (10hashar) [22:51:15] 10Beta-Cluster-Infrastructure: Setup monitoring for database servers in beta cluster - https://phabricator.wikimedia.org/T87093#1770675 (10hashar) 5Open>3stalled [22:51:31] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 7Tracking: Use Beta cluster as a true canary for code deployments (tracking) - https://phabricator.wikimedia.org/T53494#1770680 (10hashar) [22:51:32] 10Beta-Cluster-Infrastructure, 7Tracking: Setup monitoring for Beta Cluster (tracking) - https://phabricator.wikimedia.org/T53497#1770679 (10hashar) 5Open>3stalled [22:52:07] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: [postmortem] Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770681 (10hashar) [23:56:40] 10Deployment-Systems, 6operations, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1770800 (10Dzahn) - moved "labsdbmanager" to role keyword: https://gerrit.wikimedia.org/r/#/c/250083/ - moved base::firewall to the deployment-server role https://g...