[00:02:07] now to try build tools-cron-01 [00:02:09] :D [00:03:02] tools-submit is running so many things! [00:04:01] andrewbogott: I'm going to disable puppet on tools-submit again [00:04:12] hmm [00:04:14] not [00:04:15] YuviPanda: ok [00:04:16] needed actually [00:04:18] I'm not! [00:04:21] this time we should at least only get one email [00:04:26] or zero, better yet! [00:05:29] :D [00:50:20] $gridmaster = "${::labsproject}-master.${::site}.wmflabs" [00:50:22] class { 'gridengine': gridmaster => $gridmaster } [00:50:24] what the fuck [00:50:24] 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1946536 (10Reedy) @Florian ^ Any chance you could review those? If there's slightly better ways things can be done, ie collapsing function calls etc, I don't rea... [01:41:07] oh no [01:41:13] I've fallen into another puppet hole [01:42:50] mmm [01:42:52] mayben ot [01:52:13] 6Labs: Create Labs project for ifttt - https://phabricator.wikimedia.org/T124131#1946760 (10madhuvishy) 3NEW [01:53:49] 6Labs: Create Labs project for ifttt - https://phabricator.wikimedia.org/T124131#1946772 (10madhuvishy) [01:53:51] 6Labs, 7Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#1946771 (10madhuvishy) [01:57:44] YuviPanda andrewbogott just a thought [01:57:49] on the freezing up issue [01:57:59] * YuviPanda istns [01:58:01] err [01:58:03] *listens [01:58:08] from what i can tell it may be related to storage issues but not necessarily nfs [01:58:22] it seems like maybe process 'writes' and the os buffers [01:58:31] and then at some point that dirty copy is flushed to disk [01:58:54] and maybe this is happening too often or not often enough and it's taking to long and locking up [01:59:19] ugh [01:59:21] I se [01:59:23] e [01:59:26] interesting avg queue lenth [01:59:26] I wonder if a kernel upgrade will help [01:59:26] http://graphite.wmflabs.org/render/?width=973&height=375&_salt=1453254904.406&target=tools.tools-webgrid-lighttpd-*.iostat.vda.average_queue_length [01:59:30] when you say ‘the os buffers’ you mean the os in the vm, or the virt host os? [01:59:35] * YuviPanda puts the hammer away [02:00:00] is that in fractions [02:00:09] I was told to look at tweaking [02:00:09] vm.dirty_background_ratio [02:00:12] vm.dirty_ratio [02:00:32] if we have issues with lockup on write as in we are waiting to long based on storage performance [02:00:55] andrewbogott: well I guess maybe both? [02:00:56] we can def play wit em [02:01:07] I'm not sure how the VM on dirtual disk on real disk handles life man [02:01:13] I mean I guess both cases there buffer [02:01:39] so this is just to say, if you see it happen try to remember w/ me to take a look at iostat [02:01:53] is there any pattern regarding what vm OS or what hosts do this? [02:01:57] Is it all and everywhere? [02:01:59] I've seen 1 that ref'd a "job" which would imply maybre nfs and one that ref'd vda1 which would imply not [02:02:24] don't understand the question [02:02:28] do you mena like, best practice? [02:02:36] no, I mean-- [02:02:40] are we losing all kinds of vms? [02:02:46] ah [02:02:51] or just ones with a particular os or on a particular virt host? [02:02:59] the one I'm looking at was [02:03:02] root@tools-webgrid-lighttpd-1412:~# cat /etc/issue [02:03:02] Ubuntu 14.04.3 LTS \n \l [02:03:07] I'm not sure anyone has gone through and tallied it up [02:03:37] tip: exec node numbers starting with 14 are trusty, 12 are precise! [02:03:42] I have a vague recollection of checking that, but… we should probably keep a running tally of which instances freeze [02:04:28] guess who found an unpuppetized script on the cron host [02:04:31] * YuviPanda puppetizes [02:06:03] just a thought I want to try to check iostat on the VM and the virt node at the time of lockup [02:06:07] maybe we can catch something [02:06:09] :) [02:06:16] yeah [02:06:25] but how do we check iostat on VM [02:06:27] when it's stukc [02:06:30] *cstuck [02:06:33] *stuck [02:06:35] bah [02:06:37] well the historical will at least be in graphite [02:06:38] but yeah [02:06:46] chasemp: can you leave a comment on the phab ticket so valhallasw and tim l are also aware? [02:07:09] which ticket is best? [02:07:54] * YuviPanda tries again to use phab search [02:08:27] https://phabricator.wikimedia.org/T124038 [02:08:29] chasemp: ^ [02:08:36] andrewbogott: btw, that's at least one precise host and one trusty host [02:08:46] yeah [02:08:50] with different kernels I expect [02:09:00] yeah [02:09:15] so we have https://phabricator.wikimedia.org/T124038 and https://phabricator.wikimedia.org/T124038 [02:09:19] ah dang [02:09:21] same one [02:09:26] https://phabricator.wikimedia.org/T124038 [02:09:27] and [02:09:41] https://phabricator.wikimedia.org/T123835 [02:09:49] the one I was tinking of was another even :) [02:09:52] huh [02:10:10] chasemp: bah, the seocnd is what I meant [02:10:31] * YuviPanda 's track record of typing what he means goes down the proverbial toilet today [02:10:45] better than a literal toilet [02:11:27] oh crap I gotta go 10m ago [02:12:28] hehe [02:12:31] chasemp: good night [02:12:48] is there a tracking bug with an inventory of lockups? [02:13:56] andrewbogott: not from a quick look. [02:13:59] maybe we should start one [02:14:31] 6Labs, 10Labs-Infrastructure: Labs instances sometimes freeze (Tracking bug) - https://phabricator.wikimedia.org/T124133#1946783 (10Andrew) 3NEW [02:14:41] woo [02:14:43] thanks [02:15:06] 6Labs: Webservice stuck, won't stop, can't restart - https://phabricator.wikimedia.org/T124038#1946791 (10Andrew) [02:15:08] 6Labs, 10Labs-Infrastructure: Labs instances sometimes freeze (Tracking bug) - https://phabricator.wikimedia.org/T124133#1946790 (10Andrew) [02:15:10] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1204 locked up - https://phabricator.wikimedia.org/T123835#1946792 (10Andrew) [02:15:13] ok, there’s a baby step [02:15:19] LD [02:15:21] :D [02:15:25] there were the k8s hosts getting stuck [02:15:31] but that was a jessie kernel issue we found and fixed [02:15:37] and then there was tools-checker that got stuck once [02:17:35] 6Labs, 10Tool-Labs, 5Patch-For-Review: Migrate tools-submit to tools-cron-01/-02 - https://phabricator.wikimedia.org/T123873#1946797 (10yuvipanda) a:3yuvipanda [02:36:59] 6Labs, 10Tool-Labs, 5Patch-For-Review: Migrate tools-submit to tools-cron-01/-02 - https://phabricator.wikimedia.org/T123873#1946821 (10yuvipanda) So failover should be: 1. Stop cron in active (or shut it down) 2. Clean out / Backup all the active crontabs in active (note that there's also a puppet initiate... [02:42:36] 6Labs, 10Tool-Labs, 5Patch-For-Review: Migrate tools-submit to tools-cron-01/-02 - https://phabricator.wikimedia.org/T123873#1946833 (10yuvipanda) Alright, I think that moves them all off to their appropriate places. I'll switchover tomorrow. [02:43:01] 6Labs, 10Tool-Labs, 5Patch-For-Review: Migrate tools-submit to tools-cron-01/-02 - https://phabricator.wikimedia.org/T123873#1946834 (10yuvipanda) The cron runner hosts do not have the whole exec_environ anymore. No idea why they did. [02:45:40] 6Labs: beta swift labs instances requirements - https://phabricator.wikimedia.org/T123512#1946836 (10yuvipanda) I just bumped up instance count limit for you guys! And \o/ to seeing this happen, maybe we can finally get rid of NFS from deployment-prep! [02:46:54] 6Labs, 10Tool-Labs, 5Patch-For-Review: puppet/apt issues on tools-submit - https://phabricator.wikimedia.org/T124014#1946845 (10yuvipanda) I don't actually think there's any reason libsnd should be on toolssubmit. I've 'fixed' the problem by removing the exec_environ from tools-submit, no idea why it was the... [02:47:11] 6Labs, 10Tool-Labs, 5Patch-For-Review: puppet/apt issues on tools-submit - https://phabricator.wikimedia.org/T124014#1946847 (10yuvipanda) 5Open>3Resolved And the reason multiarch support was removed was to make things consistent, I think. [07:47:41] 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1947143 (10Florian) Reviewed, merged and uploaded changes for the core submodules. But now only for wmf.10, I assume, that we need them for wmf.11, too? [10:04:16] !log deployment-prep bump ram quota to 350G [10:05:20] !log deployment-prep bump cores to 165 [10:06:04] no feedback from log bot eh :( [10:56:41] 6Labs: Webservice stuck, won't stop, can't restart - https://phabricator.wikimedia.org/T124038#1947424 (10Magnus) Another "immortal" web service, this time for tool "catfood". [10:57:01] 6Labs, 10Labs-Infrastructure: Labs instances sometimes freeze (Tracking bug) - https://phabricator.wikimedia.org/T124133#1947426 (10Magnus) [10:57:03] 6Labs: Webservice stuck, won't stop, can't restart - https://phabricator.wikimedia.org/T124038#1947425 (10Magnus) 5Resolved>3Open [11:27:57] 6Labs, 10DBA, 10wikitech.wikimedia.org, 5Patch-For-Review: Untangle wikitech/labtestwikitech and s7 DBs and networking and mysql grants - https://phabricator.wikimedia.org/T124002#1947479 (10jcrespo) Should this be closed now (no more db errors) or should firewall be changed? [11:50:53] did something change in the proxy setup? [11:51:43] didn't change anything but I get Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at [11:52:04] and my tools send a 'Access-Control-Allow-Origin', '*' [11:54:37] nevermind it's a misinterpreted 503 by the brower [11:55:45] so this lead to tools-webgrid-lighttpd-1209 which look like dead ;(, ssh freeze [12:35:59] 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1947580 (10Reedy) >>! In T123599#1947143, @Florian wrote: > Reviewed, merged and uploaded changes for the core submodules. But now only for wmf.10, I assume, tha... [12:50:22] 6Labs, 10DBA, 10wikitech.wikimedia.org, 5Patch-For-Review: Untangle wikitech/labtestwikitech and s7 DBs and networking and mysql grants - https://phabricator.wikimedia.org/T124002#1947627 (10Krenair) Don't the servers have inconsistent database access (whether by network restrictions or grants) still? [12:52:32] 6Labs, 10DBA, 10wikitech.wikimedia.org, 5Patch-For-Review: Untangle wikitech/labtestwikitech and s7 DBs and networking and mysql grants - https://phabricator.wikimedia.org/T124002#1947642 (10jcrespo) @Krenair yes, but that is only because T120122 is ongoing (and will take ~6 months to fully cover all serve... [12:58:38] 6Labs, 10DBA, 10wikitech.wikimedia.org, 5Patch-For-Review: Untangle wikitech/labtestwikitech and s7 DBs and networking and mysql grants - https://phabricator.wikimedia.org/T124002#1947659 (10Krenair) 5Open>3Resolved I guess this is resolved. [12:59:05] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1209 frozen - https://phabricator.wikimedia.org/T124162#1947662 (10Krenair) [13:34:51] 10Quarry: Cannot download data from a query with Unicode characters in its title - https://phabricator.wikimedia.org/T123031#1947747 (10XXN) I confirm this. I encountered the same problem with http://quarry.wmflabs.org/query/6945 [13:37:46] 6Labs, 10Labs-Infrastructure: Labs instances sometimes freeze (Tracking bug) - https://phabricator.wikimedia.org/T124133#1947758 (10scfc) @Andrew: If an instance locks up, do you want to take care of that yourself for debugging purposes, or is rebooting and mentioning the instance name here/in the blocking bug... [14:11:22] 6Labs: beta swift labs instances requirements - https://phabricator.wikimedia.org/T123512#1947854 (10fgiunchedi) {{done}}, required a #cores and memory bump too (following https://wikitech.wikimedia.org/wiki/OpenStack#Managing_project_quotas) [14:17:29] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: Can't start webservice: timeout - https://phabricator.wikimedia.org/T124169#1947877 (10Nemo_bis) 3NEW [14:24:07] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: Can't start webservice: timeout - https://phabricator.wikimedia.org/T124169#1947891 (10Magnus) Duplicate of T124038 [14:28:10] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: Can't start webservice: timeout - https://phabricator.wikimedia.org/T124169#1947915 (10scfc) [14:28:12] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1209 frozen - https://phabricator.wikimedia.org/T124162#1947916 (10scfc) [14:28:50] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: Can't start webservice: timeout - https://phabricator.wikimedia.org/T124169#1947877 (10scfc) The stuck job is on `tools-webgrid-lighttpd-1209`, so merging there. [14:29:35] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1947928 (10scfc) [14:30:50] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1946783 (10scfc) [14:30:52] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1209 frozen - https://phabricator.wikimedia.org/T124162#1947934 (10scfc) [14:32:27] looks like my webservice is on 1209 [14:32:51] 6Labs: Webservice stuck, won't stop, can't restart - https://phabricator.wikimedia.org/T124038#1947938 (10scfc) `catfood` is "running" on `tools-webgrid-lighttpd-1209`, that instance is handled by T124162, so the (initial) scope of this task was resolved. [14:33:08] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1947941 (10scfc) [14:33:10] 6Labs: Webservice stuck, won't stop, can't restart - https://phabricator.wikimedia.org/T124038#1947940 (10scfc) 5Open>3Resolved [14:34:54] I see that host is hung and I'm looking at why [14:35:33] ^me [14:38:58] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1209 frozen - https://phabricator.wikimedia.org/T124162#1947956 (10scfc) Contrary to the other instances, the console log on wikitech shows no indications of a problem: ``` […] cloud-init boot finished at Wed, 30 Dec 2015 03:17:15 +0000. Up 11.03 seconds Ubuntu 12.0... [14:39:50] PROBLEM - Host tools-test2-for-backports-scfc is DOWN: CRITICAL - Host Unreachable (10.68.18.197) [14:43:30] tools.taxonbot does report replication lag starting the script catstruct.tcl for 5 minutes now (dewiki) [14:45:04] dewiki is at 7106 [14:45:22] 7106 what? [14:47:19] seconds lag [14:50:55] !log tools reboot tools-webgrid-lighttpd-1209 as frozen [14:52:40] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1947994 (10doctaxon) 3NEW [14:54:35] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1948009 (10doctaxon) [14:57:48] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1948028 (10jcrespo) > Replication lag of 10 minutes Are you talking about the production replica databases or something else? [14:59:37] I have not mentioned it here, because it is a production-originated issue, but expect dewiki and wikidata replicas on labs to be delayed for some time while a schema change is ongoing on production [15:00:16] jynus: what chance is being made? [15:01:03] Betacommand, https://phabricator.wikimedia.org/T62539 [15:02:33] should be transparent for most applications, but it is a 500 million table primary key change, so it is impossible to not have any impact at all [15:06:59] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1948039 (10doctaxon) I think, it is MariaDB ... but I am not sure. [15:11:17] Betacommand: jcrespo is asking me about the database: Can you help me to explain? https://phabricator.wikimedia.org/T124172 [15:13:48] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1948045 (10Betacommand) >>! In T124172#1948028, @jcrespo wrote: >> Replication lag of 10 minutes > > Are you talking about the production replica databases or something else? Its the dewiki_p b... [15:14:04] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1948046 (10jcrespo) Then yes, this is a known issue. A production schema change is ongoing on wikidata, affecting replication lag on some servers, both in production and labs of wikidata and dewi... [15:16:29] Betacommand: thank you! [15:20:16] sorry for the inconveniences, we only do this kind of things when not doing them would be worse [15:20:27] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1948059 (10doctaxon) It's since about 12:12 UTC, now it's 15:18 UTC - the 3 hours should have ended now. Replication lag should go back to normal these minutes now ... Is it possible to monitor? [15:25:43] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1948066 (10jcrespo) @doxtaxon I meant 3 hours from now (a bit less, now) You can monitor the exact replication lag at https://tools.wmflabs.org/replag/ I will announce the end of the maintenance... [15:28:19] doctaxon, feel free to ask any questions here regarding the maintenance [15:40:53] jynus: you are jcrespo? I didn't know. Okay, three hours from now. Had it been not possible, to support this schema change by an own space? Other tools users were not affected of it this way, I guess. But I am not sure ... [15:43:22] "support this schema change by an own space" ? I do not get that [15:43:27] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1948132 (10scfc) [15:43:29] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1209 frozen - https://phabricator.wikimedia.org/T124162#1948130 (10scfc) 5Open>3Resolved a:3scfc [15:57:21] jynus: "support this schema change running on an own server only for this job" [15:58:37] doctaxon: the schema change was done in *production*. On the wikis that we all love. I am Wikimedia's DBA [15:59:15] labs gets automatically synced from production, so it has to follow to be in sync [15:59:55] okay [16:00:00] DBA = Database administrator [16:00:04] yeah [16:02:13] jynus: okay, then let's wait of it, but thank you for your explanations [16:18:51] (03PS1) 10BBlack: add traffic channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/265281 [16:26:40] (03PS2) 10BBlack: add traffic channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/265281 [16:29:23] (03CR) 10BBlack: [C: 031] add traffic channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/265281 (owner: 10BBlack) [16:41:46] (03CR) 10Alex Monk: [C: 032] add traffic channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/265281 (owner: 10BBlack) [16:42:18] (03Merged) 10jenkins-bot: add traffic channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/265281 (owner: 10BBlack) [16:42:32] !log tools.wikibugs Updated channels.yaml to: af30a77133e5b45db928e8f4850dd73a71169b16 add traffic channel [16:53:01] 6Labs, 5Patch-For-Review, 7Tracking: Support instance manipulation, proxies, dns with Horizon (Quarterly goal tracking bug) - https://phabricator.wikimedia.org/T124181#1948361 (10Andrew) 3NEW [16:55:26] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Investigate decommissioning labcontrol2001 - https://phabricator.wikimedia.org/T118591#1948384 (10Andrew) [16:55:28] 6Labs, 5Patch-For-Review: Rename labcontrol2001 to labtestweb2001 - https://phabricator.wikimedia.org/T123790#1948382 (10Andrew) 5Open>3Resolved a:3Andrew [16:55:35] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Investigate decommissioning labcontrol2001 - https://phabricator.wikimedia.org/T118591#1948385 (10Andrew) 5Open>3Resolved a:3Andrew [16:56:34] 6Labs: Unable to change projects in horizon - https://phabricator.wikimedia.org/T123310#1948395 (10Andrew) [16:57:01] 6Labs, 5Patch-For-Review, 7Tracking: Create web-proxy editing panel in Horizon - https://phabricator.wikimedia.org/T124183#1948396 (10Andrew) 3NEW [16:58:36] 6Labs, 5Patch-For-Review, 7Tracking: Switch to using Horizon/Designate for labs public dns - https://phabricator.wikimedia.org/T124184#1948403 (10Andrew) 3NEW [16:59:24] 6Labs: Unable to change projects in horizon - https://phabricator.wikimedia.org/T123310#1948420 (10Andrew) This is because we're currently using the Keystone v3 api, which does not work with ldap assignment. [17:00:14] 6Labs: Switch from ldap project assignment to keystone/mysql project assignment - https://phabricator.wikimedia.org/T124186#1948427 (10Andrew) 3NEW a:3Andrew [17:03:48] 6Labs: [Tracking] Create labtest cluster - https://phabricator.wikimedia.org/T120293#1948450 (10Andrew) [17:03:50] 6Labs, 5Patch-For-Review, 7Tracking: Support instance manipulation, proxies, dns with Horizon (Quarterly goal tracking bug) - https://phabricator.wikimedia.org/T124181#1948449 (10Andrew) [18:07:39] 6Labs: Unable to change projects in horizon - https://phabricator.wikimedia.org/T123310#1948683 (10Andrew) a:5Andrew>3None [18:08:28] 6Labs: labstore2001 disk space WARNING - https://phabricator.wikimedia.org/T123874#1948687 (10Andrew) a:3Andrew [18:17:43] 10Wikibugs, 5Patch-For-Review: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032#1948768 (10Samtar) I //think// it's working - wikibugs is going mad on -dev at the moment but the rate looks good and it's not being flooded off :D [18:18:41] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1948777 (10scfc) At https://wikitech.wikimedia.org/wiki/Special:NovaInstance, "get console output" gives "Failed to get console output for instance tools-webgrid-lighttpd-1201.". Trying... [18:23:44] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Monitor 'showmount' behavior for labstore - https://phabricator.wikimedia.org/T123588#1948799 (10Andrew) 5Open>3Resolved [18:23:53] 6Labs: Monitor labs new instance creation - https://phabricator.wikimedia.org/T123590#1948801 (10Andrew) a:5Andrew>3None [18:24:07] 6Labs, 7Graphite: Setup "official labs grafana" instance - https://phabricator.wikimedia.org/T120295#1948802 (10yuvipanda) a:5yuvipanda>3None Unfortunately I don't think I'll have time to do this anytime soon :( [18:25:03] 6Labs: labstore2001 disk space WARNING - https://phabricator.wikimedia.org/T123874#1948807 (10Andrew) mark says: don't forget the filesystem! Add a flag to lvextend. [18:44:21] 6Labs: labstore2001 disk space WARNING - https://phabricator.wikimedia.org/T123874#1948934 (10Andrew) sudo lvextend -L+2T --resizefs /dev/mapper/backup-tools [18:55:44] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1949003 (10jcrespo) @doctaxon, maintenance has finished on production. How much time it will take for labsdb to catch up will depend on how loaded is the wikidata (s5) shard on labs. I will monit... [18:58:29] why can i see users like "Westonnh" on the wikitech wiki, where they got a shell https://wikitech.wikimedia.org/wiki/User_talk:Westonnh but the user does not exist in LDAP? [18:58:52] the user has 0 contributions though [19:01:06] how to find a shell name from wiki name (currently) [19:29:27] mutante: https://phabricator.wikimedia.org/T122595 [19:31:58] andrewbogott: thank you, the user was able to tell me the shell user and i could add him to missing WMF group [19:32:29] great [19:36:36] 6Labs: labstore2001 disk space WARNING - https://phabricator.wikimedia.org/T123874#1949193 (10Andrew) 5Open>3Resolved [19:46:22] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1949238 (10chasemp) p:5Triage>3High [19:47:43] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1946783 (10chasemp) [19:47:45] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1204 locked up - https://phabricator.wikimedia.org/T123835#1949240 (10chasemp) 5Open>3Resolved a:3chasemp It's currently up and going and I don't have more details to add here so I'm going to close the subtask. [19:49:21] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1949249 (10chasemp) [19:49:23] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1949248 (10chasemp) [19:49:32] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1949250 (10chasemp) 5Open>3Resolved a:3chasemp [19:49:34] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1946783 (10chasemp) [19:53:30] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1209 frozen - https://phabricator.wikimedia.org/T124162#1949261 (10chasemp) @scfc did you ever end up rebooting this? It was frozen when I saw it at around `Tue Aug 11 14:29:36` I ended up rebooting it after some general info grabbing but top said `top - 13:36:26 up... [19:54:40] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1949264 (10Boshomi) https://tools.wmflabs.org/replag/ shows high replag again: Shard Lag (seconds) Lag (time) s1 0 00:00:00 s2 0 00:00:00 s3 122 00:02:02 s4 0 00:00:00 s5 17701 04:55:01 s6 0 00:00:... [19:55:43] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1949274 (10jcrespo) This is intended- see https://phabricator.wikimedia.org/T124172#1948046 [20:36:26] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1949507 (10Reedy) [20:41:13] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1949529 (10chasemp) I did some looking regarding the lockup of `tools-webgrid-lighttpd-1209.tools.eqiad.wmflabs` since I was able to catch it and I indicated in the [[ https://phabricator.wiki... [20:50:45] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1209 frozen - https://phabricator.wikimedia.org/T124162#1949580 (10scfc) @chasemp: Yes, I rebooted it via Special:NovaInstance a short time after 14:38Z (and before 15:43Z), and the instance became responsive afterwards. [21:01:16] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1949648 (10scfc) I haven't checked if it is accurate, but my mental picture for SGE is that if the master and the execds don't exchange messages, they assume that everything's unchanged. So h... [21:02:16] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1949655 (10Reedy) [21:04:24] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1949675 (10scfc) [21:04:27] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1949673 (10scfc) 5Resolved>3Open I still cannot `ssh` into that instance: ``` [tim@passepartout ~]$ ssh -v tools-webgrid-lighttpd-1201.tools.eqiad.wmflabs OpenSSH_7.1p2, OpenSSL 1.0... [21:04:28] 6Labs, 10wikitech.wikimedia.org: Semantic search : Provide a search filter for semantic search and a dedicated page to view logged in users' shell access requests. - https://phabricator.wikimedia.org/T124231#1949676 (10dg711) 3NEW [21:05:18] 6Labs, 10wikitech.wikimedia.org: Semantic search : Provide a search filter for semantic search and a dedicated page to view logged in users' shell access requests. - https://phabricator.wikimedia.org/T124231#1949685 (10dg711) [21:06:22] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1949688 (10chasemp) At this point I think: * processes and jobs on tools-webgrid-lighttpd-1209.tools.eqiad.wmflabs were hung in a state of [[ http://stackoverflow.com/questions/1475683/linux-... [21:10:15] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1949695 (10Reedy) 5Open>3Resolved a:3Reedy [21:11:25] 10Wikibugs, 5Patch-For-Review: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032#1949700 (10Dzahn) @Samtar looks like you fixed it indeed. I saw mass edits that would have always kicked the bot in the past.. but it stayed online. very cool :) [21:12:37] 10Wikibugs: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032#1949702 (10Dzahn) [21:15:39] 10Wikibugs: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032#1949712 (10Luke081515) So we can close this now? [21:17:19] 10Wikibugs: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032#1949714 (10Dzahn) 5Open>3Resolved a:3Dzahn i'll say yes.. we can always reopen it if needed meme, src="tech-barnstar", above="tech barnstar", below="for samtar" [21:17:39] 10Wikibugs: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032#1949717 (10Dzahn) a:5Dzahn>3Samtar [21:31:52] 6Labs, 10wikitech.wikimedia.org: PHP array to string conversion on wikitech in SMW 1.8.x - https://phabricator.wikimedia.org/T124235#1949771 (10Reedy) 3NEW [21:33:02] 6Labs, 10wikitech.wikimedia.org: PHP array to string conversion on wikitech in SMW 1.8.x - https://phabricator.wikimedia.org/T124235#1949781 (10Reedy) [21:35:08] 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: PHP array to string conversion on wikitech in SMW 1.8.x - https://phabricator.wikimedia.org/T124235#1949783 (10Reedy) [21:39:56] anomie: the whole login thing is thoroughly confusing to me … [21:41:08] 6Labs, 10wikitech.wikimedia.org: Semantic search : Provide a search filter for semantic search and a dedicated page to view logged in users' shell access requests. - https://phabricator.wikimedia.org/T124231#1949817 (10scfc) You can add the username to the results by clicking on "[Edit query]" and adding "?She... [21:49:23] anomie: and why would you need a "botpassword" if you have a regular password? [21:50:01] gifti: Because regular passwords aren't guaranteed to work for action=login after AuthManager comes out. [21:50:36] why is that? [21:58:06] gifti: The AuthManger project is working on making a more advanced authentication process possible for MediaWiki in general. Once it is fully deployed then configuration changes could be made to do things such as enable 2factor authentication. If and when that happens, the old password loging process will stop working. [21:58:47] gifti: But the botpassword solution will continue to work even if an account has a 2factor requirement or other authentication changes for interactive logins [21:59:08] sigh [21:59:23] so switching soon to either a botpassword or an OAUth grant will protect you against future config changes [21:59:45] couldn't we provide backwards compatibility? [21:59:59] we are, until we can't anymore [22:00:05] :p [22:00:14] and the botpassword is a future proof fix [22:00:20] being backwards compatible with one-factor auth means not having a second factor [22:00:24] (for certain values of future) [22:00:27] kind of defeats the purpose [22:00:42] how would that work with bots anyway? [22:01:51] when an account has a botpassword setup then the authentication flow will try to process the username + password as the botpassword. If that matches then it will not continue on to try other authentication methods that may be available [22:02:15] it is very similar to the way that Google's "app passwords" work when you have 2factor enabled for your account [22:03:09] the botpassword will be a weaker authentication method than using 2factor authentication but it will be cheap to revoke which helps a bit with account security [22:03:41] the "best" thing to change to it OAuth authentication [22:03:46] *is OAuth [22:04:22] but botpasswords are an attempt to support bots that can't for whatever reason use OAuth [22:09:39] I hope, you can give me a description how to login a bot by botpassword or OAuth [22:10:16] and how to change the login config [22:10:24] doctaxon: https://www.mediawiki.org/wiki/Manual:Pywikibot/OAuth [22:10:39] i do not use pywikibot [22:14:33] doctaxon: in that case, ask the bot developer [22:15:20] ah I think it's gifti [22:16:03] doctaxon: create bot password at Special:BotPasswords, change password in config file, profit [22:16:25] jetzt schon? [22:16:35] it doesn't exists yet [22:16:44] also dann erst [22:16:47] (I have no idea how to work with oauth) [22:16:55] -s [22:18:09] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1949928 (10chasemp) tools-webgrid-lighttpd-1209 graphs: {F3250805} {F3250823} {F3250969} {F3250981} Labstore graphs {F3250796} {F3250984} {F3250990} It seems like we have IO issues an... [22:26:45] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 323 bytes in 0.003 second response time [22:28:12] hahaha what [22:30:02] eh, the string is actually there [22:33:36] !log tools.bash Webservice returning 404 before restart, 502 after restart; not usable log data [22:34:48] !log tools.bash Can't resolve host issues contacting elasticsearch backend [22:35:34] chasemp: are you getting an alert storm from diamond righ tnow? [22:35:34] ouch [22:35:36] bd808: ^ [22:35:38] ok [22:35:40] err [22:35:43] so that's internal labs dns failure [22:35:57] yeah. looks like it may be fixed now [22:36:09] not great [22:36:09] intermitent maybe [22:36:21] that may be hitting a filter inadvertantely on my side [22:36:23] as I didn't see it [22:36:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 982543 bytes in 3.763 second response time [22:36:59] !log tools.bash Seems to be working again; possible intermittent Labs dns issues [22:37:08] both recursors (sp?) are responding for me [22:37:59] me too [22:38:16] andrewbogott: I logged errors at 2016-01-20T22:33:45 and 2016-01-20T22:33:56. Seems to be working now though [22:38:30] bd808: and it worked in between? [22:38:49] Errors in /data/project/bash/error.log if you want to peek at them [22:38:58] nothing very detailed however [22:39:10] what is that tool out of curiosity? [22:39:22] it's a quips service [22:39:27] like bash.org [22:39:28] ah [22:39:40] took over for the lost functioality of bugzilla quips [22:40:07] https://phabricator.wikimedia.org/T73245 :) [22:40:33] raw data from BZ https://phabricator.wikimedia.org/P110 [22:40:48] chasemp: so mostly a completely fun toy and not much else [22:41:05] mutante: I loaded all of those :)( [22:41:09] bd808: so there's going to be labs-wide restarts shortly [22:41:16] bd808: how do you think the ES cluster will cope? [22:41:21] should we do anything special for it? [22:41:22] bd808: :) cool! [22:41:29] YuviPanda: It should heal itself [22:41:41] ok [22:41:48] YuviPanda: if it doesn't then ping me to make it better [22:42:07] I don't have any monitoring for it at the moment :/ [22:42:19] but bash and sal will die if it's not there [22:56:32] (03CR) 10Hoo man: [C: 031] Add data-values/* to wikidata-feed irc channel [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/264926 (owner: 10Aude) [23:10:43] (03CR) 10Alex Monk: [C: 032] Add data-values/* to wikidata-feed irc channel [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/264926 (owner: 10Aude) [23:11:28] (03Merged) 10jenkins-bot: Add data-values/* to wikidata-feed irc channel [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/264926 (owner: 10Aude) [23:12:29] (03CR) 10Alex Monk: "Oh. After +2ing this, I've realised that I probably don't have the permissions necessary to deploy changes to the bot any more..." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/264926 (owner: 10Aude) [23:13:04] YuviPanda, I've got an instance in ERROR state in maps-team project. it's not rebootable and appears to not quite be there [23:13:56] I don't have anything valuable on it so if the cause is known it can just be deleted [23:14:35] 6Labs, 10wikitech.wikimedia.org: "action=formedit" doesn't work any more - https://phabricator.wikimedia.org/T124248#1950333 (10scfc) 3NEW [23:15:22] 6Labs, 10wikitech.wikimedia.org: "action=formedit" doesn't work any more - https://phabricator.wikimedia.org/T124248#1950355 (10scfc) [23:48:30] anyone having issues with bots not being able to log in? [23:48:40] greg-g: yes [23:48:44] greg-g: someone reported that on labs-l [23:48:44] on mw.org [23:48:46] short while ago [23:48:48] hmmm [23:48:49] not sure where [23:49:04] on commons [23:49:28] commons would be group1 [23:49:50] depending on timing that might make sense [23:50:25] 35min ago [23:50:45] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1950441 (10Dzahn) 5Resolved>3Open https://wikitech.wikimedia.org/wiki/Special:... [23:53:44] i can login to wikidata wiht my bot [23:53:48] via the api [23:57:37] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1950540 (10Reedy) 5Open>3Resolved Because it was reverted back to .10 Backpor... [23:58:01] aude: can you test on mediawiki.org as well?