[03:59:51] 06Labs, 10Labs-Infrastructure: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2403791 (10RobH) [04:02:40] 06Labs, 10Labs-Infrastructure: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2403792 (10RobH) All data has been restored from USB disk and services resumed. However, there are errors. syslog is spamming with Failed value conversion (but i also saw those before the migration) permi... [04:02:52] 06Labs, 10Labs-Infrastructure: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2403793 (10RobH) [08:26:13] 06Labs, 10Tool-Labs, 15User-bd808: task not run via crontab - https://phabricator.wikimedia.org/T138178#2403888 (10WikedKentaur) 05Open>03Resolved Task runs now via cron. Thanks. [08:34:03] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Bonnedav was created, changed by Bonnedav link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Bonnedav edit summary: Created page with "{{Tools Access Request |Justification=I plan to use Tools to begin work on various projects that I have been thinking about for awhile. |Completed=false |User Name=Bonnedav }}" [09:57:21] 06Labs, 10Labs-Infrastructure: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2404129 (10yuvipanda) So I looked at uwsgi logs and it looked like a possible race between uwsgi starting and graphite being fully installed. It looks ok now - I just restarted it! \o/ I'm going to throw mo... [10:40:47] yuvipanda: labmon might well have a ton of useless metrics to garbage collect. The deployment-prep statsd metrics come to mind [11:27:39] 10Tool-Labs-tools-Other, 06Wikisource, 05Wikimania-Hackathon-2016: OCR scripts need updating at tools labs by updating the "tesseract-ben" package - https://phabricator.wikimedia.org/T117711#2404334 (10Bodhisattwa) Tpt updated during Wikimania hackathon [11:27:52] 10Tool-Labs-tools-Other, 06Wikisource, 05Wikimania-Hackathon-2016: OCR scripts need updating at tools labs by updating the "tesseract-ben" package - https://phabricator.wikimedia.org/T117711#2404335 (10Bodhisattwa) 05Open>03Resolved a:03Bodhisattwa [11:32:25] 10Tool-Labs-tools-Other, 06Wikisource, 05Wikimania-Hackathon-2016: OCR scripts need updating at tools labs by updating the "tesseract-ben" package - https://phabricator.wikimedia.org/T117711#2404358 (10Bodhisattwa) a:05Bodhisattwa>03None [14:03:51] 06Labs, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Newsletter: Cannot enable/list roles in newsletter-test instance - https://phabricator.wikimedia.org/T131460#2404786 (10Tgr) 05Resolved>03Open This seems to be triggered when the encoding environment variables (`LC_*`) are changed. [14:15:23] 06Labs, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Newsletter: Cannot enable/list roles in newsletter-test instance - https://phabricator.wikimedia.org/T131460#2404846 (10Tgr) The two files at fault are probably `puppet/modules/role/manifests/echo.pp` and `puppet/modules/role/manifests/contenttranslation.pp... [15:01:55] 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Get access to an old database on tools-db - https://phabricator.wikimedia.org/T101709#1345755 (10TTO) Was this database found? [15:13:06] hashar: yeah I'm sure you are right, we'll do some cleanup early next week I imagine [15:13:40] chasemp: hi! I am not sure though how one can determine a given metric is no more updated / used :( [15:13:50] maybe via the file modified time [15:13:59] yeah that's how I've done it before [15:14:21] there is another way w/ whisper tools to look at last updated value if it needs extra context [15:14:36] but probably not needed [15:23:06] 06Labs, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Newsletter: Cannot enable/list roles in newsletter-test instance - https://phabricator.wikimedia.org/T131460#2405092 (10Tgr) Not sure what encoding these files use but it's not UTF8. Someone probably used their system encoding when creating them and setting... [15:31:30] chasemp: and we can probably have a shorter retention. [15:31:44] I am sure on beta we can live with just a few months of history [16:13:31] hashar: yeah I bet we piggy back on prod durations and we could easily do a year default in labs at a reasonable interval [16:18:50] hrmm, so the labmon1001 data is migrated and coming in succesfully but seems the actual portal for graphite is still down. [16:19:07] its serving a dms file rather than content? [16:19:30] chasemp: ^ i already left yuvi a pm about it, but i imagine since data is indeed coming into the system the frontend portal not working is far lower priority [16:20:44] robh: I think the prod graphite portal is broken too. ori filed a bug about it yesterday [16:21:12] T138541 [16:21:12] T138541: "unexpected error" on graphite-web - https://phabricator.wikimedia.org/T138541 [16:21:38] huh, different error, but odd they both break around same time, heh [16:21:50] Looks like yuvipanda may have temporarily reverted the change based on the task [16:21:59] i think labmon1001 is a config error since its not serving hte proper page at all, not even an error page [16:22:19] but, both yuvi and i confirmed we saw data streaming to the system, though when i had confirmed there were other errors he fixed his AM [16:23:20] robh: what page are you trying? [16:23:26] robh: *nod* grafana boards are looking better -- https://grafana.wikimedia.org/dashboard/db/labs-project-board [16:23:27] nothing yet [16:23:54] chasemp: I havent dug back in yet [16:24:12] http://graphite.wmflabs.org/ pops up to open in ...a text editor so there is some issue there, apache/django seems like [16:24:40] yeah its serving a .dms file rather than content [16:24:52] but its not normal apache sites, apache just feeds into the graphite software [16:25:10] graphite's UI is a django app [16:25:29] (a horribly written one too) [16:25:33] so it appears to be a graphite app issue, not a stack issue for apache. the permissions for all the graphite data is also correct (afaict) [16:39:53] huh man I'm not sure [16:39:56] I will take a look again in a efw [16:39:57] few [17:20:18] Hey [17:20:19] Check HTTPS [17:20:24] HTTP is doing strange things [17:20:51] robh chasemp bd808 [17:23:44] yuvipanda: same thing [17:24:20] firefox tries to download a dms file [17:24:28] the filename for .dms changes each time. [17:25:07] I'm looking for a runaway cat at the moment unfortunately, she seems to ahve flown the coop but isn't prepared to be outside at all [17:25:12] chrome tries to download some file of download.gz, so different browsers different behaviors [17:25:32] chasemp: that sucks, hopefully has front claws? (helps if they are outside, as they can climb down things with front claws) [17:25:45] she doesn't that's part of the issue yeah [17:25:53] and she is an idiot so there is that [17:27:52] last time this happend she was in a dresser drawer but seems like a real code red this time [17:33:57] Robh strange, it worked fine in the morning... [17:35:09] i assumed something odd bliped since you checked historical data with it, heh [17:35:29] gotta run afk for 20, laundry swap. [17:39:52] yuvipanda: so you tested it this morn and all was well?...ook then wth [17:41:17] Yeah [17:41:24] (on and off - wikimania) [17:56:14] I'm having issues with ores-worker-07, ores-worker-09 and ores-worker-10. And someone help me figure out what's up. [17:56:26] I tried rebooting ores-worker-07 and it won't come back. [17:56:40] I left 09 and 10 alone in case someone wants to look at their state. [17:57:08] Oh wait. looks like 09 is accessible [17:57:17] 07 is definitely derped and won't turn back on [17:58:13] maybe chasemp ^ [17:59:08] 06Labs, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Newsletter, 13Patch-For-Review: Cannot enable/list roles in newsletter-test instance - https://phabricator.wikimedia.org/T131460#2167726 (10ori) I'd rather figure out how to get Puppet reliably run from a UTF-8 locale. Has that been explored and deemed im... [18:04:35] Yeah. OK. looks like only -07 is struggling [18:09:00] sure I can look at 07 halfak [18:09:33] Thanks :) [18:20:35] halfak: it seems like it shut itself down, possibly as the virt host is overloaded [18:20:47] there has been some known issues that look like this afaik, but I can't get it to come back yet [18:20:47] Gotcha. I did try to restart it a bit ago [18:20:51] toh it says it will [18:21:00] yeah it gets into like an administrative shutdown state I think [18:21:05] I told it to start and it said ok [18:21:05] Gotcha. [18:21:08] so I'm giving it a minute [18:21:12] Thanks [18:21:22] I can try to cold migrate this to another virt host and I will if it doesn't come back but [18:21:31] never done it before and no one else is around so even money that works out :) [18:21:42] but def not anything your doing [18:24:40] Good to know. Happily I didn't even notice that the darn node depooled because the rest of the nodes just took over. [18:42:13] chasemp, doesn't look like it's back online [18:42:27] yeah nova is a liar pffff [18:42:49] I think it's the 1001 virt host halfak, you could try spinning up a new one while I look into migration [18:42:57] I'll leave a note for andrew on where I get with this [18:43:23] chasemp, so delete this instance and just make a new one? [18:43:38] I'm at quota :/ [18:43:43] gah ok [18:44:19] ok then yeah let's delete and recreate and see if that works [18:44:26] OK will do! [18:48:57] * halfak deletes and starts recreating ores-worker-07 [18:50:58] chasemp, is there a way to configure puppet through horizon or is that still wikitech-only? [18:51:18] this coming quarter we hope to port to horizon but still wikitech atm :) [18:51:22] kk [18:53:16] * halfak forces the puppet run. [18:53:19] Almost ready :) [19:05:03] thanks halfak for your patience, not the ideal way to handle it but it's a weird week [19:05:20] No worries. Labs is awesome. It's a small price to pay. [19:05:27] :) [19:06:02] OK. Working on re-pooling now. :) [19:07:22] (don't mean to barge in here) Is there a way to completely disable puppet on a labs instance, and what problems might it cause? [19:07:24] IT"S ALIVE [19:09:45] tom29739: well...you can disble puppet but I'm not sure if htat persists through a reboot and it will cause configuration drift nad instance stalness that will eventually break the instance guaranteed [19:09:56] it's really a short term thing for some specific end to be viable [19:10:01] puppet agent --disable 'reason' [19:10:10] How can I stop it overwriting my files? [19:10:23] It keeps overwriting my salt config. :/ [19:11:06] tom29739: well, are you using a project specific salt master or something? we would have to puppetize your config [19:11:27] I'm trying to. [19:12:35] I was using https://wikitech.wikimedia.org/wiki/Help:Project_hosted_salt_master and it worked, but it's a 2 year old version of salt. [19:12:44] hmm beta does this so there has to be a way to do this [19:13:00] tom29739: what version? [19:13:07] and what distro? [19:13:15] because prod salt is probably jessie but I can check [19:13:33] When I use that puppet role I get salt master 2014.7.5 [19:13:37] I'm using jessie [19:14:02] I want to use salt 2016.3.1 because it has some features that I want to use. [19:14:52] I did try to use the external saltstack repo, but the wikimedia repo overrides it for some reason, and it installs the old version. [19:15:00] hm I see salt-master 2014.7.5 (Helium) [19:15:03] in prod for labs [19:15:12] so we may not support the newer version [19:15:43] the package being the same name along w/ repo priority yeah [19:15:48] probably nukes the 3rd party [19:15:56] How can I override the wikimedia repo, so I can use the 3rd party repo? [19:16:20] (without puppet overwriting whatever I do) [19:16:42] I'm surprised it would mess w/ an installed package [19:16:47] puppet isn't usually that smart [19:17:20] It runs apt each time and overwrites it. With the specific salt-master package in the apt command. [19:18:45] so not sure why the labs specific role there but [19:18:55] the main manifest has a version flag [19:18:57] class salt::master( [19:18:57] $salt_version='installed', [19:19:16] so in theory we can finagle it for hiera maybe if that's all it is [19:19:53] The salt minion would need to be upgraded too. [19:19:59] They share salt-common [19:20:03] (the package) [19:20:41] And I'd need to upgrade salt-minion too to take advantage of the new features. [19:20:43] yeah it's going to be tracking down and/or creating settings to match thinsg and making it hiera friendly [19:21:02] but using version of sw in labs newer or different from prod is difficult [19:22:06] Can a custom role be created in puppet? Then I could just put that on my instance. [19:22:41] role::salt::masters::labs::project_master is the current role, so maybe something could be created off of that? [19:34:17] having a manifest for a specific version seems bad, I'm not sure how that's functionally different from allowing version specification in a command manifest [19:34:33] and dual code paths for something like this will always lead to trouble [19:35:13] using salt will always lead to trouble ;) [19:36:06] Puppet wouldn't even work when I tried it. [19:36:45] The project puppetmaster just refused to work and then puppet refused to run on the other instances. [19:39:49] I know you had a lot of issues setting up that cluster which is always lame. I have used self-hosted and project wide puppetmasters alot in Labs though So I think it was all bad luck and/or bad documentation. [19:41:46] I don't really understand puppet that much, which I'm sure is helping. ;) [19:42:51] switching puppetmasters requires some hand holding or getting the hiera settings just right before you start. There are ssl client certs that you have to remove in various places [19:43:54] It wasn't accepting the certs if I recall. It was moaning, and then I couldn't ssh to the instances because puppet wouldn't run, and then it just fell over. [19:44:40] that was mostly bad luck I think though because you kept having issues aster you ditched the local puppet setup too right? [19:44:47] *after [19:44:53] I think so. [19:45:27] As least salt appears to be working, in that I can ping the minions. [19:46:00] tom29739: what is the feature you are looking for in newer salt? [19:46:33] I've got it on a tab somewhere.. I'll find it. [19:47:14] 06Labs, 10Labs-Infrastructure: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2405716 (10RobH) a:05RobH>03None [19:49:12] chasemp, it was spm. [19:49:36] I remember trying it on the saltmaster, and it didn't work because it was that old version. [19:50:13] hm interesting [19:51:36] 06Labs, 10Labs-Infrastructure: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2405720 (10RobH) Updating from ongoing work and chat with @yuvipanda and @chasemp I've updated the task description to reflect that since Yuvi fixed this server in his AM, it is now experiencing a new issu... [20:16:02] salt 2016.3.1 (Boron) :) [20:17:07] Wonder what will happen when I re-enable puppet... [21:26:43] And it's talking to the minions :) \o/ [21:27:51] so you installed the version of salt you want and puppet is back to normal [21:28:01] but since puppet just checks for installed state [21:28:04] it all works out [21:28:05] ? [21:28:23] my imagining is that is ok but if someone pins a version ever your going to feel it :) [22:48:19] !log tools.heritage Recreated source tables (T138606) [22:48:20] T138606: updating monuments_all fails due to wd_item - https://phabricator.wikimedia.org/T138606 [22:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master