[00:26:58] grrrit-wm: restart [00:29:37] 06Labs: Make ladsgroup admin on the labs 'fa-wp' project - https://phabricator.wikimedia.org/T138372#2456384 (10Huji) No objection in also having Ladsgroup (Amir) as an admin. We can have multiple admins, correct? Also the project is not really idle. It hosts a MW instance that is used for Twinkle i18n and for... [00:45:42] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Fdapuzzo was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=742863 edit summary: [03:44:10] 06Labs, 10Tool-Labs, 06Operations, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2456611 (10Dereckson) According @mmodell, the Arcanist package is distro agnostic . We need it for every distro we support in apt.wm.o [04:39:02] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2456641 (10zhuyifei1999) 01 has been drained; right now there's 4 files pending server-side-upload in `/srv/v2c/ssu`. [04:45:52] 06Labs, 10Tool-Labs, 06Operations, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2456643 (10mmodell) @dereckson: According to T137770#2426038, the package was uploaded to Trusty and Jessie. [05:29:49] 06Labs, 10Tool-Labs, 06Operations, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2456708 (10mmodell) And yet I don't see the package on https://apt.wikimedia.org... [07:52:01] 06Labs, 10Tool-Labs, 06Operations, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10valhallasw) https://apt.wikimedia.org/wikimedia/pool/universe/libp/libphutil/ Would it be possible to also upload the package for precise? [08:05:07] 06Labs, 10Tool-Labs, 06Operations, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10hashar) The package has been created via {T137770} for both Trusty and Jessie: **Trusty** ``` $ apt-cache madison arcanist arcanist | 0~git2016... [08:21:56] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2456901 (10zhuyifei1999) This can proceed once {T140216} is completed. [08:27:07] yuvipanda: fyi https://phabricator.wikimedia.org/T140216 your video is in there [08:29:03] 50 minutes o.O [08:36:04] 1080p, 2GB [09:08:43] !log tools.heritage Deployed latest from Git, 38363cb, 6117b13 (T139267), 96af0dc, 20d6da6 [09:08:44] T139267: i18n is broken on Heritage project after moving to Kubernetes backend. - https://phabricator.wikimedia.org/T139267 [09:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master [10:07:37] 06Labs, 10Labs-Infrastructure: Review Nova RAM overcommit ratio - https://phabricator.wikimedia.org/T140119#2457056 (10hashar) [10:19:21] http://paws.wmflabs.org gives 504 :( [10:43:40] jzerebecki on it now [10:47:09] 06Labs, 10Labs-Infrastructure, 07LDAP: Remove parsoid UID/GID from Labs LDAP - https://phabricator.wikimedia.org/T140227#2457134 (10mobrovac) [10:47:54] 06Labs, 10Labs-Infrastructure, 10Parsoid, 06Services, 07LDAP: Remove parsoid UID/GID from Labs LDAP - https://phabricator.wikimedia.org/T140227#2457148 (10mobrovac) p:05Triage>03High [10:48:41] 06Labs, 10Labs-Infrastructure, 10Parsoid, 06Services, 07LDAP: Remove parsoid UID/GID from Labs LDAP - https://phabricator.wikimedia.org/T140227#2457134 (10mobrovac) [10:58:07] jzerebecki fixed it. [11:05:15] 06Labs, 10Labs-Infrastructure, 10Parsoid, 06Services, 07LDAP: Remove parsoid UID/GID from Labs LDAP - https://phabricator.wikimedia.org/T140227#2457192 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done [11:13:17] !log tools reboot tools-worker-1004, was unresponsive [11:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [11:13:33] !log tools depool tools-worker-1014 - unusable, totally in iowait [11:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [11:19:38] !log tools drained tools-worker-1004 - high ksoftirqd usage even with no load [11:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [11:40:01] !log tools cold-migrate tools-worker-1014 off labvirt1010 to see if that improves the ksoftirqd situation [11:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [11:40:29] yuvipanda: are you aware of the checker.tools.wmflabs.org alert? [11:40:44] nope, looking now [11:40:49] cool, thanks [11:41:48] paravoid downtimed [11:45:54] I probably need to move that to tornado or something that lets me control concurrency / locks better [11:49:10] PROBLEM - Host tools-worker-1012 is DOWN: CRITICAL - Host Unreachable (10.68.16.49) [11:50:43] ^ is me [12:14:13] RECOVERY - Host tools-worker-1012 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [12:17:30] 06Labs: Switch existing and new trusty instances to GRUB 2 - https://phabricator.wikimedia.org/T140100#2457326 (10faidon) [12:19:00] 06Labs: Switch existing and new trusty instances to GRUB 2 - https://phabricator.wikimedia.org/T140100#2457330 (10faidon) I added a $realm guard for base::grub in the meantime (eww) but I'm hoping that's going to be temporary :) [12:23:37] PROBLEM - Host tools-prometheus-01 is DOWN: CRITICAL - Host Unreachable (10.68.18.8) [12:24:04] ^ is me [13:23:43] is there something I should do to have select privileges for a mysql user on labsdb? ERROR 1142 (42000): SELECT command denied to user 'u4849'@'10.68.23.199' for table 'page' [13:23:51] specifically on commonswiki [13:25:29] godog you want 'commonswiki_p' [13:25:30] the _p suffix [13:26:01] yuvipanda: hah! that did it, thanks! [13:35:44] 06Labs: Make ladsgroup admin on the labs 'fa-wp' project - https://phabricator.wikimedia.org/T138372#2457748 (10Andrew) 05Open>03Resolved @Huji -- correct, there can be multiple admins. I've added Amir to the list. Thanks for the quick response! [14:09:45] 06Labs, 06Operations, 10ops-codfw: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2457845 (10MoritzMuehlenhoff) Why are we keeping this open? The host is reimaged and I'm able to connect to it? [14:30:47] 06Labs, 10Incident-20151216-Labs-NFS, 06Operations: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#2457900 (10MoritzMuehlenhoff) Both hosts should be migrated to Linux 4.4, 3.19 is deprecated at this point. [14:41:21] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2443163 (10Dereckson) I'm handling that in the next 30 minutes, so you'll be free to upgrade. [14:53:30] 06Labs, 06Operations, 10ops-codfw: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2457977 (10Andrew) 05Open>03Resolved [14:53:49] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Monitor k8s flannel etcd health - https://phabricator.wikimedia.org/T140246#2457978 (10yuvipanda) [14:54:29] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Health check for k8s etcd - https://phabricator.wikimedia.org/T140247#2457994 (10yuvipanda) [14:55:23] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Check that all k8s nodes are in 'ready' condition - https://phabricator.wikimedia.org/T140248#2458009 (10yuvipanda) [15:02:30] !log video deleting/recreating encoding01 [15:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL, dummy [15:04:06] andrewbogott: can you ping me when the instance is ready? [15:04:14] andrewbogott: did labvirt1012 ever get rebooted, robh had raised concerns HT wasn't enabled there [15:04:24] chasemp: yep, I fixed the HT thing that evening [15:04:29] cool [15:04:44] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Setup monitoring for kubernetes core components. - https://phabricator.wikimedia.org/T131929#2458034 (10yuvipanda) p:05Normal>03High [15:07:01] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Run https://github.com/kubernetes/node-problem-detector on all our nodes - https://phabricator.wikimedia.org/T140249#2458035 (10yuvipanda) [15:08:37] zhuyifei1999_: 01 is up and running and I moved the proxy to point to it [15:09:07] zhuyifei1999_: now your three instances are running on three different ssd servers — 01 is on a server that's been around for a while, 02 and 03 are on new ones [15:09:16] so I'm very interested if you see bad behavior on the last two [15:10:25] 02 and 03 seems working well [15:12:06] andrewbogott: can you apply the /srv puppet role? [15:12:20] zhuyifei1999_: ah, right, that. One minute... [15:12:34] k [15:13:26] zhuyifei1999_: done [15:13:38] k [15:18:59] andrewbogott: should be up [15:19:30] (there's no tasks right now so I can't confirm) [15:20:28] !log ores deployed ores-wmflabs-deploy:e638f1b [15:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [15:34:45] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2458121 (10zhuyifei1999) The service is running on encoding01 now. [15:39:34] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2458138 (10Andrew) 05Open>03Resolved thanks all! [15:41:40] I've got a problem with ores-web-05. It became unresponsive and now when I try to reboot, I get "Unable to shut off instance: ores-web-05" [15:41:46] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs: Split OGE grid status data collection out of admin tool - https://phabricator.wikimedia.org/T140251#2458142 (10bd808) [15:42:10] It could be because I initiated a "Soft restart" a few minutes ago and then tried to just shut it off when that didn't work. [15:42:33] Hmm... Thinks it is running and not trying to reboot now. [15:42:45] Oh! ANd I can log in [15:42:48] Hiccup? [15:43:39] andrewbogott ^ [15:45:08] halfak: It's definitely the case that you can't change the state of a server while it's already in the middle of a different state change (there's not a queue for that) [15:45:13] but I don't know why it hesitated [15:45:21] is there any evidence on the instance that it was stuck? [15:45:27] e.g. clock drift, dmesg? [15:45:37] Hmm.. It seems that none of the services are starting up. [15:45:56] Clock looks good [15:46:26] | 63cf10a6-922a-458e-8bd2-d9022e9fbc73 | ores-worker-05 | ACTIVE | public=10.68.19.255 | [15:46:27] fwiw [15:46:30] OK. Service just kicked in [15:46:39] This is much slower than expected. [15:46:40] ah, so it did indeed reboot then [15:46:50] Yeah. Looks like it. Just was very slow [15:46:59] And then once online services seems to have started up very slow. [15:47:02] I think this might be OK [15:47:14] But it's weird behavior compared to what I'm used to [15:47:25] CPU, and wa % have looked nominal the whole time. [15:47:59] hm, that host is busy but not /that/ busy [15:48:45] soft reboot initiated first though could take awhile to safely work [15:48:53] and a hard reboot following would be ignored possibly? [15:48:55] not sure [15:49:03] yuvipanda, is there a good way to repool a down'd node with nginx without a service interruption? [15:49:51] it should automatically depool when it fails [15:49:58] if you want to manually depool it before for scheduled downtime [15:50:05] Na. Want to repool [15:50:07] you can change the hiera page on wikitech and run puppet on the lb [15:50:15] ah, repool [15:50:31] good question - I think that's supposed to happen automatically after a timeout as well [15:50:42] Gotcha. Hmm. [15:50:57] you can force it with a 'service nginx reload' - which shouldn't cause a service interruption (reload vs restart) [15:51:30] Hmm... Seems this didn't cause a repooling [15:52:00] I checked that web-05 is returning correctly, but this URL reports and "Internal server error" https://ores.wmflabs.org/node/ores-web-05/ [15:52:07] So I still think it is depooled. [15:52:21] * halfak goes for the service restart [15:52:35] Weird. that still didn't get it [15:53:15] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs: Modernize the admin tool's codebase - https://phabricator.wikimedia.org/T140254#2458229 (10bd808) [15:53:29] the "status" agrees that a restart took place. [15:53:48] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs: Modernize the admin tool's codebase - https://phabricator.wikimedia.org/T140254#2458233 (10bd808) [15:54:08] Oh interesting. [15:54:29] Looks like I can't git ores-web-05:8080 from lb, but I can hit localhost from ores-web-05 [15:56:15] Hmm... Yeah. This instance seems totally weird. [15:56:16] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs: Add kubernetes status page to admin tool - https://phabricator.wikimedia.org/T140255#2458241 (10bd808) [15:56:21] I might have to kill it and try again [15:56:47] Internal server error and nothing in the logs. [15:56:51] WEIRD [15:57:13] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2458257 (10yuvipanda) [15:58:39] !log ores repooled ores-web-04 [15:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [15:59:25] * halfak runs puppet on lb [15:59:44] Looks like 04 is working [16:17:22] What does "rebuild instance" do? [16:17:33] andrewbogott or chasemp: ^ [16:17:54] I'm hoping that I can say "destroy this instance and reapply all the same puppet roles" [16:18:29] halfak: I don't know, it's not something that I've worked on or supported. It /might/ do that :) [16:18:37] yeah I hvae never tried it :) [16:18:42] If it doesn't, let me know and I'll disable the button [16:19:04] lol OK. I'll try it and see what happens :) [16:19:35] Oooh. It had me choose a disk image [16:19:41] So it seems like this might be happening. [16:21:53] so docs are sketchy on the internals but [16:21:55] 'repartition then install the new OS over it' [16:21:59] new OS beign same OS [16:22:25] if puppet config survives the rebuild it'll be by accident (since horizon doesn't know about puppet) [16:22:32] but I can imagine reasons why it might work fine anyway [16:22:37] since the fqdn will presumably remain the same [16:23:09] what are the rules for reusing instance names? [16:23:48] in practice don't as it will confuse things but in reality there is a lag between deletion and creation that would make it possibly sane iirc [16:24:03] it just creates a lot of edge cases and complication so we say don't unless you want edge cases and complication I think [16:24:12] but a rebuild wouldn't necessarily be that [16:24:18] In theory it should work just fine, in practice sometimes it doesn't, mostly due to races [16:24:22] it would in theory just be a reimage w/ same everything [16:24:24] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2458391 (10yuvipanda) I've rebooted the host since I couldn't reach it in any other way. This has been dead for weeks at this point... [16:24:33] in my experience if I wait 3 or 4 minutes after a deletion I get good results with a recreate [16:24:40] but a rebuild would bypass this issue? [16:25:07] OK. So horizon says I have a new instance running Jessie 8.5, but I can log into ores-web-05 and the MOTD says Jessie 8.3 [16:25:10] it depends on whether or not rebuild reuses the same IP [16:25:12] And it looks like it's the same old instance. [16:25:13] if it does then it should be pretty safe [16:25:21] it wouldn't be reallocating a new uuid or new ip or new metadata, just a wipe and reinstall I think [16:25:26] and also I don't know if rebuild sends a delete and recreate notification or if it sends some other message... [16:25:29] that's how I read it anyway [16:25:38] Basically I don't know what rebuild does in the backend, so many unknowns [16:25:51] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2458394 (10yuvipanda) It's been clearly stuck for a long time... ``` The last Puppet run was at Sun Jul 3 03:44:20 UTC 2016 (15160 minutes ago). ``` [16:25:57] I /think/ it boots to pxe thus reinstall [16:26:06] halfak: that's not what I expected :( [16:26:14] what's the uptime on that instance? [16:26:40] Up for 44 mins -- so since the soft restart from before [16:26:59] hm, so sounds like rebuild did nothing at all [16:27:07] Horizon reports the time since created as 3 months, 3 weeks [16:27:08] Yeah [16:27:10] ha nope [16:27:14] It changes the "image name" [16:27:17] :) [16:27:36] Want me to leave this here for now or is it OK if I delete and manually re-create? [16:28:01] andrewbogott, ^ [16:28:45] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2458405 (10yuvipanda) From kern.log: ``` Jul 3 04:00:58 tools-k8s-etcd-02 etcd[515]: start to snapshot (applied: 39044422, lastsnap: 39034420) Jul 3 04:00:59 tools-k8s-etcd-02 etcd[515]:... [16:29:02] halfak: just kill it and recreate [16:29:32] andrewbogott: merp, disable that button as a unknown? [16:29:42] for now anyway [16:29:43] 06Labs, 10Horizon: Investigate (and probably disable) 'rebuild instance' option - https://phabricator.wikimedia.org/T140259#2458418 (10Andrew) [16:29:52] chasemp: ^ [16:30:23] heh gotcha [16:30:27] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2458432 (10yuvipanda) That's a different timeperiod than when the other nodes died out... [16:30:48] andrewbogott: i tested that option once, some months ago: It's liek deleting and recreating with same name + other image [16:31:09] so I would disable it, to save users from get their data lost [16:31:20] *like [16:31:28] except it didn't seem to do even taht this time around [16:31:34] Luke081515: that's what it /should/ do but it sounds like it doesn't do that even [16:31:43] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2458437 (10yuvipanda) I'm just going to close this out now - we have monitoring for it in T140246 and T140246. Will re-open if it happens again... [16:31:46] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2458439 (10yuvipanda) 05Open>03Resolved a:03yuvipanda [16:31:52] it's like a mystery light switch in your house that does nothing [16:31:57] andrewbogott: what is that button actually doing? [16:32:00] !log ores terminated ores-web-05 [16:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [16:32:32] Luke081515: I don't know how to answer that question any more clearly than I have already [16:32:46] hm [16:32:56] ok, so I think we should disable it, for more safty [16:33:00] *safety [16:39:52] andrewbogott, it's worked when I've tried rebuilding instances. [16:44:18] 06Labs, 10Labs-Infrastructure, 10Labs-Kubernetes: ksoftirqd taking up too much resources on tools-worker-1004 - https://phabricator.wikimedia.org/T140262#2458526 (10yuvipanda) [17:14:30] !log ores ladsgroup@ores-web-05:~$ sudo puppet agent -tv [17:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [17:21:48] !log ores deploying to web again [17:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [18:20:30] !log ores ladsgroup@ores-lb-02:~$ sudo service nginx restart [18:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [19:26:33] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Review disk overcommit ratio for Nova - https://phabricator.wikimedia.org/T140122#2459323 (10Andrew) 05Open>03Resolved [19:26:45] 06Labs, 10Labs-Infrastructure: Review Nova RAM overcommit ratio - https://phabricator.wikimedia.org/T140119#2459326 (10Andrew) 05Open>03Resolved [19:33:23] aude, do you know if the maps project on labs is still alive? [19:33:38] Is the OSM database up to date? [19:33:59] Who can I talk to about getting updated coastlines imported? [19:34:51] Does Alexandrios Kosiaris still exist? [19:34:57] What happened to Apmon? [19:35:21] Who has access to the OSM postgres server? [19:35:27] so many questions! :-D [19:39:56] dschwen: indeed a plethora of questions! [19:40:47] Alex is akosiaris when he is on irc. I think he's on holiday at the moment [19:41:27] As far as I know the maps project is still running. I can't say anything about the update process for it. [19:42:43] dschwen: it looks like maxsem may have touched the maps servers last -- https://wikitech.wikimedia.org/wiki/Nova_Resource:Maps [19:43:17] you should be able to find him in #wikimedia-tech and/or #wikimedia-discovery [19:47:07] dschwen, what's up? [19:54:20] how do I get a token from an app to log in on wikitech? [19:54:49] hey MaxSem! [19:55:04] Do you have answers for me? [19:55:09] :-) [19:55:24] ragesoss: by enabling two factor authentication in preferences [19:55:50] Is the OSM database up to date? No idea. Check latest osm_id? [19:55:57] valhallasw`cloud: I did that a while ago. which is why I can't log in now! [19:56:03] Hmm... let me see [19:56:13] Who can I talk to about getting updated coastlines imported? Phabricator. [19:56:38] dschwen> Does Alexandrios Kosiaris still exist? Depends:P [19:56:40] Ragesoss: uh. Use the app you used to confirm enabling 2fa? [19:56:51] What happened to Apmon? No idea. [19:56:59] Typically "Google authenticator" on android [19:57:03] dschwen> Who has access to the OSM postgres server? Ops? [19:57:07] Is he like Schroedingers cat? [19:57:18] Do Ops give a shit? [19:57:40] Ask them? :P [19:58:08] dschwen: maybe you should explain what you're trying to do ;-) [19:58:49] Not much to explain. I use the OSM data form our postgres OSM mirror [19:58:55] I need it to be up to date [19:58:58] valhallasw`cloud: thanks. that sounds vaguely familiar. the message (and documentation for 2fa) is not very helpful if you forget what app it is talking about. [19:59:31] you don't use google authenticator anywhere else? [20:02:07] so, did you check osm_id? [20:02:56] yeah [20:03:01] dang it, it is up to date :-) [20:05:13] Ops gives lots of poops as it were. Too many poops to flush if that metaphor holds. [20:05:23] haha [20:05:25] good to know [20:06:48] I'm not sure who is responsible for keeping the database up to date, though. Probably not ops, but the people in the labs 'maps' project? [20:07:27] With the historical assistance of alex is my understanding [20:09:41] yeah, that's why I asked about those people [20:09:53] Apmon did something "special" with the coastline data [20:10:07] I don;t remember the details. They are in an extra table in the DB [20:14:32] When will instance creation be re-enabled? It's just that I've managed to mess up an instance and would like to recreate it. [20:15:48] tom29739: Soonish. I can delete/recreate the instance for you if you like. [20:15:53] name/project? [20:16:06] project privpol-captcha [20:16:12] jessie? [20:16:26] instance captcha-logging-01 [20:16:27] Yeah. [20:17:02] and you're ready for me to delete everything that's on there now? [20:17:21] Yeah. [20:17:51] does it still need to be a 'medium'? [20:18:06] I shouldn't have thought so. [20:18:16] A small could probably do it. [20:18:44] ok, the new one is building now [20:19:48] Thanks. [20:20:00] should be ready now, or in a moment [20:21:34] It looks like I can create instances from wikitech too. [20:21:54] Like this: https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=create&project=privpol-captcha®ion=eqiad [20:24:52] 06Labs, 10wikitech.wikimedia.org: Labs front-page statistics are very wrong - https://phabricator.wikimedia.org/T139773#2459562 (10Andrew) [20:25:09] I'm getting this on puppet run on labs: Could not find data item scap::deployment_server in any Hiera data file and no default supplied at /etc/puppet/modules/scap/manifests/target.pp:97 on node wdq-beta.wikidata-query.eqiad.wmflabs [20:25:35] anybody knows what is supposed to be there on labs and should I do it manually or I missed something in configs? [20:25:57] SMalyshev: what project? [20:26:05] bd808: wikidata-query [20:26:17] tom29739: yeah, that's a mistake, but a soon-to-be-unimportant mistake [20:26:38] SMalyshev: has scap been setup there before? It's a little complicated to get running in a project [20:26:44] not impossible but complicated [20:27:03] You pretty much need a self-hosted puppetmaster and slatmaster [20:27:08] *saltmaster [20:27:10] bd808: it wasn't on scap before. But I don't need scap deployment, I just need puppet to run [20:27:18] bd808: I have self-hosted puppetmaster [20:27:58] some role you have enabled is pulling in scap::target which means you need scap3 setup [20:28:15] bd808: yes, wdqs has scap deployment now [20:28:24] but I'm not sure what that means for puppet/labs [20:28:44] there is no labs wide scap3 or trebuchet service [20:29:11] bd808: I don't really need it to deploy anything related to scap... I just want to test other puppet parts [20:29:41] But puppet needs it [20:29:51] bd808: so what exactly does puppet need? [20:30:08] "data item scap::deployment_server in any Hiera data file" [20:30:26] and then probably something else and something else again [20:30:29] bd808: I can add it to hiera. What should be there? [20:31:09] SMalyshev: the bits you need are probably all in https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep somewhere [20:31:31] like `"scap::deployment_server": deployment-tin.deployment-prep.eqiad.wmflabs` [20:32:12] bd808: aha, thanks, will try that [20:32:55] I think Krenair was working on splitting up role::deployment::server to make it less horrible to setup in a project but I haven't followed along to see how far he got [20:33:20] looks like adding it to https://wikitech.wikimedia.org/wiki/Hiera:Wikidata-query worked, at least now it doesn't bail out [20:33:30] \o/ [20:33:53] thanks! let's see how the run goes... but so far looks ok [20:33:56] The times I have setup a full deploy server were following the instructions at https://wikitech.wikimedia.org/wiki/Trebuchet#Using_Trebuchet_in_Labs [20:33:57] I think it got to the review stage bd808 [20:34:01] In operations/puppet.git. [20:34:14] Krenair: :( the palce refactors go to die [20:35:15] 10Labs-Kubernetes: Can't start k8s webservice for tool "admin-beta" - https://phabricator.wikimedia.org/T140303#2459635 (10bd808) [20:35:37] yuvipanda: ^ k8s is not liking me for some reason [20:35:43] 06Labs: Actually delete instance status pages when the instance is destroyed - https://phabricator.wikimedia.org/T140298#2459650 (10Aklapper) Please associate projects. :) Assuming this is #Labs. [20:37:06] 06Labs, 10Beta-Cluster-Infrastructure: Completely remove Beta Cluster dependency on NFS - https://phabricator.wikimedia.org/T102953#2459656 (10AlexMonk-WMF) [20:39:13] 06Labs, 10Beta-Cluster-Infrastructure: Completely remove Beta Cluster dependency on NFS - https://phabricator.wikimedia.org/T102953#2459672 (10AlexMonk-WMF) 05stalled>03Resolved Given that the only place NFS is mounted in deployment-prep is deployment-upload, which is shutoff and to be deleted at some poin... [20:43:07] bd808 namespace creation lags by like 5mins.. still a problem? [20:43:56] yuvipanda: yup. I created the tool several hours ago. It has a ~/.kube/config [20:44:16] Hmm super weird [20:44:28] (on phone now) [20:44:38] not the end of the world [20:44:57] I'll fire it up on SGE instead for now [20:45:15] Can you look at log of maintain-kubeusers service on tools-k8s-master-01? [20:45:45] !log deployment-prep RIP NFS [20:45:45] Please !log in #wikimedia-releng for beta cluster SAL [20:45:48] no [20:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [20:45:56] yuvipanda: sure. where does it hide? [20:46:05] Journalctl [20:46:10] Krenair: y u no obey the bot! ;) [20:46:29] journalctl -u maintain-kubeusers -l [20:47:29] Write config in /data/project/admin-beta/.kube/config [20:47:47] bd808, it doesn't make much sense to me to have deployment-prep using a separate SAL system, the releng one has stuff about CI [20:48:05] CI and beta cluster [20:48:15] reset [20:48:15] but whatever [20:49:09] yuvipanda: there is an error message about "mkdir /root/.kube/schema/v1.3.0wmf4-dirty: read-only file system" [20:49:21] ops [20:51:10] 10Labs-Kubernetes: Can't start k8s webservice for tool "admin-beta" - https://phabricator.wikimedia.org/T140303#2459715 (10bd808) ``` tools-k8s-master-01.tools:/var/log bd808$ sudo journalctl -u maintain-kubeusers -l --no-pager|grep -2 beta Jul 13 16:47:27 tools-k8s-master-01 maintain-kubeusers[11532]: finished... [20:51:37] bd808 aaah. It seems to be the problem. Need to pass --validate=false to the kubectl command in maintain-kubeusers [20:53:48] yuvipanda, it moaned when it couldn't write to ~/.kube/schema when I tried to run kubectl with a yaml file [20:54:02] I had to "take ~/.kube" [21:13:00] (03PS1) 10Reedy: Add generated timestamp [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298878 [21:13:07] 06Labs, 10wikitech.wikimedia.org: Labs front-page statistics are very wrong - https://phabricator.wikimedia.org/T139773#2459785 (10Andrew) p:05Triage>03Low [21:13:21] 06Labs: Actually delete instance status pages when the instance is destroyed - https://phabricator.wikimedia.org/T140298#2459786 (10Andrew) p:05Triage>03Low [21:14:31] (03CR) 10Legoktm: [C: 032 V: 032] Add generated timestamp [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298878 (owner: 10Reedy) [21:21:33] 06Labs, 10Horizon, 13Patch-For-Review: Disable renaming of instances on Horizon - https://phabricator.wikimedia.org/T139768#2459816 (10Andrew) 05Open>03Resolved [21:28:30] Yeah new k8s version [21:28:38] I'll deal with it on Friday [21:28:50] This should only affect new tools [21:32:35] (03PS1) 10Reedy: Display if tasks (unresolved) are tagged as easy [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298882 [21:40:48] (03PS2) 10Reedy: Display if tasks (unresolved) are tagged as easy [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298882 [21:47:13] (03CR) 10Legoktm: [C: 032 V: 032] Display if tasks (unresolved) are tagged as easy [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298882 (owner: 10Reedy) [21:50:09] (03PS1) 10Legoktm: Fix syntax error [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298884 [21:50:25] (03CR) 10Legoktm: [C: 032 V: 032] Fix syntax error [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298884 (owner: 10Legoktm) [21:54:32] (03PS1) 10Legoktm: Bump cache key for phab tasks [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298886 [21:54:44] (03CR) 10Legoktm: [C: 032 V: 032] Bump cache key for phab tasks [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/298886 (owner: 10Legoktm) [23:09:39] bd808: Hey Bryan, do all the tool labs dbs get backed up automatically anywhere right now? [23:09:54] I mean the user dbs, not the mirror dbs [23:09:55] nope [23:10:02] oh [23:10:08] none of it is backed up afaik [23:10:33] the programs folks were asking about that. Maybe it's something to add to your roadmap :) [23:10:56] db backups are a nightmare [23:11:18] true, but so is lost data [23:11:34] anyway, just wanted to ask [23:11:38] thanks for the info [23:11:46] a tool can take its own snapshots somehow [23:11:56] mysqldump etc [23:12:40] it would be nice to at least do a monthy backup of everyone's dbs, just in case of asteroid collision or something [23:12:51] just sayin [23:12:55] even in prod we don't really have db backups [23:13:33] how long would we hang on to those backups for? [23:13:45] how would we restore them? [23:14:08] I get the desire for it, but its a really non-trivial problem [23:14:55] LVM snapshots would be the closest thing that might be reasonable to do at some point [23:14:56] this isn't really a novel idea. Most hosting services offer some kind of automatic back-up service. But yes, not a trivial problem :P [23:15:36] I can't think of one that offers it for free :) [23:15:47] yes, that's true :) [23:15:57] anyway, we can discuss more later [23:16:15] if it's important there should be a path to real production [23:16:22] for a tool [23:16:26] ^ that, so much that [23:17:00] eduwiki is trying to run a "prod" service in labs and its very scary to me [23:17:08] I disagree. Most stuff on Tool Labs is important, but I wouldn't want it all on production [23:17:47] anyway, I'm going to recommend that we build a custom back-up solution for them in the meantime [23:18:55] it would be nice to understand how many tools are generating and storing content that is actually important. [23:19:26] most of the tools I have worked on are just massaging data from somewhere else [23:20:01] sal and bash are storing data, but neither is really "important" in the grand scheme