[00:09:50] 6Labs, 10Tool-Labs: https://tools-static.wmflabs.org/video2commons/video2commons.js 403 Forbidden despite normal file permissions - https://phabricator.wikimedia.org/T124773#2050221 (10zhuyifei1999) [00:09:52] 6Labs, 10Tool-Labs: Static server returns HTTP 403 Forbidden for valid files in some cases - https://phabricator.wikimedia.org/T112388#2050222 (10zhuyifei1999) [00:52:08] andrewbogott: I'm currently seeing tools-login.wmflabs.org resolving to tools-bastion-05, and when I try to submit any jobs there it complains "Unable to run job: denied: host "tools-bastion-05.tools.eqiad.wmflabs" is no submit host." [00:52:37] I believe he's out for the evening now [01:17:33] anomie: I had the same issue, I connected with ssh -A and then did: nemobis@tools-bastion-05:~$ ssh tools-bastion-02.eqiad.wmflabs [01:43:55] Anomie, I'll have a look in a few minutes. [01:57:49] hmph, I’ve no idea how puppet failed to make -05 a submit host. But I’ll just point everyone back to -02 for now [01:59:13] anomie or Nemo_bis, want to try reconnecting to tools-login and seeing if you can submit now? [02:04:01] andrewbogott: Now points to tools-bastion-02 and seems to work. [02:04:06] ok [02:04:14] that should do for now [02:04:17] thanks for checking [03:07:13] !log tools.directory Tried to start webservice; fcgi application crashed on first access [03:55:29] bd808: thanks! [03:55:35] enterprisey: yw [03:55:47] sorry I didn't see your asks for help earlier [03:55:55] no problem at all [03:56:03] did you use the -f option on qdel or something? [03:56:18] yeah, but I had to do it with sudo [03:56:22] oh, okay [03:56:30] any idea why the normal qdel was hanging so badly? [03:57:16] no, but there have been several of these stuck jobs following the rolling restart last week [03:58:39] okay, good to know [03:59:21] My grid engine skills aren't quite up to indepth debugging yet. I mostly hit things with larger and larger hammers [03:59:33] :D [03:59:53] there isn't a good reason to let normal users use -f with qdel, right? [04:00:32] it seems that a tool account should be able to qdel -f it's own jobs, yes [04:00:46] it's basically `kill -9` [04:00:57] 6Labs, 10Tool-Labs, 10DBA: Labs queries die after 5 hours - https://phabricator.wikimedia.org/T127266#2050393 (10Dispenser) It now dies after 1 hour! https://commons.wikimedia.org/wiki/User:Dispenser/Wrong_Extension Sat Feb 20 13:20:32 UTC 2016 ERROR 2013 (HY000) at line 2: Lost connection to MySQL server du... [04:01:16] I think there's a config option for this in the sun grid engine settings [04:01:25] I tried doing that first: become the tool and qdel -f. It didn't change the state [04:01:50] oh, I see [04:03:09] valhalla would be the person to ask if there is some config we are missing for force deletes [04:03:37] might be worth opening a phab task about [04:04:28] will do [04:05:12] The man page does say something about ENABLE_FORCED_QDEL being needed for non-admins [04:05:34] 6Labs, 10Tool-Labs, 10DBA: Tool Labs queries die - https://phabricator.wikimedia.org/T127266#2050394 (10Dispenser) [04:08:35] 6Labs, 10Tool-Labs: Labs users should be able to force-delete their own jobs - https://phabricator.wikimedia.org/T127681#2050398 (10APerson) [04:46:57] 6Labs, 10Tool-Labs: Install rmytop - https://phabricator.wikimedia.org/T58999#2050432 (10bd808) 5declined>3Open a:5coren>3None Asked for in the Tool Labs user survey. [05:14:29] 6Labs, 10Tool-Labs, 13Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#2050489 (10bd808) See notes at rOPUPe38255caa48e393cad47092f815f11b3345b98e4#1392953 for a possible fix for tool-uwsgi-python that matches the config used in T104374#1911373. [05:25:23] 6Labs, 10Tool-Labs: Installer Bundler for managing Ruby application dependencies - https://phabricator.wikimedia.org/T127685#2050494 (10bd808) [06:31:14] yuvipanda, Can I yell at you right now, or are you sleeping? :p [06:31:24] Oh never mind [06:50:17] Cyberpower678: Y.uvi is going to be out on vacation for the next 2 weeks. [07:15:13] 6Labs, 10Tool-Labs: tools-login.wmflabs.org points to tools-bastion-02 - https://phabricator.wikimedia.org/T127686#2050521 (10scfc) [07:21:10] Hello, I find something wrong, 'setup-tomcat' is not working. BASH says "command not found" [07:21:37] (I have become a tool) [07:22:58] Can't use tomcat [07:30:09] MGdesigner: have you already answered your last question before you left the other day for yourself? i was too late … [07:34:31] gifti: Sorry, I didn't get that final message that day. What I remembered is I type I am reading https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Other_web_servers [07:35:12] I am testing it,and also found can't use tomcat. [07:38:16] does the portgrabber thing work? [07:51:23] gifti: About portgrabber, some path problem happened to me,so I am debuging. [07:53:44] gifti: My server is a java program with itself jetty. It launched by a script. The script can't be launched normally,but by portgrabber httpserver.err give my can't find the main java class. [07:59:36] correct: give me can't find the main java class [08:08:41] gifti:After debuging, I found the situation is portgrabber can't accept non CJKV character file name in utf-8. [08:11:25] gifti: I guess portgrabber only accept ASCII? My upstream program is hard to modify non-ascii filenames So I change to use tomcat. [08:20:10] huh [08:21:10] you could also file a bug to support arbitrary characters in portgrabber [08:21:25] (but i don't know if they will do it) [08:22:24] gifti: OK. And " -bash: setup-tomcat: command not found" the issue where should I report? [08:23:32] phabricator.wikimedia.org, tag it with Tool-Labs [08:30:43] 6Labs, 10Tool-Labs: setup-tomcat does not work - https://phabricator.wikimedia.org/T118094#2050580 (10Shoichi) p:5Low>3High I need it to launch my java servlet for runing our plan. [08:32:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.002 second response time [08:32:56] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [08:33:18] PROBLEM - Puppet staleness on tools-exec-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:34:32] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:34:52] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [08:35:06] PROBLEM - Puppet staleness on tools-precise-dev is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:35:59] PROBLEM - Puppet staleness on tools-exec-1213 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:36:23] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 790623 bytes in 6.010 second response time [08:36:47] PROBLEM - SSH on tools-webgrid-lighttpd-1206 is CRITICAL: Server answer [08:36:55] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:36:57] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:37:09] PROBLEM - Free space - all mounts on tools-worker-1004 is CRITICAL: CRITICAL: tools.tools-worker-1004.diskspace.root.byte_percentfree (<100.00%) [08:38:25] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:46:48] RECOVERY - SSH on tools-webgrid-lighttpd-1206 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [08:50:19] 6Labs, 10Tool-Labs: bug :Portgrabber don't support non ASCII characters - https://phabricator.wikimedia.org/T127689#2050607 (10Shoichi) [08:57:24] 6Labs, 10Tool-Labs: setup-tomcat does not work - https://phabricator.wikimedia.org/T118094#2050625 (10scfc) @coren, do you have a backup of `setup-tomcat` or a rough outline what it did? [08:57:47] PROBLEM - SSH on tools-webgrid-lighttpd-1206 is CRITICAL: Server answer [09:07:47] RECOVERY - SSH on tools-webgrid-lighttpd-1206 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [09:09:14] 6Labs, 10Tool-Labs: bug :Portgrabber don't support non ASCII characters - https://phabricator.wikimedia.org/T127689#2050607 (10scfc) `portgrabber` should not care about such things. What is the exact error message that you are getting? I see that `/data/project/idsgen/error.log` is empty. [09:13:49] PROBLEM - SSH on tools-webgrid-lighttpd-1206 is CRITICAL: Server answer [09:20:53] 6Labs, 10Tool-Labs: bug :Portgrabber don't support non ASCII characters - https://phabricator.wikimedia.org/T127689#2050646 (10Shoichi) After "jstart -mem 4G -l release=trusty -q webgrid-generic ./httpserver.sh" httpserver.err will show: Error: Could not find or load main class ������������.������������.Http... [09:37:03] RECOVERY - Puppet failure on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:32:43] bd808: I'm starting elastic search upgrade for the Discovery cluster. We should probably also upgrade logstash. What tests should we do before upgrade ? [10:33:06] bd808: I'm doing the upgrade on labs at the moment, production probably next week [10:53:58] Can someone here please help fix https://tools.wmflabs.org/glamtools/glamorous.php [10:54:07] It's been dead for 1-2 weeks now... [10:59:33] Does it need a reboot following SGE reboot due to labsdb1002 (c2.labsdb)? [11:02:57] RECOVERY - Puppet failure on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:11:54] yuvipanda: Not enough mails in my inbox :P Trying to get more by subscribing to shinken alerts. I need your help on that (https://gerrit.wikimedia.org/r/#/c/270729/). Ping me when you have 5' [13:28:00] RECOVERY - Puppet failure on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:49:06] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#2050907 (10Andrew) This has I believe just happened to tools-webgrid-lighttpd-1206.eqiad.wmflabs [13:52:36] 6Labs, 10Labs-Infrastructure: virt host reboots sometimes breaks puppet on instances - https://phabricator.wikimedia.org/T127698#2050910 (10Andrew) [14:03:48] RECOVERY - SSH on tools-webgrid-lighttpd-1206 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [14:09:48] PROBLEM - SSH on tools-webgrid-lighttpd-1206 is CRITICAL: Server answer [14:15:05] RECOVERY - Puppet staleness on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [3600.0] [14:39:47] RECOVERY - SSH on tools-webgrid-lighttpd-1206 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [14:57:50] is changing 'global groups' under 'configure' working for you? I don't seem to be able to e.g. add 'role::puppet::self', POSTing to Special:NovaInstance seems to hang there and eventually redirect me to instance list, but the group isn't applied [14:58:00] no logs of errors on silver afaics in apache [15:09:06] andrewbogott perhaps? or others if they can confirm/deny they are seeing the same? [15:09:53] godog: I can look shortly... [15:11:16] godog: to confirm, you’re changing the config of an instance, or manipulating the list of available puppet classes? [15:12:34] andrewbogott: thanks! I'm trying to change the config of an instance, namely add 'role::puppet::self' to 'filippo-test-jessie2' [15:12:47] ok, lemme see if I can reproduce [15:15:58] godog: works for me :( what project is that instance in? [15:16:19] andrewbogott: sigh :( project is 'monitoring' [15:17:19] andrewbogott: if it works for you can you add role::puppet::self to that instance for me? :D [15:17:37] godog: yes, done [15:17:51] let me log in as a lower-priv user and see if it still works [15:18:48] andrewbogott: thanks, the curious thing is that for me POST Special:Novainstance just hangs there for 5min and then goes back to the instance list [15:20:42] 6Labs, 10Labs-Infrastructure, 10Deployment-Systems, 6Release-Engineering-Team: integration-make-wmf-branch instance stall on Failed to start LSB: NFS support files common to client and server. - https://phabricator.wikimedia.org/T127705#2051101 (10hashar) [15:24:36] andrewbogott: not sure what changed but seems to be working now for me too? [15:27:28] godog: ok, I don’t know if that’s good news or bad news [15:27:33] let me know if it happens again [15:28:03] andrewbogott: will do, thanks! [15:46:05] PROBLEM - Puppet failure on tools-k8s-master-01 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [15:54:09] labs-morebots, everything good? [15:54:10] I am a logbot running on tools-exec-1216. [15:54:10] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:54:10] To log a message, type !log . [15:55:40] !log tools redirecting tools-login.wmflabs.org to tools-bastion-05 [15:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [16:10:57] RECOVERY - Puppet failure on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:26:35] 6Labs: Write some labs tests that monitor login and sudo permissions - https://phabricator.wikimedia.org/T127716#2051454 (10Andrew) [16:28:02] 6Labs: Move labs auth.logs to central logging - https://phabricator.wikimedia.org/T127717#2051476 (10Andrew) [16:33:37] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:36:32] Hi Everyone! [16:38:41] A new user johnsmith2167 came on Friday to vadalize my [[user:Youni Veciti/Memo cmd]] page. [16:40:14] sorry : https://wikitech.wikimedia.org/wiki/User:Youni_Verciti/Memo_cmd vandalized by him : https://wikitech.wikimedia.org/wiki/User:Johnsmith2167 [17:28:20] bd808: currently here? [17:28:41] In meetings this morning Luke081515 [17:29:02] ok, then I would ping you later ;) [17:29:06] it's not urgent [18:26:05] RECOVERY - Puppet staleness on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:34:24] hi all [18:34:33] could some admin here please have a look at https://phabricator.wikimedia.org/T126890 [18:34:48] I dont seem to be able to log in to gerrit [18:40:04] RECOVERY - Puppet failure on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:14:26] 6Labs, 10Tool-Labs: tools-login.wmflabs.org points to tools-bastion-02 - https://phabricator.wikimedia.org/T127686#2052436 (10Andrew) I've moved things from -02 to a new host, -05. There's no real reason that -05 can't be a permanent home, other than it having a 5 instead of a 1. [19:26:23] Hi. Are there more problems known than a DB server having drive issues? [19:26:50] https://tools.wmflabs.org/persondata/test.php doesn't load, but it's only a hello world php without database access [19:31:04] apper: no known issues; other tools seem to be working [19:33:34] andrewbogott: thanks. Restart of webservice helped... [19:33:41] great [19:41:01] (03PS1) 10Youni Verciti: First reviews from Hashar [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/272524 [19:46:06] (03CR) 10Youni Verciti: "My firsts commits... Here i tried to briefly correct Hashar'reviews. I still have a pb with *.pyc and i'm not able to push. Help my friend" [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/272524 (owner: 10Youni Verciti) [19:59:04] so, does anyone have a clue where parsoid-prod.wmflabs.org is defined? [19:59:18] apparently it's not a proxy/IP in the groups I'm a member of on wikitech [20:00:04] it's apparently some kind of instance proxy, which sets that hostname up as a public IP on labs, which then does an HTTP proxy into the public IP for parsoid-lb.eqiad.wikimedia.org over in prod :/ [20:00:31] (which is about to die) [20:06:34] bblack: Krenair should be able to help find it :) [20:34:19] bd808: Meeting finished? ;) [20:34:36] Luke081515: just started another one :/ [20:35:41] bd808: How long do I have to wait? (Just to make a schedule) [20:38:56] Luke081515: about an hour from now I should be open [20:39:08] ok, that's ok [20:49:29] 6Labs, 13Patch-For-Review: Periodic internal labs dns outages - https://phabricator.wikimedia.org/T124680#2052804 (10Andrew) > From the openstack mailing list: "As I understand it, in Kilo and later mdns must be primary and send data to other backends via XFR." And now a designate developer (Kiall) has correc... [20:51:47] andrewbogott: are you still around by any chance? I suspect nodepoolmanager on contintcloud manager is lacking some permissions. It hasn't refreshed images for 7 days. [20:52:09] andrewbogott: somehow it is caught polling on labnet1002.eqiad.wmnet presumably while attempting to snapshot an instance [20:52:16] hashar: didn’t we fix that < 7 days ago? [20:52:44] the last known build is from 172 hours ago or a bit more than 7 days [20:52:58] and the next build 6 days ago hasn't completed somehow [20:53:15] it is hammering labnet1002 right now, I am trying to sniff it but without much success [20:53:27] labnet1002 is just the nova api [20:53:29] probably [20:54:01] is there a way for a client to ask for all permissions it has ? [20:55:01] hashar: not that I know of [20:58:31] hashar: so the nova api isn’t involved in instance creation/deletion [20:58:35] that’s glance, which lives on labcontrol [20:59:19] but could nova be used to stop an instance and ask for a snapshot ? [21:00:07] I will try to get some traces anyway [21:00:24] https://phabricator.wikimedia.org/P2649 [21:00:25] hashar: ^ [21:00:35] lots of that in the glance api logs [21:01:08] oh [21:02:21] 54de67d5-621d-4666-9bb4-2b9c5fb62321 that is the instance being spawned which is supposed to then become a snapshot [21:03:29] those logs are not reachable unless ones is a labs ops right ? [21:04:44] andrewbogott: thanks for the trace. I am filling a task, there is no urgency the current snapshot is working fine [21:04:49] hashar: correct [21:04:56] hashar: do you know what the problem is, specifically? [21:05:05] Is it that nodepool can’t see the image in question? [21:05:19] andrewbogott: I have no clue, I will take traces first and fill them in the task [21:05:34] hashar: ok, let me know [21:05:45] you’re probably correct that it’s a rights issue, just hard to know which right [21:05:50] nodepool can emit a stack dump by sending SIGUSR2 to it, so I will do just that, dig in the code and try to figure out what part of the code is involved/looping [21:07:04] andrewbogott: thank you !! sorry for the interupt [21:22:06] pff [21:22:18] even wikitech is gone https://wikitech.wikimedia.org/wiki/ [21:23:20] hashar: Who gets the t-shirt this time? ;-) [21:24:19] hashar: that's why we have static right ? :) [21:24:38] thedj: yeah yeah :D [21:24:41] Wikipedia and Wikidata seem to be back [21:24:55] thedj: I am more worried about "real" wikis having 403 [21:24:57] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:24:58] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [21:25:20] It was actually 404's hashar on both wikidata as nlwp [21:27:26] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [21:29:10] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [21:31:12] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [0.0] [21:31:48] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2053017 (10hashar) [21:34:33] Luke081515: pong! [21:34:51] ah :) [21:35:29] I got two problems: at first: A CA puppet file sets for exmaple "centralauth-lock" permissions, I can't override them via the normal localsettings file [21:35:38] is there a way to override them? [21:36:44] overrides don't work from either settings.d files or after CommonSettings.php is sourced in LocalSettings? [21:37:28] at the moment I stored them at the localsettings, the global localsettings [21:37:44] do they have to be in lexical order before the CA role sets permisisons? [21:37:47] *permissions [21:39:35] They would need to be set after CommonSettings.php is sourced as that is where the puppet managed settings get loaded. But as I said the other day, in general mw-vagrant settings should be done in files in the /srv/mediawiki-vagrant/settings.d or settings.d/wikis/$WIKI/settings.d directories [21:40:45] ok, I will try it. But this is blocked by the second problem: [21:40:51] acutally vagrant don't loads any config [21:40:57] You can see the priority order in puppet/modules/mediawiki/templates/multiwiki/CommonSettings.php.erb [21:41:05] I already tried vagrant reload, provision and mwscript [21:41:12] doesn't load any config? [21:42:04] yeah :-/ [21:42:27] I set new permissions, but he doesn't load them [21:43:17] are you sure that your changes aren't just being overwritten by a puppet managed config file that is loaded later? [21:44:02] yeah. I set globalgroupmembership local, this was never used localy before [21:44:05] `vagrant provision` would wipe out any hand edits to the puppet managed config [21:46:26] ==> default: Notice: Finished catalog run in 26.98 seconds [21:46:56] but the rights didn#t change [21:47:03] I'm trying to reboot the whole instnace [21:47:05] *instance [21:48:00] hmmm.. I'm reading CA code [21:48:31] it looks like maybe globalgroupmembership has to be bootstrapped in the CA database before the special page works [21:48:49] * bd808 has never wandered into this code before [21:49:17] db_patches/patch-globalgroups.sql bootstraps a row into global_group_permissions [21:49:27] that's not the problem, I solved this by DB hack with krenair yesterday [21:49:37] but I want now remove this right localy [21:49:51] but this is not possible, even after a whole reboot of the whole insstnace [21:49:56] *instance [21:50:57] global group permissions only come from the database don't they? [21:51:04] not from config files? [21:51:22] sorry, maybe this was missunderstanding: [21:51:42] at first I assigned the local right " [21:51:53] globalgrouppermissions and globalgroupmembership to the group steward [21:51:57] but now I want to remove it [21:52:04] but he doesn't load the config [21:52:17] at another wiki I inserted the right for * to read [21:52:28] this part of the config is not loaded too [21:52:40] * still can't read [21:52:55] can you paste your config somewhere? [21:53:34] bd808: Sure, but can we solve this tomorow, I don't have more time at the moment :-/ [21:53:42] I would create a paste at phab [21:54:01] cool. ping me on the paste and I'll take a look [21:54:08] ok, thanks :( [21:54:10] *:) [21:54:12] wrong button [21:57:46] 6Labs, 13Patch-For-Review: Periodic internal labs dns outages - https://phabricator.wikimedia.org/T124680#2053143 (10Andrew) Kiall has now vetted our config, and I'm about to submit a patch that implements his changes. He suggested that a possible cause of the lock-ups is one of the pdns servers locking the d... [21:59:11] 6Labs, 13Patch-For-Review: Periodic internal labs dns outages - https://phabricator.wikimedia.org/T124680#2053144 (10Andrew) Ahah! The firewall blocks traffic between pdns on holmium and mdns on labservices1001. That's not specifically what Kiall predicted but it fits his theory. Firewall patch incoming... [22:02:20] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2053150 (10hashar) Using the poor man debugger on labnodepool1001.eqiad.wmnet as nodepool user: `strace -f -e recvfrom,sendto -s 102... [22:03:28] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2053156 (10hashar) python-novaclient has not been updated on labnodepool. Maybe it should... No idea really ``` labnodepool1001:~$... [22:04:48] RECOVERY - Puppet failure on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [22:05:02] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [22:05:47] !log wikimetrics renaming puppet roles, short maintenance [22:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikimetrics/SAL, Master [22:05:57] !log analytics renaming wikimetrics roles, short maintenance [22:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Analytics/SAL, Master [22:06:16] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [22:06:30] andrewbogott: was last monday (feb 15th) when you did an openstack upgrade ? [22:07:50] hashar: I didn’t do an upgrade, but if you look at the gerrit history in the openstack module you can see my recent changes to policy files. [22:07:59] It’s likely I did /that/ on Monday [22:08:50] yeah something screwed up somehow :-}}} [22:10:06] what I traced is a loop doing GET GET /v2/contintcloud/images/9bc7a824-2f28-44e8-92ee-71ceefe4ad72 [22:10:09] that yields: {"image": { "status": "SAVING" [22:10:22] and at some point that turns in a 404 "message": "Image not found.", :D [22:10:51] ‘saving’? [22:11:05] yeah it spawn an instance out of the image, provision the instance [22:11:08] ‘GET GET’? [22:11:09] then ask for a snapshot [22:11:17] single get obviously sorry [22:11:33] so at some point in the snapshot creation on the server side, something must explode somehow [22:11:42] hashar, if you ‘get’ with curl do you see the same behavior? [22:11:52] Or, with ‘glance show' [22:12:07] should try that [22:14:13] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [22:15:28] i just clicked "configure instance" on an existing , running instance in an active project, and then i get [22:15:32] The specified resource does not exist. [22:15:42] but i need to change the config of this instance [22:16:12] mutante: I can check after I sort out hashar’s thing. Meanwhile, ‘have you tried turning it off and back on again’? [22:16:56] andrewbogott: switch to mutante I am heading to sleep soonish [22:16:58] no, it's not my instance, it's wikimetrics by analytics and the role is called "production" ?/ [22:17:12] mutante: I mean, try logging out and back in to wikitech [22:17:15] andrewbogott: and my issue has no kind of emergency. It is just a pet project for my evening :} [22:17:21] andrewbogott: oh! of course, will do [22:18:21] hashar: that image has been deleted [22:18:24] tried, does not fix it [22:18:25] so the 404 seems justified... [22:18:29] mutante: ok, looking [22:18:39] project: wikimetrics instance: wikimetrics-01 [22:18:46] need to change the name of the puppet class it's using [22:18:58] error appears when clicking "configure instance" [22:19:09] just added myself to that project and as admin a few minutes ago [22:19:13] if that can be related [22:19:28] possible [22:19:30] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=2cb1a4eb-a7b1-4286-8855-a4a013db6ff0&project=wikimetrics®ion=eqiad [22:19:34] full URL to it [22:20:13] andrewbogott: is there a way to change the config directly in the backend? [22:20:20] mutante: it’s in ldap [22:20:42] hmmm, if it was a script cool, but that "vildap" stuff is so scary [22:20:49] is it that? [22:20:50] directly [22:21:02] there’s no tool other than wikitech [22:21:39] mutante: I suspect a caching bug, can you direct the requestor (who has been in the project for a while) to try? [22:21:43] do you get the same error? [22:21:51] I do, but I just added myself too [22:21:55] there is no other requestor, i am the requestor [22:22:02] i renamed the class [22:22:08] i just need to tell it the new class name [22:22:25] ok, I’ll do it in ldap [22:22:30] what’s the old and new class name? [22:22:46] old: role::wikimetrics [22:22:52] new: role::wikimetrics::production [22:23:15] yes, sounds strange but i'm just making it match the existing description :) [22:23:30] didnt invent it, just want it out of manifests/role [22:23:36] thank you [22:25:02] mutante: shall I replace all instances of that role name? It’s in here three times [22:25:17] or just for that one instance? [22:25:51] (anyway, done for that one case) [22:26:27] andrewbogott: it's a mess, i checked all that. there are like 5 instances, and only one actually uses it .. all 3 is perfect but 1 of them is puppet disabled [22:26:37] I’ll do the others, hang on [22:27:08] ok, done [22:27:11] :) thanks, also, the "production" role is on "staging1" and staging is on "staging" [22:27:24] and more like that [22:27:32] runs it on wikimetrics-01 to confirm [22:27:37] and it's fixed , all good [22:27:49] thanks again [22:28:48] man, a few weeks of writing php and all of a sudden I’m adding semicolons; [22:28:49] all; [22:28:51] over; [22:28:52] the; [22:28:53] place; [22:31:54] ;) [22:33:43] andrewbogott: check this out :) https://gerrit.wikimedia.org/r/#/c/269902/ [22:33:55] it has been tested, and it will be noop [22:34:08] and isn't that more readable etc ? [22:34:34] i'm also double confirming now on tools hosts [22:36:48] man that is a /lot/ of tiny files :) [22:37:19] looks better though [22:37:21] yea, so if something changes it's guaranteed it just touched that in a review [22:38:01] well, yea, it's cause it's a lot of tiny classes [22:38:11] the style thing is just "one file per class" [22:38:19] andrewbogott: I am off for tonight, I think we will want to revisit/investigate the available glance policy :-D "manage_image_cache" seems to be needed [22:38:38] hashar: ok! [22:39:21] and maybe we can find a way to allow more rights but at the same time solely for the 'contintcloud' tenant [22:40:07] couldn't found a nice doc of glance policy with details about what each parameters are for [22:51:48] hashar: yeah, in general the policies are poorly documented. [22:53:43] how is it possible that the "quarry" project does not have any instances? [22:53:55] while we know that it runs somewhere [22:54:49] latest SAL line "deploying to master and hoping" hah [22:55:10] on 4th of July [22:56:16] looks like the bug is that they dont get displayed [22:56:29] they dont appear on Special:NovaInstance with the project [23:00:14] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, 13Patch-For-Review: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2053461 (10hashar) Not yet .. ``` lang=json GET /v2/contintcloud/images/2e45de58-b560-4d51-a4b3-3a20b7f47dde H... [23:02:32] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, 13Patch-For-Review: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2053463 (10hashar) If I try to create a server image from a running server it works just fine, i.e.: openstac... [23:03:11] 6Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure: Make labs wikitech role aware - https://phabricator.wikimedia.org/T127771#2053464 (10Ottomata) [23:05:08] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA: tools.ptwikis throttled for excessive usage of labs db replica resources - https://phabricator.wikimedia.org/T127228#2053478 (10Danilo) Thanks. Ok. We are already developing a cache for the tool. [23:12:07] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, 13Patch-For-Review: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2053505 (10hashar) `openstack server image create` does not work. The command returns immediately showing the... [23:12:57] andrewbogott: looks like the rest api maps to method in python code and one as to look at the source :) [23:13:26] andrewbogott: anyway it seems 'openstack server image create' does not snapshot a running instance [23:13:51] that is all I got for today, gotta sleep [23:15:08] *wave* [23:16:03] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA: tools.ptwikis throttled for excessive usage of labs db replica resources - https://phabricator.wikimedia.org/T127228#2053510 (10Danilo) Thanks. Ok. We are already developing a cache for the tool. [23:46:13] andrewbogott, thanks for looking into my issue [23:46:18] can you just delete one of these accounts [23:46:22] https://phabricator.wikimedia.org/T126890 [23:50:10] hannes_roest: I can, sure :) [23:50:24] do you think then I can log in to gerrit? [23:50:31] that would be great [23:51:57] hannes_roest: I hope so :) [23:52:16] * andrewbogott deletes uid=hannesroest,ou=people,dc=wikimedia,dc=org [23:52:24] done [23:52:53] hannes_roest: can you tell me more about what you mean by 'could you delete my email address from the public record?' [23:52:58] do you mean from the contents of that ticket? [23:53:07] yes [23:53:22] so that it is not in there [23:54:11] hm… Krenair can do that the fastest [23:54:20] ok [23:54:26] hannes_roest: meanwhile, any luck with gerrit? [23:56:05] let me try [23:56:45] uh... I can delete it from the ticket if you like [23:56:51] if it's still in ldap it's still accessible to others [23:57:24] (done) [23:58:08] ok, gerrit works [23:58:17] win! [23:58:29] the web interface [23:58:37] great, now I want to try ssh [23:58:55] by ‘ssh’ you mean, submitting a patch for review? [23:59:25] or, ssh access to labs instances?