[00:14:54] we're seeing a bunch of shinken DOWN messages for integration-slave-trusty-1026 but the instance has been deleted for a month or so [00:15:08] can someone check for stale ldap records? [03:05:07] 10Tool-Labs-tools-Other, 6Community-Tech-Tool-Labs, 7Epic: Convert all Labs tools to use cdnjs for static libraries and fonts - https://phabricator.wikimedia.org/T103934#2159874 (10Krinkle) [03:07:33] 10Tool-Labs-tools-Other, 6Community-Tech-Tool-Labs, 7Epic: Convert all Labs tools to use cdnjs for static libraries and fonts - https://phabricator.wikimedia.org/T103934#2159876 (10Krinkle) [05:37:47] Can someone tell me what a bad specifier error in a crontab is supposed to be? [05:46:40] Why is it that trying to run a query against the logging or logging_userindex table is always so slow? [05:46:46] for enwp [05:46:58] I actually have to leave, but the query in question that I'm messing with is: select log_timestamp, log_action from logging where log_namespace=0 and log_title='Great_Nuclear_Doge_Event' and log_type='delete' and log_timestamp > '20160121065733'; [05:55:57] Why is trying to modify my crontab soo sloooooow. I wait like 20 seconds for a response. [05:56:19] yuvipanda, tools-login is being very slow. [05:56:45] Like really slow [10:05:16] I might have broken mediawiki on deplyoment-prep... https://en.wikipedia.beta.wmflabs.org/ gives me a "502 Bad Gateway". Checking on deployment-mediawiki01 I see apache and HHVM running, not much in error logs. [10:05:27] What should I check? [10:16:53] for the context, we were putting some load on search queries to test the new connection pool... [12:16:43] !log deployment-prep restarting varnish on deployment-cache-text04 [12:16:44] Please !log in #wikimedia-releng for beta cluster SAL [12:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [13:27:11] chasemp, I can't seem to create instances. They seem to ERROR out as soon as I create them. [13:27:54] CP678: what project, what kind of instance, what interface are you using, and have you ever made an instance before? [13:28:46] cyberbot, standard x-large, wikitech, yes. One instance is working perfecly [13:29:48] chasemp, ^ [13:31:34] chasemp, any ideas? [13:31:46] I'm looking I see what you mean but it's pretty nonspecific [13:31:49] can you try via horizon? [13:35:04] !log deployment-prep upgrade hhvm on deployment-mediawiki03 and reboot [13:35:04] Please !log in #wikimedia-releng for beta cluster SAL [13:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [13:37:20] chasemp, what's horizon? [13:37:30] CP678: horizon.wikimedia.org [13:37:37] which will be the new control panel for all things labs [13:39:38] CP678: seems like it may be a more general issue surfacing, please log a task and we will have to take a look when andrewbogott is around later today [13:40:43] chasemp, so horizon failed too. :-( [13:42:34] 6Labs, 10Labs-Other-Projects: Creating instances in projects fail - https://phabricator.wikimedia.org/T131246#2160557 (10Cyberpower678) [13:42:47] chasemp, ^ [13:43:04] sure "seems like it may be a more general issue surfacing, please log a task and we will have to take a look when" [13:43:31] there are a few things inf wise in flux and I'm not sure if this is a systemic issue or not atm [13:43:42] sorry about that [13:46:07] I was going to try experimenting with installing pthreads today. [13:46:50] chasemp, is there anyway to tell what the current usage of the project is. That is, is there any page that lays it out graphically? [13:47:50] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=displayquotas&projectid=cyberbot [13:47:59] and then the whole horizon ui is built for it [13:49:09] chasemp, i'm actually to see a little more than how much the instances have reserved. I want to see what the current load on those instances are. [13:49:18] CPU, RAM, etc... [13:49:41] I know there is a graphical interface for that somewhere. [13:50:18] for that you want graphite or horizon I think has it [13:50:25] http://graphite.wmflabs.org/ [13:51:08] http://graphite.wmflabs.org/render/?width=966&height=561&_salt=1459345857.981&target=cyberbot.cyberbot-exec-01.cpu.total.system&target=cyberbot.cyberbot-exec-01.cpu.total.user [13:51:08] etc [13:51:11] you'll have to play w/ it [13:51:27] btw, which credentials I have to use for graphite? or is the access limited? [13:52:00] it's open afaik [13:52:09] that is the labs one tho [13:53:26] so I have to use the login name from wikitech and the wikitech pwd? I can't access with this credentials [13:53:44] Your link is open, but graphite.wikimedia.org says, I have to identify, otherwise => 401 [13:53:53] those are two diff links and installs [13:53:57] so...yes they are different [13:54:02] one is for labs and is open [13:54:15] one is for prod and is probably attached to an ldap group [13:55:46] Ugh. Why did we ever ditch ganglia? [13:56:03] The graphite interface is absolutely horrible, slow, and clunky. [13:56:20] ah, graphite.wmflabs ist open, .wikimedia is not accessible for me [13:56:31] I think grafana is better [13:57:02] chasemp, any chance we could dump graphite for something...better. [13:57:12] I would say anything is better than this. :/ [13:57:20] Like Ganglia [13:57:29] Everything is laid so nicely. [13:57:51] grafana is just a frontend for graphite [13:58:02] and the graphiet front end is more or less just a exploration prototype [13:58:33] there is nagf or something in tools for instance grouping or something I don't recall [14:00:00] http://tools.wmflabs.org/nagf/?project=jupyter for instance [14:00:03] seems to be broken on some ways tho [14:00:21] chasemp: I added a grafana link to the project description boxes too [14:00:58] some time agi [14:01:01] I don't know anything about http://grafana.wmflabs.org/login [14:01:03] *ago [14:01:05] where it's running or what the deal is [14:01:24] nah, you don't need to login [14:01:35] chasemp, I don't know how to use grafana. It doesn't seem to be able to access any project, except the 3 you see when you sign up. [14:01:49] CP678: Wait a moment [14:01:54] chasemp, nag can't find Cyberbot despite it being in the dropdown [14:02:09] I see that too but why idk [14:02:27] my advice is learn to navigate the graphite interface and build your own graphs [14:02:34] CP678: Use the grafana link here: https://wikitech.wikimedia.org/wiki/Nova_Resource:Cyberbot [14:02:36] and then look for a curated version [14:02:40] https://grafana.wikimedia.org/dashboard/db/labs-project-board?var-project=cyberbot&var-server=All is that [14:04:57] chasemp: Use the grafana.wikimedia.org, this instance is showing labs data too [14:04:57] Oh goodie. Going AFK. Thanks. [14:16:18] 6Labs, 10Labs-Other-Projects: Creating instances in projects fail - https://phabricator.wikimedia.org/T131246#2160557 (10Andrew) I am not able to reproduce this in a test case. What project are you using? And, have you checked your quotas? [14:28:03] 6Labs, 10Labs-Other-Projects: Creating instances in projects fail - https://phabricator.wikimedia.org/T131246#2160665 (10Cyberpower678) >>! In T131246#2160654, @Andrew wrote: > I am not able to reproduce this in a test case. What project are you using? And, have you checked your quotas? cyberbot, and yes. [14:37:46] 6Labs, 10Labs-Other-Projects: Creating instances in projects fail - https://phabricator.wikimedia.org/T131246#2160674 (10Andrew) ok, I can reproduce now. Investigating... [14:46:04] andrewbogott, let me know if you find anything. [14:46:46] Cyberpower678: can you add me to the GitHub org for xtools please? [14:47:17] Matthew_, sorry, Elee hijacked the Repo. [14:47:33] Okay. [14:48:20] (03PS1) 10Youni Verciti: Rev 0.2.2 tools clean [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/280451 [14:48:21] Hiya MusikAnimal [14:48:35] hey there [14:49:47] Really loving grafana. :D [14:50:21] I'll probably just create a pull request anyway XD [14:52:37] got Elee to transfer ownership of the GitHub project [14:52:51] did you want to be an owner Cyberpower678? [14:52:56] MusikAnimal, can you add me back as owner [14:53:05] done [14:53:12] Cyberpower678: I think that too, grafana has a great GUI [14:53:23] I can add you too Matthew_ as a collaborator [14:53:37] MusikAnimal, thanks. I may not be active, but it would be wise to have an additional active user to fall back on when it comes to repo ownership [14:54:52] MusikAnimal: please and thank you :) [14:55:42] done [14:55:46] I'm going to work on cleaning up the interface [14:56:07] and I'm going to take a stab at upgrading the PHP to something a little more modern [14:56:41] 10PAWS, 10Revision-Scoring-As-A-Service-Backlog: Install revscoring inside PAWS - https://phabricator.wikimedia.org/T120317#2160761 (10Halfak) [14:57:00] I've hijacked the rebirth repo, I'm going to see if I can do anything with a possible rewrite. [14:57:06] phpbrew and other version managers don't even have < 5.4, xtools is only 5.3 something [14:57:41] we should be on 5.5 at least, I think [14:58:12] Matthew_, go ahead [14:58:31] MusikAnimal, I'm somewhat tempted to remove Elee from the ownership list. [14:58:52] Sure he gave us back ownership, but he shouldn't have removed the other owners to begin with. [14:59:07] he seemed very kind about it, I don't know why he removed you [14:59:11] I wonder if it were a mistake [14:59:32] just doesn't seem like him to do something like that [14:59:37] I'll shrug it off as a mistake this time, and add him back to the project. [14:59:58] he's not going to be active for quite a while anyway [15:00:41] And why is T-13 still on the list? Not accusing just curious. [15:01:10] Matthew_, I added him. I trust him, and he has yet to abuse that trust. [15:01:21] Ok. [15:01:36] Matthew_, so please make him feel welcome. [15:01:53] Oh I know. I'm a nice person :) [15:02:01] :-) [15:02:57] Matthew_: CP678 I notice both an xtools project and xtools within Tools, is this another case of migration or migration gone wrong or? [15:04:08] chasemp, The idea was to migrate, but then we found the code to be unmovable, due to toollabs dependencies. Matthew_ is going to attempt a rewrite, and then xtools should be moved again. [15:04:57] so there is still movement there but it's TBD [15:06:52] chasemp, pretty much. Which reminds me, Matthew_: what's your wikitech username? [15:10:26] andrewbogott, find anything yet? [15:10:39] I'll let you know [15:10:45] Ok [15:11:05] 10PAWS, 10Revision-Scoring-As-A-Service-Backlog: Install revscoring inside PAWS - https://phabricator.wikimedia.org/T120317#2160926 (10Halfak) Is this done? [15:12:47] bd808: Maybe something for you? T101615 [15:12:48] T101615: Please create a blacklist of user who can not use Special:GlobalRenameQueue - https://phabricator.wikimedia.org/T101615 [15:16:08] RECOVERY - Host tools-worker-1011 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [15:18:11] CP678: matthewrbowker [15:18:21] Apologies, had to step away for a second. [15:20:52] PROBLEM - Host tools-worker-1011 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:21] Matthew_, added you to the xtools project [15:24:23] PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:25:00] CP678: Thank you. [15:28:47] PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:29:53] PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:30:15] those puppet failures are because of a brief dns outage that just happened… they should all recover just fine on the next puppet run [15:31:03] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 323 bytes in 0.005 second response time [15:31:29] well, hm, that one is new :( [15:32:20] Seeing that on my end too... [15:32:39] As well as on tools themselves. [15:33:34] Matthew_: what do you see, specifically? [15:33:53] andrewbogott: "502 Bad Gateway [15:33:53] nginx/1.9.4" [15:34:01] It's a blank white page with that text [15:40:38] yes, all tools are down on Tool Labs? [15:41:12] andrewbogott, why does grafana still report cyberbot-exec-iabot-enwiki as still operational, and continues to graph that node, when I deleted it hours ago? [15:41:17] 502, even on toollabs:musikanimal which runs on a Unicorn server, not lighttpd [15:41:55] tools.stewardbots down with 502 as well [15:41:58] MusikAnimal, Matthew_: move xtools to the project NOW. :P [15:42:09] That's an order. [15:42:24] Or else I stuff myself with cheeseburgers. :p [15:42:27] Cyberpower678: love to sir! :P [15:42:33] Hehehehehe [15:42:55] nothing in the error logs [15:43:18] Though, we may want to create new instances if you're doing a rewrite. [15:43:24] The current ones use precise./ [15:45:18] Cyberpower678: I'm hoping my rewrite can work on tools. But do know that it will probably take me a long time to get it there... [15:45:41] andrewbogott: my bot is still running... it's just the frontend of the tools that are down [15:46:09] Matthew_, One of the reasons the project was created was so that when tools goes down, like it did now, xtools can continue to function. [15:46:26] xtools is widely used, and people panic as soon as it goes down. [15:46:42] as they do with toollabs:pageviews [15:46:56] Cyberpower678: yessir. Can and should are different things of course :) [15:47:12] but way too simple of an application to have it's own instance [15:47:18] But one of the goals would be resource and memory leak management. [15:47:19] Which is why I actually encourage doing the rewrite on the project itself., [15:48:29] Matthew_, and on the project itself you have full control over linux resources [15:48:45] You can even sudo, provided I give you access. :p [15:49:20] You can create your own webserver, proxy, load balancing grid engine, etc... [15:49:48] And you have 50GB of RAM at your disposal [15:50:08] Fair enough. [15:50:27] and you won't have a team of people to solve your issues for you ;-) [15:50:41] There are currently 3 instances using precise however. If you want me to spawn newer versions, let me know. [15:50:52] !log tools rebooting tools-proxy-01 in hopes of clearing some bad caches [15:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [15:51:04] valhallasw`cloud, if it's setup right, you won't need any. : [15:51:05] :p [15:51:35] Besides managing a project, versus managing a project of projects are different things. ;p [15:52:07] "503 Service Temporarily Unavailable" [15:52:10] getting 503's now [15:52:28] Cyberpower678: Heh. And again, I don't have a ton of time to throw at it, so it'll take a long time. [15:52:57] Matthew_, MusikAnimal: did you want me to delete those instances now, and create new clean ones for you guys to look at? [15:53:22] I... don't see what that will accomplish tbh. [15:53:36] Matthew_, they currently use precise [15:53:41] which is deprecated [15:53:56] Okay. But why would you recreate? [15:54:13] they're not really doing much to begin with, and I figured a clean slate would be better. [15:54:39] Besides I can then configure them to how you want them. [15:55:51] Matthew_, screw it. I added you as a projectadmin [15:55:52] https://wikitech.wikimedia.org/wiki/Special:NovaInstance [15:55:56] PROBLEM - Free space - all mounts on tools-worker-1004 is CRITICAL: CRITICAL: tools.tools-worker-1004.diskspace.root.byte_percentfree (<30.00%) [15:56:02] You can control your own instances there. [15:56:24] OK [15:56:48] * Cyberpower678 gets back to his thing, of patiently waiting to hear from andrewbogott [15:57:42] chasemp, when the @reboot is used in place of a time entry for the crontab, shouldn't those tasks automatically start when the instance is rebooted? [15:58:13] Because I restarted the working instance, and nothing happened. [15:58:25] andrewbogott: Back to 502s [16:01:06] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 806836 bytes in 4.982 second response time [16:01:13] Cyberpower678: that is in theory how it works [16:01:36] chasemp, well it doesn't seem to work. :-( [16:01:56] *sniff [16:02:33] Cyberpower678: it's your own host, so you'll have to debug it yourself [16:03:35] valhallasw`cloud, I'm starting to feel the pressure. :p [16:03:46] tools works now [16:03:53] RECOVERY - Puppet run on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:04:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [16:04:51] RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:09:24] 6Labs, 10Labs-Other-Projects: Creating instances in projects fail - https://phabricator.wikimedia.org/T131246#2161051 (10Andrew) It is specifically xlarge instances that are failing, because all hosts report insufficient RAM. I'll look into adjusting the overprovision ratio. [16:13:27] !log ores deployed ores-wikimedia-config:58905c5 [16:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [16:14:43] tools-bastion-02 $qstat gives: [16:14:44] error: commlib error: access denied (client IP resolved to host name "i-00000b8e.tools.eqiad.wmflabs". This is not identical to clients host name "tools-bastion-02.tools.eqiad.wmflabs") [16:14:49] error: unable to contact qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs" [16:15:08] andrewbogott: ^ more reverse dns issues [16:15:38] tools-bastion-05 $qstat is okay [16:16:36] ok, that will most likely sort itself out then... [16:16:45] tools-bastion-02 $ webservice restart gives: [16:16:46] error: commlib error: access denied (client IP resolved to host name "i-00000b8e.tools.eqiad.wmflabs". This is not identical to clients host name "tools-bastion-02.tools.eqiad.wmflabs") [16:16:51] error: unable to contact qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs" [16:16:54] Traceback (most recent call last): [16:16:56] File "/usr/local/bin/webservice", line 298, in [16:16:59] main() [16:17:01] File "/usr/local/bin/webservice", line 257, in main [16:17:04] job = get_job_xml(job_name) [16:17:06] File "/usr/local/bin/webservice", line 80, in get_job_xml [16:17:09] output = subprocess.check_output(['qstat', '-xml']) [16:17:11] File "/usr/lib/python2.7/subprocess.py", line 573, in check_output [16:17:14] raise CalledProcessError(retcode, cmd, output=output) [16:17:16] subprocess.CalledProcessError: Command '['qstat', '-xml']' returned non-zero exit status 1 [16:17:19] Why is it that trying to run a query against the logging or logging_userindex table is always so slow for enwp? [16:17:24] I actually have to leave, but the query in question that I'm messing with is: select log_timestamp, log_action from logging where log_namespace=0 and log_title='Great_Nuclear_Doge_Event' and log_type='delete' and log_timestamp > '20160121065733'; [16:17:45] doctaxon, really? You couldn't have pastebinned that? [16:18:18] that's the error report, ya [16:19:01] but it works on tools-bastion-05 very fine [16:19:18] 6Labs, 10Tool-Labs, 13Patch-For-Review, 3Scap3: Setup a proper deployment strategy for Kubernetes - https://phabricator.wikimedia.org/T129311#2101987 (10mmodell) p:5Triage>3High [16:19:35] 6Labs, 10Tool-Labs, 13Patch-For-Review, 10Scap3 (Scap3-Adoption-Phase1): Setup a proper deployment strategy for Kubernetes - https://phabricator.wikimedia.org/T129311#2101987 (10mmodell) [16:27:03] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 323 bytes in 0.007 second response time [16:35:52] hey chasemp, yuvipanda: looks like I can't view/edit the crontabs for either of my tools right now. Tho the existing cron jobs appear to be running on schedule. Any idea what's up? [16:36:32] J-Mo, they seem to be running really slow. Give it a minute. [16:36:58] cool. I'll check again in a bit. Thanks Cyberpower678 [16:37:30] J-Mo, when you call crontab, does it give an error, or does it appear to hang? [16:38:11] it shows me an empty crontab [16:38:47] I logged in; I "become TOOLNAME"; I type "crontab -e" [16:39:14] What's -e do? [16:39:20] I don't usually use it. [16:40:51] J-Mo, if it's giving you a blank editor, it's probably a seperate issue. [16:41:42] -e edits your existing crontab, or creates a new one if you don't have an existing crontab [16:41:47] J-Mo: what does which crontab return? [16:42:08] "/usr/local/bin/crontab" [16:42:28] * valhallasw`cloud frowns [16:42:35] which tools? [16:43:11] hostbot and grantsbot [16:44:12] last cronjob was run at 9:05am Pacific, I noticed the problem about 25 minutes later, so if my crontab was deleted, it happened within the past 40 minutes [16:46:39] J-Mo: Ah. I see what is happening. host-based-auth is broken [16:47:09] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 806974 bytes in 3.760 second response time [16:47:55] 6Labs, 10Tool-Labs: host-based auth broken - https://phabricator.wikimedia.org/T131254#2161129 (10valhallasw) [16:48:23] thanks for looking into this, valhallasw`cloud. I'll watch that ticket. [16:50:44] 6Labs, 10Tool-Labs: host-based auth broken - https://phabricator.wikimedia.org/T131254#2161151 (10valhallasw) ``` Mar 30 16:47:18 tools-cron-01 sshd[3848]: Connection from 10.68.23.74 port 55943 on 10.68.23.89 port 22 Mar 30 16:47:18 tools-cron-01 sshd[3848]: userauth_hostbased mismatch: client sends tools-bas... [16:55:59] 6Labs, 10Tool-Labs: host-based auth broken - https://phabricator.wikimedia.org/T131254#2161192 (10Andrew) I think this is left over from my recent dns confusion -- we were briefly falling back on an older dns server that presented weird/bad reverse dns lookups. Try restarting nscd on the affected host or wait... [16:58:11] 6Labs, 10Tool-Labs: host-based auth broken - https://phabricator.wikimedia.org/T131254#2161197 (10valhallasw) 5Open>3Resolved a:3valhallasw Yep, restarting nscd on tools-cron-01 fixes it. Thanks! [17:05:41] 6Labs, 10Labs-Other-Projects, 13Patch-For-Review: Creating instances in projects fail - https://phabricator.wikimedia.org/T131246#2161208 (10Andrew) The attached patch should allow at least one more xlarge instance to be created :/ [17:05:41] 6Labs, 10Labs-Other-Projects, 13Patch-For-Review: Creating instances in projects fail - https://phabricator.wikimedia.org/T131246#2161209 (10Andrew) 5Open>3Resolved a:3Andrew [17:12:36] andrewbogott, I was able to spawn an XL instance [17:13:18] great [17:13:29] andrewbogott, I have another qwuestion though [17:13:30] https://grafana.wikimedia.org/dashboard/db/labs-project-board?var-project=cyberbot&var-server=All [17:14:07] I don't know anything about grafana, I barely use it [17:14:24] Doesn't seem to have the new instance, but has a bunch of old deleted ones. [17:14:35] andrewbogott, it supposedly goes off of graphite. [17:15:41] RECOVERY - Host tools-worker-1011 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [17:15:51] Cyberpower678: hosts are not deleted from hraphite after they are deleted [17:16:21] And it can take a bit of time before the first value appears (diamond provisioning) [17:16:40] valhallasw`cloud, why aren't they deleted, they clutter up the graphs. [17:19:34] Niharika, ping [17:19:50] Cyberpower678: Pong. [17:20:53] PROBLEM - Host tools-worker-1011 is DOWN: PING CRITICAL - Packet loss = 100% [17:23:41] Because of how Graphite works it needs manual intrrvention [17:23:55] valhallasw`cloud, and who can do that? [17:24:09] File a bug? [17:24:27] But under what project do I file it? [17:24:45] Labs, or Graphite if there is one? [17:27:47] 6Labs, 7Graphite: Delete non-existent instances from Graphite for Cyberbot project - https://phabricator.wikimedia.org/T131259#2161380 (10Cyberpower678) [17:27:56] 6Labs, 7Graphite: Delete non-existent instances from Graphite for Cyberbot project - https://phabricator.wikimedia.org/T131259#2161392 (10Cyberpower678) p:5Triage>3Normal [17:29:07] how to set timeout for a job? (e.g if running time > 4 hours kill it) [17:34:39] andrewbogott, not sure if this is a bug, but I can't seem to access my newly created cyberbot-exec-iabot-01 node [17:34:50] I keep getting access denied. [17:35:34] Check the console log to see if 'Puppet finished correctly? [17:36:16] I got about a million [1;31mError: Could not request certificate: Connection refused - connect(2)[0m [17:36:26] valhallasw`cloud, andrewbogott ^ [17:36:48] Cyberpower678: I think you hit an unlucky moment when dns was messed up. If you delete and recreate things should go better. [17:37:17] :| [17:40:03] answering myself: -l h_rt=4:00:00 [17:40:21] andrewbogott, ssh: connect to host cyberbot-exec-iabot-01 port 22: No route to host [17:42:14] Cyberpower678: if you want to re-use the same name you'd best delete and wait 5 minutes before recreating. Otherwise there will be conflicts in dns caches and such [17:44:32] protip: never reuse name as they are free and it will always call into question which iteration of a box caused a certain issue along with dns shenanigans [17:44:59] chasemp, that's inconvenient. [17:45:10] I would like to keep naming consistent. [17:45:37] well, in operational terms that means you are too attached to a name but it's just advice [17:46:15] chasemp, no it's me refusing to have to give up a name because of a technical problem. [17:46:47] what I'm saying is, reusing names is bad practice [17:46:53] you are still free to do it outside of issues [17:54:11] :\ [17:54:44] I delete the instance and recreated with a 5 minute gap, and I'm still getting the same error. [17:54:44] just name it -02 [17:54:48] thats what numbers are for [18:03:45] 6Labs, 7Graphite: Delete non-existent instances from Graphite for Cyberbot project - https://phabricator.wikimedia.org/T131259#2161580 (10Cyberpower678) [18:04:10] Cyberpower678: something 'interesting' is going on… I'm still looking. [18:06:31] andrewbogott, would it have something to do with not resolving host names? The new name I spawned doesn't resolve [18:10:36] this is just another manifestation of having multiple IPs for labs-ns0. It should be stable by now, I'm syncing things [18:11:21] https://phabricator.wikimedia.org/T131266 [18:12:59] 10PAWS: Write 2-3 vision for annual plan - https://phabricator.wikimedia.org/T131267#2161651 (10Halfak) [18:15:34] Cyberpower678: you should be able to access it now [18:15:34] 6Labs, 10DBA: The logging table on labs is broken - https://phabricator.wikimedia.org/T131266#2161674 (10chasemp) p:5Triage>3Normal [18:15:54] 6Labs, 10DBA: The logging table on labs is broken - https://phabricator.wikimedia.org/T131266#2161677 (10Sigma) [18:16:44] 6Labs, 10DBA: The logging table on labs is broken - https://phabricator.wikimedia.org/T131266#2161630 (10Sigma) [18:16:52] 6Labs, 10DBA: The logging table on labs is broken - https://phabricator.wikimedia.org/T131266#2161630 (10Reedy) EXPLAIN and see what indexes that actual query is using? [18:17:13] 10PAWS: Write 2-3 year vision for annual plan - https://phabricator.wikimedia.org/T131267#2161690 (10Halfak) [18:17:22] 6Labs, 10DBA: The logging table on labs is broken - https://phabricator.wikimedia.org/T131266#2161630 (10jcrespo) Enwiki tables are being loaded currently to a) correct incorrect data b) populate the new server. During the imports (one table at a time), blocks and missing data can happen. [18:17:48] 6Labs, 10DBA: The logging table on labs is broken - https://phabricator.wikimedia.org/T131266#2161693 (10Sigma) Details of the query in question: ``` MariaDB [enwiki_p]> select log_timestamp, log_action from logging where log_namespace=0 and log_title='Great_Nuclear_Doge_Event' and log_type='delete' and log_... [18:19:22] Yes it is working now. [18:19:26] andrewbogott, ^ [18:19:32] thank you :-) [18:20:10] 6Labs, 10DBA: The logging table on labs is broken - https://phabricator.wikimedia.org/T131266#2161700 (10Sigma) @Reedy I have included the explains that @Legoktm ran for me in the description. [18:25:45] 10PAWS: Write 2-3 year vision for annual plan - https://phabricator.wikimedia.org/T131267#2161739 (10yuvipanda) a:3yuvipanda [18:25:47] 10PAWS: Write 2-3 year vision for annual plan - https://phabricator.wikimedia.org/T131267#2161651 (10yuvipanda) dibs [18:27:56] 10PAWS: Write 2-3 year vision for annual plan - https://phabricator.wikimedia.org/T131267#2161747 (10Halfak) Some thoughts to include: * CLI is lame * Open science is currently happening anyway * We're the first open jupyter hub * What if open data science Discuss experimentation period. Discuss larger/longer... [18:28:38] 6Labs, 7Graphite: Delete non-existent instances from Graphite for Cyberbot project - https://phabricator.wikimedia.org/T131259#2161752 (10Cyberpower678) [18:29:26] 10PAWS: Write 2-3 year vision for annual plan - https://phabricator.wikimedia.org/T131267#2161758 (10yuvipanda) I think that's just one facet of it. I want to think of it as a way to write code, rather than just 'do science' (not too different, IMO). Science as it happens now is a very important component, but n... [18:33:05] 6Labs, 10Horizon: Only allow labs proxies to be *.wmflabs.org - https://phabricator.wikimedia.org/T131270#2161769 (10Andrew) [18:36:36] 6Labs, 10DBA: Querying the logging table on labs is slow - https://phabricator.wikimedia.org/T131266#2161782 (10Reedy) p:5Normal>3Low [18:38:05] 6Labs, 10DBA: Querying the logging table on labs is slow - https://phabricator.wikimedia.org/T131266#2161630 (10Reedy) >>! In T131266#2161691, @jcrespo wrote: > Enwiki tables are being loaded currently to a) correct incorrect data b) populate the new server. During the imports (one table at a time), blocks and... [18:41:11] 6Labs, 10DBA: Querying the logging table on labs is slow - https://phabricator.wikimedia.org/T131266#2161789 (10jcrespo) The maintenance is to correct several issues: T126946 T115517 T129432. The reimports are done in small chunks so it only affects a subset of the queries each time. [19:36:49] 6Labs, 10Labs-Infrastructure: Make labs proxies https only - https://phabricator.wikimedia.org/T131288#2162108 (10Andrew) [19:38:18] 6Labs, 10Labs-Infrastructure: Abolish use of labs proxies in domains other than .wmflabs.org - https://phabricator.wikimedia.org/T131290#2162138 (10Andrew) [19:38:37] 6Labs, 10Horizon: Only allow labs proxies to be *.wmflabs.org - https://phabricator.wikimedia.org/T131270#2162152 (10Andrew) [19:38:39] 6Labs, 10Labs-Infrastructure: Make labs proxies https only - https://phabricator.wikimedia.org/T131288#2162108 (10Andrew) [19:51:10] andrewbogott: en.wikipedia.beta.wmflabs.org does not resolve; known issue? [19:51:34] gwicke: I think Krenair is looking at that now [19:51:52] I just deleted the records to make it consistent with the other non-wikipedia subdomains [19:51:58] problem is I can't delete the domains [19:52:06] so it's broken until then [19:54:02] 6Labs, 10Horizon: Allow creation and deletion of domains - https://phabricator.wikimedia.org/T131301#2162322 (10Krenair) [19:55:20] Krenair: if you tell me what domains you would like deleted, I will delete them [19:56:04] well en.wikibooks.beta.wmflabs.org resolves presumably because of the *.beta.wmflabs.org record [19:56:16] so no domains should exist under beta.wmflabs.org [19:56:42] there is no wikibooks.beta.wmflabs.org domain in there but for some reason there is a wikipedia.beta.wmflabs.org [19:56:49] one way of another this needs to be made consistent [19:56:51] or* [19:58:12] let's start with wikipedia.beta.wmflabs.org and see if that starts resolving properly, seeing as that's the one I just started trying to handle [19:58:14] andrewbogott, ^ [20:00:10] ok, so, delete wikipedia.beta.wmflabs.org ? [20:00:36] yes [20:00:55] it should then start to resolve in the same way beta wikibooks does [20:03:27] Krenair: shall I delete m.wikipedia.beta.wmflabs.org as well? [20:03:43] let's just try it with this one for now [20:03:45] or, to clarify: I have to delete m.wikipedia.beta.wmflabs.org before I can delete wikipedia.beta.wmflabs.org [20:04:11] okay, and zero? [20:04:17] yeah, presumably [20:04:34] ok [20:04:50] all three deleted [20:06:25] okay so now wikipedia.beta.wmflabs.org works again [20:06:29] let's do the same with the rest [20:07:17] if you delete beta.wmflabs.org, will the content be deleted too? [20:07:30] my pages there? [20:07:50] beta.wmflabs.org isn't being deleted [20:08:05] and we're only talking about dns records [20:08:10] sorry wikipedia.beta.wmflabs.org [20:08:20] no content on the servers those domains point to [20:12:15] Krenair: can you just tick the domains you want me to delete? https://etherpad.wikimedia.org/p/deletedesedomains [20:13:05] done [20:13:24] oh, so, everything :) [20:13:25] ok! [20:14:05] thanks [20:14:54] done [20:17:39] 6Labs, 10Labs-Infrastructure: Labs proxy api (aka 'Invisible Unicorn') is a spof - https://phabricator.wikimedia.org/T131308#2162479 (10Andrew) [20:17:54] Krenair: did that get you to the state you wanted? [20:18:15] andrewbogott, yep, domain list looks good and all the domains that I'd expect still appear to work [20:18:20] going to run a script through to verify [20:18:20] !ping [20:18:20] !pong [20:18:34] Krenair: well, this is a very simple solution: one domain, one wildcard :) [20:18:39] yep [20:18:55] and now it's in use for all of beta instead of random parts of it [20:19:39] thanks for finally helping me clear that mess up [20:19:49] helping me finally* [20:27:18] sure! [20:30:06] while I was going through all those, something else (unrelated) in the beta domains popped up [20:30:20] I noticed that there was no record for upload.beta.wmflabs.org [20:30:27] so it was going to the text cache [20:30:46] can we still query the old ldap-based server to check where it was sending that andrewbogott? [20:30:58] sure [20:31:38] I just corrected it in horizon but there is, of course, more issues hiding behind that... [20:32:01] andrewbogott, labs-ns0-former-placeholder? [20:32:16] Krenair: yeah, that's the ldap-based nameserver [20:32:22] I can also just dig in ldap itself if you like [20:32:29] huh, it was sending it to the correct cache machine [20:33:03] so something must have got lost in the designate migration? [20:34:22] oh well, will just have to see if I can fix this upload cache box.. [21:13:24] 6Labs, 10Labs-Infrastructure: Make labs proxies https only - https://phabricator.wikimedia.org/T131288#2162709 (10Andrew) [21:13:26] 6Labs, 10Horizon, 13Patch-For-Review: Only allow labs proxies to be *.wmflabs.org - https://phabricator.wikimedia.org/T131270#2162708 (10Andrew) 5Open>3Resolved [21:26:12] WTF [21:26:30] Why do I get a permission denied error when using sudo. [21:26:53] what host? [21:27:09] cyberbot-exec-iabot-01 [21:27:33] sudo echo "extension=pthreads.so" >> /etc/php.ini [21:27:33] -bash: /etc/php.ini: Permission denied [21:28:13] sudo isn't a bash keyword, it's an actual program [21:28:27] oh [21:28:28] the pipe won't be done with elevated privileges [21:28:53] Krenair, so what's the proper way to edit this file then? [21:28:59] sudo -e /etc/php.ini [21:29:23] should open an editor for you to add what you like [21:30:17] you could also `sudo -i` and then run `echo "extension=pthreads.so" >> /etc/php.ini` [21:35:14] Krenair, permission denied for sudo -e [21:36:17] sudo -i works. Thanks [21:36:50] really? uh.. interesting [21:37:10] if sudo -i works, sudo -e should to [21:37:11] too* [21:37:27] The editor returned a permisssion denied [21:37:51] And instead attempted to write to a file in the tmp folder [21:42:57] the tmp thing is normal [21:47:49] anothe common way to do sudo echo "extension=pthreads.so" >> /etc/php.ini is echo "extension=pthreads.so" | sudo tee -a /etc/php.ini [22:17:19] 6Labs, 10Beta-Cluster-Infrastructure: deployment-upload won't start, upload.beta.wmflabs.org down - https://phabricator.wikimedia.org/T131322#2163472 (10Krenair) [22:49:37] Execution on Bastion has is working very slowly, to the point that it's nearly non-functional [22:50:50] chasemp: given the increasing number of complaints, I'm going to setup a xlarge bastion for now. [22:51:19] It's in on and off doses. [22:52:12] What causes the slowdowns? I seem to remember reading something about NFS causing them, but I'm not sure. [22:52:14] yuvipanda Magog_the_Ogre: Related? https://phabricator.wikimedia.org/T131122 [22:52:31] yeah, same issue probable. [22:53:30] OK. [22:56:34] I was waiting to stage a node for cutover but no big deal to do both to me [22:56:34] I started one (tools-bastion-03) the puppet run will take a while [22:56:34] chasemp: is there a ticket for the cgroup stuff you were doing? [22:56:35] I know I’ve caused a slowdown once by trying to open an very large log file with vim. [22:56:35] * Matthew_ still feels bad about that. [22:56:37] one for making the bastion not suck, approach is multi-pronged [22:56:48] i cant remember the # [22:57:21] puppet's continuing to run [22:57:24] chasemp: ok. [22:57:35] the thing is xlarge or not doesnt change the nfs case which is most aggregious [22:58:18] NFS causes most of the issues. [22:58:28] I agree, but it's a simple thing for me to start off and not worry about and I think even in the long run bastions should be xlarge anyway [22:58:49] no argument there just noting [22:58:59] chasemp: but if you want me to hold off, I can do that too. but yeah, I couldn't even get into tools-login [22:59:07] pooptop seems like a thing we should do at some point :D [22:59:30] Are you going to upgrade the dev host to? [22:59:46] ha yes although i am thinking cli now as anyone who could help is there already but yes [22:59:47] tom29739: that doesn't seem to have as many issues right now, but long term probably yeah [23:00:54] The dev host compiles faster than my pentium processor at home, which is a start. [23:01:00] heh. [23:01:00] :D [23:01:15] one possibility to make the dev host really faster is to not mount NFS there on home [23:01:16] but yeah no worries, it is not a bad idea but i will have to disrupt ppl some in thia case instead od doing it pre cutover [23:01:45] chasemp: I'm in SF, so am all for disrupting all the thingssssss [23:01:54] yes i had the same ideaon home [23:02:09] we could put homedirs on /srv, and then have home mounted on /nfs or someshit that people can cd to to copy stuff over in [23:02:17] and it makes policing nfs for tools simpler too [23:02:23] so it becomes like a dedicated build host type thing [23:02:30] we could maybe do this for *all* bastions [23:02:44] but that'll probably break some people's deploy scripts and stuff [23:02:57] but idk, doing it on dev first and then -login with lots of notice seems like a overall good solution [23:03:25] I'll file a task in a bit [23:03:46] I am conflicted on the idea, I worry yhe things ppl do to keep doing what they prefer are just as bad [23:04:03] > I worry yhe things ppl do to keep doing what they prefer are just as bad [23:04:08] can't parse :D [23:04:31] ugh, brb. [23:05:02] Taking away NFS on home is a win, but only as far as people do not compensate with worse practices [23:05:03] back [23:06:05] and we have to come up with viable home storage locally across [23:06:38] anyway, go for broke and thanks for running it by me. [23:07:02] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:08:49] Another note is we cant make this perfect for all as no matter what the ppl doing too much will hit a ceiling and we need them too [23:08:58] chasemp: :D so all I'm going to do now is to setup tools-bastion-03 and make it tools-login. [23:09:16] so evidence of complaint is not necessarily evidence of problem as we tune [23:09:19] chasemp: +1. But can you write up exactly what the mechanism is going to be and how it's implemented etc in a phab ticket + announce and what not before we do it? [23:09:31] I mostly still have no idea what limits are in place etc [23:09:39] that is the plan [23:09:48] ok [23:10:16] so far how ppl work complicates this a lot [23:10:24] sudo as root to a tool user [23:10:41] and tie that all back via x number of sessions to a human [23:10:51] its not very practical atm [23:11:34] fun [23:11:36] To track and police I mean [23:11:40] * yuvipanda nods [23:12:00] loooooong term, I want everyone to be in the terminal in PAWS, or a similar containerized solution over ssh [23:12:09] like, in 2-3-4 years time [23:12:17] looong time away [23:12:29] and when we move off gridengine we'll also have far more flexibility about unix user accounts [23:12:38] The second i have thought a lot about as really all i am looking at falls short of that [23:12:46] which is ideal [23:13:02] > The second i have thought a lot about as really all i am looking at falls short of that [23:13:02] or is the ideal i mean [23:13:06] can't parse that either.. :( [23:13:29] jupyter & ssh in containers [23:14:24] aaah [23:14:26] right [23:37:10] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:38:13] we got a tools-bastion-03? [23:41:23] !log rcm.cac updating all repos [23:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm.cac/SAL, Master [23:48:49] !log tools.articlerequest Updated to v.0.1devel8 [23:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.articlerequest/SAL, Master