[00:16:48] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:20:46] !log tools Repooled nodes tools-worker 1012 and 1013 for T141126 [00:20:47] T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126 [00:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:26:47] RECOVERY - Puppet run on tools-worker-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [00:40:33] hi, small question: I've now a new computer and cant' connect anymore to my tool account via SSH: "the server refuser our key" tells me Putty. Do you have any idea please? [00:42:02] PROBLEM - Host tools-worker-1002 is DOWN: CRITICAL - Host Unreachable (10.68.16.234) [00:44:47] PROBLEM - Host tools-worker-1004 is DOWN: CRITICAL - Host Unreachable (10.68.16.126) [00:50:41] Automatik, username? [00:50:52] Krenair: Botomatik [00:50:56] what is the public part of the key you are trying? [00:52:30] and what host are you trying? [00:52:32] ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAnfucyWMZcAtdLxeKnX2BW0HEL1hp+L8Z6Hg+r1zo+0WhDk2PVOQgiWLy/lPZShM8/eUkmaeb/naKIP1xe1bLqpgTP4v36qW2qYCY7w8AxkwPEpXajQZ3sbAUdYKTcsz0hBEBlXtHjfjIaFkcplM1e43m7Opj4fqcmJJYT/zw4axV18MlcbJVXpV81/FpOc6CxBhBZ54NhAmJQoi+0LdnGhTXV7j0g8Um/+lQ/wOvF7l8klaCpXgIi+n76ud439+YCD4t4aNxvgrhU3g3dkknm8X5QH764rbm64bsAmnEr2fDzll7DorfQsOsbV04I0L+UHMLsqNNJfcWB9oI02O/jQ== rsa-key-20160306 [00:54:02] you don't have that key registered [00:54:16] oh no wait [00:54:37] there it is, my bad [00:54:38] i tried bastion.wmflabs.org and login.tools.wmflabs.org [00:55:28] Well... Maybe there is an issue with your setup [00:55:39] I don't use Windows and don't really know putty [00:55:43] don't guess where... [00:56:39] i was using w7 before and now w10, but i don't know if it can be a problem [00:57:35] was there a recent change that could explain this: [00:57:52] no such variable: $EQIAD_PRIVATE_LABS_HOSTS1_A_EQIAD [00:57:53] you're using Pageant, Automatik? [00:58:03] where do you see that mutante? [00:58:10] on our icinga prod host :/ [00:58:19] krenair: yes [00:58:38] modules/icinga/manifests/nsca/firewall.pp: $EQIAD_PRIVATE_LABS_HOSTS1_A_EQIAD \ [00:59:18] no idea [00:59:49] the last change to that file itself was in January.. hrmm [01:00:46] isn't the nsca stuff for frack? [01:02:48] yea, well, not specifically, they had the first use case [01:03:01] it should not mean that a variable like that should disappear though [01:03:42] ok, thanks krenair for your help, i will search again on my own later [03:00:43] 06Labs, 10Labs-Infrastructure, 07Tracking: NFS overload is causing instances to freeze - https://phabricator.wikimedia.org/T124133#2517840 (10chasemp) [03:07:48] 06Labs, 10Labs-Infrastructure, 07Tracking: NFS overload is causing instances to freeze - https://phabricator.wikimedia.org/T124133#2517843 (10chasemp) This task encompassed a particular failure pattern we were seeing really often in late 2015 and early 2016 all involving NFS and client failure. This affecte... [03:13:38] 06Labs: Interactive consoles? - https://phabricator.wikimedia.org/T130806#2517849 (10chasemp) [03:13:40] 06Labs, 10Labs-Infrastructure: Instance console does not gives output / keystroke access - https://phabricator.wikimedia.org/T64847#2517851 (10chasemp) [03:15:51] 06Labs: Interactive consoles? - https://phabricator.wikimedia.org/T130806#2517854 (10chasemp) >>! In T130806#2511774, @Andrew wrote: > Proposed: add a root password (managed like a prod password) but also modify policy files so that the Console tab is only visible for people with the admin keystone right. Yep,... [03:23:08] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2507274 (10chasemp) This issue seems to be prevalent and maybe limited to Jessie hosts but not Jessie hosts built from the same exact image. It has seemed to affect k8s infrastructure but it's entirely possible that a) k8s is the m... [03:29:07] 06Labs, 06Operations: Moving network::external to hiera broke much of labs - https://phabricator.wikimedia.org/T141959#2517865 (10chasemp) [03:29:16] 06Labs, 06Operations: Moving network::external to hiera broke much of labs - https://phabricator.wikimedia.org/T141959#2517877 (10chasemp) p:05Triage>03Normal [05:21:04] PROBLEM - Puppet staleness on tools-worker-1008 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:29:35] PROBLEM - Puppet staleness on tools-worker-1005 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:37:40] PROBLEM - Puppet staleness on tools-worker-1007 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:51:55] PROBLEM - Puppet staleness on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [07:47:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:22:54] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:28:35] 06Labs: Creating new instances goes into ERROR state - https://phabricator.wikimedia.org/T141966#2518168 (10yuvipanda) [08:31:58] 06Labs: Creating new instances goes into ERROR state - https://phabricator.wikimedia.org/T141966#2518188 (10yuvipanda) tools-worker-1014, which I created after these two, built fine. [08:58:54] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/PasqualeSignore was created, changed by PasqualeSignore link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/PasqualeSignore edit summary: Created page with "{{Tools Access Request |Justification=Wikimedia Italia related work on tools. |Completed=false |User Name=PasqualeSignore }}" [09:28:34] hi valhallasw`cloud , can you approve https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/PasqualeSignore ? [09:28:55] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/PasqualeSignore was modified, changed by Merlijn van Deen link https://wikitech.wikimedia.org/w/index.php?diff=816092 edit summary: [09:29:20] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/WebIntegrity was modified, changed by Merlijn van Deen link https://wikitech.wikimedia.org/w/index.php?diff=816095 edit summary: [09:29:33] Nemo_bis: done! (well, once wikitech responds to the request) [09:30:14] 10Labs-project-Extdist, 10MediaWiki-extensions-ExtensionDistributor, 06Project-Admins: Archive #Labs-project-Extdist - https://phabricator.wikimedia.org/T141969#2518306 (10Danny_B) [09:30:15] valhallasw`cloud hi it seems the grrrit-wm bot is not working in -releng [09:30:17] i get error [09:30:35] error: prefix=barjavel.freenode.net, server=barjavel.freenode.net, command=err_cannotsendtochan, rawCommand=404, commandType=error, args=[grrrit-wm, #wikimedia-releng, Cannot send to channel] [09:33:25] 10Labs-project-Extdist, 10MediaWiki-extensions-ExtensionDistributor, 06Project-Admins: Archive #Labs-project-Extdist - https://phabricator.wikimedia.org/T141969#2518323 (10Legoktm) Ahem, what? They're two separate things, and both are active. [09:33:48] paladox: I have no clue [09:33:49] thanks [09:33:55] Oh [09:34:00] Hi valhallasw`cloud, thanks for approving me! [09:34:02] paladox: the 'cannot send to channel' suggests -releng is muted? [09:34:09] Oh thanks [09:34:14] Oh yes [09:34:16] now i remeber [09:36:55] 10Labs-project-Extdist, 10MediaWiki-extensions-ExtensionDistributor, 06Project-Admins: Archive #Labs-project-Extdist - https://phabricator.wikimedia.org/T141969#2518328 (10Legoktm) 05Open>03declined [09:40:36] 10Labs-project-Extdist, 10MediaWiki-extensions-ExtensionDistributor, 06Project-Admins: Archive #Labs-project-Extdist - https://phabricator.wikimedia.org/T141969#2518333 (10Danny_B) Please update the #labs-project-extdist project description then as it is obviously confusing. Thank you. [09:44:01] 06Labs, 13Patch-For-Review: Kill ldapsupportlib.py - https://phabricator.wikimedia.org/T114063#1683588 (10valhallasw) >>! In T114063#2504069, @AlexMonk-WMF wrote: > `ldapsearch -x objectClass=posixaccount` should give you the same thing as `ldaplist -l passwd` > `ldapsearch -x uid=krenair` should give you the... [09:44:50] 10Labs-project-Extdist, 10MediaWiki-extensions-ExtensionDistributor, 06Project-Admins: Archive #Labs-project-Extdist - https://phabricator.wikimedia.org/T141969#2518337 (10Legoktm) Updated, hopefully that's more clear. [09:52:11] !log tools.lolrrit-wm restarting grrrit-wm bot to pickup cherry pick of auth change [09:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [10:12:10] !log tools.lolrrit-wm testing some changes to see weather it fixes T141329 [10:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [10:12:30] T141329: Patchsets created through web interface attributed to the wrong user - https://phabricator.wikimedia.org/T141329 [10:16:02] (03Abandoned) 10Merlijn van Deen: Testing. [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302348 (owner: 10Merlijn van Deen) [10:21:05] (03CR) 10Paladox: "recheck" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox) [10:33:35] valhallasw`cloud ive tryed to fix the T141329 [10:33:36] T141329: Patchsets created through web interface attributed to the wrong user - https://phabricator.wikimedia.org/T141329 [10:33:43] but it seems it still gets the owner [10:33:47] no matter what i do [10:59:06] valhallasw`cloud i wonder what i change to get it working [10:59:11] it works when using ssh and http [10:59:22] but dosent when using gerrit in web editor [11:00:46] 06Labs, 13Patch-For-Review: Clarify public/private role for holmium (aka labs-ns2) - https://phabricator.wikimedia.org/T93639#2518573 (10mark) a:05mark>03Andrew I'm not sure what the purpose of this ticket is. Could you please clarify? [11:03:32] paladox: look at the event stream from gerrit and see what fields do the right thing? [11:03:41] Ok [11:03:45] how do i do that? [11:03:48] please [11:09:37] valhallasw`cloud ^^ [11:12:01] ? [11:24:00] info: Connecting to gerrit.. [11:24:00] throw err; [11:24:01] ^ [11:24:01] Error: ENOENT, no such file or directory '/secret/ssh-key' [11:24:04] I get error ^^ [11:24:10] valhallasw`cloud ^^ [11:29:47] paladox: I have no clue. [11:29:52] Ok [11:37:04] valhallasw`cloud i filled it upstream https://bugs.chromium.org/p/gerrit/issues/detail?id=4324 [11:37:08] seems to be a bug [11:37:14] 10Labs-project-Extdist, 10MediaWiki-extensions-ExtensionDistributor, 06Project-Admins: Archive #Labs-project-Extdist - https://phabricator.wikimedia.org/T141969#2518668 (10Danny_B) Thank you. That link itself without any further description around was confusing and implicating that former labs tool promoted... [11:37:20] since it works for ssh and http but not for gerrit inline editing [13:15:49] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:50:51] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:06:53] valhallasw`cloud ok i have the stream-events here https://phabricator.wikimedia.org/P3630 [14:06:58] its from a test install [14:07:10] that's unreadable [14:07:15] but if you look at line 17, that was me who did that but it says jenkins did i [14:07:16] it [14:07:19] when it didnt [14:08:01] valhallasw`cloud it only says jenkins on line 17 and no mention of admin. [14:08:14] https://phabricator.wikimedia.org/P3630$17 [14:08:41] paladox: I'm not sure what you're trying to tell me. If you found a bug in gerrit, report it? [14:08:48] Ok [14:08:56] I have here https://bugs.chromium.org/p/gerrit/issues/detail?id=4324 [14:09:08] that paste doesn't tell me anything. I don't know what you did, what jenkins did, or what the state of that entire patchset is [14:09:12] nor do I really care [14:10:12] Ok [16:47:40] there are some inestabilities with labsdbs s6 and s7 on labsdb1001 [16:48:06] they should not affect you, but I wanted to let you know I detected some lag on non-default hosts [16:50:16] oh, I see why that happened [16:50:27] labsdb1001 crashed some hours ago [16:50:43] it just restarted automatically and I didn't notice at first [17:15:16] non-default hosts? [17:34:30] ^ notice anything different about the host? :) [17:38:41] It uses the instance name [17:38:43] for cloak [17:39:05] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Support reverse dns for public labs IPs - https://phabricator.wikimedia.org/T104521#2519824 (10AlexMonk-WMF) 05Open>03Resolved * labs-morebots (tools.more@instance-tools-exec-1216.tools.wmflabs.org) has joined ^ that used to show an IP :) [17:39:36] 06Labs, 10Labs-Infrastructure: Support reverse dns for public labs IPs - https://phabricator.wikimedia.org/T104521#2519826 (10AlexMonk-WMF) [17:51:01] Krenair: ooooh, rdns [18:13:53] 06Labs, 10Labs-Infrastructure, 06Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2519965 (10Dzahn) crons have been created on serpens and seaborgium. they will check once an hour (at a random minute so they are never restarted at the same time) if more than 50% of memory... [18:15:00] 06Labs, 10Labs-Infrastructure, 06Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2519966 (10Dzahn) ``` [seaborgium:~] $ /bin/ps -C slapd -o pmem= 3.5 [serpens:~] $ /bin/ps -C slapd -o pmem= 27.1 ``` [19:30:34] 06Labs, 10Labs-project-Librarybase, 06Operations: librarybase project cannot create a proxy for librarybase.wmflabs.org - https://phabricator.wikimedia.org/T131448#2520317 (10Harej) [19:30:53] 10Labs-project-Librarybase, 10The-Wikipedia-Library, 10WikiCite: Initial population of Librarybase - https://phabricator.wikimedia.org/T120115#2520330 (10Harej) [19:31:09] 10Labs-project-Librarybase, 06Project-Admins: Create Librarybase Phabricator Component Project - https://phabricator.wikimedia.org/T137091#2520331 (10Harej) [19:31:35] 10Labs-project-Librarybase, 10Reports-bot, 10The-Wikipedia-Library, 10WikiCite: Create recommendations for databases/journals/websites, by WikiProject for WikiProject X - https://phabricator.wikimedia.org/T111066#2520333 (10Harej) [19:31:59] 10Labs-project-Librarybase, 10WikiCite: Open Librarybase SPARQL endpoint to the internet - https://phabricator.wikimedia.org/T123633#2520334 (10Harej) [19:32:16] ...Does Wikibugs ping on any mention of "Labs" even if the main "Labs" project is not mentioned? [19:32:46] 10Labs-project-Librarybase, 10Data-release, 06Research-and-Data, 10WikiCite: Retrieve DOI metadata and identify non-resolving DOIs. - https://phabricator.wikimedia.org/T99046#2520335 (10Harej) [19:33:41] harej, this channel has a bunch of phab tags configured in the bot [19:33:52] these are the ones using regex: [19:33:53] - Tool-Labs(.*)? [19:33:54] - Labs(-.*)? [19:33:54] - Wikimedia-Labs(-.*)? [19:34:05] I hope I'm not creating a nuisance :( [19:34:21] nah it's fine [19:35:01] I just got a shiny new Phabricator tag to play with. [19:35:20] 06Labs, 10Labs-project-Librarybase: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#2520338 (10Harej) [19:35:22] I think we should change it to not have the regex for Labs-(.*) [19:35:44] Unless IRC really needs to know about something happening in Labs-project-whateverthefuck [19:36:03] 10Labs-project-Librarybase, 10WikiCite: Fix librarybase SPARQL endpoint updater - https://phabricator.wikimedia.org/T121381#2520339 (10Harej) [19:36:15] 10Labs-project-Librarybase, 10Labs-project-Wikipedia-Requests, 10WikiCite: Require sources, format with Citoid - https://phabricator.wikimedia.org/T137044#2520341 (10Harej) [19:36:43] 06Labs, 10Labs-project-Librarybase: proxy hostnames containing dots - https://phabricator.wikimedia.org/T129655#2520342 (10Harej) [19:37:56] 06Labs, 10Labs-project-Librarybase: Review resource usage for projects with quotas over the default. - https://phabricator.wikimedia.org/T140381#2520343 (10Harej) [19:39:01] yuvipanda: you know how you said the librarybase instance had a failure? [19:39:23] ah, yes. it is still stuck I think no? [19:39:34] chasemp ^ was a non-trusty instance stuck btw. I forgot about it [19:39:40] harej what was the name of the instance again? [19:39:42] you mean non-jessie? [19:39:48] librarybase [19:40:14] the data's not in trouble, is it? [19:40:14] chasemp yup [19:40:17] sorry, non-jessie [19:40:20] harej nope. [19:40:24] that's good [19:40:34] I nearly gave halfak a heart attack when I said the librarybase instance had a failure [19:40:52] i don't see that instance [19:40:53] we have [19:40:55] | fbcc47ae-8855-4132-bd78-f005ccb0ff17 | librarybase-sparql-01 | ACTIVE | public=10.68.20.155 | [19:40:55] | 4bd57688-ab78-43c6-8f6d-a2dd75b5f48f | librarybase-reston-01 | ACTIVE | public=10.68.18.95 [19:41:02] it's the second I think [19:41:16] oh that's jessie too [19:41:17] interesting [19:41:22] I thought that was trusty since it has mw running on it [19:41:27] harej, you may wish to keep a backup if you're worried about data getting lost in labs [19:41:36] it's on labvirt1006 [19:41:50] In principle that's a good idea. What backup workflows have worked for other Labs users? [19:41:52] is it jessie? [19:41:54] it also shows [19:42:00] [8255401.203519] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19:42:00] [8255401.204660] INFO: task nscd:4102 blocked for more than 120 seconds. [19:42:19] yeah, it's jessie. [19:42:21] not trusty [19:42:24] harej: it really depends on what you are backing up and how much of it there is [19:42:33] yuvipanda: yeah so far seems similar [19:42:38] right [19:42:40] I would probably chalk this one up too [19:42:49] chasemp I think I'll just enter this in the list and then let harej reboot? [19:43:10] and I'm pretty hesitant about putting "freeze" or "unvail" in this bucket w/o some semblance of related symptom not just unavail but yeah [19:43:16] this is probably whatever the f this is [19:43:26] yeah, but I think > [8255401.204660] INFO: task nscd:4102 blocked for more than 120 seconds. [19:43:35] is a good thing to base it off [19:43:42] I'll amend the task to mention that too [19:43:44] even that's not totally convincing but yeah decent indicator [19:43:54] we see that sometimes anyway esp w/ all teh weird thigns tools can do [19:44:06] and also badly written code etc but considering I do agree [19:44:36] I looked into the phlogiston things a bit since it has stuck proc issues [19:44:43] and it's not jessie and it has had issues for a long time [19:44:49] shrug closest I have. it seems to be some process doing any amount of IO, and IO never returns. [19:44:55] ah, right. [19:44:58] yeah that seems unreltaed [19:45:01] but I don't actually tink it's related, I think it's just struggling code / resources [19:45:11] right [19:45:40] one thing is nscd is pretty light weight almost always, I can't ever recall an nscd io spike anywhere and the same for cron the last time [19:45:49] makes me think I don't understand the trigger [19:46:01] Right [19:46:18] but this jessie thing is seeming more and more interesting [19:46:56] Right [19:48:34] https://phabricator.wikimedia.org/P3634 [19:49:12] The jbd thing is also common to most of them [19:49:26] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2520386 (10chasemp) https://phabricator.wikimedia.org/P3634 [19:49:32] I'm not entirely sure what it does but I know it does it related things [19:51:17] it's the ext4 journaling I think? [19:51:30] which could have similar patterns to io cache and flush [19:51:33] I wonder if that's tunable [19:51:35] Right [19:51:55] I agree it comes up often but idk if ...it's active and so beign caught more or guilty [19:52:51] I at least found one blog post about it [19:53:59] Trying to find it now [19:54:20] http://fenidik.blogspot.com/2010/03/ext4-disable-journal.html ? [19:54:46] https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/ [19:54:55] chasemp ^ [19:55:50] yeah saw this too, but we already set that [19:55:55] and lower [19:56:21] Right. [19:57:37] https://ext4.wiki.kernel.org/index.php/Ext4_Howto#.22No_Journaling.22_mode [20:00:33] Hello, is there currently a known issue with the SQL replicas? [20:00:40] we use ext4 elsewhere w/o issue so idk [20:01:42] thparkth: not to my knowledge [20:03:23] thparkth: what's the issue you're encountering? [20:04:04] getting "access denied" connecting to enwiki or frwiki, tried from two different tool lab projects [20:05:35] The database seems up to me, so that suggests it's a configuration issue. How are you connecting to the database? [20:05:48] "mysql enwiki" [20:06:00] thparkth, do it without the 'my' [20:06:14] seriously? that's it? :D [20:06:31] or use: mysql --defaults-file=replica.my.cnf -h enwiki.labsdb enwiki [20:06:39] ~/replica.my.cnf [20:06:45] I am currently slapping myself with a fish [20:06:48] also enwiki_p rather than enwiki [20:06:54] the sql command handles all of this for you [20:10:18] 06Labs, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2520473 (10Dzahn) cc: @ema the pwstore part should also be unblocked now. [20:10:25] thanks all [20:20:14] andrewbogott: wanted to thank you, the video project is acting much better recently, keep up the good work! [20:20:39] cool [20:20:46] ssd ftw [20:39:24] 06Labs, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2520557 (10madhuvishy) Still needed to be add as labs root. This is not done yet [20:51:17] !log git recreating gerrit-test3 instance [20:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master [20:59:10] 10Quarry: Make a Quarry automatically refresh on a set time interval - https://phabricator.wikimedia.org/T141698#2520629 (10yuvipanda) a:05yuvipanda>03None I too would like this to happen, but don't have time to work on it actively right now though :( [20:59:27] 10PAWS: Paws display 502 - Bad gateway error - https://phabricator.wikimedia.org/T140578#2520632 (10yuvipanda) Has this happened recently again? [21:01:34] Can I not set DNS entries to point to a pubic IP I have allocated to an instance in labs anymore? All I see is web proxies but...that gets in the way.... [21:02:30] ostriches which project is this? [21:02:53] `staging` -- been making a pristine gerrit-staging environment like prod that lets me test stuff moar :) [21:03:07] I can only see web proxy options, not domains. [21:03:19] do you have a floating ip? [21:03:20] andrewbogott ^ does this need creating a staging.wmflabs.org domain? [21:03:38] Oh ooohhhh. [21:03:41] https://horizon.wikimedia.org/project/dns_domains/40285354-be95-4de1-a652-97ce36ef4916/records [21:03:48] I have a staging.wmflabs.org domain [21:03:53] I could do a subdomain of that I think [21:03:56] ah, yes. manage records there? [21:04:02] yeah [21:04:10] I was looking for more a subdomain of wmflabs, but this works too [21:04:10] Thx [21:05:11] that all changed not too long ago so confusion expected I think [21:07:59] going to reboot the tools puppetmaster for kernel upgrade, is ok for a few things to fail just now [21:10:17] !log tools rebooting tools-puppetmaster-01 for kernel upgrade [21:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:11:02] 06Labs, 10Graphite: Install WmfPageview datasource plugin on Labs Grafana install - https://phabricator.wikimedia.org/T120298#2520658 (10yuvipanda) [21:11:07] 06Labs, 10Graphite, 13Patch-For-Review: Setup "official labs grafana" instance - https://phabricator.wikimedia.org/T120295#2520656 (10yuvipanda) 05Open>03Resolved I'm going to count this as done, and will open a separate ticket for issues. [21:11:18] chasemp I've a network related puzzle. [21:11:45] I only have a few minutes but hit me with it [21:12:17] chasemp a new node (tools-worker-1023) can't connect to tools-puppetmaster. The new node is no different from all other nodes I created. [21:12:23] there is no ferm involved anywhere [21:12:26] "The requested instance cannot be launched as you only have 0 of your quota available." [21:12:28] * ostriches stabs [21:12:32] (03PS1) 10Krinkle: Use www.wikimedia.org instead of deprecated bits.wikimedia.org [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/302824 [21:12:34] and it has the same security groups as other things [21:12:50] yuvipanda@tools-worker-1023:~$ curl tools-puppetmaster-01:8140 [21:12:50] curl: (56) Recv failure: Connection reset by peer [21:12:51] yuvipanda@tools-worker-1023:~$ [21:13:13] hmm [21:13:28] chasemp If you only have a few mins, go ahead, I'll just investigate some more [21:13:29] PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:13:58] Also I'm apparently using 5/7 public IPs already :p [21:14:22] yuvipanda: work telnets tools-puppetmaster-01.eqiad.wmflabs 8140 [21:14:26] works even [21:15:21] chasemp yup, I restarted puppetmaster which seems to have magically fixed this. [21:15:22] so nvm [21:15:22] yeah it's getting bounced back by teh master [21:18:42] PROBLEM - Host tools-worker-1014 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [21:20:04] (03CR) 10MarcoAurelio: "check experimental" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/302824 (owner: 10Krinkle) [21:21:34] PROBLEM - Puppet run on tools-worker-1023 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [21:22:49] 06Labs, 10Labs-Infrastructure: Support reverse dns for public labs IPs - https://phabricator.wikimedia.org/T104521#2520693 (10AlexMonk-WMF) This script also sets up instance-$instance.$project.wmflabs.org records where possible, primarily for the benefit of cases where you have an instance with a public IP, bu... [21:22:57] (03CR) 10MarcoAurelio: [C: 032] Use www.wikimedia.org instead of deprecated bits.wikimedia.org [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/302824 (owner: 10Krinkle) [21:29:33] (03Merged) 10jenkins-bot: Use www.wikimedia.org instead of deprecated bits.wikimedia.org [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/302824 (owner: 10Krinkle) [21:34:00] !log tools rebooting tools-puppetmaster-01 to test a hypothesis [21:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:34:17] 06Labs, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2520721 (10madhuvishy) The labs root thing is all good now. Thanks @yuvipanda [21:35:30] (03CR) 10MarcoAurelio: "maurelio@tools-bastion-03: Sync. resources/Common.css > Done." [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/302824 (owner: 10Krinkle) [21:35:45] (03CR) 10Krinkle: "Thank you!" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/302824 (owner: 10Krinkle) [21:36:33] RECOVERY - Puppet run on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0] [21:37:57] (03CR) 10MarcoAurelio: "> Thank you!" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/302824 (owner: 10Krinkle) [21:39:13] PROBLEM - Puppet run on tools-flannel-etcd-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:39:42] chasemp ok the 'network problem' is back on another node I just created and I've no clue what's going on... [21:39:47] ping me if around. [21:39:59] it can talk to the proxy, the internet, tec, but not to the puppetmaster?! [21:40:02] and no iptables rules anywhere [21:40:16] andrewbogott I wonder if this is a security groups issue from the upgrade? [21:42:04] yuvipanda: there shouldn't be iptables rules, security groups are handled on the network node [21:42:31] andrewbogott right, so there are no iptables rules *on* the nodes themselves, but I can't communicate between them [21:42:32] valhallasw`cloud, I've sent an email about it to labs-l... let me know if you notice any issue with tools.wmflabs.org or indeed toolserver.org mail [21:42:48] yuvipanda: ok, what are the two hosts concerned? [21:42:55] andrewbogott it just gets 'hung' when I try to hit any port on the other node, which is reminiscent of security groups issues [21:42:59] (and, what's one that works right?) [21:43:01] andrewbogott tools-worker-1004 and tools-puppetmaster-01 [21:43:29] andrewbogott tools-worker-1021 works right. try hitting any port on the latter from the former - https://tools-puppetmaster-01:8140 if you want something specific [21:44:34] yuvipanda: can you ssh to tools-worker-1004? [21:45:17] andrewbogott yup [21:45:32] I have a shell there now [21:45:35] w [21:47:15] yuvipanda: telnet 10.68.22.61 8140 connects from all three hosts for me [21:47:58] andrewbogott ok, it works now... [21:48:05] dammit :( [21:48:07] It wasn't >10mins ago... [21:48:30] RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:51:44] andrewbogott a new instance I spawned works fine from get go [21:51:52] I think labs just hates me now. [21:52:02] that's two transient errors in under 24h [21:52:11] :( [21:52:19] I just modified the config again, so expect further hiccups [21:52:36] I see [21:52:37] ok [21:53:09] andrewbogott unrelated, but how to change the ordering of images in wikitech? for tools it has precise set to default rather than jessie... [21:53:43] hm, I'll check. It's probably wrong everywhere [21:53:54] right [21:57:33] Krenair: thanks for the heads-up! [21:59:03] andrewbogot tools-worker-1014.eqiad.wmflabs - same thing [21:59:14] PROBLEM - Puppet run on tools-worker-1004 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [21:59:41] yuvipanda: fixed, I think [22:00:09] andrewbogott which one? the image ordering or the network thing? [22:00:15] image ordering [22:00:31] andrewbogott it's still defaulting to precise in tools [22:03:17] hm, yeah, multiple things look messed up with wikitech's image selection [22:03:28] mind making a ticket? [22:03:43] andrewbogott yup. any idea how to approach looking at the network issue? [22:04:00] (it's back on tools-worker-1014, which is a new image) [22:04:01] *instance [22:04:13] RECOVERY - Puppet run on tools-flannel-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:04:20] I'm not sure. I suspect that the network issue is just a delayed effect from restarting nova-network [22:04:42] unless it continues to start/stop working while I'm not actively messing with the config [22:04:59] andrewbogott so it's currently not working. what do I do? [22:06:29] PROBLEM - Puppet run on tools-worker-1002 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:08:07] !log tools depool & delete tools-worker-1007 and 1008 [22:08:08] yuvipanda: it's working now [22:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:08:27] so either it's a delay on instance setup, or it was just out of whack for a few minutes when I changed the config... [22:08:46] this first started happening a few hours ago [22:08:56] andrewbogott I'm going to create a new instance shortly and we'll see if this recurs [22:09:01] ok [22:09:15] RECOVERY - Puppet run on tools-worker-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [22:11:29] RECOVERY - Puppet run on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [22:12:58] PROBLEM - Host tools-worker-1007 is DOWN: PING CRITICAL - Packet loss = 100% [22:15:33] yuvipanda: I'm about to go — is thew new instance working ok? [22:15:42] probably without me around everything will work fine :) [22:16:19] andrewbogott I'll find out in a few mins :) [22:16:29] ok. Sorry things are flaky today :( [22:17:38] andrewbogott it's still in 'BUILD' state [22:17:51] ah it is out of it now [22:17:53] let's see [22:18:15] yuvipanda: Niharika just got a new version of mwoauth lib that will make quarry more awesome. She's wondering how we can get the live app updated to use it. [22:18:36] bd808 \o/ is there a patch? [22:18:56] yuvipanda: There is a new version of the library out there. [22:19:07] aaah [22:19:12] new version of just the mwoauth library? [22:19:16] yuvipanda: This fixes the repetitive login prompts that I kinda told you about the other day. [22:19:17] Yep. [22:19:22] rirght [22:19:34] Niharika: change this in you PR -- https://github.com/wikimedia/analytics-quarry-web/blob/master/requirements.txt#L16 [22:19:44] niharika put up a patch with an updated requirements.txt [22:20:07] yuvipanda: bd808 Okay! [22:20:51] yuvipanda: working? [22:21:43] andrewbogott nope :( [22:21:59] so there's just an arbitrary 10 minute delay for security groups [22:22:12] well, gather what info you can and I'll investigate tomorrow [22:22:18] ok [22:22:23] :/ [22:22:32] I'll do that [22:22:41] at least I'm not going mad which was my previous explanation :) [22:23:32] yuvipanda: https://github.com/wikimedia/analytics-quarry-web/pull/4 (whenever you got time) [22:25:26] niharika cool! I usually use gerrit for CR, but I'll just move it over, no worried. [22:25:34] do you know if this is fully backwards compatible? [22:26:54] Aaron hasn't been pushing tags to the github repo so it's a bit hard to tell [22:28:22] right [22:28:36] I don't see anything scary in the git history for it. A few bug fixes [22:29:19] ok! I'll munge around just now, moment. there's a pending puppet patch that needs to be applied as well [22:33:38] niharika I updated it! try now? [22:34:38] I should also move Quarry to tools [22:35:40] RECOVERY - Host tools-worker-1007 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [22:40:26] PROBLEM - Host tools-worker-1008 is DOWN: CRITICAL - Host Unreachable (10.68.23.83) [22:43:40] yuvipanda: Sorry, I didn't think of using gerrit. :| My bad. [22:44:19] niharika no worries! Nobody sane would voluntarily use the Gerrit UI :D [22:44:24] yuvipanda: bd808: Hmm, it doesn't work as expect. [22:44:27] expected* [22:44:31] does it need changes in code? [22:45:08] yuvipanda: It shouldn't, I think. I haven't used mwoauth for this before. bd808 you know what's up? [22:46:35] Niharika: hmmm... not sure. Let me look at the package that was released [22:47:13] yuvipanda, if I read right andrew network gremlins are in play? [22:47:33] chasemp yeah. it looks like there's a maybe ~10min delay in applying security groups? [22:47:39] on 1005 I thought we decided to reboot/repool yesterday [22:47:53] that...is interesting [22:47:55] chasemp ah, ok. I'll do that later, yeah. [22:48:12] chasemp yup. I'm going to create another one shortly, do you want to obsere? [22:48:13] *observe? [22:48:40] !log tools deleted tools-worker-1005 [22:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:49:02] I am on mobile afk atm (its dinner time here) [22:49:17] PROBLEM - Host tools-worker-1005 is DOWN: CRITICAL - Host Unreachable (10.68.20.191) [22:49:55] chasemp oh, right, I keep forgetting. I'll just go ahead and put things on a ticket. sorry! [22:50:08] Niharika: it's hitting the /authenticate endpoint. Let's track down the grant and see if that's the issue [22:52:25] PROBLEM - Puppet run on tools-worker-1007 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:52:37] RECOVERY - Puppet staleness on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [3600.0] [22:53:04] bd808:Okay. I thought the grant was good because the text in the prompt says "In order to complete your request, SQL Quarry needs permission to access information on meta.wikimedia.org on your behalf. No changes will be made with your account." which sounds like the basic name/info access we have for copypatrol. [22:53:26] yeah... it looks like it only has useoauth [22:55:06] so it needs a new Grant? [22:55:39] grr... yeah. the grant is an older version. [22:55:44] or ... I can cheat [22:56:06] https://github.com/wikimedia/mediawiki-extensions-OAuth/blob/master/frontend/specialpages/SpecialMWOAuth.php#L321 [22:56:08] (cheat! cheat! cheat!) [22:56:30] so it looks like this doesn't work with grants that were made before the authonly perm was created [22:56:43] the cheat would be to update the db record [22:57:06] bd808: Sounds too extreme. What does it take to update the grant nicely? [22:57:26] I don't think you can update a grant [22:57:28] you have to create new one [22:57:35] requesting a new one, getting approved, and changing the keys the app uses [22:58:00] getting approved is easy enough (or should be) [22:58:26] I kind of wonder if we shouldn't file a bug and fix this for all of the older grants though [22:58:35] Gah. bd808 We can also update the oauth code to accept useoauth. [22:59:17] (The older grant I mean) [22:59:27] maybe. I'm not 100% sure if that allows anything different. tgr might know [23:00:32] a "new" grant would have authonlyprivate or authonly instead of useoauth [23:02:31] I'm not a fan of the obtaining a new grant process. I'm inclined to say we cheat... [23:02:56] heh. I got one approved in 30 minutes the other day [23:04:40] I;m still not sure how useoauth is different than authonly [23:05:05] it allows you to read pages, for one thing [23:05:38] ah. so it allows api authentication? [23:05:56] makes sense [23:06:14] well, it gives you the read right [23:06:31] (along with a bunch of others which don't do much on their own) [23:06:41] authonly does not give you any rights [23:06:57] *nod* [23:07:17] and authonlyprivate lets you read email address [23:07:21] you can probably still use the API and authenticate but without read rights only a handful of not too interesting modules work [23:08:14] there is probably not much significance for public wikis, but for private ones it is an important difference [23:08:42] yeah. that makes sense [23:08:56] jsut hard to grep for in the code :) [23:09:17] authonlyprivate changes the return of /identify but you still don't get any rights [23:10:16] currently there is no way to both access private data and allow the app to do things [23:10:55] which kind of sucks, we have a bug about it somewhere, but the whole thing is blocked on considerations about privacy policy, IIRC [23:11:57] anyway I can approve new consumers if needed [23:14:32] the other big difference is that you can't use /authenticate with anything other than authonly[private] [23:14:40] Niharika: I think the "right" fix is to ask yuvipanda to request a new grant that is authonly (or authonlyprivate) and then switch the tokens after it is approved [23:14:55] tgr: yeah, that's what we are trying to get to work for quarry [23:15:33] ah, ok, that's the cheat part [23:15:56] yeah. I could just change the grant in the db [23:17:07] you'd have to ask csteipp or someone from security about allowing promptless authentication with useoauth, I am unsure about the security considerations [23:17:34] csteipp does not work for wmf anymore [23:17:40] yeah. it seems not right [23:17:59] mutante: he's still the authority on the OAuth extension ;) [23:18:04] he is a good person to ask about oauth security, nevertheless [23:18:05] ok [23:18:10] got it [23:18:40] bd808: I don't see how you could update grants [23:19:07] "1:02:56 a.m. heh. I got one approved in 30 minutes the other day" - yeah, once in a blue moon [23:19:10] tgr: update oauth_registered_consumer set ... where ...; [23:19:23] you create a new consumer, the user has to go through an authorization screen to create a new consumer_acceptance record for that consumer [23:19:46] when you are looking at a db prompt many things are possible (and probably wrong) [23:20:02] oh in Wikimedia production DB? that sounds scary [23:20:33] hence my line about the right way [23:21:36] yuvipanda: It will be totally awesome if you can find the time to request a new grant and update Quarry tokens. It will save quite a few furry kittens. [23:21:37] yeah just propose a new consumer, it's 10% of the time we have spent talking about it :) [23:22:18] ok, what grant option should I select? [23:23:01] yuvipanda: "Authentication only with access to real name and email address via Special:OAuth/identify, no API access." [23:23:39] ok I've requested [23:23:41] now someone should approve [23:26:24] My app already uses that, and it uses mwoauth, so do I need to change anything for that? [23:26:27] ("that" being the authentication only grant) [23:26:41] Argh [23:27:00] I used 2 of "that" :D [23:27:13] tom29739: there is a new mwoath out today. If you upgrade to it it should "just work" [23:27:24] Yay [23:27:31] tom29739: Which app is this? [23:27:59] Niharika: the ircredirector tool on tool labs [23:28:05] Niharika: you should totally write something to wikitech-l about the trick :) [23:28:16] I had to integrate it into flask, it was jerk [23:28:19] *hell [23:28:31] bd808: But this is your trick! You should do that. :D [23:28:49] Flask-mwoauth doesn't work with that grant, so I had to roll my own [23:28:55] yuvipanda: approved [23:28:56] Niharika: but you care enough to make the world better! [23:29:28] thanks tgr [23:29:35] * bd808 -> bar down the street [23:30:50] * tom29739 goes to the pub [23:31:53] bd808: I'll do that. [23:31:55] :) [23:38:11] PROBLEM - SSH on tools-docker-builder-03 is CRITICAL: Server answer [23:38:26] thanks tgr [23:51:54] PROBLEM - Puppet run on tools-worker-1005 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]