[00:02:23] 06Labs, 10MediaWiki-extensions-OAuth, 10wikitech.wikimedia.org: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150#2743542 (10bd808) [00:04:12] PROBLEM - SSH on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:03] RECOVERY - SSH on tools-webgrid-lighttpd-1401 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [00:21:49] 06Labs, 10MediaWiki-extensions-OAuth, 10wikitech.wikimedia.org: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150#2743542 (10Tgr) The first nonce check is the usual one. The second happens because CentralAuth is not installed so central ID lookup happens... [00:37:57] PROBLEM - Host tools-docker-builder-01 is DOWN: CRITICAL - Host Unreachable (10.68.19.180) [00:53:00] 06Labs, 10MediaWiki-extensions-OAuth, 10wikitech.wikimedia.org, 13Patch-For-Review: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150#2743646 (10Tgr) >>! In T149150#2743597, @Tgr wrote: > ...we can also track nonce checks - if the memcached check for a... [01:28:03] (03CR) 10Legoktm: [C: 032] Report Cognate to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/317809 (owner: 10Addshore) [01:28:24] (03Merged) 10jenkins-bot: Report Cognate to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/317809 (owner: 10Addshore) [01:28:32] (03CR) 10Legoktm: [C: 032] Report WMDE team boards to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/317810 (owner: 10Addshore) [01:28:52] (03Merged) 10jenkins-bot: Report WMDE team boards to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/317810 (owner: 10Addshore) [01:31:09] !log tools.wikibugs Updated channels.yaml to: 06b5d789dc5bb3aa059733595efbcfc50d380d61 Report WMDE team boards to wmde tech chan [01:31:10] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: labstore1003 - RAID fail - https://phabricator.wikimedia.org/T149156#2743689 (10Dzahn) [01:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [03:34:13] 06Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#2743778 (10Beetstra) @valhallasw The bot yesterday moved to 1216. It is not backlogging, but maybe it is good to make sure other tasks do not run on this instance. [06:47:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [07:22:22] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [08:48:41] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2744088 (10jcrespo) [08:48:44] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Implement proxysql both for labs and for later production usage - https://phabricator.wikimedia.org/T148500#2744086 (10jcrespo) 05Open>03Resolved This technically works, the only things missing is having multiple proxysql instances per server an... [08:58:32] (03PS1) 10Ricordisamoa: Support GET requests in get_json() and get_json_cached() [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/318054 [09:01:27] 06Labs, 10Labs-Infrastructure, 10DBA: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2744093 (10jcrespo) [09:16:16] 06Labs, 10Labs-Infrastructure, 10DBA: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2744167 (10Marostegui) From my side - I am fine with the hostnames. [10:13:38] PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [10:28:41] RECOVERY - Puppet run on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:41:35] 06Labs, 10Labs-Infrastructure, 10DBA: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2744292 (10jcrespo) a:03jcrespo [13:00:17] I am still uncertain about the route one should take to deploy a Django app on wmflabs. Is the current approach to create a Docker container? [13:14:48] tobias47n9e-c: no. we decided against custom docker containers for deploying to kubernetes. There are a growing shared set of containers that we are using instead. [13:15:01] tobias47n9e-c: are you wanting python3 or python2? [13:15:14] python3 [13:15:29] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web/Kubernetes#python_.28uwsgi_.2B_python3.4.29 [13:19:05] there are some instructions for configuring uwsgi for django at https://docs.djangoproject.com/en/1.10/howto/deployment/wsgi/uwsgi/ [13:19:47] the uwsgi config file goes in $HOME/www/python/uwsgi.ini [13:20:11] "python code setup so that your app.py file lives under ~/www/python/src", but what is the app.py in Django? [13:23:37] hmmm... let's think about this. I think I remember that we figured out how to change the default sometime before [13:24:57] the django docs are suggesting `--module=mysite.wsgi:application` for something that would be effectively ~/www/python/mysite/wsgi.py [13:28:50] tobias47n9e-c: I'm digging in the code... [13:30:05] hmmm... on gridengine we could use uwsgi-plain and give a complete config in the uwsgi.ini [13:35:05] tobias47n9e-c: it looks like yuvipanda has been very opinionated about the use of ~/www/python/src/app.py as the uwsgi entry point (pretty standard for a flask app). I think that means we are going to have to figure out how to bootstrap a django app using a ~/www/python/src/app.py file [13:37:23] bd808: ok. at least that gives me a direction to investigate in [13:40:05] tobias47n9e-c: something like this would be a place to start -- https://phabricator.wikimedia.org/P4309 [13:41:07] that would make the callable be 'app' as the generated config wants [13:42:12] yuvipanda: I'm still having issues with my PAWS kernel dying unexpectedly. can you ping me when you have a few moments to help me troubleshoot? I've tried restarting my server and the memory usage isn't going over 800 MB so I'm not hitting the 1 GB limit [13:42:20] I don't think I'll have time to try it out today, but I'll try to do a kubernetes+django+python3 hello world soon [13:48:24] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: My first kubernetes + python3 + django app tutorial - https://phabricator.wikimedia.org/T149191#2744806 (10bd808) [13:49:04] tobias47n9e-c: I opened that task ^. If you make progress it would be great to have some notes there. [13:49:29] I'll try to poke at it myself soon but can't promise much for at least a couple of days [13:52:41] 06Labs, 10MediaWiki-extensions-OAuth, 10wikitech.wikimedia.org, 13Patch-For-Review: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150#2743542 (10Anomie) >>! In T149150#2743597, @Tgr wrote: > The first nonce check is the usual one. The second happens be... [13:53:30] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: My first kubernetes + python3 + django app tutorial - https://phabricator.wikimedia.org/T149191#2744832 (10bd808) Creating a `~www/python/src/app.py` file like this would be a starting point to iterate on until the app can be made to work:... [13:56:17] correct mme if im wrong but the grid does have access to my tool's files right (read and write) [13:58:41] Change on 12www.mediawiki.org a page Wikimedia Labs was modified, changed by 128.177.161.159 link https://www.mediawiki.org/w/index.php?diff=2269558 edit summary: [-66] Zkx [13:58:45] bd808: Thanks. I will do that. [14:01:31] 06Labs, 10MediaWiki-extensions-OAuth, 10wikitech.wikimedia.org, 13Patch-For-Review: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150#2744874 (10bd808) >>! In T149150#2744705, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operati... [14:02:03] Change on 12www.mediawiki.org a page Wikimedia Labs was modified, changed by Jianhui67 link https://www.mediawiki.org/w/index.php?diff=2269560 edit summary: [+66] Reverted edits by [[Special:Contributions/128.177.161.159|128.177.161.159]] ([[User talk:128.177.161.159|talk]]) to last revision by [[User:Mainframe98|Mainframe98]] [14:25:03] 06Labs, 10MediaWiki-extensions-OAuth, 10wikitech.wikimedia.org, 13Patch-For-Review: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150#2744932 (10Anomie) >>! In T149150#2744874, @bd808 wrote: > Maybe this is all just a case of "don't do that" for accide... [15:36:01] PROBLEM - SSH on tools-webgrid-generic-1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:46:29] 06Labs, 13Patch-For-Review: Update or remove certcleaner.py - https://phabricator.wikimedia.org/T146303#2745209 (10Andrew) 05Open>03Resolved I cleaned up another pile of i-xxxxx puppet certs and now I think we're in good shape. [16:01:58] 06Labs, 10DBA: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2745255 (10Dispenser) Several short comings compared to the previous implementation # No active watcher count ([[https://www.mediawiki.org/wiki/API:Info|visitingwatchers]], i.e.... [16:18:18] 06Labs, 07Puppet: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2745291 (10Andrew) 05Open>03Resolved Yep, upgraded labcontrol1001 to 3.8.5 and now everything is fine. [16:23:47] (03CR) 10BryanDavis: [C: 032] jsub: Make trusty the default release target for jsub commands [labs/toollabs] - 10https://gerrit.wikimedia.org/r/316823 (https://phabricator.wikimedia.org/T143284) (owner: 10BryanDavis) [16:25:16] (03Merged) 10jenkins-bot: jsub: Make trusty the default release target for jsub commands [labs/toollabs] - 10https://gerrit.wikimedia.org/r/316823 (https://phabricator.wikimedia.org/T143284) (owner: 10BryanDavis) [16:36:31] 06Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#2745368 (10valhallasw) I have rescheduled the other continuous jobs on the instance. Thanks! [16:48:08] !log tools Deployed jobutils_1.16_all.deb on tools-bastion-02, tools-bastion-03, tools-cron-01 (default jsub target to trusty) [16:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [16:50:48] !log tools Deployed jobutils_1.16_all.deb on tools-precise-dev (default jsub target to trusty) [16:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [16:52:53] !log tools Deployed jobutils_1.16_all.deb on tools-mail (default jsub target to trusty) [16:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:05:30] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 07Epic, 15User-bd808: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#2745469 (10bd808) Switched default to trusty at 2016-10-26T16:48Z [17:05:55] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 07Epic, 15User-bd808: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#2745473 (10bd808) [17:05:57] 06Labs, 10Tool-Labs, 13Patch-For-Review, 15User-bd808: Make trusty the default release target for jsub commands - https://phabricator.wikimedia.org/T143284#2745470 (10bd808) 05Open>03Resolved a:03bd808 https://lists.wikimedia.org/pipermail/labs-announce/2016-October/000171.html [17:10:09] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 07Epic, 15User-bd808: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#2745490 (10bd808) [17:10:11] 06Labs, 10Tool-Labs: Create more trusty nodes in anticipation of the default for jsub switching to trusty - https://phabricator.wikimedia.org/T147205#2745488 (10bd808) 05Open>03Resolved a:03Andrew [17:12:34] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 07Epic, 15User-bd808: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#2745499 (10bd808) With the default switched we are now in the long tail phase of prodding people who have pinned to `-l release=precise` to switch... [17:17:21] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs: Make a nag system to email maintainers of tools still running on precise gird hosts - https://phabricator.wikimedia.org/T149214#2745525 (10bd808) [18:29:58] heya, i'm trying to get a web proxy addy to work in deployment-prep [18:30:18] i've added it, and it resolves, but i can't get traffic to my instance through it [18:32:15] ottomata name? [18:32:36] evenstreams-beta.wmflabs.org. [18:32:38] evenstreams-beta.wmflabs.org [18:32:42] sorry [18:32:51] eventstreams-beta.wmflabs.org [18:33:42] sqlite> select backend.url from backend, route where backend.route_id = route.id and route.domain = 'eventstreams-beta.wmflabs.org'; [18:33:42] http://10.68.17.9:6947 [18:34:13] ja that's right [18:34:37] and, that IP:port works fine [18:34:38] internally [18:34:49] 9.17.68.10.in-addr.arpa domain name pointer deployment-kafka04.deployment-prep.eqiad.wmflabs. [18:34:53] yup [18:35:20] I can't connect to that port from the proxy [18:35:27] Sure you put it in the right security group? [18:35:43] krenair@novaproxy-01:~$ curl -vvv http://10.68.17.9:6947 [18:35:43] * Rebuilt URL to: http://10.68.17.9:6947/ [18:35:43] * Hostname was NOT found in DNS cache [18:35:44] * Trying 10.68.17.9... [18:35:46] It's just stuck there [18:35:48] security group.... [18:36:11] IIRC it's the 'web' security group in deployment-prep [18:37:04] trying [18:38:05] added it, does it take a while to have an effect? [18:39:56] Oh, right [18:40:04] Yeah that group won't work for this [18:40:07] oh? [18:40:07] Forgot you had that weird port [18:40:25] It only does 80, 443 and 8000 [18:40:48] oh ha [18:40:51] Is 6947 used for anything else on any other machine? [18:40:53] i can use 8000 [18:40:57] no, i just picked one [18:41:06] If so there might be a group for this existing. If not we'll have to make a new one [18:41:08] ok [18:41:32] We have space for 3 more security groups. This will be 1. [18:41:58] After that it becomes a case of weekly meeting reviews to get more. [18:42:18] 6947? no don't do it, 8000 is fine [18:42:21] i'm just testing stuff right now [18:42:24] ok [18:42:57] yeahhh! thanks that works [18:43:19] great [18:44:31] ottomata: I see you went for SSEs :D [18:44:40] zareen: ouch :( I'm on the way to the office now, I'll look once I get there [18:47:38] yuvipanda: thanks! here's my paws link: http://paws-public.wmflabs.org/paws-public/45876923/?C=N&O=D [18:47:56] zareen: ok! what's your wiki username? [18:48:24] yuvipanda: yup [18:48:33] yuvipanda: zareenf [18:48:49] zareen: ok! I'll hopefully be able to look at it in a short while :) [18:48:54] going afk now [18:49:52] hmm, maybe i'll try some paws... [18:49:53] :) [18:50:00] no js notebook? :) [18:51:26] ottomata: :P not yet\ [19:30:37] 10Tool-Labs-tools-Other: create tool to crunch metrics for views (play started) of video and audio files - https://phabricator.wikimedia.org/T116363#2746073 (10harej-NIOSH) I have created the infrastructure for logging the play counts in a central database. Currently I am working on ingesting all the historical... [21:24:00] zareen: am looking at your paws issues now [21:30:45] zareen: hmm, I'm wondering if we're just running out of space in the cluster [21:31:01] causing things to fail well below the 1Gi limit [21:31:53] zareen: yup that looks like what's happening [21:32:01] > Oct 26 13:34:18 tools-worker-1023 kernel: [7236022.815841] [] ? pagefault_out_of_memory+0x44/0xc0 [21:32:05] > Oct 26 13:34:18 tools-worker-1023 kernel: [7236022.815886] memory: usage 1048576kB, limit 1048576kB, failcnt 36725 [21:32:17] zareen: so it looks like at some point your memory usage did spike [21:32:20] and that caused it to die immediately [21:32:32] zareen: the resource indicator is 5s late, so it wouldn't catch a small spike [21:32:39] so this isn't the cluster filling up unfortuantely ;( [21:32:46] it's just that our resource limits are too low [21:33:45] yuvipanda: ah, okay so I won't be able to complete that analysis in PAWS? [21:33:55] zareen: probably, unfortunately :( [21:34:08] zareen: I'm going to do some math to see if we can raise the limit [21:34:15] zareen: if we can then you could [21:34:31] zareen: I'll probably have an answer to that in 30min or so [21:34:37] zareen: but there's a good chance you can't [21:34:38] okay, so I'm just hitting the 1 GB limit and crashing my server? [21:34:45] zareen: yup [21:34:58] > Oct 26 13:34:18 tools-worker-1023 kernel: [7236022.815886] memory: usage 1048576kB, limit 1048576kB, failcnt 36725 [21:34:59] specifically shows that [21:36:28] bummer. okay, if you think we can raise the limit let me know. thanks for looking into that though. [21:44:19] [23:02:24] * Platonides has quit (K-Lined) /// wait, what ? [21:48:55] * abian has quit (K-Lined) [21:48:55] * jem has quit (K-Lined) [22:26:18] 06Labs, 06Operations, 10netops, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2746728 (10Krenair) [22:59:36] yuvipanda: FYI I'll be upgrading prometheus on tools and merge https://gerrit.wikimedia.org/r/#/c/317880/ [23:03:12] !log tools.heritage Deploy latest from Git master: c943e89, e3ba148, 063f9e2 (T148772 & T148773), 1db9f42, fe119e1, de2e590, 82147b8, f6ff350 & 63a1819 (T140795) [23:03:15] T148772: Bounding box in documentation should be changed - https://phabricator.wikimedia.org/T148772 [23:03:16] T148773: Error calling search with bounding box - https://phabricator.wikimedia.org/T148773 [23:03:17] T140795: Map sources using Wikidata to wd_item - https://phabricator.wikimedia.org/T140795 [23:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master [23:06:21] godog: awesome [23:17:27] !log tools upgrade prometheus on tools-prometheus-02 [23:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:18:11] yuvipanda: heh, "ish", it is complaining the metrics storage is messed up, I moved "metrics" out of the way temporarily [23:18:17] odd though, in beta the upgrade worked [23:18:31] godog: do we lose all current metrics? [23:18:36] godog: there's also a tools-prometheus-02 btw [23:20:27] !log tools Disabling puppet on tools proxy hosts for applying proxy health check endpoint T143638 [23:20:27] yeah that's where I'm testing [23:20:28] T143638: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638 [23:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:21:02] k8s discovery also changed a bit, will send amendments to puppet [23:23:32] godog: ok! [23:25:00] zareen: don't think we can raise limits now :( sorry! [23:25:12] zareen: we'll hopefully have a paws internal available in a few weeks with much higher limits [23:36:15] yuvipanda: okay, thanks for checking [23:38:53] 10Tool-Labs-tools-stewardbots: StewardBot not logged into irc - https://phabricator.wikimedia.org/T149265#2747044 (10Platonides) [23:41:43] 10Tool-Labs-tools-stewardbots: StewardBot not logged into irc - https://phabricator.wikimedia.org/T149265#2747065 (10Platonides)