[00:03:59] thanks guys :D http://sn1per-api-tests.wmflabs.org/wiki/Main_Page [00:05:36] GEOFBOT: :) [01:29:00] 6Labs, 7Database: Provision a labsdb useraccount that can be used to run replica-addusers.pl - https://phabricator.wikimedia.org/T104476#1487205 (10yuvipanda) @jcrespo using these credentials from labstore1002 to try to grant grants ends up with: ```Can\'t connect to MySQL server on \'labsdb1001.eqiad.wmnet\'... [01:50:01] 6Labs, 7Database: Provision a labsdb useraccount that can be used to run replica-addusers.pl - https://phabricator.wikimedia.org/T104476#1487225 (10yuvipanda) And if I use the root pw (to check rest of script), it fails at executing: ``` GRANT SELECT, SHOW VIEW ON `%\_p`.* TO 's52632'@'%';``` with ```Acc... [02:01:02] 6Labs, 7Database: Provision a labsdb useraccount that can be used to run replica-addusers.pl - https://phabricator.wikimedia.org/T104476#1487238 (10yuvipanda) Messing around, even: ```mysql:root@labsdb1001.eqiad.wmnet [(none)]> GRANT SELECT, SHOW VIEW ON enwiki_p.* TO 's52632'@'%'; ERROR 1044 (42000): Access... [02:04:58] 6Labs, 7Database: Provision a labsdb useraccount that can be used to run replica-addusers.pl - https://phabricator.wikimedia.org/T104476#1487242 (10yuvipanda) The actual SQL I need to execute for each user is: ``` CREATE USER 's52632'@'%' IDENTIFIED BY 'somepasswordhere'; GRANT SELECT, SHOW VIEW ON `%\... [02:05:42] PROBLEM - Free space - all mounts on tools-bastion-01 is CRITICAL tools.tools-bastion-01.diskspace.root.byte_percentfree (<50.00%) [02:06:58] !log tools removed pacct files from tools-bastion-01 [02:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [02:20:42] RECOVERY - Free space - all mounts on tools-bastion-01 is OK All targets OK [02:21:21] 6Labs, 7Database: Provision a labsdb useraccount that can be used to run replica-addusers.pl - https://phabricator.wikimedia.org/T104476#1487263 (10Springle) At Yuvi's request on IRC I added `'labsdbadmin'@'10.64.37.7'` with the same permissions/password as @jcrespo added for `'labsdbadmin'@'10.64.37.6'`, sinc... [02:51:59] labsd badmin [02:55:43] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106, and 2 others: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1487301 (10scfc) (The rewrite so far only covers service group users ("tool accounts"), for human users this is not fixed yet and this is known, so no n... [02:56:41] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 33.33% of data above the critical threshold [0.0] [03:25:47] YuviPanda: I've got a puppet question that you might know something about. How in the heck does Service['ganglia-monitor'] get applied on a prod instance? [03:26:32] ::ganglia::monitor::service defines it but I can't see that applied anywhere [03:27:02] I'm getting this failure on a labs instance -- Error: Failed to apply catalog: Could not find dependent Service[ganglia-monitor] for File[/etc/ganglia/conf.d/elasticsearch.pyconf] at /etc/puppet/modules/elasticsearch/manifests/ganglia.pp:8 [03:27:25] because the service isn't in the manifest [03:27:40] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1404 is CRITICAL tools.tools-webgrid-lighttpd-1404.diskspace.root.byte_percentfree (<50.00%) [03:30:23] found it! [03:31:26] ::standard conditionally includes ::ganglia which includes ::ganglia::monitor which has this horrible "include service" in it [04:39:55] !log stashbot Setup basic vms [05:06:43] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [05:20:37] bd808: the conditional should've made it not appear in labs.... [05:20:38] no? [05:57:28] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1406 is CRITICAL tools.tools-webgrid-lighttpd-1406.diskspace.root.byte_percentfree (<22.22%) [06:15:00] (03CR) 10Sitic: [C: 032 V: 032] Switch to python 3 [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/227164 (owner: 10Sitic) [06:17:22] sitic: nice! [06:17:23] :) [06:17:35] Did your webservice-new problem get sorted? [06:20:11] YuviPanda: well halfak is to blame (only supporting python3 with mediawiki-utilities) ;-) [06:20:23] sitic: aaah heh :) [06:20:31] I think he will be happy to hear that :) [06:20:31] yeah, was a strange issue, I think a stale NFS file handle on one of the webgrid nodes [06:21:03] gevent was blocking me, they finally added support two weeks ago [06:21:52] YuviPanda: btw, if you want to have a look: https://gerrit.wikimedia.org/r/#/c/227163/ adds ORES support [06:22:16] sitic: fight. [06:22:17] Err [06:22:19] Right [06:22:22] (Am on phone) [06:22:33] :-) [06:22:47] sitic: hopefully I'll have a non-nfs dependent webservice setup in a few months [06:23:27] wow, great =) [06:24:00] sitic: yeah, marathon or kubernetes. Unsure yet [06:24:04] Should be fun [06:42:33] RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1406 is OK All targets OK [06:47:41] RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1404 is OK All targets OK [07:38:47] YuviPanda: do I still have to un-1 something? [07:38:54] valhallasw`cloud: nope [07:39:00] ok! [07:57:44] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106, and 2 others: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1487534 (10yuvipanda) So it's mostly fixed for tool accounts now \o/ [08:16:28] <_acs_> re [08:21:26] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1487542 (10jcrespo) Aside from the aggregated totals on Ganglia and Tendril (which lead to discover OOMs on labsdb1002, you can do queries right now on information_schema such as: ``` MariaDB LAB... [10:07:32] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 100.00% of data above the critical threshold [0.0] [12:42:41] Is the channel topic still accurate? [12:43:13] The crontabs part seems old. [12:57:19] Katie, I think that was post-NFS failure [13:18:52] 6Labs, 10Tool-Labs, 7Database: Tool Labs enwiki_p replicated database missing rows - https://phabricator.wikimedia.org/T106470#1488023 (10jcrespo) a:3jcrespo [13:32:56] Coren: Is there a diagram or a description of how the database pieces all connect? [13:35:43] Katie: Don't worry, there will be another error soon, and the topic will change ;) [13:36:13] a9309131, what are you looking for exactly? Topology, databases? [13:38:16] jynus: My boss can't believe wikipedia uses SQL because he doesn't think it scales well enough. Something to show him how it's done so that it copes. [13:38:30] ha [13:38:34] :p [13:38:53] 50% of the wikipedia mysql dbas here [13:39:10] all enwiki writes goes to one server [13:39:18] that is scaling up [13:39:53] all uncached reads goes to 2-6 slaves per shard, that is scaling out [13:40:22] let me see if I can find any pretty graph [13:49:14] a930913, I do not have a pretty graph for you, but I can share you the production configuration of 1 datacenter: https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/33494f49c75de0e1e0dda7cf4217edd3856ac5eb/wmf-config%2Fdb-eqiad.php [13:49:56] that holds the >800 wikis [13:57:05] and https://noc.wikimedia.org/db.php <--- Jynus ;) [13:57:29] from noc.wikimedia.org there is also https://dbtree.wikimedia.org/ [13:57:42] which shows query per seconds and the master / slave relationships [13:57:56] not sure whether that last one is actually up to date [13:58:55] 10Tool-Labs-tools-Other, 7Epic: Convert all Labs tools to use cdnjs for static libraries - https://phabricator.wikimedia.org/T103934#1488103 (10Ricordisamoa) [13:59:09] hashar, that is a parsed config, so I do not use it [13:59:33] and the other, I do, but the version on another internal tool with more sensitive data [13:59:58] wasn't sure whether you knew about them [14:00:24] I kinda did, but I am still discovering services every day to be fair [14:01:01] also, those pretty graphs do not work nice with curl, command line for the win :-) [14:04:54] Katie: templatelinks sync on enwiki_p is running, now we only have to wait- it may take quite some time [14:07:49] jynus: hashar: Thanks guys. [14:09:44] to be fair, I think my job is "easy", i think the best thing at wmf is the load balancing and caching, with takes away most of the problems [14:27:30] that's correct ;) [14:39:23] Hey, Can you create a submodule of pywikibot named pywikibase? [14:39:29] pywikibot/pywikibase [14:39:39] I'm pulling out Wikibase related stuff [14:47:25] legoktm: hey around? [14:53:01] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1488195 (10valhallasw) As for other hosts: tools-exec-1201 has entries as well, but only roughly twice a minute: ``` 10.64.37.10-man F root __ 0.00 secs... [15:24:34] YuviPanda: hey, around? [15:25:27] Amir1_: kind of not really [15:25:27] Sup [15:25:53] I just want to create a project in pywikibot [15:26:14] I don't know how to and probably I don't have permission YuviPanda [15:26:47] I don't know either - shouldn't you ask the pywikibot people? [15:26:54] Unless you mean a gerrit project? [15:27:10] In which case there is a request form somewhere on mediawiki.org [15:30:59] YuviPanda: I'm considering turning on nfs debugging for a second or two on tools-bastion-01, but I'm afraid it could make the entire server unresponsive [15:32:08] YuviPanda: I meant a gerrit project [15:32:12] okay, thanks :) [15:32:40] valhallasw`cloud: we could have DNS point -login to -02 and then start messing around? [15:32:49] Also brb shower [15:32:54] hm, maybe, yeah. [15:33:11] inb4 'dude, where's my screen?' [15:33:27] valhallasw`cloud: heh :D [15:33:42] Any running screens on -01? [15:33:44] (there's quite a few people running processes again, we should do something about that) [15:33:54] probably [15:34:07] Ulimits or something [15:34:13] one screen, three tmuxes [15:34:20] Ok physically stepping into shower. Brb [15:34:31] Daily reboots :D [15:35:06] That'll stop 'em. [15:36:57] https://www.mediawiki.org/wiki/Git/New_repositories/Requests [15:43:09] 6Labs, 10Tool-Labs, 7Database: Tool Labs enwiki_p replicated database missing rows - https://phabricator.wikimedia.org/T106470#1488314 (10jcrespo) Sync is in progress. Looking better already. ``` MariaDB LABS localhost enwiki > SELECT -> tl_title -> FROM page -> JOIN templatelinks -> ON tl_... [15:49:12] 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1488338 (10coren) a:3coren [15:51:40] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 371 bytes in 0.029 second response time [15:54:13] 6Labs, 10Labs-Infrastructure: Rename labstore module to labs_store or similar - https://phabricator.wikimedia.org/T107160#1488346 (10scfc) 3NEW [15:55:10] Coren: ^^ shinken alert for toollabs [15:55:50] * YuviPanda is on phone on way to office [15:55:54] valhallasw`cloud: ^^ [15:56:20] YuviPanda: {{worksforme}} [15:56:29] Ok [16:00:21] 6Labs, 10Tool-Labs, 7Database: Provide replication lag as a database function - https://phabricator.wikimedia.org/T50628#1488359 (10jcrespo) [16:00:27] YuviPanda: Not sure what shinken is whining about, I see nothing wrong with it, and plenty of Magnuses. [16:00:37] owait. [16:00:42] Shinken is trying to HTTP [16:00:47] And /that/ fails. [16:01:31] Hm. Intermitently. [16:01:34] * Coren digs further [16:06:42] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 838325 bytes in 2.402 second response time [16:06:47] 6Labs, 6Security, 6operations: create-dbusers can be used to clobber existing files on the NFS server - https://phabricator.wikimedia.org/T107161#1488399 (10scfc) 3NEW [16:07:29] 6Labs, 10Labs-Infrastructure: Rename labstore module to labs_store or similar - https://phabricator.wikimedia.org/T107160#1488414 (10coren) 5Open>3declined a:3coren It used to be named labs_storage but @faidon felt that it should reflect the host naming shemes better (labstoreXXXX) for clarity. So, as d... [16:14:06] 6Labs: Find a different backup solution for Wikimetrics - https://phabricator.wikimedia.org/T103001#1488442 (10Milimetric) 5Open>3Resolved Thanks, Yuvi. I verified that backup is working perfectly fine again. We'll stick with /data/project for now because there's no real incentive to do anything else. [16:17:15] 6Labs: Find a different backup solution for Wikimetrics - https://phabricator.wikimedia.org/T103001#1488455 (10yuvipanda) Cool. I'll re-open when we have a 'real' backup solution available. [16:17:51] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [16:19:20] YuviPanda: do you know what’s up with the toollabs page flapping? [16:19:29] andrewbogott: no... Coren was looking into it [16:19:34] ok, cool. [16:19:43] I've to get off to get on BART now, I'll be online in 15 [16:19:51] * andrewbogott should’ve read the backscroll [16:26:06] As far as I can tell, it was the admin webservice php that went a bit cray-cray. [16:26:21] I've kicked its butt adn now it seems to be okay. [16:27:33] Aha. [16:27:42] PHP Fatal error: Out of memory [16:27:44] is the cause. [16:29:28] 6Labs, 10Labs-Infrastructure: Rename labstore module to labs_store or similar - https://phabricator.wikimedia.org/T107160#1488533 (10scfc) Ah, I have faint memories of that discussion (and as I never touch actual iron, that reasoning is usually not on my mind :-)). [16:44:17] hashar: hey! i'm a little confused on what to do next for the android emulation test job. you said we should get the job finished first but you disabled it. [16:49:37] 6Labs, 10Beta-Cluster, 10Wikimedia-Logstash, 5Patch-For-Review: Logstash on beta yields 500 due to NFS outage (can't open /data/project/logstash/.htpasswd) - https://phabricator.wikimedia.org/T102962#1488597 (10bd808) [16:54:41] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 20.00% of data above the critical threshold [0.0] [16:59:18] Coren: what's your hunch on enabling sunrpc.nfs_debug on tools-bastion-01? If it causes a large system load, can we easily set it back to 'off'? [16:59:40] it might spam >10k messages per second [16:59:55] My first question would be "Why?" [17:00:15] Coren: https://phabricator.wikimedia.org/T107052 [17:00:48] nfs is starting a thread very often (~10k times per second), which floods pacct, but which also indicates an underlying issue [17:02:57] valhallasw`cloud: I need to read the kernel source a bit first to figure out what the thread is meant to be doing. [17:04:09] it updates nfs client state, basically [17:04:45] and as far as I can see, it gets called mostly on errors in the NFS connection [17:05:24] 6Labs, 7Puppet: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1488633 (10yuvipanda) 3NEW [17:06:43] valhallasw`cloud: Yeah, it's not immediately clear what's up. But I'd debug on anything /but/ -bastion-01. Why not -bastion-02 at least? [17:06:53] Coren: because it's not happening on bastion-02 [17:07:05] At all, or just not as often? [17:07:11] but I can do a more exhaustive search for other hosts where it's happening [17:07:34] the manager typically runs once or twice per minute [17:08:03] * Coren ponders. [17:08:39] I'd be surprised if it wasn't just a matter of scale - but if you turn it on do -01 make sure it's only briefly and keep a very close eye on it. [17:09:14] *nod*. My fear is not being able to turn it off anymore because of the load [17:09:22] 6Labs, 7Puppet: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1488644 (10demon) Some of them shouldn't be labs-only probably ;-) [17:09:30] let me search for other hosts, maybe there is another one borking [17:09:36] valhallasw`cloud: foo;sleep 2;not-foo [17:09:37] :-) [17:10:42] * valhallasw`cloud nods [17:11:12] ah, it's also happening on some webgrid-14xx nodes it seems [17:11:40] yep [17:12:18] ok, let's do it on tools-webgrid-lighttpd-1401 instead [17:12:50] 6Labs, 7Puppet: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1488657 (10yuvipanda) Ah, right. so if it is applied in both prod and labs it should *not* be a labs only role but use hiera. This is for the growing number of things that are 'deployed' to labs... [17:12:51] * valhallasw`cloud reads up on draining hosts [17:13:35] What do we have installed on labs in the way of mysql libraries for python3? [17:13:45] for tools [17:14:20] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1488658 (10valhallasw) This is also happening on `tools-webgrid-lighttpd-1401` (and some other lighttpd-14xx hosts, and maybe others as well) which is a safer host to debu... [17:14:38] Krenair: we basically have nothing installed in the way of python3 libraries, I think [17:14:57] virtualenv, virtualenv, virtualenv [17:15:21] bah, ok [17:15:23] nevermind.. [17:15:35] we could probably setup pip to allow per-user installs tho [17:15:39] * YuviPanda gets off bart [17:16:15] !log tools disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" [17:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:23:19] 6Labs, 10Tool-Labs: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1488674 (10GoldenRing) I've followed the instructions above, and I'm reasonably confident I've done it as described. But when I try to start the webservice, I get this in the logs: ``` 2015-07-28 17:22:23: (log.c.166)... [17:32:15] valhallasw`cloud: A point that may be of note is whether the incidence of the accounting entries falls off after the node has been drained or not. [17:34:43] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [17:39:46] Coren: *nod* [17:40:20] 6Labs, 10Tool-Labs: Rewrite the meta_p table populating code to python and have it run on a cron - https://phabricator.wikimedia.org/T107094#1488754 (10Krenair) a:3Krenair [17:40:55] Krenair: \o/ [17:41:03] Krenair: python3 and python3-pymysql? [17:41:07] yep [17:41:10] nice [17:41:23] not quite working yet [17:41:26] but it's getting there [17:41:42] nice! [17:43:34] !log tools rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty [17:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:49:14] !log tools Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server [17:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:55:41] >>> req = urllib.request.Request(canon + "/w/api.php?action=query&meta=siteinfo&siprop=general&format=json") [17:55:41] Traceback (most recent call last): [17:55:41] File "", line 1, in [17:55:41] AttributeError: 'module' object has no attribute 'request' [17:55:41] >>> req = urllib.request.Request(canon + "/w/api.php?action=query&meta=siteinfo&siprop=general&format=json") [17:55:47] >>> [17:55:49] ... wut? [17:56:32] works if I import urllib.request as well as urllib [17:56:37] yes [17:56:41] OR [17:56:47] import requests [17:56:50] requests.get(...) [18:02:07] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1488828 (10valhallasw) The rate did not decrease after draining the server of jobs. Using ``` sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0 `... [18:03:25] bah [18:04:01] Anyone familiar with using lighttpd + fastcgi + flipflop + flask + python3? [18:04:31] valhallasw`cloud: That was impressively uninformative [18:08:31] I'm trying to follow the instructions left by valhallasw'cloud on phab ticket 104374, but always get "child exited with status 13" in the error log. [18:08:36] Can't make out what I'm doing wrong. [18:08:57] GoldenRing: the error reporting for lighttpd is typically not very informative, no :( [18:09:34] Any pointers? [18:09:44] Either to what's wrong or how to diagnose it. [18:10:06] you can try running the app.fcgi directly [18:10:12] to see if that gives you a clear error message [18:11:00] back in ~20 mins [18:15:08] Yes, that gives "OSError: [Errno 88] Socket operation on non-socket" [18:15:41] I've tried googling that, but all the answers I've found amount to, "It does that when you run it directly, but it should work under a web server." [18:17:28] GoldenRing: hi. let me see if I can help [18:17:40] Thanks. [18:29:18] Er, so, any ideas? [18:29:30] GoldenRing: patch incoming to add native python3 support [18:30:14] Heh. That would be the best solution, for sure. [18:30:29] As in, using uwsgi / webservice2? [18:30:33] GoldenRing: yes [18:30:39] Nice. [18:30:50] I'll leave you to it, then. [18:30:55] Thanks. [18:48:25] 6Labs, 10Tool-Labs: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1489137 (10valhallasw) I'm at a loss why this would happen. I've changed the app.fcgi to be completely wrapped by a try-except block, and even that fails. That's consistent with the output: 'Flask server started