[02:28:04] Where can I see the current replication lag? [02:41:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [02:42:28] harej, https://tools.wmflabs.org/replag/ [02:42:48] oof, is enwiki replag usually that bad? [02:43:20] that's over 13 hours [02:43:42] not sure [03:21:53] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:31:35] harej, https://phabricator.wikimedia.org/T142310 ? [03:46:25] Krenair harej: https://phabricator.wikimedia.org/T138954 ? https://phabricator.wikimedia.org/T134203 ? for enwiki_p... [03:57:48] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/EdouardHue was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=816586 edit summary: [04:26:25] I think that's the latest one... [05:06:03] ugh [06:15:28] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2482793 (10Dereckson) ady is there, jam is not. [06:15:44] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2531973 (10Dereckson) [06:30:05] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 531 bytes in 0.010 second response time [06:30:21] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2531975 (10Dereckson) [06:35:06] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2531977 (10Dereckson) To do on the instance: ```lang=mysql INSERT INTO wikipedias (prefix, lang, loclang) VALUES ("jam", "Jamaican", "Patois"), ("tcy", "Tulu", "ತು... [06:55:04] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.050 second response time [06:55:16] PROBLEM - Puppet staleness on tools-proxy-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [07:28:47] (03CR) 10Lokal Profil: [C: 032] Centralize methods to connect to database in one file [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303427 (owner: 10Jean-Frédéric) [07:29:25] (03Merged) 10jenkins-bot: Centralize methods to connect to database in one file [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303427 (owner: 10Jean-Frédéric) [07:46:17] (03CR) 10Lokal Profil: [C: 031] "Looks good. The only thing I'm unsure about is whether this will correctly load the db username/password (does pywikibot load .my.conf aut" (032 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303428 (owner: 10Jean-Frédéric) [08:02:01] (03CR) 10Lokal Profil: [C: 031] "I'll admit this is a bit beyond me to review." (032 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303498 (owner: 10Jean-Frédéric) [08:02:52] (03PS2) 10Lokal Profil: Replace TestFillTableMonumentsBase by CustomAssertions [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/302887 [08:38:40] (03PS1) 10Lokal Profil: Add commonscat mapping for sq.wikipedia [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303517 (https://phabricator.wikimedia.org/T141505) [08:43:46] (03PS2) 10Lokal Profil: Add commonscat mapping for sq.wikipedia [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303517 (https://phabricator.wikimedia.org/T141505) [08:44:15] hi! My labs instance "puppet-ema" started cronspamming the puppet project admins because of puppetfails on the instance itself. Is there any way to avoid that? The instance is really just a personal playground, it's really broken more often than not :) [08:45:38] this should also apply to various other test instances by ops people, maybe let's just skip the alert if some /etc/dont-complain-about-puppet-failures file is present or so? [08:45:43] ema: not really. Any reason why puppet should fail? [08:46:18] The problem with puppet failing is that you're also missing all important labs updates/changes, which causes instances to become stuck/unconnectable/etc [08:48:57] valhallasw`cloud: puppet fails on that instance quite often because that's where I test puppet changes, weird combinations of packages and other unspeakable things [08:49:24] so in general: 1) puppet fails often there 2) nobody should care about that [08:51:05] the general issue is that 2) is often not the case: people don't want to handle puppet not running, but at the same time their instance is insufficient 'cattle', so to say [08:51:10] (this might not be true for you, I realise that) [08:52:22] so I guess a hiera parameter for this might make sense. That's a public note 'I don't care about this server', although it's still something people might set 'to get rid of the errors', and then they complain their instance is suddenly unreachable [08:53:36] if hiera('send_puppet_failure_emails', false) { [08:53:42] apparently we already support that :-) [08:53:48] oh interesting [08:54:06] most likely for prod vs labs, but you can set that on labs as well [08:54:21] (the wikitech hiera:* namespace supports both project-based and host-based hiera config) [08:55:42] thanks valhallasw`cloud! [09:06:30] !log tools.heritage Updating pywikibot (37 commits) [09:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master [09:07:27] (03CR) 10Legoktm: [C: 032] Add Edit Review Improvements to #collaboration-team [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/303251 (owner: 10Mattflaschen) [09:07:43] (03Merged) 10jenkins-bot: Add Edit Review Improvements to #collaboration-team [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/303251 (owner: 10Mattflaschen) [09:23:45] 06Labs, 10Tool-Labs: p50380g50921 has 20+ open persistent connections to labsdb1001 - https://phabricator.wikimedia.org/T142356#2532157 (10jcrespo) [09:32:53] 06Labs, 10Tool-Labs: p50380g50921 has 20+ open persistent connections to labsdb1001 & labsdb1003 - https://phabricator.wikimedia.org/T142356#2532180 (10jcrespo) [09:37:03] 06Labs, 10Tool-Labs, 06Discovery, 06Maps: p50380g50921 has 20+ open persistent connections to labsdb1001 & labsdb1003 - https://phabricator.wikimedia.org/T142356#2532186 (10valhallasw) Last time these connections came from 10.68.16.70, and >>! In T138283#2396104, @valhallasw wrote: > 10.68.16.70 is maps-w... [09:52:33] 06Labs, 10Tool-Labs, 06Discovery, 06Maps: p50380g50921 has 20+ open persistent connections to labsdb1001 & labsdb1003 - https://phabricator.wikimedia.org/T142356#2532212 (10jcrespo) As users were took no action by T138283, I will limit now the number of concurrent connections. [09:54:31] 06Labs, 10Tool-Labs: s51704 had multiple long-running (~1 hour) concurrent queries before labsdb crashed - https://phabricator.wikimedia.org/T142358#2532214 (10jcrespo) [10:42:55] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Alfa80 was created, changed by Alfa80 link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Alfa80 edit summary: Created page with "{{Tools Access Request |Justification=Maintenance/Resurrection of some left-alone tools |Completed=false |User Name=Alfa80 }}" [11:20:08] (03CR) 10Jean-Frédéric: [C: 032] Add commonscat mapping for sq.wikipedia [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303517 (https://phabricator.wikimedia.org/T141505) (owner: 10Lokal Profil) [11:20:45] (03Merged) 10jenkins-bot: Add commonscat mapping for sq.wikipedia [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303517 (https://phabricator.wikimedia.org/T141505) (owner: 10Lokal Profil) [11:35:43] 10Tool-Labs-tools-Other, 06Discovery, 10Internet-Archive: Restore connectivity project tools - https://phabricator.wikimedia.org/T106373#2532450 (10Danny_B) [11:41:18] 06Labs, 10Tool-Labs, 06WMF-Legal: Make querycache, querycachetwo and querycache_info tables visible on labs dbs - https://phabricator.wikimedia.org/T65782#2532464 (10Danny_B) [13:32:05] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533032 (10MoritzMuehlenhoff) [13:45:50] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533061 (10MoritzMuehlenhoff) We could implement this as such: Initially we need to create a PGP key for the encrypted passwords. The public key would be distributed to the VMs via puppet. The secret key would b... [14:16:58] 06Labs: Change puppet nag emails to weekly for intentional disable w/ message - https://phabricator.wikimedia.org/T142374#2533102 (10chasemp) [14:38:52] ema: one problem is we manage global labs wide configs via the same means (ldap, dns, etc) so long term puppet disable isn't possible on Labs either, it creates holes [14:39:05] we are looking at making it once a week for a run on intentional disable for similar reasons really [14:39:37] but the nature of labs being centrally managed and testing silo's being only partially isolated creates issues [14:46:21] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533204 (10Andrew) @MoritzMuehlenhoff -- that sounds just like what I was envisioning. Would you be willing to spoon-feed me the commands I need (e.g. generating a key, encrypting/decrypting, etc)? [14:46:49] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Alfa80 was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=816643 edit summary: [14:49:18] 06Labs, 10Horizon: Incorrect quota error when creating instances in some projects - https://phabricator.wikimedia.org/T142379#2533205 (10Andrew) [14:52:18] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2527493 (10faidon) Generating the root password locally and passing the (encrypted) root password over the puppet logs doesn't sound that great to me. Why wouldn't we generate the cleartext password on the pup... [14:54:54] chasemp: I understand! In my case, I do run puppet Very Often on my test instance, but in a few cases it happened that I left it disabled for a few days while I had an experiment in progress. I've set the hiera flag to avoid sending out emails which should be good enough for now [14:55:31] cool just some background for it and like I said we are looking at a week for intentional disable becaues 24h is pretty common during the course of normal disable for test business [14:56:26] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533340 (10Andrew) @faidon, my design was based on things I already know how to do :) You're suggesting puppet gymnastics that I don't have any experience with, can you be more specific? (Also, for what it's w... [15:07:20] 06Labs: 4.4-series kernel vs. iptables - https://phabricator.wikimedia.org/T142388#2533382 (10Andrew) [15:14:14] 06Labs, 10Horizon: Incorrect quota error when creating instances in some projects - https://phabricator.wikimedia.org/T142379#2533407 (10Andrew) Note that this doesn't happen for all users. I'm yet unclear when it does and doesn't hit; I suspect it has to do with how many projects a given user belongs to. [15:15:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:16:38] PROBLEM - Puppet run on tools-exec-cyberbot is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:17:30] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533422 (10faidon) So, a simple way to do this would be: ``` user { 'root': password => generate('/usr/local/bin/password_for_labs', $fqdn) } ``` Where `/usr/local/bin/password_for_labs` could be a script i... [15:17:58] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:18:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:18:18] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:18:24] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:18:30] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:20:24] PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:20:46] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:21:39] PROBLEM - Puppet run on tools-web-static-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:21:41] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:21:59] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:22:17] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:27:32] Uh oh ^ [15:28:28] PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:29:50] e: We just had some legitimate bots and people K-Lined, supposedly for spamming? [15:30:13] yes [15:30:27] sigyn thought they were spammers, they should be immune now [15:30:28] Krenair: that's a bot klining [15:30:35] thanks [15:30:56] PROBLEM - Puppet run on tools-exec-1204 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:31:38] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:31:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:31:40] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:31:52] puppet is pretty broken :( [15:32:02] PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:32:24] PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:32:26] PROBLEM - Puppet run on tools-exec-1211 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:32:35] PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:33:05] it's being looked into [15:35:07] e: can you unkline the 3 users too? Not sure that they've been unklined [15:35:30] do you have their nicks? [15:36:36] e, the nicks are CIG, Danny_B, Luke081515, doctaxon and gifti [15:37:10] (bit more than 3 nicks) [15:37:28] oh, they should be unklined [15:37:57] e thanks, was going to ask what happened, saw a random K-line on one of my bots :/ [15:39:09] shinken-wm wasn't configured as a legitimate bot by Sigyn so she punished it [15:40:33] 06Labs, 10Labs-Infrastructure: Audit the labs infrastructure scripts that depend on LDAP to make sure they are resilient to failover - https://phabricator.wikimedia.org/T142394#2533513 (10yuvipanda) [15:40:49] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533527 (10faidon) BTW, [[ https://github.com/duritong/trocla | trocla ]] (and [[ https://github.com/duritong/puppet-trocla | puppet-trocla ]]) might also be of interest here. I haven't given it a close look, bu... [15:43:23] 06Labs: 4.4-series kernel vs. iptables - https://phabricator.wikimedia.org/T142388#2533382 (10faidon) Copying from IRC: ``` 18:21 < paravoid> did you load the right netfilter modules? 18:21 < paravoid> you need a module for netfilter's FORWARD to catch bridged traffic 18:22 < paravoid> br_netfilter for sure, pos... [15:43:29] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [15:44:52] 06Labs: 4.4-series kernel vs. iptables - https://phabricator.wikimedia.org/T142388#2533534 (10zhuyifei1999) (Just realized why "it's worth it, believe me") [15:52:25] (03PS1) 10Mattflaschen: Add home page and description. [labs/tools/phabricator-bug-status] - 10https://gerrit.wikimedia.org/r/303562 [15:53:43] (03CR) 10Mattflaschen: [C: 032 V: 032] Add home page and description. [labs/tools/phabricator-bug-status] - 10https://gerrit.wikimedia.org/r/303562 (owner: 10Mattflaschen) [15:55:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:24] RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:39] RECOVERY - Puppet run on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:59] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:03] RECOVERY - Puppet run on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:03] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:19] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:23] RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:13] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:49] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:39] RECOVERY - Puppet run on tools-web-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:41] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:59] RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:03] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:19] RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:20] RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:20] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:30] RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:32] RECOVERY - Puppet run on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:42] RECOVERY - Puppet run on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:52] RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:58] RECOVERY - Puppet run on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:46] RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:48] RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:50] RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:54] RECOVERY - Puppet run on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:58] RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:40] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:41] RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [16:07:06] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [16:07:17] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [16:07:29] RECOVERY - Puppet run on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:27] RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:39] RECOVERY - Puppet run on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:39] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:03] RECOVERY - Puppet run on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:23] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:35] RECOVERY - Puppet run on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0] [16:29:40] 10Tool-Labs-tools-Pageviews: Convert Sass to Less - https://phabricator.wikimedia.org/T142402#2533727 (10MusikAnimal) [16:29:55] 10Tool-Labs-tools-Pageviews: Convert Sass to Less - https://phabricator.wikimedia.org/T142402#2533739 (10MusikAnimal) p:05Triage>03Low [16:35:03] 10Tool-Labs-tools-Pageviews: Restrict Topviews to showing data only for individual days, weeks, or months - https://phabricator.wikimedia.org/T142403#2533750 (10MusikAnimal) [16:35:20] 10Tool-Labs-tools-Pageviews: Restrict Topviews to showing data only for individual days, weeks, or months - https://phabricator.wikimedia.org/T142403#2533765 (10MusikAnimal) p:05Triage>03Normal [16:45:35] 10Tool-Labs-tools-Pageviews: Create tool description page on wikitech - https://phabricator.wikimedia.org/T142405#2533810 (10bd808) [17:16:42] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2534044 (10RobH) >>! In T137924#2525923, @MoritzMuehlenhoff wrote: > The temp host, which was added in https://phabricator.wikimedia.org/rOPUP93cef36f9ed6e2ec1e8ec2f6d5d345a4343e1610 is... [17:22:10] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2534068 (10yuvipanda) Can be reclaimed! [17:22:23] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Copy labmon data to new SSDs - https://phabricator.wikimedia.org/T137924#2534069 (10RobH) Confirmed he doesnt need the data, so I'll create the task and take care of remote steps to reclaim WMF4724 [18:16:38] 06Labs: Write some labs tests that monitor login and sudo permissions - https://phabricator.wikimedia.org/T127716#2534327 (10Andrew) [18:24:40] 06Labs: Designate seems broken in labtest - https://phabricator.wikimedia.org/T142220#2534370 (10Andrew) 05Open>03Resolved a:03Andrew This is weird, but resolved. Puppet on labtestservices2001 was unable to complete its run (due to some terrible interaction with pdns service scripts) so the designate conf... [18:33:04] 10Tool-Labs-tools-Pageviews: Don't add throttling if querying for 10 or less articles - https://phabricator.wikimedia.org/T142326#2534405 (10MusikAnimal) a:03MusikAnimal [18:54:03] !log rcm upping ram/cpu quotas per T142311 approval in ops labs meeting [18:54:04] T142311: Request increased quota for rcm labs project - https://phabricator.wikimedia.org/T142311 [18:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL, Master [18:55:26] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2534543 (10chasemp) [18:55:28] 06Labs: Request increased quota for rcm labs project - https://phabricator.wikimedia.org/T142311#2534541 (10chasemp) 05Open>03stalled Seems good @Luke081515 for transition, we all agreed that since this is for temporary migration work we will change, put this task in stall, and revert the quota later once yo... [18:55:42] 06Labs: Revert: Request increased quota for rcm labs project - https://phabricator.wikimedia.org/T142311#2534544 (10chasemp) [19:06:14] PROBLEM - Host tools-secgroup-test-102 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:30] PROBLEM - Host tools-secgroup-test-103 is DOWN: PING CRITICAL - Packet loss = 100% [19:08:17] 06Labs, 10Labs-Infrastructure: Audit the labs infrastructure scripts that depend on LDAP to make sure they are resilient to failover - https://phabricator.wikimedia.org/T142394#2534556 (10yuvipanda) [19:08:19] 06Labs, 10Tool-Labs, 13Patch-For-Review: Add appropriate timeouts to maintain-kubeusers - https://phabricator.wikimedia.org/T141203#2534558 (10yuvipanda) [19:08:24] PROBLEM - Host tools-secgroup-test-101 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:14] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2534561 (10Dzahn) done, but also: ``` update wikipedias set loclang="ತುಳು ಭಾಷೆ" where prefix="tcy"; ``` ``` Mar... [19:21:49] 06Labs, 06Operations: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2534618 (10Andrew) Oh, I did not know about generate() -- that's clearly better than what I was imagining. Encryption-wise... I can't convince myself that it's useful. Root access to the puppetmaster already c... [19:22:18] 06Labs, 10Phlogiston, 15User-bd808: Create new Phlogiston-01 instance - https://phabricator.wikimedia.org/T142277#2534631 (10JAufrecht) a:05JAufrecht>03bd808 Bryan suggested that I had enough access to do it myself. Unfortunately, the web interface shows: [New instance creation disabled] with a link t... [19:34:40] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2534684 (10Dzahn) I checked the diff between the list of wikipedia prefixes i get from curl https://meta.wikimedia.org/w/api.php?action=sitematrix and the wikist... [19:35:13] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2534685 (10Dzahn) 05Open>03Resolved [19:37:07] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2482793 (10Krenair) Should this be added to the Add_a_wiki docs? [19:37:19] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:46:10] 06Labs: Revert: Request increased quota for rcm labs project - https://phabricator.wikimedia.org/T142311#2534735 (10Luke081515) Ok :). I will setup this instance, if I have time to (I guess unused quota does not affect the servers ;)) [19:54:41] 06Labs, 10Phlogiston (Interrupt), 15User-bd808: Create new Phlogiston-01 instance - https://phabricator.wikimedia.org/T142277#2534768 (10JAufrecht) [19:55:40] 06Labs, 10Phlogiston (Interrupt): Phlogiston-1 server is unstable and no longer able to run Phlogiston reports - https://phabricator.wikimedia.org/T141796#2534783 (10JAufrecht) [20:04:32] 06Labs, 10Phlogiston (Interrupt): Create new Phlogiston-01 instance - https://phabricator.wikimedia.org/T142277#2534820 (10bd808) a:05bd808>03None Labs instance creations are temporarily disabled while {T142165} is worked on. You'll be able to take care of this yourself once that is resolved. [20:12:39] 06Labs: Write some full-stack tests - https://phabricator.wikimedia.org/T142421#2534827 (10Danny_B) [20:39:39] 10Tool-Labs-tools-Wikidata-Periodic-Table, 10Wikidata: ptable app is broken again! - https://phabricator.wikimedia.org/T142432#2534868 (10ArthurPSmith) [20:44:49] !log ores depooling ores-web-04 [20:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [20:48:40] 06Labs, 10Labs-Infrastructure: Create a new labs flavor available to all project: largedisk - https://phabricator.wikimedia.org/T142166#2534932 (10yuvipanda) I'm actually going to create s160.small, which has 160 G of storage but only a single CPU. [20:57:33] !log git enabling puppet on gerrit-test3 again [20:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master [20:57:37] mutante ^^ [20:59:18] paladox: great, add the part where we disabled acme-setup [20:59:34] Ok [20:59:48] !log git disabled acme-setup in gerrit-test3 [20:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master [21:08:23] andrewbogott: hey, tell me if you have some time [21:08:40] Amir1: I have about 5 minutes now, heading out after that [21:08:41] what's up? [21:08:56] andrewbogott: we have an instance that hung up now [21:08:57] PROBLEM - Puppet run on tools-logs-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:09:05] but we rebuilt this instance several times [21:09:12] and always it gets issues [21:09:13] Amir1: what instance and project? [21:09:23] it's like there is a hardware issue [21:09:31] ores-web-04 in ores project [21:09:41] we threw it away and re built it [21:09:44] it's not just running out of memory? [21:10:06] nope [21:10:23] Ami1 is it a small instance [21:10:28] Amir1 [21:10:37] let me check [21:10:51] Amir1: log says "Out of memory: Kill process 21030 (uwsgi) score 120 or sacrifice child" [21:11:17] If it is a small one, that will deffitly cause it to stall, it did it for phab-01, we moved that to a medium instance [21:11:26] other instances are like this, all configs but they don't run out of memory [21:11:28] the syslog is publicly visible via both wikitech and horizon [21:11:42] this is strange [21:12:35] btw. I really love to see this caching issue resolved. I need to log out and login to see my instances every week [21:12:45] it's really annoying [21:12:56] Amir1: that bug will never be fixed. Use horizon, it works better. [21:13:41] Amir1: Of course I don't know anything about the internals of that VM — it could be a memory leak, or it could be that RAM consumption grows with usage and it just has more traffic than it can handle. [21:13:41] I guess [21:14:18] the problem is that we have two other instances with the exact same configs, puppet, code, everything [21:14:29] and it's just ores-web-04 that gets the errors [21:15:04] you've recreated it though [21:15:21] so clearly there's nothing actively cursed about this instance since it's not the same instance as it was the last time you hit this problem :) [21:15:21] and still a problem [21:15:31] So the only thing that's constant is the name [21:15:37] exactly [21:15:52] So probably whatever you're doing for balancing favors that instance so it dies first [21:15:52] try changing the name? [21:15:52] is it possible they are getting same hardware [21:16:07] Yes, i had this problem with phab-02 [21:16:08] Amir1: not likely, it gets scheduled wherever the space is available at any one time. [21:16:13] it stuck with the same settings [21:16:18] even after deleting it [21:16:40] I was looking for this [21:16:42] thanks [21:16:59] Amir1: I think the hardware host is public (although only on wikitech) [21:17:02] on the instance info page [21:17:07] I have to go now, though, sorry. [21:17:21] thanks for the help [21:18:49] i am trying to associate a floating IP with an instance [21:18:53] i get "Error: Unable to associate floating IP." [21:19:05] does that tell me i have to raise a quota first? [21:19:08] or something else [21:19:17] i have done this before with wikitech but now it's horizon [21:19:25] chasemp andrewbogott ^^ [21:21:58] probably but what is the project? [21:22:22] the project is called "git" because in gerrit we dont have permissions for paladox [21:28:06] chasemp ^^ [21:28:24] | floating_ips | 0 [21:28:33] I'm trying to figure out if https://phabricator.wikimedia.org/T140904 applies [21:28:36] List current IP addresses nova --os-tenant-id floating-ip-list [21:28:43] ERROR (CommandError): You must provide a username or user id via --os-username, --os-user-id, env[OS_USERNAME] or env[OS_USER_ID] [21:29:08] from https://wikitech.wikimedia.org/wiki/OpenStack#Managing_floating_IP_addresses [21:29:25] i do this on labcontrol1001 [21:30:36] os-tenant-id !+ os-tenant-name .. ah [21:30:46] well no.. [21:31:13] LOL [21:32:04] it does not like --os-tenant-id or --os-tenant-name anymore [21:32:10] Oh [21:32:17] or i am not supposed to use this control host [21:32:23] or ... [21:33:32] mutante: we are asking ppl to make a quick task on quota bumps and I think even for floating ips it makes sense, maybe more since they have a hard limit if you could throw something up for https://phabricator.wikimedia.org/T140904 I'll ping andrew to see what he thinks [21:36:52] ok [21:39:13] paladox: maybe we can find the ticket that ostriches used [21:39:15] to get his IP [21:39:23] LOL yeh [21:39:57] and then we can use that [21:40:11] when did he make the instance and get the IP? [21:40:12] but then there's the permission issue [21:40:14] I carnt find the task [21:40:15] so not sure [21:40:23] the requirement to make a ticket for such things is very new [21:40:24] Krenair we are testing letsencyrpt [21:40:29] and need a public ip [21:40:36] we are using staging. [21:40:53] carnt find the task [21:41:09] I doint even think he created a task. [21:41:24] we probably won't get one for the git project [21:41:33] Why? [21:41:55] I guess we will need to go with plan b then [21:41:55] https://gerrit.wikimedia.org/r/#/c/303435/7 [21:41:56] we'd have the same thing in 2 separate projects [21:42:18] lol [21:43:00] hrmm.. [21:43:23] mutante [21:43:24] found it [21:43:25] https://community.letsencrypt.org/t/acquire-and-install-certs-on-reverse-proxy-server-configuration/3093 [21:43:36] Seems they support proxy ^^ by doing that [21:46:16] varnish in deployment-prep behaves like a reverse proxy [21:46:20] i dont think that is the right way to solve it. [21:46:23] we don't do letsencrypt on the deployment-mediawiki boxes [21:46:31] just the deployment-cache ones, with varnish and nginx [21:46:45] that would mean we still have a special case for labs [21:46:53] https://github.com/certbot/certbot/issues/2164 [21:48:22] you want to do lets encrypt behind the labs proxy system? [21:48:30] no :p [21:48:47] which is why you want a public IP, right [21:48:49] ok [21:48:57] we just want to do LE, period. [21:48:59] Yep [21:49:02] and not use the labx proxy system [21:49:58] ProxyPass /.well-known/acme-challenge/ ! [21:50:02] mutante ^^ [21:50:03] mayeb we want to use it with this server though: [21:50:10] acme-staging.letsencrypt.org [21:50:16] instead of the regular one [21:50:39] Oh [21:53:39] yea, soo.. chad already has an IP in the right project. it really is just permissions on that [21:54:53] yep [22:09:22] ok, so i just learned we are not talking about the "gerrit" project eihter [22:09:29] but it's actually in "staging" [22:10:27] but "staging" has just a single instance, which is gerrit ? [22:10:42] giving up on the project and permission part [22:11:24] 10Tool-Labs-tools-Pageviews: Don't add throttling if querying for 10 or less articles - https://phabricator.wikimedia.org/T142326#2535150 (10MusikAnimal) 05Open>03Resolved Fixed with https://github.com/MusikAnimal/pageviews/releases/tag/2016.08.08T21.10 [22:18:35] 06Labs: Request increased quota for labs project - https://phabricator.wikimedia.org/T142440#2535165 (10Dzahn) [22:18:50] 06Labs: Request increased quota (floating-IP) for git labs project - https://phabricator.wikimedia.org/T142440#2535180 (10Dzahn) [22:19:35] chasemp: thank you, requested paladox: ditto ^ [22:19:43] Oh thanks [22:19:57] paladox: if that works we can still give it back later [22:20:08] Yep, and thanks [22:39:54] 06Labs, 10Labs-Infrastructure: Create a new labs flavor available to all project: largedisk - https://phabricator.wikimedia.org/T142166#2535220 (10yuvipanda) Proposed naming convention: cX.mY.sZ, where X is number of CPU cores, Y is GB of RAM, Z is GB of Storage. Pros: 1. Allows us to maintain a matrix of... [22:50:10] PROBLEM - Puppet staleness on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [23:01:01] 06Labs: (Re-)Create Gitblit->Phabricator testing instance on Labs - https://phabricator.wikimedia.org/T142186#2535280 (10Dzahn) p:05Triage>03Normal [23:07:29] 06Labs, 10Labs-Infrastructure: Creating new instance failed - https://phabricator.wikimedia.org/T136656#2535285 (10Dzahn) [23:07:31] 06Labs: (Re-)Create Gitblit->Phabricator testing instance on Labs - https://phabricator.wikimedia.org/T142186#2535284 (10Dzahn) [23:07:37] 06Labs: (Re-)Create Gitblit->Phabricator testing instance on Labs - https://phabricator.wikimedia.org/T142186#2526306 (10Dzahn) 05Open>03stalled [23:07:39] 06Labs, 10Labs-Infrastructure: Creating new instance failed - https://phabricator.wikimedia.org/T136656#2343333 (10Dzahn) [23:11:25] 06Labs, 10Labs-Infrastructure: Creating new instance failed - https://phabricator.wikimedia.org/T136656#2535306 (10yuvipanda) 05Open>03Invalid This is just because I didn't wait long enough before recreating instances of the same name... [23:11:54] PROBLEM - Host tools-logs-02 is DOWN: CRITICAL - Host Unreachable (10.68.21.158) [23:12:34] that's me ^ no worries [23:28:08] PROBLEM - Host tools-worker-1025 is DOWN: CRITICAL - Host Unreachable (10.68.22.147) [23:28:57] ^ is also me [23:29:18] RECOVERY - Host tools-worker-1025 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [23:42:27] 06Labs, 10Labs-Infrastructure: Create a new labs flavor available to all project: largedisk - https://phabricator.wikimedia.org/T142166#2535359 (10tom29739) But Horizon tells you how much RAM/CPU/storage that flavour gives the instance when you pick the flavour on there [23:54:08] RECOVERY - Puppet run on tools-worker-1025 is OK: OK: Less than 1.00% above the threshold [0.0]