[04:30:08] I am going to take over the puppet and the mediawiki config repo, please coordinate with me before pushing anything. I will let you all know once it is fine to merge normally, once I am done with the s8 failover [05:08:40] Failover was done, deployments can resume as normal [05:13:08] yay! [10:12:24] andrewbogott: the backports repo is missing as it did not exist when we updated the puppet policies. happy to add it now its avalible https://gerrit.wikimedia.org/r/c/operations/puppet/+/526398 [10:30:53] i just got the following error running a puppet-merge https://phabricator.wikimedia.org/P8822 anyone got any ideas [10:31:17] volans: could this be related to the conftool updates? [10:31:55] jbond42: checking [10:32:13] puppet-mger must be run with sudo -i always [10:32:25] can you paste the previous 5 lines? [10:32:40] I'm wondering which data wanted to add/remove given that I already merged those [10:32:47] 2019-07-30 10:29:24 [INFO] conftool::load: Adding objects for dbconfig-instance [10:32:50] 2019-07-30 10:29:24 [INFO] conftool::load: Adding objects for discovery [10:32:53] 2019-07-30 10:29:24 [INFO] conftool::load: Removing stale objects for discovery [10:32:56] 2019-07-30 10:29:24 [INFO] conftool::load: Removing stale objects for dbconfig-instance [10:32:59] 2019-07-30 10:29:24 [INFO] conftool::cleanup: Removing dbconfig-instance with tags eqiad/wikitech [10:33:18] mmmh why it wanted to remove it... [10:33:59] not sure, fyi i just check my history and i have always run without `-i`, will of course change behaviour but just thout i should mention [10:36:41] jbond42: if you re-merged or are re-merging can you tell me what it tells it's deleting? [10:37:11] i have a new merge i can do now [10:38:55] thx [10:38:59] 2019-07-30 10:38:50 [INFO] conftool::load: Removing stale objects for node [10:39:03] 2019-07-30 10:38:50 [INFO] conftool::load: Removing stale objects for dbconfig-section [10:39:06] 2019-07-30 10:38:50 [INFO] conftool::load: Removing stale objects for dbconfig-instance [10:39:09] 2019-07-30 10:38:51 [INFO] conftool::cleanup: Removing dbconfig-instance with tags eqiad/wikitech [10:39:12] 2019-07-30 10:38:51 [INFO] conftool::cleanup: Removing dbconfig-instance with tags eqiad/s8 [10:39:15] 2019-07-30 10:38:51 [INFO] conftool::load: Removing stale objects for mwconfig [10:39:18] 2019-07-30 10:38:51 [INFO] conftool::load: Removing stale objects for discovery [10:39:21] 2019-07-30 10:38:51 [INFO] conftool::load: Removing stale objects for service [10:39:33] ok, thanks a lot, debugging [10:39:38] https://phabricator.wikimedia.org/P8823 <- full output of conftool-merge [10:42:45] jbond42: problem solved, it was a pebcak during debugging stuff earlier today [10:43:32] ack thanks [14:37:51] jbond42: fyi, I attempted to address that conftool thing with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/413745/4/modules/puppetmaster/templates/puppet-merge.erb [14:38:00] I have not circled back to that but something like that seems in order. [14:42:26] thanks andrewbogott [14:49:32] getting some seemingly-random errors on compiler runs about puppetdb access? [14:49:35] Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, PuppetDB query error: [500] java.sql.SQLTransientConnectionException: PDBReadPool - Connection is not available, request timed out after 3000ms., query: ["and",["=","type","Class"],["=","title","Role::Cache::Canary"],["=","exported",false]] at /srv/jenkins-workspace/puppet-c [14:49:41] ompiler/17664/production/src/modules/prometheus/manifests/class_config.pp:35:21 at /srv/jenkins-workspace/puppet-compiler/17664/production/src/modules/profile/manifests/prometheus/ops.pp:245 on node bast3002.wikimedia.org [14:49:49] (a few of those on 2/6 nodes being compiled-for) [14:50:00] any idea? [14:54:33] bblack: i have been looking inot that, im just doing a change now [14:56:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/526441 [15:15:19] bblack: i just re-run yours and its working now ttps://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17668/console [15:15:48] well working accept for 'ERROR: Unable to find facts for host dns4002.ulsfo.wmnet, skipping' [15:16:59] yeah bad hostname [15:17:05] thanks! [15:17:25] now I need to make heira make sense to myself... [15:18:19] it seems counterintuitive that base site-level data in hierdata/ulsfo/profile/base.yaml settings for profile::base take precedence over hieradata/role/ulsfo/recursor.yaml's profile::base:: stuff :P [15:18:46] I would've thought the role+dc -specific one would take precendence over the dc+profile one, in other words [15:18:53] * bblack goes to read wikitech more [15:19:45] *precedence heh [15:20:07] bblack: I would agree with you however that is noty how its configuered. [15:20:51] yes exactly. although some progress has been made to fixing this. the role stuff is no longer its own backend which means we can just reorganise the priorities [15:21:10] it just means unpicking the yaml configs and makiing sure everything is moved to the correct place [15:21:37] this was pretty trival when i did the common re-org no idea how much harder it would be for site [15:21:51] I'm hesitant to offer concrete opinions on how it should be reorganised, as that probably requires hours of careful thought I haven't done :) [15:22:29] to avoid the A/B problem, maybe I should state the original problem: [15:23:37] bblack: i think the easiset solution for now is to move the config in the role hiera to either regex or hosts [15:23:37] there's a hieradata setting profile::base::nameservers. I want it to have a site-level default for all of a given site (e.g. ulsfo), and I also want to override that default for specific role+site combos, e.g. role::recursor in ulsfo needs a different setting from everything else in ulsfo. [15:24:16] but yeah, I could use per-host regexes for the hostnames that are currently said recursors [15:24:32] or could you just put profile::base::nameservers in common/profile/base.yml [15:24:39] then you can orderide in the role [15:24:39] it's not very future-proof though, and regex.yaml has problems of its own (first match hides further matches if people are doing various unrelated things to the same hosts via regex.yaml) [15:25:12] but I don't want to set a default for the whole of production yet. I'm doing a limited rollout as the default in 3x edge sites, but not yet the core sites. [15:25:20] ahh ok [15:25:43] and yes regex is not ideal. i do plan to try and takle reorganising the site/role stuff as well at some point [15:26:59] anyways, I can just use the per-host yaml files for now, it "works" and it will eventually be re-factored better once the rollout is complete. [15:27:21] thanks! :) [15:27:33] ack [15:56:34] https://xkcd.com/2173/ [15:58:34] chaomodus: LOL [16:05:21] I'm behind on xkcd reading :/ [16:05:37] jaufrecht: btw https://xkcd.com/2180/ :) [16:05:46] that one is so good [16:21:13] :D [17:55:21] I'm going to start the ulsfo router software upgrade, ulsfo is depooled [17:59:23] ack [18:39:09] hmm something I never quite realized before [18:39:42] puppet-compiler can't note any diffs (at all, says no-op) if you change a bunch of source files from files/ which are actually deployed by a file{} which recursively processes a whole directory tree of files. [18:40:34] e.g. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/526177/ changes a bunch of files for the install hosts, yet install2002 says no-op: https://puppet-compiler.wmflabs.org/compiler1001/17672/install2002.wikimedia.org/ [18:40:47] interesting. must be no puppetdb entries for the contents? [18:41:09] (because /etc/dhcpd and /srv/autoinstall are recursively-copied directories of files) [18:41:53] cdanis: yeah I'd guess so. it's not really diffing or tracking whole directories when used like this. it's just plastering stuff in locally when the agent runs without any metadata regard to whatever was there before, I think. [18:53:13] uslfo routers upgrade done, was very quiet (only the check I forgot to downtime alerted) and quick (took ~10min of router downtime), could almost be done without depooling the site next time. [18:53:43] Going to wait for ~30min and then repool the site - https://gerrit.wikimedia.org/r/c/operations/dns/+/526508 [18:53:53] just in case of horrible new software bugs? :) [18:55:18] haha yeah, it's a "special" release after all [18:55:54] also for all our BGP peers to process/propagate the routing down/up [19:41:52] alright, 2nd try at merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/526203 [19:46:13] everything looks good this time [20:05:27] i searched phabricator for "dhclient" to find where we added the Icinga check to make sure there are NO running DHCP clients and i find this: https://phabricator.wikimedia.org/rOPUP5e8cae65da8794e45ac4e298d1d689b85a980b43 [20:05:52] the surprising parts on that page: "Unpublished Commit" and "This repository is still importing." (since 2014 ?) [20:07:01] i guess that comes from when it wasn't sure if we'd keep using Gerrit or switch to Phab completely [22:55:10] alright, here are some new info/runbook pages for the icinga checks applied to all hosts in base that were so far linking to empty pages. links created by https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830/28/modules/base/manifests/monitoring/host.pp#187 [22:56:15] https://wikitech.wikimedia.org/wiki/Monitoring/dpkg https://wikitech.wikimedia.org/wiki/Monitoring/check_cpufreq https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:57:04] https://wikitech.wikimedia.org/wiki/Monitoring/root_disk_space https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun https://wikitech.wikimedia.org/wiki/Monitoring/check_eth