[00:12:28] finally found a way to scrape the old RT ticket HTML.. i am planning to upload that and in return we can finally remove the Perl application [00:13:07] and it's the most horrible hack.. bash script that simulates key strokes, but it works. as opposed to all the FF extensions that cant programatically "save as" anymore due to new API [07:04:05] mutante: any help needed for the reimage stuff? Is that related just to Puppet failing or to the automation itself? [07:17:31] <_joe_> we just and a spike of response times on the appservers [07:18:09] <_joe_> having the data separated by db section would make debugging easier [07:18:54] _joe_: when was the spike? [07:19:32] <_joe_> 7:11 to 7:15 I'd say [07:19:34] <_joe_> more or less [07:19:49] I repooled db1112 (the one from yesterday) at .09 [07:19:52] Maybe it was cold? [07:20:06] <_joe_> it's possible yes [07:20:17] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/xxx-joe-appserver?panelId=9&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=mw1270:3903&var-method=GET&var-code=200&from=1564643233014&to=1564643786114 [07:20:34] <_joe_> that's /average/ response time :P [07:21:05] yeah, it was db1112 [07:22:33] Interestig it got so cold even without getting mysql restarted yesterday [07:22:43] just from being out for 12h [07:24:18] <_joe_> marostegui: we will get many of such cases in the next few weeks, things we now notice because we're not blind anymore [07:24:19] <_joe_> :) [07:24:34] <_joe_> volans: should reimage work? [07:25:14] _joe_: I also think that also big traffic shifts in s3 used not to be an issue (it was on bigger wikis) so maybe a sign that s3 is getting loaded again and we need to expedite T226950 [07:25:15] T226950: Move more wikis from s3 to s5 - https://phabricator.wikimedia.org/T226950 [07:25:38] _joe_: I'm not aware of reasons it shouldn't be, but recently more than one failed because of first puppet run failures, unrealted to the reimage itself [07:25:49] <_joe_> lies [07:26:06] I know that bb.lack was testing yesterday a reimage for the anycast dnsrec [07:27:18] marostegui: was a backup made in those 12h? it might explain the "coldness" [07:27:30] volans: no, the backup isn't taken from that host [07:29:45] volans: I will have a dbctl diff for you to check it matches https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/526944/ in a bit, if you'd have time for it :) [07:30:01] marostegui: of course! anytime [07:30:08] <3 [07:33:05] volans: dbctl config diff ready for you! [07:34:31] marostegui: LGTM, +1. [07:34:35] thanks! [07:46:35] volans: the cumin2001 alert should clear soon, btw, I have checked and everything is the same on codfw.php and on dbctl [07:47:17] marostegui: ack, if it's too short we can incrase the time before it alerts [07:47:36] for the temporary one doesn't really matters, will disappear next week [07:47:41] yeah [14:46:47] FYI, in ~15min we're going to start work on the codfw mgmt network, it will be unreachable for a bit [14:47:18] time for reimage some host! :D [14:47:26] *time to [14:47:58] your problem :) [15:26:03] volans: thanks! that was puppet failing on cp hosts and it was already known by Brandon that these roles fail on first run [15:26:13] ack [15:26:15] thx [15:48:08] the codfw mgmt work is starting [16:17:00] hi all i have migrated a few more systems to puppetmaster1003 (running puppet 5.5). so far there have been no difference except for [16:17:03] Notice: /File[/var/lib/puppet/locales/ja]/ensure: created [16:17:05] Notice: /File[/var/lib/puppet/locales/ja/puppetlabs-stdlib.po]/ensure: defined content as '{md5}805e5d893d2025ad57da8ec0614a6753' [16:17:19] you can see which servers are pointing to the new master here https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/puppetmaster/frontend.yaml#L22-L32 [17:14:46] codfw mgmt work is now done