[00:28:26] 10DBA, 10Data-Services, 10Patch-For-Review: Expose ar_content_format and ar_content_model columns of archive table on Labs replicas - https://phabricator.wikimedia.org/T89741#3423430 (10Bawolff) I'm fine with it, provided that its null when ar_deleted&1=1. This may be mildly paranoia, but I'd like to be str... [01:07:42] 10DBA, 10Data-Services, 10Patch-For-Review: Expose ar_content_format and ar_content_model columns of archive table on Labs replicas - https://phabricator.wikimedia.org/T89741#3423529 (10APalmer_WMF) 05Open>03Resolved [01:08:00] 10DBA, 10Data-Services, 10Patch-For-Review: Expose ar_content_format and ar_content_model columns of archive table on Labs replicas - https://phabricator.wikimedia.org/T89741#1044220 (10APalmer_WMF) 05Resolved>03Open [05:01:10] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3423679 (10Marostegui) Nice catch faidon!! Thanks for fixing this and specially thanks for fixing dbstore1001, which is a critical host for us! [06:55:30] hey y'all :) cya at 1400 for T168584! have a nice day <3 [06:55:30] T168584: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584 [06:56:13] Sleep weel madhuvishy! :) [06:56:16] well [07:29:49] 10DBA, 10RESTBase-API, 10Reading List Service, 10ArchCom-RfC (ArchCom-Approved), and 4 others: RfC: Reading List service - https://phabricator.wikimedia.org/T164990#3423906 (10jcrespo) It is my intention to, at some point, to convert titles into first-class entities, give them some ids and that way reduce... [07:47:02] 10DBA, 10Mail, 10Operations: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3423956 (10jcrespo) a:03herron So we need: db name, account name, grants needed, ips/dns of the origin of the connections. [07:59:01] 10DBA: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811#3423982 (10Marostegui) Dropped from s6 (frwiki and jawiki) [08:18:28] 10DBA: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811#3424047 (10Marostegui) Dropped from s5 (dewiki) [08:41:30] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3424125 (10aaron) >>! In T164173#3420723, @daniel wrote: > @aaron another question: does Re... [09:23:30] 10DBA, 10RESTBase-API, 10Reading List Service, 10ArchCom-RfC (ArchCom-Approved), and 4 others: RfC: Reading List service - https://phabricator.wikimedia.org/T164990#3424163 (10daniel) >>! In T164990#3421869, @Fjalapeno wrote: > @Daniel we went back and forth on this a bit. I originally proposed ids, but in... [09:30:48] 10DBA, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3424171 (10elukey) I want to observe how the patch that I merged behaves during the next days before closing. [09:35:33] jynus: o/ - when you have some time can I ask you a couple of questions about the eventlogging_cleaner user on dbstore1002/db1047 ? [09:39:52] elukey sure [09:39:57] can I call you? [09:40:12] I hate my keyboard [09:40:24] new one arrives tomorrow [09:47:48] jynus: sorry I was grabbing a coffee, sure! [09:48:13] (grabbing my headphones) [10:12:14] 10DBA: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811#3424257 (10Marostegui) Dropped from s4 (commonswiki) [10:50:47] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3424434 (10Marostegui) db1067 is done: ``` root@neodymium:/home/marostegui# for i in `cat s1_tables`; do echo $i; mysql --skip-ssl -hdb1067 enwik... [10:50:50] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3424435 (10Marostegui) [10:54:53] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3424437 (10Marostegui) [11:09:57] s1 seems to have replication issues [11:10:06] as in more lag than usualk [11:10:16] on some host on eqiad and codfw [11:12:11] you seeing that on the aggregated? [11:13:44] I am seein everywhere, tendril, icinga, logstash [11:13:59] https://logstash.wikimedia.org/goto/2c8c67d873b9b4f7b092951474ede9ef [11:17:10] I was checking my last scap deployment and it was db1067 just being depooled to pooled with weight 0, at 10:56 as perl SAL, so I don't think it could be related [11:18:03] pools and depools rarely cause lag [11:18:19] either the whole whole thing is down or not [11:18:20] yeah, I was just checking if it could be in anyway related [11:18:39] normally when affecting multiple hosts it is a software cause [12:31:48] 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3424745 (10mforns) [13:05:35] 10DBA, 10RESTBase-API, 10Reading List Service, 10ArchCom-RfC (ArchCom-Approved), and 4 others: RfC: Reading List service - https://phabricator.wikimedia.org/T164990#3424855 (10Tgr) ID change on page undeletion was fixed a while ago (T28123). Deletion and normal recreation still changes the ID (and attempts... [14:01:43] jynus: morning :) labsdb1004 reboot? [14:05:18] halfak and me are around [14:05:35] \o/ [14:07:33] I was waiting for chris to be around [14:07:53] jynus: for 1004? I thought that was only for 1001 and 1003 [14:08:00] ok [14:08:05] let's do it, then [14:08:23] any blockers or anything you have to do before hand? [14:08:39] jynus: icinga I think [14:08:45] yes [14:08:49] i can set up downtime [14:08:59] let's disable alerts [14:09:08] downtime get lost very frequently [14:09:25] lately [14:10:17] jynus: okay, done [14:11:01] lets move to operations- [14:11:16] jynus: okay! [14:22:23] Looks like we're down? [14:24:57] halfak: see progress in #wikimedia-operations ;) [14:25:15] :) Yeah. Just saw that after I posted. [14:27:38] All was successful. [14:27:40] Thanks folks [14:27:48] * halfak retreats to his usually scheduled channels. [14:27:50] o/ [14:28:42] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3425136 (10Marostegui) #cloud-services-team I am about to start an alter table on sanitarium2's master (db1064) which once done will replicate to sanitarium2... [14:28:50] jynus: o/ [14:28:50] so postgres uses default packages [14:29:04] which means everything is automatic [14:29:09] start, stop [14:29:12] right [14:29:20] I manually in this case did [14:29:32] systemctl stop postgresql [14:29:35] but just to be safe [14:29:47] ah before rebooting? [14:29:51] mysql is a bit more involved because we use the same ideas than on production [14:30:05] yes, but almost 100% sure it is not needed [14:30:05] but when it came up it automatically started postgres - okay [14:30:08] yes [14:30:18] mysql, we like to handle it manually [14:30:32] that means managed => false on puppet [14:30:43] except on trivial services [14:30:44] right, okay [14:30:53] and automatic stuff is deleted from the package [14:31:11] so in this case there are 2 important stuff [14:31:26] STOP ALL SLAVES before stop [14:31:45] and manually doing /etc/init.d/mysql stop [14:31:53] the reason is that [14:32:08] it can take 30 minutes to shutdown a server [14:32:16] and the os won't wait [14:32:24] creating in some cases corruption [14:32:24] ah right [14:32:46] so init.d is still used on jessie [14:33:00] stretch finally gives us systemd support [14:33:12] (mariadb, actually, but you get the idea) [14:33:30] so systemctl mariadb stop on strech+ [14:33:52] right, makes sense [14:33:54] the only extra thing is that I upgraded mysql [14:34:09] probably something you will not be doing on your own etc [14:34:17] but just out of completeness [14:34:30] I started mysql with slave stopped [14:34:40] right - may be i'll be a dba someday ;) go on etc! [14:34:50] /etc/init.d/mysql start --skip-slave-start [14:34:59] then mysql_upgrade --skip-ssl [14:35:21] and then mysql --skip-ssl -> START SLAVE [14:35:33] on stretch, the skip-slave start [14:35:37] will be [14:35:51] jynus: you ran all this post-reboot? [14:35:57] yes [14:36:08] or after mysql start if reboot [14:36:40] importatant before pooling it back/m*rk it as active [14:37:08] systemctl set-environment MYSQLD_OPTS="--skip-slave-start" [14:37:19] mysql_upgrade --skip-ssl [14:37:26] systemctl unset-environment MYSQLD_OPTS [14:37:37] then start the slave the same way [14:37:46] this should be all on the documenation for mariadb [14:37:53] so, you stop all slaves before reboot, reboot, and then start back mysql with --skip-slave-start, run the upgrade, and finally start the slaves [14:37:54] and if it is not, I will add it soon [14:38:01] exactly [14:38:07] same here as in production [14:38:07] cool [14:38:12] so the idea [14:38:30] is that it requires a bit more of work as a conscient decision [14:38:40] normally the package does that every time [14:38:47] but that is dangereous in some cases [14:39:05] here we control when to do it, what kind of maintenance to do, etc. [14:39:20] not start it automatically [14:39:29] oh okay understood [14:39:31] because if it crashed, we want to check it first [14:39:50] just wanted to share what I just did- maybe not useful [14:40:04] definitely useful :) thank you [14:40:50] I will now undo the alerts disabling [14:41:04] jynus: awesome thanks [14:41:23] jynus: for the upcoming 1001 and 1003 reboots, Chris is back, do you think we should do one of them thursday? [14:41:34] mmm [14:41:44] is manuel around? [14:41:48] i'm happy to wait till next week too [14:42:05] to make sure the dust settles after tomorrow's reboot [14:42:06] my biggest concern is what if they do not come back [14:42:26] I think manuel was almost finishing the new hosts [14:42:33] and don't want to pressure him [14:42:41] but that would be a great plan B [14:42:41] Yeah, "almost" [14:42:43] plus [14:42:50] announcement [14:43:19] in the ideal world, we should be able to send thing with 1 day in advance [14:43:20] if these are being replaced with new hosts and if there's a real risk of those not coming back, let's rather avoid the reboots of 1001/1003 [14:43:32] moritzm: but if we keep doing that [14:43:37] we will never know [14:43:42] plus, the replacements [14:43:54] I am "almost" done with the new hosts, but it can take some days until they have imported the rest 3 shards [14:43:57] are not going to be 100% ready [14:44:01] yeah [14:44:11] ok, I thought this was a drop-in replacement of some sort [14:44:14] so I wanted a colective decision [14:44:17] moritzm: not at all [14:44:19] same service [14:44:21] but many changes [14:44:39] eventual replacement, but the user side of things is problematic [14:45:13] I don't know, maybe I am too negative [14:45:20] if we can wait on reboot we should [14:45:34] what are the chances also [14:45:37] 404 and 950 days of uptime on 5-ish year old servers? [14:45:42] of both not coming back? [14:45:53] Yeah chasemp, that is the thing :( [14:46:01] if they haven't been rebooted in that long, we have no idea I guess [14:46:05] If I had to reboot one, I would try the 404 uptime one first [14:46:17] and we can't lose a single disk iirc it's raid 0 for data [14:46:24] not only that [14:46:29] before we run out of spave [14:46:48] we enlarged the disk with a plain parition [14:46:53] disks are the most likely thing to fail on a reboot after that long [14:46:57] yeah [14:47:01] that is my fear [14:47:09] If we can help it I vote wait [14:47:10] that is my feari do not want to be a jerk [14:47:17] in a few weeks we would hve a viable plan b even if painful? [14:47:18] it seems like we should hold off to me too [14:47:21] but this is the part where we give you the facts [14:47:34] and you kind of value them and take a decision [14:47:38] as long as moritzm says we can afford to hold off we should [14:47:56] we are already walking the tight rope let's not also juggle chainsaws :) [14:47:57] being a devils advocate [14:48:01] on the other side [14:48:18] they are the most common ones to be breached for security issues [14:48:25] and most needing an upgrade [14:48:39] understood [14:49:11] when we finish setting up 9-11 [14:49:22] marostegui: are we something like +/- a month from all replicas on 9-11? [14:49:39] we should announce them right away [14:49:43] as beta [14:49:46] or something [14:49:51] opt-in [14:49:57] agreed [14:50:01] and try to aleviate some load [14:50:06] chasemp: I would say less than a month [14:50:10] that will help [14:50:28] so, these servers have the immediate attack vectors plugged (glibc ld.so, exim and sudo), the reboot is required to fix the underlying problem on the kernel level [14:51:13] right, okay [14:51:44] if there's a risk of some data loss (and raid0 for the data partion sounds a bit like it), we can also hold this back, but it would be good if we setup teh new servers in a way that it allows us to also reboot these servers in the future [14:52:09] moritzm: sounds good, thank you [14:52:17] moritzm: the new ones [14:52:21] have that [14:52:40] we have haproxy in the middle, and it is extremely simple [14:52:47] lets avoid trouble, then and fix this by migrating to the new servers [14:52:50] to the point that it was done weeks ago [14:53:26] but we need to compromise to push that [14:53:33] once it is ready [14:54:05] which is when we give cloud the keys to the porche and wish you good luck :-D [14:54:18] hahahaha [14:54:29] then we crash it and ask our parents for another [14:54:33] ok great: summary - We wait on labsdb1001 and 1003 reboots for now, wait for labsdb1009-11 to be available as functional alternatives for users, and then when the usage for 1001 and 1003 has dropped somewhat, we schedule reboots for 1 and 3. [14:54:34] he he [14:54:36] :D [14:55:09] madhuvishy: that is my understanding as well, and that should be able to happen within the next 6 weeks it sounds like [14:55:24] right, I'll update on task :) [14:55:42] chasemp: it is also a good reason [14:55:49] to push people to adopt them [14:55:56] 10DBA, 10Operations, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3425238 (10madhuvishy) [14:56:11] "they will be decomissioned no longer than X due to hardware and security concerns" [14:56:17] definitely, stick and carrot [14:56:39] you should definitely start thinking about the user dbs [14:56:52] maybe with the new VM hosts [14:57:06] (not sure how that is going) [14:57:16] I think teh policy decision has to be made before the technical ones on the user dbs there [14:57:21] and we haven't had a real final thought on it [14:57:28] jynus: thosee are ordered and that's about it [14:57:30] I know [14:57:37] just a reminder [14:57:49] so we do not leave those decisions for the last minute [14:57:56] everything in me wants to say no surviving user dbs on replicas [14:57:58] right [14:58:02] yeah [14:58:08] but.... [14:58:21] (elpisis, just elipsis) [14:58:33] this is where we make bryan teh bad guy and stand behind him while the arrows fly [14:58:34] :) [14:58:45] j/k idk what we'll do [14:59:00] 10DBA, 10Operations, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3425254 (10madhuvishy) Status: labsdb1005 reboot is scheduled for July 12 at 1400 UTC. We've decided to wait on labsdb1001 and 1003 reboots for now - given t... [14:59:30] jynus: will you two be in montreal by chance? [14:59:39] good time to sit down and talk since it's so nuanced [14:59:49] I will be around [14:59:52] Jaime won't [15:00:28] let's plan on bouncing some ideas and reporting back marostegui [15:00:39] sure! [15:00:40] if jynus is ok w/ that [15:00:44] I would involve some prominent users [15:00:49] as in [15:00:58] asking how they use it [15:01:04] kinda what I'm thinking too, we can set up a table to take concerns possibly [15:01:07] and maybe 90% are trivial changes [15:01:20] we have a few "posters" and designated appearances we can use to recruit some use cases [15:01:26] like people using replicas becase they do not know better [15:01:44] or one-time usages [15:01:51] halfak is a good proxy for advanced users too so I want to bend his ear a bit [15:01:55] which could be maybe solved [15:03:01] thanks jynus marostegui moritzm madhuvishy :) brb before a meeting [15:03:39] yup, thanks y'all :) /me -> meeting too [15:05:29] marostegui: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1096 [15:06:17] nice!!! [15:11:49] 10DBA, 10RESTBase-API, 10Reading List Service, 10ArchCom-RfC (ArchCom-Approved), and 4 others: RfC: Reading List service - https://phabricator.wikimedia.org/T164990#3425302 (10Fjalapeno) @tgr thanks… I forgot to mention the summary lookup in Cassandra which doesn't support ids [15:34:48] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425420 (10Cmjohnson) @jcrespo: the issue should be resolved. The cable was in the wrong eth port. Confirmed MAC cmjohnson@asw-b-eqiad> ... ethernet-switching table brief |grep ge-5/0/5... [15:41:00] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425465 (10jcrespo) May I ask you to check db1100, db1104 and db1105- probably the same issue. [15:42:54] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425480 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1098.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-re... [15:49:21] 10DBA, 10Patch-For-Review: Refactor puppet mariadb class to support multi-instance hosts - https://phabricator.wikimedia.org/T169514#3400308 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1096.eqiad.wmnet'] ``` The log can be found in `/var/log/... [15:50:37] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425524 (10jcrespo) [16:11:48] 10DBA, 10Patch-For-Review: Refactor puppet mariadb class to support multi-instance hosts - https://phabricator.wikimedia.org/T169514#3425594 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1096.eqiad.wmnet'] ``` and were **ALL** successful. [16:31:58] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425748 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1098.eqiad.wmnet'] ``` and were **ALL** successful. [16:44:40] 10DBA, 10Analytics-EventLogging, 10Analytics-Kanban, 10Contributors-Analysis, and 5 others: Drop tables with bad data: mediawiki_page_create_1 mediawiki_revision_create_1 - https://phabricator.wikimedia.org/T169781#3425892 (10kaldari) 05Open>03Resolved Looks good to me! [16:45:01] 10DBA, 10Analytics-EventLogging, 10Analytics-Kanban, 10Contributors-Analysis, and 4 others: Drop tables with bad data: mediawiki_page_create_1 mediawiki_revision_create_1 - https://phabricator.wikimedia.org/T169781#3425895 (10kaldari) [16:59:02] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425966 (10jcrespo) [16:59:11] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156297 (10jcrespo) [18:42:35] 10DBA, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Create a user for the eventlogging_cleaner script on the analytics slaves - https://phabricator.wikimedia.org/T170118#3426549 (10Nuria) [19:00:23] 10DBA, 10Analytics, 10Analytics-EventLogging: dbstore1002 crashed - https://phabricator.wikimedia.org/T170308#3426597 (10jcrespo) [19:06:24] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3426638 (10Cmjohnson) @jcrespo db1100, 1105 were the same issue db1104 is something else. I will update once I figure it out [19:19:00] 10DBA, 10Analytics, 10Analytics-EventLogging: dbstore1002 crashed - https://phabricator.wikimedia.org/T170308#3426698 (10Marostegui) Yeah, big alter on s1 tables (adding PK) was running at the time :-( [19:20:58] 10DBA, 10Analytics, 10Analytics-EventLogging: dbstore1002 crashed - https://phabricator.wikimedia.org/T170308#3426713 (10Marostegui) I will try the alters tomorrow to see if they go thru or if this host cannot cope with such big ones (which will be worrying) [19:37:52] 10DBA, 10Analytics, 10Analytics-EventLogging: dbstore1002 crashed - https://phabricator.wikimedia.org/T170308#3426812 (10jcrespo) At least x1 broke- no time to reimport now. [19:38:59] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3426824 (10Cmjohnson) @jcrespo db1104 is fixed, vlan conflict. [19:40:13] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3426839 (10jcrespo) [19:42:03] 10DBA, 10Mail, 10Operations: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3426863 (10herron) [19:43:18] 10DBA, 10Analytics, 10Security, 10Wikimedia-Incident: MySQL password for research@analytics-store.eqiad.wmnet publicly revealed - https://phabricator.wikimedia.org/T170066#3426873 (10Legoktm) [20:05:51] 10DBA, 10Analytics, 10Analytics-EventLogging: dbstore1002 crashed - https://phabricator.wikimedia.org/T170308#3427081 (10Marostegui) Oh if a shard at least broke, then I won't try this alter again as it could corrupt another shard and we might need to even reimport it. We will need to skip this host and leav... [20:24:48] 10DBA, 10Patch-For-Review: Refactor puppet mariadb class to support multi-instance hosts - https://phabricator.wikimedia.org/T169514#3427276 (10jcrespo) So after lots of changes, db1096 is running right now with 7 mysql instances (they are empty), usable and with icinga monitoring. `systemctl start mariadb@s... [21:04:08] 10DBA, 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3427683 (10Dereckson) a:03Dereckson Wiki scheduled for creation 2017-07-12 10:00–13:00 UTC. [21:54:31] 10DBA: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351#3428001 (10jcrespo)