[06:44:31] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3074464 (10Marostegui) [06:45:15] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3074430 (10Marostegui) a:03Papaul Hi @Papaul - please change the disk once you have time for it! Thanks! [06:47:30] 10DBA, 06Operations, 10ops-codfw: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10Marostegui) [07:39:39] 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3074514 (10Marostegui) This is the status of the jobid assigned to dbstore1001 backup: ``` status jobid=48872 Device status: Device "FileStorage1" (/srv/baculasd1) is not open. Device is BLOCK... [08:48:46] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3074566 (10Marostegui) Update - @Joe is kindly helping and we have seen a few issues which suggests that the storage itself might be having issues: ``` [52078943.9540... [08:49:33] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3074568 (10Marostegui) Update - @Joe is kindly helping and we have seen a few issues which suggests that the storage itself might be having issues: ``` [52078943.954044] ata1.00: BMD... [09:14:35] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3074604 (10Marostegui) For the record: The above disk broken looks sda, which is not part of the baculads volume but `/`. Although mdstat doesn't see it broken. ``` root@helium:/var/... [10:25:03] 10DBA, 06Labs: labsdb1004 MySQL crash - https://phabricator.wikimedia.org/T159572#3074934 (10jcrespo) a:03jcrespo [12:47:38] 10DBA, 06Labs: labsdb1004 MySQL crash - https://phabricator.wikimedia.org/T159572#3075232 (10jcrespo) p:05Triage>03Normal [12:53:19] 10DBA, 13Patch-For-Review: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#3075262 (10Marostegui) [14:34:07] 10DBA, 13Patch-For-Review: Fix dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T130128#3075659 (10Marostegui) [14:34:09] 10DBA, 13Patch-For-Review: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#3075655 (10Marostegui) 05Open>03Resolved The only pending shard to import is x1 - I will mark this as resolved and create a ticket just for x1. x1 is a bit more difficult because... [14:35:33] 10DBA: Import x1 on dbstore2001 - https://phabricator.wikimedia.org/T159707#3075661 (10Marostegui) [14:41:47] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3075706 (10Marostegui) My idea would be as follows: - stop labsdb1006 -> copy `/srv/postgres` to dbstore1001 -> reimage -> copy the data back. If that works fine, repea... [14:44:56] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3075715 (10jcrespo) The most important step, and why we need to copy that data away in case something goes wrong is the postgres upgrade from 9.1 (precise) to 9.4 (jessie... [15:06:37] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3075860 (10Marostegui) I am manually executing the "predump" script on dbstore1001 on a root screen called `dumps`. At least to have a local copy of the backups until we fix the bacu... [15:13:57] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3075900 (10Marostegui) >>! In T153768#3075860, @Marostegui wrote: > I am manually executing the "predump" script on dbstore1001 on a root screen called `dumps`. At least to have a lo... [15:20:20] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3075915 (10jcrespo) [15:20:23] 10DBA, 06Labs: Possible labs imagelinks drift - https://phabricator.wikimedia.org/T159023#3075912 (10jcrespo) 05Open>03Resolved a:03jcrespo This is fixed now, but again I encourage you to try the new servers soon, where this problem wasn't there in the first place. ``` labsdb1001> SELECT * FROM imagelin... [15:20:47] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate feasibility of adding a column for full entity ID to wb_terms - https://phabricator.wikimedia.org/T159718#3075916 (10WMDE-leszek) [15:38:24] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076003 (10akosiaris) >>! In T157359#3075706, @Marostegui wrote: > My idea would be as follows: > > - stop labsdb1006 -> copy `/srv/postgres` to dbstore1001 -> reimage -... [15:44:09] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076026 (10jcrespo) > There is also this days pg_upgrade which with --link mode which should in theory help avoid that problem, but I 've never tested it in a 9.1 => 9.4... [15:46:22] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076028 (10akosiaris) >>! In T157359#3076026, @jcrespo wrote: >> There is also this days pg_upgrade which with --link mode which should in theory help avoid that problem... [15:49:18] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076041 (10jcrespo) Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the slow route, which is basically what you sugg... [15:51:29] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076044 (10Marostegui) >>! In T157359#3076003, @akosiaris wrote: > > Actually I had a different plan in mind. So, labsdb1007 is a read-only slave of labsdb1006. My propo... [15:53:41] 10DBA, 06Operations, 10ops-codfw: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3076045 (10Papaul) p:05Triage>03Normal [15:54:04] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3076046 (10Papaul) p:05Triage>03Normal [15:59:02] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076066 (10akosiaris) >>! In T157359#3076041, @jcrespo wrote: > Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the... [16:00:55] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076077 (10Marostegui) >>! In T157359#3076066, @akosiaris wrote: > Yeah I can do that if it makes you two happier. That would be appreciated. Not happier per se, but a... [16:01:24] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076082 (10jcrespo) @akosiaris Do you know if labsdb1007 is actively in use? If not at all, we could start doing it now, ahead of the maintenance window... [16:24:23] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate feasibility of adding a column for full entity ID to wb_terms - https://phabricator.wikimedia.org/T159718#3076118 (10Marostegui) Hi! The wt_terms table is a quite big table (as you already probably know) it is around 230G on disk. That means the ALTER table wil... [16:26:47] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate feasibility of adding a column for full entity ID to wb_terms - https://phabricator.wikimedia.org/T159718#3076134 (10jcrespo) @Marostegui I think they do not yet want it done yet, but an ok from us/review. But they should probably clarify that. "feasibility" is... [16:32:09] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076147 (10WMDE-leszek) [16:33:35] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3075916 (10WMDE-leszek) [16:38:14] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076201 (10jcrespo) > Evaluate if it is feasible to add such an "empty" column without making Wikidata readon... [16:40:13] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076203 (10Marostegui) >>! In T159718#3076201, @jcrespo wrote: >> Evaluate if it is feasible to add such an "... [16:42:09] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076207 (10jcrespo) > If we depool the slaves we should be fine, shouldn't we? And if we use the DC switchove... [16:43:21] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076214 (10WMDE-leszek) Thanks @Marostegui for infromation so far. And yes, @jcrespo is right. We first want... [16:48:13] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076229 (10jcrespo) > Depending on the answer to this, we will plan further steps. I think you should add th... [17:00:08] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076247 (10WMDE-leszek) Oh, sorry, I missed few comments while typing mine :) It seems to me @Marostegui and... [17:01:05] 10DBA, 10Wikidata, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076249 (10WMDE-leszek) [17:04:05] 10DBA, 06Operations, 10ops-codfw: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3076250 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.... [17:04:29] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076251 (10WMDE-leszek) [17:04:32] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3076252 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [17:18:00] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3076264 (10jcrespo) Ok, now I have some comments against that method, logistically, I am a... [17:38:14] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076308 (10akosiaris) It is not. It is a read-only slave not really being used by anyone currently so we are free to start the process on it well ahead of the maint window. [17:41:14] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076313 (10jcrespo) Thanks, I will start "breaking" it tomorrow Tuesday during EU morning- do not worry, I can take care of this- you probably have more urgent things to... [17:58:25] jynus: marostegui I saw the note from ops meeting about alert for labsb1009. I didn't know the alert was there - I'd have ack'd it as well. the new db is for https://phabricator.wikimedia.org/T146718, and can be deleted now that the conference is over (if that's what you would prefer) [17:59:13] yuvipanda: no, it is not an alert, it is our script that runs every week which saw it [17:59:29] aaah, manual alert! :D [17:59:37] so datasets_p database can be deleted on 1009? [18:00:04] I do not think we have to *delete it*, but do it properly [18:00:14] as in, puppetize it, etc. [18:00:22] and ping us, etc. [18:00:54] yeah, we were basically a bit like: what is this? is it a security risk? where did it come from etc [18:01:55] jynus: marostegui hmm, I commented and logged on ticket while doing it, and assued that was enough (since there was prior engageent there). sorry if that wasn't - it was very time bound requirement as well [18:02:08] and that is ok [18:02:29] maybe doing a @Manuel @jynus I am doing this because this is urgent, ping. [18:03:01] ok! sorry! :( [18:03:02] the thing is, as not being puppetized or scripted, it could be deleted automatically [18:03:03] yeah, I was going to say that maybe a mention would be cool too, so we can see the specific ping :) [18:03:27] or it could have raised a page [18:03:41] because it is data it is not supposed to be there according to puppet [18:04:33] right. I looked for things that could raise a page and didn't find anything, and watched for them after. but I'll definitely do more @pings if similar situations arise in the future (which they hopefully wouldn't!) [18:05:06] yep, we decided to schedule that script on weekly basis so we'd have "private" data or data that isn't supposed to be there under control [18:05:07] yes, as manuel said, the script it is currently on testing, so it only sends an email [18:05:19] and since this wasn't puppetized, the script said: "hey! what's this!" [18:05:41] * yuvipanda nods [18:05:49] we have a check because we do not want any column or db being there by accident [18:05:56] that's super awesome [18:06:08] * yuvipanda vaguely remembers aftermath of previous pw leak [18:06:20] so basically, we received an email saying "there could be a data leak" [18:06:27] so we are freaking out a bit [18:06:52] *were [18:06:59] I did not know of the script either [18:07:01] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3076412 (10Marostegui) db2046: ``` root@neodymium:/home/marostegui/git# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdb2046.codfw.wmnet... [18:07:25] so, is this dataset_p something we need to whitelist? [18:07:36] the thing is [18:07:46] the database is not accesible by labs users anyway [18:08:03] good question. We do need it for the long term, but the immediate need has passed. so I'm ok deleting it for now [18:08:04] so I am not sure how useful it was- and we could have also helped there [18:08:41] we also established a firm permission system, so no longer *_p databases are public to also avoid accidents [18:09:53] they have to be whitelisted too, manually [18:11:20] So, we can delete it for now so the script doesn't complain anylonger and in the future iteration with yuvipanda we can do see how to handle it best, I would say [18:11:54] marostegui: yup! [18:12:59] yuvipanda: would you need to back it up or something? [18:13:47] jynus: yeah, I granted access specifically to the PAWS user with GRANT SELECT, SHOW VIEW ON datasets_p.* TO 's52771'@'%'; and then later revoked it. [18:13:53] marostegui: nope, we've original data [18:13:59] yuvipanda, ah, ok [18:16:11] dropped [18:16:24] marostegui: thanks [18:16:56] jynus: marostegui https://github.com/OCDX/article-quality/blob/master/src/generate_plots_of_monthly_quality.ipynb (quality improvements because of women focused workshops) is what came out the db, so: 1. thank you for your patience, 2. sorry for the alarm [18:17:30] yuvipanda: no worries, now we are all aligned and we can improve it for the future iteration :-) [18:17:42] * yuvipanda nods [18:18:04] and now, time for dinner! [18:18:08] will see you tomorrow! [18:18:09] marostegui: jynus if you've some vague outline of what you would like the process for this to be like somewhere, I'll happily take point / do the work. no rush tho [18:18:13] cya! [18:18:16] I too need to go breakfast [18:29:15] 10DBA, 06Operations, 05Prometheus-metrics-monitoring: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072#3076558 (10fgiunchedi) Update from the monitoring meeting, this can be implemented via puppetdb queries, additionally the... [18:47:58] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3076667 (10Ottomata) p:05Triage>03Normal a:03Marostegui Assigning, feel free to reassign. [18:50:08] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3076675 (10Ottomata) a:03Marostegui [18:50:15] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3053239 (10Ottomata) p:05Triage>03Normal [18:58:47] 10DBA, 06Operations, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3076720 (10Ottomata) a:03matmarex Triaging, feel free to re-assign. [19:09:23] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3076755 (10Cmjohnson) Slot 7 is just offline for some reason. Changed status to online cmjohnson@db1060:~$ sudo megacli -PDOnline -PhysDrv [32:7] -a0 EnclId-32 SlotId-... [19:11:47] 10DBA, 10Monitoring, 06Operations: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3076765 (10Ottomata) a:03Dzahn [19:24:32] 10DBA, 06Operations, 10ops-eqiad: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3076820 (10jcrespo) [19:26:54] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3076821 (10Marostegui) a:05Marostegui>03Cmjohnson I will assign this to @Cmjohnson so he can change the disk once it is onsite Thanks! [19:35:23] 10DBA, 06Operations, 10ops-eqiad: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3076884 (10Marostegui) Even though the server's data is now corrupted and needs to be reimaged, the RAID is on optimal status: ``` root@db1060:~# megacli... [19:41:16] 10DBA, 10Monitoring, 06Operations: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3076947 (10Dzahn) a:05Dzahn>03None [19:44:32] 10DBA, 10Monitoring, 06Operations: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3077001 (10Ottomata) p:05Triage>03Normal [19:47:36] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3077023 (10jcrespo) There are now 2 backups happening on dbstore1001, one on 201703061505 and another on 201703061552, one from screen, another from bacula-fd :-/ [20:28:09] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3077270 (10Marostegui) >>! In T153768#3077023, @jcrespo wrote: > There are now 2 backups happening on dbstore1001, one on 201703061505 and another on 201703061552, one from screen, a... [21:17:27] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3077513 (10jcrespo) So the comments: * do not defer the creation of the indexes- those a...