[07:57:12] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [08:26:00] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [08:28:20] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [10:55:24] 10Domains, 10Traffic, 10DNS, 10Operations, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10Aklapper) 05Open→03Declined Unfortunately closing this report as no further information has been provided. @Naveenpf: After you have pro... [11:12:49] 10Domains, 10Traffic, 10DNS, 10Operations, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10Naveenpf) It is required. What is the further information required? [11:15:11] 10Domains, 10Traffic, 10DNS, 10Operations, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10Aklapper) 05Declined→03Open [12:14:59] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2026.codfw.wmnet'] ` The log can be found in `... [12:21:56] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10Pchelolo) > @Krinkle @Pchelolo according to dashboard versions you have changed the dashboard, would it be problematic if we drop "REST API Varnish hi... [12:27:02] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10MW-1.34-notes (1.34.0-wmf.11; 2019-06-26), and 3 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Reedy) [12:51:27] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, and 4 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10ema) [12:51:52] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2026.codfw.wmnet'] ` and were **ALL** successful. [12:54:13] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) [12:54:17] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ema) 05Open→03Resolved a:03ema Done! [13:00:28] there's no more IPsec for upload \o/ [13:01:01] \o/ yay [13:01:20] all moved to download right? :-P [13:01:44] /o\ [13:02:13] I thought dad jokes were my prerogative now [13:02:21] lol [13:05:11] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) @Anomie tracked this down: a Tim... [13:06:42] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10ema) [13:07:02] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) I wonder if [[https://codesearch.wmflabs... [13:11:09] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1076.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [13:12:33] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) Thanks for chasing this down! After... [13:16:25] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) Sessions are created all the time, and t... [13:21:21] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) >>! In T184942#5303464, @Pchelolo wrote: >> @Krinkle @Pchelolo according to dashboard versions you have changed the dashboard, would it be... [13:22:07] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [13:49:58] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1076.eqiad.wmnet'] ` and were **ALL** successful. [14:33:52] bblack: sanity check welcome on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520455 [14:35:14] oh [14:35:22] didn't you already do one? [14:35:46] I did, yes, but I thought two more eyes won't hurt :) [14:36:02] and indeed there was a syntax error in site.pp heh [14:36:06] so, yeah, there are potential issues around the disk storage change, eqiad is unique about that at present [14:37:17] I guess in general (even on all the other hosts already done), it doesn't matter that our late_install stuff does mkfs on the storage, ATS will just ignore it when it uses the raw partition. [14:37:30] correct [14:37:43] the only thing is that trafficserver didn't start properly on cp1076 due to sd[ab]3 not being around, fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520450/ [14:37:56] yeah [14:38:14] and now for cp1078 I've included the hiera storage part in the reimage commit [14:38:39] the other thing is, while just swapping the device-name stuff in the above commit may "work", I don't think anyone (either of us, anyways) has checked whether anything special should be done/tuned for ATS on a 4K nvme device [14:38:57] the nvme disks are special in more than just their name [14:40:21] (because, in the name of getting max perf out of them, they're formatted with a 4K block size (== mem page size) rather than traditional 512B blocks, and some of the low-level kernel interface and handling for them is different than scsi disks and even other SSDs, they have a unique driver, newer queueing stuff, etc) [14:40:53] since ATS is doing raw disk i/o without the FS, I'm not sure to what degree we have to care or tune any kind of related settings? [14:41:07] it's possible there's absolutely nothing to do here, but it's worth looking. [14:44:47] interesting yeah, will take a look [14:46:08] e.g. on some googling around I found: [14:46:11] https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy-config-cache-force-sector-size [14:47:07] (seems like that should be set to 4096 for these nvme-hardware nodes in eqiad, but there's some caveats and wierd stuff to look at there) [14:48:07] I see 0 for alignments in sysfs FWIW [14:49:13] also we might be able to get away with just using nvme0n1 (leave off the p1, use the whole raw disk), just in this eqiad case where we're not sharing with rootfs, and make things simpler [14:50:18] right, that seems to be recommended [14:50:34] > To be safe in Linux, you could just use the entire drive: /dev/sdb instead of /dev/sdb1 and Traffic Server will do the right thing. Misaligned partitions on Linux are auto-detected. [14:51:00] I infer from what's written there that the 4096 part isn't auto-detected, though [14:51:46] there does seem to be a tradeoff: tiny cache objects will consume 4K instead of 512. But it makes everything else better (native 1:1 mapping of memory pages to real hardware storage blocks and best efficiency) [14:51:54] seems like a win :) [14:52:36] if you use 512 byte sectors on such a drive, you'll wind up turning writes into read-modify-writes in the hardware, which will be noticeably slower [14:53:09] yeah [14:53:30] I put some stuff in at the install_server level for these hosts, to do the special setup to make them 4K-native (they default to the ugly 512B compat mode) [14:54:43] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/install_server/files/autoinstall/scripts/late_command.sh#54 [14:54:58] (and the installation of nvme-cli a little earlier than that, so that we have the command) [14:55:40] oh cool [14:56:16] yet another special case sort of think to think about as we try to sanitize host hardware metadata and installer settings/code :) [14:57:25] mke2fs could be done in puppet instead of there, it's just a PITA to then protect that against re-execution properly, or could be put in a hypothetical future middle-stage (post-install, pre-puppet, some fixed simpler stuff that runs once) [14:57:39] I guess the nvme command is similar in nature to the mke2fs commands in that sense [14:58:27] it's been too long since I looked at the details but I know that it's puppet that manages the data drives on swift backend hosts [14:59:18] we had the mkfs for this in puppet at some past point as well, I think. there are probably still legacy crumbs of it. [15:00:25] what we're really looking for in cases like these is a way to say "This special setup command, it needs to run only once on a host very early after it has been imaged, before other puppet stuff that might rely on it" [15:01:11] for various cases, it ends up being convoluted to try to find the "right" way to prove the right condition to puppet (e.g. with an unless that somehow checks the state of affairs at every execution pointlessly) [15:01:52] and then to tie such a low-level thing that should be a given at runtime into the puppet depdency chain properly (you can't run this service or deploy these files, etc... until you've depended on this command that almost never has to run in practice) [15:02:20] and then it's easy to make mistakes in all of that which result in things like "oops while refactoring something I subtly broke some dependency somewhere and puppet reformatted all my live disks" [15:03:19] it just seems far better to me to move such things to an explicit separate stage: some kind of simpler script with minor per-cluster/hardware casing, which executes once from cumin during imaging/reimaging, on first boot of the new OS, before the first puppet run, and then never runs again. [15:05:49] yeah the 'oops reformatted everything' scenario scared me before when doing swift changes (noun, not adjective) :| even when labelling filesystems that's still scary [15:07:37] late_command is the closest thing we have to such a stage today, but it's not ideal either (hard to work with and debug since it's running in the installer and needs special chroot commands and putting all the cases in a multi-purpose installer shellscript, etc) [15:44:47] sector size is currently 4096 on cp1076 [15:45:27] it seems that trafficserver looks at what the sector size is for the drive and use that [15:45:33] root@cp1076:~# fdisk -l /dev/nvme0n1p1 | grep 'Sector size' [15:45:33] Sector size (logical/physical): 4096 bytes / 4096 bytes [15:46:43] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10TimedMediaHandler, and 4 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Jdforrester-WMF) >>! In T226840#5303566, @Tgr wrote: > I wonder if [[... [15:49:49] which is the ATS storage config too according to python3-superior-cache-analyzer: [15:49:57] DEBUG: Stripe.read: Finished reading metadata for Stripe(header=SpanBlockHeader(number=0, offset=0x76F50000, length=195351559, Type=http, free=False, avgObjSize=8000), file=/dev/nvme0n1p1, version=24.1, createTime=1562163160.000000, writeCursor=128645468160, lastWritePos=128643870720, aggPos=128645468160, generation=0, phase=0, cycle=0, syncSerial=11, writeSerial=51990, dirty=0, sectorSize=4096 [ [15:50:03] ... ] [15:52:13] does that still work if we use nvme0n1, or does it need a partition to figure it out? [15:53:18] will check tomorrow with cp1078, updating the CR now to use nvme0n1 instead [15:53:27] ok thanks [15:53:40] so basically, all of this was probably mostly a meaningless diversion :) [15:53:55] sorry! [15:54:05] very interesting and useful actually [15:54:39] I found out that there's a thing called traffic_cache_tool which segfaults whatever you do [15:55:04] but that would otherwise provide great insight :) [15:56:11] and yeah at the end of the day packaging https://github.com/comcast/Superior-Cache-ANalyzer was a great idea! [15:56:57] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10TimedMediaHandler, and 4 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) >>! In T226840#5303935, @Jdforrester-WMF wrote: >>>! In T2268... [15:57:26] :) [15:57:39] lol @ Comcast [15:58:17] which reminds me of https://github.com/tylertreat/comcast someone linked on IRC the other day :) [15:59:25] it's kind of crazy how much open source Comcast apparently publishes [15:59:42] it's very at odd with my view of them otherwise as an organization [15:59:47] s/odd/odds/ [16:01:38] indeed, my guess would be that in one of the acquisitions they acquired some C level or VP that won the "we're doing open source" turf war [16:06:17] it's probably a bit easier these days, even in corporate environments like that [16:06:58] I can recall carrying the open-source banner at corporations back in the 90s and early 2000s (as an individual contributor) and finding it extremely frustrating. Nobody at the top understood or cared, or if they did they were negative on the idea. [16:07:27] but the world has somewhat changed for the better since! :) [16:24:37] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10TimedMediaHandler, and 4 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) At a glance: * The edit API uses it because the way it works is... [17:04:52] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) We had to rollback to the old switch. codfw is different from all the other sites, as mr1-codfw is connect to the cr1/2 through msw1 using 10G links to each routers. While the new msw1 doesn't ha... [18:21:02] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [18:21:04] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) 05Stalled→03Resolved ` /admin1-> racadm getsel Record: 1 Date/Time: 02/21/2019 18:11:12 Source: system Severity: Ok Description: Log cleared. ---------------------------... [18:22:10] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [18:22:12] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) 05Stalled→03Resolved ` /admin1-> racadm getsel Record: 1 Date/Time: 02/21/2019 17:41:01 Source: system Severity: Ok Description: Log cleared. ---------------------------... [19:38:23] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) @Ottomata Please decommission the current servers to spare role Please provide the new hostnames... [20:38:58] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) > Please decommission the current servers to spare role Ok will do. I'll downtime the the hostnam...