[00:30:23] 10Traffic, 10Operations, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10colewhite) It appears a reload does resolve the issue, but it takes some time for Prometheus to fetch and store an update. I used `kill -HUP ` to reload. The lo... [00:38:02] 10Traffic, 10Operations, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10CDanis) [00:38:09] 10Traffic, 10Operations, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10CDanis) p:05Triage→03Low [00:49:45] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) [00:53:28] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) [01:03:45] 10Traffic, 10Operations, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) Nice catch in the logs! I'm guessing we need to increase some of the inotify tunables. Probably this one: ` ✔️ cdanis@prometheus1003.eqiad.wmnet ~ 🕣🍺 cat /p... [01:29:40] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) [01:29:57] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) 05Open→03Resolved Complete [01:31:05] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) [01:31:30] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) 05Open→03Resolved Complete [01:32:05] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Papaul) [01:32:17] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Papaul) 05Open→03Resolved Complete [01:32:42] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Papaul) [01:32:52] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Papaul) 05Open→03Resolved Complete [01:33:25] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Papaul) [01:33:41] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Papaul) 05Open→03Resolved Complete [01:34:14] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Papaul) [01:34:26] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Papaul) 05Open→03Resolved Complete [03:31:33] 10Traffic, 10Operations: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) [03:39:29] 10Traffic, 10Operations, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) Sample error log: ` Mar 12 05:42:01 cp3050 CRON[9853]: (root) CMD (/usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all) [...] Mar 12 05:42:46... [04:49:32] 10Traffic, 10Operations, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) p:05Triage→03High [06:03:47] 10Traffic, 10Operations, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10Vgutierrez) This seems to be triggered by the outage reported by globalsign in https://www.globalsign.com/en/status: `Updated 12 March 2020, 5:25 pm EDT We are st... [06:20:41] 10Traffic, 10Operations, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [07:13:43] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) https://librenms.wikimedia.org/graphs/to=1584082800/id=12085 https://librenms.wikimedia.org/device/device=149/tab=port/port=12086/ stat1005 and kafka-jumbo1006 are in the sa... [07:49:31] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I did some tests and the two hosts are definitely related. I logged as root on both via mgmt console and turned off their interfaces, and the stat1005's broadcast traffic wen... [08:38:02] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) The other host in D1, the rack of stat1005 and jumbo1006 is kafka-jumbo1008, one of the new ones: https://netbox.wikimedia.org/dcim/devices/2510/ [11:08:42] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I checked the last changes happened yesterday on the switch via: ` elukey@asw2-d-eqiad> show system rollback compare 3 0 [edit interfaces interface-range vlan-private1-d-eqi... [12:03:16] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10akosiaris) I 've had a look as well. I 've checked that the mac address of kafka-jumbo1006 is indeed the one the switch learns and indeed that's true. I 've bounced the port as well... [12:21:46] 10Traffic, 10Operations, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10jcrespo) I've made a sanity check, in addition to the ones cdanis did, and verified that indeed the number of http requests to those dcs do... [13:24:36] 10Traffic, 10Operations, 10Core Platform Team (Icebox), 10Core Platform Team Workboards (Clinic Duty Team): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10AMooney) [13:32:27] 10Traffic, 10Operations, 10Core Platform Team (Icebox): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10CCicalese_WMF) [14:43:33] 10Acme-chief: Implement server-side OCSP stapling - https://phabricator.wikimedia.org/T219765 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:44:45] 10Traffic, 10Operations: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10Vgutierrez) for future reference, OCSP response update can be triggered like this: ` sudo -i cumin -b1 'A:cp-eqiad' "/usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all" ` [14:51:43] 10Traffic, 10Operations: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) Thanks Valentin! I also made some edits on wikitech: https://wikitech.wikimedia.org/w/index.php?title=HTTPS%2FUnified_Certificates&type=revision&diff=1860120&oldid=1773414 [16:18:40] 10Traffic, 10Operations: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10Vgutierrez) [16:18:55] 10Traffic, 10Operations: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10Vgutierrez) p:05Triage→03Medium [16:21:58] 10Traffic, 10Operations: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10Vgutierrez) [16:29:57] 10Traffic, 10Operations, 10SRE-tools, 10Continuous-Integration-Config, and 5 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10crusnov) [16:33:34] 10Traffic, 10Operations, 10SRE-tools, 10Continuous-Integration-Config, and 5 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10crusnov) Summary of status: We have decided that modifying the CI image itself is unnecessary in light of incidental changes to deplo... [17:54:50] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [19:05:01] 10Traffic, 10Operations: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10MarcoAurelio) [19:11:27] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) Chris moved the servers to different ports, and for kafka-jumbo1006 it helped, since it is now serving traffic. stat1005 is still suffering of the same issue though. [19:11:45] * elukey off! [19:22:39] 10Traffic, 10Operations, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10colewhite) a:03colewhite [19:34:15] 10netops, 10Operations, 10cloud-services-team (Kanban): New network request for CloudVPS CODFW instances transport - https://phabricator.wikimedia.org/T247633 (10JHedden) [19:38:43] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10colewhite) [[ https://github.com/prometheus/prometheus/issues/3446 | Found a related issue. ]] We've upped max_user_instances and max_user_watch... [19:55:14] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: some Prometheus not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10Krinkle) [19:57:23] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) [19:57:45] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) @Krinkle please see https://phabricator.wikimedia.org/T246860#5964089 :) [20:10:56] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10Krinkle) Thanks... I guess that's bound to cause confusion, but I will try to be one of the people that does know and remembers. [21:43:23] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Dzahn) Fixed by @Papaul for kafka-jumbo1006. We saw recoveries for kafka lag on other machines all at once. [22:07:36] 10netops, 10Analytics, 10DC-Ops, 10Operations: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Papaul) I have @Jclark-ctr repalce the cable to stat1005 same issue. I have him also disconnect the cable while i was looking at the switch the interface went from up up to up down a...