57 Commits

Author SHA1 Message Date
c6dbce8f90 Update README.md
Clarify that this agent is meant to run with Net SNMPd.
2023-08-26 11:58:36 +02:00
aa38c5503f Update README.md
Add a few hints based on previous issues filed in this repo.
Clarify that linux-cp must be used, the API/Stats socket must be accessible, and that no backwards compatibility is given.
2023-08-26 11:55:22 +02:00
684400ff9e Reduce logging on AgentX connections
Previous logging was very noisy when the agent connection to snmpd
drops:

[ERROR   ] agentx.network - run            : Empty PDU, connection closed!
[INFO    ] agentx.network - disconnect     : Disconnecting from localhost:705
[ERROR   ] agentx.agent - run            : An exception occurred: Empty PDU, disconnecting
[ERROR   ] agentx.agent - run            : Reconnecting
[INFO    ] agentx.agent - run            : Opening AgentX connection
[INFO    ] agentx.network - connect        : Connecting to localhost:705
[ERROR   ] agentx.network - connect        : Failed to connect to localhost:705
[ERROR   ] agentx.agent - run            : An exception occurred: Not connected
[ERROR   ] agentx.agent - run            : Reconnecting
[INFO    ] agentx.agent - run            : Opening AgentX connection
[INFO    ] agentx.network - connect        : Connecting to localhost:705
[ERROR   ] agentx.network - connect        : Failed to connect to localhost:705
[ERROR   ] agentx.agent - run            : An exception occurred: Not connected
[ERROR   ] agentx.agent - run            : Reconnecting

Also, reconnects were attempted every 0.1s, but field research shows
that snmpd, if it restarts, takes ~3-5 seconds to come back (note: this
is also due to a systemd delay in restarting it upon failures).
Hammering the connection is not useful.

This change refactors the logging, to avoid redundant messages:
- sleep 1s between attempts (reducing the loop by 10x)
- Either print 'Connected to' or 'Failed to connect to', not both.
- Remove the 'reconnecting' superfluous message
2023-01-14 11:12:06 +00:00
43551958f8 Typo fix 2023-01-10 17:13:27 +01:00
31529a2815 improvement: add flag for agentx debugging
agentx/network.py always turned on debugging. It can be useful to have
debugging logs of the main application without the agentx debug logs, as
they are quite noisy.

Now, ./vpp-snmp-agent.py -d will turn on application debugging but NOT
agentx debugging. ./vpp-snmp-agent.py -d -dd will turn on both.

NOTE: ./vpp-snmp-agent.py -dd will do nothing, because the '-d' flag
determines the global logging level.
2023-01-10 15:21:32 +01:00
0d7dea37f5 Merge branch 'main' of github.com:pimvanpelt/vpp-snmp-agent 2023-01-10 11:26:21 +01:00
95d96d5e61 bugfix: add a control_ping() before each update
If VPP were to disconnect either the Stats Segment or the API endpoint,
for example if it crashes and restarts, vpp-snmp-agent will not detect
this. In such a situation, it will hold on to the stale stats and no
longer receive interface updates.

Before each run, send a control_ping() API request, and if that were to
fail (for example with Broken Pipe, or Connection Refused), disconnect
both API and Stats (in the vpp.disconnect() call, also invalidate the interface
and LCP cache), and then fail the update. The Agent runner will then retry
once per second until the connection (and control_ping()) succeeds.

TESTED:
- Start vpp-snmp-agent, it connects and starts up per normal.
- Exit / Kill vpp
- Upon the next update(), the control_ping() call will fail, causing the
  agent to disconnect
- The agent will now loop:
[ERROR   ]      agentx.agent - update         : VPP API: [Errno 1] Sendall error: BrokenPipeError(32, 'Broken pipe'), retrying
[WARNING ]      agentx.agent - run            : Update failed, last successful update was 1673345631.7658572
[INFO    ]     agentx.vppapi - connect        : Connecting to VPP
[ERROR   ]      agentx.agent - update         : VPP API: Not connected, api definitions not available, retrying

- Start VPP again, when its API endpoint is ready:
[INFO    ]     agentx.vppapi - connect        : Connecting to VPP
[INFO    ]     agentx.vppapi - connect        : VPP version is 23.02-rc0~199-gcfaf44020
[INFO    ]     agentx.vppapi - connect        : Enabling VPP API interface events
[DEBUG   ]      agentx.agent - update         : VPP API: control_ping_reply(_0=24, context=12, retval=0, client_index=0, vpe_pid=705326)
[INFO    ]     agentx.vppapi - get_ifaces     : Requesting interfaces from VPP API
[INFO    ]     agentx.vppapi - get_lcp        : Requesting LCPs from VPP API

- The agent resumes where it left off
2023-01-10 11:24:44 +01:00
b6864530eb Update README.md 2023-01-08 23:11:55 +01:00
7f4427c4b6 Improvement: Use interface/LCP caching on VPP API
- Set an initial vppapi.iface_dict and lcp_dict to None.
- Set an event watcher API call, with a callback
- When events happen, flush the iface/lcp cache (by setting them to None).
- When get_ifaces / get_lcp sees an empty cache, fetch the data from VPP
  API and put into the cache for subsequent calls.

This way, the VPP API is only used upon startup (when the caches are
empty), and on interface add/del/changes (note: the events fire for
link, and admin up/down, but not for MTU changes).

One small race condition exists: if a new LCP is created, this does not
trigger an interface event. Adding a want_lcp_events() makes sense, but
until then, a few options remain:
0) race exists only if inerface was created; THEN the cache was
   refreshed; and THEN the LCP was created.
1) create the lcp and then force a change to any interface (this will
   create an sw_interface event and flush the cache)
2) restart vpp-snmp-agent
2023-01-08 13:57:08 +01:00
c81a035091 Refactor to use VPPApiJSONFiles 2023-01-08 13:24:54 +01:00
fe794ed286 Remove all global variables 2023-01-08 13:21:00 +01:00
5e11539b44 Format with black 2023-01-08 13:05:42 +01:00
72e9cf3503 Update README.md 2022-12-23 16:01:02 +01:00
a56840d849 Update README.md 2022-12-23 10:20:11 +01:00
cde5d4df94 Update README.md 2022-12-23 10:09:06 +01:00
16c29e0ce6 Allow vppapi!=vppstats count, continue and use those interfaces that are in the API 2022-07-10 20:49:44 +00:00
b4c819af87 Retrieve description from all interface types, not just ethernets 2022-07-10 11:33:47 +00:00
b024a3e96b Move the YAML config to be compatible with vppcfg's config file 2022-07-10 09:47:33 +00:00
3be732e6ab Remove the workaround for endianness in VPP; Remove the --disable-lcp flag. Catch connect exceptions for VPPStats and VPP API 2022-07-09 10:14:15 +00:00
c9233749bc Pulled in latest vpp_stats.py from upstream after https://gerrit.fd.io/r/c/vpp/+/35640 2022-04-01 13:10:17 +00:00
968c0abe2f Fail the setup if we can't connect to VPP; exit the daemon with non-zero value to force restart by systemd 2022-03-14 23:14:59 +00:00
86512dd66b Turn interface mismatch into a warning - it is often recoverable 2022-03-13 12:05:54 +00:00
c112016665 Add a flag to disable lcp lookups, due to pending VAPI issues (https://gerrit.fd.io/r/c/vpp/+/35479) 2022-03-08 13:25:29 +00:00
a9c9e15828 typo fix 2022-02-27 23:01:19 +00:00
f97f50bf30 Update README 2022-02-27 22:59:55 +00:00
c319ef576d Add an optional configuration file
A simple convenience configfile can provide a mapping between VPP
interface names, Linux Control Plane interface names, and descriptions.
An example:

```
interfaces:
  "TenGigabitEthernet6/0/0":
    description: "Infra: xsw0.chrma0:2"
    lcp: "xe1-0"
  "TenGigabitEthernet6/0/0.3102":
    description: "Infra: QinQ to Solnet for Daedalean"
    lcp: "xe1-0.3102"
  "TenGigabitEthernet6/0/0.310211":
    description: "Cust: Daedalean IP Transit"
    lcp: "xe1-0.3102.11"
```

This configuration file is completely optional. If the `-c` flag is
empty, or it's set but the file does not exist, the Agent will simply
enumerate all interfaces, and set the `ifAlias` OID to the same value
as the `ifName`. However, if the config file is read, it will change
the behavior as follows:

*  Any `tapNN` interface names from VPP will be matched to their PHY by
   looking up their Linux Control Plane interface. The `ifName` field
   will be rewritten to the _LIP_ `host-if`. For example, `tap3` above
   will become `xe1-0` while `tap3.310211` will become `xe1-0.3102.11`.
*  The `ifAlias` OID for a PHY will be set to the `description` field.
*  The `ifAlias` OID for a TAP will be set to the string `LCP: `
   followed by its PHY `ifName`. For example, `xe1-0.3102.11` will
    become `LCP TenGigabitEthernet6/0/0.310211 (tap9)`
2022-02-27 22:58:03 +00:00
80190bf2d0 Merge pull request #1 from amartin-git/patch-1
Set larger receive buffer size for bulk requests
2021-12-06 18:30:29 +01:00
c19df5a77a Set larger receive buffer size for bulk requests
When using SNMP BULK GET requests (from Zabbix in our case), the default value of 1024 truncates the request, resulting in malformed requests reaching the agent. Using an 8K buffer fixes this. A better approach perhaps would be to process the buffer using a loop.
2021-12-06 12:26:22 -05:00
89abebb26b Merge branch 'main' of github.com:pimvanpelt/vpp-snmp-agent into main 2021-09-15 08:02:26 +00:00
18005bbbc2 Fix memory leak in logging (specifically: do not create a new logger for every SNMP PDU) 2021-09-15 07:58:08 +00:00
a574305fb2 Reconnect faster after errors (0.1s sleep) 2021-09-15 07:57:17 +00:00
09a2b6e9e4 Remove logger from dataset, it's not necessary, as there's only one call location that wants to say something. Turn that into an exception instead 2021-09-15 07:56:50 +00:00
bf9d61b95d Restart snmpd if it fails 2021-09-13 07:54:13 +02:00
610d03a14b Refactor README.md 2021-09-12 16:31:13 +00:00
5051ab32ce Update README with the -h/--help argparse hint 2021-09-12 16:22:22 +00:00
6d0ed88722 Add argparse and a few useful arguments
Now that we're explicitly connecting via TCP to localhost:705 (which
can be overriden by the -a flag), we no longer need to run as root.
Therefore, update vpp-snmp-agent.service to run as user Debian-snmp
group vpp, so that /run/vpp/{api,stats}.sock are writable.
Be explicit on the commandline arguments in the service definition.
2021-09-12 16:19:33 +00:00
7206d92f40 Move all loggers to be members of the class, not global objects 2021-09-12 16:08:35 +00:00
9265e211e3 Swap oper/admin status (they were the wrong way around) 2021-09-12 14:09:23 +00:00
96f2a3b4b3 Move to /usr/sbin instead of /usr/local/sbin 2021-09-11 12:55:28 +00:00
c72890868c s/freq/period/ to be more precies; Set default period to 30s; set wait period on reconnect to 10s; Add explicit INFO logline when replacing dataset 2021-09-11 12:45:28 +00:00
8c9c1e2b4a Replace the pyagentx threaded version with a much simpler, non-threaded version. 2021-09-11 12:19:38 +00:00
842bce9d6e Add server_address to initializer, allow for unix path (starts with /) or hostname:port address 2021-09-11 08:13:21 +00:00
0c0e4fc14a A better way to specify netns
See docs:

https://www.freedesktop.org/software/systemd/man/systemd.exec.html#NetworkNamespacePath=
2021-09-05 21:02:11 +00:00
184d2eceb2 Restart agent on failure 2021-09-05 20:26:47 +00:00
9b39aa61c2 Clamp all COUNTER32 at mod 2^32 2021-09-05 20:15:06 +00:00
7dec1329d2 Turn VPPApi into a threadsafe object
It now is tolerant to VPP restarts. Upon initialization, we connect(),
blocking all but the first thread from trying. The rest will see
self.connected=True and move on.

Then, on each/any error, call vpp.disconect() and set connected=False
which will make any subsequent AgentX updater run force a reconnect.
2021-09-05 20:02:11 +00:00
e1cddc8c26 Add VPP API support to retrieve mtu/ifspeed/operstatus/adminstatus/mac 2021-09-05 19:39:20 +00:00
238471d25f Ensure more updates can fit in the queue, allow scaling to 20 variables on 1000 interfaces 2021-09-05 18:23:23 +00:00
2e7aa607e4 Add most of the standard (32bit) ifTable.ifEntry MIB, the 5 that are left will require vpp_papi support, coming next 2021-09-05 18:12:02 +00:00
ac8c323abf Ensure VPPStat() is connected before each read; if VPP restarts, we'll lose the connection, and this ensures that once VPP comes back up, we'll re-connect to it seemlessly 2021-09-05 16:19:44 +00:00