wrs_sfp_dump returns "Unable to read consistent data from HAL's shmem!"

An SFP inquiry on a single port (wrs-# wrs_sfp_dump -L -p 01) frequently ends in “Unable to read consistent data from HAL’s shmem!”.
This happens when operating the switch (V5.01) with SFP Diagnostics enabled (i.e. /wr/etc/dot-config => CONFIG_READ_SFP_DIAG_ENABLE=y) and having plugged many Digital Diagnostics Monitoring (SFF-8472) SFF-8472 compliant SFPs.

Some further analysis:
By probing the IC2 lines with an oscilloscope, It has been observed that a monitoring sequence is done each second. With no SFPs pulged, scanning the I2C bus takes ~13 ms and for each SFP that is SFF-8472 compliant, an I2C sequence of ~38 ms is added. This adds up to 13 + 18 * 38 ms = ~700 ms for a fully loaded switch. This may explain why the issue appears more often when many SFPs are plugged.

Question 1: Is “Unable to read consistent data from HAL’s shmem!” due to a clash on the I2C bus?
Question 2: If so, is there a bus arbitration problem that needs to be solved?
Question 3: To my opinion the monitoring rate is quite high (once a second). The chances of a I2c clash (if any) is lowered when the rate is adjusted to once in 10 or even 60 seconds. Can this be done?

Hi peter, I found the I2C read/write bug before when I was using the SFP produced by FS. Please check the commit e3db72e907fe171f45b5db87e9e128f3ebc7392c in wr-switch-sw. It’s the commit after version 5.0.1. Maybe it causes the bug you’ve encountered.

Hi all,

The fix mentioned by hongming fixed (I hope) corrupting the eeprom of SFPs during the read. For sure it will not solve your issue described here.

I was not aware that it takes ~700ms to read all of the monitoring data. The problem is that HAL’s shmem is locked during the read of SFP’s DOM area. Setting it to 10 or 60 seconds may only mitigate the problem not solve it. IMHO: SFP data should be copied to a local memory then copied to shmem at once. Please note that locking/unlocking the memory for a read of a one SFP at the time will not solve the problem. The time when the shmem is locked should be as short as possible, chopping it into several slices will not help here.

Peter, would you like to have a patch for a v5.0.1 or you can wait for a next release?
Thanks for investigating this issue!

Adam

Adam, Hongming,

Thanks for your replies!
Lowering the sequence frequency will not solve the problem but it will leave more time for external requests like for example wrs_sfp_dump -L -p 01. A possibility to define the sequence period would be a “nice to have”.
The problem would indeed be solved by arbitration of the shared memory. Locking/unlocking the memory for a read on one SFP would help in scheme were an external request is kept pending and served when the automatic process of reading Digital Diagnostics is moving from reading one SFP to another.

When is a next release scheduled? I guess we can wait for it since we acknowledge that implementing something like arbitration is more than a few lines of code and we are already very happy that you will look into this issue.

Peter