Update regcache-maple.c #1

namiltd · 2024-07-25T21:45:22Z

Initialize variables

…mode in i.MX 8QM Fix the issue where MEM_TO_MEM fail on i.MX8QM due to the requirement that both source and destination addresses need pass through the IOMMU. Typically, peripheral FIFO addresses bypass the IOMMU, necessitating only one of the source or destination to go through it. Set "is_remote" to true to ensure both source and destination addresses pass through the IOMMU. iMX8 Spec define "Local" and "Remote" bus as below. Local bus: bypass IOMMU to directly access other peripheral register, such as FIFO. Remote bus: go through IOMMU to access system memory. The test fail log as follow: [ 66.268506] dmatest: dma0chan0-copy0: result #1: 'test timed out' with src_off=0x100 dst_off=0x80 len=0x3ec0 (0) [ 66.278785] dmatest: dma0chan0-copy0: summary 1 tests, 1 failures 0.32 iops 4 KB/s (0) Fixes: 72f5801 ("dmaengine: fsl-edma: integrate v3 support") Signed-off-by: Joy Zou <[email protected]> Cc: [email protected] Reviewed-by: Frank Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Vinod Koul <[email protected]>

Fix warning at drivers/pci/msi/msi.h:121. Recently, I added a PCI to PCIe bridge adaptor and a PCIe NVME card to my rp3440. Then, I noticed this warning at boot: WARNING: CPU: 0 PID: 10 at drivers/pci/msi/msi.h:121 pci_msi_setup_msi_irqs+0x68/0x90 CPU: 0 PID: 10 Comm: kworker/u32:0 Not tainted 6.9.7-parisc64 #1 Debian 6.9.7-1 Hardware name: 9000/800/rp3440 Workqueue: async async_run_entry_fn We need to select PCI_MSI_ARCH_FALLBACKS when PCI_MSI is selected. Signed-off-by: John David Anglin <[email protected]> Cc: [email protected] # v6.0+ Signed-off-by: Helge Deller <[email protected]>

Write combining is an optimization feature in CPUs that is frequently used by modern devices to generate 32 or 64 byte TLPs at the PCIe level. These large TLPs allow certain optimizations in the driver to HW communication that improve performance. As WC is unpredictable and optional the HW designs all tolerate cases where combining doesn't happen and simply experience a performance degradation. Unfortunately many virtualization environments on all architectures have done things that completely disable WC inside the VM with no generic way to detect this. For example WC was fully blocked in ARM64 KVM until commit 8c47ce3 ("KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device"). Trying to use WC when it is known not to work has a measurable performance cost (~5%). Long ago mlx5 developed an boot time algorithm to test if WC is available or not by using unique mlx5 HW features to measure how many large TLPs the device is receiving. The SW generates a large number of combining opportunities and if any succeed then WC is declared working. In mlx5 the WC optimization feature is never used by the kernel except for the boot time test. The WC is only used by userspace in rdma-core. Sadly modern ARM CPUs, especially NVIDIA Grace, have a combining implementation that is very unreliable compared to pretty much everything prior. This is being fixed architecturally in new CPUs with a new ST64B instruction, but current shipping devices suffer this problem. Unreliable means the SW can present thousands of combining opportunities and the HW will not combine for any of them, which creates a performance degradation, and critically fails the mlx5 boot test. However, the CPU is very sensitive to the instruction sequence used, with the better options being sufficiently good that the performance loss from the unreliable CPU is not measurable. Broadly there are several options, from worst to best: 1) A C loop doing a u64 memcpy. This was used prior to commit ef30228 ("IB/mlx5: Use __iowrite64_copy() for write combining stores") and failed almost all the time on Grace CPUs. 2) ARM64 assembly with consecutive 8 byte stores. This was implemented as an arch-generic __iowriteXX_copy() family of functions suitable for performance use in drivers for WC. commit ead7911 ("arm64/io: Provide a WC friendly __iowriteXX_copy()") provided the ARM implementation. 3) ARM64 assembly with consecutive 16 byte stores. This was rejected from kernel use over fears of virtualization failures. Common ARM VMMs will crash if STP is used against emulated memory. 4) A single NEON store instruction. Userspace has used this option for a very long time, it performs well. 5) For future silicon the new ST64B instruction is guaranteed to generate a 64 byte TLP 100% of the time The past upgrade from #1 to #2 was thought to be sufficient to solve this problem. However, more testing on more systems shows that #3 is still problematic at a low frequency and the kernel test fails. Thus, make the mlx5 use the same instructions as userspace during the boot time WC self test. This way the WC test matches the userspace and will properly detect the ability of HW to support the WC workload that userspace will generate. While #4 still has imperfect combining performance, it is substantially better than #2, and does actually give a performance win to applications. Self-test failures with #2 are like 3/10 boots, on some systems, #4 has never seen a boot failure. There is no real general use case for a NEON based WC flow in the kernel. This is not suitable for any performance path work as getting into/out of a NEON context is fairly expensive compared to the gain of WC. Future CPUs are going to fix this issue by using an new ARM instruction and __iowriteXX_copy() will be updated to use that automatically, probably using the ALTERNATES mechanism. Since this problem is constrained to mlx5's unique situation of needing a non-performance code path to duplicate what mlx5 userspace is doing as a matter of self-testing, implement it as a one line inline assembly in the driver directly. Lastly, this was concluded from the discussion with ARM maintainers which confirms that this is the best approach for the solution: https://lore.kernel.org/r/[email protected] Signed-off-by: Patrisious Haddad <[email protected]> Reviewed-by: Michael Guralnik <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: NipaLocal <nipa@local>

When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 fragments per WQE, odd-indexed WQEs always share the same page with their subsequent WQE, while WQEs consisting of 4 fragments does not. However, this relationship does not hold for page sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that newly allocated WQEs won't share the same page with old WQEs. If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a page with its subsequent WQE, allocating a page for that WQE will overwrite mlx5e_frag_page, preventing the original page from being recycled. When the next WQE is processed, the newly allocated page will be immediately recycled. In the next round, if these two WQEs are handled in the same bulk, page_pool_defrag_page() will be called again on the page, causing pp_frag_count to become negative[1]. Moreover, this can also lead to memory corruption, as the page may have already been returned to the page pool and re-allocated to another WQE. And since skb_shared_info is stored at the end of the first fragment, its frags->bv_page pointer can be overwritten, leading to an invalid memory access when processing the skb[2]. For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4 Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy network loads (e.g. iperf) will first trigger a series of WARNINGs[1] and eventually crash[2]. Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page size. [1] WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0 walk_stackframe+0x0/0x190 show_stack+0x70/0x94 dump_stack_lvl+0x98/0xd8 dump_stack+0x2c/0x48 __warn+0x1c8/0x220 warn_slowpath_fmt+0x20c/0x230 mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core] mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core] mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core] mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core] __napi_poll+0x58/0x2e0 net_rx_action+0x1a8/0x340 __do_softirq+0x2b8/0x480 [...] [2] Unable to handle kernel paging request at virtual address 393837363534333a Oops [#1] CPU: 72 PID: 0 Comm: swapper/72 Tainted: G W 6.6.0 Trace: walk_stackframe+0x0/0x190 show_stack+0x70/0x94 die+0x1d4/0x350 do_page_fault+0x630/0x690 entMM+0x120/0x130 napi_pp_put_page+0x30/0x160 skb_release_data+0x164/0x250 kfree_skb_list_reason+0xd0/0x2f0 skb_release_data+0x1f0/0x250 napi_consume_skb+0xa0/0x220 net_rx_action+0x158/0x340 __do_softirq+0x2b8/0x480 irq_exit+0xd4/0x120 do_entInt+0x164/0x520 entInt+0x114/0x120 [...] Fixes: 069d114 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme") Signed-off-by: Mingrui Cui <[email protected]> Signed-off-by: NipaLocal <nipa@local>