1.本人基础配置
cpu笔记本一台,使用mobaxterm远程控制gpu服务器, 没有sudo权限。
2.Tensorflow安装
请打开官方英文版安装介绍 https://tensorflow.google.cn/install/pip,中文版可能会缺失部分提示信息。
conda create -n tf_sionna python==3.8 #创建新的虚拟环境
conda activate tf_sionna #激活新建立的虚拟环境
pip install tensorflow-gpu==2.10 #安装tensorflow #似乎还可以写tensorflow[and-cuda]?
conda activate tf_sionna
python
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
检验自己配置环境的结果如下,虽然目前显示用不了GPU,但在跑代码时还是可以用的。具体原因我还没有弄明白,待补充吧!
3.Sionna安装
pip install sionna
安装sionna时又自动安装了tensorflow2.13.1。两种版本同时存在真是一件奇妙的事情!
4.运行的代码(待补充)
import os, json
import tensorflow as tf
import numpy as np
import random
#import seaborn as sns
import sionna
import matplotlib.pyplot as plt
import pickle
from tensorflow.keras import Model
from tensorflow.keras.layers import Layer, Conv2D, LayerNormalization
from tensorflow.nn import relu
from sionna.channel.tr38901 import Antenna, AntennaArray, CDL, UMa
from sionna.channel import OFDMChannel, GenerateOFDMChannel, ApplyOFDMChannel
from sionna.ofdm import ResourceGrid
gpu_num = 0 # Use "" to use the CPU
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
gpus = tf.config.list_physical_devices('GPU')
print(gpus)
if gpus:
try:
tf.config.experimental.set_memory_growth(gpus[0], True)
except RuntimeError as e:
print(e)
# Avoid warnings from TensorFlow
tf.get_logger().setLevel('ERROR')
5.运行结果
(1)在实验室服务器已有的tf2.5环境中添加pip install sionna后,输出结果如下:
/gpu01/miniconda3/envs/tf2.5/bin/python3 /gpu03/gaosongling/RailwayScenario/gen_dataset.py
2025-02-25 11:27:23.147936: I tensorflow/core/platform/cpu_feature_guard.cc:193]
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-02-25 11:27:24.299367: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64]
Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/clustertech/chess/ng/bin
2025-02-25 11:27:24.299498: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64]
Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/clustertech/chess/ng/bin
2025-02-25 11:27:24.299524: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38]
TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2025-02-25 11:27:26.051376: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64]
Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/clustertech/chess/ng/bin
2025-02-25 11:27:26.051438: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265]
failed call to cuInit: UNKNOWN ERROR (303)
2025-02-25 11:27:26.051497: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156]
kernel driver does not appear to be running on this host (mgmt): /proc/driver/nvidia/version does not exist
[]
2025-02-25 11:27:26.052310: I tensorflow/core/platform/cpu_feature_guard.cc:193]
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P0 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P1 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P2 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P3 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P4 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P5 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P6 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P7 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P8 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P9 finished
Process finished with exit code 0
(2)在自建的tf_sionna环境下,输出结果如下:
2025-02-25 11:42:29.949793: I tensorflow/core/platform/cpu_feature_guard.cc:182]
This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-02-25 11:42:33.111435: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38]
TF-TRT Warning: Could not find TensorRT
2025-02-25 11:42:43.428919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639]
Created device /job:localhost/replica:0/task:0/device:GPU:0 with 617 MB memory: -> device: 0, name: NVIDIA RTX A4000, pci bus id: 0000:c4:00.0, compute capability: 8.6
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P0 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P1 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P2 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P3 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P4 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P5 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P6 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P7 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P8 finished
(1, 256, 1, 2) (1, 256, 1, 2)
generate train data P9 finished
虽然在测试tensorflow时显示没有gpu,但在运行时还是可以用的,这点很令人抓狂。我重装了好几次环境,都无法在测试tf gpu时得到一个肯定的答案,一度要自闭了,并且tensorflow官方的文档我认为写的一点也不细致,不像pytorch那样明了。后来问了同门到底要怎么配环境,他说在实验室已有的tf2.5上加个sionna就能用,我添加之后确实可以运行代码,于是我又返回虚拟环境中测试是否可以使用gpu,但同样检测不到。可能,在服务器上配环境和在本地有gpu的电脑上还是不完全一样。此外tf2.5中的版本也不是2.5,而是2.11,如下图所示,由于没有提前记录tf2.5初始环境版本,不排除是因为添加了sionna导致tf版本升级。。。
6.可能的报错
在我的安装过程中,如果安装时没有强调tensorflow-gpu, 比如直接pip install tensorflow,在运行代码时主要会报出如下错误:
ImportError: jit_init_thread_state(): the LLVM backend is inactive because the LLVM shared library ("libLLVM.so")
could not be found! Set the DRJIT_LIBLLVM_PATH environment variable to specify its path.
我从tenserflow官网看到,尝试去https://github.com/llvm/llvm-project/releases/tag/llvmorg-19.1.0下载压缩包,但我并不知道应该下载哪个,最后我选择了两种有linux后缀的压缩包下载,但在本地解压时总会说XXX文件缺失权限解压失败类似的话,没能解决。ps: https://github.com/llvm/llvm-project/releases/tag/llvmorg-17.0.2。
https://github.com/NVlabs/sionna/discussions/296是对这个问题的一种解答,但也没有解释详细,所以我依旧没能解决。
Finally, 完整报错信息如下:
2025-02-24 23:11:59.500085: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-24 23:11:59.569298: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-24 23:11:59.569393: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-24 23:11:59.570989: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-24 23:11:59.582707: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-24 23:11:59.583154: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-02-24 23:12:02.438310: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-02-24 23:12:08.482323: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Traceback (most recent call last):
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/mitsuba/__init__.py", line 107, in __getattribute__
_import('mitsuba.mitsuba_' + variant + '_ext'),
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 666, in _load_unlocked
File "<frozen importlib._bootstrap>", line 565, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1108, in create_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
ImportError: jit_init_thread_state(): the LLVM backend is inactive because the LLVM shared library ("libLLVM.so") could not be found! Set the DRJIT_LIBLLVM_PATH environment variable to specify its path.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpu03/gaosongling/RailwayScenario/gen_dataset.py", line 6, in <module>
import sionna
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/sionna/__init__.py", line 18, in <module>
from . import rt
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/sionna/rt/__init__.py", line 29, in <module>
mi.set_variant('llvm_ad_rgb')
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/mitsuba/__init__.py", line 317, in set_variant
_import('mitsuba.ad.integrators')
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/mitsuba/python/ad/__init__.py", line 2, in <module>
from .integrators import *
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/mitsuba/python/ad/integrators/__init__.py", line 25, in <module>
importlib.import_module('mitsuba.ad.integrators.' + name)
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/mitsuba/python/ad/integrators/common.py", line 8, in <module>
class ADIntegrator(mi.CppADIntegrator):
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/mitsuba/__init__.py", line 253, in __getattribute__
result = module.__getattribute__(key)
File "/home/gaosongling/.conda/envs/tf_sionna/lib/python3.9/site-packages/mitsuba/__init__.py", line 115, in __getattribute__
raise AttributeError(e)
AttributeError: jit_init_thread_state(): the LLVM backend is inactive because the LLVM shared library ("libLLVM.so") could not be found! Set the DRJIT_LIBLLVM_PATH environment variable to specify its path.
7.Reference
TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.https://blog.csdn.net/m0_66233032/article/details/134606306?fromshare=blogdetail&sharetype=blogdetail&sharerId=134606306&sharerefer=PC&sharesource=m0_61175448&sharefrom=from_link
Tensorflow解决“TF-TRT Warning: Could not find TensorRT”的问题https://blog.csdn.net/weixin_45710350/article/details/140232873?fromshare=blogdetail&sharetype=blogdetail&sharerId=140232873&sharerefer=PC&sharesource=m0_61175448&sharefrom=from_link
import tensorflow as tf,但是Could not find TensorRThttps://blog.csdn.net/m0_54377950/article/details/145753869?fromshare=blogdetail&sharetype=blogdetail&sharerId=145753869&sharerefer=PC&sharesource=m0_61175448&sharefrom=from_link