🎓

財商學院

正在載入課程內容...

Nvidia CUDA in 100 Seconds
🎬 互動字幕 (34段)
0.0s
▶️ 播放中 - 點擊暫停
1x
00:00
CUDA, a parallel computing platform that allows you to use your GPU for more than just plain video games.
CUDA 是一個平行運算平台,能讓你將 GPU 用於純粹的視頻遊戲之外的用途。
00:05
Compute Unified Device Architecture was developed by NVIDIA in 2007 based on the prior work of Ian Buck and John Nichols.
統一運算裝置架構(Compute Unified Device Architecture)由 NVIDIA 於 2007 年開發,基於 Ian Buck 和 John Nichols 先前的研究。
00:11
Since then, CUDA has revolutionized the world by allowing humans to compute large blocks of data in parallel, which has unlocked the true potential of the deep neural networks behind artificial intelligence.
從那時起,CUDA 讓人類能夠平行處理大量數據區塊,從而釋放了人工智慧背後深度神經網路的真正潛力,徹底改變了世界。
00:21
The graphics processing unit, or GPU, is historically used for what the name implies.
圖形處理單元(GPU)歷史上僅用於其名稱所暗示的用途。
00:23
To compute graphics, when you play a game in 1080p at 60fps, you've got over 2 million pixels on the screen that may need to be recalculated after every frame, which requires hardware that can do a lot of matrix multiplication and vector transformations in parallel.
為了計算圖形,當你在 1080p 解析度、每秒 60 幀下遊玩遊戲時,螢幕上有超過 200 萬個像素可能需要在每一幀後重新計算,這需要能夠大量進行矩陣乘法和向量轉換的硬體。
00:37
And I mean a lot.
而且是大量的運算。
00:41
Modern GPUs are measured in teraflops, or how many trillions of floating point operations can it handle per second?
現代 GPU 以 Teraflops(每秒一兆次浮點運算)來衡量,也就是每秒能處理多少兆次浮點運算?
00:46
Unlike modern CPUs like the Intel i9, which has 24 cores, a modern GPU like the RTX 4090 has over 16,000 cores.
與擁有 24 個核心的 Intel i9 等現代 CPU 不同,像 RTX 4090 這樣的現代 GPU 擁有超過 16,000 個核心。
00:54
A CPU is designed to be versatile, while a GPU is designed to go really fast in parallel.
CPU 的設計旨在通用性,而 GPU 的設計則是為了在平行運算中達到極快的速度。
00:56
CUDA allows developers to tap into the GPU's power, and data scientists all around the world are using it at this very moment, trying to train the most powerful machine learning models.
CUDA 讓開發者能夠利用 GPU 的強大算力,全世界的數據科學家此刻正使用它來嘗試訓練最強大的機器學習模型。
01:09
It works like this.
它的運作原理如下。
01:10
You write a function, called a CUDA kernel, that runs on the GPU.
你編寫一個稱為 CUDA kernel(核心)的函式,在 GPU 上運行。
01:11
You then copy some data from your main RAM over to the GPU's memory, then the CPU will tell the GPU to execute that function or kernel in parallel.
接著,你將一些數據從主記憶體(RAM)複製到 GPU 的記憶體,然後 CPU 會告訴 GPU 以平行方式執行該函式或核心。
01:19
The code is executed in a block, which itself organizes threads into a multi-dimensional grid.
程式碼在一個 Block(區塊)中執行,Block 本身會將 Threads(執行緒)組織成一個多維網格(Grid)。
01:26
Then the final result from the GPU is copied back to the main memory.
最後,GPU 產生的最終結果會被複製回主記憶體。
01:27
It’s a piece of cake, let's go ahead and build a CUDA application right now.
這很簡單,讓我們現在就來建立一個 CUDA 應用程式。
01:30
First you'll need an NVIDIA GPU, then install a CUDA toolkit.
首先你需要一張 NVIDIA GPU,然後安裝 CUDA toolkit。
01:34
CUDA includes device drivers, runtime, compilers, and dev tools, but the actual code is most often written in C++, as I'm doing here in Visual Studio.
CUDA 包含設備驅動程式、執行階段、編譯器和開發工具,但實際的程式碼最常使用 C++ 編寫,就像我現在在 Visual Studio 裡做的一樣。
01:42
First, we use the global specifier to define a function or a CUDA kernel that runs on the actual GPU.
首先,我們使用 `__global__` 修飾詞來定義一個在實際 GPU 上運行的函式或 CUDA 核心。
01:50
This function adds two vectors or arrays together.
這個函式將兩個向量或陣列相加。
01:51
It takes pointer arguments A and B, which are the two vectors to be added together, and pointer C for the result.
它需要指標引數 A 和 B,也就是要相加的兩個向量,以及用於存放結果的指標 C。
01:58
C equals A plus B, but because hypothetically we're doing billions of operations in parallel, we need to calculate the global index of the thread in the block that we're working on.
C 等於 A 加 B,但假設我們正在平行執行數十億個運算,我們需要計算該執行緒在所屬區塊中的全域索引。
02:04
From there, we can use managed, which tells CUDA that this data can be accessed from both the host CPU and the device GPU, without the need to manually copy data between them.
從那裡,我們可以使用 managed,它告訴 CUDA 這些資料可以被宿主端 CPU 和裝置端 GPU 存取,而不需要在兩者之間手動複製資料。
02:12
And now we can write a main function for the CPU that runs the CUDA kernel.
現在,我們可以為 CPU 編寫一個主函式來執行 CUDA kernel。
02:17
We use a for loop to initialize our arrays with data, then from there we pass this data to the add function to run it on the GPU, but you might be wondering what these weird triple brackets are?
我們使用一個 for 迴圈來用資料初始化陣列,然後將這些資料傳遞給 add 函式以便在 GPU 上執行,但你可能想知道這些奇怪的三個括號是什麼?
02:29
They allow us to configure the CUDA kernel launch to control how many blocks and how many threads per block are used to run this code in parallel, and that's crucial for optimizing multidimensional data structures like tensors used in deep learning.
它們讓我們能夠配置 CUDA kernel 的啟動,以控制使用多少個區塊以及每個區塊多少個執行緒來平行執行此程式碼,這對於優化像深度學習中使用的張量(tensors)這樣的多維資料結構至關重要。
02:38
From there, CUDA device synchronize will pause the execution of this code and wait for it to complete on the GPU.
接著,CUDA device synchronize 會暫停此程式碼的執行,並等待其在 GPU 上完成。
02:46
When it finishes and copies the data back to the host machine, we can then use the result and print it to the standard output.
當它完成並將資料複製回宿主機時,我們就可以使用結果並將其列印到標準輸出。
02:51
Now, let's execute this code with a CUDA compiler by clicking the play button.
現在,讓我們透過點擊播放按鈕,使用 CUDA 編譯器執行此程式碼。
02:53
Congratulations, you just ran 256 threads in parallel on your GPU!
恭喜,你剛剛在你的 GPU 上平行執行了 256 個執行緒!
02:57
But if you want to go beyond, NVIDIA's GTC conference is coming up in a few weeks.
但如果你想更進一步,NVIDIA 的 GTC 大會將在幾週後舉行。
03:01
It's free to attend virtually, featuring talks about building massive parallel systems with CUDA.
可以免費線上參加,並包含關於使用 CUDA 構建大規模平行系統的演講。
03:09
Thanks for watching, and I will see you in the next one.
感謝收看,我們下支影片見。
03:09
Thanks for watching, and I will see you in the next one.
感謝收看,我們下支影片見。

Nvidia CUDA in 100 Seconds

📝 影片摘要

本單元深入淺出地介紹 NVIDIA 的 CUDA 平台,這是一項徹底改變運算世界的平行運算技術。課程首先闡述 CUDA 如何從 GPU 的圖形運算基礎演變而來,解釋其相較於 CPU 擁有數萬個核心的巨大並行處理優勢,並說明 Teraflops 的運算衡量標準。接著,透過實際的 C++ 程式碼範例,教學演示如何編寫 CUDA kernel,使用 `__global__` 修飾元與 managed 記憶體管理,並配置 Block 與 Thread 的執行網格,以在 GPU 上高效執行向量相加運算。最後,課程強調此技術對於人工智慧與深度學習模型訓練的關鍵影響,並鼓勵學習者關注 NVIDIA GTC 大會以獲取更多專業知識。

📌 重點整理

  • CUDA 是由 NVIDIA 開發的平行運算平台,專為利用 GPU 進行非圖形運算而設計。
  • GPU 擁有數萬個核心(如 RTX 4090),擅長處理大規模平行運算,而 CPU 則適合通用型任務。
  • CUDA 的核心機制是透過 CUDA Kernel(核心函式)在 GPU 上平行執行大量數據處理。
  • 程式執行結構包含 Grid(網格)、Block(區塊)與 Thread(執行緒),是優化多維資料的關鍵。
  • 使用 `__global__` 修飾元定義 GPU 函式,並利用 Managed 記憶體簡化 CPU 與 GPU 間的資料傳輸。
  • CUDA 應用程式的基本流程:從主機複製資料 -> GPU 平行運算 -> 結果複製回主機。
  • 現代人工智慧與深度神經網路的突破,極大程度依賴 CUDA 提供的強大運算潛力。
  • Teraflops(每秒一兆次浮點運算)是衡量 GPU 運算效能的關鍵指標。
📖 專有名詞百科 |點擊詞彙查看維基百科解釋
平行的
parallel
架構
architecture
神經的
neural
圖形
graphics
轉換
transformation
每秒一兆次浮點運
teraflops
多功能的
versatile
假設性地
hypothetically
關鍵的
crucial
優化
optimize

🔍 自訂查詢

📚 共 10 個重點單字
parallel /ˈpærəlel/ adjective
happening at the same time or in the same way
平行的;同時發生的
📝 例句
"CUDA, a parallel computing platform that allows you to use your GPU for more than just plain video games."
CUDA 是一個平行運算平台,能讓你將 GPU 用於純粹的視頻遊戲之外的用途。
✨ 延伸例句
"Computers can perform parallel operations on multiple data sets."
電腦可以對多個數據集執行平行運算。
architecture /ˈɑːrkɪtektʃər/ noun
the design and structure of a computer system or software
架構;體系結構
📝 例句
"Compute Unified Device Architecture was developed by NVIDIA in 2007 based on the prior work of Ian Buck and John Nichols."
統一運算裝置架構由 NVIDIA 於 2007 年開發,基於 Ian Buck 和 John Nichols 先前的研究。
✨ 延伸例句
"The new software architecture improves data processing speed."
新的軟體架構提升了數據處理速度。
neural /ˈnʊrəl/ adjective
relating to a nerve or the nervous system
神經的;神經系統的
📝 例句
"Since then, CUDA has revolutionized the world by allowing humans to compute large blocks of data in parallel, which has unlocked the true potential of the deep neural networks behind artificial intelligence."
從那時起,CUDA 讓人類能夠平行處理大量數據區塊,從而釋放了人工智慧背後深度神經網路的真正潛力,徹底改變了世界。
✨ 延伸例句
"Deep neural networks require massive computational power."
深度神經網路需要龐大的運算能力。
graphics /ˈɡræfɪks/ noun
visual images or designs used in computing
圖形;圖像
📝 例句
"The graphics processing unit, or GPU, is historically used for what the name implies."
圖形處理單元,或稱 GPU,歷史上僅用於其名稱所暗示的用途。
✨ 延伸例句
"This video card provides excellent graphics rendering."
這張顯示卡提供極佳的圖形渲染能力。
transformation /ˌtrænsfərˈmeɪʃən/ noun
a complete change in form or appearance
轉換;變形
📝 例句
"To compute graphics, when you play a game in 1080p at 60fps, you've got over 2 million pixels on the screen that may need to be recalculated after every frame, which requires hardware that can do a lot of matrix multiplication and vector transformations in parallel."
為了計算圖形,當你在 1080p 解析度、每秒 60 幀下遊玩遊戲時,螢幕上有超過 200 萬個像素可能需要在每一幀後重新計算,這需要能夠大量進行矩陣乘法和向量轉換的硬體。
✨ 延伸例句
"The transformation of data into actionable insights is key."
將數據轉換為可執行的洞察是關鍵。
teraflops /ˈtɛrəflɒps/ noun
a measure of computer performance, trillions of floating-point operations per second
每秒一兆次浮點運算(衡量運算能力的單位)
📝 例句
"Modern GPUs are measured in teraflops, or how many trillions of floating point operations can it handle per second?"
現代 GPU 以 Teraflops(每秒一兆次浮點運算)來衡量,也就是每秒能處理多少兆次浮點運算?
✨ 延伸例句
"The new console boasts 12 teraflops of processing power."
新款遊戲機擁有 12 teraflops 的處理能力。
versatile /ˈvɜːrsətaɪl/ adjective
able to adapt or be adapted to many different functions or activities
多功能的;通用的
📝 例句
"A CPU is designed to be versatile, while a GPU is designed to go really fast in parallel."
CPU 的設計旨在通用性,而 GPU 的設計則是為了在平行運算中達到極快的速度。
✨ 延伸例句
"C++ is a versatile programming language used in many fields."
C++ 是一種通用的程式語言,應用於許多領域。
hypothetically /ˌhaɪpəˈθɛtɪkli/ adverb
based on a hypothesis or supposed to be true
假設性地;理論上
📝 例句
"C equals A plus B, but because hypothetically we're doing billions of operations in parallel, we need to calculate the global index of the thread in the block that we're working on."
C 等於 A 加 B,但假設我們正在平行執行數十億個運算,我們需要計算該執行緒在所屬區塊中的全域索引。
✨ 延伸例句
"Hypothetically, if the market crashes, we would lose liquidity."
理論上,如果市場崩盤,我們將失去流動性。
crucial /ˈkruːʃəl/ adjective
extremely important or necessary
關鍵的;至關重要的
📝 例句
"They allow us to configure the CUDA kernel launch to control how many blocks and how many threads per block are used to run this code in parallel, and that's crucial for optimizing multidimensional data structures like tensors used in deep learning."
它們讓我們能夠配置 CUDA kernel 的啟動,以控制使用多少個區塊以及每個區塊多少個執行緒來平行執行此程式碼,這對於優化像深度學習中使用的張量(tensors)這樣的多維資料結構至關重要。
✨ 延伸例句
"Cash flow management is crucial for any startup."
現金流管理對任何新創公司都至關重要。
optimize /ˈɒptɪmaɪz/ verb
make the best or most effective use of a situation or resource
優化;使最適化
📝 例句
"They allow us to configure the CUDA kernel launch to control how many blocks and how many threads per block are used to run this code in parallel, and that's crucial for optimizing multidimensional data structures like tensors used in deep learning."
它們讓我們能夠配置 CUDA kernel 的啟動,以控制使用多少個區塊以及每個區塊多少個執行緒來平行執行此程式碼,這對於優化像深度學習中使用的張量(tensors)這樣的多維資料結構至關重要。
✨ 延伸例句
"We need to optimize our investment portfolio."
我們需要優化我們的投資組合。
🎯 共 10 題測驗

1 What does CUDA stand for? CUDA 是什麼的縮寫? What does CUDA stand for?

CUDA 是什麼的縮寫?

✅ 正確! ❌ 錯誤,正確答案是 B

As mentioned at [5], CUDA stands for Compute Unified Device Architecture.

如 [5] 秒所提到的,CUDA 代表 Compute Unified Device Architecture(統一運算裝置架構)。

2 Which company developed CUDA? 哪家公司開發了 CUDA? Which company developed CUDA?

哪家公司開發了 CUDA?

✅ 正確! ❌ 錯誤,正確答案是 C

The video states at [5] that CUDA was developed by NVIDIA.

影片在 [5] 秒指出 CUDA 是由 NVIDIA 開發的。

3 What is the primary historical use of a GPU mentioned in the video? 影片中提到 GPU 的主要歷史用途是什麼? What is the primary historical use of a GPU mentioned in the video?

影片中提到 GPU 的主要歷史用途是什麼?

✅ 正確! ❌ 錯誤,正確答案是 B

At [21], the video states the GPU is historically used for what the name implies, which is graphics processing.

在 [21] 秒,影片指出 GPU 歷史上僅用於其名稱所暗示的用途,即圖形處理。

4 How many cores does an RTX 4090 GPU have compared to an Intel i9 CPU? RTX 4090 GPU 與 Intel i9 CPU 相比,擁有多少核心? How many cores does an RTX 4090 GPU have compared to an Intel i9 CPU?

RTX 4090 GPU 與 Intel i9 CPU 相比,擁有多少核心?

✅ 正確! ❌ 錯誤,正確答案是 C

At [46], the video contrasts the Intel i9's 24 cores with the RTX 4090's over 16,000 cores.

在 [46] 秒,影片對比了 Intel i9 的 24 個核心與 RTX 4090 超過 16,000 個核心。

5 What is a function that runs on the GPU called in CUDA? 在 CUDA 中,在 GPU 上運行的函式被稱為什麼? What is a function that runs on the GPU called in CUDA?

在 CUDA 中,在 GPU 上運行的函式被稱為什麼?

✅ 正確! ❌ 錯誤,正確答案是 C

At [70], the video explains that you write a function called a CUDA kernel.

在 [70] 秒,影片解釋你需要編寫一個稱為 CUDA kernel 的函式。

6 What programming language is most often used for CUDA code according to the video? 根據影片,CUDA 程式碼最常使用哪種程式語言? What programming language is most often used for CUDA code according to the video?

根據影片,CUDA 程式碼最常使用哪種程式語言?

✅ 正確! ❌ 錯誤,正確答案是 C

At [94], the video mentions that actual code is most often written in C++.

在 [94] 秒,影片提到實際的程式碼最常使用 C++ 編寫。

7 Which specifier is used to define a function that runs on the GPU? 哪個修飾元用於定義在 GPU 上運行的函式? Which specifier is used to define a function that runs on the GPU?

哪個修飾元用於定義在 GPU 上運行的函式?

✅ 正確! ❌ 錯誤,正確答案是 B

At [102], the video states that we use the global specifier to define a CUDA kernel.

在 [102] 秒,影片指出我們使用 global 修飾元來定義 CUDA kernel。

8 What does 'managed' memory allow in CUDA? CUDA 中的 'managed' 記憶體有什麼作用? What does 'managed' memory allow in CUDA?

CUDA 中的 'managed' 記憶體有什麼作用?

✅ 正確! ❌ 錯誤,正確答案是 B

At [124], the video explains that managed tells CUDA data can be accessed from both host CPU and device GPU without manual copying.

在 [124] 秒,影片解釋 managed 告訴 CUDA 資料可以被宿主端 CPU 和裝置端 GPU 存取,而不需要手動複製。

9 What do the triple brackets configure when launching a CUDA kernel? 三個括號在啟動 CUDA kernel 時配置什麼? What do the triple brackets configure when launching a CUDA kernel?

三個括號在啟動 CUDA kernel 時配置什麼?

✅ 正確! ❌ 錯誤,正確答案是 B

At [149], the video explains the brackets configure how many blocks and threads per block are used.

在 [149] 秒,影片解釋這三個括號配置了區塊數量以及每個區塊的執行緒數量。

10 What function pauses code execution to wait for the GPU to finish? 哪個函式會暫停程式碼執行以等待 GPU 完成? What function pauses code execution to wait for the GPU to finish?

哪個函式會暫停程式碼執行以等待 GPU 完成?

✅ 正確! ❌ 錯誤,正確答案是 C

At [158], the video mentions 'CUDA device synchronize' to pause execution and wait for the GPU.

在 [158] 秒,影片提到 'CUDA device synchronize' 來暫停執行並等待 GPU。

測驗完成!得分: / 10