python實(shí)現(xiàn)pdf轉(zhuǎn)word和excel的示例代碼

更新時(shí)間：2025年01月22日 08:26:09 作者：PandaCode輝

本文主要介紹了python實(shí)現(xiàn)pdf轉(zhuǎn)word和excel的示例代碼,文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧

一、引言

在辦公中，我們經(jīng)常遇收到pdf文件格式，因?yàn)閜df格式文件不易修改，當(dāng)我們需要編輯這些pdf文件時(shí)，經(jīng)常需要開通會員或收費(fèi)功能才能使用編輯功能。今天，我要和大家分享的，是如何使用python編程實(shí)現(xiàn)，將PDF文件輕松轉(zhuǎn)換成Word和Excel格式，讓編輯變得輕而易舉。

二、python編程

要將PDF轉(zhuǎn)換為Word，我們需要解析PDF的布局和內(nèi)容，并將其重新格式化為Word文檔。這涉及到復(fù)雜的文本識別和格式轉(zhuǎn)換技術(shù)。

使用過如下幾個(gè)庫：最好的還是pdf2docx。

(一)、使用 pdf2docx 庫
(二)、使用 PyMuPDF 庫
(三)、使用 pdfplumber 庫
(四)、使用 PyPDF2 和 python-docx 庫

重點(diǎn)：pdf2docx 是一個(gè)將 PDF 文件轉(zhuǎn)換為 DOCX 文件的 Python 庫。

pip install pdf2docx -i https://mirrors.aliyun.com/pypi/simple

更換PIP源
　　PIP源在國外，速度慢，可以更換為國內(nèi)源，以下是國內(nèi)一些常用的PIP源。

豆瓣(douban) http://pypi.douban.com/simple/
清華大學(xué) https://pypi.tuna.tsinghua.edu.cn/simple/
阿里云 http://mirrors.aliyun.com/pypi/simple/
中國科技大學(xué) https://pypi.mirrors.ustc.edu.cn/simple/
中國科學(xué)技術(shù)大學(xué) http://pypi.mirrors.ustc.edu.cn/simple/

1，PDF轉(zhuǎn)Word

from pdf2docx import Converter

# pdf轉(zhuǎn)word方法
def pdf_to_word(pdf_path, word_path=None, page_nums=None):
    '''
    @方法名稱: pdf轉(zhuǎn)word
    @中文注釋: pdf轉(zhuǎn)word
    @入?yún)?
        @param pdf_path str pdf文件路徑
        @param page_nums str 頁碼序號
    @出參:
        @返回狀態(tài):
            @return 0 失敗或異常
            @return 1 成功
        @返回錯(cuò)誤碼
        @返回錯(cuò)誤信息
        @param doc_file str word文件名
    @作    者: PandaCode輝
    @weixin公眾號: PandaCode輝
    @創(chuàng)建時(shí)間: 2024-12-17
    @使用范例: pdf_to_word('test.pdf')
    '''
    global cv
    result_dict = {}
    try:
        if not type(pdf_path) is str:
            result_dict["error_code"] = "111111"
            result_dict["error_msg"] = "pdf文件路徑參數(shù)類型錯(cuò)誤,不為字符串"
            return result_dict
        # 檢查PDF文件是否存在
        if not os.path.isfile(pdf_path):
            result_dict["error_code"] = "999999"
            result_dict["error_msg"] = f"PDF文件未找到: {pdf_path}"
            return result_dict

        start_time = time.time()

        if not word_path:
            # 使用os.path.basename()獲取文件名
            file_path = os.path.dirname(pdf_path)
            # 使用os.path.basename()獲取文件名
            file_name = os.path.basename(pdf_path)
            # 提取文件名，去除文件后綴
            file_name = file_name.split('.')[0]
            # print(file_name)
            # word文件名+路徑
            word_path = os.path.join(file_path, f'{file_name}.docx')
            # print(word_path)

        # 初始化轉(zhuǎn)換器
        cv = Converter(pdf_path)
        # 轉(zhuǎn)換整本PDF或指定頁碼
        if page_nums:
            # 解析頁碼參數(shù)
            pages = []
            for part in page_nums.split(','):
                if '-' in part:
                    start, end = part.split('-')
                    pages.extend(range(int(start) - 1, int(end)))
                else:
                    pages.append(int(part) - 1)
            # 轉(zhuǎn)換指定頁碼
            cv.convert(docx_filename=word_path, pages=pages)
        else:
            # 轉(zhuǎn)換整本PDF
            cv.convert(docx_filename=word_path, start=0)

        # 保存為Word文檔
        cv.close()

        # 識別時(shí)間
        end_time = time.time()
        # 計(jì)算耗時(shí)差，單位毫秒
        recognize_time = (end_time - start_time) * 1000
        # 保留2位小數(shù)
        recognize_time = round(recognize_time, 2)
        # print('處理時(shí)間:' + str(recognize_time) + '毫秒')
        result_dict["recognize_time"] = recognize_time
        result_dict["error_code"] = "000000"
        result_dict["error_msg"] = "pdf轉(zhuǎn)word成功"
        # 使用os.path.basename()獲取文件名
        word_file_name = os.path.basename(word_path)
        # 打印結(jié)果
        # print("文件名:", word_file_name)
        result_dict["filename"] = word_file_name

        result_dict["file_size_mb"] = file_size_mb

        return result_dict

    except Exception as e:
        cv.close()
        print("pdf轉(zhuǎn)word異常," + str(e))
        result_dict["error_code"] = "999999"
        result_dict["error_msg"] = "PDF到Word轉(zhuǎn)換過程中發(fā)生錯(cuò)誤," + str(e)
        return result_dict

2，PDF轉(zhuǎn)Excel

要將PDF轉(zhuǎn)換為Excel，目前沒有現(xiàn)成的轉(zhuǎn)換庫，需要稍加處理下。

使用過如下幾個(gè)庫：

(一)、使用 pdf2docx 庫和 docx 庫和 pandas 庫

先將pdf轉(zhuǎn)成word文檔，然后讀取word文檔中的表格內(nèi)容，然后再轉(zhuǎn)成excel文檔。

pip install python-docx -i https://mirrors.aliyun.com/pypi/simple

pip install pandas -i https://mirrors.aliyun.com/pypi/simple

from docx import Document
import pandas as pd
'''
不擅長編程的用戶，可以選擇我的免費(fèi)工具箱，開箱即用，方便快捷。
print("搜/索/wei/xin/小/程/序:  全能科技工具箱")
'''
# pdf轉(zhuǎn)excel方法
def pdf_to_excel(pdf_path, xlsx_path=None, page_nums=None):
    '''
    @方法名稱: pdf轉(zhuǎn)excel
    @中文注釋: pdf轉(zhuǎn)excel
    @入?yún)?
        @param pdf_path str pdf文件路徑
        @param page_nums str 頁碼序號
    @出參:
        @返回狀態(tài):
            @return 0 失敗或異常
            @return 1 成功
        @返回錯(cuò)誤碼
        @返回錯(cuò)誤信息
        @param xlsx_file str excel文件名
    @作    者: PandaCode輝
    @weixin公眾號: PandaCode輝
    @創(chuàng)建時(shí)間: 2025-01-06
    @使用范例: pdf_to_excel('test.pdf')
    '''
    global cv
    result_dict = {}
    try:
        if not type(pdf_path) is str:
            result_dict["error_code"] = "111111"
            result_dict["error_msg"] = "pdf文件路徑參數(shù)類型錯(cuò)誤,不為字符串"
            return result_dict
        # 檢查PDF文件是否存在
        if not os.path.isfile(pdf_path):
            result_dict["error_code"] = "999999"
            result_dict["error_msg"] = f"PDF文件未找到: {pdf_path}"
            return result_dict

        start_time = time.time()

        # 使用os.path.basename()獲取文件名
        file_path = os.path.dirname(pdf_path)
        # 使用os.path.basename()獲取文件名
        file_name = os.path.basename(pdf_path)
        # 提取文件名，去除文件后綴
        file_name = file_name.split('.')[0]
        # print(file_name)
        # word文件名+路徑
        word_path = os.path.join(file_path, f'{file_name}.docx')
        # print(word_path)
        if not xlsx_path:
            # xlsx文件名+路徑
            xlsx_path = os.path.join(file_path, f'{file_name}.xlsx')
            # print(xlsx_path)

        # 第一步，先將pdf轉(zhuǎn)成doc文檔
        rsp_dict = pdf_to_word(pdf_path, page_nums=page_nums)
        if rsp_dict["error_code"] == "000000":
            # 第二步，再讀取doc文檔，轉(zhuǎn)成xlsx文檔
            # 打開Word文檔
            doc = Document(word_path)

            if len(doc.tables) < 1:
                result_dict["error_code"] = "999999"
                result_dict["error_msg"] = "PDF文件未找到表格內(nèi)容，無法轉(zhuǎn)成xlsx文檔."
                return result_dict

            # 創(chuàng)建一個(gè)Excel writer對象
            with pd.ExcelWriter(xlsx_path, engine='openpyxl') as writer:

                # 遍歷文檔中的所有表格
                for i, table in enumerate(doc.tables, start=1):
                    # 創(chuàng)建一個(gè)空的DataFrame來存儲表格數(shù)據(jù)
                    data = []

                    # 遍歷表格中的所有行
                    for row in table.rows:
                        # 遍歷行中的所有單元格
                        row_data = []
                        for cell in row.cells:
                            row_data.append(cell.text)
                        data.append(row_data)

                    # 將數(shù)據(jù)轉(zhuǎn)換為DataFrame
                    df = pd.DataFrame(data)

                    # 將DataFrame保存到Excel的不同工作表中
                    sheet_name = f"Table_{i}"
                    df.to_excel(writer, sheet_name=sheet_name, index=False, header=False)

            # print(f"轉(zhuǎn)換完成，結(jié)果保存在{xlsx_path}中。")
        else:
            result_dict["error_code"] = rsp_dict["error_code"]
            result_dict["error_msg"] = rsp_dict["error_msg"]
            return result_dict

        # 識別時(shí)間
        end_time = time.time()
        # 計(jì)算耗時(shí)差，單位毫秒
        recognize_time = (end_time - start_time) * 1000
        # 保留2位小數(shù)
        recognize_time = round(recognize_time, 2)
        # print('處理時(shí)間:' + str(recognize_time) + '毫秒')
        result_dict["recognize_time"] = recognize_time
        result_dict["error_code"] = "000000"
        result_dict["error_msg"] = "pdf轉(zhuǎn)excel成功"
        # 使用os.path.basename()獲取文件名
        xlsx_file_name = os.path.basename(xlsx_path)
        result_dict["filename"] = xlsx_file_name

        return result_dict

    except Exception as e:
        print("pdf轉(zhuǎn)excel異常," + str(e))
        result_dict["error_code"] = "999999"
        result_dict["error_msg"] = "PDF到excel轉(zhuǎn)換過程中發(fā)生錯(cuò)誤," + str(e)
        return result_dict

(二)、使用 pdfplumber 和 python-pandas 庫

使用pdfplumber庫讀取pdf表格內(nèi)容，然后寫入excel表格文檔中。

pip install pdfplumber -i https://mirrors.aliyun.com/pypi/simple

import pandas as pd
import pdfplumber

'''
不擅長編程的用戶，可以選擇我的免費(fèi)工具箱，開箱即用，方便快捷。
print("搜/索/wei/xin/小/程/序:  全能科技工具箱")
'''

def pdf_to_excel_new(pdf_path, xlsx_path=None, page_nums=None):
    '''
    @方法名稱: pdf轉(zhuǎn)excel
    @中文注釋: pdf轉(zhuǎn)excel
    @入?yún)?
        @param pdf_path str pdf文件路徑
        @param page_nums str 頁碼序號
    @出參:
        @返回狀態(tài):
            @return 0 失敗或異常
            @return 1 成功
        @返回錯(cuò)誤碼
        @返回錯(cuò)誤信息
        @param xlsx_file str excel文件名
    @作    者: PandaCode輝
    @weixin公眾號: PandaCode輝
    @創(chuàng)建時(shí)間: 2025-01-06
    @使用范例: pdf_to_excel('test.pdf')
    '''
    result_dict = {}
    try:
        if not type(pdf_path) is str:
            result_dict["error_code"] = "111111"
            result_dict["error_msg"] = "pdf文件路徑參數(shù)類型錯(cuò)誤,不為字符串"
            return result_dict
        # 檢查PDF文件是否存在
        if not os.path.isfile(pdf_path):
            result_dict["error_code"] = "999999"
            result_dict["error_msg"] = f"PDF文件未找到: {pdf_path}"
            return result_dict

        start_time = time.time()

        # 使用os.path.basename()獲取文件名
        file_path = os.path.dirname(pdf_path)
        # 使用os.path.basename()獲取文件名
        file_name = os.path.basename(pdf_path)
        # 提取文件名，去除文件后綴
        file_name = file_name.split('.')[0]
        # print(file_name)

        if not xlsx_path:
            # xlsx文件名+路徑
            xlsx_path = os.path.join(file_path, f'{file_name}.xlsx')
            # print(xlsx_path)

        # 提取 PDF 中的文本數(shù)據(jù)
        with pdfplumber.open(pdf_path) as pdf:
            if len(pdf.pages) < 1:
                result_dict["error_code"] = "999999"
                result_dict["error_msg"] = "PDF文件未找到表格內(nèi)容，無法轉(zhuǎn)成xlsx文檔."
                return result_dict

            # 創(chuàng)建一個(gè) Excel 的寫入器
            with pd.ExcelWriter(xlsx_path) as writer:
                # 轉(zhuǎn)換整本PDF或指定頁碼
                if page_nums:
                    # 解析頁碼參數(shù)
                    pages = []
                    for part in page_nums.split(','):
                        if '-' in part:
                            start, end = part.split('-')
                            pages.extend(range(int(start) - 1, int(end)))
                        else:
                            pages.append(int(part) - 1)
                    # 轉(zhuǎn)換指定頁碼
                    for i in pages:
                        page = pdf.pages[i]
                        # 提取當(dāng)前頁的表格數(shù)據(jù)
                        table = page.extract_table()
                        if table:
                            # 將表格數(shù)據(jù)轉(zhuǎn)換為 DataFrame
                            df = pd.DataFrame(table)
                            # 將 DataFrame 寫入 Excel 的不同工作表
                            df.to_excel(writer, sheet_name=f'Page {i}', index=False)
                else:
                    # 轉(zhuǎn)換整本PDF
                    for i, page in enumerate(pdf.pages, start=1):
                        # 提取當(dāng)前頁的表格數(shù)據(jù)
                        table = page.extract_table()
                        if table:
                            # 將表格數(shù)據(jù)轉(zhuǎn)換為 DataFrame
                            df = pd.DataFrame(table)
                            # 將 DataFrame 寫入 Excel 的不同工作表
                            df.to_excel(writer, sheet_name=f'Page {i}', index=False)

        # 識別時(shí)間
        end_time = time.time()
        # 計(jì)算耗時(shí)差，單位毫秒
        recognize_time = (end_time - start_time) * 1000
        # 保留2位小數(shù)
        recognize_time = round(recognize_time, 2)
        # print('處理時(shí)間:' + str(recognize_time) + '毫秒')
        result_dict["recognize_time"] = recognize_time
        result_dict["error_code"] = "000000"
        result_dict["error_msg"] = "pdf轉(zhuǎn)excel成功"
        # 使用os.path.basename()獲取文件名
        xlsx_file_name = os.path.basename(xlsx_path)
        # 打印結(jié)果
        # print("文件名:", xlsx_file_name)
        result_dict["filename"] = xlsx_file_name

        # 獲取文件大?。ㄗ止?jié)）
        file_size_bytes = os.path.getsize(xlsx_path)
        # 將字節(jié)轉(zhuǎn)換為兆字節(jié)
        file_size_mb = file_size_bytes / (1024 * 1024)
        # 打印結(jié)果
        # print("文件大?。ㄕ鬃止?jié)）:", file_size_mb)
        result_dict["file_size_mb"] = file_size_mb
        return result_dict

    except Exception as e:
        print("pdf轉(zhuǎn)excel異常," + str(e))
        result_dict["error_code"] = "999999"
        result_dict["error_msg"] = "PDF到excel轉(zhuǎn)換過程中發(fā)生錯(cuò)誤," + str(e)
        return result_dict

三、前端頁面效果展示

1，選擇PDF文件

2，選擇轉(zhuǎn)換類型：PDF轉(zhuǎn)Word 和 PDF轉(zhuǎn)Excel

3，頁面范圍：可選參數(shù)，不選則全部轉(zhuǎn)換　

總結(jié)

pdf2docx 和 PyMuPDF 是pdf轉(zhuǎn)word更直接的選擇，因?yàn)樗鼈儗ｉT用于轉(zhuǎn)換 PDF 到 DOCX，并且通常在版面還原方面做得更好。
pdfplumber 更適合于文本和表格的提取，而不是直接的格式轉(zhuǎn)換。
PyPDF2 和 python-docx 的組合提供了更多的靈活性，但可能需要更多的自定義代碼來處理復(fù)雜的布局和格式。

根據(jù)你的需求，選擇最適合你的庫。如果你需要高度保真的版面還原，pdf2docx 或 PyMuPDF 可能是更好的選擇。如果你需要從 PDF 中提取文本和表格數(shù)據(jù)，pdfplumber 可能更適合。

到此這篇關(guān)于python實(shí)現(xiàn)pdf轉(zhuǎn)word和excel的文章就介紹到這了,更多相關(guān)python實(shí)現(xiàn)pdf轉(zhuǎn)word和excel內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: