Python實(shí)現(xiàn)提取和去除數(shù)據(jù)中包含關(guān)鍵詞的行

更新時(shí)間：2023年08月01日 16:48:24 作者：上景

這篇文章主要介紹了Python如何提取數(shù)據(jù)中包含關(guān)鍵詞的行已經(jīng)如何去除數(shù)據(jù)中包含關(guān)鍵詞的行，文中的示例代碼講解詳細(xì)，需要的可以參考一下

幫對(duì)象處理所需數(shù)據(jù)時(shí)寫的代碼——第六彈（實(shí)現(xiàn)功能一：Python實(shí)現(xiàn)根據(jù)某列中找到的關(guān)鍵字從原始數(shù)據(jù)中過濾行，然后匹配到關(guān)鍵詞的行數(shù)據(jù)保存到新的 CSV 文件中；實(shí)現(xiàn)功能二：從原始數(shù)據(jù)中刪除“刪除的關(guān)鍵字”列中找到的任何關(guān)鍵字的行，然后將剩余數(shù)據(jù)保存到新的 CSV 文件中）

功能一：篩選出包含關(guān)鍵詞的行

第一節(jié) 讀取數(shù)據(jù)和設(shè)置

在這一部分中，代碼從兩個(gè)不同的源讀取數(shù)據(jù)：

It reads "Table 1" from an Excel file (需要保留的關(guān)鍵詞.xlsx) into a DataFrame called keywords_df.
It reads "Table 2" from a CSV file (原始數(shù)據(jù).csv) into another DataFrame called data_df.

創(chuàng)建一個(gè)名為的空 DataFrame，result_df其列與相同data_df。

import pandas as pd
from tqdm import tqdm
# Read Table 1 
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要保留的關(guān)鍵詞.xlsx")
# Read Table 2 (數(shù)據(jù)表格)
data_df = pd.read_csv(r"C:\Users\Desktop\原始數(shù)據(jù).csv", dtype=str)
# Create an empty Table 3
result_df = pd.DataFrame(columns=data_df.columns)
# Iterate over the keywords in Table 1

第二節(jié) 迭代關(guān)鍵字并過濾數(shù)據(jù)

在此部分中，代碼使用循環(huán)和庫迭代“關(guān)鍵字”列中的每個(gè)關(guān)鍵字，tqdm以顯示名為“處理”的進(jìn)度條。

對(duì)于每個(gè)關(guān)鍵字，它執(zhí)行以下步驟：

它搜索“表 2”( data_df) 中“地址”列包含當(dāng)前關(guān)鍵字的行。該str.contains()方法用于檢查部分匹配，并na=False用于忽略缺失值。

匹配的行存儲(chǔ)在名為的 DataFrame 中matched_rows。

使用, 將DataFramematched_rows附加到先前創(chuàng)建的空 DataFrame 中，以重置串聯(lián) DataFrame 的索引。result_dfpd.concat()ignore_index=True

for keyword in tqdm(keywords_df['關(guān)鍵詞'], desc="Processing"):
    # Find rows in Table 2 where the "地址" column matches the keyword
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False)]
    # Append the matched rows to Table 3
    result_df = pd.concat([result_df, matched_rows], ignore_index=True)

第三節(jié) 刪除重復(fù)行并保存結(jié)果

在這一部分中，代碼執(zhí)行以下步驟：

它使用該方法根據(jù)所有列從“表 3”( ) 中刪除重復(fù)行drop_duplicates()。DataFrameresult_df已更新為僅包含唯一行。

使用該方法將刪除重復(fù)行的結(jié)果 DataFrame 保存到名為“篩選出包含關(guān)鍵詞的行.csv”的新 CSV 文件中to_csv()。設(shè)置index為False避免將 DataFrame 索引保存為 CSV 文件中的單獨(dú)列。

最后，打印“Query Complete”，表示關(guān)鍵字搜索、過濾和CSV保存過程已完成。

# Remove duplicate rows from Table 3 based on all columns
result_df = result_df.drop_duplicates()
# Save Table 3 to a CSV file
result_df.to_csv(r"C:\Users\Desktop\篩選出包含關(guān)鍵詞的行.csv", index=False)
# Print "Query Complete"
print("Query Complete")

第四節(jié) 運(yùn)行示例

原始數(shù)據(jù)如下：

需要保留的關(guān)鍵詞假設(shè)如下：

代碼運(yùn)行完畢后（只保留了包含太原市和陽泉市的行）：

完整代碼

import pandas as pd
from tqdm import tqdm
# Read Table 1 
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要保留的關(guān)鍵詞.xlsx")
# Read Table 2 (數(shù)據(jù)表格)
data_df = pd.read_csv(r"C:\Users\Desktop\原始數(shù)據(jù).csv", dtype=str)
# Create an empty Table 3
result_df = pd.DataFrame(columns=data_df.columns)
# Iterate over the keywords in Table 1
for keyword in tqdm(keywords_df['關(guān)鍵詞'], desc="Processing"):
    # Find rows in Table 2 where the "地址" column matches the keyword
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False)]
    # Append the matched rows to Table 3
    result_df = pd.concat([result_df, matched_rows], ignore_index=True)
# Remove duplicate rows from Table 3 based on all columns
result_df = result_df.drop_duplicates()
# Save Table 3 to a CSV file
result_df.to_csv(r"C:\Users\Desktop\篩選出包含關(guān)鍵詞的行.csv", index=False)
# Print "Query Complete"
print("Query Complete")

功能二：去除掉包含關(guān)鍵詞的行

第一節(jié) 數(shù)據(jù)加載

在這一部分中，代碼導(dǎo)入所需的庫、pandas 和 tq??dm。然后它從外部文件加載兩個(gè)數(shù)據(jù)集。

import pandas as pd
from tqdm import tqdm
# Read Table 1
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要?jiǎng)h除的關(guān)鍵詞.xlsx")
# Read Table 2
data_df = pd.read_csv(r"C:\Users\Desktop\篩選包含關(guān)鍵詞的行.csv", dtype=str)

第二節(jié) 關(guān)鍵字處理和過濾

該部分涉及迭代keywords_dfDataFrame 中的每個(gè)關(guān)鍵字。對(duì)于每個(gè)關(guān)鍵字，代碼都會(huì)搜索data_df“地址”列包含該關(guān)鍵字作為子字符串的行。結(jié)果存儲(chǔ)在matched_rows.

for keyword in tqdm(keywords_df['刪除的關(guān)鍵詞'], desc="Processing"):
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False, regex=False)]
    data_df = data_df[~data_df['地址'].str.contains(keyword, na=False, regex=False)]

第三節(jié) 保存和完成

在這一部分中，DataFrame中的剩余數(shù)據(jù)data_df（在過濾掉具有匹配關(guān)鍵字的行之后）將保存到桌面上名為“消失掉包含關(guān)鍵字的行.csv”的新CSV文件中。該index=False參數(shù)確保索引列不會(huì)保存到 CSV 文件中。最后，腳本打印“Query Complete”，表明關(guān)鍵字處理和過濾操作已完成。

data_df.to_csv(r"C:\Users\Desktop\去除掉包含關(guān)鍵詞的行.csv", index=False)
print("Query Complete")

第四節(jié) 運(yùn)行示例

原始數(shù)據(jù)如下：

需要?jiǎng)h除的關(guān)鍵詞假設(shè)如下：

代碼運(yùn)行完畢后（刪除了包含太原市和陽泉市的行）：

完整代碼

import pandas as pd
from tqdm import tqdm
# Read Table 1 
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要?jiǎng)h除的關(guān)鍵詞.xlsx")
# Read Table 2 
data_df = pd.read_csv(r"C:\Users\Desktop\原始數(shù)據(jù).csv", dtype=str)
# Iterate over the keywords in Table 1
for keyword in tqdm(keywords_df['刪除的關(guān)鍵詞'], desc="Processing"):
    # Find rows in Table 2 where the "地址" column contains the keyword as a substring
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False, regex=False)]
    # Remove the matched rows from Table 2
    data_df = data_df[~data_df['地址'].str.contains(keyword, na=False, regex=False)]
# Save the remaining data to a CSV file
data_df.to_csv(r"C:\Users\Desktop\去除掉包含關(guān)鍵詞的行.csv", index=False)
# Print "Query Complete"
print("Query Complete")

上述代碼注意文件的格式，有csv格式和xlsx格式，根據(jù)需要適當(dāng)修改程序即可。

以上就是Python實(shí)現(xiàn)提取和去除數(shù)據(jù)中包含關(guān)鍵詞的行的詳細(xì)內(nèi)容，更多關(guān)于Python提取和去除關(guān)鍵詞的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python實(shí)現(xiàn)提取和去除數(shù)據(jù)中包含關(guān)鍵詞的行

目錄

功能一：篩選出包含關(guān)鍵詞的行

第一節(jié) 讀取數(shù)據(jù)和設(shè)置

第二節(jié) 迭代關(guān)鍵字并過濾數(shù)據(jù)

第三節(jié) 刪除重復(fù)行并保存結(jié)果

第四節(jié) 運(yùn)行示例

完整代碼

功能二：去除掉包含關(guān)鍵詞的行

第一節(jié) 數(shù)據(jù)加載

第二節(jié) 關(guān)鍵字處理和過濾

第三節(jié) 保存和完成

第四節(jié) 運(yùn)行示例

完整代碼

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python實(shí)現(xiàn)提取和去除數(shù)據(jù)中包含關(guān)鍵詞的行

目錄

功能一：篩選出包含關(guān)鍵詞的行

第一節(jié) 讀取數(shù)據(jù)和設(shè)置

第二節(jié) 迭代關(guān)鍵字并過濾數(shù)據(jù)

第三節(jié) 刪除重復(fù)行并保存結(jié)果

第四節(jié) 運(yùn)行示例

完整代碼

功能二：去除掉包含關(guān)鍵詞的行

第一節(jié) 數(shù)據(jù)加載

第二節(jié) 關(guān)鍵字處理和過濾

第三節(jié) 保存和完成

第四節(jié) 運(yùn)行示例

完整代碼

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕