利用Python提取PDF文本的简单方法实例

作者：somenzz 时间：2023-05-09 05:37:34　

你好，一般情况下，Ctrl+C 是最简单的方法，当无法 Ctrl+C 时，我们借助于 Python，以下是具体步骤：

第一步，安装工具库

1、tika — 用于从各种文件格式中进行文档类型检测和内容提取

2、wand — 基于 ctypes 的简单 ImageMagick 绑定

3、pytesseract — OCR 识别工具

创建一个虚拟环境，安装这些工具

python -m venv venv
source venv/bin/activate
pip install tika wand pytesseract

第二步，编写代码

假如 pdf 文件里面既有文字，又有图片，以下代码可以直接识别文字：

import io
import pytesseract
import sys

from PIL import Image
from tika import parser
from wand.image import Image as wi

text_raw = parser.from_file("example.pdf")
print(text_raw['content'].strip())

这还不够，我们还需要能失败图片的部分：

def extract_text_image(from_file, lang='deu', image_type='jpeg', resolution=300):
print("-- Parsing image", from_file, "--")
print("---------------------------------")
pdf_file = wi(filename=from_file, resolution=resolution)
image = pdf_file.convert(image_type)
image_blobs = []
for img in image.sequence:
img_page = wi(image=img)
image_blobs.append(img_page.make_blob(image_type))
extract = []
for img_blob in image_blobs:
image = Image.open(io.BytesIO(img_blob))
text = pytesseract.image_to_string(image, lang=lang)
extract.append(text)
for item in extract:
for line in item.split("\n"):
print(line)

合并一下，完整代码如下：

import io
import sys

from PIL import Image
import pytesseract
from wand.image import Image as wi
from tika import parser

def extract_text_image(from_file, lang='deu', image_type='jpeg', resolution=300):
print("-- Parsing image", from_file, "--")
print("---------------------------------")
pdf_file = wi(filename=from_file, resolution=resolution)
image = pdf_file.convert(image_type)
for img in image.sequence:
img_page = wi(image=img)
image = Image.open(io.BytesIO(img_page.make_blob(image_type)))
text = pytesseract.image_to_string(image, lang=lang)
for part in text.split("\n"):
print("{}".format(part))

def parse_text(from_file):
print("-- Parsing text", from_file, "--")
text_raw = parser.from_file(from_file)
print("---------------------------------")
print(text_raw['content'].strip())
print("---------------------------------")

if __name__ == '__main__':
parse_text(sys.argv[1])
extract_text_image(sys.argv[1], sys.argv[2])

第三步，执行

假如 example.pdf 是这样的：

在命令行这样执行：

python run.py example.pdf deu | xargs -0 echo > extract.txt

最终 extract.txt 的结果如下：

-- Parsing text example.pdf --
---------------------------------
Title pure text

Content pure text

Slide 1
Slide 2
---------------------------------
-- Parsing image example.pdf --
---------------------------------
Title pure text

Content pure text

Title in image

Text in image

你可能会问，如果是简体中文，那个 lang 参数传递什么，传 'chi_sim'，其实是有官方说明的，链接如下：

https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-different-versions.md

最后的话

从 PDF 中提取文本的脚本实现并不复杂，许多库简化了工作并取得了很好的效果

来源：https://blog.csdn.net/somenzz/article/details/124440977

标签：python,提取,pdf

投稿

利用Python提取PDF文本的简单方法实例

第一步，安装工具库

第二步，编写代码

第三步，执行

最后的话

猜你喜欢

vscode使用nuget包管理工具

解密Python中的描述符（descriptor）

Golang实现的聊天程序服务端和客户端代码分享

使用Python程序计算钢琴88个键的音高

python如何实现数据的线性拟合

PHP运行环境配置与开发环境的配置(图文教程)

ASP使用FSO组件生成HTML静态页面

JS关于刷新页面的相关总结

document.getElementById的简写方式

基于pip install django失败时的解决方法

oracle join on 数据过滤问题

ASP向SQL语句传递参数方法

pytorch 数据集图片显示方法

基于JavaScript中标识符的命名规则介绍

利用Pygame制作简单动画的示例详解

无忧月影出书啦《JavaScript王者归来》

Oracle数据库安全策略分析(二）

最新MySql8.27主从复制及SpringBoot项目中的读写分离实战教程

自动定时重启sql server回收内存

Python 实现数据库(SQL)更新脚本的生成方法

利用Python提取PDF文本的简单方法实例

第一步，安装工具库

第二步，编写代码

第三步，执行

最后的话

猜你喜欢

vscode使用nuget包管理工具

解密Python中的描述符（descriptor）

Golang实现的聊天程序服务端和客户端代码分享

使用Python程序计算钢琴88个键的音高

python如何实现数据的线性拟合

PHP运行环境配置与开发环境的配置(图文教程)

ASP使用FSO组件生成HTML静态页面

JS关于刷新页面的相关总结

document.getElementById的简写方式

基于pip install django失败时的解决方法

oracle join on 数据过滤问题

ASP向SQL语句传递参数方法

pytorch 数据集图片显示方法

基于JavaScript中标识符的命名规则介绍

利用Pygame制作简单动画的示例详解

无忧 月影出书啦《JavaScript王者归来》

Oracle数据库安全策略分析(二）

最新MySql8.27主从复制及SpringBoot项目中的读写分离实战教程

自动定时重启sql server回收内存

Python 实现数据库(SQL)更新脚本的生成方法

无忧月影出书啦《JavaScript王者归来》