Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

作者：wanlifeipeng 时间：2023-05-14 08:03:02　

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考，具体如下：

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
word = []
words_dict= {}
for letter in f.read():
if letter.isalnum():
word.append(letter)
elif letter.isspace(): #空白字符空格 \t \n
if word:
word = ''.join(word).lower() #转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
#处理最后一个单词
if word:
word = ''.join(word).lower() # 转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
for k,v in words_dict.items():
print(k,v)

运行结果：

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存，性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
data = f.read()
word_reg = re.compile(r'\w+')
#word_reg = re.compile(r'\w+\b')
word_list = word_reg.findall(data)
word_list = [word.lower() for word in word_list] #转小写
word_set = set(word_list) #避免重复查询
# words_dict = {}
# for word in word_set:
# words_dict[word] = word_list.count(word)
# 简洁写法
words_dict = {word: word_list.count(word) for word in word_set}
for k,v in words_dict.items():
print(k,v)

运行结果：

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
#line_words = word_reg.findall(line)
#比上面的正则更加简单
line_words = line.split()
word_list.extend(line_words)
word_set = set(word_list) # 避免重复查询
words_dict = {word: word_list.count(word) for word in word_set}
for k, v in words_dict.items():
print(k, v)

运行结果：

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
line_words = line.split()
word_list.extend(line_words)
words_dict = dict(collections.Counter(word_list)) #使用Counter统计
for k, v in words_dict.items():
print(k, v)

运行结果：

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注：这里使用的测试文本test.txt如下：

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

PS：这里再为大家推荐2款相关统计工具供大家参考：

在线字数统计工具：
http://tools.jb51.net/code/zishutongji

在线字符统计与编辑工具：
http://tools.jb51.net/code/char_tongji

希望本文所述对大家Python程序设计有所帮助。

来源：http://www.cnblogs.com/hupeng1234/p/6680491.html

标签：Python,统计,英文单词

投稿

Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

猜你喜欢

python自动化之Ansible的安装教程

JS清除IE浏览器缓存的方法

详解python3中的真值测试

python tkinter库实现气泡屏保和锁屏

关于Python的GPU编程实例近邻表计算的讲解

django 实现编写控制登录和访问权限控制的中间件方法

nvm版本导致npm install报错Unexpected token '.'的解决办法

VMware中linux环境下oracle安装图文教程（一）

使用Python matplotlib绘制简单的柱形图、折线图和直线图

基于Python实现文件分类器的示例代码

基于golang uint8、int8与byte的区别说明

python获取地震信息微信实时推送

pyinstaller打包后,配置文件无法正常读取的解决

python使用requests实现发送带文件请求功能

详解Python常用的魔法方法

PHP对象克隆clone用法示例

Python torch.onnx.export用法详细介绍

ORACLE 正则解决初使化数据格式不一致

Vue使用localStorage存储数据的方法

pandas 实现将NaN转换为None

Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

猜你喜欢

python自动化之Ansible的安装教程

JS清除IE浏览器缓存的方法

详解python3中的真值测试

python tkinter库实现气泡屏保和锁屏

关于Python的GPU编程实例近邻表计算的讲解

django 实现编写控制登录和访问权限控制的中间件方法

nvm版本导致npm install报错Unexpected token '.'的解决办法

VMware中linux环境下oracle安装图文教程（一）

使用Python matplotlib绘制简单的柱形图、折线图和直线图

基于Python实现文件分类器的示例代码

基于golang uint8、int8与byte的区别说明

python获取地震信息 微信实时推送

pyinstaller打包后,配置文件无法正常读取的解决

python使用requests实现发送带文件请求功能

详解Python常用的魔法方法

PHP对象克隆clone用法示例

Python torch.onnx.export用法详细介绍

ORACLE 正则解决初使化数据格式不一致

Vue使用localStorage存储数据的方法

pandas 实现将NaN转换为None

python获取地震信息微信实时推送