nlp计数法应用于PTB数据集示例详解
作者:jym蒟蒻 时间:2023-10-26 17:24:07
PTB数据集
内容如下:
一行保存一个句子;将稀有单词替换成特殊字符 < unk > ;将具体的数字替换 成“N”
we 're talking about years ago before anyone heard of asbestos having any questionable properties
there is no asbestos in our products now
neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes
we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute
dr. <unk> led a team of researchers from the national cancer institute and the medical schools of harvard university and boston university
ptb.py
使用PTB数据集:
由下面这句话,可知用PTB数据集时候,是把所有句子首尾连接了。
words = open(file_path).read().replace('\n', '<eos>').strip().split()
ptb.py起到了下载PTB数据集,把数据集存到文件夹某个位置,然后对数据集进行提取的功能,提取出corpus, word_to_id, id_to_word。
import sys
import os
sys.path.append('..')
try:
import urllib.request
except ImportError:
raise ImportError('Use Python3!')
import pickle
import numpy as np
url_base = 'https://raw.githubusercontent.com/tomsercu/lstm/master/data/'
key_file = {
'train':'ptb.train.txt',
'test':'ptb.test.txt',
'valid':'ptb.valid.txt'
}
save_file = {
'train':'ptb.train.npy',
'test':'ptb.test.npy',
'valid':'ptb.valid.npy'
}
vocab_file = 'ptb.vocab.pkl'
dataset_dir = os.path.dirname(os.path.abspath(__file__))
def _download(file_name):
file_path = dataset_dir + '/' + file_name
if os.path.exists(file_path):
return
print('Downloading ' + file_name + ' ... ')
try:
urllib.request.urlretrieve(url_base + file_name, file_path)
except urllib.error.URLError:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
urllib.request.urlretrieve(url_base + file_name, file_path)
print('Done')
def load_vocab():
vocab_path = dataset_dir + '/' + vocab_file
if os.path.exists(vocab_path):
with open(vocab_path, 'rb') as f:
word_to_id, id_to_word = pickle.load(f)
return word_to_id, id_to_word
word_to_id = {}
id_to_word = {}
data_type = 'train'
file_name = key_file[data_type]
file_path = dataset_dir + '/' + file_name
_download(file_name)
words = open(file_path).read().replace('\n', '<eos>').strip().split()
for i, word in enumerate(words):
if word not in word_to_id:
tmp_id = len(word_to_id)
word_to_id[word] = tmp_id
id_to_word[tmp_id] = word
with open(vocab_path, 'wb') as f:
pickle.dump((word_to_id, id_to_word), f)
return word_to_id, id_to_word
def load_data(data_type='train'):
'''
:param data_type: 数据的种类:'train' or 'test' or 'valid (val)'
:return:
'''
if data_type == 'val': data_type = 'valid'
save_path = dataset_dir + '/' + save_file[data_type]
word_to_id, id_to_word = load_vocab()
if os.path.exists(save_path):
corpus = np.load(save_path)
return corpus, word_to_id, id_to_word
file_name = key_file[data_type]
file_path = dataset_dir + '/' + file_name
_download(file_name)
words = open(file_path).read().replace('\n', '<eos>').strip().split()
corpus = np.array([word_to_id[w] for w in words])
np.save(save_path, corpus)
return corpus, word_to_id, id_to_word
if __name__ == '__main__':
for data_type in ('train', 'val', 'test'):
load_data(data_type)
使用ptb.py
corpus保存了单词ID列表,id_to_word 是将单词ID转化为单词的字典,word_to_id 是将单词转化为单词ID的字典。
使用ptb.load_data()加载数据。里面的参数 ‘train’、‘test’、‘valid’ 分别对应训练用数据、测试用数据、验证用数据。
import sys
sys.path.append('..')
from dataset import ptb
corpus, word_to_id, id_to_word = ptb.load_data('train')
print('corpus size:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()
print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()
print("word_to_id['car']:", word_to_id['car'])
print("word_to_id['happy']:", word_to_id['happy'])
print("word_to_id['lexus']:", word_to_id['lexus'])
结果:
corpus size: 929589
corpus[:30]: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]
id_to_word[0]: aer
id_to_word[1]: banknote
id_to_word[2]: berlitz
word_to_id['car']: 3856
word_to_id['happy']: 4428
word_to_id['lexus']: 7426
Process finished with exit code 0
计数方法应用于PTB数据集
其实和不用PTB数据集的区别就在于这句话。
corpus, word_to_id, id_to_word = ptb.load_data('train')
下面这句话起降维的效果
word_vecs = U[:, :wordvec_size]
整个代码其实耗时最大的是在下面这个函数上:
W = ppmi(C, verbose=True)
完整代码:
import sys
sys.path.append('..')
import numpy as np
from common.util import most_similar, create_co_matrix, ppmi
from dataset import ptb
window_size = 2
wordvec_size = 100
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
print('counting co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)
print('calculating SVD ...')
#try:
# truncated SVD (fast!)
print("ok")
from sklearn.utils.extmath import randomized_svd
U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
random_state=None)
#except ImportError:
# SVD (slow)
# U, S, V = np.linalg.svd(W)
word_vecs = U[:, :wordvec_size]
querys = ['you', 'year', 'car', 'toyota']
for query in querys:
most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
下面这个是用普通的np.linalg.svd(W)做出的结果。
[query] you
i: 0.7016294002532959
we: 0.6388039588928223
anybody: 0.5868048667907715
do: 0.5612815618515015
'll: 0.512611985206604
[query] year
month: 0.6957005262374878
quarter: 0.691483736038208
earlier: 0.6661213636398315
last: 0.6327787041664124
third: 0.6230476498603821
[query] car
luxury: 0.6767407655715942
auto: 0.6339930295944214
vehicle: 0.5972712635993958
cars: 0.5888376235961914
truck: 0.5693157315254211
[query] toyota
motor: 0.7481387853622437
nissan: 0.7147319316864014
motors: 0.6946366429328918
lexus: 0.6553674340248108
honda: 0.6343469619750977
下面结果,是用了sklearn模块里面的randomized_svd方法,使用了随机数的 Truncated SVD,仅对奇异值较大的部分进行计算,计算速度比常规的 SVD 快。
calculating SVD ...
ok
[query] you
i: 0.6678948998451233
we: 0.6213737726211548
something: 0.560122013092041
do: 0.5594725608825684
someone: 0.5490139126777649
[query] year
month: 0.6444296836853027
quarter: 0.6192560791969299
next: 0.6152222156524658
fiscal: 0.5712860226631165
earlier: 0.5641934871673584
[query] car
luxury: 0.6612467765808105
auto: 0.6166062355041504
corsica: 0.5270425081253052
cars: 0.5142025947570801
truck: 0.5030257105827332
[query] toyota
motor: 0.7747215628623962
motors: 0.6871038675308228
lexus: 0.6786072850227356
nissan: 0.6618651151657104
mazda: 0.6237337589263916
Process finished with exit code 0
来源:https://www.cnblogs.com/jiangyiming/p/16102323.html
标签:nlp,计数法,PTB,数据集
0
投稿
猜你喜欢
面试官问订单ID是如何生成的?难道不是MySQL自增主键
2024-01-24 00:36:24
golang中的单引号转义问题
2023-07-07 07:20:22
python基于moviepy实现音视频剪辑
2023-08-21 14:56:47
JS实现TITLE悬停长久显示效果完整示例
2024-04-16 09:54:00
Python csv文件的读写操作实例详解
2021-06-07 19:02:41
SpringBoot用多线程批量导入数据库实现方法
2024-01-23 23:03:36
JS实现单击输入框弹出选择框效果完整实例
2024-04-22 22:33:02
JavaScript Reduce使用详解
2024-04-19 10:16:03
Go事务中止时是否真的结束事务解析
2023-07-07 11:35:35
Python面向对象类的继承实例详解
2023-04-19 11:06:38
javascript获取选中的文本的方法代码
2024-05-03 15:07:58
详解Django之auth模块(用户认证)
2021-12-11 21:09:20
Python脚本调试工具安装过程
2021-08-02 23:10:53
Mac 安装和卸载 Mysql5.7.11 的方法
2024-01-23 16:09:58
Oracle PL/SQL入门慨述
2010-07-18 12:57:00
python的concat等多种用法详解
2022-08-14 23:37:18
HTTP请求 GET与POST方法的区别
2023-07-26 12:46:35
vue中将el-switch值true、false改为number类型的1和0
2024-04-27 15:57:43
Golang 操作TSV文件的实战示例
2023-07-18 06:46:01
html静态页面调用php文件的方法
2023-09-04 23:18:21