结巴分词缓存预热及环境变量TMP查看

Blog Content

Python 2020-03-28 17:27:22

执行一段简单的脚本，就可以测试出来jieba.cache生成过程。

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)#
print ("Full Mode: " + "/ ".join(seg_list))#全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))#精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")#默认是精确模式
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")#搜索引擎模式
print(", ".join(seg_list))

输出过程有：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dump cache file failed.

去安装目录打开文件：site-packages/jieba/init.py

class Tokenizer(object):

    def __init__(self, dictionary=DEFAULT_DICT):
        self.lock = threading.RLock()
        if dictionary == DEFAULT_DICT:
            self.dictionary = dictionary
        else:
            self.dictionary = _get_abs_path(dictionary)
        self.FREQ = {}
        self.total = 0
        self.user_word_tag_tab = {}
        self.initialized = False
        self.tmp_dir = None
        self.cache_file = None

self.tmp_dir 默认为：/tmp，需要写权限。如果没有，可以修改。

系统tmp 默认目录确定：

>>> import tempfile
>>> tempfile.gettempdir()
'/var/folders/1m/68x8hh1n38v2q2r5d9qybt240000gn/T'

tempfile里面的有些方法创建的文件是在关闭之后会自动删除的，但是mkstemp()这个方法创建的临时文件并不会被删除，只是不会被其他应用程序找到和使用。可以在使用之后通过os.close(fd)这个方法关闭这个文件。

tempfile.tempdir

该属性用于指定创建的临时文件（夹）所在的默认文件夹。如果没有设置该属性或者将其设为None，Python将返回以下环境变量：TMPDIR, TEMP, TEMP指定的目录，如果没有定义这些环境变量，临时文件将被创建在当前工作目录。

查看环境变量：

import os

env_dist = os.environ # environ是在os.py中定义的一个dict environ = {}

print(os.environ.get('JAVA_HOME'))
print(os.environ['JAVA_HOME'])

print(os.environ.get('TMPDIR'))
print(os.environ.get('TMP'))


# 打印所有环境变量，遍历字典
for key in os.environ:
    print(key + ' : ' + env_dist[key])

修改：self.tmp_dir:

class Tokenizer(object):

    def __init__(self, dictionary=DEFAULT_DICT):
        self.lock = threading.RLock()
        if dictionary == DEFAULT_DICT:
            self.dictionary = dictionary
        else:
            self.dictionary = _get_abs_path(dictionary)
        self.FREQ = {}
        self.total = 0
        self.user_word_tag_tab = {}
        self.initialized = False
        self.tmp_dir ="/home/admin/"
        self.cache_file = None

结巴分词缓存处理代码片段def initialize(self, dictionary=None)：

            if self.cache_file:
                cache_file = self.cache_file
            # default dictionary
            elif abs_path == DEFAULT_DICT:
                cache_file = "jieba.cache"
            # custom dictionary
            else:
                cache_file = "jieba.u%s.cache" % md5(
                    abs_path.encode('utf-8', 'replace')).hexdigest()
            cache_file = os.path.join(
                self.tmp_dir or tempfile.gettempdir(), cache_file)
            # prevent absolute path in self.cache_file
            tmpdir = os.path.dirname(cache_file)

            load_from_cache_fail = True
            if os.path.isfile(cache_file) and (abs_path == DEFAULT_DICT or
                os.path.getmtime(cache_file) > os.path.getmtime(abs_path)):
                default_logger.debug(
                    "Loading model from cache %s" % cache_file)
                try:
                    with open(cache_file, 'rb') as cf:
                        self.FREQ, self.total = marshal.load(cf)
                    load_from_cache_fail = False
                except Exception:
                    load_from_cache_fail = True

            if load_from_cache_fail:
                wlock = DICT_WRITING.get(abs_path, threading.RLock())
                DICT_WRITING[abs_path] = wlock
                with wlock:
                    self.FREQ, self.total = self.gen_pfdict(self.get_dict_file())
                    default_logger.debug(
                        "Dumping model to file cache %s" % cache_file)
                    try:
                        # prevent moving across different filesystems
                        fd, fpath = tempfile.mkstemp(dir=tmpdir)
                        with os.fdopen(fd, 'wb') as temp_cache_file:
                            marshal.dump(
                                (self.FREQ, self.total), temp_cache_file)
                        _replace_file(fpath, cache_file)
                    except Exception:
                        default_logger.exception("Dump cache file failed.")

                try:
                    del DICT_WRITING[abs_path]
                except KeyError:
                    pass

在本地环境使用jieba分词模块，生成jieba.cache文件（默认生成在本地环境的临时目录下），将该文件拷贝至jieba/目录下由于jieba中文分词模块在首次加载时需要生成字典树（Trie树）缓存文件（jieba.cache），导致其处理速度会受到一定的影响。

要解决这个问题，可以将jieba.cache缓存文件预先生成完毕，上传至服务器。

运行jieba分词时直接读取缓存文件即可，不必每次重新生成，从而提升jieba分词的模块加载速度。

1. 在本地环境使用jieba分词模块，生成jieba.cache文件（默认生成在本地环境的临时目录下），将该文件拷贝至jieba/目录下

2. 修改jieba/__init__.py文件，self.tmp_dir

上一篇：python读取阿里云oss文件
下一篇：centos搭建golang+gin环境

One - One Code All

Blog Content

结巴分词缓存预热及环境变量TMP查看

tempfile.tempdir

The minute you think of giving up, think of the reason why you held on so long.