使用停用词 | Elasticsearch: 权威指南

使用停用词 | Elasticsearch: 权威指南 | Elastic

2026-07-22

请注意:
本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。

» » »

« 停用词的优缺点停用词与性能 »

使用停用词编辑

移除停用词的工作是由 stop 停用词过滤器完成的，可以通过创建自定义的分析器来使用它（参见使用停用词过滤器stop 停用词过滤器)。但是，也有一些自带的分析器预置使用停用词过滤器：

语言分析器: 每个语言分析器默认使用与该语言相适的停用词列表，例如：english 英语分析器使用 _english_ 停用词列表。
standard 标准分析器: 默认使用空的停用词列表：_none_ ，实际上是禁用了停用词。
pattern 模式分析器: 默认使用空的停用词列表：为 _none_ ，与 standard 分析器类似。

停用词和标准分析器（Stopwords and the Standard Analyzer）编辑

为了让标准分析器能与自定义停用词表连用，我们要做的只需创建一个分析器的配置好的版本，然后将停用词列表传入：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { 
          "type": "standard", 
          "stopwords": [ "and", "the" ] 
        }
      }
    }
  }
}

	自定义的分析器名称为 `my_analyzer` 。
	这个分析器是一个标准 `standard` 分析器，进行了一些自定义配置。
	过滤掉的停用词包括 `and` 和 `the` 。

任何语言分析器都可以使用相同的方式配置自定义停用词。

保持位置（Maintaining Positions）编辑

analyzer API 的输出结果很有趣:

GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead

{
   "tokens": [
      {
         "token":        "quick",
         "start_offset": 4,
         "end_offset":   9,
         "type":         "<ALPHANUM>",
         "position":     1 
      },
      {
         "token":        "dead",
         "start_offset": 18,
         "end_offset":   22,
         "type":         "<ALPHANUM>",
         "position":     4
      }
   ]
}

position 标记每个词汇单元的位置。

停用词如我们期望被过滤掉了，但有趣的是两个词项的位置 position 没有变化：quick 是原句子的第二个词，dead 是第五个。这对短语查询十分重要，因为如果每个词项的位置被调整了，一个短语查询 quick dead 会与以上示例中的文档错误匹配。

指定停用词（Specifying Stopwords）编辑

停用词可以以内联的方式传入，就像我们在前面的例子中那样，通过指定数组:

"stopwords": [ "and", "the" ]

特定语言的默认停用词，可以通过使用 _lang_ 符号来指定:

"stopwords": "_english_"

TIP: Elasticsearch 中预定义的与语言相关的停用词列表可以在文档"languages"stop 停用词过滤器中找到。

停用词可以通过指定一个特殊列表 _none_ 来禁用。例如，使用 _english_ 分析器而不使用停用词，可以通过以下方式做到：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":      "english", 
          "stopwords": "_none_" 
        }
      }
    }
  }
}

	`my_english` 分析器是基于 `english` 分析器。
	但禁用了停用词。

最后，停用词还可以使用一行一个单词的格式保存在文件中。此文件必须在集群的所有节点上，并且通过 stopwords_path 参数设置路径:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":           "english",
          "stopwords_path": "stopwords/english.txt" 
        }
      }
    }
  }
}

停用词文件的路径，该路径相对于 Elasticsearch 的 config 目录。

使用停用词过滤器（Using the stop Token Filter）编辑

当你创建 custom 分析器时候，可以组合多个 stop 停用词过滤器分词器。例如：我们想要创建一个西班牙语的分析器:

自定义停用词列表
light_spanish 词干提取器
在 asciifolding 词汇单元过滤器中除去附加符号

我们可以通过以下设置完成:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type":        "stop",
          "stopwords": [ "si", "esta", "el", "la" ]  
        },
        "light_spanish": { 
          "type":     "stemmer",
          "language": "light_spanish"
        }
      },
      "analyzer": {
        "my_spanish": {
          "tokenizer": "spanish",
          "filter": [ 
            "lowercase",
            "asciifolding",
            "spanish_stop",
            "light_spanish"
          ]
        }
      }
    }
  }
}

	停用词过滤器采用与 `standard` 分析器相同的参数 `stopwords` 和 `stopwords_path` 。
	参见算法提取器（Algorithmic Stemmers）。
	过滤器的顺序非常重要，下面会进行解释。

我们将 spanish_stop 过滤器放置在 asciifolding 过滤器之后.这意味着以下三个词组 esta 、ésta 、++está++ ，先通过 asciifolding 过滤器过滤掉特殊字符变成了 esta ，随后使用停用词过滤器会将 esta 去除。如果我们只想移除 esta 和 ésta ，但是 ++está++ 不想移除。必须将 spanish_stop 过滤器放置在 asciifolding 之前，并且需要在停用词中指定 esta 和 ésta 。

更新停用词（Updating Stopwords）编辑

想要更新分析器的停用词列表有多种方式，分析器在创建索引时，当集群节点重启时候，或者关闭的索引重新打开的时候。

如果你使用 stopwords 参数以内联方式指定停用词，那么你只能通过关闭索引，更新分析器的配置update index settings API，然后在重新打开索引才能更新停用词。

如果你使用 stopwords_path 参数指定停用词的文件路径，那么更新停用词就简单了。你只需更新文件(在每一个集群节点上)，然后通过两者之中的任何一个操作来强制重新创建分析器:

关闭和重新打开索引 (参考索引的开与关)，
一一重启集群下的每个节点。

当然，更新的停用词不会改变任何已经存在的索引。这些停用词的只适用于新的搜索或更新文档。如果要改变现有的文档，则需要重新索引数据。参加重新索引你的数据。

« 停用词的优缺点停用词与性能 »

官方地址：https://www.elastic.co/guide/cn/elasticsearch/guide/current/using-stopwords.html

有任何技术问题请点击这里网站运营推广招聘

IT PHP 编程语言开发编程 Linux 科技 Elasticsearch 数据库面试 HTML/CSS/XML 网络 JAVA NoSQL 操作系统 C/C++ Golang Git 算法正则表达式 Redis 互联网 MySql 软件运维 JavaScript 国际架构设计商业 Mac OS TCP/IP Excel Windows Oracle Socket VR Vim MongoDB 运营 Python MemCache 硬件电子娱乐设计摄影 nginx 游戏 WordPress HTTP 团建数码电器 Docker 大模型

Elasticsearch集群模式知多少携程Elasticsearch数据同步实践 elasticsearch动态映射 Elasticsearch是做什么的以及它的使用和基本原理 elasticsearch配置 Elasticsearch简介与实战如何配置使用Elasticsearch的动态映射 (dynamic mapping) elasticsearch最新版安装 elasticsearch出现只读索引如何操作 blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];') 【Elasticsearch集群】打分策略详解与explain手把手计算两节点Elasticsearch集群 ElasticSearch自带的分词类型 es 相关配置文件 [Elasticsearch] 多字段搜索 (一) - 多个及单个查询字符串 ES查找空字符串 Elasticsearch 映射参数 fields Elasticsearch－基础介绍及索引原理分析 ElasticSearch集群中的分片查询方式 Elasticsearch集群节点(角色)类型解释node.master和node.data Elasticsearch 模糊查询 wildcard、regexp、prefix选型

略微加速

Elasticsearch权威指南 - 互联网笔记

使用停用词编辑

停用词和标准分析器（Stopwords and the Standard Analyzer）编辑

保持位置（Maintaining Positions）编辑

指定停用词（Specifying Stopwords）编辑

使用停用词过滤器（Using the stop Token Filter）编辑

更新停用词（Updating Stopwords）编辑

略微加速

Elasticsearch权威指南 - 互联网笔记

使用停用词编辑

停用词和标准分析器（Stopwords and the Standard Analyzer）编辑

保持位置（Maintaining Positions）编辑

指定停用词（Specifying Stopwords）编辑

使用停用词过滤器（Using the stop Token Filter）编辑

更新停用词（Updating Stopwords）编辑

Getting Started Videos