博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
PostgreSQL 全文检索 - 词频统计
阅读量:6954 次
发布时间:2019-06-27

本文共 5119 字,大约阅读时间需要 17 分钟。

标签

PostgreSQL , 全文检索 , 词频统计 , ts_stat , madlib


背景

TF(Term Frequency 词频)/IDF(Inverse Document Frequency 逆向文本频率)是文本分析中常见的术语。

PostgreSQL支持全文检索,支持tsvector文本向量类型。

如何在一堆文本中,找到热词,或者对词频进行分析呢?

方法1,ts_stat

第一种方法来自PostgreSQL的内置函数,ts_stat,用于生成lexeme的统计信息,例如我想知道某个问答知识库中,出现最多的词是哪个,出现在了多少篇文本中。

ts_stat介绍如下

12.4.4. Gathering Document Statistics

The function ts_stat is useful for checking your configuration and for finding stop-word candidates.

ts_stat(sqlquery text, [ weights text, ]          OUT word text, OUT ndoc integer,          OUT nentry integer) returns setof record

sqlquery is a text value containing an SQL query which must return a single tsvector column. ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data. The columns returned are

  • word text — the value of a lexeme

  • ndoc integer — number of documents (tsvectors) the word occurred in

  • nentry integer — total number of occurrences of the word

If weights is supplied, only occurrences having one of those weights are counted.

For example, to find the ten most frequent words in a document collection:

SELECT * FROM ts_stat('SELECT vector FROM apod')  ORDER BY nentry DESC, ndoc DESC, word  LIMIT 10;

The same, but counting only word occurrences with weight A or B:

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')  ORDER BY nentry DESC, ndoc DESC, word  LIMIT 10;

测试

1、创建生成随机字符串的函数

create or replace function gen_rand_str(int) returns text as $$      select substring(md5(random()::text), 4, $1);    $$ language sql strict stable;

2、创建生成若干个随机词的函数

create or replace function gen_rand_tsvector(int,int) returns tsvector as $$      select array_to_tsvector(array_agg(gen_rand_str($1))) from generate_series(1,$2);    $$ language sql strict;
postgres=# select gen_rand_tsvector(4,10);                             gen_rand_tsvector                             -----------------------------------------------------------------------   '21eb' '2c9c' '4406' '5d9c' '9ac4' 'a27b' 'ab13' 'ba77' 'e3f2' 'f198'  (1 row)

3、创建测试表,并写入测试数据

postgres=# create table ts_test(id int, info tsvector);  CREATE TABLE  postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(4,10);  INSERT 0 100000

4、查看词频,总共出现了多少次,在多少篇文本(多少条记录中出现过)

postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')  ORDER BY nentry DESC, ndoc DESC, word  LIMIT 10;   word | ndoc | nentry   ------+------+--------   e4e6 |   39 |     39   9596 |   36 |     36   a84c |   35 |     35   2b44 |   32 |     32   5146 |   32 |     32   92f6 |   32 |     32   cd56 |   32 |     32   fd00 |   32 |     32   4258 |   31 |     31   5f18 |   31 |     31  (10 rows)

5、再写入一批测试数据,查看词频,总共出现了多少次,在多少篇文本(多少条记录中出现过)

postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(2,10);  INSERT 0 100000  postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')                               ORDER BY nentry DESC, ndoc DESC, word  LIMIT 10;   word | ndoc | nentry   ------+------+--------   30   | 4020 |   4020   a7   | 4005 |   4005   20   | 3985 |   3985   c5   | 3980 |   3980   e6   | 3970 |   3970   f1   | 3965 |   3965   70   | 3948 |   3948   5e   | 3943 |   3943   e4   | 3937 |   3937   2b   | 3934 |   3934  (10 rows)

方法2,madlib

实际上MADlib也提供了词频统计的训练函数

Term frequency

Term frequency tf(t,d) is to the raw frequency of a word/term in a document, i.e. the number of times that word/term t occurs in document d. For this function, 'word' and 'term' are used interchangeably. Note: the term frequency is not normalized by the document length.

term_frequency(input_table,                     doc_id_col,                     word_col,                     output_table,                     compute_vocab)

Arguments:

input_table

TEXT. The name of the table storing the documents. Each row is in the form <doc_id, word_vector> where doc_id is an id, unique to each document, and word_vector is a text array containing the words in the document. The word_vector should contain multiple entries of a word if the document contains multiple occurrence of that word.

id_col

TEXT. The name of the column containing the document id.

word_col

TEXT. The name of the column containing the vector of words/terms in the document. This column should of type that can be cast to TEXT[].

output_table

TEXT. The name of the table to store the term frequency output. The output table contains the following columns:

  • id_col: This the document id column (same as the one provided as input).

  • word: A word/term present in a document. This is either the original word present in word_col or an id representing the word (depending on the value of compute_vocab below).

  • count: The number of times this word is found in the document.

compute_vocab

BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called output_table_vocabulary (suffix added to the output_table name) and contains the following columns:

  • wordid: An id assignment for each word

  • word: The word/term

参考

转载地址:http://jtjil.baihongyu.com/

你可能感兴趣的文章
DRDS SQL 审计与分析——全面洞察 SQL 之利器
查看>>
微信小程序:模板消息推送实现
查看>>
CodePush自定义更新弹框及下载进度条
查看>>
自己总结的php开发中用到的工具
查看>>
小程序视频或音频自定义可拖拽进度条
查看>>
PHP导出超大的CSV格式的Excel表方案
查看>>
Mac 环境下如何生成Git shh key
查看>>
jenkins 使用磁盘检查插件 disk check plugin
查看>>
使用 Ruby 拓展 Vim
查看>>
centos7下安装LNMP(nginx+PHP7.1.9+mysql5.7)
查看>>
NodeAPI学习之Buffer
查看>>
深入java单例模式
查看>>
create-react-app
查看>>
20170812-XSS跨站脚本攻击
查看>>
Let’s Build |> 使用Elixir,Phoenix和React打造克隆版的Slack(part 1)
查看>>
如何让 StackNaivgator 实现越级回跳
查看>>
工具简述
查看>>
Hbase 集群搭建
查看>>
分布式文件服务器dfs
查看>>
正则表达式
查看>>