Azure Cognitive Services - Text Analytics の試し打ち

TL;DR

名前付きエンティティの認識は、人、場所、組織などを抽出できる
日本語では Person、Location、Organization のみ解析可能で、英語に比べるといまいちの精度
個人を特定できる情報の検出は、個人情報と医療情報を検出する機能が提供されているが現在英語とスペイン語のみ
Azure Cognitive Search は、クラウドのストレージを高度に検索するサービス
インデックス作成で Analyzer を使用して形態素解析を行うが、検証目的でAPIが用意されている

名前付きエンティティの認識 (NER)

API : text/analytics/v3.1-preview.3/entities/recognition/general

NER v3 では英語とスペイン語のみがサポートされている。日本語は v2 が代替として動作し、"Person"、"Location"、"Organization"のみ返却される。

En

Nice to meet you. Yamada is my family name and Taro is my given name. I'm from Osaka, Japan. I was born on April 14th in 1973.

Request

{ documents: [
        { 
            id: "1", 
            language: "en", 
            text: "Nice to meet you. Yamada is my family name and Taro is my given name. I'm from Osaka, Japan. I was born on April 14th in 1973."
        }
    ]
}

Response

    "documents": [
        {
            "id": "1",
            "entities": [
                {
                    "text": "Yamada",
                    "category": "Person",
                    "offset": 18,
                    "length": 6,
                    "confidenceScore": 0.52
                },
                {
                    "text": "Taro",
                    "category": "Person",
                    "offset": 47,
                    "length": 4,
                    "confidenceScore": 0.76
                },
                {
                    "text": "Osaka",
                    "category": "Location",
                    "subcategory": "GPE",
                    "offset": 79,
                    "length": 5,
                    "confidenceScore": 0.8
                },
                {
                    "text": "Japan",
                    "category": "Location",
                    "subcategory": "GPE",
                    "offset": 86,
                    "length": 5,
                    "confidenceScore": 0.55
                },
                {
                    "text": "April 14th",
                    "category": "DateTime",
                    "subcategory": "Date",
                    "offset": 107,
                    "length": 10,
                    "confidenceScore": 0.8
                },
                {
                    "text": "1973",
                    "category": "DateTime",
                    "subcategory": "DateRange",
                    "offset": 121,
                    "length": 4,
                    "confidenceScore": 0.8
                }
            ],
            "warnings": []
        }

Ja

はじめまして。山田が苗字で、太郎が名前です。日本の大阪出身です。私は、1973年4月14日生まれです。

Request

{ documents: [
        { 
            id: "1", 
            language: "ja", 
            text: "はじめまして。 山田が苗字で、太郎が名前です。 日本の大阪出身です。 私は、1973年4月14日生まれです。"
        }
    ]
}

Response

    "documents": [
        {
            "id": "1",
            "entities": [
                {
                    "text": "山田",
                    "category": "Person",
                    "offset": 8,
                    "length": 2,
                    "confidenceScore": 0.66
                },
                {
                    "text": "日本",
                    "category": "Location",
                    "subcategory": "GPE",
                    "offset": 24,
                    "length": 2,
                    "confidenceScore": 0.98
                },
                {
                    "text": "大阪",
                    "category": "Location",
                    "subcategory": "GPE",
                    "offset": 27,
                    "length": 2,
                    "confidenceScore": 0.98
                }
            ],
            "warnings": []
        }

個人を特定できる情報の検出

API : /text/analytics/v3.1-preview.3/entities/recognition/pii

NER v3.1 のプレビューでは、個人情報 (PII) と医療情報 (PHI) を検出する機能が備わっている。 APIは現在、英語とスペイン語のみで提供されている。

抽出されるカテゴリは、固有表現認識でサポートされるカテゴリで確認できる。

日本

日本の銀行口座番号

日本の運転免許証番号

日本の個人マイナンバー

日本の法人マイナンバー

日本の住民票コード

日本の在留カード番号

日本の社会保険番号 (SIN)

日本のパスポート番号

En

Nice to meet you. Yamada is my family name and Taro is my given name. I'm from Osaka, Japan. I was born on April 14th in 1973. My number is 345-6890. My bank account number is 1234567.

Request

{ documents: [
        { 
            id: "1", 
            language: "en", 
            text: "Nice to meet you. Yamada is my family name and Taro is my given name. I'm from Osaka, Japan. I was born on April 14th in 1973."
        },
        { 
            id: "2", 
            language: "en", 
            text: "My number is 345-6890. My bank account number is 1234567."
        }
    ]
}

Response

    "documents": [
        {
            "redactedText": "Nice to meet you. ****** is my family name and **** is my given name. I'm from Osaka, Japan. I was born on ********** in 1973.",
            "id": "1",
            "entities": [
                {
                    "text": "Yamada",
                    "category": "Person",
                    "offset": 18,
                    "length": 6,
                    "confidenceScore": 0.52
                },
                {
                    "text": "Taro",
                    "category": "Person",
                    "offset": 47,
                    "length": 4,
                    "confidenceScore": 0.76
                },
                {
                    "text": "April 14th",
                    "category": "DateTime",
                    "subcategory": "Date",
                    "offset": 107,
                    "length": 10,
                    "confidenceScore": 0.8
                }
            ],
            "warnings": []
        },
        {
            "redactedText": "My number is ********. My bank account number is ********",
            "id": "2",
            "entities": [
                {
                    "text": "345-6890",
                    "category": "Phone Number",
                    "offset": 13,
                    "length": 8,
                    "confidenceScore": 0.8
                },
                {
                    "text": "1234567.",
                    "category": "Japan Bank Account Number",
                    "offset": 49,
                    "length": 8,
                    "confidenceScore": 0.85
                }
            ],
            "warnings": []
        }
    ],

Ja

未対応

キーフレーズの抽出

API : /text/analytics/v3.1-preview.3/keyPhrases

En

Request

{ documents: [
        { 
            id: "1", 
            language: "en", 
            text: "Nice to meet you. Yamada is my family name and Taro is my given name. I'm from Osaka, Japan. I was born on April 14th in 1973."
        }
    ]
}

Response

    "documents": [
        {
            "id": "1",
            "keyPhrases": [
                "family",
                "Taro",
                "Yamada",
                "Osaka",
                "Japan"
            ],
            "warnings": []
        }
    ],

Ja

Request

{ documents: [
        { 
            id: "1", 
            language: "ja", 
            text: "はじめまして。 山田が苗字で、太郎が名前です。 日本の大阪出身です。 私は、1973年4月14日生まれです。"
        }
    ]
}

Response

    "documents": [
        {
            "id": "1",
            "keyPhrases": [
                "苗字",
                "太郎",
                "日本",
                "大阪出身",
                "山田",
                "名前",
                "まし",
                "生まれ"
            ],
            "warnings": []
        }
    ],

Azure Cognitive Search

Azure Cognitive Search 自体はクラウドのストレージを高度に検索するためのサービス。
インデックスを作成する際に使われる Analyzer で形態素解析を行っており、Analyzer の検証用に Analyzer API が用意されている。

docs.microsoft.com

blog.johtani.info

yotiky Tech Blog

とあるエンジニアの備忘録

Azure Cognitive Services - Text Analytics の試し打ち

目次

TL;DR

名前付きエンティティの認識 (NER)

En

Request

Response

Ja

Request

Response

個人を特定できる情報の検出

En

Request

Response

Ja

キーフレーズの抽出

En

Request

Response

Ja

Request

Response

Azure Cognitive Search

目次

TL;DR

名前付きエンティティの認識 (NER)

En

Request

Response

Ja

Request

Response

個人を特定できる情報の検出

En

Request

Response

Ja

キー フレーズの抽出

En

Request

Response

Ja

Request

Response

Azure Cognitive Search

キーフレーズの抽出