Extra: language detection with character quadgrams

Extra: language detection with character quadgrams#

`cld2` library for language detection#

CLD2 is a Naïve Bayesian classifier, using one of three different token algorithms:

For Unicode scripts such as Greek and Thai that map one-to-one to detected languages, the script defines the result.
For the 80,000+ character Han script and its CJK combination with Hiragana, Katakana, and Hangul scripts, single letters (unigrams) are scored.
For all other scripts, sequences of four letters (quadgrams) are scored.

pycld2 is a fork of this C++ library. To demonstrate the effectiveness of character quadgrams for language detection, what is the percentage English vs. Dutch posts in my LinkedIn activity?

import altair as alt
import pandas as pd
import pycld2 as cld2


posts_data = "https://github.com/jads-nl/public-lectures/blob/main/nlp/data/linkedin-shares.csv?raw=true"
posts = pd.read_csv(posts_data)
posts[["IsReliable", "length", "languages_detected"]] = pd.DataFrame(
    posts.ShareCommentary.astype("str").apply(cld2.detect).tolist(), index=posts.index
)
posts[["first_lang", "first_lang_perc"]] = pd.DataFrame(
    posts.languages_detected.apply(lambda x: [x[0][1], x[0][2]]).tolist(),
    index=posts.index,
)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import altair as alt
      2 import pandas as pd
----> 3 import pycld2 as cld2
      6 posts_data = "https://github.com/jads-nl/public-lectures/blob/main/nlp/data/linkedin-shares.csv?raw=true"
      7 posts = pd.read_csv(posts_data)

ModuleNotFoundError: No module named 'pycld2'

posts.head()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 posts.head()

NameError: name 'posts' is not defined

alt.Chart(posts).mark_bar().encode(
    alt.X("length:Q", bin=alt.Bin(maxbins=20)),
    alt.Y('count()', stack=None),
    alt.Facet('first_lang:N', columns=2))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 alt.Chart(posts).mark_bar().encode(
      2     alt.X("length:Q", bin=alt.Bin(maxbins=20)),
      3     alt.Y('count()', stack=None),
      4     alt.Facet('first_lang:N', columns=2))

NameError: name 'posts' is not defined

Exercise: compute the top-10 character quadgrams from the Universal Declaration of Human Rights#

Given:

The Universal Declaration of Human Rights in 500 languages
The example code below to
- fetch the plain text of the declaration in a given language
- iterate over a single string to compute frequency op character quadgrams

Compute the top-10 for a handful of languages of choice.

Learning objectives:

know how to scrape simple websites
know how to clean HTML tags

# fetch text
from bs4 import BeautifulSoup
import requests


lang_id = "dut"
url = f"https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID={lang_id}"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
soup.body.find("span", attrs={"class": "udhrtext"})

# count n-grams in string
def count_n_grams_frequencies(text, n=4):
    """Counts frequency of n-grams for str_.
    
    Note that string should be cleaned such that
    - only lower case
    - punctuation is replaced by whitespace
    """
    from collections import Counter
    
    
    frequencies = Counter()
    for n_gram in [text[i:i+n] for i in range(0, (len(text)-n))]:
        frequencies[n_gram] += 1

    return frequencies.most_common()


example = "the quick brown fox jumps over the lazy dog "
count_n_grams_frequencies(example)
    

[('the ', 2),
 ('he q', 1),
 ('e qu', 1),
 (' qui', 1),
 ('quic', 1),
 ('uick', 1),
 ('ick ', 1),
 ('ck b', 1),
 ('k br', 1),
 (' bro', 1),
 ('brow', 1),
 ('rown', 1),
 ('own ', 1),
 ('wn f', 1),
 ('n fo', 1),
 (' fox', 1),
 ('fox ', 1),
 ('ox j', 1),
 ('x ju', 1),
 (' jum', 1),
 ('jump', 1),
 ('umps', 1),
 ('mps ', 1),
 ('ps o', 1),
 ('s ov', 1),
 (' ove', 1),
 ('over', 1),
 ('ver ', 1),
 ('er t', 1),
 ('r th', 1),
 (' the', 1),
 ('he l', 1),
 ('e la', 1),
 (' laz', 1),
 ('lazy', 1),
 ('azy ', 1),
 ('zy d', 1),
 ('y do', 1),
 (' dog', 1)]

Extra: language detection with character quadgrams

Contents

Extra: language detection with character quadgrams#

cld2 library for language detection#

Exercise: compute the top-10 character quadgrams from the Universal Declaration of Human Rights#

`cld2` library for language detection#