Extra: language detection with character quadgrams#
cld2
library for language detection#
CLD2 is a Naïve Bayesian classifier, using one of three different token algorithms:
For Unicode scripts such as Greek and Thai that map one-to-one to detected languages, the script defines the result.
For the 80,000+ character Han script and its CJK combination with Hiragana, Katakana, and Hangul scripts, single letters (unigrams) are scored.
For all other scripts, sequences of four letters (quadgrams) are scored.
pycld2 is a fork of this C++ library. To demonstrate the effectiveness of character quadgrams for language detection, what is the percentage English vs. Dutch posts in my LinkedIn activity?
import altair as alt
import pandas as pd
import pycld2 as cld2
posts_data = "https://github.com/jads-nl/public-lectures/blob/main/nlp/data/linkedin-shares.csv?raw=true"
posts = pd.read_csv(posts_data)
posts[["IsReliable", "length", "languages_detected"]] = pd.DataFrame(
posts.ShareCommentary.astype("str").apply(cld2.detect).tolist(), index=posts.index
)
posts[["first_lang", "first_lang_perc"]] = pd.DataFrame(
posts.languages_detected.apply(lambda x: [x[0][1], x[0][2]]).tolist(),
index=posts.index,
)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 import altair as alt
2 import pandas as pd
----> 3 import pycld2 as cld2
6 posts_data = "https://github.com/jads-nl/public-lectures/blob/main/nlp/data/linkedin-shares.csv?raw=true"
7 posts = pd.read_csv(posts_data)
ModuleNotFoundError: No module named 'pycld2'
posts.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 posts.head()
NameError: name 'posts' is not defined
alt.Chart(posts).mark_bar().encode(
alt.X("length:Q", bin=alt.Bin(maxbins=20)),
alt.Y('count()', stack=None),
alt.Facet('first_lang:N', columns=2))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 alt.Chart(posts).mark_bar().encode(
2 alt.X("length:Q", bin=alt.Bin(maxbins=20)),
3 alt.Y('count()', stack=None),
4 alt.Facet('first_lang:N', columns=2))
NameError: name 'posts' is not defined
Exercise: compute the top-10 character quadgrams from the Universal Declaration of Human Rights#
Given:
The example code below to
fetch the plain text of the declaration in a given language
iterate over a single string to compute frequency op character quadgrams
Compute the top-10 for a handful of languages of choice.
Learning objectives:
know how to scrape simple websites
know how to clean HTML tags
# fetch text
from bs4 import BeautifulSoup
import requests
lang_id = "dut"
url = f"https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID={lang_id}"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
soup.body.find("span", attrs={"class": "udhrtext"})
# count n-grams in string
def count_n_grams_frequencies(text, n=4):
"""Counts frequency of n-grams for str_.
Note that string should be cleaned such that
- only lower case
- punctuation is replaced by whitespace
"""
from collections import Counter
frequencies = Counter()
for n_gram in [text[i:i+n] for i in range(0, (len(text)-n))]:
frequencies[n_gram] += 1
return frequencies.most_common()
example = "the quick brown fox jumps over the lazy dog "
count_n_grams_frequencies(example)
[('the ', 2),
('he q', 1),
('e qu', 1),
(' qui', 1),
('quic', 1),
('uick', 1),
('ick ', 1),
('ck b', 1),
('k br', 1),
(' bro', 1),
('brow', 1),
('rown', 1),
('own ', 1),
('wn f', 1),
('n fo', 1),
(' fox', 1),
('fox ', 1),
('ox j', 1),
('x ju', 1),
(' jum', 1),
('jump', 1),
('umps', 1),
('mps ', 1),
('ps o', 1),
('s ov', 1),
(' ove', 1),
('over', 1),
('ver ', 1),
('er t', 1),
('r th', 1),
(' the', 1),
('he l', 1),
('e la', 1),
(' laz', 1),
('lazy', 1),
('azy ', 1),
('zy d', 1),
('y do', 1),
(' dog', 1)]