# Extra: language detection with character quadgrams

## `cld2` library for language detection

[CLD2](https://github.com/CLD2Owners/cld2) is a Na√Øve Bayesian classifier, using one of three different token algorithms:
1. For Unicode scripts such as Greek and Thai that map one-to-one to detected languages, the script defines the result.
2. For the 80,000+ character Han script and its CJK combination with Hiragana, Katakana, and Hangul scripts, single letters (unigrams) are scored.
3. For all other scripts, sequences of four letters (quadgrams) are scored.

[pycld2](https://pypi.org/project/pycld2/) is a fork of this C++ library. To demonstrate the effectiveness of character quadgrams for language detection, what is the percentage English vs. Dutch posts in my LinkedIn activity?

In [1]:
import altair as alt
import pandas as pd
import pycld2 as cld2


posts_data = "https://github.com/jads-nl/public-lectures/blob/main/nlp/data/linkedin-shares.csv?raw=true"
posts = pd.read_csv(posts_data)
posts[["IsReliable", "length", "languages_detected"]] = pd.DataFrame(
    posts.ShareCommentary.astype("str").apply(cld2.detect).tolist(), index=posts.index
)
posts[["first_lang", "first_lang_perc"]] = pd.DataFrame(
    posts.languages_detected.apply(lambda x: [x[0][1], x[0][2]]).tolist(),
    index=posts.index,
)

In [2]:
posts.head()

Unnamed: 0,Date,ShareLink,ShareCommentary,SharedURL,MediaURL,Visibility,IsReliable,length,languages_detected,first_lang,first_lang_perc
0,2021-01-29 22:24:35,https://www.linkedin.com/feed/update/urn%3Ali%...,MIjn college-reeks voor de huidige Discover gr...,,,MEMBER_NETWORK,True,538,"((DUTCH, nl, 81, 1007.0), (ENGLISH, en, 18, 59...",nl,81
1,2021-01-25 14:08:14,https://www.linkedin.com/feed/update/urn%3Ali%...,Ook ik heb soms mijn vraagtekens bij het belei...,,,MEMBER_NETWORK,True,265,"((DUTCH, nl, 99, 1089.0), (Unknown, un, 0, 0.0...",nl,99
2,2021-01-23 22:11:48,https://www.linkedin.com/feed/update/urn%3Ali%...,"Compassion, even (especially?!) for the person...",,,MEMBER_NETWORK,True,63,"((ENGLISH, en, 98, 1486.0), (Unknown, un, 0, 0...",en,98
3,2021-01-23 22:05:01,https://www.linkedin.com/feed/update/urn%3Ali%...,"Ik kende ze nog niet, maar ga gelijk inschrijv...",,,MEMBER_NETWORK,True,81,"((DUTCH, nl, 98, 1292.0), (Unknown, un, 0, 0.0...",nl,98
4,2021-01-22 06:43:45,https://www.linkedin.com/feed/update/urn%3Ali%...,Van Vleuten in warm bad bij Movistar: 'Pontifi...,,,MEMBER_NETWORK,True,166,"((DUTCH, nl, 99, 657.0), (Unknown, un, 0, 0.0)...",nl,99


In [3]:
alt.Chart(posts).mark_bar().encode(
    alt.X("length:Q", bin=alt.Bin(maxbins=20)),
    alt.Y('count()', stack=None),
    alt.Facet('first_lang:N', columns=2))

## Exercise: compute the top-10 character quadgrams from the Universal Declaration of Human Rights

Given:
- The [Universal Declaration of Human Rights in 500 languages](https://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx)
- The example code below to
  - fetch the plain text of the declaration in a given language
  - iterate over a single string to compute frequency op character quadgrams
  
Compute the top-10 for a handful of languages of choice.

Learning objectives:
- know how to scrape simple websites
- know how to clean HTML tags


In [4]:
# fetch text
from bs4 import BeautifulSoup
import requests


lang_id = "dut"
url = f"https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID={lang_id}"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
soup.body.find("span", attrs={"class": "udhrtext"})

<span class="udhrtext" id="ctl00_PlaceHolderMain_usrUDHRLanguage_lblLang"><h3>UNIVERSELE VERKLARING VAN DE RECHTEN VAN DE MENS</h3>
<h4>Preambule</h4>
<p>Overwegende, dat erkenning van de inherente waardigheid en van de gelijke en onvervreemdbare rechten van alle leden van de mensengemeenschap grondslag is voor de vrijheid, gerechtigheid en vrede in de wereld; </p>
<p>Overwegende, dat terzijdestelling van en minachting voor de rechten van de mens geleid hebben tot barbaarse handelingen, die het geweten van de mensheid geweld hebben aangedaan en dat de komst van een wereld, waarin de mensen vrijheid van meningsuiting en geloof zullen genieten, en vrij zullen zijn van vrees en gebrek, is verkondigd als het hoogste ideaal van iedere mens; </p>
<p>Overwegende, dat het van het grootste belang is, dat de rechten van de mens beschermd worden door de suprematie van het recht, opdat de mens niet gedwongen worde om in laatste instantie zijn toevlucht te nemen tot opstand tegen tyrannie en onderd

In [5]:
# count n-grams in string
def count_n_grams_frequencies(text, n=4):
    """Counts frequency of n-grams for str_.
    
    Note that string should be cleaned such that
    - only lower case
    - punctuation is replaced by whitespace
    """
    from collections import Counter
    
    
    frequencies = Counter()
    for n_gram in [text[i:i+n] for i in range(0, (len(text)-n))]:
        frequencies[n_gram] += 1

    return frequencies.most_common()


example = "the quick brown fox jumps over the lazy dog "
count_n_grams_frequencies(example)
    

[('the ', 2),
 ('he q', 1),
 ('e qu', 1),
 (' qui', 1),
 ('quic', 1),
 ('uick', 1),
 ('ick ', 1),
 ('ck b', 1),
 ('k br', 1),
 (' bro', 1),
 ('brow', 1),
 ('rown', 1),
 ('own ', 1),
 ('wn f', 1),
 ('n fo', 1),
 (' fox', 1),
 ('fox ', 1),
 ('ox j', 1),
 ('x ju', 1),
 (' jum', 1),
 ('jump', 1),
 ('umps', 1),
 ('mps ', 1),
 ('ps o', 1),
 ('s ov', 1),
 (' ove', 1),
 ('over', 1),
 ('ver ', 1),
 ('er t', 1),
 ('r th', 1),
 (' the', 1),
 ('he l', 1),
 ('e la', 1),
 (' laz', 1),
 ('lazy', 1),
 ('azy ', 1),
 ('zy d', 1),
 ('y do', 1),
 (' dog', 1)]