What is tagging? Just how much of tagging do we need? How much is enough? Are we doing it wrong?
Tagging is the process of identifying and marking segments in a language corpus whether it’s spoken or text.
Most corpora like the Brown Corpus have POS tagging (Part-Of-Speech Tagging). These corpora have only one purpose and they fulfil that purpose well. What I have in mind is multilayer tagging of the corpora. This is accomplished by having the corpus and annotations in different layers. For instance consider the following example:
This multilayer propose can help us annotate and view different parts of corpora.
As you might know I’m building the first online school for linguistics courses in teachable platform (Glot home) I’m really working hard on courses like “statistical methods in phonology” and “LING101”. Both are completely free.
The general linguistics course is going to cover most aspects of linguistics and can serve as a general introduction on linguistics. I have recently finnished builing the curriculum of the course (despite the fact that I have my graduate university entrance exam in 2 month!). In general I Hope this course could be helpful to students who are looking forward to an introduction on linguistics. I will post more updates on the course.
2018 update:
Well, it was a good project, but didn’t last much.
If you are willing to update this project, email me. [x]@[this_domain] where x = this_domain
If you want to look for a specific file to download but tired of looking at different sites for it just add [parent directory] before the file’s name and BINGO!
PS. you can also add: index or file size and similar keywords but parent directory works the best.
When considering how the frequency changes in a given time domain
we often judge by looking at (or hopefully calculating) the
frequency change but there are times that utterances are being
prolonged (mostly duration is a function of tone i.e. our tone
will alter duration; With that being said most of my calculations
do not consider duration or time as an independent variable);
This increase in producing the utterance does have a pragmatic
effect and thus it is worthwhile to consider it in transcribing
prosody.
ToBI is a system of transcribing prosodic features of speech.
ToBI is easy to learn, easy to share, and easy to use. these 3
features have made ToBI a widely used system for transcribing
prosody. But ToBI has some downfalls too. Since it is based on
the phonological level of the speech it does not contain
phonetic features of speech. I have some ideas to implement
phonetic features of prosody in ToBI and on the top of that
adding frequency change in the ToBI system.
I, therefore, decided to design a formula based on frequency time
change. The formula is:
Z is the time frequency function on the scale from
1.0 to 4.0
where Z from 1.0 to 2.0 is flagged as low
2.0 to 3.0
flagged as medium and lastly 3.0 to 4.0 flagged as
high. F is the frequency in Hertz and it is scaled from
75 to 300
. T is short for Time on FFT frequency display. Both F and
T are indicators of the change in the pitch track and hence they
are calculated as: F2 - F1 and T2 - T1. T is always
positive bud F can sometimes be negative so I used Absolute value
in the formula to avoid F being zero.
I have written a Python code that plots the distribution of the
outputs across Z axis. X axis is T and Y axis is F.
The Python code is:
The proposed symbols for transcribing this scale in ToBI is using
commas, semicolons, and colons.
a) because it’s more intuitive
b) because it is easily read.
[comma] for low change, [semicolon] for medium change,
and [colon] for high change. examples are:
(a) L+H:% (Indicaating high frequency change in the given
time domain)
(b) L+H;* (Indicating medium frequency change in the given
time domain)
(c) H,* (Indicating low frequency change in the given time
domain)
Consider the following example:
This is a recording from Kurmanji Kurdish.
The values for F and T are:
Calculating the above data we will get:
2.80 in our Time Frequency Function (TFF) is considered
Medium change so we transcribe it as: H+L;%
Edit Feb 16: On the second thought I observed that there are numerous problems with the ToBI and the AM so I’m thinking of changing the framework. I’ll post later if I had another idea.