View on GitHub

Phonological CorpusTools

Take the stress out of corpus analysis!

Download this project as a .zip file Download this project as a tar.gz file

About Phonological CorpusTools

There is an ever-increasing interest in exploring the roles of frequency and usage in understanding phonological phenomena (e.g., Bybee 2001, Ernestus 2011, Frisch 2011, Archangeli & Pulleyblank 2013, Hume et al. to appear). Corpora of language give us a way of making generalizations across wide swaths of such usage, exploring patterns in under-documented languages, and creating balanced stimuli in experiments, but:

Phonological CorpusTools (PCT) is our answer to these problems – a free, downloadable program with both a graphical and command-line interface, designed to be a search and analysis aid for dealing with questions of phonological interest in large corpora.

An overview article about the software is available:

Hall, Kathleen Currie, J. Scott Mackie, and Roger Yu-Hsiang Lo. (2019). Phonological CorpusTools: Software for doing phonological analysis on transcribed corpora. International Journal of Corpus Linguistics 24(4). 522-535. https://doi.org/10.1075/ijcl.18009.hal

Various files related to PCT, including example corpus and feature files, are available at https://github.com/PhonologicalCorpusTools/PCT_Fileshare.

About us

We are a group of researchers in the Linguistics Department at the University of British Columbia. The PI on this project is Dr. Kathleen Currie Hall, and the project is supported in part by a SSHRC Insight Development grant to Dr. Hall.

What do we mean by a “corpus”?

Within PCT, a corpus generally refers to a structured list of words, each phonologically transcribed, and accompanied by their token frequency of occurrence within some body of usage. It can contain other information as well – e.g., lexical class of the word, number of syllables, morphological parse, etc. – but it doesn’t have to. PCT can also be used to convert running text into a frequency-tagged corpus of this sort.

Functionality

Documentation

Please see the user’s manual of the latest version for complete documentation; currently available at http://corpustools.readthedocs.org/en/latest/. Documentation can also be found throughout the PCT software itself by clicking on “Help” (either in the main menu or in dialogue boxes relating to individual functions).

Information about how to cite PCT itself can be found at https://corpustools.readthedocs.io/en/latest/citing_pct.html.

Workshops / Tutorials

We’re delighted to have participated in various workshops / conferences:

Example files and tutorial handouts can be downloaded from: https://github.com/PhonologicalCorpusTools/PCT_Fileshare.

Standard installation (executable)

Windows

NOTE 1: This method requires that you are running a 64-bit version of windows. You can check this by in Control Panel -> System and Security -> System.

NOTE 2: When the software is downloaded, you may get a security warning indicating that you have tried to launch an unrecognized app. Selecting “Run anyway” should allow PCT to work as expected.

Download the latest version’s installer (should be a file ending in .exe) from the Phonological CorpusTools page on GitHub (https://github.com/PhonologicalCorpusTools/CorpusTools/releases). Double-click this file to install PCT to your computer. It can then be run the same as any other program, via Start -> Programs.

Mac OS X (requires 10.8 or higher) – PCT v 1.4.1 is confirmed to work on 10.13 and higher, but may have issues on earlier OS platforms

NOTE 1: When the software is downloaded, you may get a security warning indicating that you have tried to launch an unrecognized app. If you Ctrl-click on the application and select “Open,” you should be able to override the security warning and use PCT normally.

Download the latest version’s installer (should be a file ending in .dmg) from the Phonological CorpusTools page on GitHub (https://github.com/PhonologicalCorpusTools/CorpusTools/releases). You can then double-click this file to run Phonological CorpusTools. You can move the icon to your toolbar like any other application.

Linux

There is currently no executable option available for Linux operating systems. Please use the fallback installation method below to install from source.

Fallback installation (setup.py)

Windows, Mac OS X, or Linux

Dependencies:

If you expect to use the acoustic similarity module, there are additional dependencies:

Download the latest version of the source code for Phonological CorpusTools from the GitHub page (https://github.com/PhonologicalCorpusTools/CorpusTools/releases). After expanding the file, you will find a file called ‘setup.py’ in the top level directory. Run it in one of the following ways:

  1. Double-click it. If this doesn’t work, access the file properties and ensure that you have permission to run the file; if not, give them to yourself. In Windows, this may require that you open the file in Administrator mode (also accessible through file properties). If your computer opens the .py file in a text editor rather than running it, you can access the file properties to set Python 3.x as the default program to use with run .py files. If the file is opened in IDLE (a Python editor), you can use the “Run” button in the IDLE interface to run the script instead.

  2. Open a terminal window and run the file. In Linux or Mac OS X, there should be a Terminal application pre-installed. In Windows, you may need to install Cygwin ( https://www.cygwin.com/ ). Once the terminal window is open, nagivate to the top level CorpusTools folder—the one that has setup.py in it. (Use the command ‘cd’ to navigate your filesystem; Google “terminal change directory” for further instructions.) Once in the correct directory, run this command: “python3 setup.py install” (no quotes). You may lack proper permissions to run this file, in which case on Linux or Mac OS X you can instead run “sudo python3 setup.py install”. If Python 3.x is the only version of Python on your system, it may be possible or necessary to use the command “python” rather than “python3”.

Phonological CorpusTools should now be installed! Run it from a terminal window using the command “pct”. You can also open a “Run” dialogue and use the command “pct” there. In Windows, the Run tool is usuall found in All Programs -> Accessories.

Versions

Get the latest version of PCT from: https://github.com/PhonologicalCorpusTools/CorpusTools/releases. See the release notes for each version at: https://corpustools.readthedocs.io/en/latest/release.html.

Multi-character sequences

Below, you can find lists of the multi-character sequences that are included in each of the built-in transcription-to-feature system files. You may want to use these in order to copy & paste them into the corpus creation dialogue box if you are using a corpus file that is not already delimited. See also https://github.com/PhonologicalCorpusTools/PCT_Fileshare for more detail on the transcription / feature files.

Buckeye: aa, aan, ae, aen, ah, ahn, ao, aon, aw, awn, ay, ayn, ch, dh, dx, eh, el, em, en, ey, eyn, hh, ih, ihn, iy, iyn, jh, ng, nx, ow, own, oy, oyn, sh, th, tq, uh, uhn, uw, uwn, zh, ehn, er, ern

CELEX: tS, dZ, aU, aI, eI, OI

CMU: TH, DH, CH, JH, SH, ZH, NG, HH, IY, IH, UH, EH, ER, AH, AH, AO, AE, AA, UW, AW, AY, EY, OW, OY, AH N, AH L

CPA: T/, J/, ^/, u:, A/, a/, e/, O/, o/, n,, l,

IPA: k̟x̟, ɡ̟ɣ̟, k̟͡x̟, ɡ̟͡ɣ̟, dʑ, tɕ, d͡ʑ, t͡ɕ, dʒ, dz, dɮ, d̠ɮ̠, tʃ, t̠ɬ̠, ts, tɬ, t̪s̪, t̪ɬ̪, d̪z̪, d̪ɮ̪, ʈʂ, ɖʐ, pf, bv, pɸ, bβ, t̪θ, d̪ð, cç, ɟʝ, kx, k̠x̠, ɡɣ, ɡ̠ɣ̠, qχ, ɢʁ, kp, ɡb, pt, bd, d͡ʒ, d͡z, d͡ɮ, d̠͡ɮ̠, t͡ʃ, t̠͡ɬ̠, t͡s, t͡ɬ, t̪͡s̪, t̪͡ɬ̪, d̪͡z̪, d̪͡ɮ̪, ʈ͡ʂ, ɖ͡ʐ, p͡f, b͡v, p͡ɸ, b͡β, t̪͡θ, d̪͡ð, c͡ç, ɟ͡ʝ, k͡x, k̠͡x̠, ɡ͡ɣ, ɡ̠͡ɣ̠, q͡χ, ɢ͡ʁ, k͡p, ɡ͡b, p͡t, b͡d, aʊ, aɪ, eɪ, oʊ, ɔɪ

SAMPA: I:, i:, U:, u:, E:, O:, Q:, tS, dZ, @U, aU, aI, eI, OI, n,, l,, 3`

References