General Corpus  study sites

Concordancing tools

  • Conc
    a Mac concordance program.
  • MonoConc
    a Mac/Windows concordance program that allows sorts (2R,1R,2L,1L) and provides simple frequency information.
  • OCP: The Oxford Concordance Program.
  • ParaConc
    a Mac/Windows concordance program for parallel texts. A version is available for free for research purposes (under license).
  • SARA (SGML-Aware Retrieval Application)
    MS-Windows-based concordance and word-frequency package. Especially set up for BNC.
  • TACT (Text Analysis Computing Tools)
    MS-DOS programs "designed to do text-retrieval and analysis on literary works".
  • WordSmith Tools
    Easy-to-use MS-Windows programs for generating word frequency listings and concordances.
  • MicroConcord
    produced by Oxford University Press. (Demo via ftp.)
  • XKwic
    Fast Concordance Program for X-Windows. University of Stuttgart project "Textual Corpora and Tools for their exploration".

Tagging & Parsing (available software or on-line tagging)

  • Amalgam Tagger (Univ. of Leeds)
    Enter text via e-mail and have it tagged. Choice of eight tagging schemes.
  • The AMAZON parser AutoMAtische ZinsONtleding (automatic analysis of sentences). Dutch only.
  • Apple Pie Parser
    "bottom-up probabilistic chart parser which finds the parse tree with the best score by best-first search algorithm." (Available by ftp)
  • Brill:Trainable Part of Speech Tagger
    Rule-based part of speech tagger (available by ftp).
  • Conexor linguistic software developed in Finland.
    "based on linguistic generalisations and rules rather than linguistically naive corpus probabilities"
  • Dependency Parser of English.
    Parse English text on-line (WWW demo).
  • EngCG Parser
    Constraint Grammar Parser of English. (Use online)
  • EngCG tagger
    Constraint Grammar tagging of English.
  • EngCG-2 tagger
    Newer version of EngCG tagger (Constraint Grammar tagging of English). WWW demo
  • EngLite parser
    "a fast, light parser that assigns word class and shallow syntactic tags to words in English texts" (Conexor) WWW demo
  • Ergo Linguistic Technology Parser
    Online demo.
  • FDG Functional Dependency Grammar of English
    "builds functionally labelled dependency links between words and assigns morphosyntactic tags to words"(Conexor) WWW demo
  • Georgetown University Natural Language Processing Parser Modularity Demo
  • Link Grammar Parser
    Syntactic parsing of English. Use on-line or download (MS Windows or Unix).
  • MBT Memory Based Tagging.
    Demo. Tags text in Dutch, English, or Spanish online.
  • TOSCA/LOB tagger for English.
    Available (MS-DOS) by ftp.
  • QTAG Part of speech tagger
    Java/C from Birmingham U. (Oliver Mason).
  • XRCE Part of Speech Disambiguators (Xerox Research Centre)
    Tag text online in French, Dutch, English, German, Spanish, Portuguese, Italian. 

Online tools

  • VIEW (Variation in Englisch Words and Phrases)
    http://view.byu.edu/ Publicly accessible web interface by Mark Davies for searching the BNC. Various search options.
  • PIE (Phrases in English)
    http://pie.usna.edu/ Publicly accessible web interface by William H. Fletcher for searching the BNC. Allows the search of word, part-of-speech, or character n-grams as well as phrase frames.
  • The Sketch Engine
    http://www.sketchengine.co.uk/ The Sketch Engine by Adam Kilgarriff and Pavel Rychly is a corpus search engine incorporating word sketches, grammatical relations, and a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behaviour. Free demo account after registration.

APIs and frameworks

  • Annotation Graph Toolkit (AGTK)
    http://agtk.sourceforge.net/ Free software library in C++ (Java port available) for the processing of annotation graphs. Annotation Graphs are a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems.
  • Atlas (Architecture and Tools for Linguistic Analysis Systems)
    http://www.nist.gov/speech/atlas/ Software library in Java for the processing of annotation graphs. Altas provides a data model, a storage format, and an API.
  • LT XML
    http://www.ltg.ed.ac.uk/software/xml/ Free software library in C for the processing of XML documents, including searching and extracting, down-translation (e.g. report generation, formatting), tokenising and sorting.
  • NITE XML Toolkit (NXT)
    http://www.ltg.ed.ac.uk/NITE/ Software library in Java for developing tailored end user corpus tools, especially for highly structured and/or cross-annotated multimodal corpora. NXT provides a data model, a storage format, and API support for handling data, querying it, and building graphical user interfaces.

Corpus creation tools

  • CLaRK
    http://www.bultreebank.org/clark/ An XML-based system for corpora development
  • GATE - General Architecture for Text Engineering
    http://gate.ac.uk/ GATE is a modular system for the linguistic processing of texts. It comprises an architecture, library framework and graphical development environment. Plugins can be used to build an application for a particular annotation task. GATE is freely available under GNU Library General Public License (LGPL 2.0) and can be downloaded after a registration. It is implemented in Java, and thus available for all major platforms.
  • SPre - configurable pre-processor
    SPre is a program for segmenting and annotating texts of arbitrary formats. The algorithms for the segmentation are relatively freely configurable via an XML file. Other annotators can be integrated. SPre is published as a plugin for GATE. SPre is implemented in Java, and thus available on all major platforms.
  • jTokeniser
    http://www.andy-roberts.net/software/jTokeniser/ Program and API for tokenising natural language text strings. Various tokenisers are provided for the segmentation of sentences into words and texts into sentences. Written in Java, hence available on all major platforms. Free Software (LGPL).

Annotation tools

  • Alembic Workbench Project
    http://www.mitre.org/tech/alembic-workbench/ Tool for manual and automatic annotation of text corpora. Automatic annotation is achieved by a mixed approach: heuristics for information extraction can be manually composed or automatically inducted. Available free of charge.
  • PALinkA: A Discourse Annotation Tool
    http://clg.wlv.ac.uk/projects/PALinkA/ An annotation program which allows a wide range of annotations. At present it has been used to annotate texts for anaphora resolution, centering, summarisation and marking certain features in texts.
  • TASX (Time Aligned Signal data eXchange) currently down
    http://medien.informatik.fh-fulda.de/tasxforce TASX provides an XML based annotation format, an annotation tool and a web based query system for multimodal corpora.
  • Annotate
    http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/annotate.html Annotate is a tool for efficient semi-automatic annotation of corpus data. It facilitates the generation of context-free structures and additionally allows crossing edges.
    http://www.exmaralda.org/ EXMARaLDA (EXtensible MARkup Language for Discourse Annotation) provides an XML-based format and a variety of tools for discourse transcription and annotation. It's written in Java, and thus available for all major computer platforms.
  • Transcriber
    http://www.etca.fr/CTA/gip/Projets/Transcriber/ Transcriber is a tool for assisting the manual annotation of speech signals. It provides a user-friendly graphical user interface for segmenting long duration speech recordings, transcribing them, and labeling speech turns, topic changes and acoustic conditions. It is more specifically designed for the annotation of broadcast news recordings, for creating corpora used in the development of automatic broadcast news transcription systems, but its features might be found useful in other areas of speech research.
  • Anvil
    http://www.dfki.de/~kipp/anvil/ Anvil is a free video annotation tool.
  • MMAX
    http://mmax.eml-research.de A tool for multi-modal annotation in XML


Corpus analysis tools

  • IMS Open Corpus Workbench (CWB)
    http://cwb.sourceforge.net/ The IMS Open Corpus Workbench (former IMS Corpus Workbench) is a set of tools for full text retrieval of text corpora. The Corpus Query Processor (CQP) is a powerful corpus search tool supporting regular expressions, match conditions on all annotation levels and collocation analysis. Research and evaluation licences are available free of charge.
  • WordSmith Tools
    http://www.lexically.net/wordsmith/ Commercial set of tools to explore the behaviour of words in texts. It provides a tool for generating lists of all words or word-clusters in a text, a concordancer to see a word in its context, and a tool for identifying key words in a text. Demo mode available (restricted functional range).
  • AntConc
    http://www.antlab.sci.waseda.ac.jp/software.html freeware concordance software; compiles amongst others KWIC (key words in contexts), word clusters, n-grams, word frequencies
  • TextSTAT - Simple Text Analysis Tool
    http://neon.niederlandistik.fu-berlin.de/en/textstat/ open source concordance software; compiles amongst others KWIC (key words in contexts), word clusters, n-grams, word frequencies, retrograde/reverse sorting
  • QLDB - Querying Linguistic Databases
    http://www.ldc.upenn.edu/Projects/QLDB/ Project about data models and query languages for linguistic databases.
  • An On-Line Repository of Association Measures
    http://www.collocations.de/AM/ Statistical association measures, applied to cooccurrence frequency data collected in a contingency table, are the most widely used tool for the analysis of word combinations and the extraction of collocations from text corpora.
  • The UCS Toolkit (version 0.3)
    http://www.collocations.de/ The UCS toolkit is a collection of libraries and scripts for the statistical analysis of cooccurrence data.
     This page offers information about some common corpus tools and links to resources on the web. 

    Online search in corpora


    This section links to corpora that can be freely searched online. Each of them comes with their own search engine/interface and with different features. Some of the websites offer search in more the one corpus.
    NB: This section focusses on the features available online. The corpora themselves (e.g. Bank of English, British National Corpus, Brown Corpus) are briefly described in the English corpora section.
    • Bank of English sampler – search in a 56 million word subset of the Bank of English:
      - Search by word, phrase, wildcard, part of speech or a combination of these.
      - KWIC concordances of variable length (concordance output restricted to 40 lines).
      - Collocation sampler to retrieve a word's most significant collocates.
    • British National Corpus (BNC) - sample search in the BNC at the BNC website:
      - Search by word, phrase, wildcard, part of speech or a combination of these.
      - Sentence concordances (output restricted to 50 samples).
      Also available for the BNC:
    • PIE (Phrases in English) – web interface based on BNC phrases, by W.H. Fletcher:
      - Search for frequently co-occuring words of  2 to 8 words length (word clusters).
      - Search all clusters of a particular length or clusters containing a particular word, phrase or part of speech.
      - Cluster lists with frequency statistics, and KWIC concordances of the clusters.
    • VIEW (Variation in English Words and Phrases) – web interface for the BNC, by M. Davies:
      - Search by word, phrase, wildcard, part of speech or a combination of these.
      - Search in the entire corpus as well as genre-specific searches.
      - Frequency statistics, collocates and KWIC concordances.
      - Compare quasi-synonyms or other related words and their collocates.
    • Business Letter Corpus – search in business letters and some other texts, by S. Yasumasa:
      - Search by word, phrase or wildcard.
      - KWIC concordances of variable length.
    • Compleat Lexical Tutor ('corpus-based concordance' section) - search in a range of corpora, in particular Brown Corpus and a 2 million word subset of the BNC as well as  a range of smaller corpora:
      - Search by word, phrase or wildcard.
      - KWIC concordances of variable length, collocate frequencies.
      - Gapped KWIC concordances as a basis for exercises.
    • Corpuseye – search in different types of corpora, especially The Wikipedia as a corpus:
      - Search by words or phrases.
      - KWIC concordances, collocate frequency.
      - Morphosyntactic analysis analysis of concordance lines.
    • Edict Virtual Language Centre Web Concordancer – search in a range of corpora, especially Brown Corpus, LOB as well as literary and other texts (The Times, Hitchhiker's Guide to the Galaxy, King James Bible, Starr Report)
      - Search by word, phrase or wildcard
      - KWIC concordances of variable length, collocate frequencies, sentence concordances
      - Gapped KWIC concordances as a basis for exercises
      - Collocational frameworks
    • ELISA - English Language Interview Corpus as a Second-Language Application - a small audiovisual corpus of spoken English developed with pedagogical goals:
      - Easy access to full interview text and videos
      - Browse corpus by topic index
      - Online concordancer (KWIC and sentence format, search by word, phrase or wildcard)
      - Ready-made concordance of all words in the whole corpus and in each interview
      - Ready made frequency lists word the whole corpus and each interview
    • MICASE - Michigan Corpus of Academic Spoken English - search according to a range of criteria:
      - Browse according to specified speaker and speech event attributes (file references)
      - Search by word or phrase in specified contexts (KWIC concordances)
    • WebCorp – search in the entire Web as the corpus (basis: Google)
      - Search by word, phrase or wildcard
      - KWIC cconcordances, word lists, some good advanced features
      - Disadvantage: not language-specific  

    Online full-text search in books


    Text and media archives


    The archives listed below offer a variety of texts and smaller corpora for download. To search them with corpus analysis methods, you will normally need an offline text/corpus analysis tool, i.e. a concordancer. Alternatively, you may be able to carry out some simple analyses with online text analysis tools
    • American Rhetoric project – media archive
      More than 5000 full text, audio and (streaming) video versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events.
    • Internet Archive - media archive
      A digital library of Internet sites and other cultural artifacts in digital form (text, audio, video).
    • Literary Web Concordances – literary texts
      Free online search (concordances and a range of interesting features).
    • Online Books Page (University of Pennsylvania) – literary texts
      Free access to texts in different formats (meta search in a number of archives).
    • Oxford Text Archive – literary texts
      Free download as well as online search (concordances), wide variety of languages.
    • Project Gutenberg –  literary texts
      Free download (e.g. complete works of Shakespeare).
    • State of the Union Archive - media archive
      All Sate of the Union addresses, provided by c-span.org (transcripts, and since 1989 video clips as well).
    • University of Virginia eBook Library – literary texts
      Approx. 2,000 literary texts in html format.

    Online text/corpus analysis tools


    This section lists a selection of simple text analysis tools that can be used online, i.e. without installation. These tools allow you to create e.g. concordances, wordlists, text profiles from your own texts or from web pages of your choice. 
    • Compleat Lexical Tutor ('text-based concordances' section) - analyse your own text:
      - KWIC concordance for each word in the text.
      - See also 'phrase extractor' section to build concordance with word clusters.
    • Edict Virtual Language Centre ('Word Frequency Text Profiler' section) - analyse your own text:
      - Compares the text against well-known word lists (1000/2000 most frequent English words and others).
      - Highlights words of different frequency bands in different colours.
      - See also 'Unique Words Text Profiler' (finds all words which occur only once in a text).
    • Spaceless – analyse a text or web page of your choice:
      - Returns a variety of word lists.
    • TurboLingo - amalyse a text or web page of your choice:
      - KWIC concordance for all words in the text/web page
      - Frequency lists and other features

    Offline text/corpus analysis tools


    This section lists software packages that are commonly referred to as concordancers. They provide a more comprehensive range than the online analysis tools listed above (usually creation of concordances, alphabetical and frequency word lists, comparison of word lists and other statistical functions). Most packages can be freely downloaded but require installation. 
    • AntConc - free; by L. Anthony
      - For Windows and Linux.
      - Reads text, html, and xml files.
      - Main functions: concordances, citation of search term in its co-text, collocates, word clusters, frequency lists, text profiling through key rod lists.
    • ConcApp - free; by C. Greaves
      - For Windows.
      - Main functions: concordances, collocate search, frequency lists.
    • Concordance - by R.J.C. Watts
      - For Windows.
      - Creates a complete concordance for each word in a corpus and supports
        its publication as a web concordance.
      - Other functions: individual concordances, citation of search term in its co-text,
        frequency lists, text profiling through key rod lists, and a range of other statistical functions.
    • KwicFinder - by W.H. Fletcher
      - For Windows.
      - Different from the other packages in that it focusses on the analysis of web pages.
    • MonoConc Pro - by Michael Barlow/Athelstan.
      - For Windows.
      Very comprehensive package.
    • Simple Concordance Program free; by A. Reed
      - For Window and Mac.
      - Main functions: concordances, citation of search term in context, frequency lists.
    • TextSTAT - free; by M. Huening
      - For Windows, Linux and Mac.
      - Reads text, html, Word and Open Office files.
      - Web spider facility for corpus creation directly from Internet sources.
      - Main functions: concordances, citation of search term in context, frequency lists.
    • Wordsmith Tools - by Mike Scott
      - For Windows.
      - Very comprehensive package. 

    Further resources 


    This section focusses on corpus-related resources for the learning and teaching context. 

    Corpus linguistics websites 


    The following websites include resources and link collections generally related to corpus linguistics.


Subscribe to receive free email updates:


Posting Komentar

Thanks for your comment...I am looking forward your next visit..