
来源:百度文库 编辑:偶看新闻 时间:2024/07/06 18:42:29

Corpora4Learning Home | Bibliography | English corpora | Tools & websites | Projects


Tools & websites

This page offers information about some common corpus tools and links to resources on the web. 


    Online search in corpora


    Online full-text search in books


    Text and media archives


    Online text/corpus analysis tools


    Offline text/corpus analysis tools (concordancers)


    Further resources


    Corpus linguistics websites top

This section links to corpora that can be freely searched online. Each of them comes with their own search engine/interface and with different features. Some of the websites offer search in more the one corpus. 

NB: This section focusses on the features available online. The corpora themselves (e.g. Bank of English, British National Corpus, Brown Corpus) are briefly described in the English corpora section.

  • Bank of English sampler – search in a 56 million word subset of the Bank of English:


    - Search by word, phrase, wildcard, part of speech or a combination of these.

    - KWIC concordances of variable length (concordance output restricted to 40 lines).

    - Collocation sampler to retrieve a word's most significant collocates.


  • British National Corpus (BNC) - sample search in the BNC at the BNC website: 


    - Search by word, phrase, wildcard, part of speech or a combination of these.

    - Sentence concordances (output restricted to 50 samples).


    Also available for the BNC:


  • PIE (Phrases in English) – web interface based on BNC phrases, by W.H. Fletcher: 


    - Search for frequently co-occuring words of  2 to 8 words length (word clusters).

    - Search all clusters of a particular length or clusters containing a particular word, phrase or part of speech.

    - Cluster lists with frequency statistics, and KWIC concordances of the clusters.


  • VIEW (Variation in English Words and Phrases) – web interface for the BNC, by M. Davies: 


    - Search by word, phrase, wildcard, part of speech or a combination of these.

    - Search in the entire corpus as well as genre-specific searches.

    - Frequency statistics, collocates and KWIC concordances.

    - Compare quasi-synonyms or other related words and their collocates.


  • Business Letter Corpus – search in business letters and some other texts, by S. Yasumasa:


    - Search by word, phrase or wildcard.

    - KWIC concordances of variable length.


  • Compleat Lexical Tutor ('corpus-based concordance' section) - search in a range of corpora, in particular Brown Corpus and a 2 million word subset of the BNC as well as  a range of smaller corpora:


    - Search by word, phrase or wildcard.

    - KWIC concordances of variable length, collocate frequencies.

    - Gapped KWIC concordances as a basis for exercises.


  • Corpuseye – search in different types of corpora, especially The Wikipedia as a corpus:


    - Search by words or phrases.

    - KWIC concordances, collocate frequency.

    - Morphosyntactic analysis analysis of concordance lines.


  • Edict Virtual Language Centre Web Concordancer – search in a range of corpora, especially Brown Corpus, LOB as well as literary and other texts (The Times, Hitchhiker's Guide to the Galaxy, King James Bible, Starr Report)


    - Search by word, phrase or wildcard

    - KWIC concordances of variable length, collocate frequencies, sentence concordances

    - Gapped KWIC concordances as a basis for exercises

    - Collocational frameworks


  • ELISA - English Language Interview Corpus as a Second-Language Application - a small audiovisual corpus of spoken English developed with pedagogical goals:


    - Easy access to full interview text and videos

    - Browse corpus by topic index

    - Online concordancer (KWIC and sentence format, search by word, phrase or wildcard)

    - Ready-made concordance of all words in the whole corpus and in each interview

    - Ready made frequency lists word the whole corpus and each interview


  • MICASE - Michigan Corpus of Academic Spoken English - search according to a range of criteria:


    - Browse according to specified speaker and speech event attributes (file references)

    - Search by word or phrase in specified contexts (KWIC concordances)


  • WebCorp – search in the entire Web as the corpus (basis: Google)


    - Search by word, phrase or wildcard

    - KWIC cconcordances, word lists, some good advanced features

    - Disadvantage: not language-specific  



  • Amazon Search inside the book


    - Search in books by word or phrase, and then browse relevant books online.


  • Google book search


    - Search in books by word or phrase, and then browse relevant books online.




The archives listed below offer a variety of texts and smaller corpora for download. To search them with corpus analysis methods, you will normally need an offline text/corpus analysis tool, i.e. a concordancer. Alternatively, you may be able to carry out some simple analyses with online text analysis tools

  • American Rhetoric project – media archive


    More than 5000 full text, audio and (streaming) video versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events.


  • Internet Archive - media archive


    A digital library of Internet sites and other cultural artifacts in digital form (text, audio, video).


  • Literary Web Concordances – literary texts


    Free online search (concordances and a range of interesting features).


  • Online Books Page (University of Pennsylvania) – literary texts 


    Free access to texts in different formats (meta search in a number of archives).


  • Oxford Text Archive – literary texts 


    Free download as well as online search (concordances), wide variety of languages.


  • Project Gutenberg –  literary texts 


    Free download (e.g. complete works of Shakespeare).


  • State of the Union Archive - media archive


    All Sate of the Union addresses, provided by c-span.org (transcripts, and since 1989 video clips as well).


  • University of Virginia eBook Library – literary texts


    Approx. 2,000 literary texts in html format.



This section lists a selection of simple text analysis tools that can be used online, i.e. without installation. These tools allow you to create e.g. concordances, wordlists, text profiles from your own texts or from web pages of your choice. 

  • Compleat Lexical Tutor ('text-based concordances' section) - analyse your own text:


    - KWIC concordance for each word in the text.

    - See also 'phrase extractor' section to build concordance with word clusters.


  • Edict Virtual Language Centre ('Word Frequency Text Profiler' section) - analyse your own text:


    - Compares the text against well-known word lists (1000/2000 most frequent English words and others).

    - Highlights words of different frequency bands in different colours.

    - See also 'Unique Words Text Profiler' (finds all words which occur only once in a text).


  • Spaceless – analyse a text or web page of your choice:


    - Returns a variety of word lists.


  • TurboLingo - amalyse a text or web page of your choice:


    - KWIC concordance for all words in the text/web page

    - Frequency lists and other features



This section lists software packages that are commonly referred to as concordancers. They provide a more comprehensive range than the online analysis tools listed above (usually creation of concordances, alphabetical and frequency word lists, comparison of word lists and other statistical functions). Most packages can be freely downloaded but require installation. 

  • AntConc - free; by L. Anthony


    - For Windows and Linux.

    - Reads text, html, and xml files.

    - Main functions: concordances, citation of search term in its co-text, collocates, word clusters, frequency lists, text profiling through key rod lists.


  • ConcApp - free; by C. Greaves 


    - For Windows.

    - Main functions: concordances, collocate search, frequency lists.


  • Concordance - by R.J.C. Watts 


    - For Windows.

    - Creates a complete concordance for each word in a corpus and supports 

      its publication as a web concordance.

    - Other functions: individual concordances, citation of search term in its co-text, 

      frequency lists, text profiling through key rod lists, and a range of other statistical functions.


  • KwicFinder - by W.H. Fletcher 


    - For Windows.

    - Different from the other packages in that it focusses on the analysis of web pages.


  • MonoConc Pro - by Michael Barlow/Athelstan.


    - For Windows.

    Very comprehensive package.


  • Simple Concordance Program free; by A. Reed 


    - For Window and Mac.

    - Main functions: concordances, citation of search term in context, frequency lists.


  • TextSTAT - free; by M. Huening


    - For Windows, Linux and Mac.

    - Reads text, html, Word and Open Office files.

    - Web spider facility for corpus creation directly from Internet sources.

    - Main functions: concordances, citation of search term in context, frequency lists.


  • Wordsmith Tools - by Mike Scott


    - For Windows.

    - Very comprehensive package. 



This section focusses on corpus-related resources for the learning and teaching context. 

  • ICT4LT resources: 


    Module on Using concordance programs in the modern foreign languages classroom

    by Marie-Noëlle Lamy and Hans Jørgen Klarskov Mortensen.

    Module on Corpus linguistics by Tony McEnery and Andrew Wilson.


  • Concordance and Corpora tutorial – Georgetown University Washington



  • McEnery & Wilson's Corpus Linguistics course - University of Lancaster



  • Tim Johns' data-driven learning page including the virtual DDL library with examples of concordance-based exercises



  • CLAWS tagset developed at the University of Lancaster



  • Frequency lists for the British National Corpus - Lancaster University



  • Wortschatz-Lexikon - University of Leipzig


    English and German online dictionary based on newspaper corpus, with frequency of occurrence, explanation, grammatical information and more


  • Chemnitz Internet Grammar - University of Chemnitz


    A corpus-based pedagogical grammar of English


  • Online Dictionary of Corpus Linguistics Terms


  • EUROCALL Special Interest Group on Corpora in CALL


The following websites include resources and link collections generally related to corpus linguistics. 

  • University of Lancaster corpus linguistics page


  • David Lees' bookmarks for corpus-based linguistics


  • Michael Barlow's corpus linguistics page


  • Yvonne Breyer's Corpus Linguistics page
back to top

S.Braun (at) surrey.ac.uk


updated 03/06/06