EngLangBlog: corpus linguistics

Showing posts with label corpus linguistics. Show all posts

Monday, September 10, 2018

Getting involved in some research

Welcome back to all of you picking up your A level English Language studies from last year and hello to all new students of the best A level course*. There will be a few new posts this term about different areas of the course and lots of stuff on the @EngLangBlog Twitter account, but this post is about a project that linguists at Lancaster University need your help with.

Researchers there are currently putting together a 100-million word corpus of written language. A corpus is basically a well-organised database of language that then provides a body of material to be explored and analysed in different ways later on. Because the ways we write and the devices we use to write on and with have changed over time, linguists need examples of electronic language from actual, real-life users of it to build up a better picture of what's happening. Which is where you come in.

Extracts of emails, online conversations, WhatsApp messages and the like are all needed, so if you can help feed data into this project, have a look at the instructions here and take part. If you are a teacher, you might also want to build some of this in to your work on language change and technology.

Once the data has been collected and analysed, the linguists at Lancaster will be writing an article for the English and Media Centre's emagazine, which will take a look at how the corpus is tracking changing language and what the data tells us about the directions the English language is taking.

(*true)

Thursday, November 15, 2012

Tracking tweets

Twitter is proving to be a fantastic resource for linguists and you can see several examples of what's been done with it in these previous blog posts.

More recently, researchers at UCL have been looking at what tweets in London reveal about patterns of language use. The people looking at it, Ed Manley and James Cheshire, aren't linguists but an engineer and spatial analysis lecturer (eh?) so they're more interested in where things happen rather than necessarily exploring why - which might be more interesting for English Language students - but it raises good questions about why certain languages don't appear as often as might be expected- Bengali and Somali, for example - while others do make an appearance - Haitian Creole, Basque and Swahili -among them.

But what might be of even greater interest to A level English Language students is how Twitter is proving to be a source of fantastic data on gender and language, and what we might term communities of practice: groups of people that you "do" language with in your various day to day activities and whose language styles influence your own.

This excellent article by Ben Zimmer in the Boston Globe gives you a clear introduction to what Twitter is offering linguists and you can see more of the work of Tyler Schnoebelen and his colleagues in this powerpoint of their presentation to #nwav41 (Warning! Contains advanced statistics to boggle the mind).

In other Twitter-related research, this link to a paper by Rebecca Maybaum gives you a glimpse of how Twitter can be used to track evolving slang: in this case, the words used to describe people on Twitter. Tweeps? Twiends? Tweethearts?

Wednesday, March 30, 2011

My Little Pony must die

The language used to target young consumers (or children as we used to call them) is often designed to appeal to their developing sense of gender identity, and some would argue that many ads manipulate that identity to encourage boys and girls into thinking that certain toys and games are only for the other gender. I've seen it happen with my own kids who've left the protective cocoon of CBeebies and CBBC and entered the commercial wasteland that is ITV2 and the Cartoon Network.

This brilliantly simple piece of corpus analysis using Wordle (and flagged up by an anonymous person on the English Language list today) takes the language of TV ads and represents them in word clouds. The results based on gender are really striking:


boys' toys

girls' toys

The effects of this kind of polarised language are harder to gauge perhaps, but it's pretty stunning that in the 21st Century kids are still being sold a line that fighting is what boys do and love is what girls do. This pressure group - Pink Stinks - has done some good work in challenging gender stereotypes around children's toys and clothes, and is well worth a look.

For A level Language Investigations (AS ENGA2 projects into representation for the AQA A spec or ENGB4 for the A2 investigation in AQA B) this sort of work is ideal.

Thursday, January 27, 2011

Word clouds and sputnik moments

The US President's State of the Union address always attracts a fair bit of media analysis, but in recent years the analysis has taken on an apparently more linguistic angle. Thanks to cool tools like Wordle and Concordle which allow you to paste in text and then create word clouds based on lexical frequency (see the graphic above, for a word cloud of this blog post), many commentators have started to argue that key themes can be discerned, and from these patterns judgements can be made about the president's concerns.

This link from the BBC news site gives us a chance to examine the relative frequency of 10 words over the last 220 years. Meanwhile, this piece from the BBC last year takes a look at the frequency of particular words in British political parties' election manifestos from 1945 to 2010.

On the surface this sort of analysis makes sense, but as linguists point out, just because you use a word many times doesn't necessarily mean that that specific word reveals a great deal about you. For example, I really dislike Michael Gove, the Education Secretary. The reasons are numerous and varied, ranging from a knee-jerk dislike of Gove's posh Tory background, through to Gove's dogmatic belief that all schools should become "free" schools or academies, Gove's scrapping of the EMA and his bizarre insistence that state schools' problems can be solved by frisking teenagers for porn and putting ex-soldiers in classrooms. But it doesn't stop there: I also have (an admittedly childish) dislike of his resemblance to Dobbie from Harry Potter. Now, if you were to create a word cloud of this post, I'm sure the words Michael and Gove would crop up quite frequently, leading some to suggest that this post is all about Michael Gove. It's not. It's about language, not about Gove at all. But presumably, you can see my point by now...

This blog entry by the Lousy Linguist puts it more analytically (and more sensibly), so is well worth a read.

Arguably, one of the most telling moments in President Obama's speech this week was his use of the phrase Sputnik moment to draw a parallel between the US's economic and technological status in the world now - and the need for a reinvention of the USA's research and development programmes - compared to the moment in the last century when the Soviet Union gave the USA a nasty shock by successfully launching the world's first satellite. However, he used the word sputnik only twice in his address, so such an interesting image is unlikely to be identified in a basic crunching of the data.
So, in short, word clouds and simple crunching tools such as those mentioned above can be really helpful for carrying out what corpus linguists like to call "quick and dirty" breakdowns of word frequencies in texts, but we always need to be aware of context and meaning if we're to avoid drawing unhelpful conclusions from the data. In many ways, a quick corpus analysis using something like Wordle can be a brilliant way of opening up new ways into a text, be it a speech, a poem or an extract from a novel, but it's not a substitute for detailed analysis.

Tuesday, March 16, 2010

Using the internet to research new words

Here's a very neat website called WebCorp that allows you to search for words across the internet to see where and when they have appeared. There are probably loads of language uses for this, but one that springs to mind is to see how long ago new words might have first appeared in print for work on ENGA3 Language Change.

So, you could try a search for recent faves such as staycation, frenemy, moobs, credit crunch and gas sipper....go on give it a go. Then do what everyone else does and search for your own name and some rude words.

EngLangBlog