TED Talks Download Subtitles
Fecha: January 5th, 2010 | Categoría: Internet | 21 Comments »UPDATE: Online version
Go to the Online version
This is what I've been working on today. It's a simple console-based script to download subtitles for TED Talks - since I haven't found a way to download them directly from the web in a compatible format (I generally use '.srt' subtitles). Here is the script made in python. TEDTalkSubtitles.py
Key parts of the program:
A simple function to parse the value in miliseconds to something like "00:34:32,334":
-
def getFormatedTime(intvalue):
-
mils = intvalue%1000
-
segs = (intvalue/1000)%60
-
mins = (intvalue/60000)%60
-
hors = (intvalue/3600000)
-
return "%02d:%02d:%02d,%03d"%(hors,mins,segs,mils)
With this recursive function, fetch available languages for the talk
-
def availableSubs(subs):
-
a = subs.find("LanguageCode")
-
if a == -1:
-
return []
-
subs = subs[a+len("LanguageCode"):]
-
return [re.search("%22([^A-Z]+)%22", subs).group(1)] + availableSubs(subs)
Get information about the video
-
def getVideoParameters(urldirection):
-
ht = urllib.urlopen(urldirection).read()
-
var = re.search('flashVars = {\n([^}]+)}', ht)
-
if var:
-
var = var.group(1)
-
else:
-
return None
-
var = [a.replace('\t', '') for a in var.split('\n')]
-
for a in range(len(var)):
-
if var[a]:
-
var[a] = var[a][:var[a].rfind(',')]
-
resultado = []
-
for a in var:
-
l = a.find(':')
-
if l != -1:
-
resultado.append((a[:l], a[l+1:]))
-
return dict(resultado)
Getting it all together:
-
def downloadSub(idtalk, lang, timeIntro):
-
print("Downloading subtitles for language %s"%lang)
-
c = simplejson.load(urllib.urlopen('http://www.ted.com/talks/subtitles/id/%s/lang/%s'%(idtalk, lang)))
-
salida = file('subs_%s_%s.srt'%(idtalk,lang), 'w')
-
conta = 1
-
c = c['captions']
-
for linea in c:
-
salida.write("%d\n"%conta)
-
conta += 1
-
salida.write("%s --> %s\n"%(getFormatedTime(timeIntro+linea['startTime']), getFormatedTime(timeIntro+linea['startTime']+linea['duration'])))
-
salida.write("%s\n\n"%(linea['content'].encode('utf-8')))
-
salida.close()
Related to:
Parsing and Converting TED Talks JSON Subtitles
Download subtitles from TED talks for offline viewing
