Email threading Example [REST API]ΒΆ
Thread a email collection
Out:
0. Load the test dataset
GET http://localhost:5001/api/v0/datasets/fedora_ml_3k_subset
1.a Parse emails
POST http://localhost:5001/api/v0/email-parser
=> received [u'id', u'filenames']
=> dsid = 3dc5e4bd0b2b4b948ac18031d27ecff2
1.b. check the parameters of the emails
GET http://localhost:5001/api/v0/email-parser/3dc5e4bd0b2b4b948ac18031d27ecff2
- encoding: latin-1
- type: EmailParser
- data_dir: ../freediscovery_shared/fedora_ml_3k_subset/data
- n_samples: 3063
2. Email threading
POST http://localhost:5001/api/v0/email-threading/
=> model id = 4fdffe397e5947299d0ec23cc59e0011
Threading examples
cf. https://www.redhat.com/archives/rhl-devel-list/2008-October/thread.htlm
for ground truth data (mailman has a maximum threading depth of 3, unlike FreeDiscovery)
Strange ext3 problem (id=3049)
> Re: Strange ext3 problem (id=3055)
Dia has .la files (id=3039)
> Re: Dia has .la files (id=3040)
> > Re: Dia has .la files (id=3048)
> > > Re: Dia has .la files (id=3047)
> > > > Re: Dia has .la files (id=3051)
> > > > > Re: Dia has .la files (id=3052)
> > > > > > Re: Dia has .la files (id=3057)
> > > > > > > Re: Dia has .la files (id=3060)
> > > > > > Re: Dia has .la files (id=3061)
> > > Re: Dia has .la files (id=3062)
PackageKit 0.3.10 into F9 (id=3032)
> Re: PackageKit 0.3.10 into F9 (id=3034)
rawhide report: 20081031 changes (id=3019)
> Re: rawhide report: 20081031 changes (id=3021)
> > Re: rawhide report: 20081031 changes (id=3023)
> > > Re: rawhide report: 20081031 changes (id=3025)
> > > > Re: rawhide report: 20081031 changes (id=3026)
> Re: rawhide report: 20081031 changes (id=3022)
> > Re: rawhide report: 20081031 changes (id=3027)
> > > Re: rawhide report: 20081031 changes (id=3028)
> > > Re: rawhide report: 20081031 changes (id=3029)
> > > > Re: rawhide report: 20081031 changes (id=3030)
> > > > > Re: rawhide report: 20081031 changes (id=3031)
> > > > > Re: rawhide report: 20081031 changes (id=3035)
> > > > > > Re: rawhide report: 20081031 changes (id=3037)
> > Re: rawhide report: 20081031 changes (id=3033)
> > Re: rawhide report: 20081031 changes (id=3036)
Seeking comaintainers (id=3002)
> Re: Seeking comaintainers (id=3004)
> > Re: Seeking comaintainers (id=3006)
> > > Re: Seeking comaintainers (id=3008)
> Re: Seeking comaintainers (id=3005)
> > Re: Seeking comaintainers (id=3009)
> Re: Seeking comaintainers (id=3007)
> Re: Seeking comaintainers (id=3010)
> > Re: Seeking comaintainers (id=3011)
from __future__ import print_function
from time import time
import sys
import platform
from itertools import groupby
import numpy as np
import pandas as pd
import requests
import sys
import os
pd.options.display.float_format = '{:,.3f}'.format
if platform.system() == 'Windows' and sys.version_info > (3, 0):
print('This example currently fails on Windows with PY3 (issue #')
sys.exit()
dataset_name = "fedora_ml_3k_subset" # see list of available datasets
BASE_URL = "http://localhost:5001/api/v0" # FreeDiscovery server URL
print(" 0. Load the test dataset")
url = BASE_URL + '/datasets/{}'.format(dataset_name)
print(" GET", url)
res = requests.get(url)
res = res.json()
# To use a custom dataset, simply specify the following variables
data_dir = res['data_dir']
print("\n1.a Parse emails")
url = BASE_URL + '/email-parser'
print(" POST", url)
fe_opts = {'data_dir': data_dir }
res = requests.post(url, json=fe_opts)
dsid = res.json()['id']
print(" => received {}".format(list(res.json().keys())))
print(" => dsid = {}".format(dsid))
print("\n1.b. check the parameters of the emails")
url = BASE_URL + '/email-parser/{}'.format(dsid)
print(' GET', url)
res = requests.get(url)
data = res.json()
print('\n'.join([' - {}: {}'.format(key, val) for key, val in data.items() \
if key != 'filenames']))
print("\n2. Email threading")
url = BASE_URL + '/email-threading/'
print(" POST", url)
t0 = time()
res = requests.post(url,
json={'dataset_id': dsid }).json()
mid = res['id']
print(" => model id = {}".format(mid))
def print_thread(container, depth=0):
print(''.join(['> ' * depth, container['subject'],
' (id={})'.format(container['id'])]))
for child in container['children']:
print_thread(child, depth + 1)
print("\nThreading examples\n"
"cf. https://www.redhat.com/archives/rhl-devel-list/2008-October/thread.htlm \n"
"for ground truth data (mailman has a maximum threading depth of 3, unlike FreeDiscovery)"
)
for idx in [-1, -2, -3, -4, -5]: # get latest threads
print(' ')
print_thread(res['data'][idx])
Total running time of the script: ( 0 minutes 15.251 seconds)