web processing

Overview

Today, the cyberspace is an inseparable part of the human social life. People spend much time in this space to cover most of their daily needs, such as reading product reviews and shopping, communicating with family and friends, reading daily news, etc. Developing domestic high quality web-based services to answer these needs is of great importance, or people will be used the foreign counterparts, which can lead to a lot of negative economic and social consequences. Due to this fact that a large part of the cyber-space is related to human language (written and spoken), it is obvious that the most important requirement for the development of domestic web-based services is to provide suitable language processing infrastructure. In this regard, we focus more on the web and Cyber-Linguistics related tasks.

More Information

Activities

  • Providing technical services and consultation in the fields of cyber-linguistics and web services, as well as text & information retrieval
  • Performing and managing fundamental, strategic and applied R&D projects in the IT-based technologies, including search engine, cyber linguistics and web-based applications.
  • Supporting development of tools, applications and services in the field of Persian language and handwriting. 
  • Evaluating, testing & monitoring web services, NLP tools and data
  • Evaluating, monitoring and foresighting domains of cyber-linguistics, web services, text & information retrieval

 

Services & Products

Text Search Engine

(Parsijoo , Yooz)

English-Persian Machine Translation

(Targoman, Farazin)

Quranic Automatic Question Answering System

(Quranjooy)

Image search engine

 

Data

Persian Wordnet

(Farsnet)

 

Persian Knowledge graph

(Farsbase)

Persian image network

(Tasvirnet)

Quranic Ontology

(quranNager)

Tools

 

Persian Language  Processing Toolkit

(Parsi Pardaz)

Aye Yab

 

Title Date volume Download
Named Entity Recognition (NER) for Farsi 2019 May 29 2.7MB
Co-reference Resolution in Farsi 2019 May 29 3.9 MB

 

 

 

product

Web Azma

Web Azma is a testing and evaluating laboratory, established in 2015 in Iran Telecommunications Research Center (ITRC), with the aim of standardized quality assessment of ethnic IT-based services such as search engines, machine translators, emails, social services, etc. Within the 2 years of its activity, Web Azma has developed several automatic as well as human-based testing tools and platforms, using which, it has been continuously reporting quality assessment reports regarding web-based services supported by Persian Search Engine Program. Web Azma testing tools and platforms can be listed as follows:

  • Automatic testing platform for assessing the effectiveness of Persian text, image, video, and audio search engines as well as measuring the precision of English-Persian machine translators,
  • Automatic tools for evaluating the availability, response time, and capability of handling concurrent users for any web-based services,
  • A crowdsourcing platform, with more than 300 users, for designing and executing any kind of testing activity which requires human participation,
  • Analytical script-based tools for collecting and inferring several statistics about user engagements and their interactions with web-based services,
  • UI and UX evaluation platforms for UI/UX assessment of any web-based services.
  • Using the above listed tools and platforms, Web Azma has executed several tests whose results have been published on official website of Web Azma (http://webazma.itrc.ac.ir). Moreover, several conference and journal papers have been published in domestic as well as international societies. Currently, the focus of the laboratory is on developing tools for quality assessment of intelligent advertising platforms as well as NLP services such as knowledge graph and tree bank.
  • http://webazma.itrc.ac.ir/

Text search engine

This project aims at providing Persian text search services with the following specific goals:
  • Fulfillment of the information need of Persians within the vast information pool of the internet,
  • Providing a full covering of Persian parts of the web,
  • Providing a search service which considers Persian language specialties and Islamic as well as Persian culture,
  • Increasing the intelligence level of Persian search services with the aim of efficient processing and retrieval algorithms,
  • Using distributed architecture

Two different Persian search engines have been selected to participate and compete in this project; both search engines were started their services in 2010. Currently, they provide various services such as text, image, and multimedia search, Persian-English translation, news, music, blog, etc. Similar to other local search engines such as Yandex, Baidu, and Naver, the Persian search services prefer to focus on deep language oriented processes to improve their accuracy and benefitted from recruiting native language exports.

yooz

  • Yooz search engine creation began in 2010. The purpose of this search engine in the first step is the covering of Persian web pages and then in the future is to increase the volume of covered web pages in Persian language and other languages like Arabic and English and also improving the quality of its search service.
  • Crawling over a billion Persian web pages and providing various services are the main features of this search engine. Currently the services provided by this search engine in addition to textual search service are news, weblogs, sports search services and machine translator service www.yooz.ir

English-Persian machine translation

Communication between different nations is one of the necessities of modern societies. Translating different languages is a prerequisite for communication between different nations. But human translating is very time consuming and costly task. Therefore, using machines for translation is vital for providing translation services. Nowadays, machine translation (MT) is one of the most important and widely used services in search engines. The machine translation system developed in this project has quality like famous machine translation systems in the world (e.g. Google translator) and is able to translate English to Persian and vice versa. This MT system is servicing to about 6000 users every day and can provide translation service in some professional areas. In addition to developing MT system in this project, a parallel English-Persian corpus is also developed containing about 50 millions of words. Implementing new approaches such as deep learning, we aim to reach to human quality in machine translation.

Parsijoo

With the ever growing use of the web nowadays internet search engines have turned into extremely important tools, which provide users with the information they are looking for. Today, over 80% of internet users access their required information through web search engines. Local search engines can provide better local services and transfer the global network traffic into local traffic. Parsijoo search engine has been developed to meet this growing requirement, and has achieved coverage of more than 500 million Persian web pages, providing value-added services to interact and to respond to more than 2 million queries per day. Within this scope, following peculiarities can be taken into account:

  • Text Search
  • Image Search
  • News Search
  • Map Search
  • Speech Search
  • Scientific Search
  • Covering local web sites-web pages- through general domains
  • Searching & indexing all Persian web pages (in web environment)
  • Intelligent updating of content
  • Providing fast answers to queries

Targoman English-Persian Bilingual Translator

Targoman English-Persian translator is a pioneer in statistical translation in Iran. It was developed by the “Human Language Technology & Machine Learning” laboratory at Amirkabir University, with the support of ICT Research Institute, and is regarded as one of the major tools for developing Persian scripts and language. Targoman offers a variety of applications such as text processing, search engines and digital libraries. It can also be employed independently. Currently, Targoman is applied to Parsijoo Search Engine for translation purposes and is capable of providing English-to-Persian and Persian-to- English textual translation based on a corpus of 38 million lexicons in the domains of news and fiction. One major advantage of Targoman is the high quality of translated result compared to the existing Persian translators. Based on the evaluation results using BLUE standard, the quality achieved by Targoman in both Persian-to-English translation and vice versa is far superior to those of competing translators on Google, Bing and so on. Targoman is undergoing a further upgrade and will be expected to provide services such as speech translation as well as displaying related images in the near future. The quality and domain coverage is also gradually improving. Within this scope, following peculiarities can be taken into account:

  • Highest quality of translation compared to counterpart systems
  • Exclusive bilingual local English-to-Persian translator
  • Ease of upgrade
  • High-speed translations
  • Inclusion of a rich corpus of Persian words
  • Possibility of suggestions for improved translation results by end users

Faraazin

This Translator was designed and implemented in the project "Development of English-Persian machine translator service, design and data development" which was in the year 2011. As previously stated this project is defined as a way of the expansion of the currently existing service in the form of technology transfer. The approach taken in the design of the “Faraazin” Translator is a rule based approach. The main features of the “Faraazin” Translator are:

  • English/Persian translation service provisioning
  • Translation of English documents and texts
  • Translation of files with DOC, PPT and PDF formats
  • Provides translated files to users via e-mail
  • Rich dictionary
  • High-speed translations
  • Requires less hardware in comparison to other services
  • Fully native
  • Provides professional translations in various fields
  • www.faraazin.ir

Quranjooy

Quranjooy is a comprehensive search engine that offers its users a convenient and accurate way to search the holy Quran. The system has been pioneered at ICT Research Institute and caters to Farsi speakers with their queries on the holy book. Within this scope, following peculiarities can be taken into account:

  • Lexical (keyword), Syntactic (multi-words phrases) and Semantic (concept) search
  • Applying reasoning and data mining algorithms to retrieve results
  • Applying unstructured resources such as Quran translation and interpretation as well as structured resources such as Quran conceptual graph - which has been (specifically) developed for this search tool
  • Improving and updating system with a huge volume of resources
  • Easy extension of system to be used in domains other than Quran (can be applied to other domains through minor modification and can be easily reconfigured for use other than the Quran)

Image search engine

Deep-learning is a new machine learning technique which provided promising results in image processing and machine vision applications. Hence, the aim of this project is employing deep-learning techniques in developing Persian language image search engines. In this project three kinds of image search are taking into account: tag based, content based and image search based on combination of tag and content. Since deep-learning techniques in this project will be used in the three kinds of image search engines, this project can be considered as comprehensive research on evaluating the impact of deep-learning techniques on improving the performance of different kinds of image search engines.

Persian Wordnet (Farsnet3)

  • Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. There are a number of semantic lexicons for English and some other languages, and Persian needs such a complete resource to be used in NLP works.
  • FarsNet is a bilingual lexical ontology that consists of two main parts: a semantic lexicon and a lexical ontology. Each entry in the semantic lexicon contains natural language descriptions, phonological, morphological, syntactic and semantic knowledge about a lexeme. The lexemes can participate in relations with other lexemes in the same lexicon or to entries of other lexicons and ontologies, in the ontology part. Here, the semantic lexicon is serving as a lexical index to the ontology. [1] The ontology part contains not only the standard relations defined in WordNet but also some additional conceptual ones such as Agent, Patient, Location, Instrument, Entailment, Salient defining feature, Co-Occurrence,…. FarsNet has also inter-lingual relations connecting Persian synsets to English ones (in Princeton WordNet 3.1).
  • Farsnet is developed in 3 projects. Farsnet 1.0 was espablish in 2008 and has more than 17000 lexical entries organized in about 10000 synsets of nouns, adjectives and verbs. Farsnet 2.0 was released at 2010 with 30000 words in 20,000 synsets that contains a new pos Adverb. More than 200 universities and organs request Farsnet 2.0 ang have got it to use in their projects.
  • Current Project of Farsnet is Farsnet 3.0 that aims to collect 100000 persian word. Farsnet3 Project has started at 2016 August and end in August 2017. At the first phase of Farsnet3 all Farsnet 2.0 Data was revised humanely to correct some errors. Now 2 phase of projects with 65000 words is passed and completing Farsnet continues.
  • Current statistics of Farsnet is as below table:

Persian Knowledge Graph

Knowledge bases are considered as one of the most important factors in the development of internet and its future. In this regard, the Persian knowledge graph project has been implemented using Semantic Web and Linked Data. The data model in Semantic Web is RDF (Resource Description Framework) which is a graph, therefore, the structure of the project is based on an RDF model that is constituted of many facts including subjects, predicates and objects. This project extracts knowledge from many resources including Persian Wikipedia, raw texts, web tables and etc., but its main focus is on Wikipedia. Up to now, over 6.5 million facts describing about 500000 entities have been extracted. The DBpedia’s ontology including 732 classes have been applied to the project. This ontology have been expanded by adding 25 classes according to Persian entities. In addition, 800 Wikipedia templates and more than 2000 properties has been mapped to the ontology by human experts. As far as data storage system is considered, two levels are designed for metadata and data triples. In general, this project aims to produce a knowledge graph containing high precision data to improve the results of search engines.

Persian image network (TasvirNet)

“TasvirNet” is the first image network for Persian nouns that attempts to establish a hierarchical database with a large number of images available to researchers in Iran and other Persian language countries. The goal of this image network is providing a Persian language based image network and facilitate access of Persian users to the images independent of the policies which are set by “ImageNet” (the image network provided by Stanford University. The hierarchy of synsets in TasvierNet is based on the hierarchy in ImageNet that and Persian equivalent of more than 93 percent of the synsets in ImageNet are automatically translated and included in TsvirNet. The images in TasvirNet are automatically downloaded of the links provided by ImageNet for images. Moreover, more than 1000 synsets for Iranian and Islamic culture with more than 70 thousand related images are included in TasvirNet. The added images have been tagged by crowdsourcing method. TasvirNet can be accessed via tasvir-net.ir.

Qurannegar

Qurannegar refers to the conceptual graph representation of the Quran, which is first of its kind at this level in Iran and includes concepts, semantic relations and their related instances. Within this scope, following features can be taken into account:

  • Contains more than 27000 concepts within a graph
  • Contains more than 9000 instances
  • Contains more than 430,000 relations between concepts & instances based on
  • 130 conceptual relation types within a graph

Parsi Pardaz Persian Language Processing Toolkit

ICT research center has developed and implemented a comprehensive and smart Persian language search tool, which can handle various kinds of text processing from lexical level to semantic level. The aim of this processing engine is to enable users to perform searches and view the results in their native language. Parsi Pardaz comprises nine Persian language processing and integration tools for all language analysis layers corresponding to the following categories:

  • Lexical Layer Tools: Normalizer, tokenizer, spell checker
  • Morphological Layer Tool: Lemmatizer, Stemming, POS Tagger
  • Syntactic Layer Tool: Syntactic parser dependency
  • Semantic Layer Tool: Semantic rule labeling

Ayehyab

Ayehyab is a Persian language – based search tool that can accept a single word or multi-word query, look though the Quran verses (Ayats) and then return a response that include:

  • Sureh (Surah) name and Ayeh (Ayah) number
  • Reproduces a portion of the retrieved verse together with its Farsi translation
  • Within this scope, following peculiarities can be taken into account:
  • A tool for searching the Quran and returning a verse along with its Farsi translation
  • A Quran search engine
X