Python crawler implementation code example for taking names

Y2J
Release: 2017-05-10 11:42:16
Original
4850 people have browsed it

Everyone will encounter something in their life. They will not care about it before it appears, but once it comes, they will find that it is extremely important and require a major decision to be made in a short period of time. That is for yourself. Give your newborn baby a name. The following article mainly introduces how to use Python crawler to give your child a good name. Friends in need can refer to it.

Preface

I believe every parent has experienced it, because it is necessary to name the child within two weeks after birth (you need to apply for a birth certificate), which is estimated to be a lot. Everyone is like me. I was very confused at first. Although I felt that there are so many Chinese characters, I could just find any character to make a name. But later I realized that it was really not a casual thing. No matter how I thought about it, I found that it was inappropriate, so I looked around in dictionaries and online. I search and read Tang and Song Dynasty poems, the Book of Songs, and even martial arts novels. However, the name I have been thinking about for a long time often encounters the opinions and objections of my family members, such as problems such as difficulty in speaking, the same accent as the name of relatives, etc., so I fall into a cycle of repeated searches and denials. The cycle becomes more and more confusing.

So we went back to the Internet again and searched various , and found many articles on the Internet such as "A complete list of good baby boy names". These articles gave hundreds of articles at once. Thousands of names are too dizzying to use. There are many websites or apps that test names. When you enter a name, you can get a rating of eight characters or five characters. This function is quite good and can be used as a reference. However, we either need to input names one by one for testing, or These websites or APPs have very few names, or they cannot meet our needs such as qualifying words, or they start charging, and in the end we can't find any useful ones.

So I want to make a program like this:

  1. The main function is to provide reference for batch names. These names are combined with the baby's Calculated from birth date and horoscope;

  2. You can expand the name library. For example, if you find a batch of good names in the Book of Songs on the Internet and want to see how they are, you can add them and use them;

  3. You can limit the characters used in the name. For example, some family trees have restrictions. If the current generation is "国", the name must have the character "国";

  4. The name list can be given scores, so that after inversion, you can look at the names from high scores to low scores;

In this way, you can get a copy There is a list of names that match your child's birth date, your family tree restrictions, and your preferences, and the list has given scores for reference. Based on this, we can figure it out one by one to find the name we like. Of course, if you have new ideas, you can add new names to the vocabulary at any time and recalculate.

Code structure of the program

Code introduction:

  • /chinese-name-score Code root directory

  • /chinese-name-score/main Code directory

  • /chinese-name-score/main/dicts Dictionary file directory

  • /chinese-name-score/main/dicts/names_boys_double.txt Dictionary file, two-letter names for boys

  • /chinese-name-score/main/dicts/names_boys_single.txt Dictionary file, single-letter names for boys

  • /chinese-name-score/ main/dicts/names_girls_single.txt Dictionary file, two-letter names for girls

  • /chinese-name-score/main/dicts/names_grils_double.txt Dictionary file, one-letter names for girls

  • /chinese-name-score/main/outputs Output data directory

  • /chinese-name-score/main/outputs/names_girls_source_wxy.txt Output Sample files

  • /chinese-name-score/main/scripts Some scripts for preprocessing dictionary files

  • /chinese-name -score/main/scripts/unique_file_lines.py Set the dictionary file to remove duplicate names and blank lines in the dictionary

  • ##/chinese-name -score/main/sys_config.py The system configuration of the program, including the crawled target URL and dictionary file path

  • /chinese-name-score/main/user_config.py The user configuration of the program , including the baby’s age, month, day, time, gender and other settings

  • /chinese-name-score/main/get_name_score.py Program running entrance


How to use the code:

  1. If there are no qualified words, find the dictionary files names_boys_double.txt and names_grils_double.txt, you can add yourself here For some name lists found, just split them by line and add them at the end;

  2. If there are qualified words, find the dictionary files names_boys_single.txt and names_girls_single.txt, and add your favorites here. A single word list can be divided by line and added at the end;

  3. Open user_config.py and configure it. See the next section for configuration items;

  4. Run the script get_name_score.py

  5. In the outputs directory, view your own output files, which can be copied to Excel for sorting and other operations;

Program The configuration entry

The configuration of the program is as follows:

# coding:GB18030
 
"""
在这里写好配置
"""
 
setting = {}
 
# 限定字,如果配置了该值,则会取用单字字典,否则取用多字字典
setting["limit_world"] = "国"
# 姓
setting["name_prefix"] = "李"
# 性别,取值为 男 或者 女
setting["sex"] = "男"
# 省份
setting["area_province"] = "北京"
# 城市
setting["area_region"] = "海淀"
# 出生的公历年份
setting['year'] = "2017"
# 出生的公历月份
setting['month'] = "1"
# 出生的公历日子
setting['day'] = "11"
# 出生的公历小时
setting['hour'] = "11"
# 出生的公历分钟
setting['minute'] = "11"
# 结果产出文件名称
setting['output_fname'] = "names_girls_source_xxx.txt"
Copy after login

According to the configuration item setting["limit_world"] , the system automatically determines whether to use a single-word dictionary or a multi-word dictionary Dictionary:

  1. If this item is set, for example, if it is equal to "国", then the program will combine all the words into names for calculation. For example, both the names Guohao and Haoguo will be calculated;

  2. If you do not set this item and keep it empty String, the program will only read the double-word dictionary of *_double.txt

Principle of the program

This is a simple crawler. You can open the life.httpcn.com/xingming.asp website to view. This is a POST form. Fill in the required parameters and click submit. A results page will open. The bottom of the results page contains the eight-character score and the five-frame score.

If you want to get scores, you need to do two things. One is to automatically submit the form to the crawler and get the results page; the other is to extract the scores from the results page;

For the first thing, it is very simple , urllib2 can achieve it (the code is in /chinese-name-score/main/get_name_score.py):

 post_data = urllib.urlencode(params)
 req = urllib2.urlopen(sys_config.REQUEST_URL, post_data)
 content = req.read()
Copy after login

The params here is a parameter dict. In this way, POST with data is submitted. Then the result data was obtained from content.

The parameters of params are set as follows:

 params = {}
 
 # 日期类型,0表示公历,1表示农历
 params['data_type'] = "0"
 params['year'] = "%s" % str(user_config.setting["year"])
 params['month'] = "%s" % str(user_config.setting["month"])
 params['day'] = "%s" % str(user_config.setting["day"])
 params['hour'] = "%s" % str(user_config.setting["hour"])
 params['minute'] = "%s" % str(user_config.setting["minute"])
 params['pid'] = "%s" % str(user_config.setting["area_province"])
 params['cid'] = "%s" % str(user_config.setting["area_region"])
 # 喜用五行,0表示自动分析,1表示自定喜用神
 params['wxxy'] = "0"
 params['xing'] = "%s" % (user_config.setting["name_prefix"])
 params['ming'] = name_postfix
 # 表示女,1表示男
 if user_config.setting["sex"] == "男":
  params['sex'] = "1"
 else:
  params['sex'] = "0"
  
 params['act'] = "submit"
 params['isbz'] = "1"
Copy after login

The second thing is to extract the required scores from the web page. We can use BeautifulSoup4 to achieve this, and its syntax is also very simple:

 soup = BeautifulSoup(content, 'html.parser', from_encoding="GB18030")
 full_name = get_full_name(name_postfix)
 
 # print soup.find(string=re.compile(u"姓名五格评分"))
 for node in soup.find_all("p", class_="chaxun_b"):
  node_cont = node.get_text()
  if u'姓名五格评分' in node_cont:
   name_wuge = node.find(string=re.compile(u"姓名五格评分"))
   result_data['wuge_score'] = name_wuge.next_sibling.b.get_text()
  
  if u'姓名八字评分' in node_cont:
   name_wuge = node.find(string=re.compile(u"姓名八字评分"))
   result_data['bazi_score'] = name_wuge.next_sibling.b.get_text()
Copy after login

Through this method, HTML can be parsed and the scores of eight characters and five grids can be extracted.

Example of running results

1/1287 李国锦 姓名八字评分=61.5 姓名五格评分=78.6 总分=140.1
2/1287 李国铁 姓名八字评分=61 姓名五格评分=89.7 总分=150.7
3/1287 李国晶 姓名八字评分=21 姓名五格评分=81.6 总分=102.6
4/1287 李鸣国 姓名八字评分=21 姓名五格评分=90.3 总分=111.3
5/1287 李柔国 姓名八字评分=64 姓名五格评分=78.3 总分=142.3
6/1287 李国经 姓名八字评分=21 姓名五格评分=89.8 总分=110.8
7/1287 李国蒂 姓名八字评分=22 姓名五格评分=87.2 总分=109.2
8/1287 李国登 姓名八字评分=21 姓名五格评分=81.6 总分=102.6
9/1287 李略国 姓名八字评分=21 姓名五格评分=83.7 总分=104.7
10/1287 李国添 姓名八字评分=21 姓名五格评分=81.6 总分=102.6
11/1287 李国天 姓名八字评分=22 姓名五格评分=83.7 总分=105.7
12/1287 李国田 姓名八字评分=22 姓名五格评分=93.7 总分=115.7
Copy after login

With these scores, we can sort them, which is a very practical reference.

Friendly reminder

  1. The score is related to many factors, such as the time of birth, the limited words, the strokes of the limited words, etc. These conditions It has been decided that some names will not have high scores, so don’t be affected by this, just find the ones with high relative scores;

  2. Currently, the program can only crawl the content of one website, and the address is http ://life.httpcn.com/xingming.asp

  3. This list is for reference only. I have read some articles. There are many celebrities and great people in history. Their names have very low ratings but they all made great achievements. , the name does have some influence, but sometimes catchy words are the best;

  4. After selecting a name from this list, you can check it on Baidu, Renren and other places to Just in case some negative people have the same name, or there are too many people with this name;

  5. The eight-character score is inherited from China, and the five-frame score was invented by the Japanese in modern times. Sometimes You can also try the Western zodiac naming method, and strangely, the horoscopes and five scores are very different on different websites, which further proves that this thing is for reference only;

## The code of this article has been uploaded to github

Summary

[Related recommendations]

1.

Python Free Video Tutorial

2.

Python Meets Data Collection Video Tutorial

3.

Python Learning Manual

The above is the detailed content of Python crawler implementation code example for taking names. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template