MongoDB 倾向于将数据都放在一个 Collection 下吗？

Question

举个例子，有一个用户信息和用户间关系的数据库，如果按照 SQL 的思路，会建立用户信息和用户关系两张表。那么，在 MongoDB 中，是倾向于将用户关系嵌入到用户信号，组成一个单独的文档吗？

天蓬老师 · Answer

That’s not the case.

There is an upper limit on the size of a single doc in a Collection, which is currently 16MB, which makes it impossible for you to cram everything into a collection. Moreover, if the collection structure is too complex, it will not only affect query and update efficiency, but also cause maintenance difficulties and operational risks. Have you ever tried to accidentally save a doc as null with a shake of your hand? Anyway, I have done it. If all a person's information is in this collection, it must be quite a sour feeling.

The general principle is:

Cluster according to the query method
- Put data that needs to be read together frequently together.
- Put information that is logically closely related together.
- Put the data that requires map-reduce/aggregation together, and these operations can only operate on a single collection.
Split according to the amount of data
- If you find that you need to use an array in a collection and the length of the array will continue to increase, then you should put the data content in a special collection, and each piece of data refers to the primary key of the current doc (just like mysql's 1..N foreign key dependencies).
- If you find that a certain doc is too deep (more than 2 layers), you will probably consider splitting it, otherwise there will be problems with performance and maintainability.
Design according to the table structure
- MongoDB does not have the concept of table structure, but in actual use, it is rare to say that there are docs with various structures in a collection. If you find that the differences in doc structures are getting bigger and bigger, then you should consider how to abstract it into something like Structure, throw changed things to other collections, and reference each other using foreign key dependencies.

For example, when designing a user system, the user collection should contain commonly used information such as name, as well as lastLoginAt, which are only related to the user. Perhaps information about the access rights of the user should also be included, but the user's login log should not be included. This information will continue to grow.

As for the relationship between users, whether there is a user collection needs to be discussed. If you only need to store the relationship between users and record the uids of friends, and the number of friends is not too large, a few hundred at most, then I tend to put them in a collection. If the relationship data itself is more complex, or the number of friends is in the thousands, then I tend to split it.

In addition, Mongodb’s official data model design paradigm is worth reading. It is recommended to take a good look.

怪我咯 · Answer

Original address: http://pwhack.me/post/2014-06-25-1 Please indicate the source for reprinting

This article is excerpted from Chapter 8 of "The Definitive Guide to MongoDB", which can thoroughly answer the following two questions:

/q/1010000000364944
/q/1010000000364944

There are many ways to represent data, and one of the most important issues is to what extent the data should be normalized. Normalization is the process of dispersing data into multiple different collections, and different collections can reference data from each other. Although many documents can reference a certain piece of data, this piece of data is only stored in one collection. Therefore, if you want to modify this piece of data, you only need to modify the document that saves this piece of data. However, MongoDB does not provide a join tool, so multiple queries are required to perform join queries between different collections.

Denormalization is the opposite of normalization: embedding the data required for each document within the document. Each document has its own copy of the data, rather than all documents collectively referencing the same copy of the data. This means that if the information changes, all related documents need to be updated, but when a query is executed, only one query is needed to get all the data.

Deciding when to normalize and when to denormalize is difficult. Normalization can improve data writing speed, and denormalization can improve data reading speed. This needs to be carefully weighed against the dozens of needs of your own application.

Examples of data representation

Suppose you want to save student and course information. One way to represent this is to use a students collection (each student is a document) and a classes collection (each course is a document). Then use the third collection studentsClasses to save the relationship between students and courses.

> db.studentsClasses.findOne({"studentsId": id});
{
  "_id": ObjectId("..."),
  "studentId": ObjectId("...");
  "classes": [
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("...")
  ]
}

If you are familiar with relational databases, you may have built this type of table join before, although you may only have one student and one course in each demerit document (rather than a list of course "_id"s). Putting the courses in an array is a bit MongoDB style, but in practice you usually don't save data this way because it takes many queries to get the real information.

Suppose you want to find a course selected by a student. You need to first search the students collection to find the student information, then query the studentClasses to find the course "_id", and finally query the classes collection to get the desired information. In order to find out the course information, three queries need to be requested from the server. It's likely that you don't want to use this kind of data organization in MongoDB, unless the student information and course information change frequently, and there is no requirement for data reading speed.

You can save a query if you embed the course reference in the student document:

{
  "_id": ObjectId("..."),
  "name": "John Doe",
  "classes": [
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("..."),
    ObjectId("...")
  ]
}

The "classes" field is an array that stores the "_id" of the courses that John Doe needs to take. When you need to find out information about these courses, you can use these "_id" to query the classes collection. This process only requires two queries. This way of organizing data is great if the data does not need to be accessed at any time and does not change at any time ("any time" is more demanding than "frequently").

If you need to further optimize the reading speed, you can completely denormalize the data and save the course information as an embedded document in the "classes" field of the student document. In this way, you can get the student's course information with only one query:

{
  "_id": ObjectId("..."),
  "name": "John Doe"
  "classes": [
    {
      "class": "Trigonometry",
      "credites": 3,
      "room": "204"
    },
    {
      "class": "Physics",
      "credites": 3,
      "room": "159"
    },
    {
      "class": "Women in Literature",
      "credites": 3,
      "room": "14b"
    },
    {
      "class": "AP European History",
      "credites": 4,
      "room": "321"
    }
  ]
}

The advantage of the above method is that it only requires one query to get the student's course information. The disadvantage is that it takes up more storage space and makes data synchronization more difficult. For example, if physics becomes a 4-point credit (instead of a 3-point grade), then every student who took the physics course will need to update their documentation, not just the "Physics" document.

Finally, you can also mix embedded data and reference data: create a sub-document array to save common information, and find the actual document by reference when you need to query more detailed information:

{
  "_id": ObjectId("..."),
  "name": "John Doe",
  "classes": [
    {
      "_id": ObjectId("..."),
      "class": "Trigonometry"    
    },
    {
      "_id": ObjectId("..."),
      "class": "Physics"
    }, {
      "_id": ObjectId("..."),
      "class": "Women in Literature"
    }, {
      "_id": ObjectId("..."),
      "class": "AP European History"
    }
  ]
}

This method is also a good choice, because the embedded information can be modified as needs change. If you want to include more (or less) information on a page, you can add more (or less) The information is placed in the embedded document.

Another important question to consider is, is information updated more frequently or is information read more frequently? If the data will be updated regularly, normalization is a better choice. If the data changes infrequently, it is not worth sacrificing read and write speed to optimize update efficiency.

For example, an example of a textbook introduction to normalization might be to save users and user addresses in separate collections. However, people rarely change their addresses, so the efficiency of each query should not be sacrificed for the extremely unlikely event that someone changes their address. In this case, the address should be embedded in the user document.

If you decide to use inline documents, you need to set up a cron job when updating documents to ensure that all documents are successfully updated for every update. For example, we tried to spread an update to multiple documents, and the server crashed before the update completed all documents. It is necessary to be able to detect this problem and redo the unfinished update.

Generally speaking, the more frequently the data is generated, the less likely it should be embedded in other documents. If the number of embedded fields or embedded fields grows indefinitely, then these contents should be saved in a separate collection and accessed using references instead of embedded in other documents. Information such as comment lists or activity lists should be Saved in a separate collection and should not be embedded in other documents.

Finally, if some fields are part of the document data, then these fields need to be embedded into the document. If you often need to exclude a field when querying documents, then this field should be placed in another collection rather than embedded in the current document.

More suitable for embedding	More suitable for quotes
Subdocuments are smaller	The subdocument is larger
Data does not change regularly	Data changes frequently
The final data is consistent	The data in the intermediate stage must be consistent
Document data has increased slightly	Document data has increased significantly
Data usually requires a secondary query to be obtained	Data is usually not included in the results
Fast reading	Fast writing

Suppose we have a user collection. Below are some fields that may be required and whether they should be embedded in the user document.

User preferences (account preferences)

User preferences are only relevant to a specific user and will most likely need to be queried with other user information within the user document. So user preferences should be embedded into the user document.

Recent activity

This field depends on how frequently activity has grown and changed recently. If this is a fixed-length field (such as the last 10 events), then this field should be embedded in the user document.

Friends

Usually you should not embed friend information into user documents, at least not completely. The next section will introduce relevant content of social network applications.

All user generated content

Should not be embedded in user documentation.

Base

The number of references to other collections contained in a set is called cardinality. Common relationships include one-to-one, one-to-many, and many-to-many. Suppose there is a blogging application. Each blog post has a title, which is a one-to-one relationship. Each author can have multiple articles, which is a one-to-many relationship. Each article can have multiple tags (tags), and each tag can be used in multiple articles, so this is a many-to-many relationship.

In MongoDB, many can be split into two subcategories: many and few. For example, the relationship between authors and articles may be a one-to-one relationship: each author only publishes a few articles. There may be a many-to-few relationship between blog posts and tags: the number of posts may actually be greater than the number of tags. There is a one-to-many relationship between blog posts and comments: each post can have many comments.

As long as the relationship between less and more is determined, it is easier to make a trade-off between embedded data and referenced data. Generally speaking, it is better to use the inline method for "less" relationships, and it is better to use the reference method for "many" relationships.

Friends, fans, and other troublesome things

Keep friends close and stay away from enemies

Many social applications need to link people, content, fans, friends, and other things. The trade-off between using inline and referenced forms for this highly relevant data is not easy. This section will introduce considerations related to social graph data. Often, following, friends, or favorites can be simplified into a publish-subscribe system: one user can subscribe to notifications related to another user. In this way, there are two basic operations that need to be efficient: how to save subscribers, and how to notify all subscribers of an event.

There are three common subscription implementation methods. The first way is to embed the content producer in the subscriber document:

{
    "_id": ObjectId("..."),
    "username": "batman",
    "email": "batman@waynetech.com",
    "following": [
        ObjectId("..."),
        ObjectId("...")
    ]
}

Now, for a given user document, you can query all the activity information that the user is interested in using the form db.activities.find({"user": {"$in": user["following"]}}). However, for a piece of activity information that has just been released, if you want to find out all the users who are interested in this information, you have to query the "following" field of all users.

Another way is to embed the subscriber into the producer document:

{
    "_id": ObjectId("..."),
    "username": "joker",
    "email": "joker@mailinator.com",
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("...")
    ]
}

When this producer publishes a new message, we can immediately know which users need to be notified. The disadvantage of this is that if you need to find a list of users that a user follows, you must query the entire user collection. The advantages and disadvantages of this method are exactly opposite to those of the first method.

At the same time, both methods have another problem: they will make user documents become larger and larger, and changes will become more and more frequent. Often, the "following" and "followers" fields don't even need to be returned: how often is the list of followers queried? If users follow certain people more frequently or unfollow some people, it will also lead to a lot of fragmentation. Therefore, the final solution further normalizes the data and saves the subscription information in a separate collection to avoid these shortcomings. This kind of normalization might be a bit much, but it's useful for fields that change frequently and don't need to be returned with the rest of the document. It makes sense to do this normalization of the "followers" field.

Use a collection to save the relationship between publishers and subscribers. The document structure may be as follows:

{
    "_id": ObjectId("..."),   //被关注者的"_id"
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("...")
    ]
}

This can make the user document more streamlined, but requires additional queries to get the fan list. Since the size of the "followers" array often changes, "usePowerOf2Sizes" can be enabled on this collection to ensure that the users collection is as small as possible. If the followers collection is stored in another database, it can be compressed without affecting the users collection too much.

Coping with the Wil Wheaton Effect

No matter what strategy is used, inline fields can only work effectively when the number of subdocuments or references is not particularly large. For more famous users, it may cause the document used to save the fan list to overflow. One solution for this situation is to use "continuous" documents when necessary. For example:

> db.users.find({"username": "wil"})
{
    "_id": ObjectId("..."),
    "username": "wil",
    "email": "wil@example.com",
    "tbc": [
        ObjectId("123"),    // just for example
        ObjectId("456")     // same as above
    ],
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."),
        ...
    ]
}
{
    "_id": ObjectId("123"),
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."),
        ...
    ]
}
{
    "_id": ObjectId("456"),
    "followers": [
        ObjectId("..."),
        ObjectId("..."),
        ObjectId("..."),
        ...
    ]
}

For this situation, you need to add relevant logic to get data from the "tbc" (to be continued) array in the application.

Say something

No silver bullet.

伊谢尔伦 · Answer

If the business always needs to query the relationship between users, it is better to separate the relationship into a Collection